Follow up on my Ec2 Latency Rant

I was testing up our latest Waffle release when I saw this horrid latency, average in the 12+ms range.  So I decided to dig a bit deeper today:

i’ll start with a Single threaded test on the ec2 setup I was complaining about, this is running waffle 0.4 and memcached 1.2.5.


mysql> set @@global.innodb_memcached_enable=1;
Query OK, 0 rows affected (0.00 sec)
mysql> use dbt2;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
seleDatabase changed
mysql> select count(*) from stock;
+———-+
| count(*) |
+———-+
| 2000000 |
+———-+
1 row in set (17.79 sec)
mysql> select count(*) from customer;
+———-+
| count(*) |
+———-+
| 600000 |
+———-+
1 row in set (2 min 9.56 sec)

mysql> select count(*) from stock;
+———-+
| count(*) |
+———-+
| 2000000 |
+———-+
1 row in set (2 min 11.39 sec)


———
MEMCACHED
———
Memcached puts 44050
Memcached hits 22016
Memcached misses 22034
Memcached Prefix: 4504
Memcached Get Total Lat 5842 (us)
Memcached Get Recent Lat 5936 (us)
Memcached Miss Total Lat 1733 (us)
Memcached Miss Recent Lat 9230 (us)
Memcached Set Total Lat 47 (us)
Memcached Set Recent Lat 45 (us)

so here we are seeing get times from memcached in the range of 6ms, 1gbe in a single threaded test typically give me 600 microseconds in this test on dedicated hardware.

I decided to terminate that instance and try again. If I got a server that went through a different switch maybe my performance would improve.


mysql> set @@global.innodb_memcached_enable=1;
Query OK, 0 rows affected (0.00 sec)
mysql> select count(*) from stock;
+———-+
| count(*) |
+———-+
| 2000000 |
+———-+
1 row in set (16.87 sec)
mysql> select count(*) from customer;
+———-+
| count(*) |
+———-+
| 600000 |
+———-+
1 row in set (27.55 sec)
mysql> select count(*) from stock;
+———-+
| count(*) |
+———-+
| 2000000 |
+———-+
1 row in set (32.31 sec)

———
MEMCACHED
———
Memcached puts 44144
Memcached hits 22071
Memcached misses 22073
Memcached Prefix: 4705
Memcached Get Total Lat 1359 (us)
Memcached Get Recent Lat 908 (us)
Memcached Miss Total Lat 298 (us)
Memcached Miss Recent Lat 923 (us)
Memcached Set Total Lat 46 (us)
Memcached Set Recent Lat 42 (us)

A different server gave me substantially better results. I did do a traceroute:


Slow server:
traceroute to 10.248.247.0 (10.248.247.0), 30 hops max, 40 byte packets
1 dom0-10-254-128-156.compute-1.internal (10.254.128.156) 0.000 ms 0.000 ms 0.000 ms
2 10.254.128.3 (10.254.128.3) 0.000 ms 0.000 ms 0.000 ms
3 ec2-75-101-160-178.compute-1.amazonaws.com (75.101.160.178) 0.000 ms 0.000 ms 0.000 ms
4 ec2-75-101-160-183.compute-1.amazonaws.com (75.101.160.183) 0.000 ms 0.000 ms 0.000 ms
5 dom0-10-248-244-163.compute-1.internal (10.248.244.163) 0.000 ms 0.000 ms 0.000 ms
6 domU-12-31-39-02-F0-F2.compute-1.internal (10.248.247.0) 0.000 ms 0.000 ms 0.000 ms

Not sure why I got 0 timings.


Fast server:
traceroute to domU-12-31-39-00-B1-B5.compute-1.internal (10.254.182.67), 30 hops max, 40 byte packets
1 dom0-10-254-128-156.compute-1.internal (10.254.128.156) 0.081 ms 0.129 ms 0.114 ms
2 10.254.128.3 (10.254.128.3) 116.278 ms 231.327 ms 288.199 ms
3 dom0-10-254-180-151.compute-1.internal (10.254.180.151) 348.145 ms 405.002 ms 461.337 ms
4 domU-12-31-39-00-B1-B5.compute-1.internal (10.254.182.67) 523.199 ms 587.528 ms 644.319 ms

Both servers where in us-east-1b… At least I am pretty sure elasctic fox showed in us-east-1b. I opened up 20 other instances of ec2 and noticed the problem in 2 instances out of the 20.

I thought maybe a did a traceroute on the public address and checked again, and no it was the private, which is weird because in the traceroute we see the public names of the systems.

Just to verify I was not seeing a server reported in the wrong zone,  I spun up an instance in us-east-1a and did a quick traceroute between the servers.


traceroute to ip-10-250-59-166.ec2.internal (10.250.59.166), 30 hops max, 40 byte packets
1 dom0-10-254-128-156.compute-1.internal (10.254.128.156) 0.070 ms 0.100 ms 0.090 ms
2 10.254.128.3 (10.254.128.3) 125.842 ms 246.171 ms 303.269 ms
3 ec2-75-101-160-178.compute-1.amazonaws.com (75.101.160.178) 364.750 ms 427.458 ms 489.541 ms
4 216.182.232.15 (216.182.232.15) 548.359 ms 610.895 ms 673.215 ms
5 216.182.224.18 (216.182.224.18) 735.725 ms 797.731 ms 860.069 ms
6 ec2-75-101-160-113.compute-1.amazonaws.com (75.101.160.113) 934.841 ms 982.372 ms 1049.491 ms
7 ip-10-250-56-173.ec2.internal (10.250.56.173) 1110.901 ms 1016.309 ms *
8 ip-10-250-59-166.ec2.internal (10.250.59.166) 1.213 ms 1.060 ms 1.036 ms

So that’s different. I almost feel like I am playing russian roulette, will I get a good network route or not.   I am going to try and get another one of the servers with bad routing, but its very sporadic. I will wirte more if I get another one to play with.

From a waffle standpoint I think I need to either hack memslap or make a waffleslap type tool to verify that the network between the two machines is good before proceeding.

This entry was posted in mysql, performance, Waffle Grid. Bookmark the permalink.

Comments are closed.