As some people have mentioned here and here Increasing the innodb log file size can lead to nice increases in performance. This is a trick we often deploy with clients so their is not anything really new here. However their is a caveat, please be aware their is a potentially huge downside to having large log file sizes and that’s crash recovery time. You trade real-world performance for crash recovery time. When your expecting your shiny Heartbeat-DRBD setup to fail-over in under a minute this can be disastrous! In fact I have been some places were recovery time is in the hours. Just keep this in mind before you change your settings.
So I spent several hours over the last few days on the Secondary index bug. Out of frustration I decided to try and bypass the LRU concept all together and try going to a true secondary page cache. In standard Waffle a page is written to memcached only when it is expunged ( or LRU’d ) from the main buffer pool. This means anything in the BP should not be in memcached. Obviously with this approach we missed something, as Heikii pointed out in a comment to a previous post, it seems likely we are getting an old version of a page. Logically this could happen if we do not correctly expire a page on get or we bypass a push/lru leaving an old page in memcached to be retrieved later on.
So I was thinking why not bypass the LRU process? While I feel this is the most efficient way to do this, its not the only way. I modified innodb to use the default LRU code and then modified the page get to push to memcached on any disk read. Additionally I added a second push to memcached when we flush a dirty page. This should ensure ( hopefully ) that memcached is always up to date. This way really means all of your pages will have some limited persistence in memcached, in that they never expire ( unless you run out of space ).
We decided against doing Waffle this way previously for a couple of reasons. First we are duplicating the memory footprint. The LRU method is efficient in its either in memcached or in the BP, but should not be in both. The “persistent” way can have pages in both… that seems like a waste. Additionally it would be possible in the “persistent” way to have a BP that is 32G and a memcached pool of 32G and each is an exact copy of each other. The second reason is recoverability. In the LRU process we do not care if we miss a set. Since a get expires the cache, a get from memcached wipes out the page. So if the set fails, nothing is put into memcached and its read from disk the next time. The “persistent” way means a failed set means your data is potentially out of sync. So we need to code around it, we either need to retry the missed set, build a black list of pages to read from disk, or disable the entire cache when an error occurs.
Now on the flip side. Having the cache be somewhat persistent means less memcached sets, especially during the LRU process which should speed things up. After reading from disk once, you should never have to go back to disk for a read, so once again faster performance. Hindsight I am not sure that a set failure when using the LRU method is as safe as it needs to be anyways. If you miss a get, and a set, then come back online your data is probably going to be corrupt, so we probably need to add a blacklist/disable on failure feature anyways. Another plus is you can use the standard memcached ( no need for an MRU ) with the persistent method.
So I spent the weekend looking at places where we may have missed something in the code for waffle. You can actually see some of the stuff I tried in the bug on launchpad about this, but the weird thing is the very last thing I tried. As I took a step back and looked at the problem ( secondary index corruption ) and our assumption that we “missed” something, I decided to find the place where pages are written to disk and to push to memcached from here as well as from the LRU. With the double write buffer enabled that place should be buf_flush_buffered_writes. By pushing to memcached here we should eliminate the page that falls through the cracks of the LRU. Basically this should help ensure memcached has an exact copy of the data that exists on disk. The result? It failed with the same secondary index failure. This means:
a.) maybe we have a problem in the memcached/libmemcached layer ( seems unlikely that this would cause an error at the exact same time every run )
b.) Somehow multiple copies of the same page end up in the BP ( maybe left over from a merge process ), where the “invalid” or temp page is LRU’d but never makes it to disk… I think I will eliminate the push to memcached via the lru process and see if that fixes it ( should validate this theory )
c.) I missed something else
d.) A page is set into memcached via some mechanism
A page bypasses the normal memcached read process, to load a page into the BP
A page then is changed in the BP
Then that page is either re-read from memcached overwriting the change or the change is written to disk without going through the “normal” fil_io or lru process…
Not sure if I can see many other scenarios here….thoughts?
If you read Yves blog post about waffle yesterday we are seeing some weird gremlins in the system and could use some scoobey doo detective work if you have some ideas. The strange thing is it only exhibits under high load. So it really seems like we may have missed some background cleanup process that accesses or removes pages from disk or the buffer pool without going through the functions we call waffle in (buf_LRU_search_and_free_block & buf_read_page_low ).
One of the idea’s I had was trying to narrow the scope of what’s being pushed and read form Memcached. Even though I am using file per table, system tablespace pages are still making it in and out of memcached. I thought if we missed something maybe it was here ( even though I could not find it in the code ). I mean cleaning up undo or internal data would seem like a logical place to miss something. So I hacked Waffle to only send blocks from space id’s > 1. This effectively means only actual table data should be going to and from Memcached.
To my utter amazement performance dropped by over 50% when I eliminated the reads/writes from memcached for space ID 1… that is just massively huge! In fact I counted nearly 2x more sets/gets on space 1 then I did on any other space. Maybe I am just tired right now, but this seems wrong. I mean sure space 1 pages may get LRU’d … but at that frequency it’s a bit crazy. I need to dig deeper into this.
***** I guess I was just tired:) Space ID for the system space is 0, dooooh “egg and my face are in alignment” ****
Not really a MySQL related topic, but so I do not forget later on and for those who end up working with the StorMan CLI to manage their controller cards I thought I would publish my quick down and dirty cheatsheet. The great thing about the newer arcconf is the help & usage is very helpful and needs little explanation. In fact I almost just skipped posting this because most people can just type arcconf and figure it out from there, but because I touch soo many different cards I figured this would serve as a reminder to me and others in the future.
Display the config:
[root@tm163-110 ~]# /usr/StorMan/arcconf GETCONFIG
Usage: GETCONFIG <Controller#> [AD | LD [LD#] | PD | [AL]]
Prints controller configuration information.
Option AD : Adapter information only
LD : Logical device information only
LD# : Optionally display information about the specified logical device
PD : Physical device information only
AL : All information (optional)
[root@tm163-110 ~]# /usr/StorMan/arcconf GETCONFIG 1
Controllers found: 1
/usr/StorMan/arcconf getconfig 1 LD
/usr/StorMan/arcconf getconfig 1
Check this output for failures:
Failed stripes : No
You probably haven’t noticed but I have not blogged since the UC. It is not because I am upset by the perspective of working for Oracle, I have simply been busy tracking down an issue we have with WaffleGrid. We discovered that under high load, with DBT2 on a tmpfs, we end up with error in a secondary index. In the error of MySQL, we have entries like this one:
InnoDB: error in sec index entry update in
InnoDB: index `myidx1` of table `dbt2`.`new_order`
InnoDB: tuple DATA TUPLE: 3 fields;
0: len 4; hex 80000001; asc ;;
1: len 4; hex 80000bea; asc ;;
2: len 4; hex 80000005; asc ;;
InnoDB: record PHYSICAL RECORD: n_fields 3; compact format; info bits 32
0: len 4; hex 80000001; asc ;;
1: len 4; hex 80000bea; asc ;;
2: len 4; hex 80000004; asc ;;
TRANSACTION 14469, ACTIVE 1 sec, process no 7982, OS thread id 2995481488 updating or deleting
mysql tables in use 1, locked 1
26 lock struct(s), heap size 2496, 65 row lock(s), undo log entries 60
MySQL thread id 31, query id 1246503 localhost root updating
DELETE FROM new_order
WHERE no_o_id = 3050
AND no_w_id = 1
AND no_d_id = 5
That are triggered when transactions are purged. Basically, an entry in the secondary index has to be deleted an when MySQL access the page, the row is missing.
Matt and I have dig this issue to the limit of our sanity and although we gained knowledge of the InnoDB code, we are still stuck. Basically what we are looking for is a way for a file page to go to disk without hitting the LRU list or a possibility to have 2 pages in the buffer pool with the same space_id:offset pair. Anyone who has inputs on these topics, please, comments this post…
So lets test some different configurations and try and build some best practices around Multiple SSD’s:
Which is better? Raid 5 or Raid 10?
As with regular disks, Raid 10 seems to performance better ( accept for pure reads ). I did get a lot of movement test to test like with the 67% read test -vs- the 75% or 80% tests. But all in all RAID 10 seemed to be the optimal config.
So in my previous post I showed some benchmarks showing a large drop off in performance when you fill the x-25e. I wanted to followup and say this: even if you do everything correctly ( i.e. leave 50%+ space free, disable controller cache etc ) you may still see a drop in performance if your workload is heavily write skewed. To show this I ran a 100% random read sysbench fileio test over a 12GB dataset (37.5% full ) , the tests were run back-to-back over a several hours , here is what we see:
*Note the scale is a little skewed here ( i start at 2500 reqs ).
Each data point represents 2 million IO’s, so somewhere after about 6 million IO’s we start to drop. At the end it looks like we stabilize around2900-3000 requests per second, an overall drop of about 25%.
The plan was only to do two quick posts on RAID Performance on the X-25e, but this was compelling enough to post on it’s own. So in part I Mark Callaghan asked hey what gives with the SLC Intel’s single drive random write performance, It’s lower then the MLC drive. To be completely honest with you I had overlooked it, after all I was focusing on RAID performance. This was my mistake because this is actually caused by one of the Achilles heals of most flash on the market today, crappy performance when you fill more of the drive. I don’t really know what the official title for it is but I will call it “Drive Overeating”.
Let me try and put this simply: a quick trick most vendors use to push better random write #’s and help wear leveling is to not fully erase blocks and use free space on the drive as a place where they can basically “write random blocks sequentially”. They can then free the deleted blocks later on. When the drive becomes overly full ( Overeats), performance starts to degrade as blocks must be fully erased. So as you get closer to capacity drive speeds will slow. I had heard that intel reserves some flash that is not accessible to help boost performance ( for example 32GB is available, but an additional XGB is hidden for these operations ) but I have not seen this in print anywhere. If someone has a link on this I would appreciate it.
Anyways, what you want to see consistent #’s from various size datasets in fileio when using flash. We maybe adjusting the size, but the threads + the # of requests is going to stay the same… so theoretically a 32thread test with 2 million requests on a 20GB should produce similar #’s to a 40GB test with the same setup ( unlike regular disk we do not need to move around to find the random block ). Now their is always special circumstances, i.e. cache, external processes, etc that can influence the numbers slightly, but the goal is consistancy. Take a look at the earlier Violin SSD benchmarks the 100% writes ( 0% on the graph ) 20Gb is very close to the 360GB in terms of performance. Now lets take a closer look at the X-25e:
Everyone loves SSD. It’s a hot topic all around the MySQL community with vendors lining up all kinds of new solutions to attack the “disk io” problem that has plagued us all for years and years. At this year’s user conference I talked about SSD’s and MySQL. Those who follow my blog know I love IO and I love to benchmark anything that can help overcome IO issues. One of the most exciting things out their at this point are the Intel x-25e drives. These bad boys are not only fast but relatively inexpensive. How fast are they? Let’s just do a quick bit of review here and peak at the single drive #’s from sysbench. Here you can see that a single X25-e outperforms all my other single drive test.