Waffle: Progress and a Rearchtecture?

So I spent several hours over the last few days on the Secondary index bug. Out of frustration I decided to try and bypass the LRU concept all together and try going to a true secondary page cache. In standard Waffle a page is written to memcached only when it is expunged ( or LRU’d ) from the main buffer pool. This means anything in the BP should not be in memcached. Obviously with this approach we missed something, as Heikii pointed out in a comment to a previous post, it seems likely we are getting an old version of a page. Logically this could happen if we do not correctly expire a page on get or we bypass a push/lru leaving an old page in memcached to be retrieved later on.

So I was thinking why not bypass the LRU process? While I feel this is the most efficient way to do this, its not the only way. I modified innodb to use the default LRU code and then modified the page get to push to memcached on any disk read. Additionally I added a second push to memcached when we flush a dirty page. This should ensure ( hopefully ) that memcached is always up to date. This way really means all of your pages will have some limited persistence in memcached, in that they never expire ( unless you run out of space ).

We decided against doing Waffle this way previously for a couple of reasons. First we are duplicating the memory footprint. The LRU method is efficient in its either in memcached or in the BP, but should not be in both. The “persistent” way can have pages in both… that seems like a waste. Additionally it would be possible in the “persistent” way to have a BP that is 32G and a memcached pool of 32G and each is an exact copy of each other. The second reason is recoverability. In the LRU process we do not care if we miss a set. Since a get expires the cache, a get from memcached wipes out the page. So if the set fails, nothing is put into memcached and its read from disk the next time. The “persistent” way means a failed set means your data is potentially out of sync. So we need to code around it, we either need to retry the missed set, build a black list of pages to read from disk, or disable the entire cache when an error occurs.

Now on the flip side. Having the cache be somewhat persistent means less memcached sets, especially during the LRU process which should speed things up. After reading from disk once, you should never have to go back to disk for a read, so once again faster performance. Hindsight I am not sure that a set failure when using the LRU method is as safe as it needs to be anyways. If you miss a get, and a set, then come back online your data is probably going to be corrupt, so we probably need to add a blacklist/disable on failure feature anyways. Another plus is you can use the standard memcached ( no need for an MRU ) with the persistent method.

I figured that the more persistent way should eliminate any odd missed steps between the get and LRU. Basically i am pushing to memcached only when fil_io is trying to write, which happens to be during a flush. So I gave it a try, coded it up, and it failed with the same damn secondary index error!

I knew that whatever was causing the issue had to be happening between the write and the next read. What would do that? Call it good luck, but I was reading through the Google v4 patches readme on the changes they made to help the insert buffer and I had a thought. I wonder if I am missing something with the insert buffer? Long story short, yep i was. Specifically ibuf_merge_or_delete_for_page. What does this do? According the comments:

/*************************************************************************
When an index page is read from a disk to the buffer pool, this function
inserts to the page the possible index entries buffered in the insert buffer.
The entries are deleted from the insert buffer. If the page is not read, but
created in the buffer pool, this function deletes its buffered entries from
the insert buffer; there can exist entries for such a page if the page
belonged to an index which subsequently was dropped. */

This function is actually called during the buf_page_io_complete… which is called in buf_read_page_low, which is the function we use for memcached gets. This is called when reading from disk, however we skipped calling this function during our gets from memcached. So by skipping this we actually were potentially missing insert buffer data… it’s strange that we did not see this more often.

After some playing around, this eliminated the secondary index errors in the persistent version I was using. In fact this version was pushing 19K TPM in DBT2 tests up from 13K with the LRU version. But before you get too excited it also introduced a couple of other bugs. First the buf_page_io_complete now complains about retrieving the wrong page. Take a look:

147429-9567137-Memcached miss: Block: bf2:0:2 : Thread : 1144666448 time:0
147430-9567198-Memcached miss: Block: bf2:7:3 : Thread : 1144666448 time:52
147431-9567260-Memcached get: Block: bf2:7:1 : Thread : 1144666448 time:125
147432:9567322:090528 11:45:49 InnoDB: Error: space id and page n:o stored in the page
147433-9567395-InnoDB: read in are 7:3, should be 7:1!
147434-9567435-Memcached set: Block: bf2:7:1 : Thread : 1144666448 time:6028
147435-9567498-Memcached miss: Block: bf2:7:147 : Thread : 1144666448 time:43

the code for this reads:

            if (io_type == BUF_IO_READ) {
		/* If this page is not uninitialized and not in the
		doublewrite buffer, then the page number and space id
		should be the same as in block. */
		ulint	read_page_no = mach_read_from_4(
			block->frame + FIL_PAGE_OFFSET);
		ulint	read_space_id = mach_read_from_4(
			block->frame + FIL_PAGE_ARCH_LOG_NO_OR_SPACE_ID);

		if (!block->space
		    && trx_doublewrite_page_inside(block->offset)) {

			ut_print_timestamp(stderr);
			fprintf(stderr,
				"  InnoDB: Error: reading page %lu\n"
				"InnoDB: which is in the"
				" doublewrite buffer!\n",
				(ulong) block->offset);
		} else if (!read_space_id && !read_page_no) {
			/* This is likely an uninitialized page. */
		} else if ((block->space && block->space != read_space_id)
			   || block->offset != read_page_no) {
			/* We did not compare space_id to read_space_id
			if block->space == 0, because the field on the
			page may contain garbage in MySQL < 4.1.1,
			which only supported block->space == 0. */

			ut_print_timestamp(stderr);
			fprintf(stderr,
				"  InnoDB: Error: space id and page n:o"
				" stored in the page\n"
				"InnoDB: read in are %lu:%lu,"
				" should be %lu:%lu!\n",
				(ulong) read_space_id, (ulong) read_page_no,
				(ulong) block->space, (ulong) block->offset);
		}

So basically somehow the page we retrieved from memcached was supposed to be space 7 page 1, but when we extract the data from the block we get space 7, page 3... which causes an error. I have hacked this code up a lot, so its possible I have a double write somewhere that is not called that often. So time to debug this, but at least we have not more secondary error!

Also I tried the LRU version with the fix, but adding the call to the LRU version actually causes mysql to crash shortly after starting. Obviously that's something I need to look at too.

So the short of it is, we are making progress. I would also love to hear thoughts and idea's on whether people like the LRU or persistent model we are testing.

This entry was posted in benchmark, innodb internals, linux, Matt, mysql, Waffle Grid. Bookmark the permalink.

One Response to Waffle: Progress and a Rearchtecture?

  1. Jay Paroline says:

    Would it be possible to combine the two approaches, in a way? I might not be following everything you guys have tried, so forgive me if I’m suggesting the obvious.

    Set when pushed out of LRU, delete from memcached on any disk read. That should eliminate redundancy between LRU & memcached, while helping to make sure that memcached doesn’t have stale/wrong data, right?