How many fat marathon runners do you know? I have yet to see anyone run the Boston marathon who looks like me. What about a fat man competing in the 100 yard dash in the the Olympics? no? It seems to make sense then that maybe a fat/bloated index or primary key may not be able to compete for the performance crown. As you may have read, One of my common performance mistakes is using incorrect or bloated datatypes in your table design ( shameless plug: stop by the MySQLCamp unconference at the UC and hear me babel on about common PT mistakes). Unfortunately even wise datatype selection can result in suboptimal indexes. Raise your hand if you have heard someone tell you not to use a char/varchar as a PK in innodb?
Not everyone has. Two weeks in a row makes a trend, and a trend means I should probably blog about this. Ran into a pair of clients who are using char/varchar primary key fields in Innodb. I strongly urge my clients against doing this for many reasons. The most obvious is Innodb’s primary key is used as a reference in all your other indexes, this means a 40 Byte per row primary key will increase the size of all your other indexes 40 bytes per row. This really adds up, and you can end up with a really bloated disk footprint. Also, and I harped on before, bloated disk footprint generally = bloated memory footprint. So your going to fit less rows in memory, and do more disk io. It really is a nasty effect.
So what can you do? Well the quickest fix to try is adding an auto_increment column, making it the primary key, and then making your former pk a unique key. Most of the applications I run into can handle this type of change minimally. Something like this:
alter table mytable drop primary key, add unique key (pk_column), add column autoid int not null auto_increment, add primary key (autoid);
Now keep in mind, since the innodb PK is a clustered index, this means they are stored on disk klumped together via the primary, so this type of change my have little, some, or no impact on disk io depending on how your queries are searching for data.
I tested this out for my client last week who had a large 36byte PK, I saw a reduction in table size anywhere from 30%-60% by simply adding the new PK. From a performance perspective we were seeing performance gains of anywhere from 1.5-4x depending on the query.
The other side to the primary key is the amount of time it takes to do a string compare internally. Comparing and searching for 4 byte integer should be much quicker and more efficient then a 20 byte string, but how much more?
This week I was working with a client who were doing simple PK based lookups on a 20 character customer number. The table only had 1 index which was the primary key. They wanted to push through as many selects as they could on this system. Memory was not a concern as they had 10x more memory on the box then they had data in the database.
The client actually had written a test specifically to hammer their workload through the system. The app was simple spawn x threads. Each thread will grab a connection from a pool , then execute a really simple primary key lookup (select * from mytable where pk = ‘xxx’ ). The number of these lookups was configurable from their app. So using this app we kicked off the benchmark on 2 servers, each with 20 threads. The database server we were testing against was an 8 core machine with 48GB of memory, the table contained about 18 million rows. Everything should have fit into memory, and the cache was pre-warmed prior to the test. So what did we get?
about 17K queries per second… lets take a look at the CPU for this test:
Thats a lot of extra sys cpu, I really should check oprofile or something similar to see exactly what is being called here.
Now for this particular workload we altered our table to not only switch to using a auto-increment primary key… but we also decided to add a column that did a crc32 on the old primary key and stored that in the database. Something like this:
alter table mytable drop primary key, add column autoid int not null auto_increment, add primary key (autoid), add column mycrc32_value int unsigned not null default 0, add key ( mycrc32_value );
Then we updated the table to store the crc32 value of the pk_column.:
Update mytable set mycrc32_value=crc32(pk_column);
Now you would need to make sure this is inserted and updated as new data comes in, but the idea is its more efficient to search an int rather then a larger character field. Now we changed the query from this
select * from mytable where pk_column = 'xxx'
select * from mytable where mycrc32_value=crc32('xxx') and pk_column = 'xxx'
Why do we keep pk_column in the where clause if we have the hash lookup? Because we need this to prevent hash collisions, occasionally the output for crc32 for two different values will produce the same hash. By including this extra check we guarantee your going to get what you ask for.
But why not index the pk_column then?
Well you could, but here I am willing to bank in the crc32 hash being unique enough to filter the result set down to a handful of values. I can then search through this handful of values fairly quickly regardless if its indexed or not. The exclusion of this column saves extra space in the mean time, so its a trade off.
So what is the net result of the change?
Yep that’s nearly 35K queries per second. And take a look at the CPU impact:
The time spent do system tasks dropped off the face of the earth.
No I do not expect this type of drop off everywhere you would try this, but its definitely worth trying if you find your self querying text fields often ( does not have to be the pk either ).