Notes about my new benchmark

2009-03-07 at 07:21 am Matthew Yonkovit

I am going to get asked later today on this, so I figured I would head it off at the pass. So I am working on a new benchmark, and I am using an early test release of it to do some of my SSD testing. The quick 10K foot overview of what It does.

I build three large tables and several small tables. The large tables contain the exact same structure as each. Why 3? well its a cheap easy way to help ensure that the BP is turned over, and I can use them later to join to each other. One of the things I wanted to accomplish was to test different types of queries… I.e. full scan, index scan with low card, med card, high data skew, etc. So I added columns to help with each. Here is the structure of the large tables:

create table benchmark_a_lrg (
a_id int not null auto_increment,
b_id int,
c_id int,
a_name varchar(200),
a_post text,
a_var1 varchar(100),
a_var2 varchar(50),
a_var3 varchar(10),
a_tint_type1 tinyint,
a_tint_type2 tinyint,
a_int_type1 int,
a_int_type2 int,
a_char_type1 char(6),
a_char_type2 char(6),
a_varchar_type1 varchar(6),
a_varchar_type2 varchar(6),
parent_a_id int,
a_value1 bigint,
a_value2 bigint,
a_value3 bigint,
a_low_card_i int,
a_med_card_i int,
a_high_card_i int,
a_low_card_ni int,
a_med_card_ni int,
a_high_card_ni int,
a_high_skew_i int,
a_high_skew_ni int,
a_adddate datetime,
a_moddate timestamp,
primary key (a_id),
key (b_id),
key (c_id),
key (a_char_type1, a_int_type1),
key (a_varchar_type1, a_moddate ),
key (a_tint_type1, a_varchar_type1),
key (a_int_type1, a_tint_type1),
key (a_low_card_i),
key (a_med_card_i),
key (a_high_card_i),
key (a_high_skew_i),
key (parent_a_id, a_id)) engine = innodb;

Yes very large indeed. My idea here is to be able to use this to search for a multitude of different scenarios, especially in ssd & waffle environments… to be able to do something like “ yeah we improved in full scans, but saw a huge regression in self joins “… I think this type of analysis Is valuable.

I generate data for these structures based on a few different variables. By default I try and evenly distribute the type columns, while I use different methods for the card and skew columns.

After I generate the data I then build a set of meta data that looks for things like counts, the top 100 values for some of the columns etc. This way I can do things like say, I want to select x from table where column = really skewed…

For my Read-only IO test I am purposely using queries that are going to do a lot of IO. Search for thousands of records, do filesorts, build temp tables, etc. Here are the queries I have been using:


select a.a_id,a.a_name,a.a_value1,a.a_var3
from benchmark_a_lrg a where a.a_char_type2=?
select a_id,a_name,a_value1,a_var3
from benchmark_a_lrg where a_tint_type1 in (?,?,?) order by a_value1 limit 100
select count(distinct a_name),sum(a_value3)
from benchmark_a_lrg where a_value2 > ? and a_value1 < ?

select count(*)
from benchmark_a_lrg where a_int_type1 in (?,?,?,?)

select avg(a_value1),count(distinct a_id)
from benchmark_a_lrg where a_med_card_i = ? and a_low_card_ni = ?

select a.a_value2, b.a_value3
from benchmark_a_lrg a, benchmark_a_lrg b, benchmark_b_lrg c
where a.a_id=b.parent_a_id and a.b_id=c.b_id
and b.a_med_card_i in (?,?,?,?,?) and c.b_low_card_ni in (?,?) “;

Now for my mixed load test, I hacked together the following:

select a.a_id,a.a_name,a.a_value1,a.a_var3 from benchmark_a_lrg a where a_id=?
update benchmark_a_lrg set a_value1=?, a_post=? where a_id=?
update benchmark_a_lrg set a_value2=?, a_post=? where a_med_card_i=?
select a.a_id,a.a_name,a.a_value1,a.a_var3 from benchmark_a_lrg a where a.a_med_card_i=?
update benchmark_a_lrg set a_value3=? , a_name=? where a_int_type1 in (?,?)
select count(*) from benchmark_a_lrg where a_int_type1 in (?,?,?,?)
update benchmark_a_lrg set a_value1=?, a_post=? where a_id=?
update benchmark_c_lrg set c_varchar_type2=? where c_low_card_i=? and c_int_type2 in (?,?,?)

Yes there are no deletes or inserts… i wrote this test in 10 minutes. Why? First I was taxed for time. Second I wanted to not dramatically change the distribution of the data. This way when I restart the test I do not need to worry about the data altering significantly enough to effect the next run ( basically it enables me to know if I randomly select an int_type1 I am going to get the same # of records from run to run ). I will be adding inserts/deletes to this asap… but I need to figure out a good way to do it ( may try preserving the deletes and updates and coming back and undoing them, or using higher types for all these so they are easily identifiable, not sure right now. ). While my read-only io test was a 1 for 1, the mixed is not. I call updates/selects on the PK much more frequently then the rest.

I will release a version of this when I feel its polished, the code is just ugly right now and I can not in good conscious release it in such a state.

2 Responses to “Notes about my new benchmark”

  1. Mark Callaghan

    Please share the test code. One test I do for a read-heavy workload is a parallel checksum of all tables. You get to choose the SQL functions that compute the checksum but it avoids the network bottleneck. For it to be fast in InnoDB you need to disable InnoDB prefetching or patch InnoDB to use more than 1 background prefetch thread.

  2. We’re interested in the benchmark code as well. We’re currently working on benchmarks to stress IO bound performance for mixed (insert/update/delete/read) workloads. We posted a simple C++ benchmark that stresses inserts here:

    tokutek.com/benchmark.php

    We added deletes as well, but have not posted the code yet. Mark Callaghan posted a Python version of the insert test with queries added:

    http://mysqlha.blogspot.com/2009/02/code-for-benchmarking.html

    I wonder if there’s some way to combine our efforts into a single OpenSource benchmark since we seem to be driving towards similar objectives.

Leave a Reply