Follow up to my Common Disk Issues

Ahh seems like a few people do take the time to read my blog:) Peter Z Commented here on my common disk performance mistakes post. He makes some great arguments, and you may want to give it a read. While he does not agree with everything I say it is interesting to see his views. Remember different folks have different experiences and a lot of times there are multiple roads on the path the performance nirvana.

Let me start off saying I wholly admit that saying “everything” is a disk issue is a dramatic exaggeration. And i did not specifically say disk, I said “The problem is always an IO problem”, more on that later. I have run into my far share of issues outside of this sphere ( network, context switching, cpu ), but I still find disk performance to be by far the most common issue effecting systems I deal with. In my opinion It is the reason memcache had to be invented, the reason for explosive growth in the amount of memory in servers, etc. If all of your disks performed even at half the speed of memory the urgent need to fit everything into memory would not be so urgent ( Now that sounds silly ). This is not a MySQL specific issue either I was dealing with disk performance issues since the mid 90’s in other database systems.

While I completely agree IO/disk issues are not generally an issue when you can fit most of your data into memory, they do sometimes crop up in these installs and at the worst time. My point here is don’t buy 64GB of memory and a single drive:) All the gloriously speedy memory in the world will not help you when you need to load the data for the first time ( worst time something like Christmas rush for an e-tailer, when they bring up a system after a crash, basically not only did they have an outage they have to suffer through sluggish performance ) … Another big one is at some point your data may not fit into memory, while hopefully you can plan for it, sometimes it slaps you upside the head. Their are also things like flushing data to disk may take longer. Heavy write loads, etc. So while lotso memory can help mask or hide disk IO issues, they still exist. If you know their their and can plan for them awesome! But eventually something will need disk. A common topic I heard from many at the UC ( including Peter ) and a philosophy I try and live by is plan for the worst ( or tune to the edge case ).

100% completely agree that storage size should be considers. But I would size for IO first. The example Peter gave is that Sphinx or an Archive database will hit a space bottleneck well before an IO one. This gets back to knowing your IO. In those instances if you know your IO usage will use for example 200IOPS at the max, two 500GB monsters should handle both the IO and may give you the space needed for your app. I would never put 100IOPS capacity in solution that required 150IOPS.

I have point out something Peter brought up in the following statement:

“Another mistake is to measure IO capacity in spindles – So your hard drive can do 200 random reads per second and your SSD drive can do 10000 does it mean if you have array with 50 hard drives it would perform as good as SSD ? Not really. Leaving all potential serialization issues along you need enough concurrency to utilize multiple hard drives and with 50 drives you would need at least 50 outstanding requests all the time to fully utilize them. So for example these 50 drives will unlikely be helpful solving replication delay or speeding up this 3 hours reporting query or 5 hours ALTER TABLE
command.”

Adding harddrives MAY in fact speed up replication delays or make a 5 hour Alter table command run fast, but it really depends on the system. 50 Harddrives properly configured in say an emc array should blow away a SSD drive. Their a few hidden gems to think about. Their are a few differnt issues that can pop up around disk, but the most common is disk contention. You need to move from one spot on the plater to another in order to retrieve data to service a request. That arm movement time is generally what kills you. Two requests may come into the same disk and be in oposite locations. A simple example to a really complicated device ( I know we can get into caching and queuing, etc but lets stay simple)… lets flatten out the disk and think of this as how a disk works:

(* is where the arm is reading data, R is a request in the queue )

A B C D E F
*     R    

My Arm Reads data A, and data on D is in the queue to read. In order to move my ARM from A to D I have to cross over B&C. This is generally the Average service time of a disk ( other stuff does get sprinkled in ).

A B C D E F
      *    
R     *    

Now I move to D, a second request needs Data in A needs to be accessed again. The time I wait to move from D to A is roughly the avwait time.

With two disks or more the theory is this ( In a perfectly striped world, remember really dumb simple example ):

A B C
*    
*    
D E F
*    
*    

So now the arms are in position to read from data in A and D. We will reduce the wait time in this example because the requests can be processed at the same time on different disks, and any service time should be reduced because are arm is positioned correctly. Like I said oversimplified example, but the gist is correct. Now in this perfect world I agree the concurrency makes a huge difference. 8 cores means at any nanosecond only 8 things can really be active. This means that a break even point for most systems would logically be adding no more disks then the number of cores. While as a general rule I could live with this, this is in fact not always the case. What is nice is the Linux kernel can throw a request into the disk queue and then process another request, which can then throw another request at the disk queue. If your innodb concurrency settings are too high you can easily build up an excessive queue. But if even at more reasonable levels it can be an issues. Because we do not ( and do not want to ) control specifically where data ends up residing on disk when it is striped their is a likelihood that 2 or more active processes will ask for data that has to be processed from the same disk. The likelihood of this increases as you add additional data to the drives ( so the closer to 100% capacity the more likelyhood for contention ). So to get truly the best performance adding additional disks will in fact reduce the likelihood of disk contention. Their are however diminishing returns. With 2 disks you are really 50-50 whether any two requests are trying to access the same disk, 4 disks 25% chance, 8 disks 12.5% chance, etc. Performance gains will also yield diminishing returns, going from 10 disks to 12 may only yeild microsecond improvements. The key is know your io patterns. Know your bottlenecks. For someone 12 disks maybe overkill, for someone else it may yield huge benefits.

Peter mentioned that altering a table would unlikely yeild a performance increase from adding disk. If you alter a table that takes 5 hours and the AVWAIT time on the system ( iostat -x ) is really high its a very safe bet adding disk will help the process. Avwait is generally a measure of disk contention. So in this case here is my hypothesis: Say your wait time is 50MS and you do 1000 disk requests that equates to 50Seconds of time, if through magic ( where is that ibm pixie dust? ), a better controller, or more disk you reduce the time to 25ms you can shave off 25 seconds. Less waiting, less contention faster requests.

Bottom line: Their are really a couple of ways to get faster disk times: #1 Faster disk ( i.e. ssd with <.1ms response time ) , #2 reducing contention, more disks … #3 add more cache/memory.

Completely agree with peter on my memory diagram. Its my test not yours! Benchmark your data. It is included only as a guideline to illustrate a point, memory can help overcome disk issue:) But the point of the graph and analysis is try and fit your data into memory!

We definitely agree on the separation of certain file systems. In commodity hardware with external disk I generally recommend 2 disks for the OS, tmp, swap, os logs, etc. Then a separate set of disk for the database. 99% of the time I will generally recommend 4 disks in a RAID 10 or Raid 5 depending on the access patterns of disk. Here is something else. If you have 8 disk slots in your system, and going to go with 6 disks for the OS and the database, why not spend the extra $300-400 and fully populate it? It generally is a small expense that will save you the hassle latter on of having to rebuild your raid set if you need to add disk! Disk is cheap.

Remember I said “The problem is always an IO problem”, this can mean disk… but it can also mean io in memory, or io in queries. I consider queries that read through to many rows an IO problem. The more read to disk you have to do, the higher the chance of disk contention, not to mention disk reads are measured in ms, memory in nano. The majority of gigs I am doing tend to have IO problems that can only be solved by reducing the amount of data a query has to retrieve and look through. Here the proper indexing strategy is your best friend. Just yesterday I was looking at a query that had to search through 1.5 million rows, with the proper index we were able to get that down to 250K ( 10X decrease ). This means that query could potentially do 10 times less requests to disk or memory. Also Just because all the data is in memory does not mean you should not worry about reducing the number of rows your queries use. Less rows = less memory = more varied stuff in memory. The query that need 1.5 million rows effectively wasted 1.25M rows worth of space.

Good stuff! Awesome discussion going on here though!

This entry was posted in Common Mistakes, hardware, mysql, performance. Bookmark the permalink.

3 Responses to Follow up to my Common Disk Issues

  1. Peter says:

    Matt,

    Sure system needs to be balanced and this is common mistake. I just had one guy who bought box with 16 cores 32G of memory and single disk as MySQL box – not a good choice.

    Now regarding ALTER TABLE or replication – multiple drives have some affect but not as much as you expect. You still have requests issues one after another which need to be services.

    SAN is no magic – you either have cache (we really do not take this into account as if it would be in SAN cache it would be in buffer pool unless it is initial start or something) or you have multiple data copies and you can find one with smaller seek than original drive.

    Though do not over estimate it – you still will have some seek (though track to track seek is fraction of ms these days) and most important you have rotation latency. Disk spinning at 15000 RPM makes 250 rotations a second. In average you would have to wait half of rotation before your data piece will be under the head which makes it 2ms for the same hard drive.

  2. Peter says:

    Matt,

    I am not sure what number of cores has to do with number of hard drives. Controller can execute IO at the same time as CPU is busy doing something else.

  3. matt says:

    IN Re: # of cores
    Here is the way I view this: In a dual core machine at any point in time how many processes can make an active call to IO ( make the request and get queued? )? 2, because at any slice of time at most 2 processes are active ( we are talk less then nanosecond ). How many can make the io request with 8 cores? 8, same as two because the 8 cores will be making their requests at that single point in time. So the queue depth on the controller with two cores ( in the worst possible case where all requests require IO ) will get 2 requests, then the queue will grow to 4,6,8,etc. The 8 core machine under the same circumstances will see its queue grow from 8 to 16, to 24, to 32, etc. So the queue is filling up much, much faster. This is going to cause more contention.

    Worst case is your 8 requests that were coming from those 8 cpu processes are all for the same disk in vastly random places. This is going to slow down the queue for that disk. With 1 disk it is a guarantee, with 2 its 50-50, etc.

    Now with advanced queuing techniques in controllers this is lessened some, but it is still not 100% eliminated.