Over the past several years whether I am working with a small company or a fortune 500 client I have seen lots of issues and mistakes made around the configuration, setup, and ongoing maintenance of disk on Unix and Linux Servers. This is not only an issue with MySQL shops, rather it can be an issue with all database setups whether it is Oracle, DB2, or Sybase. Neglecting the disk is setting yourself up for long term issues. These mistakes often force companies to throw more and more hardware at the problem… Lets look at some common mistakes around disk:
#1 The problem is always an IO problem, and remember spindles not capacity
In performance disk is everything. It makes you happy, it makes you sad. Learn to love it. Learn to hate it. Learn to understand it. Everything eventually comes back to disk. Take this example: Lets say you have a bad query that you fix with an index. Why was it slow? The easy answer : was a missing index. But really the answer is the missing index caused the query to read too much data from disk. The index merely reduced the number of ios that needed to be used to complete the query. In a perfect world the disk would be fast enough to return the query without you even realizing it was missing. But our world is not perfect. So what are some disk related performance tips?
Do not size your systems for capacity, size it for performance. I often walk into a client site and they are sized for space. For instance 2 500GB disks mirrored running a 400GB database. That in an of itself does not tell you much ( In fact a 400GB database running on 2 disks can be very fast given the right workload). The problem is many clients are running databases that have an IO footprint that requires a much higher number of disks. For instance my single raptor was able to pull off around 170 random iops. So 2 drives could “theoretically” push 340. If you had a system that peaked at 600iops this would be way to slow. The exact number of needed drives varies… (and unfortunately the formula to figure out the exact number of drives is very complicated before you have production sized load. Its much easier if you are migrating to something new, after all hind sight is 20-20). If you know the # of IOPS you need to support you can start benchmarking drives and drive configurations to try and meet that. To try and get the IOPS you can use Sar or iostat to grab statistics from your running boxes. Personally I think you need to look for the peak load and add 20% at least.
Straight IOPS alone is not enough. One of the other key components in disk performance is query and memory usage. Are your queries reading all the data or part of the data? How much of the data needs to be queried? How much can you fit into memory? You basically need to understand the system inside and out. This helps you determine how much memory -vs- disk you require.
Let me talk a little bit about memory usage and its impact on performance. On our 400GB test system mentioned above if most of your queries read the last 30 days worth of data and you have 1 year stored online that means about 8.2% (30/365) of your data is “hot”. Lets round to 10% because this is not an exact science and I would rather overcompensate then under. 10% of 400GB means 40GB worth of data is going to be active at any one time ( theoretically ). Ideally you want to have this always be accessed from memory if possible. So go out and buy 64GB of memory, so you access times will be great ( after warm up most of your hot data should stay in cache). But because many customers do not put that much memory into the server you have to start looking at your IO. With a 16GB system with 10GB allocated to the innodb buffer pool this means that 30GB of data or 75% will be read from disk. So you could assume that 75% of your IOPS is hitting disk. That means you want to take this into account and bump up the number of IOPS you are shooting for. To illustrate memory’s impact on query and disk performance I do have some concrete numbers from a previous benchmark. Lets go back and look at my pervious SSD tests.
This graph shows the Transactions per minute (TPM) for drives with different amounts of allocated innodb buffer pools. By looking at this you can kind of extrapolate the cost of disk io vs memory. With a 10K drive when 4.5% of my dataset is in memory I was able to hit 285TPM, but when I increased that to 25% I was able to hit 625TPM. Which means by using disk instead of memory in this case it caused this benchmark to run 2.2 times slower. But add in faster disk (in this case solid state) and the 4.5% number with a single disk is 2.5 times faster, and when looking at the combination of faster disk + more memory (the 25% test) was 18X faster. That is a huge bump. But you do not need SSD to get these results. I am sure that adding another 4 disks will yield somewhere between the 2.5X increase and the 18X increase. This type of performance boost could be the difference between happy and ticked off users.
Another key piece of data is disk latency. If you are pushing 200IOPS but your response time is 40ms something is terribly wrong. Generally I start to get worried when my avg wait+srv time is > 10ms… that is not 10ms per query, but 10ms per disk io. 1 Query could cause 1000′s of disk IO’s, so cutting this down can have huge benefits. Lets say you average 10ms latency over a 10,000 io query. That means 100,000 ms is spent in the disk subsystem ( or 100 seconds ). This means by making that component faster, say reducing to 7 ms your query time should see a reduction by 30 seconds! Basically an extra disk or two ( or more cache ) will result in lower latency, faster response times, and happier users.
What does all this babbling on mean? Benchmark your disks. Use standard benchmark tools if you have to, For instance you can setup sysbench/dbt2 to emulate your environment. Alternatively you could stress out your disks with other benchmarks like bonnie++ or iozone. In a perfect world you would do a controlled test something like run production load against your servers to see the exact performance impact. Run the tests with multiple disk configurations, this allows you to get a baseline and determine exactly how much IO load you can handle, and what configuration is best for your environment. Know your system! More disks generally mean faster performance ( there is a direct correlation ). Watch disk latency.
#2 File system and LUN Layout:
Several years ago amongst Oralce DBA’s their was a philosophy that was widely adopted called SAME http://www.oracle.com/technology/deploy/availability/pdf/OOW2000_same_ppt.pdf ( Stripe and Mirror Everything ). The idea was more spindles equates to faster io and higher rate throughputs. So take every disk you have and make 1 giant lun. This helps alleviate management issues often associated with JBOD or trying to balance multiple datafiles over several individual LUNS. I have seen this philosophy adopted widely in most MySQL shops I go into today. The problem is most clients or hosting providers get over zealous with this. Same was designed for large disk arrays, not server with 2 or 4 drives. I am going to look at this from 2 angles. The first is availability, the second performance.
For many companies I see the single giant Lun. /, /tmp, /boot, /home, swap, mysql, etc are all on the same lun. In fact many time this are all under / ( single mount). This in the minds of many admins makes sense. You reduce management and monitoring in one swoop. The problem is this setup can cause your mysqld to become unresponsive. How? Easy, lets say a hacker try’s to brute force your system ( don’t say the admin should detect this, because they should… but I have seen many who are not, and I ran into this example ). This brute force attack will log messages into your log files. The logs fill up the available harddrive, and wham MySQL can not extend, so it hangs. Or how about someone dumps a giant file to /tmp and fills the / fiesystem? Seen that happen. Or you run a special backup and redirect it to your home dir, the backup consumes all available space… unlikely you say? Nope, I have seen it happen more times then I care to remember. Creating a separate mount point for file systems like /var/log , /tmp, /home, etc can help reduce the risk of this happening.
How about performance? If you put the OS, Swap, and the database on the same disks you are just asking for issues. On a side note: I know some people out there think turning off swap is a great idea, but in my mind its an absolutely horrible thing to do. Do you know how many times I have run into a setup where swap is used? Lots. It’s not used because the admin does not know his application well enough, its used because its the “oh crap” last line of defense against someone doing something stupid. Ask yourself would you rather have your database slow or dead? I know the argument is sometimes the server can go unresponsive. When I see this, it is generally because everything is shared on the same disks. When swap and the database are thrashing on the same physical disk it is very, very bad. Really any additional overhead on the disk can be bad. Let me give you an example. I was working with a decent size DB2 database a few years ago, on a big box San, with 1 giant disk set. I was grabbing a backup from tape that I needed from the previous night writing it to my home directory on the box. Every alert we had on the system started going off, query response time started going insane ( like 30 seconds, vs 10ms normally ). I killed the restore, the response time dropped back to normal. I kick off the restore again, bad response time again. Iostat ( which is your friend! ) was showing 300ms response times. Why did a single restore hammer the system? It was writing a ton, flooding the cache on the controller and exposing the disk to tons of writes on top of the normal load. I can get into a bunch more technical details, but the point is there is nothing magical or technical about this, most performance issues can be traced back to disk. By sharing non-database related workloads on the same physical disk you are risking having non-database related work impact your system.
#3 The myth of the blackbox
No matter what the vendor tells you there is no such thing as “self managing”, “pain free”, “0 Administration”, or “Black Box” Sans. Because disk access times can have a dramatic impact on performance you want to make sure you understand your San and what is happening behind the scenes. Case in point several years ago I was working with a leading storage vendor’s SAN. The sysadmin in charge of the SAN treated it like a black box (“its black magic voodoo, we are not meant to understand it”). Even Sar and Iostat showing 100ms + response times during peak load would not dissuade him from his black box ways. Once in a while on the systems performance would get better for 1 database, and horrible for another. Everything pointed to disk IO. After a several month battle we were able to bring in an expert to look things over. It turns out that the majority of the database data was spread amongst 10% of the physical drives, the rest were filled with archive data and in a different storage pool. Additionally the “self managing” black box moved chunks of hot data from disk to disk ( 4GB chunks ). So a chunk that was really active kept moving to disks that were sort of active. This made the sort of active disks, really active. This movement in the backend caused the bottleneck to move around.