One of the services we (Sun/MySQL) offer is a Performance Tuning and Optimization consulting package. These consulting engagements generally take several different forms as each client has a specific need that must be resolved. But there is a commonality between all of these assignments, they all require some amount of bottleneck detection as well as benchmarking. The topic of benchmarking and bottleneck detection is huge, and has a very broad audience. I really can not do it justice in a single post or two. In fact we are actually planning to give a Webinar in March ( more details will be forth coming ) talking about how we go about finding bottlenecks while out at client sites, and even in an hour I can barely scratch the surface.
What I thought I would share are a few of the easier cases of finding and fixing a bottleneck I have encountered. I am still amazed that more people have not performed similar steps to eliminate the easy stuff before I get out on site. So what are some of the easy things you can do?
In my opinion one of, if not the most important step in the process of finding a bottleneck is to identify the issue you are trying to fix as thoroughly as possible. For instance simply stating “My website is Slow” does no one any good. You need to define the specifics. For instance my website is slow could be any of these:
Every May our users renew their credentials and the website crawls during this time
We can only process 100K SOAP calls an hour and we need to process 1 million
Every night @ 7pm our MySQL database hits 100% CPU utilization and the website crawls
Our website slows down when we hit over 100 concurrent users.
The page were people post comments is dog slow, and timeouts all the time
Each one of these could lead to a very different process to find the bottleneck in the application. If your specific problem happens to be that comments post page is slow, what good does it do you if I make the rest of the site 5x faster but that single page is still a dog. Its all about giving you the most bang for your buck, and ensuring that we can address your pain points. If there is time left over all the other tuning is just gravy.
Trying to narrow down the scope of the issue should also lead to more questions... for instance if at 7pm the system is always busy, why? Is their a Cron job? Is their a backup running? Is new content deployed then? All of this will help you further narrow down the problem. With this knowledge then you can better proceed to the actual gathering of data. For instance if your only busy @ 7pm wouldn't that be the best time to capture data from? If you know a specific process runs during that time, wouldn't that make the best place to perform a code review? The more specific you get here, the better the chance of a speedy fix.
Armed with your specific problem, now you probably want to start looking at all the appropriate information.
A huge mistake I see over and over again is some people only look at the slow log ( or the slow log with a timing of 5 seconds + ) or the status variables. Sure the slow log is great to look for long running queries, but the slow log generally will only take you so far. I mean depending on the granularity you can miss a lot of important information (5.1 allows for < 1 second times!!! l, but in 5.0 or earlier 1 second was all you could get out of a standard MySQL build) . Generally I find tools/logs like the general log, SAR, Iostat, show global status, the MySQL Enterprise monitor, Cacti, and the like as useful or even more so then just the slow log.
Let me give you a few examples:
I had a client who had a single page that killed his performance, in fact the page load time was in excess of 15 seconds. The database had no slow queries, the status variables looked solid... on the surface the server just looked busy. Because we had defined the problem down to a specific page, we isolated their entire stack in a QA environment, turned on the general query log and hit only that single page. Now armed with a log of every query that page issued, I started to look at what was happening when the page was loaded. In this case the single page actually called over 13K queries. Using my keen sense of deduction, I deduced this was for more then any normal page should call. So I pulled out mysqlsla (http://hackmysql.com/mysqlsla) to summarize the general query log. Looking at the distribution of queries I found the exact same statements executed over and over just with different variables. So I got a copy of the PHP code, and found what I was looking for. Nice nested loops, I.e. select every username, then for each individual user name, loop through everyone of their posts, then for each post loop through each comment. Something like this:
select * from users where userid in (x,y,z)
for each username
select * from posts where username='username'
for each post
select * from comments where postid='postid'
This generated lots of individual statements. By merely changing the queries to do their operations in bulk ( 1 select vs 1 select for each row returned ) we reduced the number of queries down from 13K to under 2K, and dropped the page load time from 15+ seconds to less then 3 seconds.
Really the key data here was missing from the slow log ( the small individual primary key based lookups ), and by narrowing the problem to a single page and then analyzing what that single page did could we were able to successfully determine what was happening internally. Many times a small change like above can make all the difference in the world.
Several times I have run into clients who need to process a certain number of transactions every few seconds, minutes, or hours. My goal in this situations is to try and ramp up the load to first meet their requirements and then see how high they can go. In these cases I find it always best to try and generate the transaction throughput using the entire application stack while watching for bottlenecks ( after all the bottleneck could be web, app, network, db related, etc.. ). Using the full stack generally means benchmarking & load testing is more realistic and more applicable to your goals because you are mimicking your production process. Far to often I have seen database systems capable of handling X number of transactions, but the weight of an inefficient network or application server gets in the way when running the system in the real world. In fact one client I worked with kept on falling much below his expected transaction levels, when we looked into Network performance (Insert plug: He was using the Nifty networking graph in MySQL Enterprise Monitor ) we noticed he was hitting the upper limit of a 10Mb/s half duplex network, low and behold his servers network speeds were set to 10Mb/s half duplex ( Side note, network speeds/nic configuration etc is generally on the the key items I suggest people look at right away ).
But what happens when the full stack is not available? Or you can not generate realistic user load? It varies. With one client we tried to scale his application up to his required “transaction rates” but his application would not support it ( the application server died ). Did the application server die because of the database? Not likely as the CPU on the DB server was < 20%, with no errors, and no issues on the surface ... but we had to be sure. In this case we had to analyze what we did have available to us. We were easily able to grab the SQL for 1 transaction ( using the general log ). Looking at the SQL for the single transaction we determined even if the same data came into the system over and over again we would always be adding new records. This logic enabled us to replay the same transaction code over and over again using mysqlslap until we reached the desired transaction limit. This helped verify the database could indeed handle the clients desired load, but the bottleneck was elsewhere, enabling us to move on and look at the other components.
Certainly there are many other tools and ways to try and generate realistic load on the system. For instance In the above case we could have looked at replaying the general log from the full stack test that failed. While this load did not exceed the required throughput, maybe doubling or tripling the number of concurrent times the same log is replayed would give use a good picture. Certainly there are other good ways to solve the above scenarios, but these worked well in these situations.
Just what I have written here on benchmarking and load testing just seems light... Maybe I will post more later.