An often overlooked and way underrated component to any site is the collection and reporting of solid historical performance metrics. Don’t get me wrong, everyone collects some sort of web page stats, has access to mysql status variables, and uses top to check cpu and process stats ( what else does anyone need right?). But the metrics I am talking about are not your simple # of pages,# of users, apache log metrics that everyone has in their web server, nor are they point in time cpu stats, or cumulative row accessed counters. No, we are talking about true performance stats, gathered over a long period of time. These stats can be used for troubleshooting, trending, and make great art and conversation pieces ( have seen my one of a kind CPU graph).
Think of it this way, someone complains your site is slow. They say when they hit the first page it just seems sluggish. What does sluggish mean? Ideally you should be able to tell page 1 took 3 seconds to load last week and this week it is taking 5 seconds. This enables you drastically narrow your scope down to figuring out where the 2 seconds came from. If you rely on a gut feeling, then you never really know if you fixed the problem, or even where it came from. We will talk about page load times in a few paragraphs. The same holds true for any performance issue. If your site is slow and you run “top” and see the current CPU is at 50%, what does that mean? What context is that in? I mean was it at 50% all morning? Does it normally run like that? Or did it just spike up? There are too many questions that unless you are proactively monitoring and collecting data you may never be able to answer. This is especially true on the database. On the database side, you need more just the MySQL slow query log, the general log, or any easily available log/stats in the database ( but these do help ). When tracking down performance issues it is key… no actually absolutely imperative that you have something to compare the current numbers to.
On the database/OS side there are a couple of tools that I find help alot. The first of course is our Enterprise monitoring tool . There is a lot already written on the MySQL Enterprise Monitor (EM), so I will keep it short. EM is a very solid tool that appeals to a broad audience. I personally find it extremely useful ( and I am seasoned db veteran, who is very picky). At client sites, especially those shops without deep MySQL technical knowledge find it to be a life saver. Another very solid tool is Cacti (http://www.cacti.net/). While cacti lacks the “technical knowledge base” and some of the most of advanced enterprise monitor features, it does a pretty good job of giving you access to historical metrics.
Let me give you a couple of examples of how I have used EM in the past. Here we have a CPU utilization graph from a client over several days. Things look busy with sustained spikes up to 70% cpu utilization.
After adding a few indexes and fixing some other issues we see an immediate effect on our CPU :
Here is the full graph:
Because we were using the Enterprise monitor to collect and watch the system while we were tuning we were able to quantify our performance improvement. These graphs don’t only help us show our improvements, they help us identify issues. Take a look at the network graph here:
With this client, we found that their application froze during the above spikes in network traffic. A little digging determined that the client was running their network in a LAB that only had 100Mb/s network connections. They maxed out the pipe between application and database and experienced a mysterious “freeze”.
As a Database person we tend to put too much emphasis on MySQL and not the whole picture. Even when tuning database performance, sometimes its best to look outside the database for clues. A holistic view of your application can help target your efforts and give you a quicker resolution then hunting and pecking through various layers. Different applications have different key metrics. So make sure you understand what metrics are important for your application. For dynamic websites a great source of information is page load times. In fact load times can be one of the most beneficial metrics when looking into overall web performance. Its amazing that many companies simply choose to ignore collecting page load times ( or soap/remote calls if your application does that sort of thing ). When I say page load times I am saying index.php took 3.5 seconds to load, myinfo.php took 5 seconds to load. These give you a concrete number and a place to look when people report problems. Wouldn’t it be great to be able to say: “Oh, the user reported problems when loading the index.php? Hmmm, normally that page takes 3.5 seconds, but today its taking 10 seconds, that’s unusual, lets look into it” …
More often then not without accurate metrics like page load times, you resort to a needle in the haystack approach to performance resolution.
- user reports slowness in a certain section of the site
- Web server/http looks ok, same number of vistors, same number of pages accessed, my simple page load test shows normal times
- unix servers looks ok, cpu looks fine, memory ok
- network looks ok, no spikes, humming along
- database looks ok, database metrics look ok, no slow queries, no locks…
- ??? well all the pieces are ok … go get more information…
- Insert finger pointing between the various groups
- repeat, eventually get a developer involved…
I have been part of this process, generally all the admins can check the general status of things rather quickly… but for any sort of deep dive it ends up being trying to find a needle in a haystack. Even if on the database side you had a few slow queries, how do you really know the page they came from? Even if your Unix server is running hot, what if its because of some other process and not “page” slowness? You end up generating a lot of false positives here.
How many times have you seen this scenario: Someone reports slowness, someone looks and finds something running on the system, say a backup, and immediately say that page is slow because there is a backup running. The issue is dropped, until the next day when someone else reports the same issue, only this time there is no backup running. So you look again. Its a vicious cycle, which can cost the company lots of time and money.
With accurate metrics, like page load times… your resolution tends to be much more targeted . An example:
- user reports slowness on certain areas of the site
- check baselines for that page against the current numbers for that page, yep they are off
- Check the overall server, nope everything but that page is running ok
- If your forward thinking, you flip a debug flag for that page, which logs individual steps … or you could simply make a copy of the page and put periodic print statements outputting the time at various steps…
- On the database side you could run over to your QA server and turn on the general query log and hit just that one page while its running
We are focusing on a more targeted area of the site as opposed to the site as a whole.
In my opinion as a database geek, page load times enable you to do the same sort of analysis we are used to doing with the slow query log. We find a query in the slow log or by some other means, run the query to see the time ,explain the query, add an index or change the sql, explain the query, and then rerun it. We rinse and repeat this over and over again. Why not find the slow page, analyze the steps on the page, find the SQL or the code thats causing the slowdown, fix it, and retest? It seems pretty straight forward to me. Remember Think outside the database.