I am cleaning out some old scripts and it got me thinking back to a time before I rejoined the ranks of consultant… to many different jobs and many different people. Why is it that some companies or “tech” people miss the boat when it comes to performance tuning. They don;t take a holistic apporach to tuning, rather they fire off one thing and if it looks good it must be someone else’s problem. Case in point. VMSTAT… some people live and die by vmstat, there attitude is if I don’t see it there it does not exist. Others rely heavily on tools they do not understand.
I worked at one company as a DBA several years ago. We were experiencing horrible issues with performance across the board with a large 3rd party Oracle based application. Sar -d and iostat where screaming about wait times on the server in excess of 200ms. Iowait was high but not that high ( whoever invented or decided to put io wait in the cpu stats should be taken outside and beaten by the way, but thats another story ). The sysadmin at the time said glance+ did not show any issues and he trusted its results, because he had never heard of those other “unreliable” tools. After months of arguing back and forth, I was able to get a copy of glance to compare numbers. Looking at io stats over the same period I noticed avserv time of 103 from sar, 03 from glance… after some digging a bug popped up that particular version of glance rolled over on certain metrics that had 3 digitd (98,99,00,01 you get it ). It still took months to convince them of a disk issue.
At another company we saw low cpu but the users response time was unbearable. Disk io was higher then normal, but still not terrible, memory was higher then normal as well. The lack of common metrics pointing to a problem caused the finger pointing to start… SA blames dba, dba blames developer, etc. This was solved very easily when you simply looked at the amount of data being pushed to the server over the nic. Someone kicked off some giant ftp processes. Amazing once the ftp processes were killed, everything was normal… started them up again and the performance took a nose dive.
These types of problems require IT folks who know and understand their systems… they need to look at obscure details and know something a little off. I am all for applications with pretty graphs or wonderful gui interfaces, but they need to be aware of the details and standard tools across the board. Very few get it… others are just hanging around waiting for retirement.
To get a good idea of system performance and bottlenecks you can look at things like:
all of these things yeild some key peice of information. Items like HBA throughput is often over looked as a potential bottleneck but knowing a normal load vs a high load can be beneficial.
I put together a little code for gathering all these into 1 script on AIX… here is the file: sysperf.pl