I/O Load Monitoring. :: A few days ago a client of mine complained about slowdown on his website. I checked his server, and later noticed that the server seemed to be burdened with heavy I/O operations.

When checked with tools such as "top", the row showing "%wa" (meaning: I/O wait) was very high. Normally < 10%, it was ranging from 50% up to 98% instead.

This is a very alarming information. Because the server is using SSD disks :) so its I/O (input/output) should be #really fast. Not bogged down like this.

I checked the daemons (server software) such as Apache, MySQL, Varnish, etc - and they were all idling. None were busy.

So the I/O load came from somewhere else. Probably from the hypervisor (physical server) itself. Which means it's a possible hardware problem.

Because the datacenter has concluded that it was not a "noisy neighbour" - another VM (virtual machine) in the same physical server that's hogging all the I/O resources. Pretty much all of them were idling, just like mine.

===
However I'll need some data to convince the datacenter to do hardware check on its SSD storage cluster. So I wrote this little bash script : http://pastebin.com/SxxuaVy4

The script logs server's I/O status into a CSV (Excel) format. So it can be very easily graphed later.

Using iostat tool, it probed the server's current I/O load.
Then the script is executed every minute, by running in as a cronjob.

You may notice that iostat is executed with "-d 1 3" parameter. Which means "run 3 times, with 1 second delay in between"

This is because iostat's first run always cause a spike in I/O load :) so the numbers would be inaccurate. I noticed the numbers tend to stabilize after the 3rd run, so I set it up that way.

===
Of course, you can very easily modify this script to monitor something totally different :) just change the iostat / head / tail / cut part to something else - voila.

===
Attached is a graph created from one of the log. The X axis is timestamp, in military format (24 hours)

I submitted the logs to the datacenter.

It convinced them to do checks on the storage cluster - and voila, they found some degraded disks in that cluster :)

===
Damaged disks replaced, storage cluster rebuilt - and everyone lives happily ever after ? :) Fingers crossed. Happy ending !


Photo
Shared publicly