Monitoring disk load is one of the harder things to monitor, but also one of the most crucial things you should monitor. Disk load problems can really give your applications a hard time, slowing them down or crippling them completely. On Linux servers it’s easy, as the CPU wait counter gives clear hints of issues with your disk io.
I rolled out check_diskstat on our Linux servers in September 2014 and really missed a similar plugin for monitoring disk load on Windows servers. Hence, I started thinking about a new Powershell script, which would use the Powershell command ‘get-counter’, to gather all disk related information from the Performance Monitor. I started with making a list of the requirements:
- The main requirement was that it had to be multilingual, as I work on English and Dutch versions of Windows Server 2003, 2003 R2, 2008 and 2008 R2.
- Another requirement was that the script had to allow an argument that specifies the amount of samples over which an average could be calculated.
- The perfdata output should be outputted in a way where all disk load related values had to be visible in a graph. I had to deal with very high values, eg 8763098004 and very small decimals, eg 0,00014. This implied I had to find some way to make it visually attractive and correct in Highcharts, for example by outputting in milliseconds instead of seconds or megabytes instead of bytes.
- The plugin also had to work culture independent. Some culture use ‘,’ and other use ‘.’ as decimal. I solved this by replacing [System.Threading.Thread]::CurrentThread.CurrentCulture with ‘en-US’ ans setting it back to the original value once I’m done.
Monitoring disk load may be useful in finding the cause of performance issues. If a component of an application starts writing huge logs or big amounts of data in a database on your Windows disks, a bottleneck could be created in your application’s flow. This bottleneck could quickly result in any kind of lag, latency or slowness for end-users, resulting in more incidents, calls or complaints. An integral part of the job as monitoring engineer, is to avoid situations as described above. Here Nagios can help you, by alerting you before applications start getting slow. Up until now, the only way to monitor performance counters for Windows servers, was using an agent like NSClient++ (or NCPA?) to retrieve one performance counter. My check_ms_windows_disk_load plugin enables you to combine several disk load related performance counters with only one service. This method has several advantages:
- You don’t need to worry what counters to monitor. The plugin will do that for you.
- As the plugin monitors 8 performance counters, and you only need one service, this would save you 7 services for each disk. So your Nagios server has less work, which enables you to monitor other stuff instead or increase the monitor interval on your checks.
- As you can pass maxsamples (-ms or –MaxSamples) as a parameter, you can choose yourself how long you want the plugin to run before calculating averages. Each sample should be one second.
You could also prove to your application engineers that the storage is or is not the cause of their application’s performance. You can use comprehensive graphs visualizing a collection of disk performance related information. You also need knowledge about your disk load in order to choose the right disk type for the job. Are your 3TB SATA disks strong enough to handle the job or will you have to buy more expensive SSD’s to achieve the performance you need?
How to monitor your disk load?
- Put the script in the NSClient++ scripts folder, preferably in a subfolder Powershell.
- In the nsclient.ini configuration file, define the script like this:
1check_ms_win_disk_load = cmd /c echo scripts\powershell\check_ms_win_disk_load.ps1 $ARG1$; exit $LastExitCode | powershell.exe /noprofile -command -
- Make a command in Nagios like this:
1check_ms_win_disk_load => $USER1$/check_nrpe -H $HOSTADDRESS$ -p 5666 -t 60 -c check_ms_win_disk_load $ARG1$
- Configure your service in Nagios. Make use of the above created command. Configure something similar like this as $ARG1$:
1-a '-dl C -ms 5 -rqw 20 -rqc 50'
One day after everything is configured correctly, your Highcharts graphs should look like this:
If you want to test the load on your Windows disks, you can use this Storage Load Generator DiskSPD from Microsoft to play. (Yes Microsoft has a GitHub account!!)
I hope this plugin can help you monitor the disk load on your Windows hosts. Please rate it on the Nagios Exchange if you like my work.