Monitoring NetApp Ontap

Introduction

I’d like to start with thanking the original developer, John Murphy. Thanks to his plugin to monitor NetApp Ontap storage, we don’t need to buy the expensive NetApp plugin from Quorum.  He also inspired me to continue developing Nagios plugins in my free time. So if you want to monitor your NetApp Ontap Cluster, this plugin could help you do that. It is written in Perl and is not being actively developed, like my other Powershell plugins, as my Perl knowledge is less evolved then my Powershell knowledge. Somestimes user send me a piece of code to add, if you want to test the latest additions, give the dev branch a try.  All help is welcome to improve this plugin. Read the post about debugging Perl scripts, make a fork of the project on Github and start experimenting.

The plugin is able monitor NetApp Ontap components, from disk to aggregates to volumes and alert you if if finds any unhealthy components.

NetApp Ontap Logical View

How to monitor your Netapp Ontap?

  1. Download the latest release from GitHub to a temp directory and then navigate to it.
  2. Copy the contents of NetApp/* to your /usr/lib/perl5 or /usr/lib64/perl5 directory to install the required version of the NetApp Perl SDK. (confirmed to work with SDK 5.1 and 5.2)
  3. Copy check_netapp_ontap.pl script to your nagios libexec folder and configure the correct permissions

Parameters:

–hostname, -H => Hostname or address of the cluster administrative interface.

–node, -n => Name of a vhost or cluster-node to restrict this query to.

–user, -u => Username of a Netapp Ontapi enabled user.

–password, -p => Password for the netapp Ontapi enabled user.

–option, -o => The name of the option you want to check. See the option and threshold list at the bottom of this help text.

–warning, -w => A custom warning threshold value. See the option and threshold list at the bottom of this help text.

–critical, -c => A custom warning threshold value. See the option and threshold list at the bottom of this help text.

–modifier, -m => This modifier is used to set an inclusive or exclusive filter on what you want to monitor.

–help, -h => Display this help text.

Option list:

volume_health: Check the space and inode health of a vServer volume on a NetApp Ontap cluster. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to accomodate large volume monitoring better. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword node: The node option restricts this check by vserver name.

aggregate_health: Check the space and inode health of a cluster aggregate on a NetApp Ontap cluster. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to better accomodate large aggregate monitoring. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword, “is-home” keyword node: The node option restricts this check by cluster-node name.

snapshot_health: Check the space and inode health of a vServer snapshot. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to better accomodate large snapshot monitoring. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword node: The node option restricts this check by vserver name.

quota_health: Check that the space and file thresholds have not been crossed on a quota. thresh: N/A storage defined. node: The node option restricts this check by vserver name. snapmirror_health: Check the lag time and health flag of the snapmirror relationships. thresh: snapmirror lag time (valid intervals are s, m, h, d). node: The node options restricts this check by snapmirror destination cluster-node name.

filer_hardware_health: Check the environment hardware health of the filers (fan, psu, temperature, battery). thresh: component name (fan, psu, temperature, battery). There is no default alert level they MUST be defined. node: The node option restricts this check by cluster-node name. port_health: Checks the state of a physical network port. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name.

interface_health desc: Check that a LIF is in the correctly configured state and that it is on its home node and port. Additionally checks the state of a physical port. thresh: N/A not customizable. node: The node option restricts this check by vserver name.

netapp_alarms: Check for Netapp console alarms. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name. cluster_health desc: Check the cluster disks for failure or other potentially undesirable states. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name. disk_health: Check the health of the disks in the cluster. thresh: Not customizable yet. node: The node option restricts this check by cluster-node name. For keyword thresholds, if you want to ignore alerts for that particular keyword you set it at the same threshold that the alert defaults to.  

check-ms-win-disk-load-graph-01

Monitoring Windows Disk Load

Introduction

Monitoring disk load is one of the harder things to monitor, but also one of the most crucial things you should monitor. Disk load problems can really give your applications a hard time, slowing them down or crippling them completely. On Linux servers it’s easy, as the CPU wait counter gives clear hints of issues with your disk io.

I rolled out check_diskstat on our Linux servers in September 2014  and really missed a similar plugin for monitoring disk load on Windows servers. Hence, I started thinking about a new Powershell script, which would use the Powershell command ‘get-counter’, to gather all disk related information from the Performance Monitor. I started with making a list of the requirements:

  • The main requirement was that it had to be multilingual, as I work on English and Dutch versions of Windows Server 2003, 2003 R2, 2008 and 2008 R2. 
  • Another requirement was that the script had to allow an argument that specifies the amount of samples over which an average could be calculated.
  • The perfdata output should be outputted in a way where all disk load related values had to be visible in a graph. I had to deal with very high values, eg 8763098004 and very small decimals, eg 0,00014. This implied I had to find some way to make it visually attractive and correct in Highcharts, for example by outputting in milliseconds instead of seconds or megabytes instead of bytes.
  • The plugin also had to work culture independent. Some culture use ‘,’ and other use ‘.’ as decimal. I solved this by replacing [System.Threading.Thread]::CurrentThread.CurrentCulture with ‘en-US’ ans setting it back to the original value once I’m done.

Monitoring disk load may be useful in finding the cause of performance issues. If a component of an application starts writing huge logs or big amounts of data in a database on your Windows disks, a bottleneck could be created in your application’s flow. This bottleneck could quickly result in any kind of lag, latency or slowness for end-users, resulting in more incidents, calls or complaints. An integral part of the job as monitoring engineer, is to avoid  situations as described above. Here Nagios can help you, by alerting you before applications start getting slow. Up until now, the only way to monitor performance counters for Windows servers, was using an agent like NSClient++ (or NCPA?) to retrieve one performance counter. My check_ms_windows_disk_load plugin enables you to combine several disk load related performance counters with only one service. This method has several advantages:

  • You don’t need to worry what counters to monitor. The plugin will do that for you.
  • As the plugin monitors 8 performance counters, and you only need one service, this would save you 7 services for each disk. So your Nagios server has less work, which enables you to monitor other stuff instead or increase the monitor interval on your checks.
  • As you can pass maxsamples (-ms or –MaxSamples) as a parameter, you can choose yourself how long you want the plugin to run before calculating averages. Each sample should be one second.

You could also prove to your application engineers that the storage is or is not the cause of their application’s performance. You can use comprehensive graphs visualizing a collection of disk performance related information. You also need knowledge about your disk load in order to choose the right disk type for the job. Are your 3TB SATA disks strong enough to handle the job or will you have to buy more expensive SSD’s to achieve the performance you need?

How to monitor your disk load?

  1. Put the script in the NSClient++ scripts folder, preferably in a subfolder Powershell.
  2. In the nsclient.ini configuration file, define the script like this:

  3. Make a command in Nagios like this:

  4. Configure your service in Nagios. Make use of the above created command. Configure something similar like this as $ARG1$:

Examples:

One day after everything is configured correctly, your Highcharts graphs should look like this:

disk load graph 01

If you want to test the load on your Windows disks, you can use this Storage Load Generator DiskSPD from Microsoft to play. (Yes Microsoft has a GitHub account!!)

I hope this plugin can help you monitor the disk load on your Windows hosts. Please rate it on the Nagios Exchange if you like my work.