Monitoring NetApp Ontap

Introduction

There are of course numerous way to monitor your NetApp Ontap storage, but this post focusses for now on how to achieve quality monitoring with the help of a Nagios plugin, which was originally developed by John Murphy. The plugin definitely has some flaws, so all help is welcome to improve it. Read the post about debugging Perl scripts, make a fork of the project on Github and start experimenting.

The plugin is able monitor multiple critical NetApp Ontap components, from disk to aggregates to volumes. It can also alert you if it finds any unhealthy components.

NetApp Ontap Logical View

How to monitor Netapp Ontap with Nagios?

  • Download the latest release from GitHub to a temp directory and then navigate to it.
  • Copy the contents of NetApp/* to your /usr/lib/perl5 or /usr/lib64/perl5 directory to install the required version of the NetApp Perl SDK. (confirmed to work with SDK 5.1 and 5.2)
  • Copy check_netapp_ontap.pl script to your nagios libexec folder and configure the correct permissions

Parameters:

–hostname, -H => Hostname or address of the cluster administrative interface.

–node, -n => Name of a vhost or cluster-node to restrict this query to.

–user, -u => Username of a Netapp Ontapi enabled user.

–password, -p => Password for the netapp Ontapi enabled user.

–option, -o => The name of the option you want to check. See the option and threshold list at the bottom of this help text.

–warning, -w => A custom warning threshold value. See the option and threshold list at the bottom of this help text.

–critical, -c => A custom warning threshold value. See the option and threshold list at the bottom of this help text.

–modifier, -m => This modifier is used to set an inclusive or exclusive filter on what you want to monitor.

–help, -h => Display this help text.

Option list:

volume_health:

Check the space and inode health of a vServer volume on a NetApp Ontap cluster. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to accomodate large volume monitoring better. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword node: The node option restricts this check by vserver name.

aggregate_health:

Check the space and inode health of a cluster aggregate on a NetApp Ontap cluster. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to better accomodate large aggregate monitoring. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword, “is-home” keyword node: The node option restricts this check by cluster-node name.

snapshot_health:

Check the space and inode health of a vServer snapshot. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to better accomodate large snapshot monitoring. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword node: The node option restricts this check by vserver name.

quota_health:

Check that the space and file thresholds have not been crossed on a quota. thresh: N/A storage defined. node: The node option restricts this check by vserver name. snapmirror_health: Check the lag time and health flag of the snapmirror relationships. thresh: snapmirror lag time (valid intervals are s, m, h, d). node: The node options restricts this check by snapmirror destination cluster-node name.

filer_hardware_health:

Check the environment hardware health of the filers (fan, psu, temperature, battery). thresh: component name (fan, psu, temperature, battery). There is no default alert level they MUST be defined. node: The node option restricts this check by cluster-node name. port_health: Checks the state of a physical network port. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name.

interface_health desc:

Check that a LIF is in the correctly configured state and that it is on its home node and port. Additionally checks the state of a physical port. thresh: N/A not customizable. node: The node option restricts this check by vserver name.

netapp_alarms:

Check for Netapp console alarms. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name. cluster_health desc: Check the cluster disks for failure or other potentially undesirable states. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name. disk_health: Check the health of the disks in the cluster. thresh: Not customizable yet. node: The node option restricts this check by cluster-node name. For keyword thresholds, if you want to ignore alerts for that particular keyword you set it at the same threshold that the alert defaults to.  

Monitoring Microsoft Failover Cluster Preferred Node

 

 


Error: Your Requested widget " wp-github-commits-5" is not in the widget list.
  • [do_widget_area sidebar-1]
    • [do_widget_area sidebar-2]
      • [do_widget id="tag_cloud-2"]
    • [do_widget_area widgets_for_shortcodes]
      • [do_widget id="wp-github-commits-14"]
      • [do_widget id="wp-github-commits-17"]
      • [do_widget id="wp-github-commits-15"]
      • [do_widget id="wp-github-commits-25"]
      • [do_widget id="wp-github-commits-3"]
      • [do_widget id="wp-github-commits-13"]
      • [do_widget id="wp-github-commits-20"]
      • [do_widget id="wp-github-commits-16"]
      • [do_widget id="wp-github-commits-2"]
      • [do_widget id="wp-github-commits-24"]
      • [do_widget id="wp-github-commits-23"]
      • [do_widget id="wp-github-commits-21"]
      • [do_widget id="wp-github-commits-5"]
      • [do_widget id="wp-github-commits-6"]
      • [do_widget id="wp-github-commits-26"]
      • [do_widget id="wp-github-commits-9"]
      • [do_widget id="wp-github-commits-22"]
      • [do_widget id="wp-github-commits-19"]
      • [do_widget id="wp-github-commits-8"]
      • [do_widget id="wp-github-commits-7"]
      • [do_widget id="wp-github-commits-18"]
      • [do_widget id="wp-github-commits-11"]
      • [do_widget id="wp-github-commits-4"]
      • [do_widget id="wp-github-commits-12"]
      • [do_widget id="wp-github-commits-27"]
    • [do_widget_area wp_inactive_widgets]
      • [do_widget id="recent-posts-2"]
      • [do_widget id="recent-comments-2"]

    Introduction

    Clustering is a very important technology to ensure application availability and performance. Accurate monitoring of your clusters is crucial to keep your applications stable. Not monitoring is not an option, as you might never know when a failover has taken place and waste valuable time looking in the wrong places for solutions to your problems. There are different options and levels of monitoring you can choose for monitoring Microsoft Windows failover clusters.

    Preferred Node Option

    One of the easiest is checking each cluster service if it is still running on it’s preferred node. The plugin from Nedstars (link) to check if MS cluster services are running on their preferred node only works for Windows 2008 R2 or later versions. Apart from that I noticed that there seemed to be a bug in Nedstars script. If a MS cluster contained more then one cluster service, it seemed to only check one of them. I did not investigate this further, so forgive me if I’m wrong here. As we are still using multiple Windows 2003 failover clusters,  I decided to write a completely new plugin that uses WMI to get information about failover cluster services on Microsoft Windows 2003 failover clusters. Windows Server 2008 R2 Failover Cluster Preferred Node: preferred node on 2008 Windows Server 2003 R2 Failover Cluster Preferred Node: check_ms_cluster_preferrede_node_2003R2 The plugin starts by checking the version of the Windows server where it is running on with WMI. If the version is lesser then 6.1 (which is the version number of Windows Server 2008 R2), the script will continue to use WMI to get all cluster services and then check each cluster service individually to see where it is running and compare that with it’s preferred node. If the OS version is not lesser then 6.1, the plugin will import the module ‘failoverclusters’ and start using module commands instead of WMI. I did not check if the script works on a cluster with more then two nodes, as we don’t have clusters with three or more nodes. If anyone could test this for me and let me know… 

    How to use the check_ms_cluster_preferred_node?

    1. Copy the ‘check_ms_cluster_preferred_node.ps1’ Powershell script to the NSClient++ scripts folder, preferably in a sub-folder ‘Powershell’ on all the failover cluster nodes you wish to monitor.
    2. In the nsclient.ini configuration file, define the external command like this (and restart the NSClient++ service (nscp) afterwards):
    3. Make a command in Nagios like this:
    4. Configure your service in Nagios. Use the command you previously made. A healthy check in Nagios XI would look like this:check_ms_cluster_preferred_node

    Other Microsoft Failover Cluster monitoring options?

    So is monitoring the preferred node the best way to monitor clusters. Definitely not. A cluster might have no or multiple preferred nodes for some reason. If you happen to own some MS clusters for critical applications, you will most likely want to be alerted for more issues then only a cluster service failover. I’ve seen multiple clusters with issues that did not consist of a failover. Instead one or more cluster service just went offline and online on the same node. In order to monitor this, I wrote another Powershell script that checks for certain event id’s in the Windows application eventlog and alerts when these events contain information about failing cluster services.