Monitoring NetApp Ontap

Introduction

There are of course numerous way to monitor your NetApp Ontap storage, but this post focusses for now on how to achieve quality monitoring with the help of a Nagios plugin, which was originally developed by John Murphy. The plugin definitely has some flaws, so all help is welcome to improve it. Read the post about debugging Perl scripts, make a fork of the project on Github and start experimenting.

The plugin is able monitor multiple critical NetApp Ontap components, from disk to aggregates to volumes. It can also alert you if it finds any unhealthy components.

NetApp Ontap Logical View

How to monitor Netapp Ontap with Nagios?

  • Download the latest release from GitHub to a temp directory and then navigate to it.
  • Copy the contents of NetApp/* to your /usr/lib/perl5 or /usr/lib64/perl5 directory to install the required version of the NetApp Perl SDK. (confirmed to work with SDK 5.1 and 5.2)
  • Copy check_netapp_ontap.pl script to your nagios libexec folder and configure the correct permissions

Parameters:

–hostname, -H => Hostname or address of the cluster administrative interface.

–node, -n => Name of a vhost or cluster-node to restrict this query to.

–user, -u => Username of a Netapp Ontapi enabled user.

–password, -p => Password for the netapp Ontapi enabled user.

–option, -o => The name of the option you want to check. See the option and threshold list at the bottom of this help text.

–warning, -w => A custom warning threshold value. See the option and threshold list at the bottom of this help text.

–critical, -c => A custom warning threshold value. See the option and threshold list at the bottom of this help text.

–modifier, -m => This modifier is used to set an inclusive or exclusive filter on what you want to monitor.

–help, -h => Display this help text.

Option list:

volume_health:

Check the space and inode health of a vServer volume on a NetApp Ontap cluster. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to accomodate large volume monitoring better. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword node: The node option restricts this check by vserver name.

aggregate_health:

Check the space and inode health of a cluster aggregate on a NetApp Ontap cluster. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to better accomodate large aggregate monitoring. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword, “is-home” keyword node: The node option restricts this check by cluster-node name.

snapshot_health:

Check the space and inode health of a vServer snapshot. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to better accomodate large snapshot monitoring. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword node: The node option restricts this check by vserver name.

quota_health:

Check that the space and file thresholds have not been crossed on a quota. thresh: N/A storage defined. node: The node option restricts this check by vserver name. snapmirror_health: Check the lag time and health flag of the snapmirror relationships. thresh: snapmirror lag time (valid intervals are s, m, h, d). node: The node options restricts this check by snapmirror destination cluster-node name.

filer_hardware_health:

Check the environment hardware health of the filers (fan, psu, temperature, battery). thresh: component name (fan, psu, temperature, battery). There is no default alert level they MUST be defined. node: The node option restricts this check by cluster-node name. port_health: Checks the state of a physical network port. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name.

interface_health desc:

Check that a LIF is in the correctly configured state and that it is on its home node and port. Additionally checks the state of a physical port. thresh: N/A not customizable. node: The node option restricts this check by vserver name.

netapp_alarms:

Check for Netapp console alarms. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name. cluster_health desc: Check the cluster disks for failure or other potentially undesirable states. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name. disk_health: Check the health of the disks in the cluster. thresh: Not customizable yet. node: The node option restricts this check by cluster-node name. For keyword thresholds, if you want to ignore alerts for that particular keyword you set it at the same threshold that the alert defaults to.  

Linux Vulnerabilities Overview

Introduction

Linux is considered to be much more secure then Windows. Over the last years however, several big Linux vulnerabilities were discovered . This definitely doesn’t mean that Linux is suddenly an insecure operating system. What it does mean is that you need to monitor and patch your systems. The same goes of course for Windows server, but I’l try to go into detail about WSUS updates in another post.

When you look at the latest Red Hat security advisories, it becomes very clear that you need to implement a system which automatically installs security updates. Doing this manually on 500+ servers would be crazy and a big waste of time. You also need make sure you always have a recent snapshot or backup in place, preferably right before the time the security updates are installed.

RunDeck allows you to do such a thing. After adding your Linux server as nodes to RunDeck, you can easily schedule a job containing a workflow where a VMware snapshot could be taken after which the installation of the security updates can be started safely.

I’ll try to go over the most famous Linux vulnerabilities and summarize some very basic information abut them.

Heartbleed

Security bug disclosed 01/04/2014 by Neel Mehta (Google) in the OpenSSL cryptography library, qualified as a buffer over-read situation where more data can be read than should be allowed.

  • CVE-2014-0160

Linux vulnerabilities Hearthbleed

Shellshock (Bashdoor)

Everybody must have heard of Heartbleed, discovered 24/09/14 by Stephane Chazelas. Shellshock allows attackers to execute any kind of code, smuggled in environment variables. Anything that invokes the flawed open-source shell and passes in malicious variables, which seems to be surprisingly easy to do, is vulnerable to being hijacked.

Just in case specific CGI scripts are vulnerable, you could use Shellshock Tester or Shellshock Test Tool.

  • CVE-2014-6271
  • CVE-2014-6277
  • CVE-2014-6278
  • CVE-2014-7169
  • CVE-2014-7186
  • CVE-2014-7187

Linux vulnerabilities Shellshock

Ghost

The last critical security flaw to hit the news 16/01/2016 was Ghost. It’s a stack-based buffer overflow in the glibc DNS client-side resolver that puts Linux machines at risk for remote code execution. It was discovered by a Google engineer. The glibc maintainers had previously been alerted of the issue via their bug tracker in July 2015. The issue was solved by a combined effort of two engineers o the Red Hat team, the Google team and the glibc team. Check out the Google blogpost.

  • CVE-2015-7547: glibc getaddrinfo stack-based buffer overflow

Linux vulnerabilities Ghost

Kernel Zero-Day Flaw

19/01/2016 a new critical zero-day Linux vulnerability has been found in the kernel that could allow attackers to gain root privileges. It has been discovered by a research group named Perception Point. The issue was apparently present since 2012 and is the result of a reference leak in the keyrings facility built into Linux. The keyrings facility is a way to encrypt and store login data, encryption keys and certificates and make them available to applications. 

A PoC was released on GitHub with an example exploit code.

  • CVE-2016-0728

Patch your impacted systems against Linux vulnerabilities

Ensure that you are running the latest patch level. If it’s a virtual machine, take a VMware snapshot first, so that in worst case scenario, you can go back.

CentOS / Red Hat / Fedora

Ubuntu / Debian

You can schedule this easily with for example Nagios Reactor. It allows you execute commands over SSH on scheduled intervals. In combination with the VMware snapshot chain, you easily create a robust patching ecosystem. Please note that Nagios reactor is completely free, but is still in beta. It also only seems to work on CentOS 6.

RunDeck

You can use an inline script such as this to start a yum update on your Linux serves:

The job only requires one variable and that I called reboot. This can be set to true or false.

This is a screenshot of the Log Output of a RunDeck job:

DAF Linux Yum