Monitoring Network Connections Nagios 2

Monitoring Windows Network Connections

Introduction

Monitoring the network connections on your Windows servers can be crucial to examine server load and investigate bottlenecks and anomalies. There are many ways to monitor your network connections. This blog post will go into detail of some of the tools that can be used to achieve optimal monitoring of your Windows network connections.

How To monitor your Windows Network Connections?

PerfMon

In the Windows Performance Monitor, you can find several counters for all kinds network connections. This set of counters is available for TCPv4 and TCPv6 connections.

Counter NameCounter Description
Connection FailuresConnection Failures is the number of times TCP connections have made a direct transition to the CLOSED state from the SYN-SENT state or the SYN-RCVD state, plus the number of times TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state.
Connections ActiveConnections Active is the number of times TCP connections have made a direct transition to the SYN-SENT state from the CLOSED state. In other words, it shows a number of connections which are initiated by the local computer. The value is a cumulative total.
Connections EstablishedConnections Established is the number of TCP connections for which the current state is either ESTABLISHED or CLOSE-WAIT.
Connections PassiveConnections Passive is the number of times TCP connections have made a direct transition to the SYN-RCVD state from the LISTEN state. In other words, it shows a number of connections to the local computer, which are initiated by remote computers. The value is a cumulative total.
Connections ResetConnections Reset is the number of times TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state.
Segments Received/secSegments Received/sec is the rate at which segments are received, including those received in error. This count includes segments received on currently established connections.
Segments Retransmitted/secSegments Retransmitted/sec is the rate at which segments are retransmitted, that is, segments transmitted containing one or more previously transmitted bytes.
Segments Sent/secSegments Sent/sec is the rate at which segments are sent, including those on current connections, but excluding those containing only retransmitted bytes.
Segments/secSegments/sec is the rate at which TCP segments are sent or received using the TCP protocol.

At the moment there seems to be no Performance Monitor counter available  in Windows to show the UDP connection count.  Although the Windows Performance Monitor is an easy choice to have a quick glance at how many TCP connections are currently active, it is not an optimal tool to use for debugging or alerting. The PerfMon user interface also hasn’t changed much over the years. 

UDP Connection Count

This means that we will have to look at other options, such as Netstat:

Netstat

Netstat is a command-line tool that displays very detailed information about your network connections, both incoming and outgoing, routing tables, network interfaces and network protocol statistics.
It is mostly used for finding problems in the network and to determine the amount of traffic on the network as a performance measurement. 

Although Netstat is the perfect tool for looking in real-time at your network connections, you will need some way to graph the Netstat values. Being able to analyze the connection count over time really helps with getting a better understanding of what your servers and applications are doing.

Nagios

As I saw multiple plugins to check network connections with Netstat on Linux hosts, but not on Windows hosts, I decided to write a Powershell script which uses Netstat to monitor your TCP and UDP network connections on Windows hosts.

How to monitor your network connections with Nagios?

  1. Download the latest version of check_ms_win_network_connections on GitHub.
  2. Put the script in the NSClient++ scripts folder, preferably in a subfolder Powershell.
  3. In the nsclient.ini configuration file, define the script like this:

  4. Make a command in Nagios like this:

  5. Configure your service in Nagios. Make use of the above created command. Configure something similar like this as $ARG1$:

Additional Information

The script initiates a ‘netstat -ano’ , which will display all active network connections with their respective ip addresses, port number and the corresponding process id’s, parse the results and apply the optional filters.
This could of course also be accomplished by just retrieving the ‘\TCPv4Connections Established’ performance countera and it’s UDP variant, but the real strength of the script are it’s parameters. If you think your systems have been compromised by a virus or other malicious software, you can distribute the check_ms_network_connections plugin to all Windows servers and then check your network connections for a given process, port or ip address. This could quickly result in an overview of all impacted systems.

Usage

Because the Powershell command  get-process  doesn’t add file extensions, the -P parameter also does not need it’s file extensions eg ‘.exe’. For example in order to look for all connections made by svchost.exe, the parameters would look like this: -H server.fqdn -P svchost 

Another usage example could be the need to monitor a server that needs a continuous link with another server. By specifying, the -wl and -cl parameters like this -H server.fqdn -wl 2 -cl 0 -wh 10 -ch 15  , you should get a warning alert when the amount of TCP connections drops below 2 and a critical alert when there is no TCP connection with the remote server.
Please note that when using different filter parameters, ‘or’ is used, not ‘and’. So if any of the filters apply’s, the connection should be added. 

If you don’t want to filter on IP address or port, I suggest you use the ‘-c’ parameter, which improves performance a lot. If you are running the plugin on a server with a very high amount of connections, I also suggest using the -c parameter.
The ‘-c’ parameter will execute  (netstat -abn -proto TCP).count which is way faster then having to loop through each individual connection. It does imply you will get less information, as it only counts the active TCP connections.

Results

The result of using Nagios XI to monitor your network connections looks like this:

Monitoring Network Connections Nagios

TIG

A third option is to use a TIG stack, which will use Telegraf to query the counters from PerfMon and sends them to an InfluxDB time series database. Visualization is done with Grafana.

The Telegraf agent configuration file needs this input:

TIG Network Connections

Grafana allows you to create a query which will show all values for all hosts with a certain tag. With the help of templates, it becomes very easy to create beautiful graphs with filterable, sortable min, max, avg and current values o all your network connections counters. And this with a one second granular interval.

TIG-Windows-Network-Connections-Top-Avg

A disadvantage of using Telegraf is that you are limited to using PerfMon counters. This means it’s not possible to get the UDP connection count. There seems to be a way to execute Powershell scripts with telegraf, but my guess is that the resulting load will be too high to execute this with a one second interval.

Final Words

As you can seen there are multiple options to monitor your Windows network connections. I’ll try to extend this documentation with some alerting examples.

Monitoring Windows Scheduled Tasks

Introduction

Tasks scheduler is a Microsoft Windows component that allows you to schedule programs or scripts to start at pre-defined intervals. There are two major versions of the task scheduler: In version 1.0, definitions and schedules are stored in binary .job files. Every task corresponds to a single action. This plugin will not work on version 1.0 of the task scheduler, which is running on Windows Server 2000 and 2003. In version 2.0, the Windows task scheduler got a redesigned user interface based on Management console. Version 2.0 also supports calendar and event-based triggers, such as starting a task when a particular event is logged to the event log, or when a combination of events has occurred. Also, several tasks that are triggered by the same event can be configured to run either simultaneously or in a pre-determined chained sequence of a series of actions.

Tasks can also be configured to run based on system status such as being idle for a pre-configured amount of time, on startup, logoff, or only during or for a specified time. Other new features are a credential manager to store passwords so they cannot be retrieved easily. Also, scheduled tasks are executed in their own session, instead of the same session as system services or the current user. You can find a list of all task scheduler 2.0 interfaces here.

Requirements

Starting from Windows Powershell 4.0, you can use a whole range of Powershell cmdlets to manage your scheduled tasks with Powershell. This plugin for Nagios does not use these cmdlets, as it has to be Powershell 2.0 compatible. Maybe in a few years, when Powershell 2.0 becomes obsolete, I’ll patch the script to make use of the new cmdlets. You can find the complete list of cmdlets here. Failing tasks will always end with some sort of error code. You can find the complete list of error codes here. This plugin will output the exitcodes for failing tasks in the Nagios service description. Output will also notify you on tasks that are still running. We have multiple Windows servers at work with a growing amount of scheduled tasks and each scheduled task needs to be monitored. With the help of Nagios and this plugin you can find out:

  • How many are running at the same time?
  • How many are failing?
  • How long are they running?
  • Who created them?

Versions

Disabled scheduled tasks are excluded by default from 3.14.12.06. In earlier versions, you had to manually exclude them by excluding them with -EF or -ET. It seemed like a logical decision to exclude disabled tasks by default and was suggested by someone on the Nagios Exchange reviewing the plugin.. Maybe one day I’ll make a switch to include them again if specified. As some scheduled tasks do not need to be monitored, the script enables you to exclude complete folders.

Since v5.13.160614 it is possible to include hidden tasks. Just add the ‘–Hidden 1’ switch to your parameters and your hidden tasks will be monitored.

One of the folders I tend to exclude almost all the time is the “Microsoft” folder. It seems like several tasks in the Microsoft folder tend to fail sometimes. So unless you absolutely need to know the state of every single scheduled task running on your Windows Server, I can advise you to exclude it too. You can find the folder and tasks in this locations: C:\Windows\System32\Tasks
It is possible to include tasks or task folders with the ‘–InclFolders’ and ‘–InclTasks’ parameters. This filter will get applied after the exclude parameter. Please note that including a folder is not recursive. Only tasks in the root of the folder will be included.

Help

This is the help of the plugin, which lists all valid parameters:

You could put every scheduled task  you don’t want to monitor in a separate  folder and exclude it with the -EF parameter. Alternatvely, you can use the -ET parameter to exclude based on name patterns. One quite important thing to know is that in order to exclude or include the root folder, you need to escape the backslash, like this: “\\”.

How to monitor your scheduled tasks?

  1. Put the script in the NSClient++ scripts folder, preferably in a subfolder Powershell.
  2. In the nsclient.ini configuration file, define the script like this:

    For more information about external scripts configuration, please review the NSClient documentation. You can also consider defining a wrapped script in nsclient.ini to simplify configuration.
  3. Make a command in Nagios like this:
  4. Configure your service in Nagios. Make use of the above created command. Configure something similar like this as $ARG1$:

Some things to consider to make it work:

  • “set-exectionpolicy remotesigned”
  • Nscp service account permissions => Running with local system should suffice, but I had users telling me it only worked with a local admin. I found out that on some NSClient++ versions, more specific version 0.4.3.88 and probably some earlier versions too, the following error occured when running nscp service as local system: “CHECK_NRPE: Invalid packet type received from server”. After filing an issue on the GitHub project page of NSClient++, Michael Medin quickly acknowledged the issue and solved it from version 0.4.3.102, so the plugin should work again as local system.

Examples

If you would run the script in cli from you Nagios plugin folder, this would be the command:

If you would want to exclude one noisy unimportant scheduled task, the command used in cli would look like this:

If you only want the scheduled tasks in the root to be monitored, you can use this command:

This would only give you the scheduled tasks available in the root folder. The output look like this now.

Final Words

It seems the perfdata in the Highcharts graphs sometimes contains decimal numbers (see screenshot), which is kind of strange as I’m sure I only pass rounded numbers. Seems this is related to the way RRD files are working. To reduce the amount of storage space used, NPCD and RRD while average out the data, resulting in decimals, even when you don’t expect them.

This is a small to do list:

  • Add switches to change returned values and output.
  • Add array parameter with exit codes that should be excluded.
  • Test remote execution. In some cases it might be useful to be able to check remotely for failed windows tasks.
  • Include a warning / critical threshold when discovered tasks exceed a certain duration.
  • I was hoping to add some more exit codes to check, which would make failed tasks easier to troubleshoot. You can find the list of scheduled task exit codes here. The constants that begin with SCHED_S_ are success constants, and the constants that begin with SCHED_E_ are error constants.

Screenshots:

These are some screenshots of the Nagios XI Graph Explorer for two of our servers making use of the plugin to monitor scheduled tasks: Tasks 01 check_ms_win_tasks_graph_02 Let me know on the Nagios Exchange what you think of my plugin by rating it or submitting a review. Please also consider starring the project on GitHub.

Willem

Monitoring Microsoft Windows Updates

Introduction

Monitoring WSUS updates on Microsoft Windows Server is critical to ensure you get alerted when your systems need to be patched. The process to update Windows Updates on high priority servers implies proper planning to ensure no post-installation problems. If we could trust Microsoft patches for 100 %, installing WSUS updates on a system would be done the moment a maintenance schedule could be created for this system. Unfortunately in my personal experience, WSUS updates are more a cause of problems instead of a solution. That’s why we prefer to not install them too fast, as you might experience major issues with your production systems or with the software that is running on it. A recent example, a colleague accidentally patched some production SharePoint servers, which prohibited the creation of new sitecollections and caused issues with some icons. The only solution was to restore a backup…

Ideally the updates would first need to get tested on QA systems. If the QA servers are running for some times without issues, the production systems can get patched. The above is one of the reasons I spent some time combining the best features from the available Windows Update plugins on the Nagios Exchange.
Such as Christian Kaufmann’s idea to cache the list of Windows Updates into a file. This results in a much lower performance impact of the plugin on the servers you are monitoring. If you have any experience with WSUS updates, you will have noticed that the ‘TrustedInstaller.exe” process which is a MS Windows system process that takes care of querying the WSUS server and installing updates if requested. 

The plugin will count all available WSUS updates and output the count in every possible state. However it will only alert in case a set number of days have passed since the last successful update was installed. By using this method, you can then define a policy and agree to patch all systems which had no updates for a certain time. You could use different policies for QA and PR (production) systems to prevent problems. 

WSUS

 

Details

Some things you need to know about Windows Updates. Microsoft saves the date of the ‘last successful update’ in the registry. The location of the String Value is:

This date however is saved in the Greenwich Mean Time (GMT) or the Coordinated Universal Time (UTC) format. My plugin will try to translate this time to the local time format with the help of a function called Get-LocalTime. This function uses the [System.TimeZoneInfo] .NET class which is only usable if you have .NET 3.5 or higher. So keep in mind the ‘Last Successful Update’ date is in UTC format for servers where .NET 3.5 or higher is not installed.

The plugin will also check this registry key:

And give a warning if the system has a required reboot pending.

PSWindowsUpdate

Starting from Windows 10, Microsoft apparently decided to no longer make use of the above registry key. The only way I found to retrieve the last successful update date and time is with the help of the PSWindowsUpdate module. So I added another argument which allows you to select a different method named ‘PSWindowsUpdate’ to retrieve the necessary information. Please not that the default method is still the original method, I called ‘UpdateSearcher”

In order for this method to work, you will need to install the PSWindowsUpdate module in this location: C:\Windows\System32\WindowsPowerShell\v1.0\Modules. If you are using Powershell 5 you can just do:

I’ve included the 1.5.1.11 and 1.5.2 version of the module in the GitHub repository. Or you can download it on the Microsoft Script Center Repository.

How to monitor your WSUS updates?

  1. Please note that the default DaysBeforeWarning and DaysBeforeCritical parameters are set to 120 and 150. Feel free to adjust them as required or pass them as an argument.
  2. Put the script in the NSClient++ scripts folder, preferably in a subfolder Powershell.
  3. In the nsclient.ini configuration file, define the script like this:
  4. Make a command in Nagios like this:
  5. Configure your service in Nagios. Make use of the above created command. Configure something similar like this as $ARG1$:
    QA servers =>

    PR servers =>

  6. If you want to make use of the new ‘PSWindowsUpdate’ method you will need to have an argument like this:

(Almost) Final words

So why did I create another pluging to check WSUS updates? Because I’m using a system which completely automates Windows Update installation with the help of Nagios XI and Rundeck. The existing plugins did not meet my requirements.

Please note that there are several known issues with WSUS on some operating systems. It’s recommended to always update to the latest ‘Windows Update Client’. Please check Windows 8.1 and Windows Server 2012 R2 update history for more information. More specific, when using WIndows Server 2012 R2, you will really want the following KB’s:

  • KB3172614 => “July 2016 update rollup for Windows 8.1 and Windows Server 2012 R2”
  • KB3179574 => “August 2016 update rollup for Windows 8.1 and Windows Server 2012 R2”
  • KB3185279 => “September 2016 update rollup for Windows 8.1 and Windows Server 2012 R2”

When you don’t have these update rollup’s, checking  for updates and updating your Windows 2012 R2 systems could go very slow. In our case an update check could take up to 40 minutes instead of 10 seconds. 

Let me know on the Nagios Exchange what you think of my plugin by rating it or submitting a review. Please also consider starring the project on GitHub.

Monitoring Microsoft IIS Application Pools

Introduction

For those who are not aware, IIS is a HTTP web server from Microsoft which can host both static and dynamic content. This is done by a Windows kernel-mode driver named http.sys. It listens for incoming TCP requests on a configured port, performs some basic security checks and passes the request to a user-mode process. The worker fulfills the request and sends the response back to the requester. Web application are grouped into IIS application pools which has it’s own process assigned to it.

As we are migrated al our IIS applications to a new IIS 8.5 farm on Windows 2012 R2 servers, we needed a way to reliably monitor the state of our most critical IIS application pools. So I created a Powershell script which is able to check the state of an application pool and count the number of web application using it. As each IIS application pool has one w3wp.exe IIS worker process assigned, I added the % processor usage and memory usage to the perfdata.

The latest version also contains a new method to retrieve the IIS application pool information. As Get-ChildItem IIS:\AppPools has a weird bug where the command hangs sometimes I had to look for an alternative. This method uses C:\Windows\system32\inetsrv\appcmd.exe   instead, which seems much more performant.  

How to monitor your MS IIS Application Pools with Nagios?

  • Put the script in the NSClient++ scripts folder, preferably in a subfolder Powershell.
  • In the nsclient.ini configuration file, define the script like this:
  • Make a command in Nagios like this:
  • Configure your service in Nagios. Make use of the above created command. Configure something similar like this as $ARG1$:

    Or if you want to monitor an application pool which has OnDemand startmode where there is no IIS worker process when it isn’t used.

    IIS application pools OnDemand Startmode
    When you want to use the AppCmd.exe method:

Final Words

I only had the chance to test this on a Windows Server 2012 R2. It’s very possible you will experience issues on lower IIS versions. You need to install the IIS Management Scripts and Tools feature for the script to work properly.

IIS Application Pool

When you got it up and running your Nagios server should look like this:

monitoring iis application pools

 

Monitoring NetApp Ontap

Introduction

I’d like to start with thanking the original developer, John Murphy. Thanks to his plugin to monitor NetApp Ontap storage, we don’t need to buy the expensive NetApp plugin from Quorum.  He also inspired me to continue developing Nagios plugins in my free time. So if you want to monitor your NetApp Ontap Cluster, this plugin could help you do that. It is written in Perl and is not being actively developed, like my other Powershell plugins, as my Perl knowledge is less evolved then my Powershell knowledge. Somestimes user send me a piece of code to add, if you want to test the latest additions, give the dev branch a try.  All help is welcome to improve this plugin. Read the post about debugging Perl scripts, make a fork of the project on Github and start experimenting.

The plugin is able monitor NetApp Ontap components, from disk to aggregates to volumes and alert you if if finds any unhealthy components.

NetApp Ontap Logical View

How to monitor your Netapp Ontap?

  1. Download the latest release from GitHub to a temp directory and then navigate to it.
  2. Copy the contents of NetApp/* to your /usr/lib/perl5 or /usr/lib64/perl5 directory to install the required version of the NetApp Perl SDK. (confirmed to work with SDK 5.1 and 5.2)
  3. Copy check_netapp_ontap.pl script to your nagios libexec folder and configure the correct permissions

Parameters:

–hostname, -H => Hostname or address of the cluster administrative interface.

–node, -n => Name of a vhost or cluster-node to restrict this query to.

–user, -u => Username of a Netapp Ontapi enabled user.

–password, -p => Password for the netapp Ontapi enabled user.

–option, -o => The name of the option you want to check. See the option and threshold list at the bottom of this help text.

–warning, -w => A custom warning threshold value. See the option and threshold list at the bottom of this help text.

–critical, -c => A custom warning threshold value. See the option and threshold list at the bottom of this help text.

–modifier, -m => This modifier is used to set an inclusive or exclusive filter on what you want to monitor.

–help, -h => Display this help text.

Option list:

volume_health: Check the space and inode health of a vServer volume on a NetApp Ontap cluster. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to accomodate large volume monitoring better. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword node: The node option restricts this check by vserver name.

aggregate_health: Check the space and inode health of a cluster aggregate on a NetApp Ontap cluster. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to better accomodate large aggregate monitoring. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword, “is-home” keyword node: The node option restricts this check by cluster-node name.

snapshot_health: Check the space and inode health of a vServer snapshot. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to better accomodate large snapshot monitoring. thresh: space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), “offline” keyword node: The node option restricts this check by vserver name.

quota_health: Check that the space and file thresholds have not been crossed on a quota. thresh: N/A storage defined. node: The node option restricts this check by vserver name. snapmirror_health: Check the lag time and health flag of the snapmirror relationships. thresh: snapmirror lag time (valid intervals are s, m, h, d). node: The node options restricts this check by snapmirror destination cluster-node name.

filer_hardware_health: Check the environment hardware health of the filers (fan, psu, temperature, battery). thresh: component name (fan, psu, temperature, battery). There is no default alert level they MUST be defined. node: The node option restricts this check by cluster-node name. port_health: Checks the state of a physical network port. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name.

interface_health desc: Check that a LIF is in the correctly configured state and that it is on its home node and port. Additionally checks the state of a physical port. thresh: N/A not customizable. node: The node option restricts this check by vserver name.

netapp_alarms: Check for Netapp console alarms. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name. cluster_health desc: Check the cluster disks for failure or other potentially undesirable states. thresh: N/A not customizable. node: The node option restricts this check by cluster-node name. disk_health: Check the health of the disks in the cluster. thresh: Not customizable yet. node: The node option restricts this check by cluster-node name. For keyword thresholds, if you want to ignore alerts for that particular keyword you set it at the same threshold that the alert defaults to.