Monitoring @ FOSDEM 2018 => Observability, Tracing and the RED method

FOSDEM 2018

Yesterday 03/02/18, I went to FOSDEM in Brussels. FOSDEM is a two-day non-commercial event organised by volunteers to promote the widespread use of free and open source software. The goal is to provide free and open source software developers and communities a place to meet in order to:

  • get in touch with other developers and projects
  • be informed about the latest developments in the free software world
  • be informed about the latest developments in the open source world
  • attend interesting talks and presentations on various topics by project leaders and committers
  • to promote the development and benefits of free software and open source solutions

It seems like FOSDEM is getting more popular each year, resulting in rooms and aula’s filling up fast.

Fosdem-2018

Take a look at the Saturday schedule to get an idea of the wide range of topics that were discussed. I’ll try to give a quick overview of the most interesting monitoring related topics I learned about.

Observability

Christiny Yen held an interesting lecture about Observability, which is an upcoming term that’s very trending in the devops monitoring world. Wikipedia describes observability as a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

But is there a significant difference between monitoring and observability? It seems like monitoring vendors are quickly updating their websites and promise observability while a few months back, the word was nowhere to be found.

It seems to me like observabiltiy could be in the same list as other system attributes, such as:

  • functionality
  • performance
  • testability
  • debuggability
  • operability
  • maintainability
  • efficiency
  • teachability
  • usability

By making systems observable, we can make it easier to come to decisions and take action, making sure the OODA loop is actually a loop, instead of having a break in it.

Observability OODA Loop

Observability isn’t a substitute for monitoring, nor does it obviate the need for monitoring, they are complementary. Monitoring is best suited to report the overall health of systems. Monitoring, as such, is best limited to key business and systems metrics derived from time-series based instrumentation, known failure modes as well as blackbox tests. Observability, on the other hand, aims to provide highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes. It can be used as a way to better understand system performance and behavior, even during the what can be perceived as “normal” operation of a system. Since it’s impossible to predict every single failure mode a system could potentially run into or predict every possible way in which a system could misbehave, it becomes important that we build systems that can be debugged armed with evidence and not conjecture. 

How can we make microservices observable?

  • Observable systems should emit events, metrics, logs and traces
  • All components (also non-critical) should be instrumented
  • Instrumentation should not be ‘opt-in’, manual or ‘hard to Do’
  • Need to find the right balance of metrics, logs and traces for a given service
  • Using advanced analytics for diagnosing services, such as anomaly detection and forecasting

Tracing

Another interesting topic, trace buffering with LLTng, was discussed by Julien Desfossez.  Some LLTng features:

  • System-wide insights: LLTng allows to understand the interactions between multiple components of a given system. By producing a unified log of events, LLTng provides great insight into the system’s behavior.
  • High performance: Tracers achieve great performance through a combination of essential techniques such as per-CPU buffering, RCU data structures, a compact and efficient binary trace format, and more.
  • Flexibility by offering multiple modes:
    • Post-processing mode: Gather as much data as possible and take the time to investigate
    • Snapshot mode: When an error is detected, extract the last trace buffer from memory
    • Live mode: Continuously monitor a stream of events
    • Rotation mode: Periodically, or on a specific trigger, process a chunk of trace without stopping the tracing

Tracing tends to be quite expensive. Think long and hard whether the added complexity is warranted. You might be falling into the trap of premature optimisation? Is optimisation that important when you could just scale horizontally?

Tracing to disk with all kernel events enabled can quickly generate huge traces:

  • 54k events/sec on an idle 4-cores laptop, 2.2 MB/sec
  • 2.7M events/sec on a busy 8-cores server, 95 MB/sec

So make sure you watch you available storage closely..

Some service instrumentations Methods

Tom Wilkie gave a presentation about the RED Method, while also talking about the USE Method and the Four Golden Signals. He explained why consisteny is an important approach for reducing cognitive load.

The USE Method

For every resource, monitor

  • Utilisation: % time the resource was busy
  • Saturation: Amount of work resource has to do (often queue length)
  • Errors: Count of error events

The RED Method

For every service, monitor request

  • Rate: Number of requests per second
  • Errors: The number of those requests that are failing
  • Duration: The amount of time those requests take

Google’s Four Golden Signals

  • Latency: The time it takes to service a request, with a focus on distinguishing between the latency of successful requests and the latency of failed requests.
  • Traffic: A measure of how much demand is being placed on the service. This is measured using a high-level service-specific metric, like HTTP requests per second in the case of an HTTP REST API.
  • Errors: The rate of requests that fail. The failures can be explicit (e.g., HTTP 500 errors) or implicit (e.g., an HTTP 200 OK response with a response body having too few items).
  • Saturation: How “full” is the service. This is a measure of the system utilization, emphasizing the resources that are most constrained (e.g., memory, I/O or CPU). Services degrade in performance as they approach high saturation.

Conclusion

FOSDEM 2018 seems to be a very popular event which covers a wide spectrum of free and open source software projects. I learned some new things and met several interesting people. Hopefully I’m able to return next year. If open source software peakes your interest, I can certainly recommend attending this event. You will not be dissapointed. 

Monitoring F5 BIG-IP Platform

Introduction

The F5 BIG-IP platform consists of software and hardware that acts as a reverse proxy and distributes network or application traffic across a number of servers. Load balancers are used to increase capacity and reliability of applications. They improve the overall performance of applications by decreasing the burden on servers associated with managing and maintaining application and network sessions, as well as by performing application-specific tasks. As it’s a critical and important part of your network, monitoring F5 BIG-IP health is critical to ensure operations are working as expected. This can be achieved via traditional SNMP polling, but in order to get a detailed view of the performance of F5 network services, you will require a combination of SNMP polling and F5 syslog message analysis.

The best platform to do syslog message analysis is still the Elastic stack. Over the past years I’ve been working on a set of F5 Logstash filters, which can be used to create beautiful Kibana dashboards which can give you detailed insights in the working and processes of your F5 BIG Load Balancer.

Load balancers are generally grouped into two categories: Layer 4 and Layer 7. Layer 4 load balancers act upon data found in network and transport layer protocols (IP, TCP, FTP, UDP). Layer 7 load balancers distribute requests based upon data found in application layer protocols such as HTTP. Requests are received by both types of load balancers and they are distributed to a particular server based on a configured algorithm. Some industry standard algorithms are:

  • Round robin
  • Weighted round robin
  • Least connections
  • Least response time

Layer 7 load balancers can further distribute requests based on application specific data such as HTTP headers, cookies, or data within the application message itself, such as the value of a specific parameter. Load balancers ensure reliability and availability by monitoring the “health” of applications and only sending requests to servers and applications that can respond in a timely manner.

Monitoring F5 BIG-IP Platform

Nagios

Nagios allows you to actively monitor the health of your F5 Load Balancer with SNMP. I’ll add some examples here asap.

Elastic

You can find the required configuration files on GitHub. The project includes F5 Logstash filters, F5 elasticsearch templates and F5 Logstash patterns.

Logstash configuration

F5 Logstash input

F5 Logstash filters

dcc => ASM related messages. BIG-IP Application Security Manager (ASM) enables organizations to protect against OWASP top 10 threats, application vulnerabilities, and zero-day attacks. Leading Layer 7 DDoS defenses, detection and mitigation techniques, virtual patching, and granular attack visibility thwart even the most sophisticated threats before they reach your servers.

apd => Access Policy Demon. The apd process runs a BIG-IP APM access policy for a user session.

tmm => The traffic management microkernel is the process running on the BIG-IP host O/S that performs all of the local / global traffic management for the system.

sshd => The ssh daemon provides remote access to the BIG-IP system command line interface

logger => If a BIG-IP high-availability redundant pair has the Detect ConfigSync Status feature enabled, each unit in the pair sends periodic iControl queries to its peer to determine if the redundant pair configuration is synchronized. These iControl requests occur approximately every 30 seconds on each unit. Each inbound request generates an entry in both the local /var/log/httpd/ssl_access_log file and the /var/log/httpd/ssl_request_log file. As I never saw anything useful coming out of it, I asked our F5 engineer to have a look at this F5 article , which describes how to exclude these messages in the F5 syslog configuration.

F5 Logstash custom grok patterns

You will need to add these F5 Logstash custom grok patterns to your Logstash patterns directory. For me it’s located in /etc/logstash/patterns

Elasticsearch configuration

Included in the GitHub project you can find my f5 elasticsearch template, with the correct mappings for each field. This enables you to use your data more efficiently and allow for advanced ip aggregations. You can find more information about mapping types here. If you have ideas about better mappings (I know they need some work), please let me know on GitHub by making an issue.

F5 Remote Logging Configuration

You will need to configure your F5 with one or more remote syslog servers  to send logs your Logstash nodes. Ideally you will want to specify a custom port dedicated for F5 syslog traffic. You can find the official F5 remote syslog documentation here

You can use the F5 Configuration Utility to add a remote syslog server like this:

  1. Log on to the Configuration utility.
  2. Navigate to System > Logs > Configuration > Remote Logging.
  3. Enter the destination syslog server IP address in the Remote IP text box.
  4. Enter the remote syslog server UDP port (default is 514) in the Remote Port text box.
  5. Enter the local IP address of the BIG-IP system in the Local IP text box (optional).

    Note: For BIG-IP systems in a high availability (HA) configuration, the non-floating self IP address is recommended if using a Traffic Management Microkernel (TMM) based IP address.

  6. Click Add.
  7. Click Update.

Kibana Dashboards

The Logstash filters I created allow you do some awesome things in Kibana. I’m working on a set of dashboards with a menu which will allow you to drilldown to interesting stuff, such as apd processors, session, dcc scraping and other violations. 

Elastic F5 Home Dashboard

F5 Logstash fitler Kibana dashboard

Elastic F5 dcc scraping dashboard

F5 Logstash filter dcc scraping

I will open source them when I consider them ready for public. If you are truly interested in helping me develop and expand them, send me an email (willem.dhaeseatgmail), and I’ll consider sending you my Kibana 5 dashboards json exports. The only requirement is that you use f5-* as index pattern.

Greetings

Willem

CentOS 7 – An Enterprise Ready Problemless OS

Introduction

It must be about 8 years now since we choose CentOS as our default operating system for Linux servers. A lot has changed since then and it has always been on my to do to write a blog post about it. Karanbir Singh announced the release of CentOS 7.4.1708 on 13/09/17. As with all CentOS 7 components, this release was built from sources hosted at git.centos.org. It also supersedes all previously released content for CentOS Linux 7, and users are highly encouraged to upgrade all systems running CentOS 7. Make sure to read the release notes before upgrading.

One month later, we were able to patch all our CentOS 7 systems and did not run into a single upgrade problem. I would say that merits a big congratulations to the whole CentOS team, and of course also all Red Hat engineers for producing a problemless and stable distribution.

In this post I’ll try to give a general overview of what CentOS is about and why you should choose for this partcular operating system.

centos 7

CentOS Lifecycle

It’s very important to keep an eye on the lifecycles of the operating systems you are managing. Good planning ensures you have enough time to migrate your applications in time before your operating systems are no longer supported. 

CentOS VersionRelease DateFull UpdatesMaintenance Updates
319 March 200420 July 200631 October 2010
49 March 200531 March 200929 February 2012
512 April 200731 January 201431 March 2017
610 July 201110 May 201730 November 2020
77 July 2014Q4 202030 June 2024

Source: https://en.wikipedia.org/wiki/CentOS 

CentOS 7 Repositories

There are three primary CentOS repositories (also known as channels), containing software packages that make up the main CentOS distribution. 

  • base – Contains packages that form CentOS point releases, and gets updated when the actual point release is formally made available in form of ISO images.
  • updates – Contains packages that serve as security, bugfix or enhancement updates, issued between the regular update sets for point releases. 

  • extras – Provides additional packages that may be useful.

  • centosplus – Provides additional packages that extend functionality of existing packages. Please note that this repository is disabled by default. Using this repository is more dangerous than using other CentOS repositories, as it is designed to have several updated packages and it is not really designed to be completely enabled. You should only pick the packages you are looking for and use exclude= and includepkgs= (or exclude= and yum-plugin-priorities) to load only those packages from the centosplus repository. (also check the official centosplus documentation)

CentOS vs Red Hat Enterprise Linux

While CentOS is derived from the Red Hat Enterprise Linux codebase, CentOS and Red Hat Enterprise Linux are distinguished by divergent build environments, QA processes, and, in some editions, different kernels and other open source components. For this reason, the CentOS binaries are not the same as the Red Hat Enterprise Linux binaries. Red Hat Enterprise Linux (RHEL) is actually also open source. But although the code is available for Red Hat users, it is not free to use. Red Hat and the CentOS project announced 7 January 2014 they were actually joining forces.

 CentOSRHEL
License FOSS – GPL and othersCommercial – RedHat EULA
SecuritySELinux, NSS, Linux PAM, firewalld SELinux, NSS, Linux PAM, firewalld
Patches/fixesAs promptly as possible given available project resources.SLA through Red Hat
SupportSelf-support24x7 support through Red Hat
Package managementYumYum
Enterprise package managementSpacewalk / KatelloRed Hat Satellite
ClusteringLinux-HARed Hat Cluster Suite (RHCS)
BootloaderGRUB 2GRUB 2
Graphical user interface (GUI)GNOME 3 / KDE SC 4.10GNOME 3 / KDE SC 4.10
Service managementsystemdsystemd
Storage managementLVM / SSM LVM / SSM
Default file systemXFSXFS
ContainerizationDocker, KubernetesRed Hat OpenShift
Virtual device interface (VDI)SPICESPICE

red hat

There are a lot of advantages in choosing Red Hat 7 over CentOS 7. 

  • Enterprise-level support
  • Access to engineering resources
  • Red Hat’s Customer Portal
  • Certifications
  • Latest features

But choosing Red hat also has some considerable disadvantages:

  • Not free
  • Administration overhead for license management

And yes, I do mention the administration overhead as a problem. This problem might not apply for everyone though. In my case though the process of ordering new Red Hat licenses or prolonging expiring licenses just takes a lot of (unnecessary) time. 

Final words

So I hope my blog post gave you some additional information to make a better informed decision which operating systems are best suited for your use case. If you need professional support, Red Hat is there for you, if you feel comfortable supporting your own Linux servers, follow the CentOS rabbit. 

Monitoring Infoblox DDI

Introduction

Infoblox is a DDI (DNS, DHCP, and IP address management solution) which simplifies network management a lot. Over the past 8 years I was able to work with it and never looked into another solution, as it completely fulfills all our DNS and DHCP needs. During that time, I’ve been finetuning my Infoblox Logstash grok patterns and index template mappings. As I didn’t found any existing Infoblox Logstash grok patterns, I decided to make them open source. You can download the Logstash configuration file on GitHub here. There is also a template included with the mappings for Elasticsearch. 

Infoblox Logstash

Infoblox Logging

Thanks to Infoblox, we can:

  • Consolidate DNS, DHCP, IP address management, and other core network services into a single platform, managed from a common console
  • Centrally orchestrate DDI functions across diverse infrastructure
  • Boost IT efficiency and automation by seamlessly integrating with other IT systems (such as Rundeck) through RESTful APIs

Infoblox has integrated reporting & analytics capabilities, but imho DNS and DHCP related logs are on the top priority list for sending to a log aggregator, such as Elasticsearch or NLS. DHCP and DNS logs allow us to link ip addresses to device hostnames and mac addresses. As ip addresses are logged everywhere, this is a vital log source in order to trace what happened by who on your network. A good Logstash filter is able to parse all the relevant fields, so they can be used in aggregations. 

Infoblox Logstash filter (named, dhcpd and httpd)

Please note hat I’m not using a syslog input, but a tcp input. I’ve had considerable issue with the default syslog patterns used by Elasticsearch.  Apart from that I prefer to apply my own field names for syslog data. Using my own custom syslog grok pattern allows me to match the parsed field to our internally used naming conventions. Feel free to adjust the field names as needed.