Monitoring @ FOSDEM 2018 => Observability, Tracing and the RED method


Yesterday 03/02/18, I went to FOSDEM in Brussels. FOSDEM is a two-day non-commercial event organised by volunteers to promote the widespread use of free and open source software. The goal is to provide free and open source software developers and communities a place to meet in order to:

  • get in touch with other developers and projects
  • be informed about the latest developments in the free software world
  • be informed about the latest developments in the open source world
  • attend interesting talks and presentations on various topics by project leaders and committers
  • to promote the development and benefits of free software and open source solutions

It seems like FOSDEM is getting more popular each year, resulting in rooms and aula’s filling up fast.


Take a look at the Saturday schedule to get an idea of the wide range of topics that were discussed. I’ll try to give a quick overview of the most interesting monitoring related topics I learned about.


Christiny Yen held an interesting lecture about Observability, which is an upcoming term that’s very trending in the devops monitoring world. Wikipedia describes observability as a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

But is there a significant difference between monitoring and observability? It seems like monitoring vendors are quickly updating their websites and promise observability while a few months back, the word was nowhere to be found.

It seems to me like observabiltiy could be in the same list as other system attributes, such as:

  • functionality
  • performance
  • testability
  • debuggability
  • operability
  • maintainability
  • efficiency
  • teachability
  • usability

By making systems observable, we can make it easier to come to decisions and take action, making sure the OODA loop is actually a loop, instead of having a break in it.

Observability OODA Loop

Observability isn’t a substitute for monitoring, nor does it obviate the need for monitoring, they are complementary. Monitoring is best suited to report the overall health of systems. Monitoring, as such, is best limited to key business and systems metrics derived from time-series based instrumentation, known failure modes as well as blackbox tests. Observability, on the other hand, aims to provide highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes. It can be used as a way to better understand system performance and behavior, even during the what can be perceived as “normal” operation of a system. Since it’s impossible to predict every single failure mode a system could potentially run into or predict every possible way in which a system could misbehave, it becomes important that we build systems that can be debugged armed with evidence and not conjecture. 

How can we make microservices observable?

  • Observable systems should emit events, metrics, logs and traces
  • All components (also non-critical) should be instrumented
  • Instrumentation should not be ‘opt-in’, manual or ‘hard to Do’
  • Need to find the right balance of metrics, logs and traces for a given service
  • Using advanced analytics for diagnosing services, such as anomaly detection and forecasting


Another interesting topic, trace buffering with LLTng, was discussed by Julien Desfossez.  Some LLTng features:

  • System-wide insights: LLTng allows to understand the interactions between multiple components of a given system. By producing a unified log of events, LLTng provides great insight into the system’s behavior.
  • High performance: Tracers achieve great performance through a combination of essential techniques such as per-CPU buffering, RCU data structures, a compact and efficient binary trace format, and more.
  • Flexibility by offering multiple modes:
    • Post-processing mode: Gather as much data as possible and take the time to investigate
    • Snapshot mode: When an error is detected, extract the last trace buffer from memory
    • Live mode: Continuously monitor a stream of events
    • Rotation mode: Periodically, or on a specific trigger, process a chunk of trace without stopping the tracing

Tracing tends to be quite expensive. Think long and hard whether the added complexity is warranted. You might be falling into the trap of premature optimisation? Is optimisation that important when you could just scale horizontally?

Tracing to disk with all kernel events enabled can quickly generate huge traces:

  • 54k events/sec on an idle 4-cores laptop, 2.2 MB/sec
  • 2.7M events/sec on a busy 8-cores server, 95 MB/sec

So make sure you watch you available storage closely..

Some service instrumentations Methods

Tom Wilkie gave a presentation about the RED Method, while also talking about the USE Method and the Four Golden Signals. He explained why consisteny is an important approach for reducing cognitive load.

The USE Method

For every resource, monitor

  • Utilisation: % time the resource was busy
  • Saturation: Amount of work resource has to do (often queue length)
  • Errors: Count of error events

The RED Method

For every service, monitor request

  • Rate: Number of requests per second
  • Errors: The number of those requests that are failing
  • Duration: The amount of time those requests take

Google’s Four Golden Signals

  • Latency: The time it takes to service a request, with a focus on distinguishing between the latency of successful requests and the latency of failed requests.
  • Traffic: A measure of how much demand is being placed on the service. This is measured using a high-level service-specific metric, like HTTP requests per second in the case of an HTTP REST API.
  • Errors: The rate of requests that fail. The failures can be explicit (e.g., HTTP 500 errors) or implicit (e.g., an HTTP 200 OK response with a response body having too few items).
  • Saturation: How “full” is the service. This is a measure of the system utilization, emphasizing the resources that are most constrained (e.g., memory, I/O or CPU). Services degrade in performance as they approach high saturation.


FOSDEM 2018 seems to be a very popular event which covers a wide spectrum of free and open source software projects. I learned some new things and met several interesting people. Hopefully I’m able to return next year. If open source software peakes your interest, I can certainly recommend attending this event. You will not be dissapointed. 

CentOS 7 – An Enterprise Ready Problemless OS


It must be about 8 years now since we choose CentOS as our default operating system for Linux servers. A lot has changed since then and it has always been on my to do to write a blog post about it. Karanbir Singh announced the release of CentOS 7.4.1708 on 13/09/17. As with all CentOS 7 components, this release was built from sources hosted at It also supersedes all previously released content for CentOS Linux 7, and users are highly encouraged to upgrade all systems running CentOS 7. Make sure to read the release notes before upgrading.

One month later, we were able to patch all our CentOS 7 systems and did not run into a single upgrade problem. I would say that merits a big congratulations to the whole CentOS team, and of course also all Red Hat engineers for producing a problemless and stable distribution.

In this post I’ll try to give a general overview of what CentOS is about and why you should choose for this partcular operating system.

centos 7

CentOS Lifecycle

It’s very important to keep an eye on the lifecycles of the operating systems you are managing. Good planning ensures you have enough time to migrate your applications in time before your operating systems are no longer supported. 

CentOS VersionRelease DateFull UpdatesMaintenance Updates
319 March 200420 July 200631 October 2010
49 March 200531 March 200929 February 2012
512 April 200731 January 201431 March 2017
610 July 201110 May 201730 November 2020
77 July 2014Q4 202030 June 2024


CentOS 7 Repositories

There are three primary CentOS repositories (also known as channels), containing software packages that make up the main CentOS distribution. 

  • base – Contains packages that form CentOS point releases, and gets updated when the actual point release is formally made available in form of ISO images.
  • updates – Contains packages that serve as security, bugfix or enhancement updates, issued between the regular update sets for point releases. 

  • extras – Provides additional packages that may be useful.

  • centosplus – Provides additional packages that extend functionality of existing packages. Please note that this repository is disabled by default. Using this repository is more dangerous than using other CentOS repositories, as it is designed to have several updated packages and it is not really designed to be completely enabled. You should only pick the packages you are looking for and use exclude= and includepkgs= (or exclude= and yum-plugin-priorities) to load only those packages from the centosplus repository. (also check the official centosplus documentation)

CentOS vs Red Hat Enterprise Linux

While CentOS is derived from the Red Hat Enterprise Linux codebase, CentOS and Red Hat Enterprise Linux are distinguished by divergent build environments, QA processes, and, in some editions, different kernels and other open source components. For this reason, the CentOS binaries are not the same as the Red Hat Enterprise Linux binaries. Red Hat Enterprise Linux (RHEL) is actually also open source. But although the code is available for Red Hat users, it is not free to use. Red Hat and the CentOS project announced 7 January 2014 they were actually joining forces.

License FOSS – GPL and othersCommercial – RedHat EULA
SecuritySELinux, NSS, Linux PAM, firewalld SELinux, NSS, Linux PAM, firewalld
Patches/fixesAs promptly as possible given available project resources.SLA through Red Hat
SupportSelf-support24x7 support through Red Hat
Package managementYumYum
Enterprise package managementSpacewalk / KatelloRed Hat Satellite
ClusteringLinux-HARed Hat Cluster Suite (RHCS)
BootloaderGRUB 2GRUB 2
Graphical user interface (GUI)GNOME 3 / KDE SC 4.10GNOME 3 / KDE SC 4.10
Service managementsystemdsystemd
Storage managementLVM / SSM LVM / SSM
Default file systemXFSXFS
ContainerizationDocker, KubernetesRed Hat OpenShift
Virtual device interface (VDI)SPICESPICE

red hat

There are a lot of advantages in choosing Red Hat 7 over CentOS 7. 

  • Enterprise-level support
  • Access to engineering resources
  • Red Hat’s Customer Portal
  • Certifications
  • Latest features

But choosing Red hat also has some considerable disadvantages:

  • Not free
  • Administration overhead for license management

And yes, I do mention the administration overhead as a problem. This problem might not apply for everyone though. In my case though the process of ordering new Red Hat licenses or prolonging expiring licenses just takes a lot of (unnecessary) time. 

Final words

So I hope my blog post gave you some additional information to make a better informed decision which operating systems are best suited for your use case. If you need professional support, Red Hat is there for you, if you feel comfortable supporting your own Linux servers, follow the CentOS rabbit. 

Rundeck 2.10 – Ultimate Open Source Job scheduler

Rundeck Review

June 2016, Nagios announced they were stopping development on Nagios Reactor. So I had to start looking for a replacement. After playing with Foreman, Jenkins, Rundeck and Stackstorm, I decided the best solution for my needs was definitely Rundeck. In this Rundeck review, I’ll try to go into detail on some of the most useful Rundeck features I’ve been using over the last years.

Rundeck Review

Rundeck was definitely a hidden gem in the open source automation landscape, which has been dominated by configuration management oriented tools, such as Ansible, Chef, Puppet and Salt. But imho we don’t always need full configuration management. Usage of a job scheduler and orchestrator is in a lot of cases a more suitable option. And an added bonus is that Rundeck integrates with Ansible thanks to this plugin.

Rundeck is being very actively developed, meaning they regularely release new features. The nice thing is that they truly listen to their community, by allowing us to vote for popular features in a Trello board. Feel free to create an ccount and vote for the features you think deserve priority development time.

So what if you want professional support? Then you can opt into Rundeck Pro, which has some additional features and pro plugins available. Ok, I hope this Rundeck review helps you take a better informed decision on which automation platform to start using in your digital transformation.

Rundeck Projects and Jobs

Rundeck projects will contain definitions about nodes, as well as a set a jobs that reference these nodes. Using access control policies allows you to choose which teams have access to perform actions on jobs. Each node in the Rundeck project can be customized with tags, allowing you to target each kind of node rather than reference specific hosts names or IP addresses. All these Rundeck features allow you to create job libraries with useful scripts. Integrating The Rundeck access, job and exeecution logs into an Elastic stack gives you full visibility of what’s happening in your Rundeck server.

You can group Rundeck jobs in folders and subfolders. A collapsed view of all jobs in my DAF project:


Rundeck Security

Please note I’m just listing a few security related topics in this Rundeck review. Please refer to the official Rundeck documentation for all information you need to setup a secure Rundeck instance.

Active Directory integration

Active Directory integration is a basic requirement for any automation tool. Using Active Directory groups allows you to group users and assign specific permissions to them. Please refer to the official Rundeck documentation if you want more information how to configure this.

Agentless SSH based automation

A critical feature of any automation tool is a way to encrypt it’s traffic. As RunDeck uses SSH for executing commands on nodes, it already has a big advantage over other protocols. SSH is a secure protocol used as the primary means of connecting to Linux servers remotely. When you connect, you will be dropped into a shell session, which is a text-based interface where you can interact with your server. For the duration of your SSH session, any commands that you type into your local terminal are sent through an encrypted tunnel and executed on your server. Clients generally authenticate either using passwords (less secure and not recommended) or SSH keys, which are very secure.


The RunDeck URL also needs to be protected, otherwise attackers could easily sniff your network and extract usernames, passwords, job options and more from api calls or logins. This procedure decribes the steps that need to be taken in order to configure SSL for your RunDeck server. I decided to create my ow version of the official documentation, but it’s only applicable to Microsoft .pfx certificates.


How to configure SSL for RunDeck?

  • Generate a .pfx server certificate with your private root ca
  • Copy the generated server certificate <servername>.pfx to /etc/rundeck/ssl
  • Create a keystore to hold the server certificate <servername>.pfx

  • Retrieve the alias from the <servername>.pfx file

  • Import the Certificate and Private Key into the Java keystore

  • Create a keystore for the CA certificate

  • Add the CA certificate to the CA keystore

  • Edit /etc/rundeck/ssl/ and update all properties with their current values:

  • Edit /etc/rundeck/profile and uncomment:

  • Edit /etc/rundeck/

  • Edit /etc/rundeck/

  • Make sure port 4443 is opened in the firewall:

  • Restart the rundeckd daemon

  • Tail the RunDeck logs to make sure everything works fine:

Final words

I’d love to give a big thanks to the Rundeck developers for making Rundeck available to the public. I’m sorry if important stuff is missing in this (basic) Rundeck review, I’ll try to add more information over time. It’s also on my to do to open source my Elastic pipeline configurations, which enable analytics on the access, job and execution logs.



There seems to be quite some development work done on NSClient++ lately by Michael Medin, as you can see in this GitHub commit graph.


As I’m still on, I though it was time to make a little review on the latest (nightly) version of NSClient++, which is at this moment To be honest, I’m not looking forward to migrating all our old NSClient installations to later versions. As the nsclient.ini configuration has changed drastically, this will imply I will have some work to migrate everything without issues.

There aren’t really any alternatives at the moment. As far as I know NSClient++ is still the only client offering real-time eventlog monitoring capabilities and this is imho a must-have. 

So in this review I will go through all my old checks, and check out if they are still working in Please not that this is a nightly build and is not fit for production environments yet.


So download the latest version of NSClient++ and start the installation.
Choose the generic monitoring option. 


Choose custom setup:


Set the ip address of your Nagios server in the allowed hosts field and a strong password in the password field. You will need this passsword later to log in to the website. For now, choose the ‘Insecure legacy mode’ option. In order to use the ‘Safe mode’ and ‘Secure’ mode, you will have to install NSClient on your Nagios server too, but if you are only monitoring internal servers (not over the Internet), the ‘Insecure legacy mode’ should be ok. I’ll try to make a post about the other modes in the future.


Click next and NSClient++ will finish the installation.In order to understand al the default options and settings, it’s generally a good idea to add the default settings to the nsclient.ini file. This can be done with the following command from the NSClient++ installation folder:

In my case, with a fresh install, this generated some errors, but I’m quite sure this won’t be a real issue. Probabaly just related to the nightly build


So I was curious to see if I could get the NSClient++ webserver working, as in my previous tests (0.4.3.x) I never got it to work properly.

As checked the ‘Enable Web server’ checkbox, I was expecting it to work out of the box, which was not the case. Browsing to https://localhost:8443/ resulted in an ‘ERR_CONNECTION_REFUSED’ error.

So I had a look through the nsclient.ini and noticed that although I did enable it in the installation I still found this:

So after setting it to 1 and restarting the nscp service, I was able to log in with the password I configured during the installation. The webserver is using a self-signed certificate, which is better then nothing. If you have a certificate authority, you should be able to generate secure certificates so don’t get the ‘red cross’ in your browser.



After logging in, you immediately arrive in the Home webpage with some basic information, such as CPU Load, Processes, threads, handles and uptime


It looks a bit like a remake of the Windows Task Manager, but with a little less information.


There seems to be no X-axis information in the NSClient PCU webpage. The interval used in the Task Manager is one second, while in the NSclient webpage it is  seconds, which of course results in slightly different results. 

The NSClient webpage is of course accessible from ‘anywhere’ if set up correctly, which is definitely a plus. 

One more small remark, is that Michael seems to have chosen ‘CPU Load’ as the name which represents the CPU utilization. Imho this is quite confusing as on Linux servers, CPU load is more a value representing the current CPU queue. As NSClient is supposed to also work on Linux servers now, I think it should be named ‘CPU Usage’ (which is a bit shorter then ‘CPU Utilization’)

Besides CPU info, there is also some memory information:


And a list of 38 metrics. I think these are all the metrics NSClient++ is caching, enabling it to calculate nice averages instead of current values.



The second menu item ‘Modules’ lists all the available modules and their state. 


So I tried checking an extra module to see if it is changed in the nsclient.ini, but apart from the checkbox being checked, nothing really changed. 


As there is almost no documentation about the new webserver, I tried some things myself, but to no effect. I’m not quite sure what the reload and shutdown actions are supposed to do.


I’ve tested this and it does not restart the nscp service. Shutdown doesn’t really seem to do anything yet.

And then I suddenly noticed that a new menu item appears ‘Changes’, which allows me to Save or Undo the configuration.


It felt a bit weird that this menu items just appeared out of nowhere. Maybe it would better if it was always there, but with a green icon or when there are no detected changes. Something else I noticed is that when loading a module, you cannot enable this module unless you save it first.

In the nsclient.ini the modules I activated were properly adjusted. the only weird thing is that changes done with the web gui are using ‘enabled or diasbled’, while changes done in commandline, such as generating the defaults are using ‘0’ or ‘1’ to disable or enable a module. It would be nice if this was somehow more consistent.


The settings menu seems to need some work, as I saw a lot of ‘TODO’ and ‘Unknown’ strings for several items.

Also, I’m not quite sure what the ‘Changed’, ‘Basic’, ‘Advanced’ tabs are supposed to do.



The queries menu gives an overview of all possible queries. 


When you click on a query, you are linked to the module which enable you to use this query and you are able to see a ‘Help’ file with the usable arguments for the selected query.


And it seems Michael also enables us to test a query:


Which is a very nice feature. It would be nice to see a list of more complex working examples.


The Log menu gives a nice filterable overview of the NSClient logfile:



Similar to the Logs menu, the Console menu gives also a filterable overview of all console messages.


(Almost) Final words

The features I just listed are just a few of the many new exiting features in the new NSClient++. The webserver has a nice gui and is a nice preview of things to come. Thanks a lot Michael for sharing your work with the world.

I will continue writing on this review when I find the time.