Monitoring @ FOSDEM 2018 => Observability, Tracing and the RED method

FOSDEM 2018

Yesterday 03/02/18, I went to FOSDEM in Brussels. FOSDEM is a two-day non-commercial event organised by volunteers to promote the widespread use of free and open source software. The goal is to provide free and open source software developers and communities a place to meet in order to:

  • get in touch with other developers and projects
  • be informed about the latest developments in the free software world
  • be informed about the latest developments in the open source world
  • attend interesting talks and presentations on various topics by project leaders and committers
  • to promote the development and benefits of free software and open source solutions

It seems like FOSDEM is getting more popular each year, resulting in rooms and aula’s filling up fast.

Fosdem-2018

Take a look at the Saturday schedule to get an idea of the wide range of topics that were discussed. I’ll try to give a quick overview of the most interesting monitoring related topics I learned about.

Observability

Christiny Yen held an interesting lecture about Observability, which is an upcoming term that’s very trending in the devops monitoring world. Wikipedia describes observability as a measure of how well internal states of a system can be inferred from knowledge of its external outputs.

But is there a significant difference between monitoring and observability? It seems like monitoring vendors are quickly updating their websites and promise observability while a few months back, the word was nowhere to be found.

It seems to me like observabiltiy could be in the same list as other system attributes, such as:

  • functionality
  • performance
  • testability
  • debuggability
  • operability
  • maintainability
  • efficiency
  • teachability
  • usability

By making systems observable, we can make it easier to come to decisions and take action, making sure the OODA loop is actually a loop, instead of having a break in it.

Observability OODA Loop

Observability isn’t a substitute for monitoring, nor does it obviate the need for monitoring, they are complementary. Monitoring is best suited to report the overall health of systems. Monitoring, as such, is best limited to key business and systems metrics derived from time-series based instrumentation, known failure modes as well as blackbox tests. Observability, on the other hand, aims to provide highly granular insights into the behavior of systems along with rich context, perfect for debugging purposes. It can be used as a way to better understand system performance and behavior, even during the what can be perceived as “normal” operation of a system. Since it’s impossible to predict every single failure mode a system could potentially run into or predict every possible way in which a system could misbehave, it becomes important that we build systems that can be debugged armed with evidence and not conjecture. 

How can we make microservices observable?

  • Observable systems should emit events, metrics, logs and traces
  • All components (also non-critical) should be instrumented
  • Instrumentation should not be ‘opt-in’, manual or ‘hard to Do’
  • Need to find the right balance of metrics, logs and traces for a given service
  • Using advanced analytics for diagnosing services, such as anomaly detection and forecasting

Tracing

Another interesting topic, trace buffering with LLTng, was discussed by Julien Desfossez.  Some LLTng features:

  • System-wide insights: LLTng allows to understand the interactions between multiple components of a given system. By producing a unified log of events, LLTng provides great insight into the system’s behavior.
  • High performance: Tracers achieve great performance through a combination of essential techniques such as per-CPU buffering, RCU data structures, a compact and efficient binary trace format, and more.
  • Flexibility by offering multiple modes:
    • Post-processing mode: Gather as much data as possible and take the time to investigate
    • Snapshot mode: When an error is detected, extract the last trace buffer from memory
    • Live mode: Continuously monitor a stream of events
    • Rotation mode: Periodically, or on a specific trigger, process a chunk of trace without stopping the tracing

Tracing tends to be quite expensive. Think long and hard whether the added complexity is warranted. You might be falling into the trap of premature optimisation? Is optimisation that important when you could just scale horizontally?

Tracing to disk with all kernel events enabled can quickly generate huge traces:

  • 54k events/sec on an idle 4-cores laptop, 2.2 MB/sec
  • 2.7M events/sec on a busy 8-cores server, 95 MB/sec

So make sure you watch you available storage closely..

Some service instrumentations Methods

Tom Wilkie gave a presentation about the RED Method, while also talking about the USE Method and the Four Golden Signals. He explained why consisteny is an important approach for reducing cognitive load.

The USE Method

For every resource, monitor

  • Utilisation: % time the resource was busy
  • Saturation: Amount of work resource has to do (often queue length)
  • Errors: Count of error events

The RED Method

For every service, monitor request

  • Rate: Number of requests per second
  • Errors: The number of those requests that are failing
  • Duration: The amount of time those requests take

Google’s Four Golden Signals

  • Latency: The time it takes to service a request, with a focus on distinguishing between the latency of successful requests and the latency of failed requests.
  • Traffic: A measure of how much demand is being placed on the service. This is measured using a high-level service-specific metric, like HTTP requests per second in the case of an HTTP REST API.
  • Errors: The rate of requests that fail. The failures can be explicit (e.g., HTTP 500 errors) or implicit (e.g., an HTTP 200 OK response with a response body having too few items).
  • Saturation: How “full” is the service. This is a measure of the system utilization, emphasizing the resources that are most constrained (e.g., memory, I/O or CPU). Services degrade in performance as they approach high saturation.

Conclusion

FOSDEM 2018 seems to be a very popular event which covers a wide spectrum of free and open source software projects. I learned some new things and met several interesting people. Hopefully I’m able to return next year. If open source software peakes your interest, I can certainly recommend attending this event. You will not be dissapointed.