Introducing Observability & Monitoring Into Your System

  • May 12, 2021

Your system, application, and/or your infrastructure are going to have problems. Once you have your infrastructure under control, you will want to understand what’s happening in the system through Observability. The goal is to prevent as many issues as possible, but the reality is that problems are unavoidable. What’s most important is detecting problems before they impact your business or users. This is when monitoring and, more importantly, Observability come into play. 

Monitoring is something you and your team actively do. You will monitor your system to detect problems, which might mean running tests to check the availability and performance of your system. Observability is, on the other hand, a property of your system that uses outputs to understand what is going on inside. If your system doesn’t externalize its internal state, no amount of monitoring will help you detect specific problems in time. It is about knowing not only what is happening in your system but why it is happening.


Understanding Your System

Using Observability has advantages beyond simply understanding what’s going on in your system, application, and infrastructure. Some problems can be detected early before they are noticeable. Internal latency, too many locks in the database, or any problem you see on the backend may become noticeable if you do take any action.

Detection and communication are key here. You want to detect the problems before your users, and you will also want to let them know if the issue will affect them. Letting the world know that you are actively monitoring and troubleshooting any given problem will help put your customers at ease. You will be able to debug and troubleshoot outages, service degradations, tackle bugs, (un-)authorized activity, etc. It also helps to understand the uptime of your SaaS and the quality of service your users have.


Visualizing Your Metrics

Metrics are a crucial component of Observability, and businesses. They help us improve the application and infrastructure.  We are constantly looking for more ways to add metrics in order to help us understand how systems work. For example, latency or responsiveness can be determined with metrics. When observability and metrics are applied across the board, they must show that you are addressing all the challenges of managing your infrastructure properly: availability, productivity, costs, security, compliance, and scalability. Dashboards and correct visualizations are key. Once appropriately implemented and visualized, metrics impact the performance and quality of your business.


Implementing Observability

As discussed above, when you start implementing Observability, you will look at metrics and a combination of monitoring, log centralization, and tracing.

Monitoring: Run active checks to verify the availability of critical components.

Metrics: Collect data points to count errors, load, and other variables from all the components required to run your application and the application itself. You will obtain your metrics from your application and required services, infrastructure, pipelines, costs, and security incidents.

Log Centralization: Record information about events in your system in a central location.

Tracing: Refers to tracking and identifying problems in requests crossing through your system.


Observability and Metrics checklist:

At Flugel, we’ve put together an Observability and metrics checklist to help you detect problems before your clients notice them. 

  • Place logs in a centralized place and use a visualization tool to review and query them.
  • Detect, count, and display in a dashboard error strings in logs.
  • Collect metrics at the operating system level, service level, and CSP level.
  • Group metrics to correlate technical metrics with business metrics. 
  • Create a dashboard providing metric data information to business stakeholders. This dashboard must display costs, security, availability, response time, and other metrics impacting the business.
  • Track HTTP 5xx metrics in all the environments and alerts must be set for production.
  • Monitor TLS certificates.
  • Monitor exposed endpoints with HTTP checks.
  • Configure alerts properly to avoid alerts spamming. Methods to control this problem must be appropriately defined. Events that can be detected from metrics but don’t require immediate action shall not be considered as alerts and must be recorded in an event log.
  • Provide distributed tracing.

By putting this checklist in place, you will begin to understand how well observability and monitoring are working in your organization. Over time you will gain insight into the health and performance of your products, process, and people. 


Written by: Gabriel Vasquez
General corrections and edition: Diego Woitasen