Logs, Metrics, & Traces: Making sense of data through observable microservices

Kayalvizhi Kandasamy

Microservices is no longer a buzzword, but a staple of IT and IT driven organizations across the world. Microservices enables organizations to be nimble and cut down costs substantially due to its granularity and reusability. It is a new method of building applications by breaking down a large application into compact, self-sufficient services that are not language specific.

As interesting and advantageous microservices may be, it also brings up certain challenges. One of them being making sense of the data generated through the implementation of microservices. But, before delving into this challenge, let’s first take a look at a simple example of how a monolith application is transformed to into a microservices-based application:

From the above illustration, you can understand that:

  • Each service owns a single responsibility.
  • The large database is broken into smaller units and managed by the respective services.
  • The services communicate with the external world via REST and GraphQL APIs.
  • The services communicate internally with each other via Enterprise Message Bus.

Runtime Complexities with Microservices

Microservices may not be easy, but definitely necessary. Distributed systems are inherently unpredictable and contain certain grey areas. One of these areas is the complexity of communication between services.

Trying to understand how individual pieces of the overall system interact is a daunting task. A single transaction can flow through many independently deployed microservices or pods; identifying where performance bottlenecks occur provides valuable information. Let’s take a look at a typical workflow to gain a sense of the complexity involved:

As per the above illustration, a hypothetical situation:

  • The Orders service calls the Inventory service.
  • Consider a bug has evaded the quality assurance checkpoints and has been moved to the Inventory service’s production environment.
  •  When a user tries to place an order, the Orders service checks with the Inventory service for the availability of goods. The Inventory service fails to respond due to the bug and a connection timeout occurs. The Orders service then responds to the user with the Something Went Wrong; Please try after sometime message.

When the development tries to fix the bug and understand the root cause of it, they first have to look into the logs. Logs are the de facto choice of debugging tools used in a production environment.

The above image contains vital information, such as immutable discrete timestamped events – what happened and at what time, what was requested, what was sent, and so on.

Manually going through hundreds or thousands of lines is manageable; but you must remember that production environment logs record millions of events. This makes manual scanning an futile exercise.

Observability: Going over information with a fine-toothed comb

Observability is a well-entrenched term in engineering and control theory. It refers to how well the internal operations of a system can be deduced through the information obtained from external outputs. Here are a couple of examples coded for observability.

Load Balancer

Load balancer is a reverse proxy that distributes application traffic across a number of application servers. It also routinely monitors the health of application server instances by using health check pings. So, if an instance is down, it avoids the sending of requests to the failed instance. When the failed instance resumes, the load balancer targets the instance.

The observability in this case is that the load balancer is unaware of the internals of the application; instead, it is aware of the state of the system, which is the health of the instances. The external outputs are the health check pings.


Autoscaling helps to ensure that the right number of instances are available to handle the application traffic. It can either launch or terminate instances, based on the traffic.

As per the above image, the scaling policy is configured in such a way that if the load crosses 65%, the new instances are launched as configured. When it is 33%, it does nothing. When the utilisation crosses 65%, it scales out by launching the new instances.

The observability here is that autoscaling constantly keeps an eye on the utilization through monitoring. It scales in and out, as per the configured scaling policy.

Can Observability be equated with APM?

Observability can mean different things to different people. For some, it could be the old wine of monitoring in a new bottle. However, like any IT trend, perception is difficult, as conclusions are drawn without much analysis. One thing is for certain though; observability is definitely not APM (Application Performance Monitoring).

The three pillars of observability

The three essential components that drive observability are logs, metrics, and traces.

There are many powerful open-source tools, such as ELK, Prometheus, Zipkin, and so on, that can help you with keeping an eye on your system. However, the mere configuration of such tools does not guarantee that your application will be observable. These tools generate a vast amount of events and logs. How you use these events and logs is what determines the resilience of your system.

In short, observability is not a panacea, but the ability to use the inferred data collected from these tools. Here are some of the ways that the data can be leveraged:

  • Retry & Schedule for Later – When an incident occurs, you can orchestrate different services to be retried at different times. Immediate retrying of services may not help, but adding a back-off time definitely will.
  • Failover & Fall back – When a downstream application fails for any reason, a fallback service call is added to reduce the failure rate and increase the resiliency of the system.
  • Notify Controllers – Coding the services to notify the failures.
  • Communicate – Graceful communication to the caller of the service, regarding what is going on and when likely the request will be fulfilled, and so on.

Resilience4j is a lightweight fault tolerance library inspired by Netflix Hystrix. It’s lightweight and modular structure enables you to pull in specific modules for specific capabilities, such as circuit-breaking, rate-limiting, retry, and bulkhead, to code the observable microservices.

To observe is to understand

The goal of observable microservices is not just to collect logs, traces, or metrics. It should be used to build an engineering culture based on facts and feedback. Observability is about being data-driven, especially during debugging, and simplifies the monitoring process to a great extent.


Your email address will not be published. Required fields are marked *