Kafka monitoring: Message brokers and how to troubleshoot them

Written by


monitoring Kafka

Subscribe to our Blog

Get the Latest News and Content

Message brokers like Kafka enable microservices to scale. But this same quality makes them hard to troubleshoot. How can developers avoid messages and errors getting stuck in oblivion? In this post we provide a few solutions: Kafka Owl, Redpanda, and Helios.

A short reminder of message brokers and why they sometimes break

Distributed systems like microservices communicate with each other through frameworks like REST, gRPC, or message brokers. While each method has its own advantages, message brokers are reliable and enable complex flows since they are based on asynchronous communication.

Asynchronous communication means that a message sent by a service (the producer) doesn’t have to be immediately consumed. Instead, the message can be stored in a message queue and consumed by another service (the consumer) at another time. For example, based on availability or need.

This ensures reliability and redundancy. Messages do not get lost or become stateless. As a result, systems can scale.

One of the most popular message broker systems is open-source Kafka. Kafka supports high performance and high scalability and enables permanent storage. For these reasons and others, Kafka is widely used by many companies.

While there is probably no better replacement for microservices communication than message brokers, they do pose some challenges. Mainly, it is very difficult to identify and troubleshoot errors. In synchronous communication frameworks, a lack of immediate response to a message is a clear indicator that there is an error in the system. But in asynchronous communication, there might be an error, but since there is no message feedback, no one would ever get an alert.

Solutions for monitoring microservices in message brokers

Solutions like Kafka Owl and Redpanda enable monitoring messages in message brokers. Kafka Owl enables exploring and fetching messages in Kafka clusters while Redpanda enables exploring message topics.

 

The challenge with existing solutions: message error context for troubleshooting

However, these solutions do not provide the context of the message requests. As developers, we lack an understanding of why a certain message created an issue. We only see the details of the error, but not the big picture.

As a result, we’re not able to:

  • Easily reproduce, troubleshoot, and debug the issue
  • Prevent this issue from recurring

In other words, we might have more information from before, but we’re still spending a lot of time dealing with microservices issues, which doesn’t really solve our problem.

The solution: trace-based monitoring and troubleshooting of message brokers

Traces can solve this context issue because they allow us to see all the operations that are triggered in our distributed system, no matter the communication type: REST, Kafka, GRPC or others.

This happens through a single operation, making it easy to approach each operation and see it as part of a whole, and not as an individual action. In addition, traces can be created automatically, without a developer having to decide where to insert them, unlike logs.

Traces can give the full context of a message, including its flow and behavior between different components and services. This makes them the ideal solution for troubleshooting. When an error occurs, developers can see its complete context within the microservices architecture, making it easy to troubleshoot and fix.

Helios provides a free trace-based solution for troubleshooting message brokers like Kafka. With Helios, we can look at the Kafka messages through their traces. This provides the context of the message, i.e which services it is connected to, where and why the error occurred, and more.

Helios is based on open source OpenTelemetry, and leverages traces for visualization and for making the data actionable. With Helios, developers can understand sync and async flows, event streams, and queues; search through traces and identify bottlenecks and errors.

 

Trace-based Kafka monitoring
Trace-based Kafka monitoring

One of our customers, a marketing platform that produces tens of thousands of message requests every day in a Kafka-based architecture, was not able to identify the root cause of issues. Errors occurred in approximately one in a few thousand messages, which made combing through logs ineffective. But by visualizing the issue in a trace-based view, they are able to reproduce issues and prevent them from occurring.

Try Helios for yourself and monitor your own message brokers.

Subscribe to our Blog

Get the Latest News and Content

About Helios

Helios is a dev-first observability platform that helps Dev and Ops teams shorten the time to find and fix issues in distributed applications. Built on OpenTelemetry, Helios provides traces and correlates them with logs and metrics, enabling end-to-end app visibility and faster troubleshooting.

The Author

Related Content

Adopting distributed tracing while meeting privacy guidelines
How to adopt distributed tracing without compromising data privacy
Engineering teams can both drive productivity and comply with their company’s privacy policy when introducing distributed tracing into their tech stack...
Read More
Kubernetes Monitoring with Open-Telemetry
Kubernetes Monitoring with OpenTelemetry
Unlocking the Full Potential of Kubernetes: Revolutionize Your Monitoring with OpenTelemetry Organizations increasingly deploy and manage their applications...
Read More
Developer observability, data insights
Beyond Observability and Tracing: Doing More With The Data We Have
What is observability and why isn’t it enough? Here’s more we can do with system and instrumentation data from OTeL & more sources to provide development...
Read More