Message brokers and how to troubleshoot them: monitoring Kafka

Written by

monitoring Kafka

Subscribe to our Blog

Get the Latest News and Content

Message broker systems like Kafka enable microservices to scale. But this same quality makes them hard to troubleshoot. How can developers avoid messages and errors getting stuck in oblivion? In this post we provide a few solutions: Kafka Owl, Redpanda, and Helios.

A short reminder of message brokers and why they sometimes break

Distributed systems like microservices communicate with each other through frameworks like REST, gRPC, or message brokers. While each method has its own advantages, message brokers are reliable and enable complex flows since they are based on asynchronous communication.

Asynchronous communication means that a message sent by a service (the producer) doesn’t have to be immediately consumed. Instead, the message can be stored in a message queue and consumed by another service (the consumer) at another time. For example, based on availability or need.

This ensures reliability and redundancy. Messages do not get lost or become stateless. As a result, systems can scale.

One of the most popular message broker systems is open-source Kafka. Kafka supports high performance and high scalability and enables permanent storage. For these reasons and others, Kafka is widely used by many companies.

While there is probably no better replacement for microservices communication than message brokers, they do pose some challenges. Mainly, it is very difficult to identify and troubleshoot errors. In synchronous communication frameworks, a lack of immediate response to a message is a clear indicator that there is an error in the system. But in asynchronous communication, there might be an error, but since there is no message feedback, no one would ever get an alert.

Solutions for monitoring microservices in message brokers

Solutions like Kafka Owl and Redpanda enable monitoring messages in message brokers. Kafka Owl enables exploring and fetching messages in Kafka clusters while Redpanda enables exploring message topics.


The challenge with existing solutions: message error context for troubleshooting

However, these solutions do not provide the context of the message requests. As developers, we lack an understanding of why a certain message created an issue. We only see the details of the error, but not the big picture.

As a result, we’re not able to:

  • Easily reproduce, troubleshoot, and debug the issue
  • Prevent this issue from recurring

In other words, we might have more information from before, but we’re still spending a lot of time dealing with microservices issues, which doesn’t really solve our problem.

The solution: trace-based monitoring and troubleshooting of message brokers

Traces can solve this context issue because they allow us to see all the operations that are triggered in our distributed system, no matter the communication type: REST, Kafka, GRPC or others.

This happens through a single operation, making it easy to approach each operation and see it as part of a whole, and not as an individual action. In addition, traces can be created automatically, without a developer having to decide where to insert them, unlike logs.

Traces can give the full context of a message, including its flow and behavior between different components and services. This makes them the ideal solution for troubleshooting. When an error occurs, developers can see its complete context within the microservices architecture, making it easy to troubleshoot and fix.

Helios provides a free trace-based solution for troubleshooting message brokers like Kafka. With Helios, we can look at the Kafka messages through their traces. This provides the context of the message, i.e which services it is connected to, where and why the error occurred, and more.

Helios is based on open source OpenTelemetry, and leverages traces for visualization and for making the data actionable. With Helios, developers can understand sync and async flows, event streams, and queues; search through traces and identify bottlenecks and errors.


Trace-based Kafka monitoring
Trace-based Kafka monitoring

One of our customers, a marketing platform that produces tens of thousands of message requests every day in a Kafka-based architecture, was not able to identify the root cause of issues. Errors occurred in approximately one in a few thousand messages, which made combing through logs ineffective. But by visualizing the issue in a trace-based view, they are able to reproduce issues and prevent them from occurring.

Try Helios for yourself and monitor your own message brokers.

Subscribe to our Blog

Get the Latest News and Content

About Helios

Helios is a developer platform that helps you increase dev velocity when building cloud-native applications. With Helios, dev teams can easily and quickly perform tasks such as getting a full view of their API inventory, reproducing failures, and automatically generating tests, from local to production environments. Helios accelerates R&D work, streamlining activities from troubleshooting and testing to design and collaboration.

The Author

Related Content

How to Cut Engineering Costs and Save Time with Helios Adoption
How to Cut Engineering Costs and Save Time with Helios Adoption
A few ways that leveraging Helios will save your developers time and make your organization more resource-efficient for 2023
Read More
WhatsApp Image 2022-12-22 at 11.51
Helping Go teams implement OpenTelemetry: A new approach
Developers can instrument their Go applications quickly and easily using Helios OpenTelemetry (OTel), the emerging industry standard for application observability...
Read More
Development Trends in 2023: Following KubeCon NA
Development Trends in 2023: Following KubeCon NA
In November 2022, our team here at Helios attended and sponsored KubeCon North America. Our motivations for being there were related to getting out the...
Read More