Troubleshooting message brokers


Message brokers like Kafka support high communication performance and enable microservices to scale. However, as they are based on asynchronous communication, message brokers do not provide message feedback if there is an error, making it difficult to identify when further troubleshooting is needed. Solutions do exist that enable message monitoring in message brokers: Kafka Owl, for example, enables exploring and fetching messages in Kafka clusters while Redpanda enables exploring message topics. Yet, these solutions are also limited because they do not provide the full context of message requests. It is not immediately clear why a certain message created an issue; only the details of the error are seen, not the entire picture. As a result, developers can’t easily reproduce and debug an issue, let alone prevent it from recurring.

How Helios can help your team

Helios, a developer platform based on OpenTelemetry, leverages distributed tracing for full visualization of all operations triggered in a distributed system, regardless of communication type: REST, Kafka, gRPC, or others. Traces can give the full context of a message, including its flow and behavior between different components and services. When an error occurs, developers can see its complete context within the microservices architecture, making it easy to troubleshoot and fix errors. Helios provides a free trace-based solution for troubleshooting message brokers like Kafka.

With Helios you can:

  • Get visibility into Kafka messages in the full context of their traces, helping you understand the flow of messages (i.e., which services they are connected to and where, why an error occurred, and more)
  • Understand sync and async flows, event streams, and queues; and search through traces to identify bottlenecks and errors
  • Create traces automatically, without having to decide where to insert them (unlike logs)
  • Leverage actionable insights when you need them: as early as in your local and integration environments, all the way to production

See the live trace visualization in the Helios Sandbox.

Example scenario

One of our customers, a marketing platform that produces tens of thousands of message requests every day in a Kafka-based architecture, was not able to identify the root cause of issues. Errors occur in approximately one in a few thousand messages, which makes combing through logs ineffective. Using Helios, developers can now visualize the issue in a trace-based view, and can reproduce issues to prevent them from happening again.

Increase your dev velocity
with actionable telemetry data

Helios is compliant with: