Distributed tracing is a method for tracking all the operations within a distributed system that have been triggered by a specific request. These include which components were touched, how the data flowed between the components, the dependencies that exist, and any changes that occurred to the systems and services. The information provided by distributed tracing enables end-to-end visibility into the microservices architecture and insights for troubleshooting errors.
How is distributed tracing used in microservices? The core concept that enables Distributed Tracing is Context Propagation.
A Context is an object that contains the information for the sending and receiving service to correlate one span with another and associate it with the trace overall.
Propagation is the mechanism that moves Context between services and processes. By doing so, it assembles a Distributed Trace.
This method tracks requests across services, while monitoring the flow of the request across all the devices, databases, serverless functions and third-party APIs. Each such tracked action is called a ‘span’. Related solutions aggregate these spans and build a directed graph from them. This graph is called a trace. Some developer tools, like Jaeger and Helios, enable visualization of the trace and show developers how the data flows through the app, including complex sync and async flows (HTTP requests, gRPC calls, serverless invocations, messaging queues, event streams and more). Distributed tracing with OpenTelemetry will be discussed below.
Distributed tracing vs. logging
Distributed tracing and logging are two different approaches that can be used by developers to understand the behavior of a system.
The first is a method that provides context and information about how a request is processed across multiple microservices. It provides a complete picture of the request flow through all microservices. Visualization of this information can help pinpoint where a problem occurred in the system. Then, these insights can be used for troubleshooting and debugging and for improving development velocity.
Logging, on the other hand, is the process of recording system events, typically for debugging and auditing purposes. Logs provide a historical record of what has happened in the system, but do not give a complete picture of a request flow.
Logs and traces complement each other. Logs can help identify that an issue occurred, while traces give more context about where and why it occurred.
Distributed tracing: languages and components
Distributed tracing use cases
Visualized tracing data can help developers gain a broad understanding of their microservices architecture, as well as granular insights into specific requests and dependencies. This capability makes tracing a good solution for a number of developer use cases:
Troubleshooting and debugging
It is time-consuming and challenging for developers to identify issues, reproduce scenarios and fix bugs in microservices. This is mainly due to the lack of visibility into the architecture and missing information that is not available through logs, like HTTP request body, Kafka messages and Lambda events.
To help troubleshoot and debug issues, some distributed tracing systems gather payloads and error data for identifying bottlenecks, identifying broken flows, and reproducing them.
Lack of visibility into microservices means developers lack confidence to make changes in microservices, since they don’t know what might break. This is due to how microservices are designed: they require properly configured APIs, taking into account response handling, error handling, requests, security, and a number of other factors. Otherise, they won’t be able to communicate.
A tracing system provides visibility into the architecture and data insights across all environments – from local to testing to staging to production. By seeing data flows, payloads, dependencies, and errors, developers can then acquire the information they need to develop and deploy production-ready code, while also ensuring services interact with each other as part of a software development lifecycle.
Microservices provide teams with the flexibility to choose which technology stacks and framework to use and implement. However, this poses operational challenges in terms of communication, monitoring, scalability, and consistency among services. With distributed tracing, requests, queries, and payloads can be shared and reused, which helps developers operate and collaborate more effectively.
Testing microservices takes a very long time and is inaccurate. The loosely coupled nature of microservices and their optional boundaries and connection points that create dependencies, make testing them very complex. In most cases, developers can only reliably test their own services, since broader tests require relying on potentially outdated testing or staging environments, or on mocking, which is also complex. Even when testing does take place, results can be flaky. This means developers cannot be confident that testing will ensure code quality and application functionality and performance.
Using trace based testing, comprehensive tests can be automatically generated. Trace-based tests can even be generated directly from production, ensuring consistency and reliability.
Who can use distributed tracing
This method can be used by developers and DevOps to improve their understanding of their architectures and environments. It assists them improve their velocity and efficiency. However, tracing holds high promise especially for developers, who can gain newfound visibility so they can track and monitor activities in the entire development process that occurs before production. As a result, they are able to develop, test and troubleshoot with confidence.
A tracing solution for developers should enable them to:
Distributed tracing with OpenTelemetry (OTel)
OpenTelemetry (OTEL) is an open-source collection of tools, APIs, SDKs, and tools for creating and gathering telemetry data, including traces from microservices. Then, through solutions like open-source Jaeger or Zipkin or Helios, developers can visualize the traces, see their microservices architecture and troubleshoot errors.
OpenTelemetry enables easy integration with existing tools, is vendor-agnostic and supports multiple technologies. The project also boasts a vibrant open source community that contributes to it constantly. To date, OpenTelemetry is the industry standard for collecting distributed tracing data.
Visualiztion leverages OpenTelemetry tracing to provide granular visibility with unique and immediate insights. Developers can see and understand how their services interact with each other and where any errors and performance issues actually occurred. Troubleshooting becomes effortless.
Understand your architecture and identify workflows and dependencies.
Use the provided information to quickly realize the problem and proceed to fixing it.
Easily share traces, tests and triggers with your team.
Deep visibility into your services
Distributed tracing aggregates the operations that occur in microservices based apps with a certain context. But without purpose-built visualization methods, views are cumbersome and lack data needed to understand and troubleshoot complex requests.
Helios provides deep visualization that enables seeing into complex sync and async workflows, understanding the dependencies between different components and detecting changes across versions. This visibility is key to troubleshooting your applications.
Here’s an example:
How does tracing with Helios work?
Distributed tracing with Jaeger
Jaeger is a popular open source distributed tracing solution that was developed and released by Uber. With Jaeger, developers can monitor and troubleshoot microservices. They can use it for distributed context propagation, transaction monitoring, root cause analysis, service dependency analysis and identifying bottlenecks for optimization. Jaeger was inspired by OpenTracing. Includes Cassandra, Elasticsearch and in-memory storage backends that are built-in, provides adaptive sampling, and more.
Augment Jaeger with Helios
Jaeger enables a basic level of visualization of distributed tracing. By enhancing Jaeger with Helios,
you can get:
Visualized tracing data can help developers troubleshoot and debug their microservices. This is due to the granular visibility and insights distributed tracing solutions provide. Testing is also a useful use case. Learn more.
A tracing solution for developers should enable them to access and view data, get contextual information, filter data and collaborate with their team and with DevOps.
The most popular open source solutions that can leverage tracing are Jaeger and Zipkin. They rely on the telemetry data gathered by OpenTelemetry.
More recent solutions can provide additional capabilities: more tracing data from payloads and visualization. This enables troubleshooting before production. Read more.
Leveraging a range of instrumentation layers, distributed tracing can be adopted in a way that will help engineering teams boost productivity without compromising data privacy.
On this page
More Resources on Distributed Tracing with Helios:
Increase your dev velocity
with actionable telemetry data