Distributed Tracing: All you need to know to get started

Distributed tracing in microservices

Distributed tracing is a method for tracking all the operations within a distributed system that have been triggered by a specific request. These include which components were touched, how the data flowed between the components, the dependencies that exist, and any changes that occurred to the systems and services. The information provided by distributed tracing enables end-to-end visibility into the microservices architecture and insights for troubleshooting errors.

How does tracing work?

How is distributed tracing used in microservices? The core concept that enables Distributed Tracing is Context Propagation.

Context is an object that contains the information for the sending and receiving service to correlate one span with another and associate it with the trace overall.

Propagation is the mechanism that moves Context between services and processes. By doing so, it assembles a Distributed Trace.

This method tracks requests across services, while monitoring the flow of the request across all the devices, databases, serverless functions and third-party APIs. Each such tracked action is called a ‘span’. Related solutions aggregate these spans and build a directed graph from them. This graph is called a trace. Some developer tools, like Jaeger and Helios, enable visualization of the trace and show developers how the data flows through the app, including complex sync and async flows (HTTP requests, gRPC calls, serverless invocations, messaging queues, event streams and more). Distributed tracing with OpenTelemetry will be discussed below. 

Related: API monitoring vs. observability in microservices- Troubleshooting guide

Distributed tracing vs. logging

Distributed tracing and logging are two different approaches that can be used by developers to understand the behavior of a system.
The first is a method that provides context and information about how a request is processed across multiple microservices. It provides a complete picture of the request flow through all microservices. Visualization of this information can help pinpoint where a problem occurred in the system. Then, these insights can be used for troubleshooting and debugging and for improving development velocity.
Logging, on the other hand, is the process of recording system events, typically for debugging and auditing purposes. Logs provide a historical record of what has happened in the system, but do not give a complete picture of a request flow.
Logs and traces complement each other. Logs can help identify that an issue occurred, while traces give more context about where and why it occurred.

Distributed tracing: languages and components

Tracing with Node.js

Node.js enables overriding implementations at runtime by replacing the implementation of functions (i.e., monkey-patching), making it rather simple to implement distributed tracing. To use tracing with a Node.js application, developers can use open-source solutions like OpenTracing or OpenTelemetry Node.js or implement a tool that leverages these OSS and ads visualization and other advanced capabilities, such as Helios. Click here to see how to instrument distributed tracing for your Node.js application with Helios based on OpenTelemetry.

Tracing with Golang

Go's strong typing and compiling into machine code makes it difficult to add instrumentation code dynamically. In addition, making runtime changes to compiled machine code is risky and it may be considered a security problem. Helios took the legwork out of OpenTelemetry instrumentation in Go by taking a new approach that is both easy to implement and non-intrusive.

Tracing with Grafana

Grafana has developed Grafana Tempo, an open source distributed tracing backend. Integrated with Grafana and Prometheus, Grafana Tempo can be used for ingesting tracing. With Helios, traces can be displayed on Grafana dashboards.

Tracing with Kafka

Kafka is an open source event streaming platform that captures real-time data. It is a popular tool but it impedes developer observability since it decouples producers and consumers and uses asynchronous processes. This means there are no direct transactions to trace or any explicit dependencies. This makes tracing for Kafka all the more important. Here’s how to to run distributed tracing with Kafka with Helios.

Tracing with Java

Java does not enable overriding implementations at runtime by replacing the implementation of functions. However, it supports a mechanism called the Java agent – enabling dynamic bytecode modification that essentially enables similar capabilities to the ones we have in Node. The Java agent is a separate JAR that’s provided as an argument to the application JAR and performs the instrumentation. Learn more about how to get started with OTel-based tracing in Java or onboard Java observability now .

Tracing with Python

Python supports object-oriented and procedural-oriented programming techniques and does not require variable declarations since it is a dynamically typed language. Like all other instrumentation libraries, OpenTelemetry based instrumentation for Python works by wrapping existing function implementations and extracting the necessary data. Get started with tracing in Python.

Distributed tracing use cases

Visualized tracing data can help developers gain a broad understanding of their microservices architecture, as well as granular insights into specific requests and dependencies. This capability makes tracing a good solution for a number of developer use cases:

Troubleshooting and debugging

It is time-consuming and challenging for developers to identify issues, reproduce scenarios and fix bugs in microservices. This is mainly due to the lack of visibility into the architecture and missing information that is not available through logs, like HTTP request body, Kafka messages and Lambda events.

To help troubleshoot and debug issues, some distributed tracing systems gather payloads and error data for identifying bottlenecks, identifying broken flows, and reproducing them.

Developer-first observability

Lack of visibility into microservices means developers lack confidence to make changes in microservices, since they don’t know what might break. This is due to how microservices are designed: they require properly configured APIs, taking into account response handling, error handling, requests, security, and a number of other factors. Otherise, they won’t be able to communicate.

A tracing system provides visibility into the architecture and data insights across all environments – from local to testing to staging to production. By seeing data flows, payloads, dependencies, and errors, developers can then acquire the information they need to develop and deploy production-ready code, while also ensuring services interact with each other as part of a software development lifecycle.

Operations

Microservices provide teams with the flexibility to choose which technology stacks and framework to use and implement. However, this poses operational challenges in terms of communication, monitoring, scalability, and consistency among services. With distributed tracing, requests, queries, and payloads can be shared and reused, which helps developers operate and collaborate more effectively.

Testing

Testing microservices takes a very long time and is inaccurate. The loosely coupled nature of microservices and their optional boundaries and connection points that create dependencies, make testing them very complex. In most cases, developers can only reliably test their own services, since broader tests require relying on potentially outdated testing or staging environments, or on mocking, which is also complex. Even when testing does take place, results can be flaky. This means developers cannot be confident that testing will ensure code quality and application functionality and performance.

Using trace based testing, comprehensive tests can be automatically generated. Trace-based tests can even be generated directly from production, ensuring consistency and reliability.

Who can use distributed tracing

This method can be used by developers and DevOps to improve their understanding of their architectures and environments. It assists them improve their velocity and efficiency. However, tracing holds high promise especially for developers, who can gain newfound visibility so they can track and monitor activities in the entire development process that occurs before production. As a result, they are able to develop, test and troubleshoot with confidence.

A tracing solution for developers should enable them to:

Distributed tracing with OpenTelemetry (OTel)

OpenTelemetry (OTEL) is an open-source collection of tools, APIs, SDKs, and tools for creating and gathering telemetry data, including traces from microservices. Then, through solutions like open-source Jaeger or Zipkin or Helios, developers can visualize the traces, see their microservices architecture and troubleshoot errors.

OpenTelemetry enables easy integration with existing tools, is vendor-agnostic and supports multiple technologies. The project also boasts a vibrant open source community that contributes to it constantly. To date, OpenTelemetry is the industry standard for collecting distributed tracing data.

Tracing visualization

Visualiztion leverages OpenTelemetry tracing to provide granular visibility with unique and immediate insights.  Developers can see and understand how their services interact with each other and where any errors and performance issues actually occurred. Troubleshooting becomes effortless.

Learn More about using Helios OTel based visualization and insights

Deep
Visualization

Understand your architecture and identify workflows and dependencies.

Immediate
Insights

Get granular visibility into issues and errors.

Simple
Troubleshooting

Use the provided information to quickly realize the problem and proceed to fixing it.

Built-in
Collaboration

Easily share traces, tests and triggers with your team.

Deep visibility into your services

Distributed tracing aggregates the operations that occur in microservices based apps with a certain context. But without purpose-built visualization methods, views are cumbersome and lack data needed to understand and troubleshoot complex requests.

Helios provides deep visualization that enables seeing into complex sync and async workflows, understanding the dependencies between different components and detecting changes across versions. This visibility is key to troubleshooting your applications.

Here’s an example:

Jaeger Visualization

Helios Visualization

How does tracing with Helios work?

Contextual information about any trace

Helios collects all the data. You will be able to see a wide scope of information about any trace: errors, attributes, payloads, logs, how the data flows in the system, span durations, commit hashes, and more.

Drill down to identify issues

Helios gathers log information, so you can drill down into the error logs to identify failed requests and issues. Then, go back to your code and resolve the issue.

Visualize again to affirm the fix

Once the issue is resolved, visualize the trace again, to make sure it won’t occur again.

Trace search

Filter your traces based on condition types like HTTP, AWS, Lambda, Database, and Messaging. Easily find the traces you are looking for. You can also customize the attributes you are filtering for.

 

Issue reproduction

Helios provides flow replaying capabilities that can replay requests to API endpoints, message queues, Lambda functions, and more. The flow can be triggered as a script, a cURL command or Postman request.

Distributed tracing with Jaeger

Jaeger is a popular open source distributed tracing solution that was developed and released by Uber. With Jaeger, developers can monitor and troubleshoot microservices. They can use it for distributed context propagation, transaction monitoring, root cause analysis, service dependency analysis and identifying bottlenecks for optimization. Jaeger was inspired by OpenTracing. Includes Cassandra, Elasticsearch and in-memory storage backends that are built-in, provides adaptive sampling, and more.

Augment Jaeger with Helios

Jaeger enables a basic level of visualization of distributed tracing. By enhancing Jaeger with Helios,
you can get:

FAQs

Visualized tracing data can help developers troubleshoot and debug their microservices. This is due to the granular visibility and insights distributed tracing solutions provide. Testing is also a useful use case. Learn more.

A tracing solution for developers should enable them to access and view data, get contextual information, filter data and collaborate with their team and with DevOps.

The most popular open source solutions that can leverage tracing are Jaeger and Zipkin. They rely on the telemetry data gathered by OpenTelemetry.

More recent solutions can provide additional capabilities: more tracing data from payloads and visualization. This enables troubleshooting before production. Read more.

Leveraging a range of instrumentation layers, distributed tracing can be adopted in a way that will help engineering teams boost productivity without compromising data privacy.

More Resources on Distributed Tracing with Helios:

Blog - standard
Helios is now part of Snyk!
We are thrilled to announce that Snyk has acquired Helios! It’s been quite an amazing, bumpy, yet satisfying...
Read More
Challenges of existing SCA tools
Challenges with Traditional SCA Tools
Application security testing tools are designed to ensure that applications are put through rigorous...
Read More
Banner for blog post - Scaling microservices - Challenges, best practices and tools
The Challenges of Collecting Runtime Data
Collecting data in real-time plays a crucial role in securing, monitoring, and troubleshooting applications....
Read More

Increase your dev velocity
with actionable telemetry data