Helios is now part of Snyk! Read the full announcement here.

SANDBOX

SECURITY

OBSERVABILITY

LANGUAGES

USE CASES

RESOURCES

OpenTelemetry Tracing: Everything you need to know

Written by


Subscribe to our Blog

Get the Latest News and Content

OTel distributed tracing capabilities compensate for traditional observability methods, that master monolith apps but are hardly sufficient in observing and debugging distributed environments – Here’s everything to know

 

Applications are increasingly switching from the traditional monolithic design to a modern microservices-based design with several operational benefits. However, it also introduces challenges as conventional methods for collecting metrics and logs become ineffective due to the application design’s distributed nature. These challenges result in gaps in the E2E visibility needed by developers and administrators to monitor, troubleshoot and analyze the underlying issues within applications effectively.

Distributed tracing plays a crucial role in this process, providing a robust method to trace the flow of requests and responses across different services and components. This approach offers valuable insights into performance and error patterns.

Related: OpenTelemetry – a full guide 

What is OpenTelemetry Tracing?

OpenTelemetry is an open-source observability framework under the Cloud Native Computing Foundation (CNCF) incubating project comprising a collection of APIs, libraries, agents, and SDKs to enable teams to instrument, collect, and export telemetry data from modern applications.

This framework allows developers to instrument their applications with the standard instrumentation library that generates telemetry data from various sources, such as logs, metrics, and traces. OpenTelemetry agents can then collect and export this telemetry data to multiple systems for logging, tracing, and monitoring. The core concept of OpenTelemetry is that it aims to be vendor-agnostic, meaning that the data collected can be sent to different backends and switching between them would require no client-side changes.

OTel tracing framework provides multiple benefits that are especially valuable when dealing with distributed microservice-based architectures. The following key points highlight some of these benefits:

  1. Standardization: OpenTelemetry provides a typical instrumentation and data format standard that allows developers to generate and export telemetry data in a vendor-agnostic way. This ensures that telemetry data is consistent and interoperable across different systems, making it easier to analyze and troubleshoot issues. It also provides end-to-end observability of distributed systems by collecting telemetry data from multiple sources, namely logs, metrics, and traces.
  2. Flexibility: OpenTelemetry supports multiple programming languages, frameworks, and cloud environments, making it a versatile and flexible tool that can be used in various scenarios.
  3. Interoperability: OpenTelemetry can seamlessly integrate with other observability tools, namely tracing systems, logging platforms, and monitoring tools, making it easier to adopt and extend existing observability solutions.

OpenTelemetry vs. OpenTracing

OpenTelemetry and OpenTracing are both open-source projects aimed at providing a standard for distributed tracing, but they have evolved differently over time.

OpenTracing stamped the technology landscape as a vendor-neutral API for distributed tracing. It provided a set of APIs for instrumenting applications to generate and propagate trace information and a specification for vendors to implement their own distributed tracing systems that would be compatible with the API.

However, as distributed systems became more complex and the need for observability grew, it became clear that the existing tracing systems needed to be improved. In response, the OpenTelemetry project was launched in 2019 as a merger between OpenTracing and another open-source project called OpenCensus.

Observability in distributed systems can be addressed more thoroughly thanks to OpenTelemetry’s design, which offers both tracing and metrics and logging capabilities.

How OpenTelemetry Works

OpenTelemetry relies on its numerous components to instrument and collects logs, metrics, and traces from distributed applications.

How OpenTelemetry works

Figure: OpenTelemetry Components

Components of OpenTelemetry

The following components within the OpenTelemetry framework provide versatile support for various components.

  1. OTel SDK

    The SDK provides libraries for different programming languages that developers use to instrument their applications to collect telemetry data. It also includes the exporter, responsible for transmitting telemetry data to a backend system; in most cases, it is an OpenTelemetry collector.

  2. OpenTelemetry Exporter

    The exporter transmits telemetry data to a backend system. Exporters are available for backend systems like Jaeger, Zipkin, Prometheus, and more.

  3. OpenTelemetry Collector

    The collector receives telemetry data from exporters and forwards it to backend systems. It also supports the processing, sampling, and transformation of telemetry data.

  4. OpenTelemetry Backend systems

    Backend systems store and process telemetry data. Examples include Jaeger, Zipkin, Prometheus, and others.

OpenTelemetry Tracing process

Even though OpenTelemetry provides a robust and comprehensive method for tracing, the process can be simplified into the following steps.

  1. Instrumentation: Developers instrument their applications using the OpenTelemetry SDK to collect telemetry data. They add code to their applications to start and end spans, set attributes, and add events to spans.
  2. Span Creation: When an application receives a request, it creates a new span to represent that request. Spans have a start and end time and can include attributes, events, and child spans.
  3. Context Propagation: OpenTelemetry propagates trace context across service boundaries using headers. This allows the trace to be continued across multiple services and systems.
  4. Span Export: The exporter collects and sends the telemetry data to a backend system like Jaeger or Zipkin for storage and analysis, a backwards-compatible feature.

OpenTelemetry Integration

The versatility of the OpenTelemetry framework allows a multitude of integration options. Developers may choose to integrate at any of the following levels:

  1. Integration with different programming languages.
  2. Integration with different frameworks.
  3. Integration with different services.

Integration with different programming languages and platforms

OpenTelemetry provides SDKs for various programming languages, including Java, Python, Go, Node.js, .NET, Kubernetes and more. These SDKs allow developers to instrument their applications to collect telemetry data in a standardized way.

Integration with different frameworks

OpenTelemetry integrates with popular application frameworks such as Spring, Flask, and Django. These integrations enable automatic instrumentation of these frameworks, allowing developers to collect telemetry data without writing additional code.

Integration with different services

OpenTelemetry provides exporters for various backend systems, including Jaeger, Zipkin, Prometheus, and more. These exporters allow developers to send telemetry data to these backend systems for storage and analysis. OpenTelemetry supports integration with cloud providers like AWS, Azure, and Google Cloud Platform, allowing telemetry data to be collected from services running in these environments.

OpenTelemetry Instrumentation

Instrumentation is adding code to an application to collect telemetry data. In the context of OpenTelemetry, instrumentation involves using the OpenTelemetry SDK to collect data about an application’s performance, behaviour, and other metrics. Instrumentation aims to gain visibility into an application’s behaviour and performance, allowing developers to identify and troubleshoot issues.

Automated vs manual instrumentation

OpenTelemetry allows for two methods of instrumentation.

  1. Automated Instrumentation: The developer can use this approach to enable instrumentation without updating the source code. However, for this method of instrumentation to work, the language used within the application must be one of the supported languages of OpenTelemtry.
  2. Manual Instrumentation: This method involves the developer manually creating the telemetry data by creating traces and events using the meter and tracer objects available. It is important to remember that manual instrumentation relies on supported languages, similar to automatic instrumentation.

OpenTelemetry data collection

Data collection in OpenTelemetry refers to collecting telemetry data from various sources, such as applications, services, and infrastructure components. The collected data is then processed and analyzed to provide insights into the performance and behaviour of the system.

OpenTelemetry data collection methods

There are multiple methods that developers may use OpenTelemetry to collect traces, metrics, and logs related to applications:

  1. Instrumentation: OpenTelemetry instrumentation involves adding monitoring or tracing code to an application or service. The instrumentation code generates telemetry data, which is then collected by an OpenTelemetry agent and sent to a telemetry backend.
  2. Exporters: OpenTelemetry exporters allow telemetry data to be collected from non-instrumented sources, such as logs, metrics, or tracing data from other systems. Exporters are typically used to collect data from systems that are not directly controlled by the R&D teams, such as 3rd party services, API gateways, cloud providers, etc.
  3. Service-to-service communication: OpenTelemetry can also collect telemetry data through service-to-service communication. When services communicate, they can exchange telemetry data, such as tracing information, to provide insights into the performance and behaviour of the system.

Sampling and data volume

Sampling is an essential technique for managing the volume of telemetry data collected by OpenTelemetry. Sampling involves collecting a subset of the telemetry data generated by an application or service rather than collecting all.

In OpenTelemetry, sampling can be done at various levels, including:

  1. Head Based sampling: Head-based sampling propagates the sampling decision to other participants using the context as early in the process. Avoiding collecting any telemetry information for dropped spans allows for conserving CPU and memory resources.
  2. Tail-Based Sampling: Through tail-based sampling, we can postpone making a sampling decision until all trace spans are available, allowing us to make better decisions about sampling based on all the trace data. We could; for instance, sample failed or unusually long traces.

How to make use of the data? OpenTelemetry Analysis and Visualization

OpenTelemetry traces and spans provide a detailed view of the performance and behaviour of applications and services. Analysis of traces and spans can provide insights into latency, errors, and other performance metrics. Some standard techniques for trace and span analysis in OpenTelemetry include:

  1. Distributed tracing analysis: Distributed tracing analysis involves analyzing traces and spans across multiple services to identify performance bottlenecks, latency issues, and other performance problems.
  2. Resource utilization analysis: This technique looks for potential resource conflicts or inefficiencies by looking at each span’s resource usage (such as CPU, memory, and network) in a trace.
  3. Error analysis: Error analysis involves analyzing traces and spans to identify errors and exceptions and to understand the impact of errors on system performance.

The OpenTelemetry framework allows developers to use third-party OSS tools to be able to consolidate and visualize the trace information collected from multiple sources, for instance:

  1. Jaeger: It is a popular open-source distributed tracing system that supports OpenTelemetry. It provides a web-based UI for analyzing and visualizing trace data.
  2. Grafana: It is an open-source analytics and visualization platform. It provides various visualization options for telemetry data, including graphs, charts, and dashboards.
  3. Prometheus: It is an open-source monitoring and alerting system. It provides a web-based UI for analyzing and visualizing metric data.

Related:

How Novacy Shortened Troubleshooting Time by 90% 

How we slashed detection and resolution time in half (Salt Security)

OTel data enrichment and smart visualization – Helios

While OpenTelemetry provides a standardized way of collecting and exporting trace data from applications, Helios builds on this by providing a comprehensive tracing system that includes analysis, visualization, and alerting capabilities on top of simplifying the process of OSS installation and maintenance. Helios allows users to trace requests across multiple services, providing end-to-end visibility into the performance and behaviour of distributed systems.

Example – We choose to torubleshoot a mobile app:

With trace visualization we can easily see the 401 authentication error from the server back to the mobile app

401 authentication error - distributed tracing

 

Another issue could be an error occurring deeper in the application. A mobile app can only show an error message returned by the endpoint which is called by the mobile app or even worse, not show anything at all. It could, for example, be a bad request downstream to a third-party product that failed, yet the distributed application continued with its flow.

E2E flow error - distributed tracing

In the screenshot above we see a request to a third-party app which received a rate limit error – but the application flow continued without exposing any error to the mobile developer. Helios helps developers visualize what really happens in their application as well as the integrations with third-party apps to quickly get to the bottom of things.

 

OpenTelemetry Tracing Use Cases

OpenTelemetry allows its users to troubleshoot issues both in the back end and front end of applications.

Frontend use cases

  1. Detect faulty logic that can cause errors within the application
  2. Locate and identify components that make the UI slow and hinder customer experience
  3. Locate geo-specific bottlenecks

Backend use cases

  1. Detect incorrect user input that leads to errors being generated
  2. Identify API calls to the back end that have delayed response times
  3. Detect ineffective API code that increases the response times of each call

Related: How to adopt distributed tracing without compromising data privacy

OpenTelemetry Adoption

OpenTelemetry has been adopted by many major companies and organizations, including Google, Microsoft, Amazon, Uber, and Netflix, and is supported by a vibrant and active community of developers and contributors.

The future of OpenTelemetry and distributed tracing looks bright, with continued growth and adoption expected in the coming years. The project constantly evolves, with new features and capabilities being added regularly, and the community is actively involved in shaping its future.

As more organizations adopt microservices and cloud-native architectures, the need for observability and performance optimization will only continue to grow, making OpenTelemetry an increasingly important tool for modern application development. OpenTelemetry is a valuable tool for improving system visibility, reducing MTTR, and improving the overall user experience.

Conclusion

OpenTelemetry is a robust open-source framework for distributed tracing, metrics, and logging in modern applications. It provides a standardized way to collect, analyze, and visualize telemetry data, making it easier to troubleshoot issues, optimize performance, and improve the overall user experience. OpenTelemetry has been adopted by many major companies and organizations and is supported by an active community of developers and contributors.

Distributed tracing is becoming increasingly important in modern software development as applications become more complex and distributed. It allows developers to trace requests across multiple services and systems, providing valuable insights into system performance and behaviour. With the adoption of microservices and cloud-native architectures, the need for observability and performance optimization will only grow, making distributed tracing and OpenTelemetry increasingly essential tools for modern application development.

Distributed tracing is expected to grow in importance as applications become even more distributed and complex. OpenTelemetry is poised to play a vital role in this future, providing a powerful and standardized way to collect, analyze, and visualize telemetry data. The community is actively shaping the project’s future, with new features and capabilities being regularly added. OpenTelemetry is a valuable tool for improving system visibility, reducing MTTR, and improving the overall user experience. It is expected to continue to play a critical role in modern application development.

Subscribe to our Blog

Get the Latest News and Content

About Helios

Helios is an applied observability platform that produces actionable security and monitoring insights. We apply our deep runtime data collection capabilities to help Sec, Dev, and Ops teams understand the actual application risk posture, prioritize vulnerabilities, shorten troubleshooting time, and reduce MTTR.

The Author

Helios
Helios

Helios is an applied observability platform that produces actionable security and monitoring insights. We apply our deep runtime data collection capabilities to help Sec, Dev, and Ops teams understand the actual application risk posture, prioritize vulnerabilities, shorten troubleshooting time, and reduce MTTR.

Related Content

Challenges of existing SCA tools
Challenges with Traditional SCA Tools
Application security testing tools are designed to ensure that applications are put through rigorous security assessments to identify security flaws within...
Read More
Banner for blog post - Scaling microservices - Challenges, best practices and tools
The Challenges of Collecting Runtime Data
Collecting data in real-time plays a crucial role in securing, monitoring, and troubleshooting applications. This real-time data, often referred to as...
Read More
Helios Runtime for Appsec
Helios Runtime for AppSec: The missing link in application security
Modern development teams increasingly rely on open-source packages to rapidly build and deploy applications. In fact, most, if not all applications consist...
Read More