What is Observability?
Observability is a term that has been thrown around a lot in the past few years in the software development industry. Different people use it in different ways, but one thing that is clear is that it attempts to provide a solution to a real pain engineers are feeling. It is the pain of not knowing what is happening in the microservices architecture and how and why systems are behaving in production.
According to Splunk, observability is “the ability to measure the internal states of a system by examining its outputs. A system is considered “observable” if the current state can be estimated by only using information from outputs, namely sensor data.”
New Relic adds: “Observability is proactively collecting, visualizing, and applying intelligence to all of your metrics, events, logs, and traces—so you can understand the behavior of your complex digital system.”
In other words, what these APM giants are telling us is that observability gives us the ability to monitor our internal systems just by looking at output data, which is collected through metrics, events, logs and traces.
Observability: Seeing, Understanding, Explaining
The way I see it, observability is made up of three principles: visibility, understandability and explainability. First, we need to be able to see the services and the system components. Then we need to be able to perceive and recognize how our code is affected by these different services when it is pushed to production. Finally, we need to be able to explain why these changes occurred.
These three principles ensure developers can detect issues, investigate them, troubleshoot and prevent them from recurring.
Dynatrace nicely sums up why we need observability: “The goal of observability is to understand what’s happening across all these environments and among the technologies, so you can detect and resolve issues to keep your systems efficient and reliable and your customers happy.”
Observability seems to be the solution for all of our microservices perils. A quick Twitter scroll reveals high hopes for observability’s ability to help developers:
Monitoring and observability are debugging for your whole production. Not enough people talk about this – they should be your main priority
— Linux Handbook (@LinuxHandbook) February 18, 2022
Once someone has experienced how much easier it is writing code with real observability, you cannot pull it out of their cold dead hands. It’s like getting glasses for the first time, and realizing you could barely see the world around you.
— Charity Majors (@mipsytipsy) January 17, 2022
7. Observability is the most important operational need of … any piece of software!
Amen!! If there’s one good trend above all that envoy has ushered in, let it be the need for APIs for everything, especially operational sine qua nons like stats, runtime config etc.!
— Cindy Sridharan (@copyconstruct) July 31, 2020
Observability enables us to know when our services don’t meet our availability goals, and ideally highlight the root cause. If we were to share that info publicly, then that would be a form of transparency.
— Kelsey Hightower (@kelseyhightower) February 13, 2022
Beyond Observability: Leveraging Telemetry Data for Actionable Insights
While I do see observability as a game changing concept in software development, it seems like there is more that we can do, and should be doing, with this data.
Distributed tracing data enables us to gain much more advanced and actionable insights than just seeing into the system and being able to explain what happened. “Passive” data may provide developers with an understanding of what happened, but it doesn’t explain what to do next.
We asked our own developers what they needed for observability. Some of their answers were:
- “I need observability daily: it helps me gain control over my systems and reduce code errors.”
- “I want to feel safe that if something happens I would know about it and what to do about it.”
- “Just adding observability capabilities doesn’t solve the problem. We need something actionable, we need to know what to do about it.”
By going beyond observability, we can achieve what we set out to do in the first place: deliver faster, better and at higher velocity. Enormous amounts of unbelievable data are being provided to us through logs, metrics, traces and events (whether you gather it manually or through solutions like OpenTelemetry). Yet, we’re only using that data passively, for issue identification and investigation, once an error occurs.
There’s so much more we can do!
Distributed tracing data insights can actively assist us with development, debugging, troubleshooting, testing, reproducing scenarios, comparing between deployments and environments and identifying changes across pull requests.
Solving Development Challenges with Data Insights
Imagine a typical day in the life of a developer of a modern application. Coding is only one aspect of it. Understanding how those coding changes propagate through the system – which may include multiple microservices, serverless functions, job queues like Celery or even big data pipelines like Databricks – is another, fairly large and confusing, aspect.
Because today, understanding the impact of code changes is no longer as effortless as adding breakpoints to a monolith server used to be. When something goes wrong, and it often does, identifying the failure point is non-trivial. Scrolling through the logs just won’t cut it, as these logs may appear in any of the components in the flow, and may not be correlated in one centralized place. Being able to reproduce issues, by triggering internal parts of the flows, is also hard – just finding the correct payloads, let alone re-triggering them (in all the different protocols), takes a lot of time and is often done inefficiently and insecurely.
This is where actionable distributed tracing insights take center stage. The list of actionable insights and operations we can provide the developers, which go way beyond observability, just goes on and on. A partial list of examples:
- Test your flows with complex assertions to validate deep system behaviors – Imagine you could easily test the end-to-end flow of your microservices, including the different calls to third-party APIs or even your cloud native services. Well, you can.
- Share and reuse requests, queries and payloads among developers and teams – Remember how tedious it is to generate a specific request to your microservice? Especially, if you want to trigger it from a point in the “middle” of your flow? Not to mention if it is not an HTTP request. Not anymore.
- Reproduce application states for debugging and troubleshooting – Stop manually chaining the different calls to bring your system to a certain state. You can do it automatically based on recommendations from the tracing data you collect.
- Troubleshoot your events streams (like Kafka) or messaging queue (like RabbitMQ) – Finally, get insights into what is actually happening in your message broker, and in one click.
- Simplify onboarding for new developers to the system – Instead of referring new developers to read out-dated documentation, let them deep dive into how the system really behaves. Today. Right now.
- Automatically convert behavior to documentation – Documentation is out-dated, almost by definition. Automatically create live docs based on traces from your system and stop chasing changes.
Now, not only do we get more insights, but these actions can all occur even before production, enabling us to increase our dev velocity and ensure production-readiness as early as the developer’s environment. Meaning, we are improving our day-to-day routine and not just waiting until a crash or an error happens before we investigate them. Rather, we are making sure features are production ready.
Observability is the first step in this direction. Gaining observability helps pinpoint some of the issues and shed a light on where we need to look. But to leverage the real power of microservices and develop with high velocity, we need to take the next step. Data-driven insights and recommendations will let engineers truly develop and push code with confidence.