When code fails in production –
and how to fix it in minutes

How fast can you discover problems in your code so that you’re delivering features quickly and confidently?

 

All developers know that moment when, after we’ve reviewed everything and completed our sanity tests, our code somehow doesn’t work in production. The question is, how quickly and easily can we discover the problem so that we can continue delivering features faster and more confidently to our users? 

I found myself in such a situation recently where I discovered my code didn’t work in production. I developed a data retention feature in our product, Helios. In a nutshell, we retain the data (traces) of each of our customers for a certain amount of days, according to their plan. In some cases, we identify traces that should never be deleted, and we mark them as persistent. I created a new endpoint in the Helios app called persist, which updates the data in two different databases: the first one is a Postgres database that keeps the metadata of the traces; and the second is an Elasticsearch cluster that holds the raw data itself. Both should be affected by the data retention logic.  

I released the feature after testing it locally and ensuring that it worked in production as expected, but when I showed it live to the team during our sprint demo, only the first trace was marked properly as persistent, while the second trace wasn’t.

What went wrong

Like my colleague Lior who dogfoods at Helios all the time (aka uses our product to visualize and debug flows in our product), Helios was my go-to application to check in and see what went wrong. I looked up the persist API endpoint and quickly identified the two calls from the live demo. Each call in itself is a trace that can be investigated in Helios.

 

Using Helios to search for the ‘persist’ API endpoint and then easily locate the two calls that needed to be compared

 

I opened trace #1 to see what worked well, then opened trace #2 to see what had failed. I immediately realized that the queries to Postgres and Elasticsearch did not happen.

 

A side-by-side comparison of the two calls to the ‘persist’ API endpoint during the live demo: it’s easy to identify from the visualization which requests did not take place

 

This led me back to the code, where I saw that a validation occurs right before the updates are made to both databases and this validation doesn’t pass in all cases. This also explains why the sanity tests passed after deployment, but then the feature didn’t work in the sprint demo.

How I fixed it

Once I got to the bottom of this issue, I proceeded to change the way we fetch the trace for this new endpoint. 

The point is not the fix itself, rather how fast I arrived at the root cause. In every other microservices architecture, I don’t have any tool that can explain what happened in a single call with such ease and simplicity. Looking at the screenshots above, it’s obvious that at 10:04 the call to mark the trace as persistent worked, and at 10:09 it didn’t.

In the first call (trace #1), the request to fetch the trace returned with a response body, and two subsequent requests were made to update the Postgres and Elasticsearch databases. In the second call (trace #2), the response body returned empty, meaning no trace was found for the combination of organization and trace ID, so no updates were made to the databases.

Without a product like Helios, I would have had to invest a lot of time to understand what changed between 10:04 and 10:09. With Helios, I can look at these two visualizations and piece together in seconds what happened, make the appropriate fix in my code, and even generate a test for this scenario to ensure the problem doesn’t happen again. All in all, from when I started troubleshooting to when I went back into the code, fixed the bug, and released the fix – it was a matter of minutes.

How I made sure it won’t happen again

As mentioned above, Helios provides the ability to create E2E tests from specific flows in your system. The tests can be modified and parameterized to run on every environment – whether it’s a local environment, staging, or even in production. In this case, I took the first call (trace #1) which was successful and created a test out of it. The test consists of various validations, one of which ensures that the two final requests (which update the Postgres and Elasticsearch databases) actually happen. If we were to run the test on the second call (trace #2) it would fail.

 

Once the root cause is identified it’s easy to build a test in Helios to makes sure it doesn’t occur again

Final thoughts

No code is immune to bugs, not the least in production. If you have a tool like Helios in place, that helps you discover issues faster and more efficiently, you can save time and costs in delivering quality features to your users. Perhaps more importantly, you can build tests from your flows to make sure issues don’t repeat themselves.

Written by

About Helios

Helios empowers developers to deliver production-ready code by reducing cloud-native development friction. By making OpenTelemetry data actionable, Helios provides a new type of developer suite for understanding, troubleshooting, and testing distributed applications. With Helios, developers get the power to deliver code faster and with more confidence.

The Author