How fast can you discover problems in your code so that you’re delivering features quickly and confidently?
All developers know that moment when, after we’ve reviewed everything and completed our sanity tests, our code somehow doesn’t work in production. The question is, how quickly and easily can we discover the problem so that we can continue delivering features faster and more confidently to our users?
I found myself in such a situation recently where I discovered my code didn’t work in production. I developed a data retention feature in our product, Helios. In a nutshell, we retain the data (traces) of each of our customers for a certain amount of days, according to their plan. In some cases, we identify traces that should never be deleted, and we mark them as persistent. I created a new endpoint in the Helios app called
persist, which updates the data in two different databases: the first one is a Postgres database that keeps the metadata of the traces; and the second is an Elasticsearch cluster that holds the raw data itself. Both should be affected by the data retention logic.
I released the feature after testing it locally and ensuring that it worked in production as expected, but when I showed it live to the team during our sprint demo, only the first trace was marked properly as persistent, while the second trace wasn’t.
What went wrong
Like my colleague Lior who dogfoods at Helios all the time (aka uses our product to visualize and debug flows in our product), Helios was my go-to application to check in and see what went wrong. I looked up the
persist API endpoint and quickly identified the two calls from the live demo. Each call in itself is a trace that can be investigated in Helios.
I opened trace #1 to see what worked well, then opened trace #2 to see what had failed. I immediately realized that the queries to Postgres and Elasticsearch did not happen.
This led me back to the code, where I saw that a validation occurs right before the updates are made to both databases and this validation doesn’t pass in all cases. This also explains why the sanity tests passed after deployment, but then the feature didn’t work in the sprint demo.
How I fixed it
Once I got to the bottom of this issue, I proceeded to change the way we fetch the trace for this new endpoint.
The point is not the fix itself, rather how fast I arrived at the root cause. In every other microservices architecture, I don’t have any tool that can explain what happened in a single call with such ease and simplicity. Looking at the screenshots above, it’s obvious that at 10:04 the call to mark the trace as persistent worked, and at 10:09 it didn’t.
In the first call (trace #1), the request to fetch the trace returned with a response body, and two subsequent requests were made to update the Postgres and Elasticsearch databases. In the second call (trace #2), the response body returned empty, meaning no trace was found for the combination of organization and trace ID, so no updates were made to the databases.
Without a product like Helios, I would have had to invest a lot of time to understand what changed between 10:04 and 10:09. With Helios, I can look at these two visualizations and piece together in seconds what happened, make the appropriate fix in my code, and even generate a test for this scenario to ensure the problem doesn’t happen again. All in all, from when I started troubleshooting to when I went back into the code, fixed the bug, and released the fix – it was a matter of minutes.
How I made sure it won’t happen again
As mentioned above, Helios provides the ability to create E2E tests from specific flows in your system. The tests can be modified and parameterized to run on every environment – whether it’s a local environment, staging, or even in production. In this case, I took the first call (trace #1) which was successful and created a test out of it. The test consists of various validations, one of which ensures that the two final requests (which update the Postgres and Elasticsearch databases) actually happen. If we were to run the test on the second call (trace #2) it would fail.
No code is immune to bugs, not the least in production. If you have a tool like Helios in place, that helps you discover issues faster and more efficiently, you can save time and costs in delivering quality features to your users. Perhaps more importantly, you can build tests from your flows to make sure issues don’t repeat themselves.