Anatomy of Failure or: How to Deploy the Same Bug Four Times

Code Pray Deploy

Setting the scene, meeting the players

All frontend code produced by Mews is being continuously integrated and delivered via automated build steps powered by custom build Azure Pipeline workflows. The deployment process begins with a pull request merge into the master branch. Pipeline runs the usual build steps, code analysis, and tests, then emits source code prepared for delivery in the form of a folder stored in Azure Blob Storage with a unique, auto-generated, semantically numbered name. Since, at this point, changes present in the pull request have already been checked by the QA department, the content of such a folder is delivered right away by a CDN.

Folder structure

📁 4.98.2
📁 5.0.0
📁 5.0.1
🗎 manifest.json

A simple .json file is used to keep the latest version of the app the server is supposed to serve to users.
manifest.json

{
"latest": "5.0.1"
}

Such a setup allows Mews to deploy new features, bug-fixes, and code improvements for multiple projects, multiple times a day. Sound good? Do you wanna know more? We have a whole talk about this.
With the scene laid out, it’s time to meet the important players:

  • Albert is code owner of the deployment pipeline, but also quite busy with management duties.
  • Bernard is code owner of code shared across multiple products and also a contributor to pipeline setup.
  • Cecil is code owner of one product impacted by a bug and responsible for that product`s health and well-being.

Albert, Bernard and/or Cecil may or may not be in this picture

Deploy 1 – An honest mistake

As usual, it all begins with good intentions. Bernard’s team, wanting to improve code quality and prevent bugs, rewrites a simple shared component from Javascript to Typescript. Code is written, approved by team members, and verified by the QA team. Unfortunately, one simple error is made:
ListItem.jsx

const handleClick = event => {
if (this.props.onClick) {
this.props.onClick(event, this.props.value);
}
}

ListItem.tsx

const handleClick = (event: React.MouseEvent) => {
if (this.props.onClick && this.props.value) {
this.props.onClick(event, this.props.value);
}
}

This change causes ListItems without assigned values to no longer fire the onClick handler. That’s fine for normal ListItems where value is always present, but ExpandableListItems do not have a value of their own. Instead, their onClick function shows or hides additional ListItems.

Working ExpandableListItem

This mistake is not caught by automated tests, nor by QA team members, and is deployed to the production environment. And the bug reports start to pour in.
Cecil’s immediate reaction is to roll back the application version to the last one known to work, there will be time for investigation after the impact on users is lowered. So he goes to the manifest.json file, but instead of seeing:

{
"latest": "5.84.0"
}

he sees:

{
"latest": "5.84.0",
"current": "5.84.0"
}

Wait, what? What’s current? But time is of the essence. Cecil changes the latest version to 5.83.5, the last known working version, as he would under normal conditions, but minutes run by and the change has no effect. Cecil therefore changes current to the same: 5.83.5. Finally, the working version of the application is being delivered to users and bug investigation can begin. Cecil can relax for a bit until the fixed version is ready to deploy.

Deploy 2 – Timing is everything

But the peace and quiet doesn’t last long for Cecil. While he was busy lowering the impact of the deployed bug, Bernard and his team merged a new commit to the application source code, triggering a new application build in the deployment pipeline. The problem is that Cecil, while he rolled back the application to the bug-free version, didn’t have time to prevent the deployment pipeline from emitting new builds. A few short minutes after the buggy version of the app was removed from production, the build server creates a new application version 5.83.6 from the latest version of the source code. Source code that, you guessed it, contains the same ListItem bug. The version is deployed to customers automatically. ListItems stop working properly again and bug reports start to pour in.
"But how can this happen?" I hear you cry. "There was already a version 5.84.0! How does the server create 5.83.6 now!?". You see, that’s why Albert and Bernard added the current property to the manifest. It was meant to keep track of the latest version available through semantic version numbering. Unfortunately, while doing it, they made the naming of current and latest very confusing. Cecil’s edits to manifest.json, and the unfortunate timing of the build pipeline, broke the semantic ordering of application versions, making it possible to emit an application version with a lower version number than its predecessor.
This time, Cecil is much quicker to roll back, he also realizes that he is not supposed to touch latest anymore. The deployment pipeline is shut down until Albert, Bernard, and Cecil can figure out what went wrong. Customers are being served an older, bug-free version of the app in the meantime.

Deploy 3 – 5.84.0 strikes back

The day after the initial deploy and all seems good. Albert and Bernard fixed the manifest.json formatting to CLEARLY_STATE_WHAT_TO_DO! They also fixed the naming of latest and current. The deployment pipeline is open again and Cecil’s team deployed several patch versions of the application, including a fix for the original issue, and everything seems to be working nicely again.

{
"__CHANGE_ONLY_LATEST_NOTHING_ELSE__": null,
"latest": "5.83.13",
"current": "5.83.13"
}

The tranquility of the moment persists until a few moments later when Bernard tries to deploy a new minor version of the app, changing the version number from 5.83.13 to 5.84.0… the same version that the server emitted the day before.
We don’t need to go into details here. Suffice to say, the build server uses an already existing version 5.84.0, with the bug present, instead of emitting new source code. Bug reports pours in and Cecil chews through a pencil while rolling back the app for the third time in 24 hours.

Deploy 4 – The cache never forgets

It’s a few days after the initial deploy. Lessons have been learned and issues are fixed. The app is switched to a clear and 100% non-existent version 5.83.100. All other versions are removed from storage to prevent further numbering conflicts. An issue with the build server is fixed to never silently use existing versions again. Multiple patch versions of the product are deployed without incidents. Bernard and Cecil are relaxed and back to their normal routines. Until, that is, another commit causes a minor version increase.
This time, version 5.84.0 is not present in storage so the build server emits a new, flawless app version with that number. Unfortunately, the internet browsers of users already received this app version a few days ago and, unless caches were cleared, that version is still there, still has the same bug, and is used instead of a costly download of the same version from the CDN. Deploy #4 is completed by delivering a good app version with an unfortunate version number.

Cecil does the fourth and final rollback, musing on the inevitable heat death of the universe and the sweet, sweet comforts of dark nothingness.

The dust settles

There are certainly lessons to be learned that we can take away from all this.

The bad news

  • four freaking times!

  • avoidable mistakes with testing and documentation

  • more complete test coverage would have prevented all these issues

  • if you use semantic numbering, make sure it cannot be broken and your build environment cannot emit multiple versions of code with the same numbering

The good news

  • different causes at least means we can learn from our mistakes

  • decent reaction times: issues with all deploys were resolved within ten minutes, minimizing the impact on customers

  • rollback works well!

  • swift identification of root causes, minimal impact on overall velocity of teams and feature and bug-fix delivery

Expect the unexpected: don`t trust anyone, especially not robots and automatons!

Hopefully this story can help you avoid at least some of our mistakes. Do you have similar experiences under your belt? What’s the greatest amount of times you had to fight with the same bug? Let’s share in the comments and may your deploys be bug-free and your feature-to-bug-ratio be favorable.

Share:

More About