Why we do incident drills and why you should too

SRE at Mews, apart from incidents and observability also really into fantasy books

Intro

At Mews, incident handling is inevitable to maintaining a robust and reliable service for our clients. It is crucial to have effective incident management as we scale and the potential impact of incidents widens. Severe bugs and issues will inevitably happen in software development; there isn’t a service provider anywhere in the world for whom this is not the case. To a certain degree, they can be prevented by good practices and architecture, and we consistently attempt to raise the bar on our approach at Mews to ensure our clients have the smoothest experience possible. Every once in a while, though, there will be such an unscheduled out-of-luck or something-that-would-never-happen event. Prevention is good, but you don’t want to be caught off guard and forced to react if something happens.

Although our incident process at Mews is already mature and continually evolving, we firmly believe that deliberate practice is essential to react quickly and easily overcome such critical events. Our journey revolves around how purposeful exercise and incident drills significantly contribute to our improvement.

The lessons

When you are building an app and growing as a business, there will surely come a time when it’s no longer scalable to pick up the phone and call the one developer who might be able to fix a problem. At this point, it makes sense to establish a process to learn from recurring issues and how to prevent them.

Past challenges of incident management at Mews

In the past, incident management at Mews worked well for our scale, though there were numerous opportunities for improvement. The process for handling incidents was largely manual (imagine 19 points in a checklist) with a lengthy postmortem process that required addressing every identified root cause, no matter how impractical the solution. When you need to address everything that could have prevented the incident or improved detection, the list of remediation tasks can grow rapidly, many of which are not practical or necessary to resolve. Our process was also quite rigid — introducing a change required a full Request for Comments (RFC), discouraging any small incremental changes that would keep it up to date with the needs of the actual users (the engineers handling the incidents).

Lack of training

The second issue we found was the insufficient preparation of incident responders. If you were a team manager, you automatically could be called in the middle of the night to deal with an incident, even if you had never handled or supported the code causing the issue. It’s unpleasant for anyone to have this kind of responsibility; some people can wing it, but it’s incredibly stressful for others. This naturally led to many deviations from the process, where people created desired paths. Incident responders would do their best to restore normal service but often skipped steps due to lack of time, permissions, or clarity, creating confusion and unnecessary stress.

Establishing a new process

With these two important lessons in mind, we wanted to get it right and develop an easy-to-follow procedure that wouldn’t burden responders with unnecessary steps or paperwork, making them forget they even have a production problem to fix. Instead, it should make their life easier and guide them towards a resolution, offering prompts for necessary human input such as status updates, while automating tasks that don’t require human interaction, like creating channels, cross-posting information, or adding relevant people. Rather than a homegrown solution, we opted for incident.io to help us establish good practices and do the heavy lifting.

The new process shouldn’t just be thrown at people, they should have multiple options to improve their readiness. We introduced incident drills to cater to those who prefer interactive training. It can be difficult to convince some people to participate (“Don’t we have enough live incidents?”), but others are relieved to have someone to answer their questions.  While we recognize that everyone has different learning preferences – whether through documentation, videos, or observing real incidents, we also believe there is great value in encouraging hands-on experimentation and addressing specific team needs and concerns.

Initially, we focused on familiarizing ourselves with the new tools and processes, but other approaches can be utilized as the response process matures.

Do you want to do incident drills with Fína?

Join the team!

Incident response basics drill

We want all responders to be familiar with the tools and processes at their disposal (so they don’t need to search for documentation or reinvent solutions under pressure, especially in the middle of the night).

During these drills, nothing is actually broken. An experienced incident responder arranges a meeting and guides the team through the incident lifecycle, explaining what the tools can do, what happens automatically, and where input is needed. We highlight the importance of sharing good internal and external updates. A playful scenario can help people relax and engage more than a typical “Incident drill” or “NullReferenceException”. Urgent situations featuring a cake or adorable animals have worked well for us, and ChatGPT is incredibly useful for brainstorming new topics!

There should be enough time for discussion, allowing people to try the steps themselves instead of watching someone else, and to explore the reasons behind each action. This helps identify anything missing or unclear and provides an opportunity to discuss improvements and needs.

Shadowing

Would you feel confident dealing with an incident on your own? It can be immensely helpful to watch how others handle them. When shadowing incident responders during an active incident, try to determine if you could perform the same steps yourself — and even give it a try! You might discover you lack access to some system, which is not something you want to find out in the middle of the night.

Afterwards, talk to the responders and ask them to walk you through what they did so you can learn.

Next time, try stepping up and handling the incident yourself, with a more experienced responder to help you. This approach is valuable no matter how experienced you are — remember, you are not alone. Involve the rest of your team when you need help. It can be overwhelming when many things are happening simultaneously; having an extra pair of eyes (or an accomplice for a tricky production fix) can make a huge difference.

Team specific runbooks

Knowing the processes is nice, but what about actually fixing the problem? Do you know the typical remediation steps your team takes, like restarting a service, performing a failover, rolling back a deployment, or recreating a corrupted cloud resource? These actions can seem trivial, but imagine being woken up at night — could you remember all the manual steps? Wouldn’t you prefer to follow a checklist or a straightforward tutorial? Or even better, press a button and let automation handle the steps?

It can be very beneficial to sit down with the whole team, identify the common actions, and write them down. If a task is frequent, consider automating it. It can even be triggered automatically before an incident is raised.

Chaos engineering

There is another approach we have yet to try. In a nutshell, you break something in a (not necessarily) nonproduction environment, and observe:

  • What is the time to detect?
    How long does it take to actually notice the problem? Is there sufficient observability and alerting?
  • How does the response team react?
    This can be quite controversial. We don’t want this to be perceived as “trying to catch some slackers”, but rather identify where there is confusion and ineffective processes.
  • Can the system recover on its own?
    This would be ideal if your system can work with graceful degradation or complete recovery (for example, you change configuration manually, but it gets automatically consolidated).

Conclusion

In conclusion, incident drills are essential for preparing teams to handle unexpected issues effectively. By establishing an easy-to-follow process, conducting basic incident response drills, shadowing experienced responders, and creating team-specific runbooks, teams can better equip themselves to manage incidents confidently and efficiently. Successfully resolving an incident is a reason to celebrate and appreciate the people who made it happen! Do you regularly refresh your incident processes? Let us know what works best for you!

SRE at Mews, apart from incidents and observability also really into fantasy books
Share:

More About &