When scaling our tech organization, the biggest paradigm shift happened at around 20 developers. Prior to that, we had a single tech team that was taking care of everything. The only distinctions in this team were by function: backend developers, frontend developers, etc. Different functions had distinct methods of logging, error reporting, and performance monitoring. Some tools were shared, some were specialized. However, management of this stack was rather simple due to its finite and fixed nature. With transformation to multiple cross-functional teams (currently 13), it's not anymore just about building the "machine" (the product) but building a "machine" that builds "machines" (building teams that build products).
One big part of this higher order "machine" is the incident management framework. For me, it is one of the indicators of technical organization maturity and is one of my favorite interview questions. By incidents, I mean all unhandled exceptions, errors, outages, performance issues, security issues, and whatever else comes to mind. Not only directly within the product, but in the surrounding tooling as well. There were several typical problems we faced, like many other companies:
- Not capturing all incidents, being blind to them and waiting for customers to report them.
- Capturing the incidents but ignoring them for various reasons.
- Not acting on the incidents in a timely manner according to their severity.
- Generating lots of false positives.
- Having multiple tools to look for incidents.
- Not having good enough data and insight into how big of a problem an incident actually was.
In this article, I’ll describe how we designed our new incident management process, addressed all those problems, and implemented the technical solution.
When designing the new workflow, we defined several requirements and constraints based on our previous experience and issues we've been dealing with. Our first aim was for incident delivery to be extremely targeted, meaning that issues be delivered to the group of people who are most likely responsible for them and have the power and knowledge to resolve them. If this didn't happen, the bystander effect would kick in. Therefore, each incident needed the owning team and its function within the team specified. That narrowed down the recipients from the whole organization to a few people.
The second significant requirement for us was actionability and traceability. In simple terms, we didn't want the incidents to just be entries in a logging platform, Slack notifications, or emails, because there are no guarantees someone would look at them or action them. You cannot easily report on their count, resolution time, or other factors, if some of them are unresolved longer than appropriate, etc.
Being a tech lead of a team is one of the most difficult roles, therefore we wanted to make the incident workflow very user-friendly for them. In the end, the tech leads are accountable for the incident resolution within their teams. And since our teams are cross-functional we didn't want the tech lead to have to go through a lot of different tools just to get an overall picture of what's going on in their teams.
In order to ensure actionability and traceability, we decided early on that all incidents have to be routed and stored in our issue tracking system, YouTrack. Most of the incidents are problems that haven't been reported by an end user yet, so it makes no sense to keep them segregated from reported errors. Also, we have quite powerful workflows, alerting, and reporting built on top of YouTrack, so we wanted to reuse that. By keeping track of incidents there, we get more precise data about what the teams spend time on. Moreover, nothing gets lost in spam or overlooked on Slack and developers have a single place to look for their work items. Routing all your incidents into issue tracking software might be a scary thing in the beginning, especially if you have lot of them, so we recommend a soft rollout, giving teams some time to prepare for the inflow.
Sending all incidents directly to issue tracking software however wouldn't be smart, since there are no rate limits, for example. That's why we employ Sentry as a "sink" for all incidents that stands in front of YouTrack and protects it from flooding. We're mainly taking advantage of its deduplication, rate limiting, and alerting rules. It doesn't make sense to report every single occurrence of the same error, neither does it make sense to report the first occurrence of a warning. This we control by alert rules based on incident type (error, warning, fatal), environment (production, staging, development), number of events and other criteria.
For each combination of team, function and, optionally, product, there is a Sentry project. You can easily choose which projects you're interested in which allows slicing by those dimensions and creating "views" for the team lead (selecting all projects involving the team) or function lead (selecting all projects including e.g., frontend). The reason for such granularity is that we also track deployments and product versions there to spot when a deployment causes a surge of errors. Sentry allows this on a per-project basis which forces us to segregate it, even though this introduces some operational overhead.
There are two main sources of incidents: our applications and our other tools and systems. Collecting incidents from our applications is rather straightforward. Sentry provides SDKs for most platforms. The work on our end resides in determining the team, function, and product of an incident. Function and product are simple since it's clear in which app the incident originated and which function is responsible for it.
The team is trickier, especially when multiple teams maintain a single application. We have a monolithic backend application with most of the teams contributing, so this is something we definitely had to address :) Our solution was to annotate all transactions in the system with the owning team on the code level. So, whenever code is being executed, it's clear which is the owning team. The owning team can be defined on larger scopes, e.g., modules or whole APIs, which helps when rolling out the initial distribution of owners. The teams then continually fiddle with the assignments over time, so that the ownership converges to a stable state. Until a team gets split again, of course.
Detour 1: our monolithic applications are actually pretty modular with clear ownership of every single line of executable code. We try to adopt the best aspects of both approaches: the simplicity and performance of monoliths with the clear ownership and isolation of microservices. The transactional ownership is used not only for incidents, but also for performance ownership or larger refactoring initiatives when we're asking, "Who should refactor this endpoint?"
The second category of incident producers are other tools and systems like New Relic, Azure Monitor, Azure DevOps, Rapid7 InsightOps, etc. Such tools usually do not support annotation of the alerts with custom metadata and, therefore, we use naming conventions to distinguish the team and other attributes. And since these tools usually don't integrate directly with Sentry, we route webhooks through unified collection "middleware" built inside Zapier which we use as a simple FaaS. There we transform all the proprietary webhook formats into a unified one, extract team, function, and product from the alert name, and route the incident to Sentry.
The last, easiest, most visible (and for some people the most annoying) part is the notification system. When an issue is raised or escalated in our issue tracking software, a workflow picks it up and sends a webhook to Zapier. There, based on various criteria, we distribute notifications to different channels. Always to a Slack issue channel of the owning team, sometimes to a more general channel, and, for the most severe incidents, into PagerDuty.
We already have many ideas about what we can improve in each step of the workflow. From nicer issue descriptions and smarter and more granular alerting rules to better notifications with even more narrowed down recipients with escalation rules and resolution time SLOs. However, even without such perks, the "basic" incident workflow brought several benefits to our organization:
- We now have clear, consolidated, and complete data about our incidents allowing us to make data-informed decisions.
- The way we report errors or unhandled exceptions keeps improving, the teams are deduplicating incidents, muting false positives, and introducing new incidents that have previously been ignored.
- Thanks to a single source of truth, we can quickly spot if a team is struggling with an overflow of incidents and arrange for help.
- We have much better insight into the quality of our deployments by correlating incidents with them.
- The workflow is preconfigured for all teams and functions and provided as a service of the platform. It requires about 5 minutes when a new team is introduced or renamed.
- It's easily extensible by implementing new incident collectors for new tools in Zapier.
- No incident is "lost in translation" which results in a better quality product and faster resolution times, supporting our vision of achieving technical excellence.
Hopefully this overview gives you some inspiration on what to improve in your incident management process. And, vice versa, we'd like to hear what you are doing better or what we could improve. Or, if you'd like me to go more in depth of some part of the article, let me know that as well!
For more engineering insights shared by Mews tech team: