Adopting an incident management framework can transform your company from a state of chaos during incidents to a more calm and effective workplace. This can reduce lost revenue and lost customer trust.
|Severity Level||Customer Impact||Notes|
|SEV-1||Customer unable to sign in, core use cases blocked, data loss or security vulnerability around customer data||Drop everything, declare an incident|
|SEV-2||Significant degradation of customer experience, customer blocked from some specific use cases||Top priority for affected service team, declare an incident|
|SEV-3||Some latency, new production bug identified, UI bug not blocking core use cases||Response can wait until normal business hours|
|SEV-4||Minor degradation, minor bug discovered in production||Not a priority|
Regarding Sev-1 incidents, there are two main strategies for team structure.
A good place to start, especially for small and medium sized engineering organizations, is to train all on-calls to assume the role of incident commander during a Sev-1 and Sev-2 incidents. Of course an individual might page in a teammate who is more qualified to assume that role.
There is no centralized team for incident response, incident commanders are in every team.
For companies with the resources, a dedicated team can be created to support teams during Sev-1 incidents. The people on this team are the best at incident command, being scribes, liaisons, etc.
Sev-2 incidents would still be handled by individual service teams, but as the severity and impact of an event is raised to Sev-1, this centralized team would take over.
It is a good idea to have teams trained on how to be an incident commander, but especially tricky outages or larger impact cross organizational issues will be escalated to get the help of this centralized team.
This team should be staffed to provide 24/7/365 support for Sev-1 incidents. The should not have other feature development responsibilities, but they could focus on long term operational and engineering process improvements, such as facilitating a post-mortem program.
Very large enterprises should consider a follow the sun model of support. This allows the centralized incident command to work in or closer to normal business hours respective to that team or employees timezone.
A hybrid approach would be to identify a virtual team who can be asked to join an incident and assume incident commander and other roles. These ICs or managers are not dedicated to an incident management team, but instead have other normal day to day jobs on traditional teams and are just available via this virtual team nature.
Create a pagerduty schedule or other type of roster and socialize the name of this virtual team. Document that they would be paged in to assist during SEV-1 incidents.
|Incident Commander||The single individual organizing incident responders. They are organizing the call, delegating tasks, and driving consensus.|
|Deputy||Incident Commander support role. Assisting tracking delegated tasks, able to assume Incident Commander for relief, etc.|
|Communications Liaison||Writes informative, audience sensitive descriptions of the current situation. Authors periodically send updates to key stakeholders and customers.|
|Scribe||Focused on capturing timestamps when key facts are discovered, records hypothesis and decisions, and any other details about the outage. These notes aid in reconstructing events during the blameless postmortem.|
|Incident Responder||Everyone else. Typically a Subject Matter Expert (SME) on a service team. Probably the on-call for that team. They are doing the work of troubleshooting, research, analysis, generating hypothesis, triggering deployments, writing patches, etc.|
At a bare minimum, it is highly recommended that you have the incident commander and incident responder roles defined.
Deputy, Scribe, and Liaison are optional. These can be completely omitted from your framework or merged into a single supporting role. We will describe these roles below, but you can tailor them based on your org size, the nature of your business and other factors.
If you do adopt dedicated roles for any of Deputy, Liaison, and/or Scribes, they should not be acting as incident responders. A scribe will miss timestamps, task owners, and details if they are simultaneous grepping through logs or tracking down "just one more thing". If you do define these as required roles, be sure to document this "non-response" requirement.
During a SEV-1 outage, the company should operate under a strict hierarchical command structure. This is totally unlike day to day work culture. During an emergency, there must be a clear, single leader who makes decisions, allocates resources, and ensures that only one thing is changed at a time.
Document that during a SEV-1, the incident commander has ultimate authority, above the CEO or anyone else in the company, during an outage. Train executives inside and outside of the engineering org, so that they are not surprised by this fact during an incident.
Incident Command must only be assumed by trained staff. Commanders do not necessarily have to be engineers or SMEs. Anyone who has passed the training who has good organization, communication, and delegation skills can assume the role.
Internal and external stakeholders and customers need to be informed in a timely fashion.
For a smaller org, this role may be assumed by the incident commander. If there is a dedicated liaison, they may handle both internal and external customers. In larger enterprises, it is good to not only have a dedicated liaison, but you may also want to specialize the roles to internal and external communications.
Liaison are often staffed from the customer support org. When communicating with external customers, it is important to use appropriate language that is informative, but that doesn't cause your customers to panic or lose trust in your services.
For a small engineering org, you may create a single on-call schedule. In most companies, you will want to define an on-call schedule per team. These should be named to optimize for other teams to be able to discover each other quickly during an emergency.
A good starting plan is to have >team-name<-primary which is 24/7 for one week. Only trained engineers are put into the on-call loop. New employees are exempt from on-call for at least 90 days.
Optionally, a recommended schedule is the "secondary" on-call. This has a couple of benefits:
Another common pattern is to have an employee go from being the primary in week one to the secondary in week two. The secondary role affords them time to complete operational work, investigations, etc from their previous week. This may be a band-aid hiding toil, which is not encouraged. Losing two weeks to non-feature work may not be acceptable to the business. Example: For a 5 person team, the typical engineer would have 3 weeks of feature work and 2 weeks of being on-call.
Distributed companies may want to take advantage of time-zones and split the on-call schedule into 12 hour shifts, so that employees are not paged during sleeping hours. This is not the norm across companies and although rare, can work well.
Now that you have documented your incident management framework, it is time to get everyone on-board. Establish a training program. OpsDrill has Incident Response Training available and can also develop training specific to your companies needs.
Aside from a company wide definition of SEV-1 and SEV-2, a long term goal should be to have every service team define service specific incident level thresholds.
A valuable exercise for the business is to model the financial loss of a service being down for 5 minutes.
From a product perspective, it is important to understand how a degraded service affects customers as well as a complete outage.
Compared to the other steps in this framework rollout, this is a lower priority.
The last piece of the incident management puzzle is to have a postmortem or after action review.
Companies that take the time to learn from past mistakes and actively improve systems to make repeating these mistakes impossible, benefit from the compounding interest of constant improvement.
Atlassian has a good explanation of blameless postmortems.