OpsDrill logo
Back to Guide Topic List

Two Person Rule (2PR)

As on-call we are often doing operational work that might have less safe-guards and have a higher probability of causing an outage.

For example we might issue the commands to remove nodes or capacity from a production system.

The AWS S3 outage of 2017 was caused by an operating trying to remove a few hosts and accidentally removing a significant amount of capacity, which sent the system into a death spiral and the outage lasted for four hours.

While we should write automations and put into place safety mechanisms, a general purpose rule which can help reduce this risk is the two person rule or 2PR.

Much like a code review, we want others to "sanity check" what we are about to do. It doesn’t matter if this is clicking submit on a form, executing a command on the CLI, or is making a judgement call that could have some impact. Basically, any process that doesn’t already have a review step and which affects production, should require a second person to review the on-call’s work.

It is as simple as requesting a 2PR from the team and then screen sharing what you are going to do. Much like a code review, you should provide some context and reasons why you are going to do it.

The other person should then approve or give constructive feedback and deny the request.

If you have a secondary on-call, having 2PR be one of their main responsibilities is very handy and keeps the rest of the team focused. Otherwise, you can request a 2PR in team chat.

Although 2PR won’t eliminate all errors, 2PR can catch errors and help facilitate knowledge sharing within the team.

Back to Guide Topic List