OpsDrill logo

The games DevOps teams play to stay sharp

Pictures of various games like Keep Talking and Nobody explodes next to a pyramid categorizing types of games

We don't play enough games at work. In this post I will break down various types of games. From mainstream games, to DevOps specific games which test the resilience of your teams and systems.

We can play games for team building, but is it worthwhile? We will put "team building" under a microscope and examine the different levels of benefits for improving your team. We will look at specific games in each category to discover the game mechanics that allow us to have productive practice sessions.

A hierarchy of team building games, layed out in a pyramid shape.

Hierarchy of Team Building Games

  1. Ops Drills - Core competencies and skills
  2. Role Based Co-op Games - Delegation
  3. Asymmetrical Games - Communication
  4. Party Games - Trust, Bonding, and Safety

Each layer in this pyramid builds upon the previous layers. All are valuable for teams, but the impact to a team is higher as you go up the stack.

Teams get challenges at work on an almost daily basis. How resilient and high functioning a team is will determine how quickly they overcome these challenges.

If a team lacks rapport and trust, they will be slow to collaborate. In a less resilient team, a singular hero/team martyr will always be making the heroic sacrifice to put out the current fire.

During a high severity outage, it is critical that your incident commander stop the bleeding for customers to avoid losing revenue and losing customer trust. If your team can't communicate and delegate well, it won't be able to quickly apply the right subject matter experts and bring systems back online.

So, how can we build a strong and effective DevOps culture?

Games.

Games, but don't just limit yourself to "team building" games. Certain games can give us a playground to practice valuable skills. Also, being remote and virtual friendly, we will focus on examples of online team building activities.

In this post we will look at games your team can play to:

Increase social familiarly with Party Games

Trust falls are a cliche, but you can't work on the finer points of team execution if you are having challenges with the most basic levels of team cohesion.

Party games are the most basic category of virtual team building games. They increases interaction, build rapport, and can help individuals work on building trust with each other.

The value is getting to know each other a little bit better, having a good laugh, and being creative.

Games teams are playing include Jackbox Party Pack, Among Us, and Animal Crossing.

Example: Jackbox Party Games

Jackbox is a classic in the party game genre. It is a collection of different mini-games with a mechanic like "Fill in the Black"

There is a timer and once all the answers are in, the other players vote on which answer they liked the best.

Often these are funny, witty, or shocking. Everyone gets to know each other better and it is a great ice breaker.

Improve communication with Asymmetrical Games

We often over-estimate how well we communicate. As individual contributors in our day to day job, we don't always get to practice communicating clearly and succinctly.

We carry a lot of context in our heads and assume too much. We don't provide enough context.

An asymmetrical game is one where there are at least two different "flavors" of the game. This might be payers having different rules, a different perspective, or different knowledge.

An asymmetrical game situated in an unfamiliar world can help us exercise these communication skills, because we don't have a shared implicit context to fall back on.

Example: Keep Talking and Nobody Explodes (KTANE)

In KTANE one player, the defuser, can see "The Bomb". Everyone else cannot see the bomb, but they can read the bomb diffusion manual.

This is an asymmetrical game, because the person with the bomb has knowledge which is quite different from the people with the instruction manuals.

The person with the bomb needs to clearly articulate the situation — This bomb has a dial and two batteries and three horizontal wires. The wires from top to bottom are green, white, yellow.

The rest of the team needs to ask very clear questions and to give very clear instructions.If the wrong wire is cut, the bomb explodes and everyone dies.

Improve delegation with Role Based Games

Building upon social foundations and communication skills, the next level of games focuses on delegation and coordination. These skills are critical during incident response and high severity incident management.

In order for games to give us a space to practice these skills, a common mechanism is to give players specialized roles. This is like being a subject matter expert (SME) on a team.

Some example games in this area are Overcooked, Pandemic, World of Warcraft, and Star Trek Bridge Crew.

Example: Overcooked

In Overcooked, the level design often creates specialization for each player. One player may focus on chopping ingredients, another cooking the food, and a third person might be moving finished orders to the pick-up window.

The origin of the term Chef is French and means boss or leader.

Teams with an effective leader will perform much better. They can delegate and ensure bottlenecks are identified and taken care of.

No one wants to do the dishes, but the Chef can push the team to take care of that task or outright assign that to the least utilized teammate.

Improve on relevant knowledge and skills through Ops Drills

At the top of our gaming pyramid are "Ops Drills". These games provide the most benefits to SRE, DevOps, Security, and software teams.

These activities are skill based drills. They let us practice our individual and team core competencies.

We will expand this category and cover each type of game or activity in detail.

Some games and drills in this space include:

Table-top exercises

No, I am not talking about Dungeons and Dragons. Okay, I'm probably not talking about D&D 😜.

You may have done something like this called "What If Planning".

Table-top exercises (TTX) have a rich history in Cybersecurity, the military, and the government (FEMA).

Someone prepares a scenario and presents it to a large cross section of the company. Everyone works through how they would respond. How can they mitigate that threat? They may also brainstorm other related threats or weaknesses.

It is a very powerful tool, because it is so cheap. It is called Table-topping, because much like in D&D, there is only a written scenario and people's imagination. There are unlimited possibilities, you can take these scenarios anywhere.

These are best with wide attendance. You can bring in legal council, your CEO, head of marketing, etc. Having a cross functional team allows us to bake in "defense in depth" by mitigate risks from multiple levels. It also can help forge a better understanding of constraints and responsibilities across different parts of the company.

Example: Backdoors & Breaches

Fun shout out to Backdoors & Breaches here. It is a deck of cards that you can shuffle to stimulate ideas for a Cybersecurity table-top session.

It has cards like Phish, Malicious Service, and Weaponizing Active Directory.

This may inspire a scenario where employees were tricked into clicking a link that launches malware that installs a malicious service which starts up on every reboot and that connects to active directory using that employees credentials.

Given that scenario, your group can brainstorm about mitigations such as Firewall Log Reviews, SIEM Log Anlysis, etc.

Capture the Flag

Capture the Flag, also popularized by the Cybersecurity community, involves a contest between different teams to find clues and vulnerabilities by exploring and entering systems though security exploits or non-obvious paths into a system.

There are two main types of CTF games, jeopardy-style and attack-defence. In jeopardy-style questions or tasks are presented, and a team gets points for the number of questions answered or tasks completed.

Attack-defence is a war-game style where every team is operating compromised systems. They want to be the first to exploit a competing team systems, but they also can work on hardening their own system against exploits.

Unlike Table-topping, Capture the Flag (CTF) requires a software system that users can interact with.

These are often elaborate websites, apps, and services constructed with well hidden flaws or clues. It takes very creative problem solving and domain knowledge such as knowing how to manipulate machine code, inspecting memory layout, etc. Players can try different security vulnerabilities such as cross site scripting, SQL injection, etc against the target system.

Example: DEF CON CTF

DEF CON is the largest and one of the oldest hacker conventions.

Games, pranks, and hacking is core to DEF CON culture.

CTF games are very popular at DEF CON and their is currently a qualifier round for DEF CON CTF 2021.

CTF Time is an amazing resource for finding upcoming CTF games as well as reading team's writeups for previous CTF events.

Runbook Drills

The idea of a Runbook drill or SOP drill (SOP is Standard Operating Procedure) is to have an engineer execute a runbook step by step. This can be done in production, in test environment, or in production, but in "dry-run mode", which simulates the steps without actually affecting production. The details depend on the nature of steps in each Runbook. A Runbook drill is an easier to adopt version of a Gameday which we will cover later.

The benefits include:

  • Spotting out of date and incorrect steps in a Runbook
  • Keeping operators familiar with various systems that they own

These drills are great, because they are very cheap to run and give good results back to the team. More teams should take advantage of this type of drill.

Incident Response Drills

An incident response drill takes place in a simulated live site outage. The simulation provides a distributed system with familiar concepts such as graphs, logs, runbooks, continuous integration, and deployment.

You could deploy an example distributed system, such as a photo sharing app, and then break it on purpose. You could then page in your team and have them work together to bring it back online.

It attempts to simulate a stressful environment to test team cohesion.

Incident commanders can practice delegating and coordinating.

Being a simulation, it is easy for teams to adopt without any risk to their customers.

Incident Response Escape Room

Full disclosure: At OpsDrill we have created the first ever purpose-built escape room   for giving teams a space to practice incident response.

It uses the escape room game genre, but instead of attempting to unlock a door and escape a room, the team is searching for clues as to why systems are down and to bring everything back online.

Chaos testing

Chaos testing involves automated breakage or sabotage of your infrastructure and software systems to identify gaps in operational readiness.

Some examples are filling up the memory on a host. Filling up the disk on a host. Killing or restarting processes. Removing DNS. Adding packet loss. The list goes on and on.

Chaos testing is performed in production.

Chaos testing has many benefits:

  • Test that monitoring and alerting is working.
  • Test your backups and automated failover.
  • Test that your pager works at 3am ;)

Chaos testing can be hard to adopt. You're systems will go down. You could lose real customer data. You could lose real revenue.

Chaos testing is something that most teams want to do, but sadly most teams never get around to fully implementing. Because of that, it is good to look at other more achievable games presented here as alternatives to augment chaos testing.

Gamedays

A related drill to chaos testing is the Gameday event. Chaos testing is ideally an ongoing processes, whereas a Gameday is a discrete event.

A Gameday is planned and executed, usually with the companies knowledge. An example would be to de-provision all the ec2 hosts for all the application servers in an availability zone.

Much like chaos testing, this is done in production and tests your real world resiliency.

You may find some Runbook drills can't be performed without risk or without tampering with production. These ideas can be scheduled into a Gameday, so Runbook drill preparation often provides ideas for future Gameday events.

Also like chaos testing, executing your first Gameday requires buy-in from the highest levels of the company, because bad things could happen to real customers. There are risks, but this is a great tool for developing operational resilience.

Summary

From Party Games through to Gamedays, we have categorized the mechanics and benefits of playing games. From basic team building to deliberate practice of highly relevant skills and core competencies.

Hopefully this has inspired you to schedule some playtime with your team and given you ideas for what types of games are the best next step for growing your team.

So the next time you get caught playing games at work, share this article with your manager 😜

Next Post: Are we asking too much from Developers?

Grow your DevOps team culture

For more content around growing your team's DevOps culture, subscribe to our mailing list.