A Gameday, also known as Adversarial Gameday, is a planned event designed to stress the resiliency of your teams and systems.
A Gameday can help us validate that our knowledge and automated processes are up to the task of handling partial degredation or complete failure of one or more subsystems.
It helps test our observability, monitoring, alarms, SLI, etc. It helps teams validate Runbooks. It gives us a chance to practice incident response framework in a more controlled outage environment.
An example: If we intentionally increase latency between a frontend and backend service...
- Do the alerts that are expected to fire, indeed fire?
- Are dashboards easy to use and do they highlight the relevant facts?
- Does the on-call know where to find the relevant Runbook? Is it up to date?
- Was troubleshooting the root cause(s) relatively straight forward?
- Was the team able to bring about recovery in a timely manner?
- Did we see the latency impact in the service level indicators?
These intentional "experiments" help to identify action items for improving site reliability.
We want to make our systems more resilient, but also to invest in team resiliency as well.
Introducing latency is fairly conservative example. Gamedays at Amazon has been as drastic as turning off the power to a data center. Typically a Gameday has one or more fairly serious breakages that are intensional, much like a chaos testing experiment.
Gameday events are often large scale and involve multiple teams or organizations. These are probably not a small scale opsdrills that one might conduct with a new employee before they have their first on-call shift.
Gameday in Production or Prod-like Environment?
Ideally, a Gameday takes place in production. There are many reasons this may not currently be possible.
- Lack of buy-in from key executives
- Lack of SLO or key metrics for knowing if the customer experience is being degraded or un-available
- It is very dangerous to be flying blind during such a risky experiment
- Risk / Reward just isn't there, such as life/death critical services such as aviation, medical applications, etc
Sometimes Gamedays are conducted in a test or dedicated Gameday environment.
Testing in non-prod environments can often have limitations that reduce some of the benefits. It may take significant effort to get PagerDuty, observability, alarming, and CI/CD pipeline support and other supplementary infrastructure setup.
For example, if everything is "just like prod", except engineers do not receive PagerDuty alerts and the SLO / SLI are not implemented, then it is less realistic to practice and harder to understand severity in the heat of the moment.
Identify the service or services which will be purposefully broken.
Gamedays require a lot of planning and some infra work, such as realistic load generation scripts, shutoff valves, syncing environments, etc
Identify metric levels where the Gameday experiment should be stopped, either a specific experiment or all of the ongoing experiments.
If Gamedays in production are new to your business, you may need to work on gaining executive buy-in before risking customer traffic.
Identify a day and time window and the participants.
Setup dedicated Slack channels and Zoom calls (or other tech of course) to fasciate.
What to Test?
Much like the practice of Chaos testing, it is a good idea to have a hypothesis and an experiment instead of blindly breaking things in production.
In a test or dedicated Gameday environment, this is less important, but having a plan is a best practice.
Post-mortems for prior outages is a rich source of Gameday experiments. Theoretically, you systems are now hardened against these, so you should be able to run through a list of these. You can replay traffic or reproduce the circumstances for a previous outage.
Thinking about the business can also generate ideas for what to test. Amazon has an annual sale called "Prime Day" which puts significant load onto the website. They therefore plan and execute one or more Prime Day Gameday events to help ensure systems will be able to handle the traffic and unique use cases.
Hold a brainstorming meeting during the planning phase and think of sub-systems that could fail, such as replication, dns, etc.
An operationally mature service may make the Gameday boring, but that is a best case scenario to aspire too.
Gameday versus Chaos testing
There is a lot of overlap between the concept of a Gameday and Chaos Testing. A Gameday is a discrete point in time. Maybe once a year or once a quarter.
As a company adopts these kinds of experiments on a more continuous schedule, then the Gameday graduates into a chaos testing program.