In a true DevOps organization, the same team that handles application development (“Dev”) also takes care of production deployments and support (“Ops”). Since application failure can happen at any time of the day or night, DevOps teams need to “carry a pager” and be ready to respond 24×7. Though notification services like PagerDuty have eliminated the need for an actual pager, the concept of being “on-call” in case there is a problem remains the same.
In order to share the burden of 24×7 on-call, pager duty often rotates among all team members. While rotation has the advantage of sharing the burden among the team, and is usually good for the sanity and sleep habits of the team members, it can unfortunately slow response times when failure strikes since team members have long gaps between on-call rotations that slow reaction times and dull memories. So, in order to maintain crisp reactions and strong memories for handling failures, your team needs to practice failure – often and regularly.
The best way to practice failure is to pick two team members not currently on-call and assign one the role of Monitor and the other(s) the role of Responder (the Responder can be one person or a team). The Monitor creates a real problem and then monitors progress for how the Responder identifies and reacts to the issue. For example, the monitor might play a little Chaos Monkey with some of the hardware, unplugging a server here or there, and seeing how long it takes the Responder (and the rest of the team) to locate and fix or isolate the malfunctioning hardware. The Monitor takes notes on each practice run and the team reviews them together in the next stand-up. Deficiencies in monitoring, alerts, or runbooks are noted and added to the backlog so that they are addressed before the next practice session.
During the incident response, it is critical that the team not cut corners in the process just because this is practice. If the process requires 30 minute updates to the management chain, then those should be provided. If it requires updating status in a ticketing system or in a group chat room, that should be done too. This helps with muscle memory of all things that need to occur during an incident and avoids developers spending 100% of the time on the technical issue and leaving customers and management wondering if anyone is working the issue.
Once the incident is resolved, the team should meet and discuss the outcome. This post-mortem style meeting should focus on a few specific questions, such as:
- What went well?
- What could have gone better?
- Was the process followed? If not, does the process need to be changed?
- Is the tooling and monitoring adequate to diagnose issues like this?
- Was intra-team, customer, and management communication adequate?
If you received a similar page-out at 3am, do you have adequate tools, training, documentation, and knowledge to work the issue?
The outcome of the practice session and post-mortem should not only include more familiarity with tools and documented processes, but should provide your team with real tasks that can improve your ability to respond to future issues. Some typical areas for improvement include:
- Automation – If this is a common or likely issue, can the team automate the solution to this issue? Don’t forget to consider that rebuilding systems with automation is sometimes faster (and better) than fixing the issue.
- Monitoring – Is the monitoring granular enough to help the team diagnose problems like this? A monitoring tool that reports “RAID failure on server X” is far more useful than one that only reports “slow website”.
- Training – Is there a specific tool that someone used that seems valuable? Based on how the team interacted is any cross-training needed?
- Documentation – Is the documentation for the failure response process adequate? Are there missing notes on common issues and fixes? What about the escalation path? Does contact information for service providers and hardware vendors exist? Is the document repository available at 3am from home or another random location?
- Runbooks – Developers can document how specific failures typically manifest in order to give teams a head start on the issues. For example: “When you receive this alarm, check X, then restart Y, then reinstall Z”.
In summary, running practice sessions to test your failure response is very beneficial in a DevOps organization. The sessions let developers feel more confident that they can handle a page-out when it comes, and it validates and improves your process for dealing with failures. If your organization runs these sessions on a regular basis, it will be more effective at dealing with issues when they arrive … at 3am, of course.
At IBB our Cloud and Software Transformation group has real-world experience in testing and optimizing your failure response process. Our experience ranges from choosing the right tools, implementation a plan, and orchestrating experiments to help test it. Let us know how we can help you.