What does good incident response look like?
You have an incident response plan.
You have an incident response plan.
These are good steps to follow during an incident:
Focus on recovery, less diagnosis or fix forward. In most environments, this will be ascertaining that the failure is due to a recent change, and safely rolling it back. Fix forward or in-place as a last resort.
While many people can help diagnose and test the response to an incident, in most environments, there should only be one person doing the fixing. In distributed system environments, particularly in failure modes, care must be taken to not have "too many cooks".
The surgical method of organising programming teams has long been established as a useful approach to this - have one technical lead, with others supporting them with tools, data and support. A useful idea is to ask other team members to "damage report" - that is to say, be on-hand to help test the system from different edge or corner cases, and report errors and issues that others are seeing to the diagnosis team.
Once a major incident has begun, it is important to keep stakeholders, customers and others informed of the issue. A public-facing site will quickly generate attention, and it is important to have a clear and consistent message on what is going wrong. One member of the incident response team should be devoted to communicating out regularly - during a major incident this may be as often as every 15 or 20 minutes.
When it's back, lock the doors and preserve data.
After each incident, as soon as possible and only within the response team - review how the process went and allow people to share and process how the incident response went, aside from the issue itself. Consider including this information in the CoE were it directly relevant, particularly to timeline.
v0.1 22/01/22