Cause of Error

What does good incident response look like?

You have an incident response plan.

These are good steps to follow during an incident:

1. Get The System Back

Focus on recovery, less diagnosis or fix forward. In most environments, this will be ascertaining that the failure is due to a recent change, and safely rolling it back. Fix forward or in-place as a last resort.

2. Only One Surgeon

While many people can help diagnose and test the response to an incident, in most environments, there should only be one person doing the fixing. In distributed system environments, particularly in failure modes, care must be taken to not have "too many cooks".

The surgical method of organising programming teams has long been established as a useful approach to this - have one technical lead, with others supporting them with tools, data and support. A useful idea is to ask other team members to "damage report" - that is to say, be on-hand to help test the system from different edge or corner cases, and report errors and issues that others are seeing to the diagnosis team.

3. One Person Communicates Out

Once a major incident has begun, it is important to keep stakeholders, customers and others informed of the issue. A public-facing site will quickly generate attention, and it is important to have a clear and consistent message on what is going wrong. One member of the incident response team should be devoted to communicating out regularly - during a major incident this may be as often as every 15 or 20 minutes.

4. Lock The Doors

When it's back, lock the doors and preserve data.

5. Review and be Vocally Self Critical

After each incident, as soon as possible and only within the response team - review how the process went and allow people to share and process how the incident response went, aside from the issue itself. Consider including this information in the CoE were it directly relevant, particularly to timeline.

Produce a single document with seven sections:

1. Summary

A simple description of what happened.

2. Customer Impact

Describe the issue from the point of view of our customers. What did they see?

3. Security Impact

Was any system, data or privacy breached?

4. Timeline

Who did what when, and when the problem was resolved.

5. Five Whys

Keeping asking Why until you have a root cause. Dissect or deconstruct at every stage.

6. Lessons Learned

What did we learn from this problem?

7. Next Actions

Given the things we learned, what will we do next about this?

Implementation Notes

How to implement this method in practice.

v0.1 22/01/22