Cause of Error

What does bad look like?

The most common issues are :

Lack of Telemetry

The most common issue identified in any root-cause analysis involving system engineering will be we didn't monitor that - monitoring and observability is never finished. But having little or no telemetry at all in a domain is a major issue, and should be remedied as soon as discovered.

Change Management

Poor Change Management is a very common issue identified in most root-cause analysis. Many outages are caused by configuration errors, made by system engineers. Most often this is a function of inadequate or difficult testing environments

Lack of Support

The second most common issue identified in any root-cause analysis involving system enginerring will be we didn't have time for that - technical debt and having time for non-functional requirements are often second-class citizens in many organisations. Incident Response highlights and focuses on these.

Communication

A very common problem during an incident is insufficient communication from the response team, to the outside world. This causes stakeholders anxiety, which will cause them to contact the response team. Responding to stakeholders, ad-hoc will rapidly distract (and upset) the response team, further hampering effective recovery actions.

This is an easy issue to solve - appoint one person in the response team to communicate out.

Produce a single document with seven sections:

1. Summary

A simple description of what happened.

2. Customer Impact

Describe the issue from the point of view of our customers. What did they see?

3. Security Impact

Was any system, data or privacy breached?

4. Timeline

Who did what when, and when the problem was resolved.

5. Five Whys

Keeping asking Why until you have a root cause. Dissect or deconstruct at every stage.

6. Lessons Learned

What did we learn from this problem?

7. Next Actions

Given the things we learned, what will we do next about this?

Implementation Notes

How to implement this method in practice.

v0.1 22/01/22