Cause of Error

6. Lessons Learned

Every event has lessons learned, and these should be candidly and clearly articulated.

It usually follows that each lesson learned has an action, and a person attached to that action as a followup. It perhaps isn't worth making this a cast-iron rule as part of your process, but is a good rule-of-thumb. If we learned something, then there is probably action associated with it, if only to ensure that all others learn the point too.

Our monitoring is not sufficient for this type of event.

We were monitoring the service concerned, but not at sufficient granularity or regularly enough to catch the outages when they occured.

We learned that it takes too long to test the system in production, manually

If we learned nothing, was it even an outage?

Produce a single document with seven sections:

1. Summary

A simple description of what happened.

2. Customer Impact

Describe the issue from the point of view of our customers. What did they see?

3. Security Impact

Was any system, data or privacy breached?

4. Timeline

Who did what when, and when the problem was resolved.

5. Five Whys

Keeping asking Why until you have a root cause. Dissect or deconstruct at every stage.

6. Lessons Learned

What did we learn from this problem?

7. Next Actions

Given the things we learned, what will we do next about this?

Implementation Notes

How to implement this method in practice.

v0.1 22/01/22