Cause of Error

4. Timeline

The timeline of a CoE describes what happened when and who did what. It should answer most questions:

When was the problem first discovered, and how?

Who noticed or reported the problem first to us?

Who engaged in the incident, and helped recover from the failure?

When was the failure resolved?

What indicators did we use to prove it was resolved?

How did we prove that the failure was resolved?

The more accurate and specific the timeline the better. There are often process failures during the incident response itself which should be assessed and learned from. In many cases, a failure might have been responded to or fixed more quickly, had we known about the exact nature of the problem sooner - therefore the alerts we received, and the actions we took discovering and analysing the nature of the problem, are just as important as those around the cause of the failure itself.

The CoE timeline should be a simple list of time and event, focusing on people and their actions.

21:01 - Incident opened. First alarm tripped, engineer Bob investigates

        21:02 - Bob alerts Operations team of inital impact

        21:05 - Cathy comes online

        21:07 - Bob begins reviewing all changes from today

        21:10 - Cathy opens up conference bridge and sends initial alert email

        21:14 - Configuration item changed this afternoon identified as potential issue - testing

        21:17 - Testing confirms likely configuration issue as cause

        21:18 - Rollback of changes Xd4DD begun

        21:22 - Rollback completed in production, testing begins

        21:23 - System confirmed online, testing begins

        21:31 - All production tests pass and system confirmed as working again

        21:35 - Conference bridge and incident closed.

After investigation, and particularly after subsequent actions taken, it is perfectly valid to expand the timeline to well before the incident, particularly if subsequent data comes to light that shows how it might have developed into failure, or could have been mitigated earlier.

Many engineering teams use chat systems (such as Slack, Teams, Discord or IRC) habitually, and particularly during incident response. Incident Response teams should think aloud in these chat forums during an incident, and the log of chat and actions taken will make an excellent start to any CoE timeline, and should be preserved as soon as possible after the incident.

Experienced CoE practitioners will begin keeping the timeline during the incident itself.

Produce a single document with seven sections:

Cause of Error

4. Timeline

1. Summary

A simple description of what happened.

2. Customer Impact

Describe the issue from the point of view of our customers. What did they see?

3. Security Impact

Was any system, data or privacy breached?

4. Timeline

Who did what when, and when the problem was resolved.

5. Five Whys

Keeping asking Why until you have a root cause. Dissect or deconstruct at every stage.

6. Lessons Learned

What did we learn from this problem?

7. Next Actions

Given the things we learned, what will we do next about this?

Implementation Notes

How to implement this method in practice.