Introduction

It is inevitable that things will go wrong and failures will occur. Failure is a chance to reflect, diagnose, learn and improve.

These pages provide a simple approach for root-cause analysis and documentation:

Cause of Error (CoE) is a method for analysing and documenting a failure, performing a root-cause analysis and identifying follow-up actions. CoE can be applied to any kind of process, people or systems problem. Its primary output is a document with seven sections.

This work is an attempt to provide clear examples and guidance on how to create CoEs effectively.

Background

When things go wrong, rather than blame or punish, we need a method to drive out root-causes, be they process, system or people failures, identify measures to mitigate the issue completely, take actions and deliberately focus on making sure that this failure cannot occur again in the same way. We invite people to embrace failure and use it as an opportunity to learn and grow, to improve reliability and to be excellent at post-mortem analysis.

Distributed systems engineering focuses on building systems that are self-healing, highly-available and designed to be fault-tolerant. Meanwhile, in the real world, systems go wrong every day - most often through a combination of poor testing and change and configuration management, and, or a lack of appreciation for single points-of-failure.

This method is not designed to replace good systems engineering, architecture or operations practice; it is designed to provide a simple blueprint for how to conduct good post-mortem analysis, focusing on driving out a root cause and ensuring that lessons are addressed and actions taken to avoid or mitigate the problem in the future. Its output is a simple document, containing seven sections, which all stakeholders should be able to follow and understand.

By focusing on a simple document structure as its output, CoE tries to build confidence in the process with transparency and a clear description and understanding of the underlying issues.

Why Do This?

The benefits of CoE are:

Your customers, stakeholders, peers and management will build trust and gain confidence in you; by having a clearly articulated process and output document, they will understand what happened, what you did and what you are doing next. This will greatly help many of the perceived communication issues engineering teams have post-incident; they will leave you alone (for a while).

You can identify patterns in system failures. By analysing patterns in historical CoEs, you can identify process or technical improvements in any delivery pipeline; this can form part of an error budget and help drive product-development priorities in reliability.

You will better focus actions on concrete improvements to systems after a failure.

You and your team will feel more positive about engaging potential failure scenarios and incidents.

You will learn from failures quickly.

Intended Audience

This site is focused on software and system engineering, but CoE can easily be applied to other practices.

Cause of Error

Introduction

Background

Why Do This?

Intended Audience

1. Summary

A simple description of what happened.

2. Customer Impact

Describe the issue from the point of view of our customers. What did they see?

3. Security Impact

Was any system, data or privacy breached?

4. Timeline

Who did what when, and when the problem was resolved.

5. Five Whys

Keeping asking Why until you have a root cause. Dissect or deconstruct at every stage.

6. Lessons Learned

What did we learn from this problem?

7. Next Actions

Given the things we learned, what will we do next about this?

Implementation Notes

How to implement this method in practice.