Introduction
It is inevitable that things will go wrong and failures will occur. Failure is a chance to reflect, diagnose, learn and improve.
These pages provide a simple approach for root-cause analysis and documentation:
Cause of Error (CoE) is a method for analysing and documenting a failure, performing a root-cause analysis and identifying follow-up actions. CoE can be applied to any kind of process, people or systems problem. Its primary output is a document with seven sections.
This work is an attempt to provide clear examples and guidance on how to create CoEs effectively.
Background
When things go wrong, rather than blame or punish, we need a method to drive out root-causes, be they process, system or people failures, identify measures to mitigate the issue completely, take actions and deliberately focus on making sure that this failure cannot occur again in the same way. We invite people to embrace failure and use it as an opportunity to learn and grow, to improve reliability and to be excellent at post-mortem analysis.
Distributed systems engineering focuses on building systems that are self-healing, highly-available and designed to be fault-tolerant. Meanwhile, in the real world, systems go wrong every day - most often through a combination of poor testing and change and configuration management, and, or a lack of appreciation for single points-of-failure.
This method is not designed to replace good systems engineering, architecture or operations practice; it is designed to provide a simple blueprint for how to conduct good post-mortem analysis, focusing on driving out a root cause and ensuring that lessons are addressed and actions taken to avoid or mitigate the problem in the future. Its output is a simple document, containing seven sections, which all stakeholders should be able to follow and understand.
By focusing on a simple document structure as its output, CoE tries to build confidence in the process with transparency and a clear description and understanding of the underlying issues.
Why Do This?
The benefits of CoE are:
Your customers, stakeholders, peers and management will build trust and gain confidence in you; by having a clearly articulated process and output document, they will understand what happened, what you did and what you are doing next. This will greatly help many of the perceived communication issues engineering teams have post-incident; they will leave you alone (for a while).
You can identify patterns in system failures. By analysing patterns in historical CoEs, you can identify process or technical improvements in any delivery pipeline; this can form part of an error budget and help drive product-development priorities in reliability.
You will better focus actions on concrete improvements to systems after a failure.
You and your team will feel more positive about engaging potential failure scenarios and incidents.
You will learn from failures quickly.
Intended Audience
This site is focused on software and system engineering, but CoE can easily be applied to other practices.