Cause of Error

Implementation Notes

The most important aspect of any root cause method, is that it is conducted in an atmosphere and approach of psychological safety. People will not discuss their own failings, be candid or refelective about systems or process failings, if they feel that the exercise is a chance to punish or blame.

Many organisations pretend to be No Blame - but managers are held accountable in commercial and highly regulated industries and are often predisposed to find someone to blame when things go wrong, rather than see any failure as a learning opportunity, for themselves, the people involved, or the organisation at large.

Post-mortem and recovery will not be succesful without a full commitment to psychological safety throughout the process.

CoE should be a forensic, detailed, emotion-free process; blame actively distorts it.

Psychological Safety and No Blame

The single biggest barrier to effective root cause analysis is poor quality engagement by stakeholders, and having management involved.

There is a natural tendency to look for people to blame when things go wrong, but in most cases, process and technology failures are often more culpable than individuals.

Even when humans are at fault, in any event, they are much less likely to engage honestly with a process they feel is out to punish them.

Whilst negligence, dishonesty and incompetence are reasonable grounds for "blame", they are not helpful when diagnosing and recovering from a large-scale catastrophic event and then rapidly optimising that system for reliability; in other words, even if you must blame and punish, do it afterwards.

Be serious about not blaming or punishing the people involved in the failure - particularly those in the recovery, as they may not have been part of the root cause or original failure. Humour is important in business, but during a failure event, most people rarely appreciate the joke. Don't make jokes about the failure or the people within it.

No blame is difficult, and management staff particularly will default to it. It takes concerted, serious effort to maintain an effective root-cause analysis process - a strong collateral-based process, with outcomes that everyone can understand will go a long way to abating any undue pressure to "find a culprit" - by sharing responsibility, and making a strong commitment to growth and learning from failure, any negative event can be turned into a positive one.

Timing and Incident Response

The normal cadence/speed of a CoE process is to produce the first within 24-48 hours of the original event occuring. Data gathering should begin as part of the incident process - NASAs "lock the doors" is a particularly useful idea. In most cases, the root-cause or causes will be known, but Next Actions might not be completed. It is normal then to summarise and republish the CoE at the end of the process, perhaps months after, once all Next Actions are full addressed.

Lock The Doors

"Lock the doors" is a command given by a Flight Controller during a NASA failure incident, such as a loss of vehicle. It begins a FREEZE, ISOLATE and PROTECT process. Access to the facility becomes restricted, and those people who were not involved in the incident are removed; communication into and out of the room is also restricted. The FREEZE and ISOLATE steps are designed to preserve evidence in the facility, and ensure that those who were involved are not influenced or affected by communicating with others who were not there. The PROTECT step allows those involved to preserve and gather any data required to perform the root-cause analysis.

The contents of the bins are preserved.

Most of us are not NASA, but these steps have many parallels in a commercial environment.

It is vitally important that an incident response team preserves any required data as close to the incident as possible. Although care should always be taken with engineers potentially working long hours after an incident, it is better to gather data in the hours after a problem, than potentially lose data through archival or temporal issues, by doing any analysis or gathering long after the incident.

An important aspect of "Lock The Doors" is keeping people out who were not involved - to stop them distracting or influencing the analysis and particularly not influencing (or distracting) peoples immediate recollection, communication and evidence about the incident. Some may find this a useful idea when managing stakeholders, particularly senior management, in the early part of an incident/failure response and particular during the post-incident phase.

Produce a single document with seven sections:

1. Summary

A simple description of what happened.

2. Customer Impact

Describe the issue from the point of view of our customers. What did they see?

3. Security Impact

Was any system, data or privacy breached?

4. Timeline

Who did what when, and when the problem was resolved.

5. Five Whys

Keeping asking Why until you have a root cause. Dissect or deconstruct at every stage.

6. Lessons Learned

What did we learn from this problem?

7. Next Actions

Given the things we learned, what will we do next about this?

Implementation Notes

How to implement this method in practice.

v0.1 22/01/22