What does bad look like?
The most common issues are :
Lack of Telemetry
The most common issue identified in any root-cause analysis involving system engineering will be we didn't monitor that - monitoring and observability is never finished. But having little or no telemetry at all in a domain is a major issue, and should be remedied as soon as discovered.
Change Management
Poor Change Management is a very common issue identified in most root-cause analysis. Many outages are caused by configuration errors, made by system engineers. Most often this is a function of inadequate or difficult testing environments
Lack of Support
The second most common issue identified in any root-cause analysis involving system enginerring will be we didn't have time for that - technical debt and having time for non-functional requirements are often second-class citizens in many organisations. Incident Response highlights and focuses on these.
Communication
A very common problem during an incident is insufficient communication from the response team, to the outside world. This causes stakeholders anxiety, which will cause them to contact the response team. Responding to stakeholders, ad-hoc will rapidly distract (and upset) the response team, further hampering effective recovery actions.
This is an easy issue to solve - appoint one person in the response team to communicate out.