The investigation of incidents provides important lessons for improving safety, environmental performance and profits.
This knol is based on Chapter 12 of the book Process Risk and Reliability Management.
The thorough investigation and analysis of incidents (both actual events and near misses), along with the appropriate follow-up, provides one of the most effective means of improving the safety and reliability of process facilities. Other risk management programs, such as hazards analysis and management of change, are directed toward anticipating problems so that corrective actions can be taken before an event  occurs. Yet, in spite of their undoubted value, these predictive techniques do have the following limitations:
- Such analyses are, of necessity, theoretical and speculative; there can be no assurance that all plausible events have actually been identified. Indeed, it is more than likely that some important failure mechanisms will be overlooked.
- It is difficult to predict the true level of risk associated with each identified event because estimated values of both consequence and likelihood are usually very approximate. In particular, predictions as to what might happen are invariably colored by the personal experiences of the persons carrying out the analysis. If a person has witnessed an event, he or she is likely to assign it a high probability of occurring again, and vice versa.
- Most serious events have multiple causes, some of which appear to be totally implausible or even weird ahead of time (which is why serious accidents so often seem to come out of the blue). Even the best qualified hazards analysis team will have trouble identifying such multiple-contingency events, and/or persuading others to treat the finding as being plausible.
- It is very difficult to predict and quantify human error — yet most events involve such error.
Actual incidents, by contrast, provide hard information as to how things can go wrong, thus helping to cut through wishful thinking, prejudice, ignorance and misunderstandings. The root cause analysis that follows an incident investigation will help identify weaknesses and limitations in a facility’s management system.
Another reason for emphasizing the importance of incident investigation is that process safety management (PSM) systems — of which Incident Investigation and Analysis constitutes one element — have been in place in many cases for more than fifteen years. Many of these facilities have made good progress in meeting regulatory requirements. However, the fact that such systems can ‘survive an audit’ and are working well on paper does not mean that they are as effective at actually improving safety as they might be. Incident investigations help identify how the elements of PSM really are functioning, and can provide management with insights as to how the PSM program can be improved.
Incident Investigation and Analysis Methods
Publications in the field of incident investigation and analysis often promote a particular methodology with the implicit claim that their approach is better than the methods promulgated by other organizations. Such publications are often commercial in their approach, thus tending to create a concern in the mind of the reader as to the objectivity of the materials that are presented.
This knol does not advocate or promote any particular methodology. Indeed, it is suggested here that an effective incident investigation and analysis requires much more than the mere application of a particular investigation technique. Equally important — maybe more so — is the ability on the part of the investigators to inculcate an atmosphere of trust and confidence with everyone with whom they work — not only those involved in the incident itself, but also the managers who will be charged with taking appropriate corrective actions. Each analytical technique has its strengths and weaknesses — an effective investigation will use a judicious mix of approaches as circumstances dictate.
Therefore, rather than stress the use of just one particular analytical method this ebook suggests that a successful investigation should be conducted through use of the six strategies and techniques listed below and also shown in Figure 1.
- Establish trust, and thereby encourage candid discourse with those involved in the event and also from the managers responsible for follow-up.
- Listen to what people actually say, base all findings on verifiable facts, and be thorough in all phases of the investigation.
- Establish a clear cause and effect chart — backed up with solid evidence — integrated into a timeline.
- Use technical experts to explain specialized issues.
- Develop an understanding of root causes and systemic issues at different management levels.
- Manage an incident investigation and analysis as a project with its own schedule, budget and deliverables.
Elements of Successful Incident Investigation and Analysis
The most important feature of a successful investigation is the establishment of trust between the investigators — who are not interrogators — and the persons involved in the incident itself. In one instance a technician whose actions had contributed to the occurrence of an injury event approached his boss twenty four hours after the interviews had been concluded; he voluntarily reported that a valve that should have been open at the time of the incident was actually closed. Without that information it is unlikely that the investigating team would have ever have fully understood what happened. The technician was not the only person who was candid. His boss, who had twenty five years of experience with the equipment involved, took the initiative to successfully work out the complex sequence of events that led to the incident, even though the upshot was to make his own company and he himself look more accountable for the event. The integrity and candor displayed by these two persons was crucial to the quality of the investigation (and neither of them got into any kind of trouble).
It is also important to establish trust with the managers of the facility where the incident occurred. This can be done by ensuring that the investigation team keeps management fully informed at each stage in the investigation process. Thus the project becomes “our investigation”, not “their investigation”.
Listen to the Facts
Many incident investigators are intelligent, highly experienced, and are not lacking in self-confidence. Although these attributes are important they can get in the way of simply listening to the facts. For example, during one investigation, the team members noted that a piece of equipment was damaged. By assuming that the damage occurred immediately prior to the event a plausible explanation as to what happened was developed. Unfortunately for the credibility of the team members who had jumped to the (incorrect) conclusion as to what had happened, a manager who arrived at the site a few hours after the incident noted that the equipment had not been damaged at that time; therefore the damage must have occurred when the affected equipment item was being removed for inspection and repair. This inconvenient fact overturned the investigators’ elegant and satisfying analysis.
An investigator must always be thorough — particularly when he or she thinks that the investigation is complete, and no more fact finding work is needed. For example, on one investigation the equipment involved in the event was moved from its location in the field to the vendor’s yard. The lead investigator felt that there was really no point in going to the site of the incident because there would be nothing to be learned. Nevertheless, he did visit the site as a matter of duty. When he did so he uncovered some new information that led to a basic reassessment as to how serious the event could have been.
Cause and Effect
Virtually all investigations include the development of a timeline. By ordering the events sequentially it becomes possible to determine their causes, and then the causes of those causes.
One engineering company had as its motto, “There’s no substitute for knowing what you’re doing”. In many investigations it is found that a real expert is needed in order to establish the technical details as to what happened. As already noted in the example provided above, a senior manager who had twenty five years experience to do with the equipment involved took it upon himself to determine what happened. Without his insight, knowledge and experience the investigation team would have taken much longer to determine what happened — indeed they might never have done so.
Root Cause Analysis
Once the facts have been established and an understanding of the event has been established, a root cause analysis can be carried out in order to apply lessons learned to a broader set of circumstances. There are four types of root cause analysis. They are:
Argument by analogy (story-telling);
Each of these approaches can be of value — to purposely exclude any of them, particularly for commercial reasons, is short-sighted. (Further disussion of these four approaches is provided in the book Process Risk and Reliability Management that was referenced at the start of this knol.)
One of the difficulties associated with many investigations is that they tend to suffer from “scope creep”. They grow and grow and grow with root causes being piled upon root causes, without any clear idea as to when the end point has been reached. As one manager once sarcastically observed, “The team seems to be trying to solve world poverty”. It must be understood by all the team members that an investigation is a project, just like any other project. As such it needs a budget, a schedule, a clear scope of work and a contingency plan for when things go awry.
A particularly common project management difficulty is that an investigation proceeds well until the root causes of the incident have been established, at which time the investigators are instructed to return to doing their ‘real work’. Therefore the ‘danger point’ to watch for with respect to project management is that time between completion of the analysis and the writing of the report.
An effective incident investigation and analysis program generally contains two major components: technical and human. The technical side of the investigation is what most publications in this area focus on, particularly with regard to root cause analysis. However, what does not always receive the same degree of attention is the human aspect of incident investigation work. An effective investigator understands how people think and behave. Consequently he or she must be able to communicate with a wide range of people, particularly those listed below.
Most incidents involve front-line technicians (operators and maintenance workers), some of whom may have been injured or emotionally shaken. These people will often be feeling defensive and upset. They may also be feeling guilty if any of their colleagues were injured or died.
Technicians often may not understand what caused the incident, but they worry that they will be blamed anyway. An effective investigator encourages these front-line technicians to be open and candid — primarily by simply shutting up and letting then them talk. Unfortunately, many investigators — who often possess years of experience — are much too quick to interrupt the technician’s narrative flow with questions, war stories or snap judgments at to what happened. An investigator should also make it clear to the technicians that the goal of the investigation is to find out what happened — not to apportion blame or to demonstrate how smart the investigator is.
- Mid-Level Managers
Most investigations find that changes are needed at the facility’s mid-level management systems. Examples of such changes include an increased emphasis on equipment inspection, upgraded operating procedures and more training for the technicians. The implementation of such changes requires that the facility managers commit scarce resources that they would prefer to spend on achieving other goals. An effective investigator will empathize with these mid-level managers, and will understand the demands that are being placed on the organization by the investigation and its follow‑up.
- Senior Managers
Many investigators find that technicians are candid and open, and mid-level managers are generally willing to honestly address the need for improvements to the facility’s systems. What these investigators sometimes find, however, is that senior managers can be more resistant to the findings and implications of an investigation. These findings may indicate that systemic changes to the company’s management systems are required; the senior managers in charge of such systems can become quite defensive — they don’t like being told that their baby is ugly. Hence an effective investigator will know how to communicate with these senior managers, and how to get their buy-in — not least because they are the ones who provide the funding needed to implement the investigation’s recommendations.
An additional concern regarding the involvement of senior managers is that they are usually strong personalities; they may try to take over the investigation and direct it to meet their own opinions, goals and agendas. A strong investigator is able to resist these blandishments.
Words and phrases such as ‘incident’, ‘accident’ and ‘near miss’ tend to be used quite loosely in general conversation. They also tend to have different connotations in English, American and Canadian usage. However, in the context of formal incident investigation and analysis such words need to be tightly defined. The definitions used for these terms in this knol are provided below.
An incident is an event that has either caused harm or loss, or that has the potential to cause harm or loss, and that could have been prevented or reduced in severity through use of the company’s management systems or by improvements to those systems.
The key to the above definition of the term ‘incident’ requires that it be preventable through use of the facility’s normal management systems — thus excluding bizarre external events such as an airplane crashing into the facility. However, many external events, such as earthquakes or very severe weather, can be anticipated and should therefore be considered in the design and operation of the facility and in the development of the emergency response program.
Some incidents are outside the control of the facility managers; such incidents require attention at a higher level. For example, most large corporations have a procurement policy that is used throughout the whole company. If an incident investigation at one site shows that problems with procurement were a contributing factor then the corrective action will probably have to be addressed at the corporate level.
The definition of the word ‘incident’ covers not just safety and environmental harm but also economic loss. Most of the literature to do with incident investigation and analysis focuses on safety-related events. But there is no reason why the techniques developed to investigate and understand such events could not also be used to address lost production, reduced efficiency and unexpected equipment failure.
The word ‘accident’ is not used in this publication because the word implies surprise and lack of controllability. There is nothing anyone can do about accidents. The whole point of an incident investigation and analysis program is that all aspects of an operation are under control of management. Only unpredictable external events such as an airplane crash alluded to above are true accidents.
Near Miss / Hit
The term ‘near miss’ — which may better be called ‘near hit’ — describes an incident that did not result in an actual loss but that had the potential to do so. For example if an object is dropped from a crane but no one is hurt then the incident is a near miss. In terms of fault tree analysis a near miss is an event in which one or more of the inputs to an AND Gate was negative. Near misses, particularly those that could have had high consequences, should be investigated thoroughly because they are strong indicator of system failures. They are a free lesson learned.
The following are examples of near misses
- Process conditions go outside safe operating limits, but are returned to normal before anything untoward occurs;
- An emergency shut down system is unnecessarily activated;
- A safeguard such as a relief valve or fire suppression system is called upon to operate;
- A hazardous chemical is released but does not affect workers in the area.
A potential incident creates the possibility of an event, but nothing actually happens. The key difference between a near miss and a potential incident is that, with a near miss, an event did take place but the consequences were minor. With a potential incident nothing happened at all. For example, if a worker drops a wrench from an upper deck and it hits the floor three stories below but no one is hurt then a near miss has taken place. If the same worker holds the same wrench such that, were he to drop it, it would fall to a lower deck, then he has created a potential incident.
Potential incidents can be classified as either unsafe acts or unsafe conditions. The worker who holds the wrench such that it may fall has committed an unsafe act. If he fails to secure the area immediately below him with barricade tape then an unsafe condition has been created.
Failures to employ authorized management systems properly can also be considered as being potential incidents. For example, if a maintenance manager authorizes a change to a piece of equipment without following his facility’s Management of Change procedure then his decision has created a potential incident.
High Potential Incident
A High Potential Incident (HPI) is a potential incident which, were it to have occurred, would have led to major loss. For example, if a toxic gas leaks from a flange into the atmosphere but no one is present then an HPI has occurred. No one was present, but the potential for a fatality existed.
In general, if a system has had to use its last safeguard, then the incident is probably high potential. For example, if the pressure in a vessel rises above the safe limit (see below) but the safety instrumentation systems bring the pressure back under control, then a potential incident has occurred. However, if the instrumentation does not work, and the high pressure has to be taken care of by the vessel’s pressure relief valve, then the last line of defense has been used and the incident can be considered to be high potential.
The end result of an incident analysis is the development of one or more root causes. A careful study of root causes suggests ways in which management can implement systemic improvements throughout the organization.
It can be argued that there is actually no such thing as a root cause. Every event has a cause; that cause, in turn, is an event which itself has cause(s), and so on. The chain can regress indefinitely – in principle, back to the formation of the earth - thus creating what has been referred to as the ‘Root Cause Myth’ . Nevertheless, a working definition for root cause is required. One definition (adapted from Paradies 2000) is as follows:
A root cause of an incident is the most basic cause that can reasonably be identified and that management can change.
Six-Step Investigation and Analysis Process
The process of investigating and analyzing incidents that is used throughout this knol can be divided into the six steps shown in Figure 3.
Incident Investigation/Analysis Steps
Step 1 — Initial Investigation
The initial investigation, which can start as soon as in the emergency response is completed, is carried out by what is sometimes referred to as the ‘Go Team’. Speed is of the essence at this stage of the investigation. The Go Team provides management with quick feedback as to what happened and what immediate corrective actions may need to be taken at other sites or facilities that share the same technology. The team collects information as soon possible, particularly information that may change quickly such as that do with process conditions. The team should not to disturb evidence, except as needed to ensure the continued safety of the facility. One of the team’s most important tasks is to interview participants and witnesses as soon as possible. People’s memories quickly fade; their memories also change based on what they think should have happened or on what other people tell them. It is vital that these people be asked to tell their story as soon as possible.
During the initial investigation the team will start to develop the document and information management systems that will be needed for the remainder of the project.
Step 2 — Evaluation and Team Formation
Following the initial investigation, management will evaluate the seriousness of the incident and assess the potential it provides for lessons learned. Management must also decide as to how detailed the investigation should be. This means that a method for evaluating the seriousness of events — particularly near misses and potential incidents — has to be selected.
Based on their incident evaluation, management will put together the formal incident investigation team (which will usually have a similar composition to that of the Go Team) as illustrated in Figure 4.
Incident Investigation Team Structure
At the top of Figure 4 are the sponsor and the incident owner. The sponsor is a senior executive, usually with line authority over the persons involved in the incident. He or she will authorize the terms of reference for the investigation and fund the work. The incident owner will typically be the line manager over the facility where the incident occurs. The owner may not spend a lot of time working with the team, but he or she provides direction, ensures that the terms of reference are being followed and is the recipient of the final report. Figure 4 also shows that the owner has delegated the task of managing the information to do with the incident to the Process Safety Management (PSM) coordinator, who uses an incident register to document the progress of the investigation and to manage the subsequent follow-up.
A major incident investigation can have as many as three investigative teams, one for each of the steps shown in Figure 4 (Go Team, Investigation Team and Analysis Team). The composition of the teams will depend on a variety of issues such as the seriousness of the event, the likelihood of litigation, and the technical aspects of the incident. It is likely that the three teams will have many members in common, but it is useful to make the distinctions shown in Figure 4 so that the team members have a clear idea as to their role at each stage in the process. (For example, the analysis team may have a member whose only role is to help the team understand the incident investigation methodology that has been selected and to run the applicable software.)
As an incident investigation proceeds the teams will be required to brief management as to its findings on a regular basis. The frequency and level of detail of the briefings will naturally depend on the severity of the event. Figure 4 outlines a representative reporting procedure. The Go Team issues its first report which summarizes the major issues to do with the incident. The formal investigation team issues one or more interim reports as it progresses with its work. Lastly, the analysis team delivers the final report, containing both the root cause analysis, the findings and suggested action items.
Step 3 — Information Gathering
Referring once more to Figure 3, after the team has been assembled and the terms of reference generated, the first step in the formal investigation process itself is to collect information about what happened. The information will generally come from interviews, documents, instrument records and field observations. At this stage of the investigation it is especially important not to jump to conclusions but to let the facts speak for themselves. The focus must be on gathering data — mostly from interviews, site inspections and the examination of log books and instrument records.
Step 4 — Timeline Development
Once the information-gathering step is more or less complete the investigation team can develop a timeline that outlines the sequence of events. As the timeline is developed it will become clear that certain information items are either missing or not detailed enough, so the team will go back to Step 3 — Information Gathering.
Step 5 — Root Cause Analysis
Individual incidents are usually indicative of a broader range of management or system problems. Simply correcting the actions and events that led to the particular incident that is being investigated represents an opportunity missed — what is needed is a process for identifying underlying or root causes so that a broader range of future incidents can be avoided. This is done through the process of root cause analysis.
Step 6 — Report and Recommendations
The final step in an investigation as shown in Figure 4 is to issue a report to all affected parties and then to take the appropriate corrective actions to prevent recurrence of similar events. Typically the report will summarize the event itself, the root causes that were identified and recommendations as to how the findings may be addressed. Follow up of the recommendations will not usually be the responsibility of the investigation team; that activity is the responsibility of the facility management, particularly the sponsor, not the investigation team.