You are likely well acquainted with the story of the Challenger space shuttle disaster. We use it regularly in our classes on data analysis. On Jan 27, 1986 the Challenger blew apart seconds after takeoff. An investigation found that low temperatures had prevented an O‐ring from sealing, causing a gas leak.
This concern was deemed innocuous during a pre‐launch meeting, but the data used did not present a complete picture. Although the team had data on O‐ring incidents above 50°F, data on missions with no recorded incidents was excluded. There was a misconception that zero events meant zero information, a fatal error as it turned out. The missing data points would have shown that higher temperatures at launch were associated with fewer O‐ring incidents.
This alone did not answer the question of whether the Challenger disaster could have been predicted, so NASA set out to produce a risk assessment for an O‐ring failure under the launch conditions present at takeoff: 31°F. The probability of failure at this temperature was at least 13%, and postponing the launch until temperatures reached 60°F would have reduced the risk to as low as 2%.
The work on the Challenger had a huge impact on the risk assessment community, including the formation of a risk assessment group within NASA. Since then, many companies and agencies that have not traditionally been concerned with risk have begun analyzing and implementing resiliency strategies.
Risk and resiliency are really two points on a continuum. To make things safer and more resilient, you need to know not only what can go wrong, but also how you can best recover when bad things inevitably happen. The devastation we have witnessed from the recent floods in the Midwest and other regions of the world have underscored the need to develop resilient systems to help communities recover more quickly. One way to think about resilient systems is through the lens of control theory, which studies the behavior of dynamic systems and their response to feedback. Accidents happen; we need to not only reduce the chances of their occurrence, but also build systems and processes that are flexible enough to deal with these accidents. In the Challenger's case, the probability of the explosion could have been reduced by delaying the launch, but another solution would have been to build additional redundancies into the system that allowed it to recover quickly from a potential accident.
Accidents happen; we need to not only reduce the chances of their occurrence, but also build systems and processes that are flexible enough to deal with these accidents.
In short, risk analysis is largely about identifying control variables within a system that have uncertainty associated with them. Understanding what that uncertainty means for the rest of the system is the key to resilient systems. An extremely useful tool in this endeavor is process simulation or discrete event simulation, a core component of MoreSteam's DFSS training and soon to be incorporated "Process Playground" tool in our EngineRoom data analytics application. Using this tool, you can model the behavior of processes in different scenarios and with different levels of stressors so that you can design leaner, stronger and more resilient systems. Stay tuned for more information on using Process Playground within EngineRoom, coming out soon.