The Basic Facts
In October 2008, the city of L’Aquila in Italy began experiencing earth tremors. Given that the city has been pretty much destroyed by earthquakes twice in the past, this was a matter of some concern. For the next six months the tremors continued. On March 31 the National Committee for the Prediction and Prevention of Major risks had a meeting, and a civil service spokesman re-assured the public that there was no immediate danger. Within a week, more than 300 people were killed when a major earthquake struck.
Seven members of the National Committee were charged with manslaughter for providing the public with information that was “inexact, incomplete and contradictory”, and on Monday this week they were sentenced to jail terms of six years. There are many misleading summaries of this case available, perhaps because of the tangential prosecution and defence cases. The defence, both in court and in public maintained that it was not reasonable to expect scientists to predict the timing of an earthquake. However, that isn’t what the prosecution was alleging. The prosecution case was that the scientists didn’t communicate the risk appropriately. According to the prosecution, at least 29 of the victims stayed in an unsafe situation due to receiving misleading information about the risk they faced.
Two news stories with accurate portrayal:
There are three underlying questions here:
1) Can you do risk estimation for an earthquake?
2) How should you communicate risk information of this sort?
3) Should you get six years in jail if you do either of these incorrectly?
Risk Estimation for Earthquakes
Unlike most infrequent high-consequence events, the risk of an earthquake in certain areas does build up over time. So while it is nonsense to say that we are “overdue” for a big storm, a flood, or a stock-market crash, it does actually make some sense to say that an area is overdue for an earthquake. These are still low probabilities though – for example, Southern California is waiting for the next big one, meaning that there’s around a 2% chance of a major quake in the next thirty years.
Most earthquake science is focussed on modelling what’s likely to happen during an earthquake, rather than when the earthquake will come. This make sense, because the most effective protective measures are long-term anyway: Good building design; removing or protecting buildings built before the modern building codes; and teaching people appropriate earthquake behaviour. By understanding the range of scenarios that are likely or possible, we can put appropriate protection in place.
As far as short-term prediction goes, big earthquakes do often have fore-shocks – smaller movements leading up to the highest energy event. These can, on a scale of a few minutes to a couple of days, give warning of increased risk. There is a lot of uncertainty involved, and it’s an area where more research and more computer power may be able to improve predictions in future. At the moment, it’s akin to clouds as a predictor of rain. You aren’t totally safe if the sky is clear, and seeing clouds doesn’t tell you that it’s going to rain, but looking at the sky does help you determine the odds.
The Southern California Earthquake Center has good general information on earthquake modelling, as well as a handy guide to living in an earthquake-prone area.
Risk communication is a whole scientific field in its own right, with a compelling message that it isn’t enough to simply tell the truth about risk, you have to tell it in a way that people can make sense of. An Australian earthquake expert Professor Paul Somerville said that the Italian team had a case to answer because of the way they communicated. As far as I can tell from the news reports though, his answer is that the scientists should stick to reporting numbers and let society do the interpretation. This isn’t what the research on risk communication suggests.
The archetypical example is doctors telling patients about the risks of treatment. Simply quoting numbers like one in a thousand or one in ten-thousand patients experiencing a side-effect is meaningless without context. Patients can’t make useful sense of these numbers. Comparisons instead or as well are often very useful. “This drug is about as dangerous as panadol, and much safer than doing nothing”; or “There is a small chance that this drug will make things worse – about the same chance as you being in a car accident in the next year”.
One consistent result in risk communication research is that deliberately being inaccurate in order to reassure people doesn’t work. The situation in Italy was complicated by scare-mongering by various parties, including one seismic technician giving warnings through a megaphone. The scientists rightly thought that it was important for people to have a realistic view of the risk. With hindsight, it’s easy to put better words in their mouths such as “The tremors are a sign of increased risk, but they don’t tell us whether or not a big earthquake is about to happen. Citizens should, as always, make sure that they have followed the steps in the earthquake preparedness booklet, and know what to do if an earthquake does happen”.
The work of Professor Terje Aven at University of Stavenger is a good starting point for reading about risk and risk communication.
Jail for Poor Risk Communication?
Personally, I think that not being able to properly communicate the risk of a major hazard is a pretty big deal when your job description is to communicate the risk of major hazards. The committee deserved to be dragged over the proverbial coals whether or not the earthquake happened.
Drawing a causal link between their inept communication and people dieing is another thing altogether. It’s easy to say afterwards that person X would not have stayed in their shaky house if they’d known the true risk. That’s rather hard to prove though, given that people were living in poor housing in an earthquake zone for months and years. Yes, some people might have skipped town for a couple of days, but that wouldn’t have been a rational response to the risk. Those same people would still have been killed if the earthquake happened a week later, or two weeks later, or a month later.
Essentially, the court has found that a combination of proper risk communication, irrational response, and good luck, would have saved lives, therefore the scientists are at fault. Hopefully, they win their appeal. I don’t think that this will have a chilling effect however. I hope and believe that this will make more scientists realise that communication is an important part of their role, and requires as much care and attention as the research itself.
The Professor Paul Somerville interview is here.
“And there came a day, a day unlike any other, when Earth’s mightiest heroes and heroines found themselves united against a common threat. On that day, the Avengers were born—to fight the foes no single super hero could withstand! Through the years, their roster has prospered, changing many times, but their glory has never been denied! Heed the call, then—for now, the Avengers Assemble!”
“The Avengers” is a team of super-heroes from the Marvel Comics universe. As with any hazard identification, it is important to be precise about what is considered in and out of the system. This is complicated by the changing composition of the super-hero team, including multiple incarnations in parallel continuities. There are also interfaces with other organisations such as Shield and non-aligned super-heroes. In fact, it may be more useful to consider the Avengers as a “System of Systems” (SoS).
Hall-May describes SoS as “systems whose constituent components are sufficiently complex and autonomous to be considered as systems in their own right and which operate collectively with a shared purpose”. This would certainly apply to the Avengers.
Held  says that a System of Systems has the characteristics:
- The system can be subdivided into independently operating systems. The independent systems must themselves be systems.
- The system does not depend on all elements for survival. For example, if the rudder on a 747 fails, the aircraft is very likely to crash destroying the rest of the nodes and ceasing to be a system as a whole. The 747 cannot be an SoS. The airport, however, will continue to operate. The airport can be an SoS.
- Systems in an SoS have some form of communication. Communication is any form of information passing, regardless of intent. For example a deer showing a white tail while running is passing the information of danger to any other observing deer. The intent of the deer was to run in fear, not to communicate the danger.
- Elements have a common mission. A mission can be described which encapsulates the behavior of the group.
This definition also applies to the Avengers. My own view  is that Systems of Systems is not an absolute definition, but should be applied when it is useful to do so. One case where it is definitely useful is when systems have a fluid configuration, with limited information about future configurations. Again, this suggests that System of Systems treatment of the Avengers is appropriate.
Each Avengers Ensemble will comprise of between four and six super-hero systems, selected from the following set. Each system may have multiple operating modes.
- Ironman / Tony Stark (Powered flying suit and operator | Genius Technologist)
- Hulk / Bruce Banner (Enraged Green Monster | Gamma-Ray Scientist)
- Thor (Norse God)
- Henry Pym (Ant-Man | Giant Man | Wasp)
- Captain America (Super Soldier)
- Hawkeye (Archer)
- Quicksilver (Fast Moving Mutant)
- Scarlet Witch (Magic Wielding Mutant)
- Black Widow (Spy)
- Beast (Scientist)
Configurations may also include up to two other external super-hero systems. Modes and capabilities of these other systems cannot be fully anticipated. A stereotypical example is Spiderman. Spiderman will be used throughout this analysis as a test case for the ability of the System of Systems to incorporate an external super-hero system.
Each configuration will typically include one or more vehicles.
An important aspect of system-of-systems is that the lifecycle of the SoS does not neatly align with the lifecycles of the component systems. All of the initial component super-hero systems were specified, designed and implemented before the assembly of the Avengers system. As a consequence it was not possible to incorporate features into the design of each of them to support Avengers working. A key example of this is inter-operability of equipment and power sources. Thor’s main weapon system is incompatible with everyone but Thor. Iron Man’s entire weapons and propulsion platform uses bespoke technology.
Contrast this with a lifecycle where the Avenger’s concept was determined before the design of the superheroes. At the very least, dangerous fashion incompatibility could have been avoided.
Top Level Hazards(TLH)
With any system-of-systems there are a standard set of hazards that may be applicable. In fact, with the Avengers we find that all of them are applicable.
Fratricide, also known as “friendly fire” or “blue on blue” incidents, typically results from misidentification of targets, incorrect aiming of direct or indirect fire, or failure to establish and enforce zones with clear rules of engagement. The mix of indirect-fire weapons (ants, lightning), direct fire weapons (shield, hammer, arrows, guns, magnetic repulsors) and melee suggests that Avenger fratricide is a risk which must be carefully managed.
The Avengers have rightly determined that procedural mitigations are insufficient here. It’s one thing to tell Iron Man, Thor or Hulk to carefully identify targets, it is another thing to actually expect them to do so. An alternate strategy is to mitigate the consequences of fratricide by designing defensive capability to be universally stronger than the offensive systems. The main mechanisms of these systems are:
- Invulnerability (Thor, Hulk)
- Physical Protection (Iron Man, Captain America)
- Agility (Beast, Quicksilver, Black Widow)
Fratricide risk applying to the other super-hero systems remains high-likelihood, high-consequence. Blind luck and comic-book physics are the only explanations for the continued survival of Henry Pym, Hawkeye and the Scarlet Witch.
Combat frequently requires fast-moving objects to co-ordinate their movements in close proximity. Whilst in many respects the speed and mass of the super-hero units makes collision a greater threat than fratricide, the mitigations are the same. Ideally zones of operation and movement would be determined and enforced through non-procedural means. In practice ambiguous verbal communication (“look out”, “gang way”, “duck”) seems to be the main strategy for collision avoidance. For units such as Hulk, Thor or Iron Man this level of mitigation is acceptable, because the consequences of collision are minor. In the case of Ant Man his small size not only magnifies the consequence, but reduces the effectiveness of the mitigation. Other units are unlikely to see Ant Man, and his own voice requires magnification to be heard.
TLH3: Resource Competition
Typical finite resources which must be considered in System of Systems are:
- Communication channels
- Logistic support channels
- Movement corridors
- Shared support facilities
Fortunately, none of the super-hero units are heavily resource dependent. Units such as Hawkeye have ammunition restrictions, but these restrictions are typically plot-imposed rather than dependent on logistic organisation. In most respects the lack of weapon-system compatability is an advantage here, as no pair of units have common consumable supplies or parts.
TLH4: Mission Capability Shortfall
The fluid configuration of a system-of-systems presents a risk that a vital capability will be lacking in some configurations. To account for this hazard we must consider what the vital capabilities are, and how they may be provided.
For a super-hero team, plotlines typically require:
- Investigative ability
- Novel solutions to complex problems
- Transport to and from the location of incidents
- Combat capability
Full details are not provided here, but a simple table showing the capabilities of each hero-unit will show that not all combinations of hero-unit can deliver all of the required capabilities. For example, a team consisting of Thor, Henry Pym as Giant Man, Quicksilver and Bruce Banner in Enraged Green Monster mode would have considerable combat capability, but so little intellectual capability that they would present an enormous threat to the general public. On the other hand, a team consisting of Tony Stark in Genius mode, Bruce Banner in Gamma Scientist Mode, Black Widow and Henry Pym as ant-man would have insufficient firepower for most plot challenges, presenting serious risk to themselves and unprotected bystanders.
Here’s the headline story: the world has as much chance of ending in December 2012 as it does in any other year. None of the best minds in any civilisation, ever,
have been able to predict the types of events that destroy civilisations with any sort of accuracy. Modern risk assessment experts can’t agree within five orders of magnitude
on the likelihood of a specific hazardous outcome for a specific factory, but even they are yards ahead of the Meso-American calendar makers, who were focusing on things a little more certain, like the phases of the moon and the orbit of Venus.
The Mayan civilisation kept calendars for the same reasons that any civilisation likes to keep track of time:
- To attach dates to things in order to keep some sort of history (the “Long Count” calendar)
- To keep track of the seasons and manage a civilisation heavily dependent on crop management (the “Haab”, or civil calendar)
- To keep track of the moon and Venus – they didn’t really cotton on to the idea of planets – (the “Tzolkin”).
The Long Count calendar didn’t need any sort of careful mathematics. They just declared a particular date to be the start (no one can
tell for sure anymore exactly what date that was) and started counting. This calendar doesn’t match up with anything in the real world, it just keeps
ticking. If the Mayan civilisation were still around, this calendar would be used to put on the front of newspapers or on the timestamp on photos.
The Haab required careful records to work out how long a year was. They did a better job than the Romans did a few centuries later, but it wasn’t perfect. This is
the calendar that would need leap-year style corrections to keep it lined up with the seasons if it was still in use today.
The Tzolkin required careful records of the rising and setting times of Venus. They probably used this calendar to work out “good luck days” and “bad luck days”, when to go to war and whether to shake hands with passing European Conquistadors. Evidently, this calendar was very good predicting moonrise, and rubbish at predicting anything actually useful.
Now, if anyone tells you that “THE WORLD IS GOING TO END!” because the Mayan calendar predicts it, they’re talking about the Long Count Calendar. Remember, this is the calendar that doesn’t even predict Summer, or the full moon. It’s just there so people know what the date is. By some interpretations (remember, no one actually knows when this calendar was meant to start counting from), this calendar will need an extra digit sometime soon. This is just like the Y2K bug – only no one uses any Mayan computers, so there’s no bug. So 2012 marks the Mayan Y2k NotaBug.
The Mayan’s did try to predict the future, but they got it right just as often as anyone else – exactly as often as if they’d just guessed or made it all up. And their guesses had nothing to do with 2012 or any other date in our near future.
Good sources on the topic:
- The NASA page does a good Q&A on some of the details of the myths at http://www.nasa.gov/topics/earth/features/2012.html
- More info than you want about the 2012 myths and their damaging consequences at http://www.2012hoax.org/
I was at PSAM11/ESREL12 this week, and I plan to write a few posts on topics I found interesting. First though there were some papers that I mentioned during my own talk that are not in the bibliography of the accompanying paper. Normally, this would cause me to start questioning just how comprehensive my work was – after all, I was making claims about the lack of empirical evaluation of quantitative analysis, and here are three relevant papers that I hadn’t even seen. On the other hand, no one in my research group had heard of the studies either, and they aren’t well known in the whole Probabilistic Risk Assessment community. Important work, bypassing its key audience. Maybe the research councils have a point when they insist on an impact plan in grant applications.
These papers report three investigations into variation in risk assessment. In the first study, 11 teams were asked to quantify the risk presented by an ammonia plant. All teams had access to the same information, but the results were evenly spread over six orders of magnitude. In the second study they tried to pin down what was causing the variation. At several stages during this study they reduced the variation by standardising assumptions and methods, but the teams (using similar source information to the first study) still spread over four orders of magnitude. Once you take into accound the reductions in variation along the way, this was a good (but not completely independent) replication of the first study’s results.
The third study is to do with the ability of reviewers to interpret studies. Multiple reviewers were given the same study, and asked to draw conclusions from it.
Anyway, the three papers are:
Easy to read but informative and provocative:
Except where noted, links are to PDFs of the documents and homepages of the authors.
Why System Safety Professiona lshould read Accident Reports by Michael Holloway and Chris Johnson
Software Safety: Where’s the Evidence? by John McDermid
Predictable Surprises: The Disasters You Should have Seen Coming by Michael Watkins and Max Bazerman
The Limits to Safety? Culture, Politics, Learning and Man–Made Disasters by Nick Pidgeon (paywall)
The Ironies of Automation by Lisanne Bainbridge
Analyzing software requirements errors in safety-critical, embedded systems by Robin Lutz (paywall)
The Seven Samurai of Systems Engineering by James Martin
My own work on specific topics that often get raised when I’m speaking or teaching:
In service safety, temporal ALARP, and operational decision making:
A.J. Rae, M. Nicholson (2009), Making Robust ALARP Decisions for In-Service Systems, 4th IET System Safety Conference, London
Should you trust humans or computers to make decisions for you?
A.J. Rae (2008), The Operator or the Automation: In Whom Should We Place Ultimate Trust?, 3rd IET International Conference on System Safety, vol., no., pp.1-6, 20-22 Oct. 2008
ALARP, MeM, GAME, GAMAB – Where do I find a reference for all these? Ignore the “solution” presented in this paper, but the summary of the existing approaches is quite good.
A.J. Rae (2007), Acceptable Residual Risk: Principles, Philosophy and Practicalities, 2nd IET System Safety Conference, London
In Part C2 of this report we apply Layers of Protection Analysi (LOPA). LOPA was developed within the chemical process industry as a method of examining the mitigations in place or available for a subset of the initiating events identified during preliminary hazard analysis. It was first documented in 2001 by the American Institute of Chemical Engineers. The basis of LOPA is a set of “Independent Protective Layers.” Protection against the initiating event is provided separately by each layer. The method provides a standard set of layers which can be instantiated for any specific system and initiating event.
To perform the analysis, we being by specifying the system, the initiating event of interest and the consequences of concern. The system in this case is the Death Star power core. This core consisted of a single “hypermatter” reactor, capable of powering all of the systems on the Death Star. The initating event is entry of hostile craft into the superstructure, and the consequences of concern are an uncontained reaction within the core.
IPL1 – Design:
Ideal protection against any hazard is to design the system to exclude the hazard. Typically, this is achieved through substitution – finding a less dangerous way to provide the same functionality. For example, convection-cooled nuclear reactors eliminate the hazard of loss of coolant pump power by not using coolant pumps. In the case of the Death Star, it would be unreasonable to expect the possibility of an explosion to be designed out. The oxygen rich “space vacuum” of the Star Wars Universe renders all power sources vulnerable to explosions, particularly under circumstances of enemy fire or dramatic tension. No non-explosive power core was available.
However, what about the initiating event? Could the Death Star not have been designed such that small attack craft could not penetrate to the heart of the vessel? Probably this was the case in the “steady state” of the Death Star, but proper risk analysis should consider startup and shutdown of the system as well. Clearly this analysis was performed, resulting in the placement of a shield generator on the Endor moon. “Bolting on” protection in this way is a clear sign of safety analysis lagging behind design processes. An actively-generated shield was always a weaker protection than designing-in physical security.
IPL2 – Basic Controls:
Basic controls are used to keep chemical plants within normal operating conditions. Most initiating events involve progressive drift from normal conditions, and this can be detected and corrected before the event can even be considered part of an emergency incident.The assumption that hostile craft would not penetrate the shield meant that little consideration was given to this layer of protection.
IPL3 – Alarms and Manual Intervention:
When things go wrong, it is important that operators have sufficient information to detect the crisis and act. This layer of protection was certainly in place, and functioned as designed. The staff on board the Death Star were immediately informed when the small craft penetrated the super-structure, and were able to direct manual interventions precisely. Unfortunately no system of manual intervention is 100% reliable against any threat.
IPL4 – Automated Safety
Automated safety in this circumstance was provided by the Power Regulator located on the North Tower of the hypermatter reactor. Enemy fire destroyed this regulator shortly before the explosion – it is arguable that destruction of the regulator alone may have been sufficient to cause an uncontained reaction.
Here our analysis shows a flaw in the reactor protective systems. Rather than providing a separate layer of protection, the Power Regulator provided an alternate mechanism for the accident. Whilst good design would require failure of both the core and the power regulator, in fact failure of either may have been sufficient. Even were this not the case, the near simultaneous destruction of both systems shows that they were not independent layers of protection.
Far more appropriate would be an automated system which forced the reactor core into a fail-safe state in the event of an attack. Whilst this would be operationally undesirable, a power-less Death Star would certainly be preferable to no Death Star at all.
IPL5 – Physical Protection
Physical protection can take two forms – containment, and isolation. Clearly containment was not an option, as the energy generated from the uncontained reaction was sufficient to destroy the entire structure. Given that that was the case, however, why put the reactor in the middle of the Death Star at all? A logical design would be a multi-hull craft, with the majority of the crew in one hull, and the dangerously explosive core in the other. This is, in fact, the prevailing design philosophy in the Star Trek universe, explicitly due to the dangers presented by breech of the warp-cores.
IPL6 – Emergency Response
It is uncertain what, if any, actions were taken in response to the destruction of the reactor core. Options available may have included venting pressure into space, or rapid evacuation. The fact that the attacking craft and a personnel shuttle were able to depart in the time between the collision with Executor and the final explosion suggests that with adequate emergency response in place, a significant portion of the crew may have managed to evacuate, even if the explosion could not be mitigated.
IPL7 – Community Response
Emergency response at the community level was limited by two factors. Firstly, there was a lack of independence between the emergency and the ability to respond. This is unfortunately not unusual. In a large number of cases response has been hindered by damage to communication and rescue equipment by the incident requiring response. Secondly, there was a lack of competence and training on the part of the Rebel Alliance management. Promotion of an ex-pirate, a gambler and a religious acolyte to leadership positions takes the “Peter Principle” to new heights. It may have been effective in winning the battle, but that was small consolation to the Ewok populace.
Part C considers the circumstances of the accident up until the time of the explosion. Part D will consider the post-explosion response.
In Part C1 we apply the Kaoro Ishikawa method of analysis.
Ishikawa diagrams, also known as “fishbone diagrams” and “cause and effect diagrams” were invented byKaoro Ishikawa as a way of illustrating the way various factors causing a problem in quality control. They were adopted by Boeing Aerospace, first for quality control, then for analysis of accidents and incidents.
The strength of the technique is that it applies categories (traditionally the six Ms – Manpower, Methods, Machines, Materials, Measurements, Mother Nature) which result in considering a broader range of factors than would otherwise be thought about.
The main limitation of the technique is that causality is only loosely defined. In theory the method is based on “necessary and sufficient” causes, but in practice a wider range of influencing factors need to be considered. The method also struggles when people, methods and equipment are inter-related rather than being causally distinct.
In the diagram presented here, it is clear that the accident could be naively analysed as a technical design problem (an exposed reactor core), as human error (allowing a small band of plucky heroes within the protective shield), enemy action (the rebel attack) or poor safety process (failing to learn lessons from the first death star, and not considering whole-of-lifecycle risk).In fact, the accident was all of these things. The Ishikawa approach allows us to find a range of areas in which our organisation can be improved by learning from this accident.
One thing that the approach does not tell us is why this particular accident occurred. The features present in the accident were equally features of many Imperial operations. For example, a very similar diagram could be drawn for the Imperial attack on the Hoth Ice World, a near-complete success. On the one hand, this shows that careful analysis of past operations could cause organisational learning in time to prevent accidents such as the destruction of the Death Star. On the other hand, if the diagrams look the same for both events, we may have failed to consider important factors which caused the accident.