Root cause analysis is dead.

July 9, 2008 at 12:20 pm | Posted in KT | Leave a comment

So many times I’ve heard the panic in the voices of senior and executive management during or just after an incident has occurred, demanding the Root Cause to be found. “What’s the root cause of the Terminal 5 Baggage Handling failure?” “What’s the Root Cause of this…? What’s the Root Cause of that…?” Enough! I say.

Humans are in general pretty good at preventing things from going wrong, and we have built complex layers of interlocks, processes, procedures and preventive actions to stop bad things happening. In general it’s not possible to destroy a modern car by doing something stupid to the controls – new automatic cars can’t be moved from park to drive without the brake being applied and so on – many interlocks to prevent humans inflicting damage to the mechanisms.

We have created complex action plans to minimise the effects of bad things happening when we can’t prevent them, witness and share the frustration when those contingent actions are not ‘allowed’ as in the immediate aftermath of the Burmese Cyclone – the frustration of the aid agencies who know what to do, where it needs doing and how to assist, but were prevented from doing so.

Given the complexity of the systems that we have created, I put it to you that in order for a bad thing to happen there has to be a miracle, as miraculous as those times when there is an alignment of good things for a traditional miracle to occur.

While the word miracle can be defined as something extraordinary or surprising, or an extremely outstanding or unusual event, thing or accomplishment, I can find no equivalent word to describe the coming together of multiple bad events that culminate in badness or disaster, the only antonyms I’ve found are normalcy, usualness, which are not what I’m looking for. Turning to latin, perhaps I’m looking for something like conspiratio which seems to mean a union of bad things – which leads me to conspiration noun; the act of plotting or secretly combining; a join effort toward a particular end. If you know of a better word to describe a bad miracle, please let me know.

So for a conspiration to happen, given the large number of preventive actions we generally have in place, I suggest that many things, benign and untroublesome on their own need to come together. In the case of the T5 opening a recent Radio 4 ‘FileOn4’ programme interviewed a number of people and concluded that there were a number of contributory factors and breached procedures which lead to the collapse of the baggage processing. Had the staff been allowed on site days before, the car-park not letting staff in would have been sorted out in advance. Had there been a test with a properly large volume of people and bags, the software problem would have been found, and so on. I suspect people in BA and BAA knew a disaster was impending, and did not have the positional or political power to prevent it – still a conspiration and at a ‘higher’ level.

It is generally understood that the IT world is becoming every more technically challenging, so in Incident Management and Problem Management the problems are becoming harder to solve. The escalations are also becoming more complex to handle, and as Einstein is alleged to have once said: “We can’t solve problems by using the same kind of thinking we used when we created them.”

In the world of ever increasing technical complexity, Root Cause Analysis may be living near the end of its useful life, and it could be time to usher in the next shift in thinking – Root Causes Analysis – mapping the complex web of technical, people, process and project management issues that contribute to the genesis and the severity of the problem at hand. The important (and tough) challenge is to prevent the analysis from becoming a sea of unsubstantiated opinion or blame, and keep it real.

Heisenburg, Einstein & Moore – An Eternal Source of Misery

May 28, 2008 at 7:58 am | Posted in KT | Leave a comment

Three well known observations are coming together to herald the death of Root Cause Analysis. Heisenburg’s ‘Observer Effect” suggests that the act of observation can alter the results; just stare, unblinking, at the tip of a colleagues nose for a few minutes to see the effect it has on people. The significant problems we have cannot be solved at the same level of thinking with which we created them is attributable to Einstein, and Moore’s Law suggests that the number of transistors that can be inexpensively placed on an integrated circuit is increasing exponentially, doubling approximately every two years.

Sooner or later, the complexity curve will overwhelm the capability continuum, and my recent experience suggests that this is already happening. Companies are producing hardware that is so densely packaged, and are moving such huge numbers of terabytes of data about that it has already become impossible to find the root cause of some outages. Some time ago I helped facilitate a network problem that caused a number of several hour infrastructure outages for a UK bank; the most likely cause turned out to be some new, faster network interconnect, and we identified that one single bit in approximately several terabytes of data was being ‘lost’ – we could get no further – too many bits, too critical an issue – the hardware was ripped for different equipment. In another situation a company used to have three cards in a frame doing a particular job, transistor density increased and the function of all three was reduced into one new card. Only afterwards was it realised that the observability of the activity of the three cards at both the backplane and the card front interconnects allowed the technical staff to identify and rectify faults – now it’s a ‘black box’ and if something goes wrong there is insufficient observability for root cause analysis: if it’s a hardware or software bug, replacing the card with another one may not solve the problem.

If we are going to troubleshoot in complex environments, we need new thinking in the commercially available, heavy duty case handling support tools. Commonly available tools are based around the original case handling tool created by HP for Hertz – a transactional model fit for handling vehicle rentals. Troubleshooters don’t care when vital data was collected, but that it is recognised as vital data and is stored and presented in a manner that can use the way human brains work to maximise effectiveness. Vital troubleshooting data stored several pages back in a narrative date ordered story is easily overlooked – data presentation really does matter for successful and efficient Root Cause Analysis. If we are making equipment so complex as to be impossible to troubleshoot we need to know quickly that we will fail, so that months are not wasted on troubleshooting something ultimately un-solveable. Since we are assembling high density, high throughput configurations that take teams to troubleshoot, we need better tools to allow those teams to function efficiently. If we cannot arrest Moore’s Law, and we cannot navigate around Heisenburg, at least we can give Einstein a fair crack and introduce new thinking into the tooling we provide our troubleshooters.

Epoch

May 27, 2008 at 9:07 am | Posted in Uncategorized | 1 Comment

The beginning.

Blog at WordPress.com. | Theme: Pool by Borja Fernandez.
Entries and comments feeds.

Follow

Get every new post delivered to your Inbox.