Heisenburg, Einstein & Moore – An Eternal Source of Misery
May 28, 2008 at 7:58 am | Posted in KT | Leave a commentThree well known observations are coming together to herald the death of Root Cause Analysis. Heisenburg’s ‘Observer Effect” suggests that the act of observation can alter the results; just stare, unblinking, at the tip of a colleagues nose for a few minutes to see the effect it has on people. The significant problems we have cannot be solved at the same level of thinking with which we created them is attributable to Einstein, and Moore’s Law suggests that the number of transistors that can be inexpensively placed on an integrated circuit is increasing exponentially, doubling approximately every two years.

Sooner or later, the complexity curve will overwhelm the capability continuum, and my recent experience suggests that this is already happening. Companies are producing hardware that is so densely packaged, and are moving such huge numbers of terabytes of data about that it has already become impossible to find the root cause of some outages. Some time ago I helped facilitate a network problem that caused a number of several hour infrastructure outages for a UK bank; the most likely cause turned out to be some new, faster network interconnect, and we identified that one single bit in approximately several terabytes of data was being ‘lost’ – we could get no further – too many bits, too critical an issue – the hardware was ripped for different equipment. In another situation a company used to have three cards in a frame doing a particular job, transistor density increased and the function of all three was reduced into one new card. Only afterwards was it realised that the observability of the activity of the three cards at both the backplane and the card front interconnects allowed the technical staff to identify and rectify faults – now it’s a ‘black box’ and if something goes wrong there is insufficient observability for root cause analysis: if it’s a hardware or software bug, replacing the card with another one may not solve the problem.

If we are going to troubleshoot in complex environments, we need new thinking in the commercially available, heavy duty case handling support tools. Commonly available tools are based around the original case handling tool created by HP for Hertz – a transactional model fit for handling vehicle rentals. Troubleshooters don’t care when vital data was collected, but that it is recognised as vital data and is stored and presented in a manner that can use the way human brains work to maximise effectiveness. Vital troubleshooting data stored several pages back in a narrative date ordered story is easily overlooked – data presentation really does matter for successful and efficient Root Cause Analysis. If we are making equipment so complex as to be impossible to troubleshoot we need to know quickly that we will fail, so that months are not wasted on troubleshooting something ultimately un-solveable. Since we are assembling high density, high throughput configurations that take teams to troubleshoot, we need better tools to allow those teams to function efficiently. If we cannot arrest Moore’s Law, and we cannot navigate around Heisenburg, at least we can give Einstein a fair crack and introduce new thinking into the tooling we provide our troubleshooters.
Leave a Comment »
RSS feed for comments on this post. TrackBack URI
Leave a Reply
Blog at WordPress.com. | Theme: Pool by Borja Fernandez.
Entries and comments feeds.