The lessons pilots can teach surgeons…

by jonathan on February 15, 2009

Wonderful checklist. From Design for Mankind (click image for site).

Wonderful checklist. From "Design for Mankind" (click image for site).

..and the lesson surgeons and soldiers teach us: Use checklists and have regular briefings/debriefings (After Action Reviews).

From the BBC:

Before take-off, every pilot needs to brief their crew about what to expect.

At the end of each flight, they talk briefly about what went right, what went wrong and what could be done better.

Pilots say this brief and debrief system has reduced errors and made flying safer, and a growing number of NHS medics think this system should be adapted – to make surgery safer.

A report by researchers at the University of York claims that accidents, errors and mishaps in hospital affect as many as one in 10 in-patients – but that up to half of these were preventable.

One doctor who has trialled the brief and debrief system in two units at his hospital says incidents were reduced by between 30-50% over the period they used it. [BBC News]

This, of course, is also a military staple. Before every mission there is extensive briefing and after every mission there is a debriefing, and if there was combat, an in-depth After Action Review.

Good project managers also know the power of post project analysis, the what they call the project review or postmortem[1. A future post will address how to conduct these properly so as to get the most benefit from the practice.].

Meanwhile briefings and debriefings are not the only aviation practice now being widely adopted by doctors.

In a brilliant Must Read article in the New Yorker, Atul Gawande explains how using checklists completely transformed aviation and is now transforming hospital intensive care units, having a massive impact on patient survival rates. He begins by explaining how the aviation checklist came into being:

A small crowd of Army brass and manufacturing executives watched as the Model 299 test plane taxied onto the runway. It was sleek and impressive, with a hundred-and-three-foot wingspan and four engines jutting out from the wings, rather than the usual two. The plane roared down the tarmac, lifted off smoothly, and climbed sharply to three hundred feet. Then it stalled, turned on one wing, and crashed in a fiery explosion. Two of the five crew members died, including the pilot, Major Ployer P. Hill.

An investigation revealed that nothing mechanical had gone wrong. The crash had been due to “pilot error,” the report said. Substantially more complex than previous aircraft, the new plane required the pilot to attend to the four engines, a retractable landing gear, new wing flaps, electric trim tabs that needed adjustment to maintain control at different airspeeds, and constant-speed propellers whose pitch had to be regulated with hydraulic controls, among other features. While doing all this, Hill had forgotten to release a new locking mechanism on the elevator and rudder controls. The Boeing model was deemed, as a newspaper put it, “too much airplane for one man to fly.” The Army Air Corps declared Douglas’s smaller design the winner. Boeing nearly went bankrupt.

Still, the Army purchased a few aircraft from Boeing as test planes, and some insiders remained convinced that the aircraft was flyable. So a group of test pilots got together and considered what to do.

They could have required Model 299 pilots to undergo more training. But it was hard to imagine having more experience and expertise than Major Hill, who had been the U.S. Army Air Corps’ chief of flight testing. Instead, they came up with an ingeniously simple approach: they created a pilot’s checklist, with step-by-step checks for takeoff, flight, landing, and taxiing. Its mere existence indicated how far aeronautics had advanced. In the early years of flight, getting an aircraft into the air might have been nerve-racking, but it was hardly complex. Using a checklist for take-off would no more have occurred to a pilot than to a driver backing a car out of the garage. But this new plane was too complicated to be left to the memory of any pilot, however expert.

With the checklist in hand, the pilots went on to fly the Model 299 a total of 1.8 million miles without one accident. The Army ultimately ordered almost thirteen thousand of the aircraft, which it dubbed the B-17. And, because flying the behemoth was now possible, the Army gained a decisive air advantage in the Second World War which enabled its devastating bombing campaign across Nazi Germany.

Medicine today has entered its B-17 phase. Substantial parts of what hospitals do—most notably, intensive care—are now too complex for clinicians to carry them out reliably from memory alone. I.C.U. life support has become too much medicine for one person to fly.

Later in the article Gawande quotes Peter Pronovost, the medical pioneer who introduced the use of aviation style checklists into John Hopkins Hospital, where they are now recognised as being enormously helpful.

The checklists provided two main benefits, Pronovost observed. First, they helped with memory recall, especially with mundane matters that are easily overlooked in patients undergoing more drastic events. (When you’re worrying about what treatment to give a woman who won’t stop seizing, it’s hard to remember to make sure that the head of her bed is in the right position.) A second effect was to make explicit the minimum, expected steps in complex processes. Pronovost was surprised to discover how often even experienced personnel failed to grasp the importance of certain precautions. In a survey of I.C.U. staff taken before introducing the ventilator checklists, he found that half hadn’t realized that there was evidence strongly supporting giving ventilated patients antacid medication. Checklists established a higher standard of baseline performance.

The parallels with IT operations are striking. Just like pilots and ICU doctors, System Administrators and IT managers also operate in highly complex, dynamic environments[2. One of the reasons IT departments are always demanding standardisation is an attempt to reduce that complexity].  In these environments relatively small mistakes can quickly cascade into disasters. In my experience mistakes by qualified and properly trained staff[3. I making a distinction here between mistakes due to bad training, sloppy procedures, poor communication or unqualified staff asked to operate beyond their competencies and mistakes made by qualified IT professional who have all the skills, knowledge and information they need to carry out their tasks] are overwhelmingly caused by two common factors – overconfidence and stress -  that give rise to one dangerous practice: rushing.

Over confidence

Mistakes due to overconfidence or over familiarity typically happen when highly skilled and creative people (like system administrators) people are required to carry out a boring multi-step processes.  An example might be provisioning a server or installing some enterprise software.

As the operator becomes more familiar with the process, they pay less less attention to it. It becomes second nature. The greater the familiarity and boredom, the faster the operator rushes through the steps, the more errors are made or steps accidentally skipped. They trade speed for a higher error rate, with the consequences of those errors often emerging much later (and therefore not attributed to their real cause). Merely warning people about this does not help much. As with driving, people tend to overestimate their own competence, skill and attention to detail (Dunning-Kruger and Lake Woebegone effects).

So how do checklists and briefing/debriefing help?

Checklists can be useful here for reminding the operator to carry out all the steps, but perhaps more importantly, they are are vital for quality control testing after the process is completed. I have found that people who are very familiar with a process hate the “paperwork” of following a checklist whilst they do it. If it is forced on them, they tend to tick the checks thoughtlessly, often post facto.

A better idea is to leave the operator to execute the process, but have someone else – preferably a junior – check the process for errors using a checklist. This basic structured quality control allows the process to be done fast, but errors are detected before they are costly (i.e. arise after the system is delivered to the customer or live).  “But the error checking is a process too”, I hear you ask “Is it not in danger of falling victim to over confidence?” The answer of course is, yes. There is a  recursive danger here. The solution is to use less skilled or junior staff to do the quality control check. It will be much harder and less familiar to them, so less likely to trigger over confidence.

The other weapon is briefing/debriefing (After Action Review). Every week my senior technical team members and I review the operational error log (a list of mistakes quality control process discover) , support tickets and other reported failures or errors that we have registered in the Operations Diary. We are “Looking for Ugly“. This team leave kaizan helps us keep our knowledge base fresh, detect problems early, prioritise and shedule our work for the upcoming week and, most importantly, learn from our mistakes.

Because feedback is immediate, public (before peers) and directly linked to individuals;  learning and remedial behaviour are strongly stimulated.  If someone makes a mistake, its on the record. There is no diffusion of responsibility; there is no postponement of consequences; in the next briefing they will need to explain what happened and how we can prevent it happening again. The individual is presumed to be innocent and the system faulty. Our objective is to to apportion blame, but to tweak our systems. If an individual is making too many mistakes, it is a flag for their manager – Are they overworked? Do they need more training? Are they bored? Are they incompetent?

The briefing element gives us a chance to publicly agree our strategic priorities, let each other know what we are up to and generally synchronise our watches. One of the most important functions to emphasise for people what is important in the blizzard of communications they receive. So many managers whine that despite mailing important instruction or information, their staff did not act on, understand or retain the information. This is human given, so there is no point in railing against it. Instead, select what is truly important or the current priority, and emphasis it at your briefing. Your people will go back and read your mail (which you should use for reference rather than as the only vector) and hopefully get the message.

These reviews are quick -  10 or 15 minutes a week – but they yield vital knowledge about the state of your operation.They are a vital tool.

Panic and stress

The second factor that contributes disproportionally to mistakes and damage in IT operations is stress or its extremly disabling off-spring, panic.

With a major system down, monitoring alarm klaxons sounding, phones beeping with alert SMS and furious clients or bosses on the phone demanding to know what is going on, it is sometimes hard to remember what to do or even where to start.

Reading long procedures or the 200 page disaster recovery plan document is pointless. What you need and crave is a checklist, the pre-thought out best practice for the situation.

Cognitive narrowing (the inability to think under-pressure) and learned helplessness (giving up under overwhelming stress) are best addressed by doing your thinking before you are stressed, and having the critical steps and actions made explicit in a checklist.  A checklist acts as an external memory module for your overladen cognitive circuitry. They get people focussed on doing rather than fretting and they are efficient; they make sure the essentialls are covered.

As good as checklists are, After Action Reviews are even more important if you have found yourself in a panic or high stress situation.

The secret weapon against unreasonable demands

Something is wrong is clearly wrong of you or your team are under stress or your systems are failing. It is vital to take time out to calm down and colletively analyse what happened or is happening.

Again it  is all about learning and making needed adjustments. Project managers have postmortems to discuss what went right or wrong in their projects to learn from their mistakes so they are avoided in the next project. You need to calmly analyse

  1. How you and your team performed under stress
  2. What caused or is causing that stress
  3. How you can avoid the situation again
  4. If it is unavoidable, how you can handle yourselves better next time.

You will tend to find that mostly IT staff are often stressed by one of two things:

  1. Impossible external demands (e.g. from Sales) and
  2. Fire fighting the consequences of previous poor work, poor planning or neglect  (often because resources are diverted away to deal with external demands).

The key to managing external demands is  to know exactly what your team is doing and why. Armed with clear strategy, clear priorities and knowing what your people are working on, you can negotiate with those making demands on your resources by forcing them to choose between limited options. “If you want D, you have to sacrifice A,B or C, which is it?”

One option you cannot and must not ever compromise on is maintaining the core operation and customer support. But you must know what they cost in terms of time. The rest is theoretically elective.

What happens in most IT operations is the opposite. IT managers cannot account for what their people are doing even though (or maybe becuase) they are extremely busy. The team is firefighting. They are not logging their work or working from a plan. There are no priorities beyond serving the loudest or the most insistent complainers. When an overzealous Sales team or exitable executives demand new products (R&D) or overcommit IT resources on customer projects.  IT cannot defend itself from demands for work becuase they cannot account for where their resources are being deployed. Executives soon tire of being told their projects have to wait becuase IT is “too busy”. Eventually they order IT to deliver on those promises or demands. It scambles to meet the demands,  even though they never agreed to them, usually by rushing or putting in overtime .

This does not have the intended effect of taking the pressure off. Quite the opposite. People now believe that the rushed overtime efforts are the benchmark for “normal”, that IT are too conservative in their estimates and that when pushed or ordered they can deliver in half they time they said they can. The silent evidence of IT having to work nights and weekends is never considered.

Sales, now thinking that IT secretly has extra capacity,  starts to increase their demands for product development or oversells capacities based on the assumtion IT is exaggerating the its  workload and timescale estimates. As work is piled on, a demoralised IT starts failing to cope. It is  blamed for being “late” on everything (even though the schedules were never even approaching reality); they are pressured to divert resources from core operations and maintenance to “catch up” on projects; core service standards start to drop;  project work is rushed and consequently mistake ridden. When the rushed work starts to generate complaints from dissatisfied customers, Sales will not remember the superhuman up-all-night efforts the IT team puts in to satisfy their demands, they will blame them for losing customers. The best staff in the IT department start to leave the company, aggravating the problems. None wants to do thankless work under massive pressure. Eventually a critical mass is reached, and there are catastophic core failures.  IT goes into a tailspin, and in a technology company, that can often means the company folds.

Don’t let that be your company.

By knowing your  team (the individuals, their capabilities, workload)  and exactly what everyone is doing,  you can easily avoid this scenario above.

The easiest way to know your team is to talk together safely and respectfully a group in weekly (or even daily) briefings and to have regular one-on-one chats with all your direct reports.

If you like it, share it...:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • FriendFeed
  • LinkedIn
  • MySpace
  • StumbleUpon
  • Technorati
  • Twitter

Leave a Comment

Previous post:

Next post: