I’m going to cheat a little here. This article by Roger Bohn from Harvard Business Review is so outstanding that I’m just going to include it in its entirety. Have fun!
By Roger Bohn
FROM THE JULY–AUGUST 2000 ISSUE OF HARVARD BUSINESS REVIEW.
In business organization, there are invariably more problems than people have the time to deal with. At best, this leads to situations where minor problems are ignored. At worst, chronic fire fighting consumes an operation’s resources. Companies with complex R&D and manufacturing processes are particularly prone to destructive fire fighting. Managers and engineers rush from task to task, not completing one before another interrupts them. Serious problem-solving efforts degenerate into quick-and-dirty patching. Productivity suffers. Managing becomes a constant juggling act of deciding where to allocate overworked people and which incipient crisis to ignore for the moment.
For several years, my late colleague Ramchandran Jaikumar and I observed fire-fighting behavior in many manufacturing and new-product-development settings. When we described what we’d seen, people instantly recognized what we were talking about—indeed, most of them said they fought fires in their own professional lives all the time. Yet with a few exceptions, the fire-fighting syndrome has stayed off the radar screens of organizational theorists.1 It deserves far more attention. In fact, fire fighting is one of the most serious problems facing many managers of complex, change-driven processes.
From our observations, fire fighting is best characterized as a collection of symptoms. You’re a victim if three of the following linked elements are chronic within your business unit or division.
- There isn’t enough time to solve all the problems.There are more problems than the problem solvers—engineers, managers, or other knowledge workers—can deal with properly.
- Solutions are incomplete.Many problems are patched, not solved. That is, the superficial effects are dealt with, but the underlying causes are not fixed.
- Problems recur and cascade.Incomplete solutions cause old problems to reemerge or actually create new problems, sometimes elsewhere in the organization.
- Urgency supersedes importance.Ongoing problem-solving efforts and long-range activities, such as developing new processes, are repeatedly interrupted or deferred because fires must be extinguished.
- Many problems become crises.Problems smolder until they flare up, often just before a deadline. Then they require heroic efforts to solve.
- Performance drops.So many problems are solved inadequately and so many opportunities forgone that overall performance plummets.
The recent Mars Climate Orbiter crash is an example of the insidious nature of fire fighting. The crash was traced to a simple communication problem—one engineering group used metric units of measurement, another used English units—but that explanation masks a more complex underlying problem. According to a NASA report published shortly before the crash, the subcontractor staff early in the project was smaller than planned. This led to delays, work-arounds, and poor technical decisions, all of which required catch-up work later. Engineering staff was borrowed from other projects in their early phases—thus forcing those projects into the same position. Engineers worked 70-hour weeks to meet deadlines, causing more errors in the short run and declines in effectiveness in the long run. Early warning signs were missed or ignored. According to a report after the crash, the navigation error that caused the crash could probably have been corrected by a contingency burn, but a decision on whether to perform the burn was never made because of the crush of other urgent work. This is classic fire fighting.
Fire fighting isn’t necessarily disastrous. Clearly it hampers performance, but there are worse alternatives. Rigid bureaucratic rules, for example, can help a company avoid fire fighting altogether, but at the price of almost no problems getting solved. Also, sometimes even a well-managed organization slips into fire-fighting mode temporarily without creating long-term problems. The danger is that the more intense fire fighting becomes, the more difficult it is to escape from.
There are some companies that never fight fires, even though they have just as much work and just as many resource constraints as companies that do. How do they avoid fire fighting? The short answer is that they have strong problem-solving cultures. They don’t tackle a problem unless they’re committed to understanding its root cause and finding a valid solution. They perform triage. They set realistic deadlines. Perhaps most important, they don’t reward fire fighting.
A Simple Model of Fire Fighting
Before we can move on to what to do about fire fighting, we need to look more closely at its underlying causes. A simple model captures the essential issues. (See the exhibit “How Problems Flow Through Organizations.”)
How Problems Flow Through Organizations
Let’s assume that the organization is a factory-engineering group, although it could easily be an R&D facility or a software development group. As problems arise—from customers’ complaints, special orders, quality lapses, supplier difficulties, and other sources—they are sent into a queue until an engineer has time to work on them.
As engineers finish a problem, they report to a manager who presides over the queue, deciding which problems are the most urgent and who should solve each one. Solving a problem takes time: an engineer must study the symptoms, confirm that the problem is real, conduct background research, diagnose its causes, search for a good solution, and implement the solution. Problems come in different shapes and sizes and hence require different amounts of time. Allocating the tasks is itself fairly complex. Each engineer works on several problems at once, and each is better at some problems than others. Engineers may function in teams, and the teams for each problem can differ. Vacations and routine tasks complicate scheduling. Each of these complexities reduces the attainable efficiency of the system and makes the manager’s decisions harder.
A key number in this system is the traffic intensity—the number of problems relative to the resources devoted to problem solving. Traffic intensity increases when there are more problems or when the problems take longer to solve. It decreases when more problem solvers are brought into the picture.
When there’s some slack—when the traffic intensity is below about 80%—the system works well. But when the traffic intensity nears 100%, problems start sitting in the queue for a while. When traffic intensity is greater than 100%—that is, when there are more problems than can be solved, even if everyone works flat out—organizations get into real trouble. The queue lengthens and problems aren’t resolved for long periods of time. Suppose, for example, that a factory is ramping up for a new product, and roughly three significant problems crop up every day. Four engineers each take an average of two days per problem, so every day the queue of unsolved problems grows by one. By the end of the third week, 15 problems are waiting for attention.
When severe fire fighting sets in, managers and engineers find themselves spending more time responding to irate queries than working productively.
As the queue grows, the engineers and their managers experience various pressures—the self-imposed pressure of knowing they’re behind, pressure from customers who want the product immediately, pressure from senior managers who are upset by customers’ complaints. This is when severe fire fighting can set in. Managers and engineers let some problems jump the queue for political reasons. They drop problem A (a machine that keeps breaking down, causing bottlenecks) to find a solution for problem B (a serious quality defect) because B reaches crisis proportions. They put lots of effort into problem C (implementing manufacturing changes for a new set of product enhancements), only to find that the enhancements are indefinitely postponed because they did not work in beta testing. And they find themselves spending more time responding to irate inquiries than working productively. (See the exhibit “Rational Rules, Irrational Results” for more detail about the seemingly reasonable organizational responses to over-long work queues.)
Rational Rules, Irrational Results Organizations have developed many rules of thumb for problem solving. And indeed, when a company is not under stress, these rules may be good ones. They can also be helpful for knowledge workers who are developing their individual reputations. But when an organization is in fire-fighting mode, these rules are pernicious.
In other words, work becomes far less efficient precisely when the most work needs to get done. The longer the backlog, the more things bog down. Engineers start spending time away from normal work—they’re stuck in meetings to set priorities about which fire to fight next; they’re handling special rush jobs for customers whose orders have been delayed; they’re solving problems that later get “overtaken by events.” In general, they’re dealing with the chaos and information overload that ensue when fire fighting is rampant. But that’s not the worst of it.
Counterproductive Problem Solving
The really bad news is that under fire-fighting conditions, pressures push engineers to solve problems not just inefficiently but badly. They don’t work on a problem long enough to uncover its root cause—they just make a gut-feel diagnosis. Then, instead of testing their hypothetical diagnosis offline, they introduce a hasty change in the process. And if the quick fix doesn’t solve the problem completely (it is usually unclear whether it helped or not), they leave it in place and try another solution. They don’t solve the problem because they don’t take the time to approach it systematically.
At best, this superficial problem solving, or patching, takes more time than systematic problem solving. Consider the following example: A manufacturer of steel cords had hundreds of machines in one facility. Because machine uptime was important, the company encouraged maintenance engineers to respond to breakdowns as quickly as possible. Even so, overall performance didn’t improve. Only after the company started keeping and analyzing records machine by machine instead of person by person did it realize that engineers were constantly interrupted while repairing one machine because another had failed. They would make a quick fix and move on to the next machine. Each original machine breakdown, as it turned out, generated many visits; on average, a problem was patched three times before it was finally solved.
Patching not only takes more time than systematic problem solving, it also fails to fix problems. A longer story shows why. A colleague and I recently helped an electronics company solve a major yield problem. The company fabricated parts in one U.S. factory and assembled them in another. The company had transferred assembly to Asia to reduce labor costs just as a new product was being introduced. At about this time, the assembly yields crashed; half or more of the devices failed. Customers were pleading for more product, but the company couldn’t meet demand. The result was an outbreak of fire fighting. A team was charged with finding a quick solution. Each member had a pet theory about what was happening and how changing the process would fix it. The Asian factory dutifully implemented one trial “solution” after another. Because of constraints at the factory, it took about a month to get the results from each trial.
Although this went on for months, the team got no closer to understanding the problem’s cause. Because team members didn’t think they had time, they never ran a controlled trial in which the same batch was assembled in both Asia and the United States. Hence there was no proof that the problem was due to a difference between the two facilities; it could have resulted from a change in fabrication that happened to coincide with the factory move. After all, the fabrication process was ramping up at the same time. Ironically, the company’s senior management talked a lot about using modern quality methods and systematic problem solving. But once the pressure from customers got too great, people fell back on patching, believing it would deliver faster results.
We suggested that the company develop a scientific understanding of the problem. To that end, we used lab experiments, mathematical analysis, and large controlled experiments in the factory. The main problem turned out to be previously unknown temperature sensitivity in assembly, the direct result of a process change that had been instituted to solve a problem the year before. It had been happening in both the U.S. and Asian plants, but seemingly inconsequential differences between them made the situation much worse in Asia. Once the cause of the problem was understood, fixing it was straightforward. Based on its new knowledge, the company also improved yields on many other products. And the knowledge gave the company a significant advantage over competitors grappling with similar problems. It took months to solve the problem this way, but fire fighting had taken even longer, to no avail.
When changes are introduced haphazardly, as they were for this process, they are frequently institutionalized without careful study. For example, noted author Primo Levi worked briefly after World War II as a chemist in a paint factory. Many years later, he met an old friend who was still working there. The friend told him the factory was producing an anticorrosion paint that contained a compound likely to accelerate corrosion. When the friend had questioned his bosses, he was told that the paint had always been made that way, that the compound was absolutely necessary, and that he shouldn’t change anything. As it happened, Levi had first included the compound in the formula. He did it strictly as a temporary measure to counter contamination in an important raw material, but his rationale was forgotten when he left, and the recipe was carved in stone.
Haphazardly introduced changes raise an even more serious issue: they can easily create new problems elsewhere in the process. That happens all the time in software development: while patching one bug, you create another. The same thing often happens in factories. In a metalworking factory, in order to improve performance of their part of the process, engineers changed the makeup of a coating. Months later, the company’s biggest customer developed a major problem: a metal framework that needed to be affixed to a rubber part no longer stuck to that part. The problem, stemming from the change in the coating, turned out to be extremely expensive for both the customer and the company—all the more so, of course, because the cause didn’t become clear for a long time. Patching can create new problems whenever the patches are not validated carefully.
Haphazardly introduced changes can easily create new problems elsewhere in the process.
Patching can be justified in a few situations. For example, in software development it’s common to add code that checks for a particular error. If that error occurs, the software delivers an error message and stops further computation. This is a patch because it does not solve the real problem, but it does prevent it from worsening. And in manufacturing, it’s common to add another inspection step when an as-yet-unresolved problem exists so flawed products can be pulled. This weeding out raises costs, but it avoids passing on the defective parts. Such superficial solutions are acceptable if several conditions are met. First, the patches should ameliorate much of the damage even though they don’t address the cause. Second, they should be solid enough that they won’t break down later. And third, they should have a better benefit-cost ratio than other solutions. The key cost here is not dollars but engineering time.
With those exceptions, patching is destructive. Solution rates fall and the number of hidden problems rises. (See the exhibit “The Effects of Firefighting Syndrome.”) The new problems that patching has created, and the old ones that it has failed to solve, act up more and more, until a large fraction of the incoming problems are actually old ones returning. The engineer’s environment becomes increasingly chaotic, which makes it harder to run experiments and pin down problems. In some cases, the organization’s ability to solve problems collapses completely, and overall performance plummets.2 At that point, senior management may need to take drastic action—like outsourcing much of the work, shutting down and starting over, or bringing in a massive infusion of outside help. Such turnarounds are a major drain on money and management time. And even when executives intervene, they sometimes make fire-fighting worse by tackling only the current crisis without fixing its deeper causes. Fortunately, there are ways to avoid reaching such a crisis point.
The Effects of Fire-fighting Syndrome
You Can Prevent Fires
There are several ways to eliminate fire fighting. They can be loosely sorted into three categories: tactical, strategic, and cultural.
Tactical methods can be put into effect quickly without making high-level policy changes. Although some of the methods are culturally difficult in U.S. companies, many are simple to apply once a company recognizes the dangers of fire fighting.
Add temporary problem solvers.
When the rate of new problems jumps, bringing in temporary assistance is a good short-term solution. Fire-fighting departments, the real kind, put out calls for fire-fighters from neighboring areas to deal with the biggest blazes. In high tech, most hard-disk-drive companies have learned to send development engineers from the United States directly to their Asian factories when they start manufacturing a new product. These extra troubleshooters bring crucial expertise because they have often seen related problems during prototyping. Furthermore, this practice creates a powerful incentive: the U.S. engineers quickly learn that patching problems in development leads to more problems during ramp-up, leaving them in Asia longer.
There are drawbacks to temporary workers, of course. First, they’re effective only when the excess workload is sporadic rather than chronic. Second, pulling problem solvers from other parts of the organization risks setting fires in those areas. Third, temporary workers are often unfamiliar with the situation.
Shut down operations.
It’s been said before, but an ounce of prevention is worth a pound of cure. When the number of problems becomes too large, shut down operations until all are solved. Or, allow a new problem into the queue only when an old one is removed. Organizations where fire fighting is not part of the culture do this instinctively during product ramp-up. Some Hewlett-Packard development centers shut down a pilot line for the rest of the day once a certain number of problems is backlogged because until those problems are solved, there is no stable baseline for detecting and solving additional problems. Few companies have the fortitude to limit the queue size during normal operations.
Another approach to limiting the queue is to do deliberately what will happen anyway: admit that some problems will not be solved. The triage technique, borrowed from military medicine, controls the queue by regulating entry. Rather than let problems queue up indefinitely, or work their way through the queue only to be worked on sporadically, decide whether to commit resources to a problem when it first arises. This technique is organizationally difficult. It is much easier to tell people, “We’ll get to your problem as soon as we can,” and delegate it to someone who is overworked, than to say, “We’ve decided your problem isn’t critical so we’re not going to fix it.”
Strategic approaches to fire fighting take longer to implement than tactical methods, but they pay off across a range of projects and over long periods. Even if they don’t eliminate fire fighting completely, they increase the number of problems solved. The first several changes we mention focus on product design, but they have a major impact on manufacturing as well.
Change design strategies.
New product design has come a long way in the past decade. In some industries, companies have increased the commonality of designs across generations and products. That has reduced the number of design problems within and across product generations as well as the changes, and therefore the problems, in manufacturing. Commonality can be further enhanced by modular designs, which allow improvement of one section of a product without much change to others.
For example, hard-disk-drive companies used to have separate teams on successive drive generations—leading to designs where even the screws changed with each product. But gradually they adopted the “platform” concept. Now the capacity of a drive can be doubled by changing only the heads, media, and firmware, and substituting the latest and fastest signal processing chips on the circuit card. The new design is manufactured almost exactly as the previous one was, and the problems are concentrated in the new areas. Seagate, for instance, can transfer some products into manufacturing by moving only a few development engineers to the factory temporarily; five years ago, 20 or more were commonly needed.
Outsource some parts of design.
Companies in the auto industry have moved toward “black box” design: they specify only the characteristics of a subsystem, including its size, weight, power requirements, and performance. The subcontractor building the subsystem determines the best way to achieve the objectives, including which technologies to use. While the total number of problems may not go down, many are removed from the central design team.
Solve classes of problems, not individual problems.
It’s sometimes possible to group seemingly diverse problems together, determine a set of underlying causes, and then learn about those causes Once they are understood, solving the individual problems is often straightforward.
An example of problem classes comes from the semiconductor industry. When integrated circuits were first manufactured, most circuits had to be thrown away because of defects. Companies developed codes for describing the failures based on which tests the circuits failed. When losses due to a particular code were high, a team would try to figure out why. Gradually it became clear that many individual problems were caused by particles falling on wafer surfaces and ruining the circuits. “Particle-caused defects” was thus a problem class. It shifted engineers’ attention from various specific failure codes to one well-defined problem: learning where the particles came from and how to get rid of them.
Eventually companies redesigned manufacturing facilities, turning them into “clean rooms” where air filters remove particles. As circuits got smaller, ever smaller particles became problematic. Another breakthrough came when engineers realized that people brought particles into the clean rooms. Again, this was a problem class: “contamination from people.” The companies solved many contamination problems at once by requiring workers to dress in special garments and forbidding the use of makeup and other skin coatings.
Determining classes of problems requires a deliberate, extensive, and sustained commitment to formal problem solving. A company must correlate information from different parts of the organization over long periods of time, develop scientific models that yield higher levels of process understanding, and run controlled experimentation in the factory. These are exactly the kinds of long-term activities that fire fighting pushes out.
Determining classes of problems requires a deliberate, extensive, and sustained commitment to formal problem solving—exactly what fire fighting pushes out.
Use learning lines.
Learning lines are production lines run to maximize real problem solving. Unlike pilot lines, which use special equipment and operators, learning lines use standard materials to make real products for customers. Thus they’re exposed to all the vicissitudes of the rest of the factory, such as bad material batches, unreliable machines, and careless operators. They’re used to gather data, run diagnostic experiments, debug processes, and do intensive problem solving, especially during manufacturing ramp-up. Often the performance of learning lines is the best in the factory, because innovations are put in place there first and problems are rapidly detected and solved. Part of the art of using a learning line is ensuring that it faithfully reflects conditions throughout the factory—and that improvements are quickly transferred to the rest of the plant. This is accomplished by not isolating engineers on the learning line. Every engineer should be able to use the learning line as a laboratory for investigation.
Develop more problem solvers.
One of the successes of the TQM movement is that nonengineers have been trained to solve simple problems. Even though they are not as fast or as knowledgeable as engineers, technicians and others can free up resources for difficult problems by handling many of the more mundane ones.
Cultural changes require shifts in the mind-set of the whole organization and in the behavior of senior managers. Extra work in organizations—even those that don’t usually fight fires—will occasionally create pressure to begin fire fighting. At these times, the organization’s problem-solving culture is critical. If managers are too far removed from the problems to see the consequences, and if the reward system favors firefighters, then the vicious cycle of fire fighting will begin. Avoiding this depends on the culture of middle and senior managers. We suggest the following guidelines.
A computer operating system (OS) can exhibit a behavior that is analogous to fire fighting. An OS has many demands placed on it simultaneously. For example, desktop PCs monitor a network connection, update the clock, interpret keystrokes, update the display, perform calculations, drive a printer, accept data from the hard-disk drive, collect garbage, and so forth. PC operating systems now also let end users run several programs at once: playing a music CD, downloading e-mail, and typing on a word processor, for example. Yet almost all computers have only one central processor, which switches among the various activities so that they appear to be going on simultaneously. Of course, sometimes the computer has more work than it can handle immediately—the traffic intensity is greater than 100%. The user experiences this as a slowdown, such as a stutter in the music or a pause in the screen update.
For a computer to work effectively, the OS must follow rules for switching among tasks on the basis of priorities. Handling communications inputs, for instance, is more urgent than most other jobs. But each “switch” itself consumes CPU time, much like the reduction in problem-solving effectiveness that occurs when engineers divide their attention between problems. Systems designers learned early that without well-defined rules governing task switching, an OS can become consumed with it—a phenomenon known as thrashing.
One insight from the OS analogy is that no fixed set of priority rules is effective under all circumstances. If, for example, an OS must process many short jobs and a few very long ones, priorities should be set to ensure that long jobs are allotted more time than short ones. But if all are long jobs, assigning equal time slices to each will result in none getting completed for a very long time. In that case, priorities should be set to ensure that some jobs are completed before time is allocated to the others. Managing fire fighting, like setting priorities in OSs, will not happen if you follow simple, rigid rules.
Don’t tolerate patching.
The importance of this point has already been discussed, but enforcing it requires support at all levels of management. At Intel, for example, managers from the CEO on down have extensive line problem-solving experience and can distinguish a patch from a real solution. If subordinates find that hasty solutions come back and bite them, organizationally speaking, they will avoid patching.
Don’t push to meet deadlines at all costs.
Such goals always favor fire fighting. Instead, be flexible about deadlines. Measure development projects by the number of outstanding problems. Most companies measure “open issues” and problems discovered after product release, and many good factories have accurate lists and measures of unsolved problems. If this list stays the same or grows for more than a month after a product introduction, the organization is in fire-fighting mode.
In most U.S. organizations, the hero is the one who puts out the biggest fires. But where were these heroes when the problems started?
Don’t reward fire fighting.
In most U.S. organizations, the hero is the one who puts out the biggest fires. But where were these heroes when the problems started? Why didn’t they intervene sooner, before the problems grew so big? Companies should reward managers who don’t have a lot of fires to put out and who practice long-term prevention and systematic problem solving.
Building a Problem-Solving Organization
In today’s highly dynamic business environments, the key tasks for people in charge are innovating, improving, and dealing with the unexpected. The “unexpected” takes the form of problems whose solutions can open the door to innovation and improvement. Seen in this light, managers have a lot in common with engineers—and they’re just as prone to fire fighting.
So how does a fire-fighting organization transform itself into a problem-solving organization? It must recognize that the self-perpetuating fire-fighting syndrome is not an inherently irrational response to high-pressure management situations. Its genesis, rather, is a set of rules and behaviors that seem reasonable but really cause fire fighting in the long run. An organization must not only abandon these seemingly logical practices but also adopt techniques that, at first blush, appear irrational.
Curing the fire-fighting syndrome is not easy. Established organizational culture and shortsighted logic often work against it. Management that believes people work harder under pressure exacerbates the condition. But pulling your company out of fire-fighting mode is worth the effort because fire fighting drains your best workers; no matter how hard they work, they can’t put all the fires out. The obvious, albeit extreme, risk of fire fighting is that the organization will have to withdraw a product line or shut down a plant that has been rendered ineffectual. The less obvious but just as important risk is that the organization’s best problem solvers will become fed up and leave. Building a problem-solving organization is difficult, but the benefits are clear and the choice unambiguous.
- Robert Hayes is one of the exceptions, although he didn’t use our term for it. In 1981, he hypothesized that one reason American factories were more chaotic than Japanese factories was a difference in culture. He wrote: “American managers actually enjoycrises; they often get their greatest personal satisfaction, the most recognition, and their biggest rewards from solving crises. Crises are part of what makes work fun. To Japanese managers, however, a crisis is evidence of failure.” (“Why Japanese Factories Work,” HBR July–August 1981).
- In “Beating Murphy’s Law,” which I coauthored with Dorothy Leonard-Barton and W. Bruce Chew (Sloan Management Review,Spring 1991), we discuss a case where a factory’s entire production came to a halt, partly as a result of cascading fire fighting.
A version of this article appeared in the July–August 2000 issue of Harvard Business Review.
Roger Bohn is an associate professor at the University of California at San Diego. He developed the ideas in this article with Ramchandran Jaikumar, who was the Daewoo Professor of Business Administration at Harvard Business School until his death in 1998. This would have been his fourth HBR article.
Mahalo and much success,
Bohn, R. (2000, July – August ). Stop Fighting Fires. Retrieved from HBR.org: https://hbr.org/2000/07/stop-fighting-fires