Wednesday, August 24, 2011

Why Do Iterations Work?

Tip of the Month: January 2011

In product development we often use iterations to increase the quality and robustness of our products. Why does this work?

To begin, let me clarify my terminology. By "iteration" I mean covering the same ground twice. I do not use the term iteration to mean breaking a larger task into several smaller pieces; I call that batch size reduction. I must mention this because many people in the agile software community to use the term iteration to refer to breaking a project into a number of smaller pieces. It is a superb technique, but I consider it confusing to label it iteration.

To me, a reference point for thinking clearly about iteration is Newton's Method, a numerical analysis technique for finding the root of an equation. In it, a calculation is repeated multiple times and the answer from each iteration is used as the basis for the next calculation. The answer gets better after each iteration. (Ignoring, for simplicity, the issue of convergence.) Newton's Method captures the essential mechanism of iteration. We repeat substantially the same activity in order to improve our result. However, it is important to recognize, even in this case, while the form of the calculation is repeated, it is not precisely the same calculation. Each iteration uses different, and better, data.

Now let's look at the difference between iteration and batch size using a physical analogy. Suppose, I am going to paint a wall in my house with an imperfect paint roller. I can quickly apply a single coat and then iterate by applying a second coat. Let's say the roller, which has a small bare spot, fails to cover 2 percent of the surface area. Then, by the time I have done the second coat the missed area will be (0.02)*(0.02) or (0.0004). This quality-improving effect is a very common reason to use iteration.

Why did quality improve when we iterated? The key lies in the independence of each iteration. Because the probability of a defect for each coat was independent, the defect probabilities multiplied. If the defects in each coat were not independent, then iterating would not improve quality. For example, if there was a 3 cm concave pit in the wall, the second coat of paint would not have solved this problem -- nor would the third, or the fourth.

We can contrast this iterative solution with one that uses batch size reduction. Suppose that I apply the paint in two smaller batches, painting the left half of the wall completely before I paint the right half. This will not improve my defect rate. I will still have a 2 percent defect rate over the entire surface. The quality improvement due to smaller batch size arises from a different source: feedback. For example, before you paint your entire house shocking pink, you may want to show your spouse what the color looks like on a single wall. Thus, the mechanism by which small batch size improves quality is different than that of iteration.

What is centrally important to the power of iteration is the value that is added by each iteration. If Newton's method simply repeated the same calculation with the same data, quality would not improve. If you simply run exactly the same test, on exactly the same product, test outcomes will not be independent, and the iteration will produce no increase in quality. The power of iteration comes from how much new information is generated by each iteration.

This new information generally comes from two sources. First, when the second iteration covers something different than the first, it will generate new information. This difference in may arise from the a change in the test or a change in what is being tested. A second more subtle source of information arises when the performance of the system we are testing is stochastic, rather than deterministic. In such cases we can repeat exactly the same test on exactly the same product and we will still derive information from it. What information? We illuminate the probability function associated with performance. For example, assume we wanted to determine if a coin was fair or biased? Could we do this on one flip of the coin? Certainly not. Repetition is critical for understanding random behavior.

In the end, our development activities must add value to our products. It is not iteration that inherently improves quality, it depends on how efficiently the iteration is generating useful information.

Don Reinertsen

The Lean Approach to Context Switching

Tip of the Month: March 2011

A great insight of lean manufacturing was recognizing the pivotal importance of reducing changeover costs. American manufacturers would run the same parts on their stamping machines for two weeks because it took 24 hours to change over the machine. Along came the Japanese, who reduced the changeover time by 100x, and suddenly short run lengths became cost-effective. With shorter run lengths, batch sizes became smaller, and this improved quality, efficiency, and flow-through time. The great blindness of the American manufacturers was accepting the cost of changeovers as immutable. This condemned them to use large batch sizes.

Today software developers wrestle with a similar problem. Some view the cost of switching context as a form of waste. They think they can eliminate this waste by minimizing the number of times that developers must switch context. This approach inherently treats the cost of context switching the same way American manufacturers treated the cost of changeovers.

Is there a leaner approach? Rather than avoiding context switching we should ask how we can minimize the cost of switching context. Let’s use a simple technical analogy. When we design a microprocessor-based system we can choose to service interrupts immediately when they come in, or we can periodically check for waiting interrupts, a technique called polling. If we service interrupts immediately we must stop the operation in process, unload data from registers into memory, fetch the interrupt data, process it, and then restore the data from the operation we just interrupted. This creates lots of overhead.

What happens when we poll interrupts? We periodically check a memory location to see if an interrupt is waiting and process it if it is. The advantage in polling is that we control when we check for interrupts. By checking during the natural breaks between jobs, we massively reduce cost of context switching. The key point is that we can engineer technical and human systems to lower the cost of context switching – we don’t need to simply accept this cost as a constraint.

But why would we want to switch context more frequently? It isn't always desirable to have long uninterrupted efforts on a primary activity. There are cases where parallel secondary activities can improve the quality or efficiency of the primary activity. For example, most engineering degree programs force students to switch context between different subjects. We could teach a year of pure math before we tackle physics, but students would have a much harder time seeing the connections between these two subjects. Similarly, as an author, I never write my books by completing single chapters in isolation. By periodically shifting between chapters I can achieve much better integration between them. Authors who work one chapter at a time often lapse into repetitiveness. By the time they are writing Chapter 10, Chapter 1 is a distant memory.

So, if you find yourself instinctively trying to reduce context shifting you should ask yourself two questions. First, have you done everything you can to lower the cost of switching context. Second, are you capturing any benefits by switching contexts. If the benefits of context switching exceed its cost, then don't try to eliminate it.

Don Reinertsen

Please Wear Your Clown Hat When You Celebrate Failure

Tip of the Month: June 2011

A recent column in Wired magazine No Innovator's Dilemma Here: In Praise of Failure recounted the story of the 5,127 prototypes used to create the first Dyson vacuum cleaner. In this column, Sir James Dyson notes his similarity to Thomas Edison, who said, "I have not failed. I've just found 10,000 ways that won't work." Dyson appears to take pride in his 5,127 prototypes as emblematic of the persistence and fortitude of an entrepreneur. In contrast, I think this extraordinary number of unsuccessful trials may illustrate a very fundamental misconception about innovation.

First, I should point out that I think Dyson is a brilliant entrepreneur who has created a very successful company. I also greatly admire his advocacy of design and engineering education. He gets an extraordinary number of things right. Nevertheless, I believe his approach to innovation, brute force trial and error, has severe weaknesses. While Dyson says that, "…each failure brought me closer to solving the problem," it is not clear that his 5,127 prototype, 15 year journey should be used as a model.

I agree that if Edison were alive today, he would undoubtedly use Dyson’s approach. The real question is whether the approach of Edison and Dyson, is good engineering. While Edison may be a deity of innovation to the lay public, not all engineers share this view. Consider the viewpoint of Edison’s contemporary and former employee, Nikola Tesla. Tesla was the technical genius behind alternating current (AC) power. This is dominant form of power distribution today, and it became dominant because of its compelling advantages over direct current (DC) power. (DC power was tirelessly advocated for by Edison.) Like Edison, Tesla was a creative genius; unlike Edison he was a skilled engineer. What did Tesla think of Edison’s brute force, trial and error approach?

``If Edison had a needle to find in a haystack, he would proceed at once with the diligence of the bee to examine straw after straw until he found the object of his search…. I was a sorry witness of such doings, knowing that a little theory and calculation would have saved him ninety per cent of his labor.'' --- Nikola Tesla, New York Times, October 19, 1931

I believe that Tesla would have the same opinion of Dyson’s 5,127 prototypes. In 30 years working with engineering organizations I have never seen a product, innovative or otherwise, come within an order of magnitude of this number of prototypes. Why is this the case? Because great engineering organizations don’t just design products, they also design efficient strategies to find solutions. Unfortunately, these strategies are much less visible than the products they produce.

What do we mean by an efficient strategy? It is one that generates the maximum valuable information with the minimum expenditure of time and money. There is a science behind generating information and it is called information theory. Why is it relevant? Because design processes must remove risk, and removing risk requires generating information. Information theory shows us that the information generated from a pass/fail test is most efficiently generated at a 50 percent failure rate. In fact, the two worst places to operate a design process are at 0 percent failure rate and at 100 percent failure rate. A very high failure rate is as dysfunctional as a very low failure rate.

How do great engineering organizations achieve optimum failure rates?

1. The direction of their first step is determined by a hypothesis as to where a solution may be found. This step is the easiest.
2. The magnitude of their first step is chosen to create a 50 percent chance of failure. When steps are either too small, or too large, they will be inefficient at generating information. Unfortunately most companies gravitate to failure rates that are either too low or too high.
3. Finally, the information generated by each experiment must be carefully analyzed and used to modify the search strategy. Each chunk of new information alters conditional probabilities and thus suggests a new direction for the next trial. Many companies fail to pivot in the presence of new information.

You may recognize the above approach as the winning strategy in playing the game of Twenty Questions. If each question is carefully chosen to have a 50 percent chance of being correct, then over 1,000,000 possible alternatives can be explored in twenty questions. Success arrives by proceeding from general to specific questions; if you begin with specific questions you will never win.

So please do not help to perpetuate the myth that success at innovation is due to brute-force trial and error. Successful innovators, from Henry Ford to present day Internet entrepreneurs, explore possibilities systematically. With each result they modify their next move.

When I began consulting in product development 30 years ago a skilled entrepreneur told me success at innovation came from a willingness to, "…build a tall junk pile." I now realize that the height of the junk was not as important as the underlying logic behind each experiment. All observers will notice the height of the junk pile; only the most discerning will spot the careful logic behind the entrepreneur’s search strategy. Try to be in the second group.

Don Reinertsen

The Cult of the Root Cause

Tip of the Month: August 2011

“Why?” is my favorite question because it illuminates relationships between cause and effect. And when we ask this question more than once we expose even deeper causal relationships. Unfortunately, my favorite question has been hijacked by the Cult of the Root Cause and been transformed into the ritual of “The Five Whys”. The concept behind this ritual is simple: when trying to solve a problem, ask “Why” at least five times. Each “Why” will bring you closer to the ultimate cause of the problem. Finally, you will arrive at the root cause, and once there, you can fix the real problem instead merely treating symptoms.

The wisdom of this approach seems obvious. After all, fixing problems is like weeding a garden. If you only remove the visible top of the weed, it can grow back; if you remove the root, then the weed is gone forever. Why not trace problems back to their root cause and fix them there? The logic seems flawless – that is, unless you stop to think about it.

Invisibly embedded in this approach are two important assumptions. First, the approach assumes that causality progresses from root cause to final effect through linear chain of stages. Second, it assumes that the best location to intervene in this chain of causality is at its source: the root cause. Certainly there are many simple cases where both these assumptions are true; in such cases, it is indeed desirable to intervene at the root cause. However, these two assumptions are frequently wrong, and in such cases the five “Whys” can lead us astray.

Upstream Isn’t Always Best
Let’s look at the second assumption first. Is it always most desirable to intervene at the beginning of the chain, at the root cause? There are two important circumstances that can make it undesirable to intervene at the level of the root cause. First, when speed of response is important, attacking an intermediate stage may produce faster results. For example, you turn on your computer and see smoke rising from the cabinet. You brilliantly deduce that the smoke probably a symptom of a deeper problem. Should you treat the symptom or fix the root cause? Most of us would treat the symptom by shutting off the power, even though we realize this does not addressing the root cause. Thus, we commonly attack symptoms instead of root causes when response time is important.

The second reason to attack a symptom is when this is a more cost-effective solution. For example, people who type produce spelling errors; in many cases the root cause of these errors is that they never learned to spell. We could address the root cause by sentencing bad spellers to long hours in spelling reeducation camps. While this may appeal to our sense of orthographic justice, it is more efficient to use spell checkers to treat the symptoms. Thus, we often choose to attack symptoms when it is more cost-effective to fix an intermediate cause than the root cause.

Networks Are Not Chains
Now let’s look at the first assumption: root cause and final effect are linked in a linear chain of causality. In many cases it is more correct to think of causes generating effects through a causal network rather than a linear chain. In such networks the paths that lead from cause to effect are much more complex than the linear sequence found in the root cause model. There are often multiple causes for an effect, and there can be multiple effects branching out from a single cause.

In such cases it is very misleading to focus on a single linear path. Doing so causes us to ignore the other paths that are entering and exiting the chain, paths that connect the chain to ancillary causes and effects. When we ignore these ancillary paths, we miscalculate the economics of our choices, and this in turn leads us to make bad economic decisions.

Consider, for example, problems with multiple causes. When you view such problems as having a single cause you cannot access the full range of options available to fix the problem. For example, every schoolchild learns that fires require a combination of heat, fuel, and oxygen. Which one is the root cause of fire? There is no one root cause; we can intervene in three different places to prevent fires, and each of these places can be attractive under specific circumstances. When we can’t control heat, we might choose to remove fuel. When we can’t eliminate fuel, we might eliminate heat. When we can’t eliminate either heat or fuel, we might eliminate sources of oxygen. The point is that by fixating on a single cause we lose access to a broader range of solutions.

Now, consider an intermediate stage with multiple effects. For example, diabetes, is a complicated disease that affects many systems within the body. One of its key symptoms is high blood glucose levels. Some patients with Type II diabetes can bring their blood glucose levels under control with careful exercise and diet, but it takes time to do this. Meanwhile, a patient’s high blood glucose levels can lead to conditions like blindness, kidney disease, and heart disease. While high blood glucose is indeed a symptom, it is actually quite sensible to treat this symptom by using insulin. Treating the symptom alleviates the multiple effects of the symptom. If we only focused on a single effect we would underestimate the full benefits of treating the symptom. When selecting interventions it is important to consider the multitude of effects can that fan out from a node in the causal network.

Opening New Horizons
Once we have broken the spell of root cause fixation, this unlocks to two additional insights. First, the optimum intervention point may change with time. For example, let’s say that while sailing you get an alarm indicating high water levels in the bilge of your sailboat. Your immediate intervention may be to pump water out of the bilge. After the water is pumped down you may observe a crack in the hull which you can temporarily plug until you return to port. When you return to port you can have the crack in the hull investigated and repaired. The optimum place to intervene has shifted from pumping, to plugging, to hull repair; thus, it is time dependent. Such dynamic solutions will exist in both causal chains and causal networks.

Second, because there are multiple possible intervention points, we can consider intervening at multiple stages simultaneously. For example, despite our best attempt to plug the crack in our sailboat, there may still be water coming in. If we want to return to port safely we may have to patch the leak and run our bilge pump. The idea that interventions should only take place at the “one best place” is an illusion.

So, ask the question “Why,” but use the answers with care. Don’t assume you will only encounter problems that can be reduced to a simple single chain of causality where the best intervention lies at the start of the chain. Be open to the possibility that you are dealing with a causal network that has multiple starting points and endpoints. You might even consider adding a few more questions to your toolkit:

1. Why do I think the root cause is the best place to fix this problem?
2. Why do I think I should only intervene at a single location?
3. Why do I think the best intervention point will remain static?
4. What other important causes and effects are entering and exiting my causal chain?

Happy problem solving!

Don Reinertsen