This week I took an adventure in 1980’s manufacturing plant optimization in Eliyahu M. Goldratt’s “The Goal: A process of ongoing improvement”. The book is featured in discussions around DevOps, particularly concerning its heavy reference in Kim, Behr and Stafford’s The Phoenix Project and all things Lean.
It was an unusual read for me but has certainly changed the way I approach problems in IT operations and even in getting my daughter to school on time.
In this post, I’ll aim to condense some of the principals taught in the book, their application in DevOps and finally offer a challenge to IT organizations where application deployment is not the primary line of business
The Goal is a fictional business parable that follows Alex Rogo, a manufacturing plant manager at UniCo manufacturing. His plant (and the business as a whole) are deep in the red, despite employing some of the best management wizardry of cost cutting exercises, order expedition, localized optimizations and even fancy new robots.
Alex has a chance meeting with a physics professor, Jonah, from his university days, who challenges some of Alex’s (and my own) most basic assumptions about productivity and the reason the business exists at all. Jonah leads Alex, via provocative Socratic inquiry, to approach his manufacturing and personal issues scientifically.
The result for Alex was saving the business and some general kickassery.
I was surprised how entertaining the book was. For me the subject matter of the plant was fairly foreign and uninteresting but the story telling is very human and relatable.
Manufacturing jargon is kept to a minimum, with only enough detail given to solidify some of the abstract ideas being taught.
The book didn’t feel dated at all. The only reminder that we are not in the current millennium is the mention of dot-matrix printers and smoking in the office.
Being a work of fiction, I did doubt whether the scale of some of the successes encountered in the book could be replicated in reality. While the turnaround of the fictional UniCo is extreme, the common-sense nature of the science described lends credibility. Fortunately for us, there is now also a long history of well published successes and failures with Eliyahu’s process and similar ideas from DevOps, the Toyota Production System or Lean manufacturing.
All of the lessons taught in the book could be condensed into a two or three page cheat sheet; even with substantial context. Unfortunately, some of the lessons are so contrary to our modus operandi that in this form they would read as nonsense; much like the notion of the world being round once did.
For example, Jonah suggests that most efficiencies gained in local areas of production actually serve to cripple the entire system, contrary to my intuition.
The beauty of the relationship between Jonah and Alex, is that Jonah never provides answers to anything. He only provides questions and validation of Alex’s assertions. In doing so, much like the Greek philosopher Socrates, Jonah teaches Alex how to think. Essentially, Jonah could give Alex a fish and feed him for a day, but instead teaches Alex how to fish, so Alex can own his own destiny.
This form of dialog caused me to continually consider my own situation. Clearly the nuances of manufacturing do not directly apply to IT operations, so some of the answers Alex discovers have no relevance to me (I don’t use heat treat ovens in the datacenter). Instead the challenges Jonah offers to my assumptions and way of thinking lead me to answers of my own.
The first question Jonah poses to Alex is “What is the goal”? In my position in IT Operations, I might have answered “uptime” or “minimizing costs” or “information security” or “good customer service”. Jonah highlights that these sorts of answers are actually just a means to an end, not the end itself.
For UniCo manufacturing, the end, or the goal, is to make money. I work for an outsourcing company, who have assigned me to a government body concerned with education. For me the goal is a little unclear, so I’ll give it some more thought and perhaps conversation with the powers-at-be. Is the goal to make money for my employer? Or to improve education for my client? Or to make money for my employer by improving education for my client? Or is the goal to assist the current government in winning the next election (not educational outcomes) so we can keep the client and continue to make money?
The point is, without clarity on the goal, you might unwittingly work to undermine it. For example, UniCo purchased new welding robots that improved local efficiency, but actually impacted negatively on their goal; to make money.
Once the goal is clear, Alex devises a succinct list of critical measures of progress towards the goal which are used to classify everything in the plant:
- Throughput: the rate at which the system generates money though completed sales
- Inventory: the money currently invested in assets it intends to sell (unfinished products, work-in-process, etc.)
- Operational expenses: money spent to turn inventory into throughput (wages, power, etc.)
These measures are specific to UniCo’s goal of making money through manufacturing but could be abstracted to other applications.
For example in education, throughput might be the number of student who graduate and proceed into successful work placements, inventory might be the number of students currently in the system and operational expenses remain; the cost of employing staff and keeping the lawns.
Dependent events, statistical fluctuations
Jonah highlights that a manufacturing plant represents a sequence (or many sequences) of dependent events. Meaning, some events cannot happen until preceding events have completed. E.g. assembly cannot begin until all required parts have arrived. Each step in the process is also subject to statistical fluctuations; where the duration of each execution of a discrete process will fluctuate. Welding or painting a part, will always vary in time.
This is fairly self-evident, and like me, Alex assumes that the average execution time of each production line station will determine the throughput of the system (i.e. time from customer order to dispatch and receipt). In practice however, the plant is failing to keep up and the issue is compounding over time.
While hiking a group of scouts in single file through a nature trail, Alex becomes frustrated that the group of boys cannot seem to stay together and are running behind time to complete the hike. Alex identifies that the hike is much like his plant:
- there is a goal: everyone arriving on time, all together
- one boy cannot move forward until the next boy has moved forward
- each step of each boy fluctuates in distance and duration
- the distance covered by the last hiker is the throughput of the system
- the energy expended is the operational expense of the system
- the distance between the first and last hiker, is the inventory or work-in-process.
Much like the factory, the work-in-process is in continual growth, with the distance between front and back of the line constantly expanding, increasing risk and the energy required to compensate for the distance.
Alex’s expectation was that the average pace of the boy scouts would determine the arrival time. Instead he discovers, that it’s actually the maximum accumulated, negative deviation from the average that determines the throughput (arrival time) and inventory (distance between front and back) of the group.
Each step that each boy takes will deviate slightly from the average. Every time a deviation is in the negative (slightly slower), that deviation accumulates as the boys behind cannot pass and slow down. Additionally the distance between the hiker in front increases. Compensating for this phenomena, requires significant effort and is often outside the capacity of the boys.
Alex further models the problem by passing matches along a sequence of bowls, based on the fluctuating rolls of a dice. The throughput of the last bowl is significantly lower than anticipated.
This misconception about the capacity of the hikers being based on average pace is shared at the UniCo plant and in my own estimation.
The illusion of local efficiency
What Alex learns (and Jonah later validates) is that the whole process is actually hinged on the bottleneck (or constraint), the slowest boy scout hiker. Alex forces all of the boys to maintain pace with the slowest boy, young Herbie, by placing Herbie at the front of the pack. As a result, the boys behind start operating below capacity which in turn leads to the entire pack walking closely together (a reduction in inventory or work-in-process). The boys are no longer required to jog to catch up the gaps between them and so operational expenditure is reduced.
Now that Herbie is setting the pace for the group, his capacity gets greater scrutiny and the group realize Herbie is carrying several kilos of obscure camping equipment in his ruck sack. The equipment (or load) is distributed amongst the boys with higher capacity and suddenly the entire group is moving quicker (throughput)!
Prior to understanding this metaphor, I might have hinged the success of the group on the fastest kids (highest capacity resources). Ultimately, any efficiency gained in the faster boys (strength, speed, etc.) is a complete and utter waste when considering the goal of arriving on time and together.
Likewise, the robots purchased by UniCo were a waste. Because of their impressive capacity, they only served to increase inventory (by piling up products in front of the next work station) and operational cost (staff working overtime to catch up). The robots did increase the local efficiency of their work centers, but undermined the goal of the company, to make money, by increasing inventory and operational expenditure without increasing throughput of the entire system.
The process of ongoing improvements
The author provides a process to solve these problems, which in the parable, is devised by Alex and his team of colleagues. This is the process Alex used to rectify the hiking boy scouts, and his manufacturing plant woes:
- Identify the system’s primary bottleneck/s (Herbie, or the plant’s heat treat machine)
- Decide how to exploit the bottleneck (move Herbie to the front or keep the heat treat machine running during lunch breaks)
- Subordinate everything else to the above decision (all kids slow down to Herbie’s pace and all other workstations prioritize parts bound for the heat treat machine and remain idle if there are sufficient parts ready)
- Elevate the system’s constraints (distribute the contents of Herbie’s rucksack or install additional heat treat machines)
- If a bottleneck is broken (i.e. no longer the bottleneck) repeat from step 1, but don’t allow inertia to become a constraint (in this case, market demand became the new bottleneck for Alex’s plant, but inertia caused them to stockpile too many completed products)
Once the manufacturing plant is once again profitable, Alex faces the challenge of responding to changes in the marketplace and planning ahead to prevent any further decline. With further prompting from his mentor Jonah, Alex determines there are three critical challenges for managers in applying his new knowledge:
- What needs to change?
- What does it need to change to?
- How do you execute the change?
These steps are particularly critical when approaching changes to the most difficult of all business resources to manage: people. When trying to change the culture, values and thinking of people, the book highlights two important strategies:
Socratic inquiry: leading people to answers by simply invoking curiosity. Alex did this by sharing Jonah’s questions with his colleagues and working together to find the answers.
The scientific method: create a hypothesis, A/B test the hypothesis (with a control) and share the conclusion. Alex did this by comparing the results of his hypothesis to the results attained in other UniCo plants, or simply with “before and after” results.
Contribution to DevOps
Prior to reading this book and The Phoenix Project, I would have described DevOps as:
Enabling frequent production releases by making developers and sysadmins play nice together and using some cool new automation tool chains.
I now have a completely different perspective. In fact, the cooperative culture and tools have taken a heavy dive in significance for me. They are simply a means to an end. They were a tailored response to a discrete problem. Truth is, in my own experience in adopting “DevOps”, the CI/CD (Continuous Integration/Continuous Delivery) tools simply became the overpriced, under performing, state-of- the-art, highly “efficient” robots in Alex’s manufacturing plant.
CI/CD tools and DevOps culture were born from a specific problem. That is, that successful implementations of Agile development methodologies have significantly increased the production capacity of development teams, typically without exceeding marketplace demand. Historically, operations teams held the title of “most efficient” as software releases were infrequent. Now the tables have turned and operations has become a bottleneck in businesses achieving their primary goal; make money first, fast and forever.
DevOps addresses this problem, using Goldratt’s principals in the following ways:
- Production deployment is identified as the bottleneck in achieving the goal
- The bottleneck is “exploited” in that more frequent releases keep operations functioning at full capacity. Operational activities that don’t directly progress the goal are deprioritized.
- Work is buffered in front of the bottleneck using “feature toggles” that operations can enable via configuration when ready
- Testing and quality assurance are moved before the bottleneck using Continuous Integration and automated testing tools. This elevates the bottleneck by preventing it from processing defective goods (i.e. buggy software)
- Batch sizes are reduced, increasing the work of the testing, packaging and release work centers, but increasing flow through the bottleneck (smaller changes, lower risk, easier planning, etc.)
- Automation is used to elevate the operations bottleneck. E.g. one touch deployments, infrastructure as code, containers, etc.
- As bottlenecks are defeated (i.e. ops can keep up with dev), other bottlenecks (prioritized according to the goal) are identified and attacked such as operational outages, technical debt, infrastructure projects, compliance problems, etc.
Application in IT Operations
Deploying applications into production is actually a very small part of what my organization does. Our focus is more in service desk, process management and infrastructure services.
Essentially, CI/CD is of very little (if any) value to us in achieving our goals. For this reason, often my colleagues express that DevOps is also of very little use to us. Eliyahu’s parable illustrates that the Theory of Constraints, a core component of DevOps thinking, is actually more relevant than ever; it’s just that the application will look different to CI/CD.
Our goal, hypothetically, is to make money. We do this by increasing throughput (projects delivered, requests fulfilled, etc.) while minimizing operational expenses and inventory (incomplete works).
Our bottleneck is not the frequent deployment of new software features. What we need to determine, is where the primary bottleneck is in achieving our goal. Hypothetically, it could be:
- Service outages
- Change and release management
- Skills shortages
- Quality assurance
- Inter-team politics
- Project delivery
Improving any one of these in isolation could actually serve to undermine our goal by increasing inventory or operational expenses, without increasing throughput.
The next step, the challenge to you and hopefully the subject of a future article, is the application of Goldratt’s process of ongoing improvement to more diverse IT Operations organizations and overcoming the challenges of organizational change.