flakiness

Jim, Tom, and Joe study Computer Science together in college. After graduation, the friends part ways and go to work at three different tech companies. The companies vary in size, business segment, mission, and technology stack. Nonetheless, on the first day of work, each receives the same assignment: he’ll be developing an integration testing suite that will be run against a critical software product.

It’s not that the software has gone untested or that any of the organizations exclusively employ cowboy coders. Each codebase is supported by thousands of unit tests as a first line of defense. Major changes are subject to code review. Release candidates are able to be evaluated by select groups of end users before being, well, released. At the heart of the testing suite exists a battery of rigorous tests that have been developed over the years to catch bugs that are out of the scope of unit tests and too tedious to have end-users with different goals evaluate. In other words, integration tests. There’s one catch–the tests are being carried out manually.

These manual tests constitute the single biggest bottleneck to each company’s release pipeline. As the software product has grown, the cost of manual testing per release cycle–in paid software developer hours–can be measured in the thousands of dollars. And humans are imperfect. There have been tests wrongly executed to the detriment of each firm. At Jim’s company, a startup social network, consumer data was exposed. Now the company is facing legal action. Tom’s employer–a financial firm–suffered a direct financial loss due to a flaw in logic in an algorithm. Joe works at a SaaS business. There was a prolonged outage and some customers are demanding refunds and switching to competitors.

What all three companies have in common is a desperate need to automate integration tests and automate them well. Across different business functions, the hurdle is clear–writing integration tests is hard, imperfect, and, frankly, exhausting. Integration tests are living, breathing things and must be afforded tender loving care to be effective.

The three begin clocking long hours while attempting to understand the scope of the tests while researching–or building–tools that can potentially be utilized to interface with the underlying system and test each component in a production-like environment. The only thing making this project even remotely approachable for one person is that the test assertions can be translated almost exactly from the manual plan. But that proves to be the hard part.

After just a month, Jim has completed his testing suite. He utilized as many 3rd party tools as he could and then used as many open source packages on top of those as he could. His code, while creating many new black boxes, is clean and efficient. The cost savings to the business are enormous and Jim’s boss is impressed. He immediately awards him another high-visibility project.

Tom takes ten weeks. Most 3rd party solutions didn’t apply to his use case and had to create his own libraries to help interface with the code. As he started missing project deliverable deadlines, he took a few shortcuts and didn’t spend time refactoring his code or writing good documentation. But the tests work and release velocity is increased substantially for the cycle following rollout. Tom’s boss is happy and pegs him as a high performer.

Joe delivers his product late. Really late. So late in fact that his boss simply stopped asking when it would be completed. He took time not only to understand the fundamentals of the software product, but also to talk to developers and construct a mental image of which components were at greatest risk of changing. He used this information to make design decisions that allowed flexibility and ease of adjusting to future releases. The libraries he had to write were elegant and all open source software was written in languages that he was intimately familiar with. When he delivered his product, though, there wasn’t much fanfare. In fact, some of the developers who stood to benefit the most from having their work automated had left the firm. Joe’s boss doesn’t have much faith in him.

Each developer puts a feather in his cap and gaily moves on to his next project.

Then some tests start failing. This is anticipated due to changes in underlying business logic, but there’s a separate class of tests that passes unreliably–seemingly at random. Sometimes they pass and sometimes they fail. They are flaky tests. They have driven many a sane engineer to the padded cell.

Jim can’t stand the sight of red (or yellow) in his testing dashboards. He digs into his code and tries everything that comes to mind. He manages to correct some, but not all flakiness. He asks others for help but tends to cast a wide net is often rebuffed. In the time since he completed his tests, most junior developers have lost their sharpness when it comes to the finer points of what could go wrong with the tests when they were run manually. Wisdom surrounding the most frustrating gotchas has been lost to the sands of time. Feeling pressure of builds failing often, Jim doubles down and takes a myopic approach–cause of flakiness must be identified at all costs. He starts neglecting other work. He gets worse at correcting business logic issues. There are powwows at his desk during every release where he makes up narratives for why code is failing. Sometimes he sounds like a conspiracy theorist. The release cycles slow down, and since the manual tests were automated so long ago, few developers–the highest paid ones–have the competency to execute the testing suite by hand. Jim’s neurosis causes him to become the biggest bottleneck of the release cycle. Upward mobility becomes a pipe dream.

Tom is making great strides in his career. He implements a disciplined approach for differentiating between business logic changes and flakiness. But he gets annoyed when he has to spend too much time on his testing suite. He views his newer projects as more important and the testing suite as a relic of the past. He doesn’t maintain his suite as he should and is quick to implement hacks. Sometimes his code confuses him, he admits. When pressure for sustained release velocity goes up, Tom takes a lax approach and is able to fly under the radar. He simply flags tests as “flaky” whenever he doesn’t have the time to investigate sources of failure. These tests get removed from the testing suite. Nobody notices. Eventually, coverage for certain esoteric but critical functions is removed entirely. One day, code running in full production causes a major loss to the firm. It turns out that one of the tests Tom had disabled would have caught the bug. He is promptly fired.

Joe is slowly building his reputation after he flubbed delivery of his project. He tries the neurotic approach to uncovering the causes of flakiness and too often finds himself mentally exhausted. Laxity doesn’t pay off either. He is quick to notice how slippery of a slope it is to code hacks and remove tests without thorough investigation. Joe realizes that sources of flakiness can be out of his hands. They can be due to the network, the host machine, 3rd party libraries, or other miscellaneous issues. What makes fixing flaky integration tests so difficult is that they aren’t easy to reproduce. He decides to implement a procedure for dealing with flaky tests.

When a test passes and fails several times in succession, his test engine marks it as “potentially flaky.” When he has time, he deconstructs the test to its most basic (but functional) components and begins running them in parallel to the main build. While this doesn’t allow him to reach definite conclusions, he’s able to isolate potential causes. He adds generous padding for operations that may be held to performance standards such as network communication, I/O, and threading. If a test continues to flake, it’s marked as as unreliable and is excluded from possible tests. The whole team is alerted to this. Once a test is marked as flaky, it’s severity (how risky it is to exclude) is assessed by Joe and it is prioritized accordingly by the software team, just as they would any bug report. The fix for each of these flakiness reports can vary, but is generally along the lines of 1. Fixing the test’s root cause–accompanied by a detailed investigative report, 2. Changing the logic of the test to work around a source of flakiness but still perform the same function, or 3. Removing the test entirely and explaining why the business logic isn’t possible to test or isn’t necessary to test.

After two years on the job, the three friends meet up at a bar. Jim hasn’t received a promotion yet and is languishing in the backwaters of the startup he once idolized. Tom has worked for a financial services startup since he was fired, but is thinking about going back to school to get his master’s degree. Joe has received a promotion and is now the team lead for a small DevOps group. His pragmatism more than made up for his initial slowness and his work-life balance is so good that he’s taken to contributing to popular open source projects spanning the entire DevOps toolchain.

The three discuss their professional lives in great detail. Tom insists that if he had just gotten that promotion before his testing shortcuts caused the firm to take a loss, he would have been able to pass off his project to someone else. Jim says that if his colleagues helped him more and if his boss had dedicated more resources to the project, he would have attained a small fortune in stock options by now. His startup is set to go public next year. Joe is one of those guys who often suggests taking a step back, but he’s interrupted by the clock striking midnight.

Time for the nightly build.

Jim jumps out of his seat, pulls his phone out of his pocket, and starts frantically trying to make sense of the alerts being sent to his phone. He gets home to the safety of his personal laptop as quickly as he can. He barely says goodbye. Tom’s phone is ringing, but he just orders another Modelo. Tom and Joe split the tab and part ways. Joe unlocks his phone to call Uber, he has one alert:

“Nightly build finished — PASSING — 99% of tests succeeded”

Parable of the Flaky Test