I’m not a fan of horror stories, but I often find myself as the main character in them.
As a DevOps engineer, it’s important to approach production issues with poise and tact, regardless of how tempting it may be to close your eyes and image you’re instead on a ski slope somewhere. Maybe for some people the ski slope is even scarier than badly behaving code. My point is that any engineer trusted to oversee or directly participate in the handling of a software incident must be comfortable with the mental and emotional gymnastics that it may incur.
Over a year ago, I assumed ownership of a monitoring platform that was central to the business. It tracked uptime, displayed configuration information, and provided emergency shutoff functionality for almost every application in the firm. It is regarded as the most useful piece of support software we had, and, to the creator’s credit, it’s incredibly reliable!
During the first few months, I made a few changes and gained experience releasing and maintaining the app. Then, as it had reached a level of maturity, changes were less frequent and several months went by without a release.
Before I knew it, the time came to do a major release. The tests were passing, two people had reviewed the code from top to bottom, and the overall risk was classified as being quite low. You might say that this was the calm before the storm, or that innocuous part of the horror film where the main character finds delight in the mundanities of the daily routine.
Deployment time came. I started up the first container and it appeared that there was some network congestion. There were no errors, but it took the app much longer to start up than usual. This hadn’t been seen in simulation testing. Since there was no service level agreement, or, generally speaking, established performance metrics, I thought that this was okay. The truth is that networking problems are tricky and deceiving, and I hadn’t yet developed the intuition to thoroughly investigate this issue.
I deployed all of the redundant containers and, aside from the initial startup delay, everything seemed to be working. Applications were getting monitored just fine and there were no complaints.
Upon arriving to work the next morning, there were quite a few complaints.. The service was slow to show new applications or updates to existing applications. How slow–five seconds, ten seconds? After a test of my own, it was apparent that something was wrong. It took almost a full minute for updates to propagate through the system. This was the chase scene of the horror movie. As I peeled back the layers of the web of consequences, my heart started racing and it seemed inevitable that my brains would be eaten by a family of zombies.
Immediately, I initiated a rollback. While the update slowness wasn’t the end of the world, if there was blocking at the network level it was possible that the emergency stop command (via SSH) would be delayed or even lost in the shuffle. This wasn’t just an unacceptable defect, it was a huge problem. That this error could have made it into production showed that testing–specifically, acceptance testing after deployment to production–was sorely lacking. Thankfully, no major issues were reported. I lived to tell the story
I went back to the drawing board. I re-deployed in a simulation environment and the issue didn’t reoccur. I deployed on my vm. I deployed on other servers. Finally, I had to test a one-off instance on the production server. Clear as day, there was the slowness.
One by one, I ruled out the sources of this issue, especially those related to code changes. Finally, I stumbled upon the requirements.txt file that was passed to the pip install in the Dockerfile. None of the package versions were pinned.
To pin a package version means to confine that package, upon installation, to either one version or a small subset of versions. Without pinning a package version, you’re susceptible to inconsistent behavior and bugs introduced in new versions of those packages.
Don’t get me wrong. I don’t think anyone has ever made a case against pinning versions. The only point that would make sense is that an application may, after years of developer inattention through release cycles, end up using deprecated and unsafe versions of packages. At least the app will still work.
After noticing the issue, I compared package versions between the old image and the new one that I had deployed to little fanfare. There were some massive disparities in versioning. By parsing the list one-by-one, eventually I realized that the problem was being caused by Eventlet, the concurrent networking library for Python. I then was able to further determine that this occurred between versions v0.20.0 and v0.20.1. I could still only recreate the problem on the virtual machines used for the production monitoring service.
I came out of this gauntlet alive and intact and also with some important lessons for when it comes to using docker in production:
1. Pin versions of all dependencies both in the Dockerfile and all requirements files
2. Have a test environment that is an exact replica of the production environment
3. Consider implementing acceptance tests after deployments
4. Have basic SLAs to enforce acceptable performance metrics
5. Eliminate as many sources of Docker toil as possible by streamlining the build, release, and deploy processes and always using a docker registry
As for the actual, definitive cause of this bug–what Eventlet code change messed this up and why it’s only triggered on this single classification of server that we have–the jury is still out. We’ve explored several different hypotheses but, much to my chagrin, haven’t had the bandwidth to dig in any further, as pinning the version was all that was required to implement the code changes and move forward.