fire_whale

Sometimes production code blows up. Though rolling back to the previously tagged version is a common solution, it isn’t always the most appropriate. Certain situations demand a hotfix. In contrast to a rollback, a hotfix is a forward fix. It’s a modification to a live codebase, potentially skirting testing procedures, necessitated when the expected loss from taking more time to fix it outweighs the risk of cowboy coding making the problem worse.

When you’re working with microservices in Docker, it’s inevitable that things will go wrong at some point. We all strive to catch errors as early as possible in the development process, but sometimes bugs slip through to the end user. While systems should be designed for fault-tolerance, a single service returning a 500 may be able to wreck havoc across several business processes at once.

I’ve found myself in situations where production code is failing and users are being impacted. The first step is issue triage. What’s wrong, who uncovered the issue, and who else might already be poking around in the code base? If there hasn’t been a recent deployment, there is no failover process, and it’s looking like I’m the only developer in the office, I’ll immediately try to judge the severity and the need for escalation.

As you may imagine, there are times where fixing the service is of the utmost priority. These are the cases in which hotfixing a docker container is pragmatic.

Through a combination of descriptive error messages and intuition, let’s say that I immediately know what part of the code I need to fix. But the firm is also losing hundreds of dollars per second (no–this hasn’t happened to me). Rebuilding the docker image is out of the question. The fastest turnaround time involves editing the hot container and then restarting it.

Let’s say I have decide to curl the route for a service called my_prod_service and I get a 500.


[root@Home-Dockertest my_production_app]# curl -I http://67.205.177.44:8001/
HTTP/1.1 500 BIG_PROBLEM
Content-Type: text/html

If you’re familiar with docker, you know that docker exec is a powerful command that gets you inside the filesystem of a container. Before I use that, I have to find the container ID.


[root@Home-Dockertest my_production_app]# sudo docker ps | grep my_prod_service
40d98f158d1d        8031004f07cf        "uwsgi --plugins 'htt"   2 minutes ago       Up 2 minutes        0.0.0.0:8001->8001/tcp   my_prod_service

The container ID is 40d98f158d1d, now it’s time to look inside.


[root@Home-Dockertest my_production_app]# sudo docker exec -it 40d98f158d1d sh
/usr/src/app # ls
Dockerfile        app.py            requirements.txt
/usr/src/app # cat app.py
def application(env, start_response):
    cause_a_problem = True
    if cause_a_problem:
        start_response('500 BIG_PROBLEM', [('Content-Type','text/html')])
        return [b"Something is wrong"]
    start_response('200 OK', [('Content-Type','text/html')])

    return [b"Hello World"]

In the most contrived example I will hopefully ever provide, the cause_a_problem flag being set to true is the immediate culprit. I can edit the file directly and the changes will take effect when I restart the container.


/usr/src/app # cp app.py app.py.BEFORE_HOTFIX
/usr/src/app # vi app.py
/usr/src/app # diff app.py app.py.BEFORE_HOTFIX
--- app.py
+++ app.py.BEFORE_HOTFIX
@@ -1,5 +1,5 @@
 def application(env, start_response):
-    cause_a_problem = False
+    cause_a_problem = True
     if cause_a_problem:
         start_response('500 BIG_PROBLEM', [('Content-Type','text/html')])
         return [b"Something is wrong"]

Even though time is of the essence in this scenario, you should still considering making copies of the files you will be editing. I recommend it even for simple timely fixes like this, because if you end up having to exec in again (this can become a vicious cycle), you’ll want to have a checkpoint that will tell you–for certain–what you’ve changed.

Once done making changes, exit the container with ‘exit’.

Then restart the container.


[root@Home-Dockertest my_production_app]# sudo docker restart 40d98f158d1d
40d98f158d1d

Confirm that the issue is solved.1


[root@Home-Dockertest my_production_app]# curl -I http://67.205.177.44:8001/
HTTP/1.1 200 OK
Content-Type: text/html


And your hotfix is complete. Almost. Because we cut out the intermediate steps, we now need to back up and clean up after ourselves.

First, you should never try to recreate from memory your hotfixed code changes. This is error prone and, plainly, a waste of effort. All you have to do is copy any affected files from the docker container to their respective repositories, add them to a hotfix-* branch, and then commit your changes. It’s safer this way2.


root@Home-Dockertest my_production_app]# cp app.py app.py.BEFORE_HOTFIX
cp: overwrite ‘app.py.BEFORE_HOTFIX’? y
[root@Home-Dockertest my_production_app]# docker cp 40d98f158d1d:/usr/src/app/app.py app.py
[root@Home-Dockertest my_production_app]# diff app.py app.py.BEFORE_HOTFIX
2c2
<     cause_a_problem = False
---
>     cause_a_problem = True

Well, me telling you to commit your changes is me telling you with the assumption that you have a CI pipeline set up to run your tests. If not, before you commit, you should run your tests, however it is that you accomplish testing. For those with competent CD pipelines, this topic may get the wheels turning regarding how to fit hotfixes into your workflow. I would talk more about how this specifically should be done, but I feel it’s outside the scope of this article and something I am not yet comfortable writing about at length.

Once that’s done, it’s safe to create a new image from the running container and retag it as you see fit:


[root@Home-Dockertest my_production_app]# docker commit 40d98f158d1d my_prod_service:v1.0.0-hotfix-cause-a-problem-flag
sha256:c3b1efd1b5896251292998dfe04f036ce814bf4d04273056761b838f8cb8b0c6
[root@Home-Dockertest my_production_app]# sudo docker images
REPOSITORY          TAG                                  IMAGE ID            CREATED             SIZE
my_prod_service     v1.0.0-hotfix-cause-a-problem-flag   c3b1efd1b589        9 seconds ago       45.92 MB
my_prod_service     v1.0.0                               8031004f07cf        21 minutes ago      45.92 MB

And then A. Redeploy it during downtime if you’re running a single point of failure3 or B. Do a rolling deployment among your redundant instances or C. If Docker ever allows mutable tagging, change the tags on the existing container and don’t worry about redeploying a new image at all.

 

While not always the best solution, you will not be able to persist changes to your containers any faster than by using this method. When practiced regularly and used properly, this is an indispensable approach for dealing with docker containers in production environments.

 

[1] If you are running multiple containers, repeat this step for every one. If you’re running a massively-redundant system, it may be faster to edit the image and redeploy via swarm instead of using exec. A well-written Ansible script could be used to distribute the change perhaps even more quickly. This will be the topic of another blog post.

[2] I don’t have git set up on this server, so pretend the diff was a git diff on master

[3] This is a terrible idea

Hotfixing Docker Containers