schoolofwhales

I’ve been doing a lot of microservice work recently. Specifically, I’ve been building Flask apps and deploying them in Docker containers. As I’ve started experimenting with running these services redundantly, I’ve been faced with an interesting problem: how does one identify which redundant instance he is connected to?

Let’s say I run three instances of a service behind HAProxy. The URL http://my_websrv maps to three services exposed across two servers: 192.168.1.1:5000, 192.168.1.1:5001, and 192.168.1.2:5000. The routing algorithm used is irrelevant.

When everything is running correctly, users will hit http://my_websrv and will have the expected user experience. Sometimes, one of the services will go down for maintenance, and any users will be redirected to a different (but indistinguishable) available node. What I’ve seen in production scenarios, though, is that especially when proper log aggregation, monitoring, and alerting are lacking, being able to identify which specific instance you are connected to can help a the service owner triage issues much more quickly. It also enables users to write more verbose bug reports.

For example, if a user reports a problem with the flask app, what’s immediately important is being able to recreate the issue. Since the service is still accessible (just not working properly), all we can rule out is that nothing is affecting the service’s health check. When debugging a redundant system, it’s a good idea to test on the instance that the incident was first discovered on. If it’s confined to one box, which, admittedly, is rare, that fundamentally shifts the priority and manpower that must be dedicated to the investigation. The problem is that there’s is no out-of-the-box way to be able to determine which instance HAProxy has routed that user to. If it’s an obvious problem requiring few or no steps to recreate, the developer can quickly click though each IP address and port (assuming there are relatively few) until the offending instance is found. But there must be a better way.

Each redundant microservice instance will have two unique identifiers. The first will be the IP:PORT pair, and the second will be the docker container ID (technically, a third is the IP:CONTAINER_NAME but why complicate things?). What I have started doing is exposing the container_id–yes, you can retrieve it inside the container–as a flask route served at /inst. This way, you can ask a user to hit that address to instantly (ha-ha) confirm what container they’re on. If your monitoring, logging, and alerting systems are exceptional, you might expose a unique identifier elsewhere (likely) and already have had a message served to you, but there’s no guarantee.

In addition, having access to this route can be used to help experiment with different routing algorithms.

import subprocess
from flask import Flask

app = Flask(__name__)

@app.route('/inst')
def get_container_id():
   p = subprocess.popen("cat /proc/self/cgroup | grep 'docker' | sed 's/^.*\///' | tail -n1", stdout=subprocess.PIPE, stderr=subprocess.STDOUT, shell=True)
   output = p.communicate()[0]
   if not output:
       return "Not using docker or failure to parse container ID for this version of docker"
   return output

The docker container ID format is parsed differently for different versions. If the above doesn’t work for you, exec into your image and start with the cat command and then build your succession of pipes from there. Once you’re ready to test in python, run the python shell inside the container and test to make sure your code works as intended.

The usefulness of this method is that, rather than having to assign some arbitrary identifier (ID_01) that you then have to map to a container ID, you have the most useful piece of identifying information served directly. Where this method fails is that if you are deploying a swarm and can’t guarantee consistent Docker versioning, you’ll have to get more creative (EG–If not output {docker_V1 behavior}, and so on). I like exposing the container ID better than I like exposing the IP and port because, if direct access isn’t disabled, users can hit the route, for some reason decide “Oh, I like instance #3!” and directly connect to it, thereby defeating the purpose of redundancy. Allowing users to access the IP and port can be dangerous in a production environment.

It seems possible that I’m overlooking an easier way to do this, but I wasn’t able to find anything else after a few cursory Google searches. I’d be interested to know if you are aware of a different approach.

Identifying Redundant Microservices