It started as a regular day in the office. At my job, regularity implies that I sat down some time before 8:30. After that, there is rarely a predictable pattern of work, which is why being a DevOps Engineer is so exciting. On this day, I became aware that a manual build service I maintained had deployed two applications with the same automatically-assigned ports, despite there being a safety feature that detected port collisions before the post request was dispatched.
I had tested this fail-safe mechanism thoroughly and was sure that it worked. The only reason it was necessary at all was because, instead of comprehensively managing port assignments (taking one out of the pool before the build was initiated, and thus having to define under what circumstances it could be added back to the pool), I was told that it was acceptable to only check which ports were already assigned when sending a user to the confirmation page where the ports needed to be displayed. However, if two users reached the confirmation page for builds destined for the same server at the same time, there was a small chance that the ports could conflict. If that happened, the port check would prevent one of the builds from being dispatched.
I approached the problem traditionally: I checked the code, I carefully initiated a dummy build, I viewed the logs, and I repeated.
Eventually, I was able to recreate the issue. The validation feature worked–when it was triggered. The problem was that the port didn’t seem to be getting updated in the appropriate microservice (which meant that collision wasn’t getting detected), but I was able to confirm that it was indeed getting updated. The only conclusion I could reach was that it seemed the Python request was being cached locally. But I wasn’t caching–or was I?
As I looked through the codebase, I came across a configuration variables file that globally defined
From the docs:
All responses will be cached transparently!
What the requests-cache library does is monkey patch a hook to the requests.get method that caches all queries. Once ‘installed’ using the above statement, the code will essentially map URLs to their respective data (by default, in sqlite, but it can also happen in memory). For example, this one piece of code will cause your r=requests.get(“http://myservice.com/api/v0/customers”) results to be cached for five minutes. All subsequent calls to that URL for the given length of time will instead be routed to a local database containing the data that was retrieved. It’s a useful abstraction on top of one of the most brilliant Python libraries. But it’s also very dangerous.
Carelessly implemented, this cache monkey patched a hook to every get request in my program, not just the file some variables and functions were imported from. The result was a snappy user experience (everything was being cached for five minutes), but also a dangerous and potentially suboptimal one (everything was being cached for five minutes).
The danger was recounted above. A request that I didn’t think was being cached (or, in another scenario, that I thought was safe to cache) had the potential to cause old data to be passed to a critical function of the build process. To plug this hole, I turned the cache off for this request using with requests_cache.disabled():. It seems like an easy fix. However, in the days following, my coworkers who had grown accustomed to builds taking a certain amount of time started to complain about performance. Is there a pragmatic tradeoff between potential for staleness and acceptable length of fetch time?
Before determining a local caching strategy, it’s important to define criticality of fresh data. It’s also important to have an idea of how long the round trip time of the request in question. Obviously, you want to minimize network traffic even if the request itself is lightweight and quick, but optimizing local cache prematurely will cause problems.
For example, I might have a service that queries a list of servers. Servers are seldom activated or taken down, and when they are it will only happen at a certain time each day. For some reason, the request to get all of the servers takes 5 seconds. Assuming we can’t thread and don’t have access to the servers microservice, there are three main options. The first is to just make the request every time the list of servers needs to be populated. The data is always fresh but user experience will be poor. The second option assumes that the microservice has an update_time field. The first request will take a long time but, after that, the data will be stored in memory. On each subsequent call, only the diff will be merged into the trunk. This is a good hybrid approach. The query is safe and is bound to be much faster than five seconds. But, the request is still dispatched–resulting in network overhead in the critical path.
In this case, local caching using requests-cache wins out. The staleness of data isn’t a big deal. If a change to the servers list needs to be updated quickly, sqlite cache can be flushed manually or the service itself can be restarted.
Then there is my use case. It’s a slow query and data freshness is paramount. What about using requests-cache with a short timeout interval–like 15 seconds? It sounds good in theory, but remember that the ports can only conflict when two builds are queued in rapid succession. Even reducing the caching interval to 3 or 5 seconds is the wrong way to approach the problem. In actuality, the best solution is to implement my own in-memory cache in the style of option 2 above. By only getting (and merging in) updates, I can guarantee freshness and drastically reduce query time.
Requests-cache is a useful library especially for users querying APIs that are black boxes, but it has limitations. It isn’t always the right tool for the job and can hide critical application flaws under the guise of speed.