April 09, 2015

Rethinking the cache

Caching is a very difficult problem. Not just invalidation and not because of some inherent algorithmic complexity, but because of the complex semantics of the process and its purpose of increasing performance.

In one of our products I had once written a specific application mini-ORM for caching objects persisted in a database. It was a Python 3 application in Pythomnic3k framework and a companion PostgreSQL database.

The database structure was really simple - one table per class named simply like "users" or "certificates" and a handful of stored procedures named like "users__load" or "certificates__save". The tables contained the actual data and a few special fields, like "checksum", which allowed to detect concurrent update conflicts optimistically.

So each time there was an ORM access in the code
user = pmnc.db_cache.User(user_id = 123)
user.passport_number = "654321"
user.save()
what happened behind the scene was that ORM implementation in db_cache.py executed
SELECT *, checksum FROM users__load(123)
then created an instance of class User based on returned data, one property per database column, cached the instance and returned it to the caller. After the passport_number property was modified the call to save executed
SELECT new_checksum
FROM users__save(123, ..., '654321', ..., checksum)
to flush changes to the database. Should another request for user 123 arrive in the meantime
user2 = pmnc.db_cache.User(user_id = 123)
it would have returned the cached instance and not go the database.

As any ORM, this one did not answer all the questions. It did not allow ad-hoc queries which did not map directly to object paradigm and it was impractical to create a separate method for every request. Therefore, ad-hoc queries started finding their way directly to the application code like
pmnc.transaction.db.execute(
"
  SELECT u1.last_name, u2.last_name
  FROM users u1 INNER JOIN users u2
  ON u1.passport_number = u2.passport_number
  WHERE u1.suspended = {suspended}
",
suspended = True)
So now there were requests that bypassed the ORM cache and went straight to the database. This was still normal as soon as they were read-only. But later there were more of them and they were getting more analytical and heavy (think history of changes of all objects belonging to a particular user), therefore a question of caching appeared again.

It was then that I implemented and released a universal read caching mechanism for Pythomnic3k (version 1.4), so that it was possible to enable cache on any resource, database being the most probable candidate of course. I wanted to put it to production and make all the requests including ORM's go through the resource cache. The first thing that surfaced immediately was that read caching implementation was pretty much useless as it was, because while caching reads it did not take into account the existence of concurrent writes.

Actually, I knew this before the release, but simply had not enough understanding of how such concurrent cache activity should behave. But I left a couple of callable hooks that allowed to customize the cache behavior on application level. So I hooked into the cache and made reads and writes coexist. Because the cache couldn't tell whether such and such SQL request had side effects, the fact had to be declared by the application with each call. In simple terms a read like this
pmnc.transaction.cached_db.execute(
"
  SELECT *, checksum FROM users__load({user_id})
", 
user_id = 123, 
pool__cache_read_keys = { "users/123" })
would later have its cached result invalidated by conflicting write
pmnc.transaction.cached_db.execute(
"
  SELECT new_checksum FROM users__save({user_id})
", 
user_id = 123, 
pool__cache_write_keys = { "users/123" })
This way it went to production and worked fine for about half a year. Until one fine after-deployment morning it didn't.

There was another thing not taken into account, a race condition between reads and writes. For example, if these two conflicting request executed concurrently:
pmnc.transaction.cached_db.execute(
"
  SELECT WHERE user_id = 123
")
pmnc.transaction.cached_db.execute(
"
  UPDATE WHERE user_id = 123
")
there was a chance that the read would start before the write but end after the write. In this case write wouldn't invalidate the result of read simply because there was none at the moment, but the result would still arrive and be cached containing already invalid data. The problem was resolved by patching in an industry-standard sleep() in the right place, and it indeed remedied the situation. But now I started to rethink the entire thing. Clearly, caching semantics needed to be improved.

So I went and made a lot of changes to the cache code, using new experience, focusing on concurrent reads and writes behaviour. In particular, the above race condition was fixed by registering affected cache keys before any request, read or write, is sent to the database. This way if a write arrives when a conflicting read is in progress or the other way around, both are allowed to proceed but read result is not cached when read returns. The result is still returned to the caller and it may or may not be invalid but now it is the responsibility of the database, not the cache, so we did not break the semantics.

Now, as I was overhauling the cache anyway, I also wanted to examine the evidence.

First, I picked up what log files with enough debug information we had from production installations of our product and read through the registered SQL queries. Predictably enough, they fell into two categories:
  1. Ad-hoc queries. Fewer but slower.
  2. ORM queries. A lot more numerous but also much faster.
Second, I checked the cache settings on database connections and they were not right. The cache sizes were too small and the "weight" eviction policy exclusively favored the requests of the first type, which produced less hits. By weight we mean the time it took to originally produce the actual value from the database.

So I thought it would be nice to improve the cache by accounting not only weight of the cached result, but also "usefulness", which would be weight multiplied by hit count. And so I added such eviction policy, only it was called "useless", as in "evict useless entries first".

Some time later I thought that since ORM produces a lot of identical parametrized queries:
SELECT * FROM users__load(?)
it would be reasonable to suggest that each entry cached from any such request has the same average usefulness. For example, if we have 1000 entries cache hits to which saved 1 second each, and there is a 10 new entries that have just been entered and did not have a chance yet, the newcomers should not be evicted right away. It would be better to let them stay hoping that each will come as helpful as the 1000 before them.

Therefore I added an optional "cache group" parameter for a query, the simplest kind of one is the literal SQL string itself. As entries produced by the same SQL string are entered to the cache, they are assigned to the same cache group and have their usefulness accounted combined. Even though the new entry may not have had any hits yet, it is under the umbrella of the high-ranked group with high average saving.

Eviction now had to work differently. At first I though that I would simply evict the low-ranked groups first, leaving the high-ranked ones intact if possible. But experiments indicated that one winning group simply took over the entire cache over time. So I had to implement eviction using weighted average amongst groups, where a high-ranked group has a better multiplier than a low-ranked one. This means that a value from a former group could still be evicted if it has low weight, and likewise a high weight value from a latter group could stay.

Fiction mode off. I have just committed all this code to Pythomnic3k SVN repository, and it will be in the next version, so anyone who is interested may check it out, the cache should now be usable. Although making it work right in an application may not be obvious, I will later include a sample specifically with cached access to a database.