September 11, 2012

On interrupting application code

So, I've released a new version of Pythomnic3k framework recently. As usual, the changes that get incorporated are essentially answers to questions from real life usage. There is a few in current release too.

Now I'm thinking what to do next. Among the things that still make me uncomfortable is the real-time guarantee, or, specifically in Pythomnic3k, a guarantee that every request will return by deadline. Just anything, a failure perhaps, as soon as it won't hang.

The problem is, sometimes a developer writes something that may not look like an infinite loop, but behaves like one. A situation when execution hits a
while True:
    pass
is a disaster in any architecture. It never returns, it has to be executed, there goes a CPU. Even the perfect scheduler cannot decide that this code effectively does nothing and just prevent it from being scheduled.

In Python this is worse, because it is effectively single-threaded for CPU-bound code. And even if that would not have been so, the next request hits the same spot and before you know it your load average is over 26. Therefore a single mistake like this could bring entire service down.

There has to be a way of interrupting the application code.

And in Python there is. A thread can inject an exception to another thread's execution path and as soon as the victim gets scheduled next, that exception is thrown. This is very easy:
PyThreadState_SetAsyncExc(other_thread, Error)
As soon as the application has a watchdog (and in Pythomnic3k there is), it could interrupt worker threads using such injection. I tried it and it worked.

Now there appear other problems.

First is the problem I could easily turn blind eye to. A code which executes an OS call cannot be interrupted this way. Easy to see, because while it does it is outside the Python scheduler's reach. Therefore
time.sleep(86400)
will still return tomorrow.

Second is the bigger problem of unpredictability. You don't know when or where the deadline hence the exception hits you. This effectively means that no code can be considered exception-safe now.

As a framework author, I could protect sensitive fragments of code by essentially "disabling interrupts" for the moments when interruption would not be convenient. So I write something like this:
current_thread.can_interrupt = True
application_code()
current_thread.can_interrupt = False
That protects the framework, but not the application code. The same developer who wrote the infinite loop in the first place could very reasonably write something like this:
lock.acquire()
try:
    ...
finally:
    lock.release()
Now consider what happens if the exception is injected after acquire but before try. Although an opening try statement actually does something non-trivial, it is never considered to be a possible source of exceptions. Paranoid as I am, I read try as noop. If try started failing, all bets are off.

Similarly, it is common to put clean up code in finally and make that code exception-safe. For example:
d["foo"] = "I exist therefore I am"
try:
    print(d["foo"])
finally:
    del d["foo"] # this could not throw
    something very important
Now any code, however safe it looks could throw. One could write very defensive code wrapping every line in a try-finally block, but then again, it is still possible that in
finally:
    try:
        del d["foo"]
    finally:
        something very important
an exception is thrown just after the second finally statement and something very important is still not executed.

As it turns out to be, we have a "programming in presence of asynchronous signals" situation here. As soon as some external mechanism could interrupt execution, you have to always account for that. This was typical when programming interrupt handlers in assembler, where much of your code were cli and sti instructions. All hell rained down if you ever forgot one.

Granted, such programming is possible and could even be considered stylish and felt elitist. But it is entirely different style from application programming in high level dynamic language. Even if facilities to disable interrupts are provided by the framework, it would require much experience and care on developer's behalf, much more that could be expected. And it will be a lot of trouble to use correctly.

Therefore I don't think I will instrument Pythomnic3k with such deadline enforcing mechanism. A developer who wishes to make his code deadline-friendly could always do it in the same way it is done now, by explicitly checking for request expiration at well defined points. Something like
while not pmnc.request.expired:
    read_data(timeout = pmnc.request.remain)
And if someone makes a mistake and the service hangs... Well, you have to be ready for that too.

No comments: