May 24, 2006

So, how does it feel writing infallible code ?

I think many people feel uncomfortable about the code they write or maintain, systems they support or administer. The stuff is just bad, let's face it. But people (including myself) tend to find a comforting spot in which they are most emotionally stable and least disturbed about the reality they have to live in. But this comfort is often based on ignoring the truth. Let's see how it may work in one particular case.

I had this idea when thinking about moving one very small but very common piece of functionality into a network service. We have each system to determine whether a given day is a workday or a holiday. As there is no such stable rule, each system has its own volatile piece of data in whatever format which it consults, and every administrator of every system has not to forget to keep it in good shape. It repeats in tens of variations, burdens the developers and is a constant administrative pain. Sounds like a good idea to factor it out into a network service ?

Paradoxically, it does not, precisely from the point of view of those troubled systems' developers and administrators. But why not ? It seems odd. Before the refactoring they have

function workday(date):
return consult_own_database(date) == "W"
...
if workday(today) then do_stuff()
...

whereas after refactoring they have

function workday(date):
return consult_magic_service(date) == "W"
...
if workday(today) then do_stuff()
...

(mind you that this a pretty real-life looking code). It looks like nothing changes at all, except for now they have one less problem. To simplify, let's assume such change wouldn't require any work from them, will be done in a snap and cause no troubles in itself. Still they refuse. What is the likely reason they give to cancel the change ?

- It's less reliable. A call to a local database (or filesystem) is less likely to fail than some network service.
Reliability, that I understand. And it's also reasonable to assume that the probability of local database failure is less than that of a remote network service. But I argue that it's not the different degree of reliability that matters here. After all the above chunks of code are identical, why would one be less reliable than the other ?

The first (existing) piece of code does not indicate a less probable failure. In fact it indicates a zero probability of failure. The call may fail, but the code does nothing about it. The developer consciously or unconsciously knows that there is a possibility of failure, but as soon as it's perceived below certain threshold, she's still comfortable ignoring it. Now with the subtlest change comes an increased failure probability which is too uncomfortable to ignore.

And what are the options ? Rewrite the code so that it accepts failures ? Impossible, too difficult, can't be done, no way, period. Thus we don't need the change, it will break the system which works fine. Ironical, isn't it ?

And so the thing remains unchanged, with regards to this switch to a network service or not, it becomes too fragile - touching it is dangerously uncomfortable. Therefore the point of this argument is that developer's comfort based on false assumptions, such as improbability of failures, forbids any major changes in the software.

No comments: