November 19, 2007

On software reliability

Despite the common prejudice and the name, for a system to be reliable is not the same as to have no errors or never crash (noone should ever be promising that). The way I see it, reliabilty is more of a predictability. A system is reliable if it behaves in a predictable fashion - so much that you can rely on that. Even if all it does reliably is crashing.

The systems we build, they don't exist in isolation. Again, despite a popular myth, programmers don't pull things out of a thin air. We base our work on the work of others - hardware, operating systems, servers, frameworks, libraries, compilers, you name it. Reuse is the boon and the bane of the industry. Client-server pairs, interfaces and contracts are not just about OOP, they are everywhere. Any interaction between software components is about grabbing something somebody else has.

I therefore take it for granted, that there are parts of the system not written by us or not belonging to us. This implies that they cannot be controlled in any way (actually, I mostly work on integration middleware, which makes my experience worse). Then this inability to control leads to inability to fix. More often than not, you cannot fix problems in systems you have to depend on.

What do you do when you encounter an unexpected behaviour from somebody else's system you have to depend on ? I do this: first - fetch a cookie for being lucky, second - understand the cause, third - find a workaround.

The point here is - a problem known is not a problem. Because if you know the exact circumstances under which it hits, you can work around. Figuratively, you have to walk the minefield, but every previously found mine can be sidestepped. Therefore,

For a component to be reliable, you must be able to work around any problem encountered, and for a system to be reliable, you need to know all the problems it can cause.

To conclude, here is a few examples of something which is broken and reliable at the same time, because there is a workaround:
  • Reliable is a broken CPU which always fails upon certain floating point division. The numeric libraries fall back to software emulation when encountering this particular kind of chip.
  • Reliable is the XML parsing library which crashes when you attempt to set attribute value to something with letter "Ё" in it. I replace "Ё" with "Е", which is a good enough if slightly ambiguous substitute.
  • Reliable is the compiler which chokes and dies upon too complex a template. I rearrange the angle brackets and it goes on fine.
  • Reliable is the provider's server which crashes every other Friday at 13:00. I schedule an offline gap at that time and noone complains.
  • Moreover, reliable is the provider's server which crashes sporadically, but will gracefully handle repeated access attempts. As you might guess, this last example is the warmest to my heart and is one of the bases for Pythomnic - the platform for developing network services I'm developing and using.


Pickerel Yee said...

what happen to when i visit it, i got the page:

"This page is parked free, courtesy of"

Dmitry Dvoinikov said...

That's too bad, I have notified the administrator. Thank you for fast notice.