May 03, 2006

Expecting software errors, not fixing one and pretending none left

There is a problem with your software, something strange is going on, some error appeared that has never had appeared before, or should have never appeared in the first place, the system produced insane output or just died ?

It's unwise to think about anything you are experiencing as of "unbelievable". You see it, you have hard evidence - it exists. Why is that - you have no idea, but it exists. Finding out the origin of such a problem requires a totally different approach to debugging. As now you have no idea why it happened, you likely could have never predicted it, tested for it or even thought about it.

Stop and think about it. You could have never thought your software would fail in that particular way. Taken to the extreme this would mean that you never thought your software could fail at all.

But this is exactly how novice programmers think. As they become more experienced, the amazing world of software failures appears before them, and they start applying the typical code/debug cycle, error handling and/or try/catch patchwork, which still limits them to the errors they are experiencing at that single moment.

It seems logical that if the programmer's experience is taken to another extreme, she would expect failures everywhere, just for the sake of vigilance. No matter if a failure of some kind has never happened before - it's there, waiting to happen.

The point of this argument is that the key thing to writing stable software is being failure-proactive and expecting everything to fail. Just ask yourself pointing finger to any piece of your system: "what happens if this fails (for whatever reason) ?" The results can be far reaching.

I therefore argue that the common debugging is far less important a process than it's usually thought. Of course, finding an error and fixing it is important and must be done, but if you had expected it, your whole system is by now prepared to any error anywhere, and this one error only challenges this fact.

Moreover, I also argue that trying to break your own software is at least as valuable a practice as the forementioned debugging.

Yes, it's a well known maxim that all testing is about breaking things, not about seeing it working. But there is a catch. Tester can only test against problems that he can think of, so much like the developer. Moreover, as soon as developers and testers work side by side (unless it's the same person), they tend to share the same narrow field of view, which greatly reduces testing efficiency.

That's why I like stress testing. Not because it shows me impressive transactions per second, but because it tends to reveal another layer of problems usually never seen. Stress testing is very useful, but it shines whenever a failure leads to a cascade of other failures. Whoa ! I couldn't have thought of THAT !

Now, how else you can intentionally break your system except by giving it a stress load ?

Failure injection comes to mind, but it contradicts to the normal development cycle (as there is simply no place for it), and requires the mentioned failure-ready system structure from day one. I also tend to remove the injected failures in production system, after all being a failure-vigilant software developer won't save you from black cats and broken mirrors.

Unplugging cables from the working system is also fun and can reveal yet another unexpected problems, although again, it's probably not applicable to a production system. For another joke of this kind see "The China Syndrome Test" in "Blueprints for High Availability" by Evan Marcus & Hal Stern.

In conclusion, this discussion again turns me to the PITOMNIK principle: as the system grows, its small errors grow with it and therefore lead to inevitable failure. You cannot ignore the presence of the errors, nor you can fix them any significant fraction of them, so you'd better be ready.

No comments: