Management Perception of System Administration

Jim Hickstein jxh at jxh.com
Fri Feb 22 13:13:19 PST 2002


One thing struck me about the technology risks: Someone says "we want 5 
nines", but then they set up a business that cannot tolerate any service 
interruption above that point.  0.99999 is a _probability_, not a 
certainty, and an average one at that.  Some days will be below average.

I was thinking of systems where this would seem to matter more, and seem to 
achieve better certainty: airline reservation systems, and better yet, 
air-traffic control systems.  Yet, in the latter case anyway, they _do_ 
have major system failures, and they _do_ have major service interruptions. 
But they also have manual procedures.  When the radar goes black, you talk 
in the radio; when the radio falls silent, you look at your pieces of paper 
and start talking on the telephone (to other radio operators).  The 
airplanes have procedures for clearing the airspace around such an 
emergency, and they don't all fall out of the sky.  This happens 
_routinely_.  (P.S. Don't tell the passengers.)

I didn't see a failsafe system when Paul was describing the totes going 
round the distribution center.  Partly this may because they didn't set out 
to design one.  But that, IMO, is a business failure, not a technological 
one.  Some service interruptions, at some level, are _inevitable_, period. 
(And the harder you try -- and succeed -- to reduce the small disasters, 
the larger the average disaster becomes.)  If you set up a business that 
won't survive one, and don't have the humility to admit that Plan B should 
exist, that's not the technology's fault.

You can buy a certain number of nines these days.  But so can your 
competitors.  The next couple of nines are harder, and they consist of 
putting systems in place to help people avoid making mistakes.  I've 
achieved some modest success at this in my operations career.  It's the 
most interesting part, to me.



More information about the Baylisa mailing list