YuenYan Daily: 原人日記: Software disasters: are inevitable, and will we see more in the future? Part I

I know that I have abandoned the blog for a long time, and yes originally it was supposed to be a blog about code. But I sort of moved to my Windows Live Space blog and vacated this one.

I’m not sure if there is anyone who reads the blog now, but if you do, I still think it is kind of cool to share my thoughts with you here.

The 3-parts essay is originally a homework assignment submitted for CSE P 503, taught by Dr. David Notkin. To be honest, I don’t need the grade anymore, but Dr. Notkin is someone that I admire and I just want to be a student of his class, so that I can learn from him.

The essay is about analysis of modern software disasters. I expanded it a little to incorporate my thoughts about the complexity of the disasters has evolved in the past few years.

Enjoy.

We have also arranged things so that almost no one understands science and technology. This is a prescription for disaster. We might get away with it for a while, but sooner or later this combustible mixture of ignorance and power is going to blow up in our faces. --- Carl Sagan

The first question is so incredibly naïve that I almost feel ashamed of answering it. On average, a program with > 512KLOC has 4-100 errors per KLOC [1], and no matter how aggressive the test and how robust the application behave, we can never be sure if the program will always run correctly. Incompatible platform (Windows drivers in Vista OS), regression caused by upgrade (Ariane 5 crash and AT&T outage in 1990)… the possibilities for failures are innumerable…

The second question, however, worth more of our consideration. As the software engineering discipline slowly maturely, we have devised new methodologies and tools such as Test-Driven development, coverage code tools, and model-checking etc that reduces the code defects in the system and also the possibility of disasters. We shouldn’t deny that these tools and conventions improves the quality of the code written by us, but does it guarantee that we will see less software disasters in the future. Any seasoned developer would give a moderate and sensible “probably no”. However, unseasoned and naïve as I am, I give this some thoughts, and I came to a horrid conclusion..

We are bound to see a lot more software disaster. A lot more…

What is a software disaster?

Disaster (n.): a sudden calamitous event bringing great damage, loss, or destruction; broadly : a sudden or great misfortune or failure – Webster Dictionary

Generally speaking, we assume that software disasters are calamitous events that are caused by code defects that result in the loss of financial resources or human lives. Therefore, in the postmortem of these calamities, we focus our efforts in discovering the root cause (the bug) in the code, and seek to provide solutions that prevent and detect these types of defects. The crash of Ariane 5 (caused by number overflow error) and the Patriot missile failure to intercept Scud missile during the first Iraq war [2] (bad floating point arithmetic) are two prominent examples of these types of errors. There are many more of these software disasters, and I’m sure that my fellow classmates, who are smarter and also more experienced, would have dissected the cases and offered their own insights.

Indeed, we now have better programming model and hardware and also methodology to deal with these problems. But “disaster” carries a unique meaning for different people – small and trivial incidents that aren’t noticeable by the majority of us maybe disasters for some. For instance, the AmazonFail event, in which some items in the Amazon database were incorrectly flagged as “adults”, caused many gay-themed books had disappeared from Amazon's sales rankings and search algorithms. As a result, many publishers and authors lost sales, and Amazon was confronted with a PR disaster as initially the action was perceived as anti-gay by the public. Early in the year, Facebook also introduced a privacy flaw during their upgrade, which allowed users to gain access to portions of their friends’ profile that they should not have been able to see. Both incidents occurred no more than three months apart and although most of us may consider them minor, the implications may be disastrous for some.

It is my intention that, as most of us are concentrating our analysis on the causes of the past software disasters (rightfully so and excellently done), I want to take a step back, compare a few “disasters” in the distant past and the recent past, and give my own thoughts about how software disasters have evolved. So first, let us look at a well-studied case, about a software disaster in the distant past.

During the Good Old Days of AT&T…

n 1990, I was only 8-years-old, and for the majority of us still primarily communicated using phones. But on January 15th 1990, for 9 hours, an estimated 60 thousand people lost their AT&T long distance service. Since it also affected the 800-numbers calls, hotels and car rentals lost sales, and it was estimated that it cost AT&T $60 to $70 million in lost revenues [3].

The cause was of course, inadvertently, due to a bug in the AT&T phone switch software that was introduced during an upgrade.

Back in the days, it took around 20 seconds for a long distance call to get connected. In order to improve the latency, AT&T deployed new software in #4 ESS switching system, which intends to reduces the overhead required in signaling between the switches by eliminating a signal that explicitly indicates if a node is ready to receives traffic, and reduces to the time between dialing and ringing to 4 seconds. The software was deployed to all 114 computers in the AT&T network.

Then, on January 15th 1990, one of the switches in New York started signaling “out-of-service”, and rebooted itself and resumed service. A second switch would receive the signal from the first switch and start processing. However, when another signal from the first switch arrived when the second switch was busy, the signal was not processed properly, and the second switch also signaled “out of service”. One by one, the “out-of-service” signal propagated to the neighboring switches, and eventually to all 114 switches, because they all deployed the same software.

The intermittent “out of service” signals blocks some of the long distance calls. Eventually, the engineers discovered then a switch would signal “out-of-service” when another signal arrives too soon. By reducing the workload of the network which increases the latency between signals, the erratic behavior eventually disappeared from the network. But during the blackout period, it was an estimated that more than 5 million calls were blocked [4].

But the question remains: what causes the switch to erratically signal “out-of-service”? To answer the question, let’s look at a snippet of C code in the #4 ESS switching system that is used for processing incoming signal:

/* ``C'' Fragment to Illustrate AT&T Defect */
do
{
  switch (expression)
  {
      // some code...
      case value:
          if (logical)
          {
              //sequence of statements
              break;
          } else
          {
              //another sequence of statements
          }

      //statements after if...else statement
  }
  //statements after case statement
} while (expression);

The root cause of the disaster is caused by nothing more than the break statement being placed at a wrong location. When the break statement is reached in the code, it would prematurely cause the program to break from the enclosing switch statement. However, this sudden termination was not anticipated – the programmer did not realize that the break statement would terminate the enclosing switch statement. The result of the termination would crash the switching node, causing it to send an “out-of-service” signal to its neighboring nodes (5). The “out-of-service” propagated between nodes and eventually crashed the system.

The cause of this disaster is absurdly simple: the programmer simply misunderstood or overlooked the effect of break statement in the code. In this particular case, some case studies also suggest that a comprehensive code coverage test would also reveal that a portion of the code is not executed due to the misplaced break statement, (and apparently ignoring the fact that the code coverage as a testing technique only began to get popular in the 1988s [5]). Therefore, like most typical software disasters that we studied, we can blame:

Programmer,

Inadequate testing, since they should’ve caught it,

The management, well only if they weren’t so aggressive and provide enough time for the programming and the IT department to test and deploy the software in a timely manner.

For most modern software disasters, any analysis or case studies will inadvertently return to these causes and blame the same people. However, even if we have smart programmers who write code carefully, comprehensive test suites and smoke test for the application, and management that set the robustness and the security of the software as the most important criteria for shipping, does it mean that we could largely avoid software disasters? No, we won’t, and that’s not only because we aren’t perfect…

To be continued…

YuenYan Daily: 原人日記

Friday, April 17, 2009

Software disasters: are inevitable, and will we see more in the future? Part I

We have also arranged things so that almost no one understands science and technology. This is a prescription for disaster. We might get away with it for a while, but sooner or later this combustible mixture of ignorance and power is going to blow up in our faces. --- Carl Sagan

What is a software disaster?

During the Good Old Days of AT&T…

1 comment:

About Me

Blog Archive

Twitter Updates

Twitter Updates

My Footprints

Friends" sites

Highly Recommended