YuenYan Daily: 原人日記

Wednesday, April 22, 2009

SortedList, SortedDictionary, vs LinQ

I had a situation at work today where I need to sort a Dictionary with string as key.

The keys in the Dictionary are in this order: “Title 1”, “Title 2”, “Title 3” etc.

I cannot use the normal framework SortedDictionary because if you sort by key as string, the items in the SortedDictionary will be sorted according to the unicode value…

e.g. you will get something like “Title 1”, “Title 10”, “Title 2” etc… totally undesirable…

Since I will need to write my own IComparer to do the custom comparison, I wonder if using Linq will provide a better solution.

So I spent probably an hour as work trying to create the query but I realized again that I am not that familiar with Linq (mostly about how to use lambda expression properly with Orderby. s I dropped the work, but then when I came back to work on the code later at home, it all became clear:

Here is the code:

static SortedList<string, double> FrameworkListSort(
    Dictionary<string, double> columnWidthDictionary)
{
    return new SortedList<string, double>(columnWidthDictionary, new TitleComparer());
}

static SortedDictionary<string, double> FrameworkDictionarySort(
    Dictionary<string, double> columnWidthDictionary)
{
    return new SortedDictionary<string, double>(columnWidthDictionary, new TitleComparer());
}

static IOrderedEnumerable<KeyValuePair<string, double>> LinqListSort(
    Dictionary<string, double> columnWidthDictionary)
{
    return columnWidthDictionary.OrderBy(
        column => /*column.Key);*/
        {
            int rank;
            Int32.TryParse(column.Key.Substring(sTitle.Length).Trim(), out rank);
            return rank;
        });
}

static IOrderedEnumerable<KeyValuePair<string, double>> LinqListSort(
    IEnumerable<KeyValuePair<string, double>> columnWidthDictionary,
    string criteria)
{
    return columnWidthDictionary.Where(targetColumn => targetColumn.Key != criteria).OrderBy(
        column => 
        {
            int rank;
            Int32.TryParse(column.Key.Substring(sTitle.Length).Trim(), out rank);
            return rank;
        });
}

Although the Linq code is not as simple as a constructor, the I know that using Linq is superior performance-wise compared to SortedDictionary.

However, there’s only one way to test it out, so I created a 100,000 items Dictionary with “Title x” as string key, and then compared the time it takes to sort created a sorted list using the three algorithms. There is the result:

While it isn’t too surprising that using Linq yield a “faster” solution, I was surprised that element removal using Linq (since it requires looping via the whole IEnumberable structure) is not worse than SortedDictionary. But SortedList is still faster – probably because it only needs to remove the one element in a linked-list, instead of reorganizing the whole b-tree in SortedDictionary case.

So for sorting: IOrderedEnumerable >> SortedList > SortedDictionary

For element removal: SortedList > IOrderedEnumerable > SortedDictionary.

Here is the test code, so you may try it at home :)

static void Main(string[] args)
{
    const int numItems = 100000;
    Random rand = new Random();
    Dictionary<string, double> columnWidthDictionary = new Dictionary<string, double>();

    for (int index = 1; index <= numItems; index++)
    {
        columnWidthDictionary.Add(String.Format("{0} {1}", sTitle, index), rand.NextDouble());
    }

    Stopwatch watch = new Stopwatch();

    watch.Reset();
    watch.Start();
    SortedList<string, double> sortedColumnWidthDictionary = 
        FrameworkListSort(columnWidthDictionary);
    watch.Stop();
    Console.WriteLine(String.Format("Time elasped (FrameworkListSort) : {0} ms", 
                                    watch.ElapsedMilliseconds));

    watch.Reset();
    watch.Start();
    SortedDictionary<string, double> sortedColumnWidthDictionary2 = 
            FrameworkDictionarySort(columnWidthDictionary);
    watch.Stop();
    Console.WriteLine(String.Format("Time elasped (FrameworkDictionarySort) : {0} ms", 
                                    watch.ElapsedMilliseconds));

    watch.Reset();
    watch.Start();
    IOrderedEnumerable<KeyValuePair<string, double>> sortedEumerableColumns = 
        LinqListSort(columnWidthDictionary);
    watch.Stop();
    Console.WriteLine(String.Format("Time elasped (LinqListSort) : {0} ms", 
                                    watch.ElapsedMilliseconds));

    watch.Reset();
    watch.Start();
    //sortedColumnWidthDictionary.Remove(sortedColumnWidthDictionary.Keys.Last());
    sortedColumnWidthDictionary.RemoveAt(sortedColumnWidthDictionary.Count-1);
    watch.Stop();
    Console.WriteLine(String.Format("Time elasped (SortedList + Remove last item) : {0} ms", 
                                    watch.ElapsedMilliseconds));

    watch.Reset();
    watch.Start();
    //sortedColumnWidthDictionary2.Remove(sortedColumnWidthDictionary.Keys.Last());
    sortedColumnWidthDictionary2.Remove("Title 1");
    watch.Stop();

    Console.WriteLine(String.Format("Time elasped (SortedDictionary - Remove last item) : {0} ms", 
                                    watch.ElapsedMilliseconds));
    watch.Reset();
    watch.Start();
    sortedEumerableColumns = LinqListSort(sortedEumerableColumns, "Title 1");
    watch.Stop();
    Console.WriteLine(String.Format("Time elasped (LinqListSort with criteria for removing last item) : {0} ms", 
                                    watch.ElapsedMilliseconds));
    Console.ReadLine();
}

Friday, April 17, 2009

Some Thoughts About Measuring Software Complexity

I have learned from J.D. that I should try to drop down my thought as much as I can or else I will forget about it… so here it goes.

At the end of the CSE P 503 today, we went into the a heated discussion about measuring complexity in software design. I was pretty tired so I mostly listened. And Dr. Notkin pointed out, the complexity is deduced by the ability of human to understand the model, and therefore, complexity in software is not measurable because there is no way to quantify how well human understand the code based of artificial metrics.

I agree: quantifying a complexity in already simplifying it and the the interactions in the entities of the model is inherently complex, simplifying it would mean over-emphasizing some elements while bypassing some other vital relationship, and the resulting metrics is probably not generalizable.

I disagree:since complexity is an approximation of how difficult is it for human to understand the model, I think that we can deduce the complexity based on some essential cognitive characteristics.

That’s similar to Bounded Model Checking – we don’t need to measure the whole complexity, but just one or two important factors that affect the complexity.

My suggestions for two metrics:

1. No. of statements in a method calls

Human likes sequences. That’s why we creates step-by-step procedure, and prefer imperative programming over functional programming. However, humans also have this limitation of simultaneously remembering only a small (7 in general) things in the brain. That’s why we prefer short list over long list.

One measurement of complexity is to check the overall LOC in a method: in general, we prefer 30 methods that contain 30 statements over 3 methods with 300 statements. The scope of each method should be small because it is easier to understand. And having 30 binary relations (method Foo() = doing-this-and-that) is actually a good thing. We love binary relations. I will explain why next.

2. No. of interactions in a method invocation

Binary relations are the backbone of our interpretation of the physical world. It is intrinsic in our language! See! We go shopping; you love this movie; I sell fish! Binary relations are the easier relation to understand. However, complexity arise when we try to introduce ternary, or n-aray relations: I go to a store this morning! Or, in a more familiar software dialect: a = b.Foo(c, d.E, f.bar(g) + i). It is just a lot harder to figure our the relationship between the entities in the relationship now.

Therefore, I suggest we measure the amounts of n-arary relations in the code. More binary/ternary relations there are, lower is the complexity is.

Of course, when you try to quantify complexity, esp. averaging it or putting it into a 1-10 scale, the number basically lost its meaning because complexity is judged between a super simple model (1) to a ultra-complex model (10), which don’t actually exist. But estimating the complexity of a system based on these two criteria, I argue, is a workable solution.

Software disasters: are inevitable, and will we see more in the future? Part I

I know that I have abandoned the blog for a long time, and yes originally it was supposed to be a blog about code. But I sort of moved to my Windows Live Space blog and vacated this one.

I’m not sure if there is anyone who reads the blog now, but if you do, I still think it is kind of cool to share my thoughts with you here.

The 3-parts essay is originally a homework assignment submitted for CSE P 503, taught by Dr. David Notkin. To be honest, I don’t need the grade anymore, but Dr. Notkin is someone that I admire and I just want to be a student of his class, so that I can learn from him.

The essay is about analysis of modern software disasters. I expanded it a little to incorporate my thoughts about the complexity of the disasters has evolved in the past few years.

Enjoy.

We have also arranged things so that almost no one understands science and technology. This is a prescription for disaster. We might get away with it for a while, but sooner or later this combustible mixture of ignorance and power is going to blow up in our faces. --- Carl Sagan

The first question is so incredibly naïve that I almost feel ashamed of answering it. On average, a program with > 512KLOC has 4-100 errors per KLOC [1], and no matter how aggressive the test and how robust the application behave, we can never be sure if the program will always run correctly. Incompatible platform (Windows drivers in Vista OS), regression caused by upgrade (Ariane 5 crash and AT&T outage in 1990)… the possibilities for failures are innumerable…

The second question, however, worth more of our consideration. As the software engineering discipline slowly maturely, we have devised new methodologies and tools such as Test-Driven development, coverage code tools, and model-checking etc that reduces the code defects in the system and also the possibility of disasters. We shouldn’t deny that these tools and conventions improves the quality of the code written by us, but does it guarantee that we will see less software disasters in the future. Any seasoned developer would give a moderate and sensible “probably no”. However, unseasoned and naïve as I am, I give this some thoughts, and I came to a horrid conclusion..

We are bound to see a lot more software disaster. A lot more…

What is a software disaster?

Disaster (n.): a sudden calamitous event bringing great damage, loss, or destruction; broadly : a sudden or great misfortune or failure – Webster Dictionary

Generally speaking, we assume that software disasters are calamitous events that are caused by code defects that result in the loss of financial resources or human lives. Therefore, in the postmortem of these calamities, we focus our efforts in discovering the root cause (the bug) in the code, and seek to provide solutions that prevent and detect these types of defects. The crash of Ariane 5 (caused by number overflow error) and the Patriot missile failure to intercept Scud missile during the first Iraq war [2] (bad floating point arithmetic) are two prominent examples of these types of errors. There are many more of these software disasters, and I’m sure that my fellow classmates, who are smarter and also more experienced, would have dissected the cases and offered their own insights.

Indeed, we now have better programming model and hardware and also methodology to deal with these problems. But “disaster” carries a unique meaning for different people – small and trivial incidents that aren’t noticeable by the majority of us maybe disasters for some. For instance, the AmazonFail event, in which some items in the Amazon database were incorrectly flagged as “adults”, caused many gay-themed books had disappeared from Amazon's sales rankings and search algorithms. As a result, many publishers and authors lost sales, and Amazon was confronted with a PR disaster as initially the action was perceived as anti-gay by the public. Early in the year, Facebook also introduced a privacy flaw during their upgrade, which allowed users to gain access to portions of their friends’ profile that they should not have been able to see. Both incidents occurred no more than three months apart and although most of us may consider them minor, the implications may be disastrous for some.

It is my intention that, as most of us are concentrating our analysis on the causes of the past software disasters (rightfully so and excellently done), I want to take a step back, compare a few “disasters” in the distant past and the recent past, and give my own thoughts about how software disasters have evolved. So first, let us look at a well-studied case, about a software disaster in the distant past.

During the Good Old Days of AT&T…

n 1990, I was only 8-years-old, and for the majority of us still primarily communicated using phones. But on January 15th 1990, for 9 hours, an estimated 60 thousand people lost their AT&T long distance service. Since it also affected the 800-numbers calls, hotels and car rentals lost sales, and it was estimated that it cost AT&T $60 to $70 million in lost revenues [3].

The cause was of course, inadvertently, due to a bug in the AT&T phone switch software that was introduced during an upgrade.

Back in the days, it took around 20 seconds for a long distance call to get connected. In order to improve the latency, AT&T deployed new software in #4 ESS switching system, which intends to reduces the overhead required in signaling between the switches by eliminating a signal that explicitly indicates if a node is ready to receives traffic, and reduces to the time between dialing and ringing to 4 seconds. The software was deployed to all 114 computers in the AT&T network.

Then, on January 15th 1990, one of the switches in New York started signaling “out-of-service”, and rebooted itself and resumed service. A second switch would receive the signal from the first switch and start processing. However, when another signal from the first switch arrived when the second switch was busy, the signal was not processed properly, and the second switch also signaled “out of service”. One by one, the “out-of-service” signal propagated to the neighboring switches, and eventually to all 114 switches, because they all deployed the same software.

The intermittent “out of service” signals blocks some of the long distance calls. Eventually, the engineers discovered then a switch would signal “out-of-service” when another signal arrives too soon. By reducing the workload of the network which increases the latency between signals, the erratic behavior eventually disappeared from the network. But during the blackout period, it was an estimated that more than 5 million calls were blocked [4].

But the question remains: what causes the switch to erratically signal “out-of-service”? To answer the question, let’s look at a snippet of C code in the #4 ESS switching system that is used for processing incoming signal:

/* ``C'' Fragment to Illustrate AT&T Defect */
do
{
  switch (expression)
  {
      // some code...
      case value:
          if (logical)
          {
              //sequence of statements
              break;
          } else
          {
              //another sequence of statements
          }

      //statements after if...else statement
  }
  //statements after case statement
} while (expression);

The root cause of the disaster is caused by nothing more than the break statement being placed at a wrong location. When the break statement is reached in the code, it would prematurely cause the program to break from the enclosing switch statement. However, this sudden termination was not anticipated – the programmer did not realize that the break statement would terminate the enclosing switch statement. The result of the termination would crash the switching node, causing it to send an “out-of-service” signal to its neighboring nodes (5). The “out-of-service” propagated between nodes and eventually crashed the system.

The cause of this disaster is absurdly simple: the programmer simply misunderstood or overlooked the effect of break statement in the code. In this particular case, some case studies also suggest that a comprehensive code coverage test would also reveal that a portion of the code is not executed due to the misplaced break statement, (and apparently ignoring the fact that the code coverage as a testing technique only began to get popular in the 1988s [5]). Therefore, like most typical software disasters that we studied, we can blame:

Programmer,

Inadequate testing, since they should’ve caught it,

The management, well only if they weren’t so aggressive and provide enough time for the programming and the IT department to test and deploy the software in a timely manner.

For most modern software disasters, any analysis or case studies will inadvertently return to these causes and blame the same people. However, even if we have smart programmers who write code carefully, comprehensive test suites and smoke test for the application, and management that set the robustness and the security of the software as the most important criteria for shipping, does it mean that we could largely avoid software disasters? No, we won’t, and that’s not only because we aren’t perfect…

To be continued…

Wednesday, August 06, 2008

The return of the prodigal son, Day 42

I jumped a total of 34 days after the last update. During this period I a. bought a new laptop, b. went to Vancouver for a wedding and spent an weekend there. c. farewell-ed my summer roommate d. spent around a week soul-searching of what I'm supposed to do.

Today I finally unpacked the ConcernExtractor... however, just as any kind of software update, CE no longers works on Eclipse 3.4. However, the plugin did load now :) so the rest is just code defect.

Finally, spending the night on CE also makes me rethink how I need to write my final section - how does CE help to locate and refactor the crosscutting clusters. I realized that I don't need to try very hard into making sense of the data. I only need to show that:

1. CE lets us visualize the location of the clusters

2. CE can simplifies the pointcut expression needed.

But now I am having a stomachache so I will stop and resume tomorrow morning perhaps?

Wednesday, July 02, 2008

The return of the prodigal son, Day 8

What was supposed to be light dinner with a friend/colleague turned out to be a ultra-Korean-feast. My friend went back to work but then I was too full (what an excuse) and went to Barnes & Nobles. Then went home and researched about Charles Petzold and his new Turing book instead.

But the seafood pancake at Seoul Hot Pot is good... so good that I ate half of it and now still feeling the pancake in my stomach

Moral 1: If I want to accomplish something, then eat quick.

Moral 2: Don't overeat Isaac. You know the moment you eat the 4th slice that you just won't work after that.

Moral 3: Have a very definite plan of what I need to do at night. Now I didn't do what I wanted (determined which laptop to buy) and it's like 2 am already.

Goals tomorrow:

Buy the laptop
Restart the work - cannot let the fire extinguish
Eat light

Tuesday, July 01, 2008

The return of the prodigal son, Day 7

Favor asked, favor refused, more questions, then favor (kind of) understood, if not totally granted.

So Thank you.

But as I fear... even I try to avoid as much distraction as possible, there's a big possibility that the resistance is futile.

To conclude my day, it ended up that the biggest distracter is still myself. I started to think - even if I skip whatever activities that I wanted to go because I want to finish this thing, it just probably won't work out as well as I thought. I have countless ways of wasting my time in many things, some of which are of no importance, some just bad, that may be I just better of partying.

Alright that's just one silly thought...

Thesis Progress

I wrote today:

-- Thesis content begins ---

If the target code snipplet is located at the beginning or the end of a method body, refactoring it into aspect using simple call or execution pointcut is straight forward.

If the target snipplet is located in the middle of a method body, refactoring into aspects using only a simple call or execution pointcut is not possible, unless you move the the snipplet to the beginning or the end of the method which is unsafe in most cases. Refactoring this type of snipplet typically requires a composite pointcut expression that involves using control flow-based pointcuts.

If the target snipplet is located inside a loop, an if block, or a try block, then refactoring the snipplet into an aspect requires a composite pointcut expression that involves using control flow-based pointcuts.

-- Thesis content ends ---

The next goal tomorrow is to describe how the distribution of clusters affect the applicability of ConcernExtractor: if we find that all of our refactoring targets that belong to the same group are located at the beginning/end of a method body, CE is really not much use. If some are located in the middle of the body, then CE can be useful. If most refactoring target are located in a block, or middle of the method body, then CE is useful because I don't need to create complex pointcut that describes each location of target join point.

Initially I was stucked at pondering if a try block would catch an exception that is thrown by advice of a join point that the try block wraps around. Since I know my computer will just blow up if I install AJDT, and I don't have the Java SDK for now, I cannot test it and need to find some documents that talk about that.

Finally I found the online Aspect J book which explains it. But then my paragraph (point) starts to balloon as I try to explain how complicated the pointcut will be, if I try to create a pointcut expression that describes code inside a loop; or that how I cannot create a common advices if some of the refactoring target that belongs to the same group are located in a try block: I may need to declare that I'm throwing an exception for in some scenario, but not in others if the containing block re-throws it. Since I don't have AspectJ installed (yet) it is hard to really visualize what might happen.

Until I realize that I only need to prove that the Concern Extraction technique will be useful - I don't really need to talk about how complicated the pointcut expression will look for edge cases (at least not for now) and I need to move on first. So it ended up that I cut around 100 words instead of producing it...

Monday, June 30, 2008

The return of the prodigal son, Day 3, 4, 5, and 6

Fighting off distraction is a very tough job... and I realized how easy it gets to just start typing www.facebook.com (esp now FF and IE have auto-complete) and start time-killing...

Another tough part is setting up the machines correctly: I remember when I was first starting the thesis writing I started using a combination of Eclipse and TexLipse plugin. It ran fine on Linux but on Windows I needed to use MikTex and then all hell broke lose. Because McGill thesis template requires several external packages, it took me several days to find those packages on some obscure, configure them, and make MikTex produce a correct DVI of the thesis sample.

I remembered when I first came to Seattle and attempted to work on the thesis. MikTex just had a new version (2.6) but my laptop had an older version and I had tons of troubles using 2.6 to keep working, I had to literally mirrored the whole file system from my laptop to one of the desktop machine so I can work using the desktop as well?

The main reason I didn't want to use the laptop to write was also because Eclipse 3.2 was taking so much memories and it took forever to generate a DVI layout, and I needed to terminate every essential service manually so the system wouldn't crash.

So as I am "rebooting" my effort, I decide to go back to the good old Emacs and MikTex. To my surprise, MikTex 2.7 has this very neat feature that download missing packages on-the-fly! So as I ran 'latex mythesis.tex' command, it automatically downloaded the packages that I so painstakingly tried to configure before!

I also go back to Emacs as the primary editor... not that I don't like TexLipse... but my 7 years with Emacs tells me that it's by far the best editor available - no I will rather die than to use Notepad...

So after a few emacs tweaking, I re-acquaint Emacs again back to my college days!

Now I can start working!

Thursday, June 26, 2008

The journey of the prodigal son, Day 2

Sometimes when you are determined to accomplish something, the whole world turns to distract you. Today I still have not started the work; however, my tools are prepared. Both the IDE and the MikTex are downloaded and ready to be load.

The remaining things on the list are:

Buy a new computer that has enough RAM for programming if necessary - probably the hardest thing to decide cuz of the costs.
Make sure that I can generate latex with my MikTex configuration
Get the tools ready: 1. the IDE, and 2. the emacs, 3. the latex plugin.

Work - well I was pulled from the crew work to work on creating branch and unfortunately the branch work is tedious and unchallenging intellectually but fails very often and time/attention sucking. Thanks to the MyRadio shows, I kept my attention on the task at hand instead of avoiding it. But tomorrow will be the difficult day. In order not to suck my energy so I cannot give to anything else, the only solution is to start early.

For the last hour I was hoping to upload the pictures on FB but it has been failing me... I wish our Microsoft Live has a similar service where I can upload pictures (with good quality) and allow us to tag our MSN contacts on the pictures and share it on the web. But don't use it on the Spaces... that thing (along the eye-torturing banner ad) is just ridiculous and I don't know about you, but I just hate using it.