Algorithms in C#: Connected Component Labeling
I have to admit, I'm not very good at interviews. For some reason my mind isn't trained well for sorting linked lists or balancing binary trees on the whiteboard. I have no problem designing 600+ type hierarchies and building complex object-oriented frameworks, but typical interview questions were never an area where I was shining. Maybe it's because I like to think slowly and take my time, maybe for some other reasons :) Anyway, just out of the blue I decided to implement an algorithm in C# with the goal to improve my algorithm skills and to enjoy the power and expressiveness of my most favorite language and platform.
As it usually goes, I first sat down and spent three hours coding the whole thing, and after that I did some online research, which revealed the algorithms official name:
The goal is, given a 2D-matrix of colors, to find all adjacent areas of the same color, similar to the flood-fill algorithm used in Paint.
Standard (existing) solution - Union-Find
Online research had revealed that there is a standard algorithm for this that heavily employs Union-Find. The union-find approach starts with a disjoint matrix, where every element is its own set and then iteratively joins (unions) neighboring sets together. The secret of Union-Find is twofold:
- Extremely fast (constant-time) union of two sets
- Extremely fast (almost constant-time) determining whether two points belong to the same set or not
I remember when I was at the university, I even knew how to prove that the complexity of Union-Find is bounded by the inverse Ackermann function, which, believe me, grows very slowly. The important thing that I carried out for the future is that for all practical purposes you can safely consider this O(1).
I was somewhat proud to have implemented a different approach. If you have seen this anywhere before, I wouldn't be surprised though, so please bear with me. The idea is to split the "image" into horizontal stripes ("spans") of the same color. Then scan all the rows of the image row-by-row, top-to-bottom, and for each row, for each span in a row, attach the span to the neighbor spans of the previous row. If you're "touching" more than one existing component, join them into one. If you're not touching any pixels of the same color, create a new component. This way, we split all horizontal spans into three generations: 0, 1 and 2 (like the .NET Garbage Collector!). Generation 0 is the current row, Generation 1 is the row above it, and Generation 2 are all the rows above these two. As we finish processing a row, Generation 1 is added to Generation 2 and Generation 0 becomes Generation 1.
I published the source code at http://code.msdn.microsoft.com/ConnectedComponents. MSDN Code Gallery is an awesome new website where you can publish your code samples and source code. I discovered that I'm not the only one to publish algorithms in C#: see for example, A* path finding.
I'd be very curious to measure the performance of my current code. I know it isn't optimal because there is at least one place where I can do considerably better. For the profiling wizards among you, I won't reveal where it is for now. For the tuning wizards, if my algorithm takes 100% time, how many percent can you get? I myself expect (but not guarantee) to be around 5-10% faster. Can anyone do better?
Also, I'd be curious how my algorithm performs compared to the classical union-find implementation. Anyone interested to implement and see if it does better? (I hope I won't regret posting this when someone shares an implementation which is 100 times faster than mine...)
Cleaner code challenge
I dare to hope that my code is more or less clean and more or less well-factored. However I learned one lesson in life - whenever I start thinking that I'm good, life immediately proves otherwise. That's why I welcome any improvement suggestions. I understand that on such Mickey Mouse-size projects it doesn't really matter that much if the code is clean or not, but anyway. The design could be better, the use of .NET and C# could be better, the distribution of responsibilities could be better, my OOP kung-fu could be better. I'd like to use little projects like this to learn to code better.
Care to implement it in F# or, say, Nemerle? This is your chance to show the world that C# is not the only .NET language! Of course, non-.NET languages are highly welcome as well. I'd personally be interested in, for example, Haskell, Lisp, Prolog, Scala, Nemerle, Boo, (Iron)Ruby, (Iron)Python, Smalltalk, Comega, SpecSharp, SML.NET and so on.
Have you installed Silverlight 2 beta? Need an idea what to implement? ;-)