We haven’t forgotten about other models – honest!

Disclaimer: The Maestro project has been renamed. While referred to in this article as ‘Maestro’, moving forward the project is referred to with codename ‘Axum’.

Copied from 'Parallel Programming with .NET'  

When discussing parallel computing and managed code, for the most part we’ve stuck to talking about a few parallel paradigms such as task parallelism (TPL) and data parallelism (PLINQ) (our brethren at the http://blogs.msdn.com/stmteam blog have also discussed some specific mechanisms for ensuring safety in parallel applications). Mostly, this has been due to our focus on what we’re shipping next. The parallel support in Visual Studio 2010 has always been focused on “low-hanging fruit”: how do we enable developers to quickly take advantage of multi-core and manycore in both their legacy applications and new applications? How do we make it easy for them to express the parallelism in their applications? That doesn’t mean we haven’t been thinking about and working on other models – models that may require a larger conceptual jump but may be better suited for architecting applications for parallelism from the ground up.

The constructs we’re providing in Visual Studio 2010 are most useful in shared-memory systems where multiple threads operate concurrently on shared state. Often times this type of concurrency is OK. For simple applications, the dependencies between threads can be manageable, allowing those apps to scale well on today’s machines. However, when we start to parallelize complex applications for tens or hundreds or thousands of cores, these dependencies become far too complex to manage and, regardless of complexity, the contention caused by so many threads fighting for access to the same state severely limits scalability. We like to learn by example, and surely there must be other successful applications that are complex and scale well to a large number of CPUs.

You betcha; you’re lookin’ it at.

Honestly, what’s more successful, parallel, and scalable than the web? Given that we may someday see 1,000+ cores on our desktops, wouldn’t it be great if we could shrink the web’s architecture down onto a single multi-core computer? The web blends a lot of different paradigms and models, and the jargon tossed around can make your head spin. So I’ll keep it simple and talk about a few of the properties that make the web such a great parallel application:


Each component that makes up to the web is an isolated machine that works in parallel with other components but does not implicitly share resources – specifically memory. Any component that wishes to use another component’s resources must do so via an explicit request and typically via a very formal and narrow channel. There are pros and cons to this restriction.

The negative aspect of this is performance: when a node needs access to the resources of another node, it can’t just take it when it needs it; it must wait until the request has been processed.

The positive aspect of isolation is how it helps reduce complexity. As I noted earlier, managing the dependencies between components in a large complex system can be extremely difficult. This difficulty can lead to a severe hindrance on developer productivity. The great emergent property of this restriction is that it actually eliminates implicit data dependencies, thus reducing the surface area that components can depend on. This drastically reduces the complexity not only for client components, but for server components as well. When components are isolated, they are the exclusive owners of their state and can read and write to state without fear that it may be changed erratically by another component.


Figure 1. Example dependency graphs of a concurrent application in a model which allows implicit dependencies (left) vs. a concurrent application in a model which disallows implicit dependencies (right). Each rectangle represents code acting on shared state.


Message-passing is really an artifact of isolation, and it’s a communication mechanism that provides many benefits to massively parallel applications. The difference between message-passing and shared-state communication is equivalent to the difference between sending a colleague an e-mail requesting her to complete a task and opening up her organizer to write down the task directly in her to-do list. More than just being rude, the latter is likely to confuse her – she might erase it, not notice it, or accidentally prioritize it incorrectly. The former is a bit gentler: she can process the e-mail when she’s ready, and because she received the message in a very standard, formal manner, she’s more likely to use the information correctly. This is much like how the web works. When a component needs to communicate state with other components or needs access to a resource, it sends this information in a message.

There are both beneficial and detrimental aspects to message-passing. Typically, to communicate state outside of an isolation boundary, the component passing the information must copy it, potentially introducing a significant amount of overhead, particularly if the data is large. If, however, the overhead is not prohibitively expensive, message-passing is a very scalable means of communication between concurrent components. To understand why, let’s first consider the alternative. Let’s imagine that there are five parallel components (threads in this case) that each need to perform 100 complex calculations and add each resulting integer to a shared list. The caveat is that the list must maintain a sorted order. The code might look something like the following:

 using System;
using System.Collections.Generic;
using System.Threading;

class Program
    static void Main()
        // create a list and add something to it
        LinkedList<int> sortedList = new LinkedList<int>();
        // thread synchronization...
        ManualResetEvent mre = new ManualResetEvent(false);
        int count = 5;
        object mutex = new Object();
        // 5 threads each add 100 items to the list in sorted order   
        for (int t = 0; t < 5; ++t)
            Thread thread = new Thread(new ThreadStart(() =>
                // using the Random class in a multi-thread environment is not
                // guaranteed to give truly random numbers, but it’s ok
                // for this example   
                Random r = new Random();

                for (int i = 0; i < 100; ++i)
                    // emulate a complex calculation
                    for (int z = 0; z < 1000000; ++z) ;
                    int num = r.Next();

                    // insert the item in sorted order with mutual exclusion
                    lock (mutex)
                        LinkedListNode<int> curNode = sortedList.First;
                        while (true)
                            if (curNode == null || curNode.Next == null)
                            else if (curNode.Value >= num)
                                sortedList.AddBefore(curNode, num);
                            curNode = curNode.Next;
                // if we're the last thread to finish, set the MRE
                if (Interlocked.Decrement(ref count) == 0) mre.Set();
        foreach (int val in sortedList)

Figure 2. Example of a shared-state, parallel algorithm.

Note that in this example it’s necessary to use a lock around all of the code that must analyze and modify the data atomically. In effect, after the first five “complex calculations” complete, each thread halts progress as it blocks, waiting for exclusive access to the lock. As shown in the figure below, this results in a large amount of time spent waiting that, instead, could be spent processing more items and reducing the overall time to completion.


Figure 3. Allocation of execution time for the example parallel algorithm using shared-state.


Figure 4. Screen clip of Thread Blocking View (Visual Studio 2010) analysis of example program.

What’s particularly bad about this situation is how it behaves when a lot of threads are added. Notice how the longest red block (time wasted on waiting) is directly proportional to the number of threads. To be more precise, the last thread that acquires the lock will spend roughly (n-1) x c milliseconds waiting, where n is equal to the number of threads waiting on the lock and c is the time it takes to update the data structure. If there were 10 times as many threads, the longest waiting period would be 10 times as long, adding a significant amount to the overall time to completion. This design also has the potential to adversely affect responsiveness. Imagine if the thread on the far left of Figure 3 was a Windows UI thread and the green chunks represent time spent processing user input. All the yellow and red times are times when the UI would be non-responsive, possibly grayed-out and stuck with a nasty hour-glass cursor.

Now let’s take the same scenario and use a hypothetical message-passing technology. Typically, message-passing technologies have the notion of a mailbox – what is essentially a thread-safe queue – for asynchronously passing messages and buffering them in order. So now we re-architect our application such that there is one parallel component, complete with mailbox, that is the sole owner of the list and the only component that makes updates to it. We create five other components that each perform 100 complex calculations in parallel (just like in the shared-state version) except that this time, instead of waiting to acquire exclusive access to the list, they simply place it into the list component’s mailbox and continue calculating the next number. Now we’ve moved the contention in the application strictly to the list component’s mailbox. Fortunately for us, a thread-safe queue can be implemented lock-free such that a single enqueue could translate to as little as a single interlocked operation. This significantly reduces the amount of time spent wasted while contending for resources, and the synchronization overhead (not including the cost of copying the message) typically impacts performance and scalability significantly less.


Figure 5. Allocation of execution time for the example parallel algorithm using message passing.

Notice now how scalable our algorithm is. While there is still some contention, using an Interlocked operation has effectively reduced the amount of work that needs to be done atomically down to a single hardware instruction, enabling us to add many more cores to the algorithm while still reducing the total time to completion.

Fault Tolerance

Much of the web’s success is owed to its resiliency, in turn owed to isolation. In a shared-memory system, a rogue thread has the ability to corrupt state willy-nilly, potentially corrupting the entire process. Detecting, dealing with, and correcting this type of failure can be incredibly difficult if not impossible. In an isolated system where message-passing is used to update state, the most damage a component or thread can do is corrupt its own state. If a single component goes down, other components that depend on it may detect the failure via messages or timeouts and either reroute their dependencies or restart the component that failed.

Especially in the face of asynchronous operations, shared-memory parallel applications struggle with failure handling. Often times, exceptions are propagated up the stack of a thread that does not have enough context to make a proper decision when an exception arrives. With message-passing, failures can be easily routed to the components that know how to handle them best.


With shared-state models it can be difficult to architect applications for reusability and maintainability. Not only are the dependencies complex and difficult to manage correctly, they also may impede a developer’s ability to cleanly separate components. With message-passing and isolation, much like the web, components interact with each other through formal, narrow channels with a very specific protocol. These channels are similar to what interfaces are to classes in the OO world. As long as a developer programs to the component’s interface and adheres to the protocol, the implementation can change with minimal impact on other components.

I don’t want to hand-wave and take credit for reaping the benefits of the web’s architecture on multi-core systems; none of these concepts are new – not even to Microsoft. Erlang is a language that has been around for years and has run many large incredibly scalable telco networks efficiently. MPI is arguably one of the most used technologies for message-passing in the highly concurrent distributed systems world. At Microsoft’s very own Robotics group, the CCR was developed to bring the powerful scalability, parallelism, and fault tolerance of message passing to robots. It’s been so successful there that it’s found its way into high-visibility parallel applications like MySpace. This applies to native code just as it does to managed: we’ve brought some of these concepts to the Parallel Pattern Library (PPL) in the form of Asynchronous Agents.

While each of these technologies has seen some degree of success they all lack a little something that makes parallel programming safe and productive. Erlang, for example, has a syntax that, while succinct, is difficult for the average developer and MPI, the CCR, and Asynchronous Agents lack the very thing that makes the web so beautifully safe: isolation.

What we need is a technology that wraps all of the concepts into a nice package – and we’re working on it. To get both safety and productivity we think a good way to go is with a language – whether that means a new language altogether or extensions to an existing one, we’re not quite sure yet. But in the interest of verifying that we can actually provide the value we claim we can, we’ve been incubating a language called Maestro, which you can learn all about on Channel 9 (also embedded below).

Get Microsoft Silverlight

Check back frequently – both here and on Channel 9 – to see what we’re working on in this space.