Query Data with Parallel LINQ

This post shows a simple way to write code that takes advantage of multiple processors. You will see that LINQ queries can allow you to side step the difficult tasks normally involved in writing multi-threaded code. To get started, all you need is a little basic knowledge of how to write simple LINQ queries.

The code shown in this post uses a pre-release version of PLINQ called the Microsoft Parallel Extensions to .NET Framework 3.5. When PLINQ finally ships, it will run only on .NET 4.0 or later. The version I'm using that runs on top of 3.5 is for evaluation purposes only. There will never be a shipping version that runs on .NET 3.5.

This LINQ provider is being created at Microsoft by the Parallel Computing team; it is not the work of the C# team that created LINQ to Objects and LINQ to SQL. Here is the website for the Parallel Computing team:

http://msdn.microsoft.com/en-us/concurrency/

At the time of this writing, these extensions were available only in pre-release form. You could download them either as Visual Studio 2008 compatible extensions to .NET 3.5, or as part of the pre-release version of Visual Studio 2010. Since the download sites might change over the coming months, I suggest that you find these resources by going to the Parallel Computing site, or to the Visual Studio site:

http://msdn.microsoft.com/en-us/vs2008

Parallel LINQ, or PLINQ, is only a small part of the Parallel Extensions to the .NET Framework. It is, however, an important part. Since it is a simple and natural extension of the LINQ syntax, I think developers familiar with that technology will find it easy to use.

Consider this code:

 var list = Enumerable.Range(1, 10000);

var q = from x in list.AsParallel()
        where x < 3300
        select x;

foreach (var x in q)
{
    Console.WriteLine(x);
}

These lines look nearly identical to the code you have seen in many simple LINQ samples. The only significant difference is the call to AsParallel at the end of the first line. Though we have often used type inference to hide the return type of a LINQ query, I'm going to pause and take a second look at this instance. Rather than returning IEnumerable<T>, this version of PLINQ returns IParallelEnumerable<int>:

 IParallelEnumerable<int> q = from x in list.AsParallel() etc….

In the near future, PLINQ queries of this type will probably return ParallelQuery<int>. Because this product is still evolving, it might be simplest to use var, at least during the pre-release phase, and let the compiler choose the type. That way you can save typing, avoid problems with anonymous types, and you need not concern yourself about changes in the API as the product develops. It is almost always appropriate to use var to designate the return type of a LINQ query, and there are only special circumstances when you would do otherwise.

Here are the results from this first PLINQ query:

 2
1
3
4
6
512
5
7
513
8
12
514
9
13
515
10
14
516
11
15
517
16
72
518
17

The numbers shown here are in a relatively random order because they are being returned from different threads. It is important to remember that the sequence of values returned by LINQ is not always guaranteed to be presented in a particular order. If Order is important in your code, you can add a call to AsOrdered to the query after the call to AsParallel. Alternatively, you could insert a GroupBy clause to establish the desired ordering. Otherwise developers should assume that the ordering from a PLINQ query will be entirely random

Now that you understand the basics of Parallel LINQ, let’s move on to look at a more interesting example. Improved performance is the main reason to write code that can run in parallel. The program shown in this post uses a timer to demonstrate how PLINQ can improve performance in a program.

Performance improvements become more evident when our code has access to more processors. The code I show here runs faster on a two processor machine, but it really starts to come into its own on a four processor machine. Moving up to even more processors yields more powerful results. Here, for instance, are the results showing an improvement of 1.33 times when using two processors, and almost two times when using 4 processors:

 2 Processors = 1.44 x improvement:
Linear: 00:00:13.15
Parallels: 00:00:09.10

4 Processors = 1.96 x improvement:
Linear: 00:00:15.00
Parallel: 00:00:07.68

These tests are being running against pre-release software, so these numbers are almost certain to change before release, and of course different machines will yield different results. Furthermore, the degree of improvement that you see is likely to change depending on the type of algorithm you run, the number of cores on your machine, the architecture of the machine, how many caches there are and how they’re laid out, etc. Though it is rare, some queries show superlinear performance enhancements. In other words, there is a greater than 4x speedup on a 4-core box. An improvement of 2 times, such as the one shown, or even a 3 time improvement, is common.

This sample program is called FakeWeatherData, and it is available for download from the LINQ Farm on Code Gallery. It features a simple LINQ to XML query run against a file with 10,000 records in it. The data I'm querying is not real, but consists of random dates and temperatures generated by a simple algorithm included in the FakeWeatherData program.

The XML file is structured like this:

 <?xml version="1.0" encoding="utf-8" ?>
<Samples>
  <Sample>
    <Year>1973</Year>
    <Month>May</Month>
    <Day>15</Day>
    <Temperature>10</Temperature>
  </Sample>
  <Sample>
    <Year>1970</Year>
    <Month>Feb</Month>
    <Day>10</Day>
    <Temperature>14</Temperature>
  </Sample>
  <Sample>
    <Year>1970</Year>
    <Month>Jan</Month>
    <Day>15</Day>
    <Temperature>11</Temperature>
  </Sample>
   ... Many lines of code omitted here
 </Samples>

There is also a simple C# class used by the program to encapsulate the data from the XML file:

 class WeatherData
{
    public string Year { get; set; }
    public string Month { get; set; }
    public string Day { get; set; }
    public string Temperature { get; set; }
}

The parallel version of the query in the program looks like this:

 for (int i = 0; i < NUM_REPS; i++)
{
    var list = (from x in doc.Root.Elements("Sample").AsParallel()
                where x.Element("Year").Value == "1973" &&
                   x.Element("Month").Value == "Apr" &&
                   x.Element("Day").Value == "15"
                select new WeatherData
                {
                    Day = x.Element("Day").Value,
                    Month = x.Element("Month").Value,
                    Temperature = x.Element("Temperature").Value,
                    Year = x.Element("Year").Value
                }).ToList();

}

Accompanying this code is a similar LINQ query that does not use PLINQ

 for (int i = 0; i < NUM_REPS; i++)
{
    var list = (from x in doc.Root.Elements("Sample")
                where x.Element("Year").Value == "1973" && 
                   x.Element("Month").Value == "Apr" && 
                   x.Element("Day").Value == "15"
                select new WeatherData
                {
                    Day = x.Element("Day").Value,
                    Month = x.Element("Month").Value,
                    Temperature = x.Element("Temperature").Value,
                    Year = x.Element("Year").Value
                }).ToList();

}

The program queries the data in the XML file first using the Parallel code, then using standard LINQ. By comparing the time it takes each block of code to execute you can get a sense of the relative improvement available through PLINQ. I'll show you how to make such comparisons in just a moment. I will also discuss some tools that will become available to help profile code of this type.

You can see that the PLINQ query contains a call to AsParallel, while the other query does not. Other than that the two queries are identical. The fact that the two queries look so much alike points to a primary strength of PLINQ: very little specialized knowledge is necessary in order to begin using it. This does not mean that the subject is trivial, but only that the barrier to entry is low. This is not the case with most concurrent programming models.

LINQ queries are designed to be read-only, working with immutable data. This is a good model for parallelism, because it makes it unlikely that data will mutate, thereby setting up the potential for a race condition. You should note, however, that PLINQ does nothing to prevent this from happening, it is simply that LINQ is designed to make it unlikely.

Note also that the declarative LINQ programming style ensures that developers specify what they want done, rather than how it should be done. This leaves PLINQ free to ensure that concurrent LINQ queries run in the safest manner possible. If LINQ had been defined more strictly, such that it had to process each element in a certain order, then the PLINQ team would have had a much more difficult task.

The code in both these queries pulls out only the records from the XML file that have their date set to April 15, 1973. Because of deferred execution, the query would not do anything if I did not call ToList(). As a result, I added that call and converted the result into a List<WeatherData>. Though hardly earthshaking in import, these calls ensure that the code actually does something, and thus gives PLINQ scope to take advantage of the multiple processers on your system.

Simple timers are created to measure the difference between the standard LINQ query and the PLINQ query. I've also used a method used in many of Parallel LINQ team's samples for displaying the time elapsed during a test run:

 private static void RunTest()
{
    XDocument doc = XDocument.Load("XMLFile1.xml");

    Stopwatch sw = new Stopwatch();

    sw.Start();
    LinqOrdinarie(doc);
    sw.Stop();
    ShowElapsedTime("Ordinaire", sw.Elapsed);

    sw.Reset();

    sw.Start();
    ParallelLinq(doc);
    sw.Stop();
    ShowElapsedTime("Parallels", sw.Elapsed);
}
 private static TimeSpan ShowElapsedTime(string caption, TimeSpan ts)
{
    string elapsedTime = String.Format("{0}: {1:00}:{2:00}:{3:00}.{4:00}",
        caption, ts.Hours, ts.Minutes, ts.Seconds,
        ts.Milliseconds / 10);
    Console.WriteLine(elapsedTime, "RunTime");
    return ts;
}

At least with the pre-release version of PLINQ that I've played with, I've found it very useful to set up timers to confirm that PLINQ is actually able to speed up an operation. My record at guessing which code will benefit from running in parallel is not good, and so I find that confirming the effectiveness of the code by explicitly measuring it is worthwhile. You can either use the simple StopWatch class from the System.Diagnostics namespace, as shown here, or else you can use a profiler. Note that a thread aware profiler might ship with some versions of Visual Studio 2010.

I've found that the advantages of concurrent LINQ become more obvious the longer the operation I'm timing lasts. As a result, I've placed the query inside a loop, and added a variable to the program called NUM_REPS. By setting NUM_REPS to a large number, say 500, you can clearly see the benefits that can be accrued when you run LINQ queries in parallel on multiple processors. Note that the first time PLINQ is used, its assembly will need to be loaded, the relevant types will need to be JIT compiled, and new threads will need to be spun up, etc. As a result, many developers see improved performance after they get past the initial warm-up time.

Though it is very easy to get started with PLINQ, there are still complexities inherent in the subject that you need to consider. For instance, PLINQ will sometimes develop a different partitioning scheme for your data depending on whether you are working with an Enumerable or an Array. To learn more about this subject, see the following post from the Parallel Programming team:

http://blogs.msdn.com/pfxteam/archive/2007/12/02/6558579.aspx

The simple PLINQ examples shown in this post should help you get started with this powerful and interesting technology. Parallel LINQ is still in its infancy, but already it provides means of greatly simplifying tasks that are not normally easy to perform.

kick it on DotNetKicks.com