The LINQ to HPC and DSC Object Models

The following figure illustrates the main object types that you use to create LINQ to HPC queries.

Overview of DryadLINQ and DSC objects

Figure 1. Overview of LINQ to HPC and DSC objects

You must first construct an HpcLinqConfiguration object that specifies the head node of the HPC cluster that your queries will use. You can also set other options. See Configuring LINQ to HPC Queries for more information. Next, you create an HpcLinqContext object and pass the configuration object as an argument to the constructor.

An HpcLinqContext object can be used in two ways.

  • If you read the context object’s DscService property, you get a DscService object that has methods that enable you to create DSC file sets, add DSC nodes, and perform other management functions for the DSC server-side component that runs on the HPC cluster’s head node. For example, if you invoke the CreateFileSet method, the system creates a new empty file set on the cluster, and then returns a DscFileSet object that represents the newly created file set. For more information, see Creating DSC File Sets.

  • If you invoke the context object’s FromDsc method, the system creates a new distributed query that can read the records of an existing DSC file set on the HPC cluster. The Microsoft® .NET type of the query is IQueryable<T>. For more information about basic queries, see Querying DSC File Sets. For information about advanced queries, see Implementing Distributed Algorithms by Using LINQ to HPC.

A LINQ to HPC query that is created by the FromDsc method processes individual records from a given file set. As you add LINQ operators such as Select and GroupBy to your query, you can think of the underlying file set as either a sequence of objects that you have previously saved, or as a sequence of text lines. In other words, a record can be a line of text (represented by the LineRecord class), or a previously saved object of a serializable .NET type.

Although LINQ to HPC queries view file sets as a sequence of records, the DSC stores the individual data records that reside on the HPC cluster in segments called DSC files. Each DSC file is stored on an individual DSC node (a computer). DSC files are implemented by using ordinary NTFS files. The DscFile class represents a subsequence of the records that are stored in the DSC file set. (DSC files may optionally be replicated across multiple DSC nodes for performance and fault tolerance.)

The following diagram illustrates how a DSC file set is divided into DSC files that are distributed across the DSC nodes of the cluster.

Division of a file set

Figure 2. A DSC file set

Note

In addition to the API, you can also use a command-line utility named Dsc.exe to manage the DSC. You can use Dsc.exe to create and delete file sets, add DSC files to file sets, and specify which DSC nodes will be used to store DSC files. See DSC Command-Line Reference for more information.

Even though DSC file sets are partitioned across the DSC nodes of the HPC cluster into DSC files, you need only a single LINQ to HPC query in your application to interact with the data. The LINQ to HPC and DSC components that run on the HPC head node handle the details of executing queries in a distributed manner across the DSC nodes. In addition, if a LINQ to HPC query produces intermediate results, they are automatically managed by the server-side components so that there is distributed processing in subsequent stages of the query. This situation occurs, for example, when two query operations are performed in succession, such as a GroupBy operation followed by a Select operation.

The following figure illustrates how a Select query works in a distributed environment.

The execution of a distributed Select query

Figure 3. Execution of a distributed Select query

Figure 3 shows how the query Select(x => f(x)) is executed. LINQ to HPC invokes the method f in a distributed manner. It automatically compiles the code needed to run the Select query, and copies that code to the DSC nodes of the cluster. Server-side components then run the code on each node and simultaneously process the DSC files that make up the input file set. The output may be saved as a new DSC file set, or used immediately for queries that contain multiple operations. The results of the operations are divided into separate DSC files in a way that matches the distribution of the input file set.

Here is a complete LINQ to HPC console application that performs a Select operation and a Max operation.

using System;
using System.Linq;
using Microsoft.Hpc.Linq;

namespace MyProgram
{
  class Program
  {
    static void Main(string[] args)
    {
       var config = new HpcLinqConfiguration("MyHpcClusterHeadNode");
       var context = new HpcLinqContext(config);

       var lengths = context.FromDsc<LineRecord>("MyTextData")
                            .Select(r => r.Line.Length);
       Console.WriteLine("The maximum line length is {0}", lengths.Max());     
    }
  }
}

Note

This code references the Microsoft.Hpc.Linq and Microsoft.Hpc.Dsc assemblies.

This program finds the longest line in a text-based file set. If the file set named MyTextData has been partitioned into multiple DSC files, the program uses multiple DSC nodes of the cluster.

To run this example, you must perform the following steps.

  1. Change the name of the head node to the name of your cluster's head node.

  2. Change the name of the input DSC file set to the name of a file set on your cluster that contains lines of text.

  3. If you do not have such a file set, create one by using the DSC command-line interface. For more information, see Creating a DSC File Set from Text Files with Dsc.exe.

Here is an example of a command that creates a file set named MyTextData. The files that are located in the \\MyServer\data directory provide the line records. The individual files in the directory are copied as DSC files on the cluster.

dsc.exe fileset add \\MyServer\data MyTextData /service:MyHpcClusterHeadNode

Not all queries can operate with independent, distributed tasks as the Select operator does. For example, you might want to sort the records of a file set. The sorted records will not be distributed in the same way as the input file set. Figure 4 illustrates how an OrderBy query executes on the HPC cluster.

The execution of a distributed OrderBy query

Figure 4. Execution of a distributed OrderBy query (merge phase omitted)

Figure 4 shows that the records of a file set that has been partitioned into multiple DSC files can produce an output file set whose partitions contain records that were taken from many DSC files of the input file set.

When you execute a LINQ to HPC query, the run-time components perform a sophisticated analysis of the query and create a query plan that is executed in the most distributed manner possible, given the types of operations that you are performing. You don’t have to be an expert in distributed computing to take advantage of the built-in algorithms that LINQ to HPC provides. You can simply create LINQ queries and rely on LINQ to HPC to determine how your query should best be divided into distributed tasks that run on the cluster.

However, in some cases, you may want to be aware of how the data in a file set is partitioned. How you partition your data into DSC files can affect the performance of your query. For more information, see Choosing the Right Number of DSC Files.

Also, if you are knowledgeable about distributed algorithms, there may be cases where you want to use LINQ to HPC to implement your own distributed algorithms. For more information, see Implementing Distributed Algorithms by Using LINQ to HPC.

Note

The LINQ to HPC and DSC beta releases have scalability limits. LINQ to HPC has been tested with the Histogram and Terasort samples on on-premises HPC clusters of up to 256 DSC nodes. The DSC has been tested with a maximum of 4,000 individual DSC files in a single DSC file set, and a maximum of 2,000 file sets in the DSC. Each record of a file set has a .NET-imposed maximum size of 2 gigabytes (GB).

In theory, the maximum size of a DSC file is limited only by the constraints of NTFS. However, in practice, the limit on DSC file sizes is based on system memory. Very large DSC files may encounter a memory limit as a vertex operates on a file and uses memory to create intermediate results. For more information, see Choosing the Right Number of DSC Files.