Trying the LINQ to HPC SDK Sample Code

This topic describes how to use the LINQ to HPC code samples that are available with the Microsoft HPC Pack 2008 R2 SP3 SDK. Before getting started, ensure that you have reviewed the basic requirements for Setting Up a Development Environment for LINQ to HPC.

In this topic:

  • How to install and configure the samples

  • Using the basic samples

    • FGrep

    • Join

    • Sort

    • Histogram

    • MapReduce

    • Programming guide samples

  • Scenario demonstrations

    • DupPic

    • Web Analytics

  • Advanced samples

    • Page Rank

    • KMeans

How to configure and build the samples

The following procedure describes how to install and configure the LINQ to HPC code samples.

Note

The database related samples require an instance of the Northwind Traders database example. The database must loaded on a SQL Server that is visible to the DSC nodes in the cluster and to your client computer. You can download the Northwind Traders sample database from: https://www.microsoft.com/downloads/en/details.aspx?FamilyId=06616212-0356-46A0-8DA2-EEBC53A68034

To install the samples

  1. Download the code samples folder for the Microsoft HPC Pack 2008 R2 SP3 SDK from here.

  2. Extract the zipped folder into a folder on your local hard disk drive.

  3. Open the Samples.sln file in Visual Studio 2010.

  4. Open the SamplesConfiguration\SampleConfiguration.cs file and modify the following code. Change HeadNode to the name of the head node of the cluster (must have Microsoft HPC Pack 2008 R2 SP3 installed). Change LocalShare to the UNC path of a share to which you have read and write access, and that is visible to the DSC nodes in the cluster.

    Optionally, if you would like to run the database-related samples in the Programming guide samples, then DbConnectionString should be the connection string for a database that has the Northwind Traders sample database installed.

    // The name of the cluster head node: public static readonly string HeadNode = "MY-HEADNODE"; // A read/write file share that is accessible by the nodes within the cluster: private static readonly string LocalShare = @"\\MY-WORKSTATION\Shared\"; // A DB connection string for the Northwind traders sample DB: public static readonly string DbConnectionString  = "Data Source=MYDB;Initial Catalog=Northwind;Integrated Security=true";

FGrep

The FGrep sample is a LINQ to HPC implementation of the UNIX GREP command. FGrep searches a DSC file set and returns all the lines that match a regular expression passed in as an argument to the program.

To run this example, you need a DSC file set that contains text data. You can use the sample data that is provided for the histogram sample project, as described in the following procedure.

Alternately, if you want to try this with your own text data, you can add that to the DSC instead. For example, the following command loads all of the files from \\MyServer\Shared\MyData onto a cluster with a head node named MyHeadNode and names the files set MyData.

DSC FILESET ADD  \\MyServer\Shared\MyData MyData /compression:None /service:MyHeadNode

Run the sample

The following procedure describes how to load a set of files into DSC and then run the Fgrep sample to count the number of lines in the file set that start with the word “sed”.

To run FGrep

  1. Use the DSC FILESET ADD command to load the histogram sample data onto the cluster.

    The following command loads all of the files from Histogram\data onto a cluster with a head node named MyHeadNode and names the files set “MyData”.

    DSC FILESET ADD Histogram\data MyData /compression:None /service:MyHeadNode

  2. Run FGREP from the command line and pass in two arguments. The first argument is the name of the DSC file set, and the second is the regular expression to search for in each line. For example, the following command searches the file set named “MyData” for all lines that begin with the word "sed".

    FGREP MyData ^sed

Review the code

The FGREP sample uses a single LINQ to HPC query to return the matching lines:

int count = 0;
foreach (LineRecord line in context.FromDsc<LineRecord>(fileSetName)
    .Where(r => regex.IsMatch(r.Line)))
{
    Console.WriteLine(line);
    count++;
}
Console.WriteLine("\nFound {0} matching lines.", count);

The sample uses the Regex.IsMatch method that returns true if a string matches the specified pattern. For more information on how to write Regex patterns, see The Regular Expression Object Model on MSDN (https://msdn.microsoft.com/en-us/library/30wbz966(v=VS.100).aspx).

Exercises

After running this sample and reviewing the code, you can use information in the programming guide to try the following:

  • You can experiment with LINQ to HPC by modifying the query. For example, you can use the OrderBy operator to sort the returned lines or limit the number of lines returned by using the Take operator. For more information about basic LINQ operators, see 101 LINQ Samples.

  • Modify the program to store the results as a separate file set in DSC and then issue a second DSC query to retrieve the results. You could use this to cache the FGrep results and save time having to re-execute queries. See Saving the Results of a LINQ to HPC Query to a New DSC File Set.

Join

The Join sample demonstrates how to use LINQ to HPC to join two DSC file sets. The file sets are joined by using a common key field that is present in each file set. This is very similar to the common practice of using JOIN queries in a normalized relational database. The example uses two input files that are included in the sample data folder (in the Join project).

Review the code

This sample joins records from two files. One file, geoip.txt, maps IP addresses to counties. The other file, log.txt, is a log of URL requests, where each request contains an IP address.

For example, a record in geoip.txt looks like this:

10.10.1.5 US

And a record in log.txt looks like this:

10.10.1.5 GET /index.html

In both cases the files contain fields separated by a single whitespace, the first field being the IP address. The program uses the DSC APIs to load each file into a DSC and then executes a query containing the Join operator. The following query joins the two tables on the common key (the IP address):

IQueryable<LineRecord> joined = strtable2.Join(strtable1,
    l1 => l1.Line.Split(' ')[0],
    l2 => l2.Line.Split(' ')[0],
    (l1, l2) => new LineRecord(l1.Line.Split(' ')[1] + " " + l2.Line));

This query returns a new collection containing joined records that contain both the IP address and the Geo IP resolved country.

For example, a record in the new collection looks like this:

US 10.10.1.5 GET /index.html

Exercises

After running this sample and reviewing the code, you can use information in the programming guide to try the following:

  • Modify the query to only return requests from the US. For more information about basic LINQ operators, see 101 LINQ Samples.

  • This example creates two file sets each containing a single DSC file. This means the query will only run on a single node. Use the HashPartition operator to repartition both the input file sets to use all the nodes in the cluster when running a query. An example of hash partitioning is shown in the Histogram example. Does the program still produce the same output? See also The HashPartition Operator in the programmer’s guide.

Sort

The Sort sample generates a data set of 100 character string records, and then sorts them. The example demonstrates how to sort data using the OrderBy operator and also how to generate a data set from seed input using all the available DSC nodes in the cluster.

Run the sample

The following procedure describes how to run the Sort sample. Sort takes two arguments, the first argument specifies how many records to generate, and the second specifies how to partition the records.

To run Sort

  1. Run Sort from the command line and specify the arguments so that 2 DSC files are created for each DSC node, each containing 1000 records. Type the following command:

    sort 1000 2

Review the code

The example demonstrates how to sort data using the OrderBy operator and also how to generate a data set from seed input using all the available DSC nodes in the cluster. The code to generate data uses the CalculateRanges method, defined in the HpcLinqExtras project to create an array of start and end values for each partition.

The FromEnumerableAsDistributed method is defined in the HpcLinqExtras project. It creates a set of records containing a single unique number for each file (0, 1, … totalFiles), and hash partitions them so that each file contains a single record, it’s rank. The SelectMany operator then uses the GenerateData function, defined by the application to generate a set of random strings based on the ranges appropriate for the rank.

long[] ranges = Utilities.CalculateRanges(totalFiles, 0, totalFiles*recordsPerFile, true);

context
    .FromEnumerableAsDistributed(Enumerable.Range(0, totalFiles), totalFiles)
    .SelectMany(r => GenerateData(ranges[r], ranges[r + 1]))
    .ToDsc(inputFileSetName)
    .SubmitAndWait();

The end result of this query is a DSC file set where each file contains data generated according to the inputs ranges[n] – ranges[n+1] where n is the rank of the partition.

Note

This sample uses a simplistic approach to generating random numbers on a distributed system. To generate random numbers with a correct distribution a better generator than System.Random is required and more attention must be paid to the seed values used. It does however show how to generate data based on different input criteria per partition.

Exercises

After running this sample and reviewing the code, you can use information in the programming guide to try the following:

  • Remove the AssumeOrderBy operator from the final query in the sample. Does the program still generate the same output? Is it as fast?

  • Modify the program to execute all the work as one query, rather than three. Is the AssumeOrderBy operator still required?

Histogram

The Histogram sample counts the occurrences of words in a DSC file set that contains text files. First is uses the DSC APIS to load the example data onto the DSC directly from the data folder within the sample.

Review the code

The following code uses the DSC APIS to load the example data onto the DSC directly from the data folder within the sample. This uses the HpcLinqExtras project extension method AddAndCopyNewFiles to load all the files within the inputFiles folder into a DSC file set.

string[] inputFiles = { @"data\InputPart0.txt", @"data\InputPart1.txt" };

context.DscService.CreateFileSet(inputFileSetName, DscCompressionScheme.None)
    .AddAndCopyNewFiles(inputFiles)
    .Seal();

This creates a file set containing two DSC files, one for each input file. If your cluster has more nodes than the number of files in the file set, queries will be inefficient because only two nodes will be used to execute the query. For more information about partitioning DSC files sets, see Before running the main histogram calculation the program hash partitions the data to create a new file set with at least as many partitions as there are DSC nodes in the cluster.

int stageSize = context.GetNodeCount();

context.WithJobName("Histogram sample - partition data")
    .FromDsc<LineRecord>(inputFileSetName)
    .HashPartition(r => r, stageSize)
    .ToDsc(outputFileSetName)
    .SubmitAndWait(); 

The program then executes a query that breaks up each line into words, and then counts the occurrences of each word. It returns an ordered list of the 200 most frequently used words in the file set. The query returns a collection of Pair objects containing the word and the number of times it occurred. The collection is ordered and only the top 200 words by frequency of occurrence are returned. Note: The Pair type is another addition provided by the HpcLInqExtras project.

IQueryable<Pair> results = context.WithJobName("Histogram sample - count words")
    .FromDsc<LineRecord>(outputFileSetName)
    .SelectMany(line => line.Line.Split(new[] {' ', '\t'}, StringSplitOptions.RemoveEmptyEntries))
    .GroupBy(word => word)
    .Select(word => new Pair(word.Key, word.Count()))
    .OrderByDescending(pair => pair.Count)
    .Take(200);

Exercises

After running this sample and reviewing the code, you can use information in the programming guide to try the following:

  • Combine the two queries into a single query. Is there any improvement in the overall runtime of the application? How does this scale when you use a much larger input data set?

  • Modify the query to produce the least frequently occurring words.

  • The histogram problem is often used by applying a MapReduce pattern. Compare the LINQ to HPC query in this example with the MapReduce implementation in the MapReduce project.

MapReduce

The MapReduce sample counts occurrences of words in a DSC file set that contains text files. It takes the same approach as the Histogram sample but expresses the computation in terms of the MapReduce pattern, rather than directly as a LINQ to HPC query.

Review the code

In this example the MapReduce operator is an extension method implemented in the HpcLinqExtras project. For more discussion of the MapReduce algorithm see Wikipedia (http://en.wikipedia.org/wiki/Mapreduce).

// Define a map expression:

Expression<Func<LineRecord, IEnumerable<string>>> mapper = (line) =>
    line.Line.Split(new[] { ' ', '\t' }, StringSplitOptions.RemoveEmptyEntries);

// Define a key selector:

Expression<Func<string, string>> selector = (word) => word;

// Define a reducer (LINQ to HPC is able to infer the 
// Decomposable nature of this expression):

Expression<Func<string, IEnumerable<string>, Pair>> reducer = 
    (key, words) => new Pair(key, words.Count());

// Map-reduce query with ordered results and take top 200.

IQueryable<Pair> results = context.FromDsc<LineRecord>(inputFileSetName)
    .MapReduce(mapper, selector, reducer)
    .OrderByDescending(pair => pair.Count)
    .Take(200);

Exercises

After running this sample and reviewing the code, you can use information in the programming guide to try the following:

  • This is another example of generating a frequency histogram from input data. Compare the code here using the MapReduce pattern implemented on top of a LINQ to HPC with the approach shown in the Histogram project which uses LINQ to HPC directly.

  • Review the implementation of the MapReduce operator in the HpcLinqExtras project to see how it maps the MapReduce pattern onto a LINQ query.

Programming guide samples

This project includes executable versions of all of the sample code fragments contained in the LINQ TO HPC Programmer's Guide (Beta 2).

Running this sample will execute all the queries show in the programming guide plus some additional examples of uploading and downloading data from DSC using both the product APIs and the additional methods shipped in the HpcLinqExtras project.

Note

This is the only sample that uses the Northwind Trader’s example database. If you have not installed this and set the DbConnectionString in SamplesConfiguration.cs then the database examples will be skipped during execution.

DupPic

The DupPic sample searches for duplicate images within a collection of images stored on disc. It demonstrates how to move binary file data into the DSC and query the data as a partitioned record set.

There are two versions of the sample. DupPic1 uses LINQ queries running on the client to search for duplicate images. It calculates a checksum for each image, looks for images with identical checksums, and returns a list of duplicate images from the images that are stored in the project’s data folder. DupPic2 uses LINQ to HPC queries to search for duplicate images within the LocalShare\pics folder. Before running the application you can copy some example images containing duplicate files into this folder, or you can let the program copy in a few sample images for you.

Review the code

DupPic2 first loads the images stored in the LocalShare\pics folder. It uses a LINQ to HPC query to do this by executing a query that takes a list of file paths to images and hash partitions them across all the nodes in the cluster. For each file path a corresponding FileRecord object, containing the file path and the binary image data, is created and stored as a DSC file set of FileRecord objects.

string[] imageFilePaths = Directory.GetFiles(SampleConfiguration.ImageFilesPath, "*.jpg",
                                                SearchOption.AllDirectories);

int stageSize = context.GetNodeCount();
context.FromEnumerable(imageFilePaths)
    .HashPartition(r => r, stageSize)
    .Select(path => new FileRecord(path))
    .ToDsc(inputFileSetName)
    .SubmitAndWait();

DupPic2 then executes a LINQ to HPC query to load a file set onto the DSC nodes. Each FileRecord contains both the binary image data, and the original path to the image. The sample then executes a query that looks for records with identical checksums for the binary image data and returns a list of file paths for duplicates.

var duplicatedFiles = context.FromDsc<FileRecord>(inputFileSetName)
    .Select(r => new
                        {
                            Path = r.FilePath,
                            Checksum = GetChecksum(r.FileData)
                        })
    .GroupBy(record => record.Checksum)
    .Where(group => group.Count() > 1)
    .SelectMany(group => group.Select(record => record.Path));

Web Analytics

The WebAnalyticsConsole sample demonstrates how to use LINQ to HPC to analyze large text files such as web server logs. The sample scenario involves a hypothetical company that buys TV advertising to drive viewers to buy its products on the web. The comanpy wants to track the impact of individual ad purchases on their web site traffic and visitor purchasing behavior and modify future purchases accordingly. In this scenario, data comes from the following sources:

  • The Ad Buying Manager buys some advertising on several different TV stations at different times of the day to get in front of different types of potential customers. She loads the ad purchase data into a spreadsheet.

  • The IT department captures log files from the company’s public web site and each day updates the latest web traffic data, loading the log files directly into a LINQ to HPC cluster using the DSC command line.

Run the sample

The application should be run once to generate some ad buy data and corresponding dummy log data and then a second time to analyze it with LINQ to HPC.

To run webAnalyticsConsole

  1. Change directory to the directory containing the WEBANALYTICSCONSOLE.EXE binary.

  2. Next run the application with no arguments to generate data:

    WEBANALYTICSCONSOLE.EXE

  3. Use the DSC command line tool to add the dummy log file data to DSC as the file set “MyLogs”.

    dsc fileset add .\LogData "MyLogs" /compression:none

  4. Run the application passing it the name of the DSC file set:

    WEBANALYTICSCONSOLE.EXE "MyLogs"

Review the code

The application proceeds in two steps, first it preprocesses the log files, using a LINQ to HPC query to read each line from the logs and create a corresponding HttpLogRecord object that represents a log entry. The HttpLogRecord type is type is defined by the application and is responsible for parsing log file lines and dealing with ill-formatted data. It exposes a clean object model that subsequent queries can use, rather than having to parse the line text each time.

var state = context
    .WithJobName("Web analytics sample - clean log files")
    .FromDsc<LineRecord>(inputFileSetName)
    .Select(l => HttpLogRecord.FromCsv(l.ToString()))
    .ToDsc<HttpLogRecord>(processedFileSetName)
    .SubmitAndWait();

The program executes a LINQ to HPC query that returns the number of HTTP requests that occur per hour, throughout the day. This query groups records by hour and returns an HpcTuple containing the number of requests per hour. This is then written to a CSV file. The CSV file can be loaded into the files\WebAnalytics.xls spreadsheet to display the web site traffic driven by a particular set of ad buys.

Note

The HpcTuple struct type is defined in the HpcLinqExtras project. HpcTuple is preferred over the .NET 4 System.Tuple class because arrays of structs have less memory management overhead than the equivalent class, especially when the struct itself only contains value types. HpcTuple is also available to applications written against the .NET 3.5 runtime.

var requestsPerHour = context
    .FromDsc<HttpLogRecord>(processedFileSetName)
    .GroupBy(record => record.Time.Hour)
    .Select(rs => HpcTuple<int, int>.Create(rs.Key, rs.Count()))
    .OrderBy(rs => rs.Item1);

Page Rank

This sample shows a simplified implementation of the page rank algorithm. The example consists of two applications, one to generate artificial page data and a second to calculate the page ranks for the generated pages.

Run the sample

The following procedure describes how to run the page rank sample.

To run Page Rank

  1. Change directory to the directory containing the PageRankTextDataGenerator.EXE binary.

  2. Type the following command to run the generator application:

    PageRankTextdataGenerator.exe /sn:amiller-hn /plf:PageLinks-input /prf:PageRanks-input /ppf:100

    This will generate two input files called PageLinks-input and PageRanks-input with one DSC file per cluster node and 100 pages in each file.

    Note: There are further options for customizing the data generation, use PageRankTextdataGenerator.exe /? To review these.

  3. Change directory to the directory containing the PageRankTextDataGenerator.EXE binary.

  4. Type the following command to run the application and specify the two input files:

    PageRankTextdataGenerator.exe /sn:amiller-hn /plf:PageLinks-input /prf:PageRanks-input /ppf:100

Review the code

The page rank calculator executes an iterative algorithm that repeatedly calculates page ranks for each page and then uses the resulting page ranks to re-calculate until some convergence criteria is met.For more discussion of the page rank algorithm see Wikipedia (http://en.wikipedia.org/wiki/Page_rank).

KMeans

K-means clustering is an iterative technique for partitioning a data set of n observations into k clusters where each observation belongs to the cluster with the nearest mean. The iterative approach used here selects an initial set of centers at random and calculates the resulting clusters by association each observation with the nearest center. It then calculates new centers based on the centroid of each cluster. This is repeated until the calculation converges to within some acceptable variance for the centers. For more discussion of the k-means clustering algorithm see Wikipedia (http://en.wikipedia.org/wiki/K-means_clustering).

The k-means clustering example has two steps; generate some random vector data and then use a series of LINQ to HPC queries to find the cluster centers. These two steps can be executed separately using the generate and run switches or combined using the all switch (kmeans.exe all).

The program will run a series of queries, first to calculate the random vectors and then the initial centers. Finally a single query is executed that runs ten k-means iterations and prints out the result.