Creating a DSC File Set for Application-Specific Files

In many situations, you may want to process application-specific files from a LINQ to HPC query. For example, you might want to apply an operation on a per-file basis.

An example of processing application-specific files is the DupPic2 example that is shipped as one of the LINQ to HPC samples. The DupPic2 example calculates the checksum of each image in a large set of images, and then looks for duplicates. Duplicate images have the same checksum.

To perform a query that accesses individual files, you first need to transfer the JPEG files to the cluster. The following code does this.

var config = new HpcLinqConfiguration("MyHpcClusterHeadNode");
config.IntermediateDataCompressionScheme = DscCompressionScheme.None;
var context = new HpcLinqContext(config);           
string[] imageFilePaths = ...
string inputFileSetName = ...

// Load all the images from a share onto the cluster and store them in FileRecords. // The FileRecords are stored in a file set that contains multiple records 
// per file and is distributed over the cluster.

int stageSize = context.GetNodeCount();
context.FromEnumerable(imageFilePaths)
       .HashPartition(r => r, stageSize)
       .Select(path => new FileRecord(path))
       .ToDsc(inputFileSetName)
       .SubmitAndWait();

The FromEnumerable operator serializes objects in the application and copies them to a temporary file set that has a single DSC file. It then uses a LINQ to HPC query to open the temporary file set.

Note

This example uses the OutputDataCompressionScheme property of the HpcLinqConfiguration class to disable compression. Image data is already compressed, and there is no performance benefit to be gained by compressing it a second time.

When the HashPartition operator is applied, the output file set from the operation causes the list of path names to be written into a file set that has as many DSC files as there are DSC nodes on the cluster.

Then, a FileRecord object is created for each path. The constructor of the FileRecord object reads the binary image of the JPEG file and stores it as a byte array along with the path to the original file. Here is the definition of the FileRecord class.

[Serializable]
public class FileRecord
{
  public string FilePath { get; private set; }
  public byte[] FileData { get; private set; }

  public FileRecord(string filePath)
  {
    FilePath = filePath;
    FileData = File.ReadAllBytes(filePath);
  }
}

After the data is loaded to the cluster, you can perform operations on the application-specific files. For example, to list duplicated JPEG files, you can use the following LINQ to HPC query.

var config = new HpcLinqConfiguration("MyHpcClusterHeadNode");
var context = new HpcLinqContext(config); 
string inputFileSetName = ...

var duplicatedFiles = 
  context.FromDsc<FileRecord>(inputFileSetName)
         .Select(r => new {
                 Path = r.FilePath,
                 Checksum = GetChecksum(r.FileData)
             })
         .GroupBy(record => record.Checksum)
         .Where(group => group.Count() > 1)
         .SelectMany(group => group.Select(record => record.Path));

Console.WriteLine("\nThe following files are duplicates:");

foreach (var filepath in duplicatedFiles)
  Console.WriteLine("  {0}", filepath);

The records that are processed in the Select operation are instances of the FileRecord class. They contain the original path names, as well as a binary copy of the application-specific files that have been stored on the local compute node. In this example, the code calculates a checksum for each of the files, and then checks for duplicate checksums.