Creating DSC File Sets

LINQ to HPC enables developers to write distributed queries against data that is spread across the compute nodes of an HPC cluster. (Remember that these nodes must be known to the DSC.) The guiding principle behind LINQ to HPC queries is to compute where the data resides. Computations must be performed on the computers that store the data to be efficient enough to process very large data sets. Consequently, developers must load their data onto the nodes of the HPC cluster before they can perform any distributed computations. This is often a one-time process.

Note

The DSC is a cataloging service and does not control the actual transport of files, except for replication. It is not a general-purpose distributed file system. You cannot use the DSC for general data storage.

The following table lists some common scenarios that developers encounter when they need to load data into DSC file sets. (For an overview of how the DSC works, see Overview of LINQ to HPC and the Distributed Storage Catalog (Beta 2).)

Scenario

To distribute the data to the compute nodes…

You have many lines of text that are spread over multiple files. You want to access the text from a LINQ to HPC query. For example, web server logs that must be analyzed.

Use either the Dsc.exe command-line tool, or write a program. For more information, see Creating a DSC File Set from Text Files.

Your application stores data as .NET types, and you want to apply a LINQ to HPC query to a sequence of these application objects. For example, when you must adapt an existing data parallel .NET program to cluster computing.

Use a LINQ to HPC query that creates a DSC file set. You can then run distributed queries over the data set. For more information, see Creating a DSC File Set from Serialized Objects.

You have data in a SQL Server database, and you want to apply an HPC LINQ query to it. For example, log records in a database that must be analyzed.

Use a LINQ to HPC query to create a DSC file set that contains .NET versions of serialized objects of each database record. See Creating a DSC File Set from a Database Table.

You have many application-specific files that you want to process with a LINQ to HPC query. For example, when you want to analyze JPEG images.

Use a LINQ to HPC query that copies your file into a binary record. See Creating a DSC File Set for Application-Specific Files.

Note

Many developers are tempted to write path names or database keys to a file set, and then load files or database records across the network as the LINQ to HPC query executes. While it is inaccurate to say that copying data over the network during a LINQ to HPC query is always wrong, you will generally have much better performance if you first transfer the data to the cluster, and then perform the computation where the data resides.

Copying large amounts of data over the network from central file shares or databases does not scale to large computing clusters. It can create I/O bottlenecks, and is generally not a recommended way to use LINQ to HPC.