Share via


DSC File Sets and Files

An application's distributed data is stored in DSC file sets. A DSC file set is a distributed store of strongly typed .NET objects. A file set is exposed in the client application as an instance of the DscFileSet class and also as an IQueryable object that is returned by the FromDsc method.

Although LINQ to HPC queries view file sets as sequences of records, the DSC divides the individual data records of a DSC file set into segments called DSC files. The DSC files that make up a file set are distributed by the DSC across the available DSC nodes.

DSC files are internal mechanisms that are used to partition and replicate groups of records, and programmers rarely need to interact with them directly. After a DSC file set is created and finalized (this is known as the seal operation), it is not possible to modify the file set. DSC files use automatically generated names, and cannot be renamed. It is helpful to think of a DSC file set as a single large file that contains many records rather than as a directory or container of files. While it is true that the records of a file set are partitioned and grouped into individual files that are distributed across DSC nodes, the fact that the records are distributed does not fundamentally alter the principal operational view of a file set as a sequence of records.

You can control aspects of file sets and files. For example, the command DscFileSet.AddExistingFile adds an existing DSC file to a newly created file set (one that has not yet been sealed). This command updates the catalog on the DSC database. It does not copy the file on disk, it simply adds the existing to the list of files DSC considers as comprising the new file set. It is possible that a DSC file might be part of more than one file set.

When a LINQ to HPC query reads a file set, it processes individual records from that file set. A record can be a line of text, which is represented by the LineRecord class, or a previously saved object of a .NET type. You can think of a file set as either a sequence of objects that you have previously saved, or as a sequence of text lines. There are a variety of serialization modes possible for records, such as plain text and binary. LINQ to HPC queries that operate on file sets use the HPC cluster to iterate over each record in the specified file set.

The following diagram illustrates how a DSC file set is divided into DSC files that are distributed across DSC nodes.

Division of a file set

Figure 2. Division of a DSC file set into DSC files

Even though DSC file sets contain distributed data, you can use a single LINQ to HPC query in your application to interact with the data. Applications can apply LINQ operators as needed, and the system handles the details of executing the query on the nodes. In addition, if a LINQ to HPC query produces intermediate results—for example, when two query operations are chained together—these intermediate results are also automatically stored by the LINQ to HPC runtime.