Creating a DSC File Set from Serialized Objects

When an application uses structured data that is not stored in text files, you can create DSC file sets that contain serialized .NET objects. You can either use the default serialization that is provided by LINQ to HPC, or you can implement your own serialization code for types that you define.

Note

Binary data that is serialized by LINQ to HPC does not use the same data format as .NET serialization such as the BinaryFormatter and other common serializers such as WCF DataContractSerializer. You must use either LINQ to HPC queries or the LINQ to HPC API to read and write binary data for LINQ to HPC jobs.

Creating a DSC file set by using default binary serialization

LINQ to HPC automatically generates serialization code for all types that are marked with the Serializable attribute, which is in the System.IO namespace.

The following code is an example of how to use the Serializable attribute.

[Serializable]
public class ProductRecord
{
  public int ProductId;
  public decimal UnitPrice;
  public string ProductName;

  public ProductRecord(int id, decimal price, string name)
  {
    ProductId = id;
    UnitPrice = price;
    ProductName = name;
  }
} 

You can use the ToDsc method to write instances of the ProductRecord to a file set. Here is an example.

var config = new HpcLinqConfiguration("MyHpcClusterHeadNode");
var context = new HpcLinqContext(config);  

var productsMyFileSetName = ...
var discountedMyFileSetName = ...

var query = context.FromDsc<ProductRecord>(productsMyFileSetName)
                   .Select(x => new ProductRecord(
                             x.ProductId, x.UnitPrice*(decimal) 0.9, x.ProductName))
                   .ToDsc(discountedMyFileSetName); 

query.SubmitAndWait();

There are some restrictions on the data that can be serialized by the default serializer. These restrictions apply both to the type to be serialized, and to its data members.

  • The most important restriction is that the object’s graph of object dependencies must be tree structured. It cannot be cyclical or contain shared elements. For example, a class that contains an array in one of its fields is allowed, but a class that contains a doubly-linked list cannot use default serialization. If this is your situation, see Creating a DSC File Set by Using Custom Serialization.

  • The default serializer does not support types that have subtypes or that derive from a nonprimitive type. In other words, no inheritance of user-defined classes is supported, although you can use the default serializer with structures that derive from built-in primitive types, such as ValueType. The default serializer can also serialize types that implement interfaces.

  • The default serializer only supports types that provide a public default constructor. This is a constructor that does not take any arguments, and has a visibility of public. The type can have other constructors in addition to the default constructor.

  • The default serializer does not support generic types.

  • The default serializer requires that the [Serializable] attribute is used for each serializable type. Every field of a serializable type must be a serializable type. If you want to exclude some fields (such as caches) from being serialized, you must implement a custom IHpcSerializer class instead of using the default serializer.

  • Every serializable type must have a visibility of public. This includes the types that are used in the data members of the serializable types. The default serializer can serialize the private fields of a public type.

  • The default serializer does not support null values. If you want to serialize null values, you must implement a custom IHpcSerializer class for the object type that may be null.

  • The default serializer does not serialize delegate types.

  • The default serializer does not serialize anonymous types from or to DSC file sets. However, the default serializer can serialize anonymous types if the types are used as inputs and outputs of intermediate queries. For example, you could use an anonymous type as the result value of a Select operation that is followed by a GroupBy operation, but you can’t write an anonymous type using the ToDsc operator.

Note

LINQ to HPC uses either its own default serialization or a custom serialization that is based on the IHpcSerializer class. It never uses the .NET ISerializable interface for serialization.

Note

LINQ to HPC queries are the only way to write binary DSC files that use the LINQ to HPC default serializer. This presents a bootstrapping problem for types that use the default serialization approach. There are two ways to bootstrap data that needs to be stored by using the LINQ to HPC default binary serialization.

The first is to write text data such as XML or CSV (comma-separated values) into files that you load as DSC files. You can then use a LINQ to HPC query to parse the text lines into objects, which you can then save with the default binary serialization.

Alternatively, you can implement custom binary serialization methods and use .NET serialization classes to write binary data that can also be read by LINQ to HPC queries. This technique enables you to partition the set of .NET objects, and to create one binary file for each DSC file that you want to have in your file set. You then can transfer the binary files into a DSC file set programmatically, or you can use the Dsc.exe command line utility. After the data is distributed to the cluster, you can use a LINQ to HPC query to save it with the default binary serializer. For more information, see Creating a DSC File Set by Using Custom Serialization.

Creating a DSC file set by using custom serialization

If you create a custom serializer, you must use the CustomHpcSerializer custom attribute and implement the IHpcSerializer<T> interface.

The following code is an example of a class that acts as a wrapper for any type that supports .NET serialization. The class, ProductCustomRecord, implements custom serialization in a way that reuses .NET serialization for the contained object. ProductCustomRecord is taken from the accompanying sample code. Note that any class that is written by the custom serializer must provide a default constructor.

[CustomHpcSerializer(typeof(ProductCustomRecord))]
public class ProductCustomRecord : IHpcSerializer<ProductCustomRecord>
{
    public int ProductId;
    public decimal UnitPrice;
    public string ProductName;

    // Required by CustomHpcSerializer
    public ProductCustomRecord() { }

    public ProductCustomRecord(int id, decimal price, string name)
    {
        ProductId = id;
        UnitPrice = price;
        ProductName = name;
    }

    #region IHpcSerializer<ProductCustomRecord> Members

    public ProductCustomRecord Read(HpcBinaryReader reader)
    {
        if (reader == null)
            throw new ArgumentNullException("reader");

        var record = new ProductCustomRecord();
        record.ProductId = reader.ReadInt32();
        record.UnitPrice = reader.ReadDecimal();
        record.ProductName = reader.ReadString();
        return record;
    }

    public void Write(HpcBinaryWriter writer, ProductCustomRecord record)
    {
        if (writer == null)
            throw new ArgumentNullException("writer");
        if (record == null)
            throw new ArgumentNullException("record");

        writer.Write(record.ProductId);
        writer.Write(record.UnitPrice);
        writer.Write(ProductName);
    }

    #endregion
}

This ProductCustomRecord class provides custom LINQ to HPC serialization. The following code is an example of how to use this class to load data to the cluster.

var config = new HpcLinqConfiguration("MyHpcClusterHeadNode");
var context = new HpcLinqContext(config);


ProductCustomRecord[] data = new[] { 
    new ProductCustomRecord(1, (decimal)1.00, "Widgets"),
    new ProductCustomRecord(2, (decimal)3.50, "Thingies"),
    new ProductCustomRecord(3, (decimal)10.50, "Blobs")
};

int stageSize = context.GetNodeCount();

context.FromEnumerable(data)
    .HashPartition(r => r, new ProductCustomRecordComparer(), stageSize)
    .ToDsc(DataPartitionedFileSetName)
    .SubmitAndWait();

// Read the data within a query.

var result = context.FromDsc<ProductCustomRecord>(DataPartitionedFileSetName);

Console.WriteLine("\n\nPartitioned data as DSC file set of int records:");
foreach (ProductCustomRecord p in result)
    Console.WriteLine("  {0}\t{1:c}\t{2}", p.ProductId, p.UnitPrice, p.ProductName);

In this example the HashPartition also operator requires an equality comparer.

[Serializable]
public class ProductCustomRecordComparer : IEqualityComparer<ProductCustomRecord>
{
    public bool Equals(ProductCustomRecord x, ProductCustomRecord y)
    {
        return x.ProductId == y.ProductId;
    }

    public int GetHashCode(ProductCustomRecord obj)
    {
        return base.GetHashCode();
    }

After the code runs, the output file set contains integer records that were serialized by the LINQ to HPC default serialization. The file set is distributed across the cluster. The name of the output file set is given by the DataPartitionedFileSetName variable.

A custom serializer’s read/write behavior must be symmetrical in the sense that records written by the Write method to the bit stream match the expectations of the Read method. The Read method must pull exactly the same number of bytes, integers, or characters from the stream as the Write method provides. If there are discrepancies between the Read and Write methods of the user-provided IHpcSerializer implementation, an error will occur within the LINQ to HPC provider. In this situation, the call stack in the debugger will show an exception in the user-provided serializer that is called by the system code, or an end-of-stream error if the serializer attempted to read more data than was written.

The LINQ to HPC local debug mode won’t always invoke custom serializers. The reason for this behavioral difference is that all records are stored in memory when using the local debug mode. Custom serializers won’t be called unless the type is present in an input or output file set. Therefore, failures that are due to a faulty custom serializer may not be seen when you run the query in the local debug mode and will only appear after the job is shipped to the cluster.

You should be careful not to confuse .NET custom serializers and HPC custom serializers, which are implementations of the IHpcSerializer interface. It is possible, as the example shows, to use .NET custom serializers as part of the implementation of an HPC serializer. It is also possible to implement the IHpcSerializer interface in a way that does not make use of .NET serialization.