What is the Microsoft Data Prep SDK for .NET?

The Microsoft Data Prep SDK for .NET helps developers and data scientists explore, cleanse and transform data for machine learning workflows in .NET Core 2.1 environment. It is currently in preview, and offers a subset of the functionality offered by the Python SDK. In future, we will add functionality to bring the .NET SDK to parity with the Python SDK.

This SDK includes the following functionality:

  • Automatic file type detection for delimited text files. The SDK can automatically detect whether your data is in any of the supported file types. You don’t need to use special file readers for formats like CSV, TSV, text, etc., or to specify delimiter, header, or encoding parameters.
  • Summary statistics can be generated quickly for a dataflow with a single line of code.

How does it differ?

The Microsoft Data Prep SDK offers an intelligent and scalable experience for essential data preparation scenarios, while maintaining interoperability with common data analysis libraries.

Key benefits to the SDK:

  • Cross-platform functionality. You can interact with the SDK in .NET Core 2.1 supported environment alongside familiar libraries. Write with a single SDK and run it on Windows, macOS, or Linux.
  • Capability to work with large, multiple files of different schema.
  • Seamless integration with other Azure Machine Learning services. You can simply pass your prepared IDataView object for automated machine learning training.

Prerequisites

  • Microsoft .NET Core 2.1
  • Visual Studio 2017 with latest updates

Installation

Skip to tutorial if you want to try out our sample codes.

  1. Launch Visual Studio 2017
  2. Create a new .NET Core 2.1 project.
  3. Right click the project name in Solution Explorer.
  4. Click Manage Nuget Packages...
  5. Click Setting on the Nuget Package Manager to configure nuget.org as an available data sources.
    Name : nuget.org Source : https://api.nuget.org/v3/index.json
  6. Select nuget.org as Package source.
  7. Select Browser in Nuget Package Manager window.
  8. Type "Microsoft.DataPrep" in the search window.
  9. Click include prerelease option.
  10. Click Install to install the latest Microsoft.DataPrep nuget package.

File Type Detection

Use the Reader.AutoReadFile() function to load your data without having to specify the file type. This function automatically recognizes and parses the file type.

 DataFlow dataFlow = Reader.AutoReadFile("./localpath/*.csv");

Note: Currently Reader.AutoReadFile() only support Delimited Text File in Microsoft.DataPrep .NET SDK.

Summary statistics

Generate quick summary statistics for a dataflow with one line of code with the GetProfile() method.

dataFlow.GetProfile();

Calling this function on a dataflow object results in output like the following table.

summary-statistics

Get Support

To get help or ask questions, please email: askamldataprep@microsoft.com

Next steps

To see detailed examples and code for preparation step, please see the tutorial.