Load data from files and other sources

This how-to shows you how to load data for processing and training into ML.NET. The data is originally stored in files or other data sources such as databases, JSON, XML or in-memory collections.

Create the data model

ML.NET enables you to define data models via classes. For example, given the following input data:

Size (Sq. ft.), HistoricalPrice1 ($), HistoricalPrice2 ($), HistoricalPrice3 ($), Current Price ($)
700, 100000, 3000000, 250000, 500000
1000, 600000, 400000, 650000, 700000

Create a data model that represents the snippet below:

public class HousingData
{
    [LoadColumn(0)]
    public float Size { get; set; }

    [LoadColumn(1, 3)]
    [VectorType(3)]
    public float[] HistoricalPrices { get; set; }

    [LoadColumn(4)]
    [ColumnName("Label")]
    public float CurrentPrice { get; set; }
}

Annotating the data model with column attributes

Attributes give ML.NET more information about the data model and the data source.

The LoadColumn attribute specifies your properties' column indices.

Important

LoadColumn is only required when loading data from a file.

Load columns as:

  • Individual columns like Size and CurrentPrices in the HousingData class.
  • Multiple columns at a time in the form of a vector like HistoricalPrices in the HousingData class.

If you have a vector property, apply the VectorType attribute to the property in your data model. It's important to note that all of the elements in the vector need to be the same type. Keeping the columns separated allows for ease and flexibility of feature engineering, but for a very large number of columns, operating on the individual columns causes an impact on training speed.

ML.NET Operates through column names. If you want to change the name of a column to something other than the property name, use the ColumnName attribute. When creating in-memory objects, you still create objects using the property name. However, for data processing and building machine learning models, ML.NET overrides and references the property with the value provided in the ColumnName attribute.

Load data from a single file

To load data from a file use the LoadFromTextFile method along with the data model for the data to be loaded. Since separatorChar parameter is tab-delimited by default, change it for your data file as needed. If your file has a header, set the hasHeader parameter to true to ignore the first line in the file and begin to load data from the second line.

//Create MLContext
MLContext mlContext = new MLContext();

//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<HousingData>("my-data-file.csv", separatorChar: ',', hasHeader: true);

Load data from multiple files

In the event that your data is stored in multiple files, as long as the data schema is the same, ML.NET allows you to load data from multiple files that are either in the same directory or multiple directories.

Load from files in a single directory

When all of your data files are in the same directory, use wildcards in the LoadFromTextFile method.

//Create MLContext
MLContext mlContext = new MLContext();

//Load Data File
IDataView data = mlContext.Data.LoadFromTextFile<HousingData>("Data/*", separatorChar: ',', hasHeader: true);

Load from files in multiple directories

To load data from multiple directories, use the CreateTextLoader method to create a TextLoader. Then, use the TextLoader.Load method and specify the individual file paths (wildcards can't be used).

//Create MLContext
MLContext mlContext = new MLContext();

// Create TextLoader
TextLoader textLoader = mlContext.Data.CreateTextLoader<HousingData>(separatorChar: ',', hasHeader: true);

// Load Data
IDataView data = textLoader.Load("DataFolder/SubFolder1/1.txt", "DataFolder/SubFolder2/1.txt");

Load data from a relational database

Note

DatabaseLoader is currently in preview. It can be used by referencing the Microsoft.ML.Experimental and System.Data.SqlClient NuGet packages.

ML.NET supports loading data from a variety of relational databases supported by System.Data that include SQL Server, Azure SQL Database, Oracle, SQLite, PostgreSQL, Progress, IBM DB2, and many more.

Given a database with a table named House and the following schema:

CREATE TABLE [House] (
    [HouseId] int NOT NULL IDENTITY,
    [Size] real NOT NULL,
    [Price] real NOT NULL
    CONSTRAINT [PK_House] PRIMARY KEY ([HouseId])
);

The data can be modeled by a class like HouseData.

public class HouseData
{
    public float Size { get; set; }

    public float Price { get; set; }
}

Then, inside of your application, create a DatabaseLoader.

MLContext mlContext = new MLContext();

DatabaseLoader loader = mlContext.Data.CreateDatabaseLoader<HouseData>();

Define your connection string as well as the SQL command to be executed on the database and create a DatabaseSource instance. This sample uses a LocalDB SQL Server database with a file path. However, DatabaseLoader supports any other valid connection string for databases on-premises and in the cloud.

string connectionString = @"Data Source=(LocalDB)\MSSQLLocalDB;AttachDbFilename=<YOUR-DB-FILEPATH>;Database=<YOUR-DB-NAME>;Integrated Security=True;Connect Timeout=30";

string sqlCommand = "SELECT Size,Price FROM House";

DatabaseSource dbSource = new DatabaseSource(SqlClientFactory.Instance,connectionString,sqlCommand);

Finally, use the Load method to load the data into an IDataView.

IDataView data = loader.Load(dbSource);

Load data from other sources

In addition to loading data stored in files, ML.NET supports loading data from sources that include but are not limited to:

  • In-memory collections
  • JSON/XML

Note that when working with streaming sources, ML.NET expects input to be in the form of an in-memory collection. Therefore, when working with sources like JSON/XML, make sure to format the data into an in-memory collection.

Given the following in-memory collection:

HousingData[] inMemoryCollection = new HousingData[]
{
    new HousingData
    {
        Size =700f,
        HistoricalPrices = new float[]
        {
            100000f, 3000000f, 250000f
        },
        CurrentPrice = 500000f
    },
    new HousingData
    {
        Size =1000f,
        HistoricalPrices = new float[]
        {
            600000f, 400000f, 650000f
        },
        CurrentPrice=700000f
    }
};

Load the in-memory collection into an IDataView with the LoadFromEnumerable method:

Important

LoadFromEnumerable assumes that the IEnumerable it loads from is thread-safe.

// Create MLContext
MLContext mlContext = new MLContext();

//Load Data
IDataView data = mlContext.Data.LoadFromEnumerable<HousingData>(inMemoryCollection);