从文件和其他源加载数据Load data from files and other sources

了解如何使用 API 将数据加载到 ML.NET 中进行处理和训练。Learn how to load data for processing and training into ML.NET using the API. 最初存储在文件或其他数据源(例如数据库、JSON、XML 或内存中集合)中的数据。The data is originally stored in files or other data sources such as databases, JSON, XML or in-memory collections.

如果使用模型生成器,请参阅将训练数据加载到模型生成器If you're using Model Builder, see Load training data into Model Builder.

创建数据模型Create the data model

ML.NET 允许通过类定义数据模型。ML.NET enables you to define data models via classes. 例如,给定以下输入数据:For example, given the following input data:

Size (Sq. ft.), HistoricalPrice1 ($), HistoricalPrice2 ($), HistoricalPrice3 ($), Current Price ($)
700, 100000, 3000000, 250000, 500000
1000, 600000, 400000, 650000, 700000

创建一个表示以下代码片段的数据模型:Create a data model that represents the snippet below:

public class HousingData
{
    [LoadColumn(0)]
    public float Size { get; set; }

    [LoadColumn(1, 3)]
    [VectorType(3)]
    public float[] HistoricalPrices { get; set; }

    [LoadColumn(4)]
    [ColumnName("Label")]
    public float CurrentPrice { get; set; }
}

使用列特性注释数据模型Annotating the data model with column attributes

特性为 ML.NET 提供有关数据模型和数据源的详细信息。Attributes give ML.NET more information about the data model and the data source.

LoadColumn 特性指定属性的列索引。The LoadColumn attribute specifies your properties' column indices.

重要

只有从文件加载数据时才需要 LoadColumnLoadColumn is only required when loading data from a file.

将列加载为:Load columns as:

  • 单个列,例如 HousingData 类中的 SizeCurrentPricesIndividual columns like Size and CurrentPrices in the HousingData class.
  • 以向量的形式一次加载多个列,例如 HousingData 类中的 HistoricalPricesMultiple columns at a time in the form of a vector like HistoricalPrices in the HousingData class.

如果有一个向量属性,请在数据模型中向该属性应用 VectorType 特性。If you have a vector property, apply the VectorType attribute to the property in your data model. 请务必注意,向量中的所有元素必须为相同的类型。It's important to note that all of the elements in the vector need to be the same type. 保持列与列之间的分隔状态可以提高特征工程的易用性和灵活性,但是对于非常多的列,在单个列上操作会对训练速度产生影响。Keeping the columns separated allows for ease and flexibility of feature engineering, but for a very large number of columns, operating on the individual columns causes an impact on training speed.

ML.NET 通过列名称进行操作。ML.NET Operates through column names. 如果要将某个列的名称更改为该属性名称以外的其他名称,请使用 ColumnName 特性。If you want to change the name of a column to something other than the property name, use the ColumnName attribute. 创建内存中对象时,仍然使用该属性名称创建对象。When creating in-memory objects, you still create objects using the property name. 但是,对于数据处理和生成机器学习模型,ML.NET 使用 ColumnName 特性中提供的值覆盖并引用该属性。However, for data processing and building machine learning models, ML.NET overrides and references the property with the value provided in the ColumnName attribute.

从单个文件加载数据Load data from a single file

若要从文件加载数据,请使用 LoadFromTextFile 方法以及要加载的数据的数据模型。To load data from a file use the LoadFromTextFile method along with the data model for the data to be loaded. 由于 separatorChar 参数默认为制表符分隔,因此请根据需要为数据文件更改该参数。Since separatorChar parameter is tab-delimited by default, change it for your data file as needed. 如果文件有标头,请将 hasHeader 参数设置为 true,以忽略文件中的第一行并开始从第二行加载数据。If your file has a header, set the hasHeader parameter to true to ignore the first line in the file and begin to load data from the second line.

//Create MLContext
MLContext mlContext = new MLContext();

//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<HousingData>("my-data-file.csv", separatorChar: ',', hasHeader: true);

从多个文件加载数据Load data from multiple files

如果数据存储在多个文件中,只要数据架构相同,ML.NET 就允许从同一目录或多个目录中的多个文件加载数据。In the event that your data is stored in multiple files, as long as the data schema is the same, ML.NET allows you to load data from multiple files that are either in the same directory or multiple directories.

从单个目录中的文件加载Load from files in a single directory

当所有数据文件位于同一目录中时,请在 LoadFromTextFile 方法中使用通配符。When all of your data files are in the same directory, use wildcards in the LoadFromTextFile method.

//Create MLContext
MLContext mlContext = new MLContext();

//Load Data File
IDataView data = mlContext.Data.LoadFromTextFile<HousingData>("Data/*", separatorChar: ',', hasHeader: true);

从多个目录中的文件加载Load from files in multiple directories

若要从多个目录加载数据,请使用 CreateTextLoader 方法创建 TextLoaderTo load data from multiple directories, use the CreateTextLoader method to create a TextLoader. 然后,使用 TextLoader.Load 方法并指定单个文件路径(不能使用通配符)。Then, use the TextLoader.Load method and specify the individual file paths (wildcards can't be used).

//Create MLContext
MLContext mlContext = new MLContext();

// Create TextLoader
TextLoader textLoader = mlContext.Data.CreateTextLoader<HousingData>(separatorChar: ',', hasHeader: true);

// Load Data
IDataView data = textLoader.Load("DataFolder/SubFolder1/1.txt", "DataFolder/SubFolder2/1.txt");

从关系数据库加载数据Load data from a relational database

ML.NET 支持从各种关系数据库中加载数据(包括 SQL Server、Azure SQL 数据库、Oracle、SQLite、PostgreSQL、Progress、IBM DB2 等),这些关系数据库由 System.Data 提供支持。ML.NET supports loading data from a variety of relational databases supported by System.Data that include SQL Server, Azure SQL Database, Oracle, SQLite, PostgreSQL, Progress, IBM DB2, and many more.

备注

若要使用 DatabaseLoader,请参考 System.Data.SqlClient NuGet 包。To use DatabaseLoader, reference the System.Data.SqlClient NuGet package.

指定具有名为 House 的表和以下架构的数据库:Given a database with a table named House and the following schema:

CREATE TABLE [House] (
    [HouseId] INT NOT NULL IDENTITY,
    [Size] INT NOT NULL,
    [NumBed] INT NOT NULL,
    [Price] REAL NOT NULL
    CONSTRAINT [PK_House] PRIMARY KEY ([HouseId])
);

数据可以通过 HouseData 等类进行建模。The data can be modeled by a class like HouseData.

public class HouseData
{
    public float Size { get; set; }

    public float NumBed { get; set; }

    public float Price { get; set; }
}

然后,在应用程序中创建 DatabaseLoaderThen, inside of your application, create a DatabaseLoader.

MLContext mlContext = new MLContext();

DatabaseLoader loader = mlContext.Data.CreateDatabaseLoader<HouseData>();

定义连接字符串以及要在数据库上执行的 SQL 命令,并创建 DatabaseSource 实例。Define your connection string as well as the SQL command to be executed on the database and create a DatabaseSource instance. 此示例使用具有文件路径的 LocalDB SQL Server 数据库。This sample uses a LocalDB SQL Server database with a file path. 但是,DatabaseLoader 支持本地和云中数据库的任何其他有效连接字符串。However, DatabaseLoader supports any other valid connection string for databases on-premises and in the cloud.

string connectionString = @"Data Source=(LocalDB)\MSSQLLocalDB;AttachDbFilename=<YOUR-DB-FILEPATH>;Database=<YOUR-DB-NAME>;Integrated Security=True;Connect Timeout=30";

string sqlCommand = "SELECT Size, CAST(NumBed as REAL) as NumBed, Price FROM House";

DatabaseSource dbSource = new DatabaseSource(SqlClientFactory.Instance, connectionString, sqlCommand);

类型不是 Real 的数值数据必须转换为 RealNumerical data that is not of type Real has to be converted to Real. Real 类型表示为单精度浮动点值或 Single,这是 ML.NET 算法应采用的输入类型。The Real type is represented as a single-precision floating-point value or Single, the input type expected by ML.NET algorithms. 在此示例中,NumBed 列是数据库中的一个整数。In this sample, the NumBed column is an integer in the database. 使用 CAST 内置函数,将其转换为 RealUsing the CAST built-in function, it's converted to Real. 由于 Price 属性的类型已经是 Real,则按原样加载。Because the Price property is already of type Real it is loaded as is.

使用 Load 方法将数据加载到 IDataViewUse the Load method to load the data into an IDataView.

IDataView data = loader.Load(dbSource);

从其他源加载数据Load data from other sources

除了加载存储在文件中的数据外,ML.NET 还支持从各种源加载数据,这些源包括但不限于:In addition to loading data stored in files, ML.NET supports loading data from sources that include but are not limited to:

  • 内存中集合In-memory collections
  • JSON/XMLJSON/XML

请注意,在使用流式处理源时,ML.NET 预计输入采用内存中集合的形式。Note that when working with streaming sources, ML.NET expects input to be in the form of an in-memory collection. 因此,在使用 JSON/XML 等源时,请确保将数据格式化为内存中集合。Therefore, when working with sources like JSON/XML, make sure to format the data into an in-memory collection.

给定以下内存中集合:Given the following in-memory collection:

HousingData[] inMemoryCollection = new HousingData[]
{
    new HousingData
    {
        Size =700f,
        HistoricalPrices = new float[]
        {
            100000f, 3000000f, 250000f
        },
        CurrentPrice = 500000f
    },
    new HousingData
    {
        Size =1000f,
        HistoricalPrices = new float[]
        {
            600000f, 400000f, 650000f
        },
        CurrentPrice=700000f
    }
};

使用 LoadFromEnumerable 方法将内存中集合加载到 IDataView 中:Load the in-memory collection into an IDataView with the LoadFromEnumerable method:

重要

LoadFromEnumerable 假定其所加载的 IEnumerable 是线程安全的。LoadFromEnumerable assumes that the IEnumerable it loads from is thread-safe.

// Create MLContext
MLContext mlContext = new MLContext();

//Load Data
IDataView data = mlContext.Data.LoadFromEnumerable<HousingData>(inMemoryCollection);

后续步骤Next steps