准备建模的数据Prepare data for building a model

了解如何使用 ML.NET 来准备数据用于进行其他处理或生成模型。Learn how to use ML.NET to prepare data for additional processing or building a model.

数据通常是不干净的和稀疏的。Data is often unclean and sparse. ML.NET 机器学习算法期望输入或特征位于单个数字向量中。ML.NET machine learning algorithms expect input or features to be in a single numerical vector. 同样,必须对要预测的值(标签)进行编码,尤其当该值是分类数据时。Similarly, the value to predict (label), especially when it's categorical data, has to be encoded. 因此,数据准备的目标之一是将数据转换为 ML.NET 算法所期望的格式。Therefore one of the goals of data preparation is to get the data into the format expected by ML.NET algorithms.

筛选数据Filter data

有时,并非数据集中的所有数据都与分析相关。Sometimes, not all data in a dataset is relevant for analysis. 删除不相关数据的方法之一是筛选。An approach to remove irrelevant data is filtering. DataOperationsCatalog 包含一组筛选操作,这些操作接收包含所有数据的 IDataView,并返回仅包含关注数据点的 IDataViewThe DataOperationsCatalog contains a set of filter operations that take in an IDataView containing all of the data and return an IDataView containing only the data points of interest. 值得注意的是,因为筛选操作不像 TransformsCatalog 中的操作那样是 IEstimatorITransformer,所以它们不能作为 EstimatorChainTransformerChain 数据准备管道的一部分包含在内。It's important to note that because filter operations are not an IEstimator or ITransformer like those in the TransformsCatalog, they cannot be included as part of an EstimatorChain or TransformerChain data preparation pipeline.

使用加载到 IDataView 中的以下输入数据:Using the following input data which is loaded into an IDataView:

HomeData[] homeDataList = new HomeData[]
{
    new HomeData
    {
        NumberOfBedrooms=1f,
        Price=100000f
    },
    new HomeData
    {
        NumberOfBedrooms=2f,
        Price=300000f
    },
    new HomeData
    {
        NumberOfBedrooms=6f,
        Price=600000f
    }
};

若要根据列的值筛选数据,请使用 FilterRowsByColumn 方法。To filter data based on the value of a column, use the FilterRowsByColumn method.

// Apply filter
IDataView filteredData = mlContext.Data.FilterRowsByColumn(data, "Price", lowerBound: 200000, upperBound: 1000000);

上述示例采用数据集中价格介于 200,000 和 1,000,000 之间的行。The sample above takes rows in the dataset with a price between 200000 and 1000000. 应用此筛选器的结果为,将仅返回数据中的最后两行,并排除第一行,因为其价格为 100,000,不在指定范围之间。The result of applying this filter would return only the last two rows in the data and exclude the first row because its price is 100000 and not between the specified range.

替换缺失值Replace missing values

缺失值在数据集中是常见现象。Missing values are a common occurrence in datasets. 处理缺失值的一种方法是使用给定类型的默认值(如有)或其他有意义的值(例如数据中的平均值)替换它们。One approach to dealing with missing values is to replace them with the default value for the given type if any or another meaningful value such as the mean value in the data.

使用加载到 IDataView 中的以下输入数据:Using the following input data which is loaded into an IDataView:

HomeData[] homeDataList = new HomeData[]
{
    new HomeData
    {
        NumberOfBedrooms=1f,
        Price=100000f
    },
    new HomeData
    {
        NumberOfBedrooms=2f,
        Price=300000f
    },
    new HomeData
    {
        NumberOfBedrooms=6f,
        Price=float.NaN
    }
};

请注意,列表中最后一个元素的 Price 缺失值。Notice that the last element in our list has a missing value for Price. 若要替换 Price 列中的缺失值,请使用 ReplaceMissingValues 方法填充该缺失值。To replace the missing values in the Price column, use the ReplaceMissingValues method to fill in that missing value.

重要

ReplaceMissingValue 仅适用于数字数据。ReplaceMissingValue only works with numerical data.

// Define replacement estimator
var replacementEstimator = mlContext.Transforms.ReplaceMissingValues("Price", replacementMode: MissingValueReplacingEstimator.ReplacementMode.Mean);

// Fit data to estimator
// Fitting generates a transformer that applies the operations of defined by estimator
ITransformer replacementTransformer = replacementEstimator.Fit(data);

// Transform data
IDataView transformedData = replacementTransformer.Transform(data);

ML.NET 支持各种替换模式ML.NET supports various replacement modes. 上述示例使用 Mean 替换模式,该模式将使用该列的平均值填充缺失值。The sample above uses the Mean replacement mode which will fill in the missing value with that column's average value. 替换的结果使用 200,000 填充数据中最后一个元素的 Price 属性,因为它是 100,000 和 300,000 的平均值。The replacement 's result fills in the Price property for the last element in our data with 200,000 since it's the average of 100,000 and 300,000.

使用规范化程序Use normalizers

规范化是一种数据预处理技术,用于标准化比例不同的特征,这有助于算法更快地融合。Normalization is a data pre-processing technique used to standardize features that are not on the same scale which helps algorithms converge faster. 例如,年龄和收入等值的范围存在明显差异,年龄的范围通常为 0-100,而收入的范围通常为零到数千。For example, the ranges for values like age and income vary significantly with age generally being in the range of 0-100 and income generally being in the range of zero to thousands. 访问转换页面,获取更详细的规范化转换列表和说明。Visit the transforms page for a more detailed list and description of normalization transforms.

最小-最大规范化Min-Max normalization

使用加载到 IDataView 中的以下输入数据:Using the following input data which is loaded into an IDataView:

HomeData[] homeDataList = new HomeData[]
{
    new HomeData
    {
        NumberOfBedrooms = 2f,
        Price = 200000f
    },
    new HomeData
    {
        NumberOfBedrooms = 1f,
        Price = 100000f
    }
};

可以向包含单个数值及矢量的列应用规范化。Normalization can be applied to columns with single numerical values as well as vectors. 使用 NormalizeMinMax 方法通过最小-最大规范化来规范化 Price 列中的数据。Normalize the data in the Price column using min-max normalization with the NormalizeMinMax method.

// Define min-max estimator
var minMaxEstimator = mlContext.Transforms.NormalizeMinMax("Price");

// Fit data to estimator
// Fitting generates a transformer that applies the operations of defined by estimator
ITransformer minMaxTransformer = minMaxEstimator.Fit(data);

// Transform data
IDataView transformedData = minMaxTransformer.Transform(data);

原始价格值 [200000,100000] 使用 MinMax 规范化公式转换为 [ 1, 0.5 ],该公式生成范围在 0-1 之间的输出值。The original price values [200000,100000] are converted to [ 1, 0.5 ] using the MinMax normalization formula which generates output values in the range of 0-1.

分箱Binning

分箱将连续值转换为输入的离散表示形式。Binning converts continuous values into a discrete representation of the input. 例如,假设某个特征为年龄。For example, suppose one of your features is age. 分箱不使用实际年龄值,而是为该值创建范围。Instead of using the actual age value, binning creates ranges for that value. 0-18 可以是一个箱,另一个箱可以是 19-35,依此类推。0-18 could be one bin, another could be 19-35 and so on.

使用加载到 IDataView 中的以下输入数据:Using the following input data which is loaded into an IDataView:

HomeData[] homeDataList = new HomeData[]
{
    new HomeData
    {
        NumberOfBedrooms=1f,
        Price=100000f
    },
    new HomeData
    {
        NumberOfBedrooms=2f,
        Price=300000f
    },
    new HomeData
    {
        NumberOfBedrooms=6f,
        Price=600000f
    }
};

使用 NormalizeBinning 方法将数据规范化为箱。Normalize the data into bins using the NormalizeBinning method. maximumBinCount 参数使你可以指定对数据进行分类所需的箱数。The maximumBinCount parameter enables you to specify the number of bins needed to classify your data. 在此示例中,数据将放入两个箱中。In this example, data will be put into two bins.

// Define binning estimator
var binningEstimator = mlContext.Transforms.NormalizeBinning("Price", maximumBinCount: 2);

// Fit data to estimator
// Fitting generates a transformer that applies the operations of defined by estimator
var binningTransformer = binningEstimator.Fit(data);

// Transform Data
IDataView transformedData = binningTransformer.Transform(data);

分箱的结果为创建 [0,200000,Infinity] 的分箱边界。The result of binning creates bin bounds of [0,200000,Infinity]. 因此,所得到的箱为 [0,1,1],因为第一个观测在 0-200,000 之间,而其他观测则大于 200,000 但小于无穷大。Therefore the resulting bins are [0,1,1] because the first observation is between 0-200000 and the others are greater than 200000 but less than infinity.

使用分类数据Work with categorical data

在用于生成机器学习模型之前,需要将非数字分类数据转换为数字。Non-numeric categorical data needs to be converted to a number before being used to build a machine learning model.

使用加载到 IDataView 中的以下输入数据:Using the following input data which is loaded into an IDataView:

CarData[] cars = new CarData[]
{
    new CarData
    {
        Color="Red",
        VehicleType="SUV"
    },
    new CarData
    {
        Color="Blue",
        VehicleType="Sedan"
    },
    new CarData
    {
        Color="Black",
        VehicleType="SUV"
    }
};

分类 VehicleType 属性可以使用 OneHotEncoding 方法转换为数字。The categorical VehicleType property can be converted into a number using the OneHotEncoding method.

// Define categorical transform estimator
var categoricalEstimator = mlContext.Transforms.Categorical.OneHotEncoding("VehicleType");

// Fit data to estimator
// Fitting generates a transformer that applies the operations of defined by estimator
ITransformer categoricalTransformer = categoricalEstimator.Fit(data);

// Transform Data
IDataView transformedData = categoricalTransformer.Transform(data);

生成的转换将 VehicleType 的文本值转换为数字。The resulting transform converts the text value of VehicleType to a number. 应用转换后,VehicleType 列中的条目将变为以下内容:The entries in the VehicleType column become the following when the transform is applied:

[
    1, // SUV
    2, // Sedan
    1 // SUV
]

使用文本数据Work with text data

在用于生成机器学习模型之前,需要将文本数据转换为数字。Text data needs to be transformed into numbers before using it to build a machine learning model. 访问转换页面,获取更详细的文本转换列表和说明。Visit the transforms page for a more detailed list and description of text transforms.

使用类似以下已加载到 IDataView 中的数据的数据:Using data like the data below that has been loaded into an IDataView:

ReviewData[] reviews = new ReviewData[]
{
    new ReviewData
    {
        Description="This is a good product",
        Rating=4.7f
    },
    new ReviewData
    {
        Description="This is a bad product",
        Rating=2.3f
    }
};

将文本转换为数字向量表示形式的最简单步骤是使用 FeaturizeText 方法。The minimum step to convert text to a numerical vector representation is to use the FeaturizeText method. 通过使用 FeaturizeText 转换,将一系列转换应用于输入文本列,从而生成表示 lp 规范化字词和 n 元语法的数字向量。By using the FeaturizeText transform, a series of transformations is applied to the input text column resulting in a numerical vector representing the lp-normalized word and character ngrams.

// Define text transform estimator
var textEstimator  = mlContext.Transforms.Text.FeaturizeText("Description");

// Fit data to estimator
// Fitting generates a transformer that applies the operations of defined by estimator
ITransformer textTransformer = textEstimator.Fit(data);

// Transform data
IDataView transformedData = textTransformer.Transform(data);

生成的转换会将 Description 列中的文本值转换为类似以下输出的数字向量:The resulting transform would convert the text values in the Description column to a numerical vector that looks similar to the output below:

[ 0.2041241, 0.2041241, 0.2041241, 0.4082483, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0.2041241, 0, 0, 0, 0, 0.4472136, 0.4472136, 0.4472136, 0.4472136, 0.4472136, 0 ]

将复杂的文本处理步骤合并到一个 EstimatorChain 中以消除干扰,并可能根据需要减少所需的处理资源量。Combine complex text processing steps into an EstimatorChain to remove noise and potentially reduce the amount of required processing resources as needed.

// Define text transform estimator
var textEstimator = mlContext.Transforms.Text.NormalizeText("Description")
    .Append(mlContext.Transforms.Text.TokenizeIntoWords("Description"))
    .Append(mlContext.Transforms.Text.RemoveDefaultStopWords("Description"))
    .Append(mlContext.Transforms.Conversion.MapValueToKey("Description"))
    .Append(mlContext.Transforms.Text.ProduceNgrams("Description"))
    .Append(mlContext.Transforms.NormalizeLpNorm("Description"));

textEstimator 包含 FeaturizeText 方法执行的一组操作。textEstimator contains a subset of operations performed by the FeaturizeText method. 更复杂管道的好处在于对应用于数据的转换的控制和可见性。The benefit of a more complex pipeline is control and visibility over the transformations applied to the data.

以第一个条目为例,以下是对 textEstimator 定义的转换步骤产生的结果的详细说明:Using the first entry as an example, the following is a detailed description of the results produced by the transformation steps defined by textEstimator:

原始文本:This is a good productOriginal Text: This is a good product

TransformTransform 说明Description 结果Result
1.NormalizeText1. NormalizeText 默认情况下将所有字母转换为小写字母Converts all letters to lowercase by default this is a good productthis is a good product
2.TokenizeWords2. TokenizeWords 将字符串拆分为单独的字词Splits string into individual words ["this","is","a","good","product"]["this","is","a","good","product"]
3.RemoveDefaultStopWords3. RemoveDefaultStopWords 删除 isa 等非索引字Removes stopwords like is and a. ["good","product"]["good","product"]
4.MapValueToKey4. MapValueToKey 根据输入数据将值映射到键(类别)Maps the values to keys (categories) based on the input data [1,2][1,2]
5.ProduceNGrams5. ProduceNGrams 将文本转换为连续单词的序列Transforms text into sequence of consecutive words [1,1,1,0,0][1,1,1,0,0]
6.NormalizeLpNorm6. NormalizeLpNorm 按缩放的 lp 规范缩放输入Scale inputs by their lp-norm [ 0.577350529, 0.577350529, 0.577350529, 0, 0 ][ 0.577350529, 0.577350529, 0.577350529, 0, 0 ]