教程:使用 ML.NET 检测产品销售中的异常Tutorial: Detect anomalies in product sales with ML.NET

了解如何构建针对产品销售数据的异常检测应用程序。Learn how to build an anomaly detection application for product sales data. 本教程将使用 Visual Studio 和 C# 创建 .NET Core 控制台应用程序。This tutorial creates a .NET Core console application using C# in Visual Studio.

在本教程中,你将了解:In this tutorial, you learn how to:

  • 加载数据Load the data
  • 针对峰值异常情况检测创建转换Create a transform for spike anomaly detection
  • 使用转换检测峰值异常Detect spike anomalies with the transform
  • 针对更改点异常情况检测创建转换Create a transform for change point anomaly detection
  • 使用转换检测更改点异常Detect change point anomalies with the transform

可以在 dotnet/samples 存储库中找到本教程的源代码。You can find the source code for this tutorial at the dotnet/samples repository.

先决条件Prerequisites

备注

product-sales.csv 中的数据格式基于“Shampoo Sales Over a Three Year Period”数据集,该数据集最初来自 DataMarket,由 Rob Hyndman 创建的 Time Series Data Library (TSDL) 提供。The data format in product-sales.csv is based on the dataset “Shampoo Sales Over a Three Year Period” originally sourced from DataMarket and provided by Time Series Data Library (TSDL), created by Rob Hyndman. “Shampoo Sales Over a Three Year Period”数据集根据 DataMarket 默认开放许可进行许可。“Shampoo Sales Over a Three Year Period” Dataset Licensed Under the DataMarket Default Open License.

创建控制台应用程序Create a console application

  1. 创建名为“ProductSalesAnomalyDetection”的 .NET Core 控制台应用程序Create a .NET Core Console Application called "ProductSalesAnomalyDetection".

  2. 在项目中创建名为“Data”的目录,用于保存数据集文件。Create a directory named Data in your project to save your data set files.

  3. 安装“Microsoft.ML NuGet 包”:Install the Microsoft.ML NuGet Package:

    在“解决方案资源管理器”中,右键单击项目,然后选择“管理 NuGet 包”。In Solution Explorer, right-click on your project and select Manage NuGet Packages. 选择“nuget.org”作为包源,然后选择“浏览”选项卡并搜索“Microsoft.ML”,再选择“安装”按钮。Choose "nuget.org" as the Package source, select the Browse tab, search for Microsoft.ML and select the Install button. 选择“预览更改”对话框上的“确定”按钮,如果你同意所列包的许可条款,则选择“接受许可”对话框上的“我接受”按钮。Select the OK button on the Preview Changes dialog and then select the I Accept button on the License Acceptance dialog if you agree with the license terms for the packages listed. 对“Microsoft.ML.TimeSeries”重复这些步骤。Repeat these steps for Microsoft.ML.TimeSeries.

  4. 在 Program.cs 文件的顶部添加以下 using 语句:Add the following using statements at the top of your Program.cs file:

    using System;
    using System.IO;
    using Microsoft.ML;
    using System.Collections.Generic;
    

下载数据Download your data

  1. 下载数据集并将其保存到之前创建的 Data 文件夹中:Download the dataset and save it to the Data folder you previously created:

    • 右键单击 product-sales.csv 并选择“将链接(或目标)另存为...”Right click on product-sales.csv and select "Save Link (or Target) As..."

      确保将 *.csv 文件保存到 Data 文件夹,或者在将其保存到其他位置后,将 *.csv 文件移动到 Data 文件夹。Make sure you either save the *.csv file to the Data folder, or after you save it elsewhere, move the *.csv file to the Data folder.

  2. 在解决方案资源管理器中,右键单击 *.csv 文件并选择“属性”。In Solution Explorer, right-click the *.csv file and select Properties. 在“高级”下,将“复制到输出目录”的值更改为“如果较新则复制”。Under Advanced, change the value of Copy to Output Directory to Copy if newer.

下表是来自 *.csv 文件的数据预览:The following table is a data preview from your *.csv file:

月份Month ProductSalesProductSales
1-Jan1-Jan 271271
2-Jan2-Jan 150.9150.9
.......... ..........
1-Feb1-Feb 199.3199.3
.......... ..........

创建类和定义路径Create classes and define paths

接下来,定义输入和预测类数据结构。Next, define your input and prediction class data structures.

向项目添加一个新类:Add a new class to your project:

  1. 在“解决方案资源管理器”中,右键单击该项目,然后选择“添加”>“新项”。In Solution Explorer, right-click the project, and then select Add > New Item.

  2. 在“添加新项”对话框中,选择“类”并将“名称”字段更改为“ProductSalesData.cs”。In the Add New Item dialog box, select Class and change the Name field to ProductSalesData.cs. 然后,选择“添加”按钮。Then, select the Add button.

    此时,ProductSalesData.cs 文件在代码编辑器中打开。The ProductSalesData.cs file opens in the code editor.

  3. 将以下 using 语句添加到 ProductSalesData.cs 顶部:Add the following using statement to the top of ProductSalesData.cs:

    using Microsoft.ML.Data;
    
  4. 删除现有类定义并向 ProductSalesData.cs 文件添加以下代码,其中有两个类 ProductSalesDataProductSalesPredictionRemove the existing class definition and add the following code, which has two classes ProductSalesData and ProductSalesPrediction, to the ProductSalesData.cs file:

    public class ProductSalesData
    {
        [LoadColumn(0)]
        public string Month;
    
        [LoadColumn(1)]
        public float numSales;
    }
    
    public class ProductSalesPrediction
    {
        //vector to hold alert,score,p-value values
        [VectorType(3)]
        public double[] Prediction { get; set; }
    }
    

    ProductSalesData 指定输入数据类。ProductSalesData specifies an input data class. LoadColumn 属性指定应加载数据集中的哪些列(按列索引)。The LoadColumn attribute specifies which columns (by column index) in the dataset should be loaded.

    ProductSalesPrediction 指定预测数据类。ProductSalesPrediction specifies the prediction data class. 对于异常情况检测,预测包括指示是否存在异常、原始分数和 p 值的警报。For anomaly detection, the prediction consists of an alert to indicate whether there is an anomaly, a raw score, and p-value. P 值越接近 0,出现异常的可能性就越大。The closer the p-value is to 0, the more likely an anomaly has occurred.

  5. 创建两个全局字段来存储最近下载的数据集文件路径和已保存的模型文件路径:Create two global fields to hold the recently downloaded dataset file path and the saved model file path:

    • _dataPath 具有用于定型模型的数据集路径。_dataPath has the path to the dataset used to train the model.
    • _docsize 具有数据集文件中记录的数量。_docsize has the number of records in dataset file. 将使用 _docSize 来计算 pvalueHistoryLengthYou'll use _docSize to calculate pvalueHistoryLength.
  6. 将以下代码添加到 Main 方法上方的行中,以指定这些路径:Add the following code to the line right above the Main method to specify those paths:

    static readonly string _dataPath = Path.Combine(Environment.CurrentDirectory, "Data", "product-sales.csv");
    //assign the Number of records in dataset file to constant variable
    const int _docsize = 36;
    

在 Main 中初始化变量Initialize variables in Main

  1. 使用以下代码替换 Main 方法中的 Console.WriteLine("Hello World!") 行,以声明和初始化 mlContext 变量:Replace the Console.WriteLine("Hello World!") line in the Main method with the following code to declare and initialize the mlContext variable:

    MLContext mlContext = new MLContext();
    

    执行所有 ML.NET 操作都是从 MLContext 类开始,初始化 mlContext 可创建一个新的 ML.NET 环境,可在模型创建工作流对象之间共享该环境。The MLContext class is a starting point for all ML.NET operations, and initializing mlContext creates a new ML.NET environment that can be shared across the model creation workflow objects. 从概念上讲,它与实体框架中的 DBContext 类似。It's similar, conceptually, to DBContext in Entity Framework.

加载数据Load the data

ML.NET 中的数据表示为 IDataView 类Data in ML.NET is represented as an IDataView class. IDataView 是用于描述表格数据(数字和文本)的一种灵活且有效的方法。IDataView is a flexible, efficient way of describing tabular data (numeric and text). 可从文本文件或其他源(例如,SQL 数据库或日志文件)将数据加载到 IDataView 对象。Data can be loaded from a text file or from other sources (for example, SQL database or log files) to an IDataView object.

  1. 添加以下代码作为 Main() 方法的下一行:Add the following code as the next line of the Main() method:

    IDataView dataView = mlContext.Data.LoadFromTextFile<ProductSalesData>(path: _dataPath, hasHeader: true, separatorChar: ',');
    

    LoadFromTextFile() 用于定义数据架构并读取文件。The LoadFromTextFile() defines the data schema and reads in the file. 它使用数据路径变量并返回 IDataViewIt takes in the data path variables and returns an IDataView.

时序异常情况检测Time series anomaly detection

异常情况检测标记意外或异常事件/行为。Anomaly detection flags unexpected or unusual events or behaviors. 它提供寻找问题所在位置的线索,并帮助回答“这是否奇怪?”的问题。It gives clues where to look for problems and helps you answer the question "Is this weird?".

“这是否奇怪”异常情况检测的示例。

异常情况检测是检测时序数据离群值的过程;在给定的输入时序上指向“怪异”或不是预期行为的行为。Anomaly detection is the process of detecting time-series data outliers; points on a given input time-series where the behavior isn't what was expected, or "weird".

异常情况检测在很多方面都很有用。Anomaly detection can be useful in lots of ways. 例如:For instance:

如果你有一辆车,你可能想要知道:此油量计读数是否正常,或者是否存在漏油现象?If you have a car, you might want to know: Is this oil gauge reading normal, or do I have a leak? 如果正在监视能耗,你需要知道:是否出现了中断?If you're monitoring power consumption, you’d want to know: Is there an outage?

可以检测到两种类型的时序异常情况:There are two types of time series anomalies that can be detected:

  • 峰值指示系统中异常行为的临时突发。Spikes indicate temporary bursts of anomalous behavior in the system.

  • 更改点指示系统中一段时间内持续更改的开始。Change points indicate the beginning of persistent changes over time in the system.

在 ML.NET 中,IID 峰值检测或 IID 更改点检测算法适用于独立且均匀分布的数据集In ML.NET, The IID Spike Detection or IID Change point Detection algorithms are suited for independent and identically distributed datasets.

与其他教程中的模型不同,时序异常检测器转换直接对输入数据进行操作。Unlike the models in the other tutorials, the time series anomaly detector transforms operate directly on input data. IEstimator.Fit() 方法不需要训练数据来生成转换。The IEstimator.Fit() method does not need training data to produce the transform. 不过,它确实需要数据架构,该架构由从空列表 ProductSalesData 中生成的数据视图提供。It does need the data schema though, which is provided by a data view generated from an empty list of ProductSalesData.

将分析相同的产品销售数据来检测峰值和更改点。You'll analyze the same product sales data to detect spikes and change points. 峰值检测和更改点检测的模型生成和训练过程相同;主要区别在于使用的特定检测算法。The building and training model process is the same for spike detection and change point detection; the main difference is the specific detection algorithm used.

峰值检测Spike detection

峰值检测旨在识别与大部分时序数据值明显不同的突然但临时的突发。The goal of spike detection is to identify sudden yet temporary bursts that significantly differ from the majority of the time series data values. 及时检测到这些可疑的罕见项、事件或观察值很重要,这样才能尽量减少其产生。It's important to detect these suspicious rare items, events, or observations in a timely manner to be minimized. 以下方法可用于检测各种异常情况,例如:中断、网络攻击或病毒式 Web 内容。The following approach can be used to detect a variety of anomalies such as: outages, cyber-attacks, or viral web content. 下图是时序数据集中峰值的示例:The following image is an example of spikes in a time series dataset:

显示两个峰值检测的屏幕截图。

添加 CreateEmptyDataView () 方法Add the CreateEmptyDataView() method

将以下方法添加到 Program.csAdd the following method to Program.cs:

static IDataView CreateEmptyDataView(MLContext mlContext) {
    // Create empty DataView. We just need the schema to call Fit() for the time series transforms
    IEnumerable<ProductSalesData> enumerableData = new List<ProductSalesData>();
    return mlContext.Data.LoadFromEnumerable(enumerableData);
}

CreateEmptyDataView() 生成一个空数据视图对象,该对象具有正确架构,可用作 IEstimator.Fit() 方法的输入。The CreateEmptyDataView() produces an empty data view object with the correct schema to be used as input to the IEstimator.Fit() method.

创建 DetectSpike() 方法Create the DetectSpike() method

DetectSpike() 方法:The DetectSpike() method:

  • 从估算器创建转换。Creates the transform from the estimator.
  • 根据历史销售数据检测峰值。Detects spikes based on historical sales data.
  • 显示结果。Displays the results.
  1. 使用下面的代码紧随 Main() 方法后创建 DetectSpike() 方法:Create the DetectSpike() method, just after the Main() method, using the following code:

    static void DetectSpike(MLContext mlContext, int docSize, IDataView productSales)
    {
    
    }
    
  2. 使用 IidSpikeEstimator 训练模型用于峰值检测。Use the IidSpikeEstimator to train the model for spike detection. 使用以下代码将其添加到 DetectSpike() 方法中:Add it to the DetectSpike() method with the following code:

    var iidSpikeEstimator = mlContext.Transforms.DetectIidSpike(outputColumnName: nameof(ProductSalesPrediction.Prediction), inputColumnName: nameof(ProductSalesData.numSales), confidence: 95, pvalueHistoryLength: docSize / 4);
    
  3. 通过在 DetectSpike() 方法中添加以下代码作为下一代码行来创建峰值检测转换:Create the spike detection transform by adding the following as the next line of code in the DetectSpike() method:

    ITransformer iidSpikeTransform = iidSpikeEstimator.Fit(CreateEmptyDataView(mlContext));
    
  4. 添加以下代码行将 productSales 数据转换为 DetectSpike() 方法中的下一行:Add the following line of code to transform the productSales data as the next line in the DetectSpike() method:

    IDataView transformedData = iidSpikeTransform.Transform(productSales);
    

    之前的代码使用 Transform() 方法对数据集的多个输入行进行预测。The previous code uses the Transform() method to make predictions for multiple input rows of a dataset.

  5. 使用 CreateEnumerable() 方法和以下代码将 transformedData 转换为强类型 IEnumerable,以方便显示:Convert your transformedData into a strongly-typed IEnumerable for easier display using the CreateEnumerable() method with the following code:

    var predictions = mlContext.Data.CreateEnumerable<ProductSalesPrediction>(transformedData, reuseRowObject: false);
    
  6. 使用以下 Console.WriteLine() 代码创建显示标头行:Create a display header line using the following Console.WriteLine() code:

    Console.WriteLine("Alert\tScore\tP-Value");
    

    将在峰值检测结果中显示以下信息:You'll display the following information in your spike detection results:

    • Alert 指示给定数据点的峰值警报。Alert indicates a spike alert for a given data point.
    • Score 是数据集中给定数据点的 ProductSales 值。Score is the ProductSales value for a given data point in the dataset.
    • P-Value“P”代表概率,P-Value The "P" stands for probability. P 值越接近 0,数据点越有可能出现异常情况。The closer the p-value is to 0, the more likely the data point is an anomaly.
  7. 使用以下代码循环访问 predictions IEnumerable 并显示结果:Use the following code to iterate through the predictions IEnumerable and display the results:

    foreach (var p in predictions)
    {
        var results = $"{p.Prediction[0]}\t{p.Prediction[1]:f2}\t{p.Prediction[2]:F2}";
    
        if (p.Prediction[0] == 1)
        {
            results += " <-- Spike detected";
        }
    
        Console.WriteLine(results);
    }
    Console.WriteLine("");
    
  8. 将调用添加到 Main() 方法中的 DetectSpike() 方法:Add the call to the DetectSpike()method in the Main() method:

    DetectSpike(mlContext, _docsize, dataView);
    

峰值检测结果Spike detection results

结果应如下所示。Your results should be similar to the following. 处理期间将显示消息。During processing, messages are displayed. 你可能会看到警告或处理消息。You may see warnings, or processing messages. 为清楚起见,已从以下结果中删除某些消息。Some of the messages have been removed from the following results for clarity.

Detect temporary changes in pattern
=============== Training the model ===============
=============== End of training process ===============
Alert   Score   P-Value
0       271.00  0.50
0       150.90  0.00
0       188.10  0.41
0       124.30  0.13
0       185.30  0.47
0       173.50  0.47
0       236.80  0.19
0       229.50  0.27
0       197.80  0.48
0       127.90  0.13
1       341.50  0.00 <-- Spike detected
0       190.90  0.48
0       199.30  0.48
0       154.50  0.24
0       215.10  0.42
0       278.30  0.19
0       196.40  0.43
0       292.00  0.17
0       231.00  0.45
0       308.60  0.18
0       294.90  0.19
1       426.60  0.00 <-- Spike detected
0       269.50  0.47
0       347.30  0.21
0       344.70  0.27
0       445.40  0.06
0       320.90  0.49
0       444.30  0.12
0       406.30  0.29
0       442.40  0.21
1       580.50  0.00 <-- Spike detected
0       412.60  0.45
1       687.00  0.01 <-- Spike detected
0       480.30  0.40
0       586.30  0.20
0       651.90  0.14

更改点检测Change point detection

Change points 是时序事件流值分布的持续更改,例如级别更改和趋势。Change points are persistent changes in a time series event stream distribution of values, like level changes and trends. 这些持续更改的持续时间比 spikes 的持续时间长得多,可能指示灾难性事件。These persistent changes last much longer than spikes and could indicate catastrophic event(s). Change points 通常对肉眼不可见,但可以使用诸如以下方法的方法在数据中检测到。Change points are not usually visible to the naked eye, but can be detected in your data using approaches such as in the following method. 下图是更改点检测的示例:The following image is an example of a change point detection:

显示更改点检测的屏幕截图。

创建 DetectChangepoint() 方法Create the DetectChangepoint() method

DetectChangepoint() 方法执行以下任务:The DetectChangepoint() method executes the following tasks:

  • 从估算器创建转换。Creates the transform from the estimator.
  • 根据历史销售数据检测更改点。Detects change points based on historical sales data.
  • 显示结果。Displays the results.
  1. 使用下面的代码紧随 Main() 方法后创建 DetectChangepoint() 方法:Create the DetectChangepoint() method, just after the Main() method, using the following code:

    static void DetectChangepoint(MLContext mlContext, int docSize, IDataView productSales)
    {
    
    }
    
  2. 使用以下代码在 DetectChangepoint() 方法中创建 iidChangePointEstimator Create the iidChangePointEstimator in the DetectChangepoint() method with the following code:

    var iidChangePointEstimator = mlContext.Transforms.DetectIidChangePoint(outputColumnName: nameof(ProductSalesPrediction.Prediction), inputColumnName: nameof(ProductSalesData.numSales), confidence: 95, changeHistoryLength: docSize / 4);
    
  3. 和先前的操作一样,通过在 DetectChangePoint() 方法中添加以下代码行,从估算器创建转换:As you did previously, create the transform from the estimator by adding the following line of code in the DetectChangePoint() method:

    var iidChangePointTransform = iidChangePointEstimator.Fit(CreateEmptyDataView(mlContext));
    
  4. 使用 Transform() 方法通过将以下代码添加到 DetectChangePoint() 来转换数据:Use the Transform() method to transform the data by adding the following code to DetectChangePoint():

    IDataView transformedData = iidChangePointTransform.Transform(productSales);
    
  5. 如之前一样,使用 CreateEnumerable() 方法和以下代码将 transformedData 转换为强类型 IEnumerable,以方便显示:As you did previously, convert your transformedData into a strongly-typed IEnumerable for easier display using the CreateEnumerable()method with the following code:

    var predictions = mlContext.Data.CreateEnumerable<ProductSalesPrediction>(transformedData, reuseRowObject: false);
    
  6. 使用以下代码创建显示标头,用作 DetectChangePoint() 方法中的下一行:Create a display header with the following code as the next line in the DetectChangePoint() method:

    Console.WriteLine("Alert\tScore\tP-Value\tMartingale value");
    

    将在更改点检测结果中显示以下信息:You'll display the following information in your change point detection results:

    • Alert 指示给定数据点的更改点警报。Alert indicates a change point alert for a given data point.
    • Score 是数据集中给定数据点的 ProductSales 值。Score is the ProductSales value for a given data point in the dataset.
    • P-Value“P”代表概率,P-Value The "P" stands for probability. P 值越接近 0,数据点越有可能出现异常情况。The closer the P-value is to 0, the more likely the data point is an anomaly.
    • Martingale value 用于根据 P 值序列识别数据点的“奇怪”程度。Martingale value is used to identify how "weird" a data point is, based on the sequence of P-values.
  7. 使用以下代码循环访问 predictions IEnumerable 并显示结果:Iterate through the predictions IEnumerable and display the results with the following code:

    foreach (var p in predictions)
    {
        var results = $"{p.Prediction[0]}\t{p.Prediction[1]:f2}\t{p.Prediction[2]:F2}\t{p.Prediction[3]:F2}";
    
        if (p.Prediction[0] == 1)
        {
            results += " <-- alert is on, predicted changepoint";
        }
        Console.WriteLine(results);
    }
    Console.WriteLine("");
    
  8. 将以下调用添加到 Main() 方法中的 DetectChangepoint() 方法:Add the following call to the DetectChangepoint()method in the Main() method:

    DetectChangepoint(mlContext, _docsize, dataView);
    

更改点检测结果Change point detection results

结果应如下所示。Your results should be similar to the following. 处理期间将显示消息。During processing, messages are displayed. 你可能会看到警告或处理消息。You may see warnings, or processing messages. 为清楚起见,已从以下结果中删除某些消息。Some messages have been removed from the following results for clarity.

Detect Persistent changes in pattern
=============== Training the model Using Change Point Detection Algorithm===============
=============== End of training process ===============
Alert   Score   P-Value Martingale value
0       271.00  0.50    0.00
0       150.90  0.00    2.33
0       188.10  0.41    2.80
0       124.30  0.13    9.16
0       185.30  0.47    9.77
0       173.50  0.47    10.41
0       236.80  0.19    24.46
0       229.50  0.27    42.38
1       197.80  0.48    44.23 <-- alert is on, predicted changepoint
0       127.90  0.13    145.25
0       341.50  0.00    0.01
0       190.90  0.48    0.01
0       199.30  0.48    0.00
0       154.50  0.24    0.00
0       215.10  0.42    0.00
0       278.30  0.19    0.00
0       196.40  0.43    0.00
0       292.00  0.17    0.01
0       231.00  0.45    0.00
0       308.60  0.18    0.00
0       294.90  0.19    0.00
0       426.60  0.00    0.00
0       269.50  0.47    0.00
0       347.30  0.21    0.00
0       344.70  0.27    0.00
0       445.40  0.06    0.02
0       320.90  0.49    0.01
0       444.30  0.12    0.02
0       406.30  0.29    0.01
0       442.40  0.21    0.01
0       580.50  0.00    0.01
0       412.60  0.45    0.01
0       687.00  0.01    0.12
0       480.30  0.40    0.08
0       586.30  0.20    0.03
0       651.90  0.14    0.09

祝贺你!Congratulations! 现在已成功生成用于检测销售数据中的峰值和更改点异常情况的机器学习模型。You've now successfully built machine learning models for detecting spikes and change point anomalies in sales data.

可以在 dotnet/samples 存储库中找到本教程的源代码。You can find the source code for this tutorial at the dotnet/samples repository.

在本教程中,你将了解:In this tutorial, you learned how to:

  • 加载数据Load the data
  • 训练模型用于峰值异常情况检测Train the model for spike anomaly detection
  • 使用经过训练的模型检测峰值异常情况Detect spike anomalies with the trained model
  • 训练模型用于更改点异常情况检测Train the model for change point anomaly detection
  • 使用经过训练的模型检测更改点异常情况Detect change point anomalies with the trained mode

后续步骤Next steps

请查看机器学习示例 GitHub 存储库,以探索能耗异常情况检测示例。Check out the Machine Learning samples GitHub repository to explore a Power Consumption Anomaly Detection sample.