教程:使用时序分析和 ML.NET 预测自行车租赁服务需求Tutorial: Forecast bike rental service demand with time series analysis and ML.NET

了解如何通过 ML.NET 对 SQL Server 数据库中存储的数据进行单变量时序分析,以预测自行车租赁服务需求。Learn how to forecast demand for a bike rental service using univariate time series analysis on data stored in a SQL Server database with ML.NET.

在本教程中,你将了解:In this tutorial, you learn how to:

  • 了解问题Understand the problem
  • 从数据库加载数据Load data from a database
  • 创建预测模型Create a forecasting model
  • 评估预测模型Evaluate forecasting model
  • 保存预测模型Save a forecasting model
  • 使用预测模型Use a forecasting model

先决条件Prerequisites

时序预测示例概述Time series forecasting sample overview

此示例为 C# .NET Core 控制台应用程序,它使用单变量时序分析算法(称为单谱分析)来预测自行车租赁需求。 This sample is a C# .NET Core console application that forecasts demand for bike rentals using a univariate time series analysis algorithm known as Single Spectrum Analysis. 此示例的代码可以在 GitHub 上的 dotnet/machinelearning-samples 存储库找到。The code for this sample can be found on the dotnet/machinelearning-samples repository on GitHub.

了解问题Understand the problem

为了实现高效运营,其中库存管理的作用不可或缺。In order to run an efficient operation, inventory management plays a key role. 产品库存过多意味着产品积压,无法产生收入。Having too much of a product in stock means unsold products sitting on the shelves not generating any revenue. 产品库存过少会损失销售额,导致客户转而购买竞争对手的产品。Having too little product leads to lost sales and customers purchasing from competitors. 因此,一个永恒的问题就是:保有多少库存才最合适呢?Therefore, the constant question is, what is the optimal amount of inventory to keep on hand? 借助时序分析,可通过查看历史数据、识别模式并使用此信息来预测未来某个时间的值,从而帮助找到这些问题的答案。Time-series analysis helps provide an answer to these questions by looking at historical data, identifying patterns, and using this information to forecast values some time in the future.

此教程使用的数据分析技术为单变量时序分析。The technique for analyzing data used in this tutorial is univariate time-series analysis. 单变量时序分析可按照特定间隔(如月销售额)查看一个时段内的单个数值观测。Univariate time-series analysis takes a look at a single numerical observation over a period of time at specific intervals such as monthly sales.

此教程使用的算法是单谱分析 (SSA)The algorithm used in this tutorial is Single Spectrum Analysis(SSA). SSA 会将时序分解为一组主要成分,SSA works by decomposing a time-series into a set of principal components. 可以将这些成分解释为信号的组成部分,对应于趋势、噪音、季节性及许多其他的因素。These components can be interpreted as the parts of a signal that correspond to trends, noise, seasonality, and many other factors. 然后重新构建这些成分,并用来预测未来某个时间的值。Then, these components are reconstructed and used to forecast values some time in the future.

创建控制台应用程序Create console application

  1. 新建一个名称为“BikeDemandForecasting”的“C# .NET Core 控制台应用程序”。 Create a new C# .NET Core console application called "BikeDemandForecasting".
  2. 安装 Microsoft.ML 版本 1.4.0 NuGet 包 Install Microsoft.ML version 1.4.0 NuGet package
    1. 在“解决方案资源管理器”中,右键单击项目,然后选择“管理 NuGet 包” 。In Solution Explorer, right-click on your project and select Manage NuGet Packages.
    2. 选择“nuget.org”作为“包源”,选择“浏览”选项卡,再搜索“Microsoft.ML” 。 Choose "nuget.org" as the Package source, select the Browse tab, search for Microsoft.ML.
    3. 选中“包括预发行版”复选框 。Check the Include prerelease checkbox.
    4. 选择“安装”按钮 。Select the Install button.
    5. 选择“预览更改” 对话框中的“确定” 按钮;如果同意所列包的许可条款,请选择“接受许可”对话框中的“我接受” 按钮。Select the OK button on the Preview Changes dialog and then select the I Accept button on the License Acceptance dialog if you agree with the license terms for the packages listed.
    6. 针对 System.Data.SqlClient 版本 4.7.0 和 Microsoft.ML.TimeSeries 版本 1.4.0 重复上述步骤 。Repeat these steps for System.Data.SqlClient version 4.7.0 and Microsoft.ML.TimeSeries version 1.4.0.

准备和了解数据Prepare and understand the data

  1. 创建一个名为“Data”的目录。 Create a directory called Data.
  2. 下载 DailyDemand.mdf 数据库文件并将其保存到“Data”目录中。 Download the DailyDemand.mdf database file and save it to the Data directory.

备注

此教程使用的数据来自 UCI 自行车共享数据集The data used in this tutorial comes from the UCI Bike Sharing Dataset. 作者 Fanaee-T,Hadi 和 Gama, Joao,“事件标签结合集合探测器和背景知识”,人工智能进展 (2013):1-15 页,Springer Berlin Heidelberg,网页链接Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, Web Link.

原始数据集包含与季节和天气相对应的若干列。The original dataset contains several columns corresponding to seasonality and weather. 为了简洁起见,并且由于本教程使用的算法仅需要单个数值列中的值,因此,已将原始数据集精简为仅包括以下列:For brevity and because the algorithm used in this tutorial only requires the values from a single numerical column, the original dataset has been condensed to include only the following columns:

  • dteday:观测日期。dteday: The date of the observation.
  • year:观测年份编码(0=2011,1=2012)。year: The encoded year of the observation (0=2011, 1=2012).
  • cnt:观测日当天自行车租赁总数。cnt: The total number of bike rentals for that day.

原始数据集映射到 SQL Server 数据库中具有以下架构的数据库表。The original dataset is mapped to a database table with the following schema in a SQL Server database.

CREATE TABLE [Rentals] (
    [RentalDate] DATE NOT NULL,
    [Year] INT NOT NULL,
    [TotalRentals] INT NOT NULL
);

以下是数据示例:The following is a sample of the data:

RentalDateRentalDate Year TotalRentalsTotalRentals
1/1/20111/1/2011 00 985985
1/2/20111/2/2011 00 801801
1/3/20111/3/2011 00 13491349

创建输入和输出类Create input and output classes

  1. 打开 Program.cs 文件,将现有 using 语句替换为以下内容: Open Program.cs file and replace the existing using statements with the following:

    using System;
    using System.Collections.Generic;
    using System.Data.SqlClient;
    using System.IO;
    using System.Linq;
    using Microsoft.ML;
    using Microsoft.ML.Data;
    using Microsoft.ML.Transforms.TimeSeries;
    
  2. 创建 ModelInput 类。Create ModelInput class. Program 类下面,添加以下代码。Below the Program class, add the following code.

    public class ModelInput
    {
        public DateTime RentalDate { get; set; }
    
        public float Year { get; set; }
    
        public float TotalRentals { get; set; }
    }
    

    ModelInput 类包含以下列:The ModelInput class contains the following columns:

    • RentalDate:观测日期。RentalDate: The date of the observation.
    • Year:观测年份编码(0=2011,1=2012)。Year: The encoded year of the observation (0=2011, 1=2012).
    • TotalRentals:观测日当天自行车租赁总数。TotalRentals: The total number of bike rentals for that day.
  3. 在新建的 ModelOutput 类的下面,创建 ModelInput 类。Create ModelOutput class below the newly created ModelInput class.

    public class ModelOutput
    {
        public float[] ForecastedRentals { get; set; }
    
        public float[] LowerBoundRentals { get; set; }
    
        public float[] UpperBoundRentals { get; set; }
    }
    

    ModelOutput 类包含以下列:The ModelOutput class contains the following columns:

    • ForecastedRentals:预测时段内的预测值。ForecastedRentals: The predicted values for the forecasted period.
    • LowerBoundRentals:预测时段内的最低预测值。LowerBoundRentals: The predicted minimum values for the forecasted period.
    • UpperBoundRentals:预测时段内的最高预测值。UpperBoundRentals: The predicted maximum values for the forecasted period.

定义路径并初始化变量Define paths and initialize variables

  1. Main 方法中,定义变量,用于存储数据位置、连接字符串,以及保存培训的模型位置。Inside the Main method, define variables to store the location of your data, connection string, and where to save the trained model.

    string rootDir = Path.GetFullPath(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "../../../"));
    string dbFilePath = Path.Combine(rootDir, "Data", "DailyDemand.mdf");
    string modelPath = Path.Combine(rootDir, "MLModel.zip");
    var connectionString = $"Data Source=(LocalDB)\\MSSQLLocalDB;AttachDbFilename={dbFilePath};Integrated Security=True;Connect Timeout=30;";
    
  2. 通过将以下行添加到 Main 方法,使用新的 MLContext 实例初始化 mlContext 变量。Initialize the mlContext variable with a new instance of MLContext by adding the following line to the Main method.

    MLContext mlContext = new MLContext();
    

    执行所有 ML.NET 操作都是从 MLContext 类开始,初始化 mlContext 将创建一个新的 ML.NET 环境,可在模型创建工作流对象之间共享该环境。The MLContext class is a starting point for all ML.NET operations, and initializing mlContext creates a new ML.NET environment that can be shared across the model creation workflow objects. 从概念上讲,它与实体框架中的 DBContext 类似。It's similar, conceptually, to DBContext in Entity Framework.

加载数据Load the data

  1. 创建 DatabaseLoader,用于加载 ModelInput 类型的记录。Create DatabaseLoader that loads records of type ModelInput.

    DatabaseLoader loader = mlContext.Data.CreateDatabaseLoader<ModelInput>();
    
  2. 定义查询,以从数据库加载数据。Define the query to load the data from the database.

    string query = "SELECT RentalDate, CAST(Year as REAL) as Year, CAST(TotalRentals as REAL) as TotalRentals FROM Rentals";
    

    ML.NET 算法要求数据是 Single 类型。ML.NET algorithms expect data to be of type Single. 因此,必须将来自数据库的非 Real 类型的数值(单精度浮点值)转换为 RealTherefore, numerical values coming from the database that are not of type Real, a single-precision floating-point value, have to be converted to Real.

    数据库中的 YearTotalRental 列都是整数类型。The Year and TotalRental columns are both integer types in the database. 使用 CAST 内置函数将它们都转换为 RealUsing the CAST built-in function, they are both cast to Real.

  3. 创建 DatabaseSource 以连接到数据库,并执行查询。Create a DatabaseSource to connect to the database and execute the query.

    DatabaseSource dbSource = new DatabaseSource(SqlClientFactory.Instance,
                                    connectionString,
                                    query);
    
  4. 将数据加载到 IDataView 中。Load the data into an IDataView.

    IDataView dataView = loader.Load(dbSource);
    
  5. 此数据集包含两年的重要数据。The dataset contains two years worth of data. 第一年的数据仅用于培训,第二年的数据用于将实际值与模型生成的预测进行比较。Only data from the first year is used for training, the second year is held out to compare the actual values against the forecast produced by the model. 使用 FilterRowsByColumn 转换筛选数据。Filter the data using the FilterRowsByColumn transform.

    IDataView firstYearData = mlContext.Data.FilterRowsByColumn(dataView, "Year", upperBound: 1);
    IDataView secondYearData = mlContext.Data.FilterRowsByColumn(dataView, "Year", lowerBound: 1);
    

    对于第一年,通过将 upperBound 参数设置为 1 来仅选择 Year 列中小于 1 的值。For the first year, only the values in the Year column less than 1 are selected by setting the upperBound parameter to 1. 相反,对于第二年,通过将 lowerBound 参数设置为 1 来仅选择大于或等于 1 的值。Conversely, for the second year, values greater than or equal to 1 are selected by setting the lowerBound parameter to 1.

定义时序分析管道Define time series analysis pipeline

  1. 定义使用 SsaForecastingEstimator 预测时序数据集中的值的管道。Define a pipeline that uses the SsaForecastingEstimator to forecast values in a time-series dataset.

    var forecastingPipeline = mlContext.Forecasting.ForecastBySsa(
        outputColumnName: "ForecastedRentals",
        inputColumnName: "TotalRentals",
        windowSize: 7,
        seriesLength: 30,
        trainSize: 365,
        horizon: 7,
        confidenceLevel: 0.95f,
        confidenceLowerBoundColumn: "LowerBoundRentals",
        confidenceUpperBoundColumn: "UpperBoundRentals");
    

    forecastingPipeline 在第一年数据中获取 365 个数据点,并按 seriesLength 参数指定的间隔从时序数据集采样或将其分为 30 天(每月)的间隔。The forecastingPipeline takes 365 data points for the first year and samples or splits the time-series dataset into 30-day (monthly) intervals as specified by the seriesLength parameter. 以一周或 7 天为一个时段分析各个样本。Each of these samples is analyzed through weekly or a 7-day window. 确定下一个时段的预测值时,使用前面 7 天的值进行预测。When determining what the forecasted value for the next period(s) is, the values from previous seven days are used to make a prediction. 根据 horizon 参数的定义,该模型设置为预测将来的 7 个时段。The model is set to forecast seven periods into the future as defined by the horizon parameter. 由于预测属于合理猜测,它不总是完全准确。Because a forecast is an informed guess, it's not always 100% accurate. 因此,最好了解上限和下限定义的最佳和最坏情况下的范围值。Therefore, it's good to know the range of values in the best and worst-case scenarios as defined by the upper and lower bounds. 在本案例中,设置的上下限可信度为 95%。In this case, the level of confidence for the lower and upper bounds is set to 95%. 可信度可以相应地提高或降低。The confidence level can be increased or decreased accordingly. 值越高,上限和下限之间的范围越大,以便达到所需的可信度。The higher the value, the wider the range is between the upper and lower bounds to achieve the desired level of confidence.

  2. 使用 Fit 方法培训模型,使数据适用于前面定义的 forecastingPipelineUse the Fit method to train the model and fit the data to the previously defined forecastingPipeline.

    SsaForecastingTransformer forecaster = forecastingPipeline.Fit(firstYearData);
    

评估模型Evaluate the model

通过预测下一年的数据并将其与实际值进行比较,评估模型的执行情况。Evaluate how well the model performs by forecasting next year's data and comparing it against the actual values.

  1. Main 方法下面,创建一个名为 Evaluate 的新实用方法。Below the Main method, create a new utility method called Evaluate.

    static void Evaluate(IDataView testData, ITransformer model, MLContext mlContext)
    {
    
    }
    
  2. Evaluate 方法中,通过结合使用 Transform 方法和培训模型,预测第二年的数据。Inside the Evaluate method, forecast the second year's data by using the Transform method with the trained model.

    IDataView predictions = model.Transform(testData);
    
  3. 使用 CreateEnumerable 方法,从数据中获取实际值。Get the actual values from the data by using the CreateEnumerable method.

    IEnumerable<float> actual =
        mlContext.Data.CreateEnumerable<ModelInput>(testData, true)
            .Select(observed => observed.TotalRentals);
    
  4. 使用 CreateEnumerable 方法获取预测值。Get the forecast values by using the CreateEnumerable method.

    IEnumerable<float> forecast =
        mlContext.Data.CreateEnumerable<ModelOutput>(predictions, true)
            .Select(prediction => prediction.ForecastedRentals[0]);
    
  5. 计算实际值和预测值之间的差值(通常称为“误差”)。Calculate the difference between the actual and forecast values, commonly referred to as the error.

    var metrics = actual.Zip(forecast, (actualValue, forecastValue) => actualValue - forecastValue);
    
  6. 通过计算平均绝对误差和均方根误差值测量性能。Measure performance by computing the Mean Absolute Error and Root Mean Squared Error values.

    var MAE = metrics.Average(error => Math.Abs(error)); // Mean Absolute Error
    var RMSE = Math.Sqrt(metrics.Average(error => Math.Pow(error, 2))); // Root Mean Squared Error
    

    使用以下指标来评估性能:To evaluate performance, the following metrics are used:

    • 平均绝对误差:度量预测与实际值之间的接近程度。Mean Absolute Error: Measures how close predictions are to the actual value. 此值介于 0 到无限大之间。This value ranges between 0 and infinity. 越接近 0,模型的质量越好。The closer to 0, the better the quality of the model.
    • 均方根误差:汇总模型中的错误。Root Mean Squared Error: Summarizes the error in the model. 此值介于 0 到无限大之间。This value ranges between 0 and infinity. 越接近 0,模型的质量越好。The closer to 0, the better the quality of the model.
  7. 将指标输出到控制台。Output the metrics to the console.

    Console.WriteLine("Evaluation Metrics");
    Console.WriteLine("---------------------");
    Console.WriteLine($"Mean Absolute Error: {MAE:F3}");
    Console.WriteLine($"Root Mean Squared Error: {RMSE:F3}\n");
    
  8. Main 方法中使用 Evaluate 方法。Use the Evaluate method inside the Main method.

    Evaluate(secondYearData, forecaster, mlContext);
    

保存模型Save the model

如果对模型满意,则保存它,以便以后用于其他应用程序。If you're satisfied with your model, save it for later use in other applications.

  1. Main 方法中创建 TimeSeriesPredictionEngineIn the Main method, create a TimeSeriesPredictionEngine. TimeSeriesPredictionEngine 是进行单个预测的一个便捷方法。TimeSeriesPredictionEngine is a convenience method to make single predictions.

    var forecastEngine = forecaster.CreateTimeSeriesEngine<ModelInput, ModelOutput>(mlContext);
    
  2. 将此模型保存到由先前定义的 modelPath 变量指定的名为 MLModel.zip 的文件。Save the model to a file called MLModel.zip as specified by the previously defined modelPath variable. 使用 Checkpoint 方法保存模型。Use the Checkpoint method to save the model.

    forecastEngine.CheckPoint(mlContext, modelPath);
    

使用模型预测需求Use the model to forecast demand

  1. Evaluate 方法下面,创建一个名为 Forecast 的新实用方法。Below the Evaluate method, create a new utility method called Forecast.

    static void Forecast(IDataView testData, int horizon, TimeSeriesPredictionEngine<ModelInput, ModelOutput> forecaster, MLContext mlContext)
    {
    
    }
    
  2. Forecast 方法中,使用 Predict 方法预测接下来的 7 天的租赁数量。Inside the Forecast method, use the Predict method to forecast rentals for the next seven days.

    ModelOutput forecast = forecaster.Predict();
    
  3. 排列 7 个时段的实际值和预测值。Align the actual and forecast values for seven periods.

    IEnumerable<string> forecastOutput =
        mlContext.Data.CreateEnumerable<ModelInput>(testData, reuseRowObject: false)
            .Take(horizon)
            .Select((ModelInput rental, int index) =>
            {
                string rentalDate = rental.RentalDate.ToShortDateString();
                float actualRentals = rental.TotalRentals;
                float lowerEstimate = Math.Max(0, forecast.LowerBoundRentals[index]);
                float estimate = forecast.ForecastedRentals[index];
                float upperEstimate = forecast.UpperBoundRentals[index];
                return $"Date: {rentalDate}\n" +
                $"Actual Rentals: {actualRentals}\n" +
                $"Lower Estimate: {lowerEstimate}\n" +
                $"Forecast: {estimate}\n" +
                $"Upper Estimate: {upperEstimate}\n";
            });
    
  4. 循环访问预测输出,并在控制台上显示它。Iterate through the forecast output and display it on the console.

    Console.WriteLine("Rental Forecast");
    Console.WriteLine("---------------------");
    foreach (var prediction in forecastOutput)
    {
        Console.WriteLine(prediction);
    }
    

运行此应用程序Run the application

  1. Main 方法中,调用 Forecast 方法。Inside the Main method, call the Forecast method.

    Forecast(secondYearData, 7, forecastEngine, mlContext);
    
  2. 运行该应用程序。Run the application. 控制台应显示类似以下内容的输出。Output similar to that below should appear on the console. 为简洁起见,输出已进行压缩。For brevity, the output has been condensed.

    Evaluation Metrics
    ---------------------
    Mean Absolute Error: 726.416
    Root Mean Squared Error: 987.658
    
    Rental Forecast
    ---------------------
    Date: 1/1/2012
    Actual Rentals: 2294
    Lower Estimate: 1197.842
    Forecast: 2334.443
    Upper Estimate: 3471.044
    
    Date: 1/2/2012
    Actual Rentals: 1951
    Lower Estimate: 1148.412
    Forecast: 2360.861
    Upper Estimate: 3573.309
    

通过观测实际值和预测值,获得以下关系:Inspection of the actual and forecasted values shows the following relationships:

实际值和预测值比较

尽管预测值并不能预测准确的租赁数,但它们缩小了值的范围,企业可以通过它们优化资源利用。While the forecasted values are not predicting the exact number of rentals, they provide a more narrow range of values that allows an operation to optimize their use of resources.

祝贺你!Congratulations! 你已成功生成用于预测自行车租赁需求的时序机器学习模型。You've now successfully built a time series machine learning model to forecast bike rental demand.

可以在 dotnet/machinelearning-samples 存储库中找到本教程的源代码。You can find the source code for this tutorial at the dotnet/machinelearning-samples repository.

后续步骤Next steps