用于 SQL Server Python 和 R 教程的纽约市出租车演示数据NYC Taxi demo data for SQL Server Python and R tutorials

适用于:Applies to: 是SQL Server 2016 (13.x)SQL Server 2016 (13.x)yesSQL Server 2016 (13.x)SQL Server 2016 (13.x) 及更高版本 是Azure SQL 托管实例Azure SQL Managed InstanceYesAzure SQL 托管实例Azure SQL Managed Instance适用于:Applies to: 是SQL Server 2016 (13.x)SQL Server 2016 (13.x)yesSQL Server 2016 (13.x)SQL Server 2016 (13.x) and later 是Azure SQL 托管实例Azure SQL Managed InstanceYesAzure SQL 托管实例Azure SQL Managed Instance

本文介绍了如何设置一个示例数据库,该数据库包含来自纽约市出租车和轿车委员会的公共数据。This article explains how to set up a sample database consisting of public data from the New York City Taxi and Limousine Commission. 此数据用于多个 R 和 Python 教程,用于 SQL Server 上的数据库内分析。This data is used in several R and Python tutorials for in-database analytics on SQL Server. 为了使示例代码运行速度更快,我们创建了一个具有代表性的 1% 采样数据。To make the sample code run quicker, we created a representative 1% sampling of the data. 在你的系统上,数据库备份文件略大于 90 MB,在主数据表中提供 170 万行。On your system, the database backup file is slightly over 90 MB, providing 1.7 million rows in the primary data table.

若要完成此练习,你应该拥有 SQL Server Management Studio 或其他可以还原数据库备份文件并运行 T-SQL 查询的工具。To complete this exercise, you should have SQL Server Management Studio or another tool that can restore a database backup file and run T-SQL queries.

使用此数据集的教程和快速入门包括以下内容:Tutorials and quickstarts using this data set include the following:

下载文件Download files

示例数据库为 Microsoft 托管的 SQL Server 2016 BAK 文件。The sample database is a SQL Server 2016 BAK file hosted by Microsoft. 可在 SQL Server 2016 及更高版本上还原它。You can restore it on SQL Server 2016 and later. 单击链接后,将立即开始文件下载。File download begins immediately when you click the link.

文件大小约为 90 MB。File size is approximately 90 MB.

备注

若要还原 SQL Server 大数据群集上的示例数据库,请下载NYCTaxi_Sample.bak 并遵循将数据库还原到 SQL Server 大数据群集主实例中的说明进行操作。To restore the sample database on SQL Server Big Data Clusters, download NYCTaxi_Sample.bak and follow the directions in Restore a database into the SQL Server big data cluster master instance.

备注

若要还原 Azure SQL 托管实例中的机器学习服务上的示例数据库,请按照快速入门:将数据库还原到 Azure SQL 托管实例中的说明,使用纽约市出租车演示数据库 .bak 文件进行操作:https://sqlmldoccontent.blob.core.windows.net/sqlml/NYCTaxi_Sample.bakTo restore the sample database on Machine Learning Services in Azure SQL Managed Instance, follow the instructions in Quickstart: Restore a database to Azure SQL Managed Instance using the NYC Taxi demo database .bak file: https://sqlmldoccontent.blob.core.windows.net/sqlml/NYCTaxi_Sample.bak.

  1. 单击 NYCTaxi_Sample.bak 下载数据库备份文件。Click NYCTaxi_Sample.bak to download the database backup file.

  2. 将该文件复制到 C:\Program files\Microsoft SQL Server\MSSQL-instance-name\MSSQL\Backup 文件夹。Copy the file to C:\Program files\Microsoft SQL Server\MSSQL-instance-name\MSSQL\Backup folder.

  3. 在 Management Studio 中,右键单击“数据库”,然后选择“还原文件和文件组” 。In Management Studio, right-click Databases and select Restore Files and File Groups.

  4. 输入“NYCTaxi_Sample”作为数据库名称。Enter NYCTaxi_Sample as the database name.

  5. 单击“从设备”,然后打开文件选择页以选择备份文件。Click From device and then open the file selection page to select the backup file. 单击“添加”以选择 NYCTaxi_Sample.bak。Click Add to select NYCTaxi_Sample.bak.

  6. 选中“还原”复选框,然后单击“确定”以还原数据库 。Select the Restore checkbox and click OK to restore the database.

查看数据库对象Review database objects

使用 SQL Server Management StudioSQL Server Management Studio 确认 SQL ServerSQL Server 实例上存在数据库对象。Confirm the database objects exist on the SQL ServerSQL Server instance using SQL Server Management StudioSQL Server Management Studio. 你可看到数据库、表、函数和存储过程。You should see the database, tables, functions, and stored procedures.

rsql_devtut_BrowseTablesrsql_devtut_BrowseTables

NYCTaxi_Sample 数据库中的对象Objects in NYCTaxi_Sample database

下表总结了纽约市出租车演示数据库中创建的对象。The following table summarizes the objects created in the NYC Taxi demo database.

对象名称Object name 对象类型Object type 说明Description
NYCTaxi_SampleNYCTaxi_Sample databasedatabase 创建一个数据库和两个表:Creates a database and two tables:

dbo.nyctaxi_sample 表:包含纽约市出租车主数据集。dbo.nyctaxi_sample table: Contains the main NYC Taxi dataset. 将在表中添加一个聚集列存储索引,改善存储和查询性能。A clustered columnstore index is added to the table to improve storage and query performance. 此表中插入了纽约市出租车数据集的 1% 采样。The 1% sample of the NYC Taxi dataset is inserted into this table.

dbo.nyc_taxi_models 表:用于保留已定型的高级分析模型。dbo.nyc_taxi_models table: Used to persist the trained advanced analytics model.
fnCalculateDistancefnCalculateDistance 标量值函数 (scalar-valued function)scalar-valued function 计算搭乘位置和下车位置之间的直接距离。Calculates the direct distance between pickup and dropoff locations. 创建数据功能定型和保存模型以及操作 R 模型中使用此功能。This function is used in Create data features, Train and save a model and Operationalize the R model.
fnEngineerFeaturesfnEngineerFeatures 表值函数 (table-valued function)table-valued function 为模型定型创建新的数据功能。Creates new data features for model training. 创建数据功能操作 R 模型中使用此功能。This function is used in Create data features and Operationalize the R model.

使用在各种教程中找到的 R 和 Python 脚本创建存储过程。Stored procedures are created using R and Python script found in various tutorials. 下表总结了从不同课程运行脚本时可以选择添加到纽约市出租车演示数据库的存储过程。The following table summarizes the stored procedures that you can optionally add to the NYC Taxi demo database when you run script from various lessons.

存储过程Stored procedure 语言Language 说明Description
RxPlotHistogramRxPlotHistogram RR 调用 RevoScaleR rxHistogram 函数绘制变量的直方图,然后将绘图作为二进制对象返回。Calls the RevoScaleR rxHistogram function to plot the histogram of a variable and then returns the plot as a binary object. 浏览并可视化数据中使用此存储过程。This stored procedure is used in Explore and visualize data.
RPlotRHistRPlotRHist RR 使用 Hist 函数创建图形,然后将输出保存为本地 PDF 文件。Creates a graphic using the Hist function and saves the output as a local PDF file. 浏览并可视化数据中使用此存储过程。This stored procedure is used in Explore and visualize data.
RxTrainLogitModelRxTrainLogitModel RR 通过调用 R 程序包定型逻辑回归模型。Trains a logistic regression model by calling an R package. 该模型预测附属列的值,并使用随机选择的 70% 的数据进行定型。The model predicts the value of the tipped column, and is trained using a randomly selected 70% of the data. 存储过程的输出是定型模型,保存在表 nyc_taxi_models 中。The output of the stored procedure is the trained model, which is saved in the table nyc_taxi_models. 定型和保存模型中使用此存储过程。This stored procedure is used in Train and save a model.
RxPredictBatchOutputRxPredictBatchOutput RR 调用已定型模型以便使用模型创建预测。Calls the trained model to create predictions using the model. 该存储过程接受查询作为其输入参数,并返回包含输入行分数的数值的列。The stored procedure accepts a query as its input parameter and returns a column of numeric values containing the scores for the input rows. 预测潜在结果中使用此存储过程。This stored procedure is used in Predict potential outcomes.
RxPredictSingleRowRxPredictSingleRow RR 调用已定型模型以便使用模型创建预测。Calls the trained model to create predictions using the model. 该存储过程接受一个新观察值作为输入,而传递的单个功能值作为嵌入式参数,并返回一个预测新观察值结果的值。This stored procedure accepts a new observation as input, with individual feature values passed as in-line parameters, and returns a value that predicts the outcome for the new observation. 预测潜在结果中使用此存储过程。This stored procedure is used in Predict potential outcomes.

查询数据Query the data

作为验证步骤,运行查询以确认已上传数据。As a validation step, run a query to confirm the data was uploaded.

  1. 在“对象资源管理器”中的“数据库”下,右键单击“NYCTaxi_Sample”数据库,然后启动一个新查询。In Object Explorer, under Databases, right-click the NYCTaxi_Sample database, and start a new query.

  2. 运行一些简单的查询:Run some simple queries:

    SELECT TOP(10) * FROM dbo.nyctaxi_sample;
    SELECT COUNT(*) FROM dbo.nyctaxi_sample;
    

数据库包含 170 万行。The database contains 1.7 million rows.

  1. 数据库内是一个包含数据集的 nyctaxi_sample 表。Within the database is a nyctaxi_sample table that contains the data set. 表已针对基于集的计算进行了优化,并且添加了列存储索引The table has been optimized for set-based calculations with the addition of a columnstore index. 运行此语句,以生成表的快速摘要。Run this statement to generate a quick summary on the table.

    SELECT DISTINCT [passenger_count]
        , ROUND (SUM ([fare_amount]),0) as TotalFares
        , ROUND (AVG ([fare_amount]),0) as AvgFares
    FROM [dbo].[nyctaxi_sample]
    GROUP BY [passenger_count]
    ORDER BY  AvgFares DESC
    

结果应类似于以下屏幕截图中显示的结果。Results should be similar to those showing in the following screenshot.

表摘要信息Table summary information

后续步骤Next steps

纽约市出租车示例数据现在可用于实践学习。NYC Taxi sample data is now available for hands-on learning.