Python 教程:使用二元分类来预测纽约市出租车费用Python tutorial: Predict NYC taxi fares with binary classification

适用于:Applies to: 是SQL Server 2017 (14.x)SQL Server 2017 (14.x)yesSQL Server 2017 (14.x)SQL Server 2017 (14.x) 及更高版本 是Azure SQL 托管实例Azure SQL Managed InstanceYesAzure SQL 托管实例Azure SQL Managed Instance适用于:Applies to: 是SQL Server 2017 (14.x)SQL Server 2017 (14.x)yesSQL Server 2017 (14.x)SQL Server 2017 (14.x) and later 是Azure SQL 托管实例Azure SQL Managed InstanceYesAzure SQL 托管实例Azure SQL Managed Instance

在本面向 SQL 程序员的由五部分组成的系列教程中,你将学习如何在 SQL Server 机器学习服务大数据群集中集成 Python。In this five-part tutorial series for SQL programmers, you'll learn about Python integration in SQL Server Machine Learning Services or on Big Data Clusters.

在本面向 SQL 程序员的由五部分组成的系列教程中,你将学习如何在 SQL Server 机器学习服务中集成 Python。In this five-part tutorial series for SQL programmers, you'll learn about Python integration in SQL Server Machine Learning Services.

在本面向 SQL 程序员的由五部分组成的系列教程中,你将学习如何在 Azure SQL 托管实例中的机器学习服务上集成 Python。In this five-part tutorial series for SQL programmers, you'll learn about Python integration in Machine Learning Services in Azure SQL Managed Instance.

你将使用 SQL Server 上的示例数据库来生成和部署基于 Python 的机器学习解决方案。You'll build and deploy a Python-based machine learning solution using a sample database on SQL Server. 你将使用 T-SQL、Azure Data Studio 或 SQL Server Management Studio,以及支持 SQL 机器学习和 Python 语言的数据库实例。You'll use T-SQL, Azure Data Studio or SQL Server Management Studio, and a database instance with SQL machine learning and Python language support.

本系列教程介绍了在数据建模工作流中使用的 Python 函数。This tutorial series introduces you to Python functions used in a data modeling workflow. 其中的部分包括数据浏览、生成和训练二元分类模型,以及模型部署等。Parts include data exploration, building and training a binary classification model, and model deployment. 你将使用来自纽约市出租车和豪华轿车委员会的示例数据。You'll use sample data from the New York City Taxi and Limousine Commission. 要生成的模型会根据一天中的时间、行程距离和上车位置来预测行程是否可能会产生小费。The model you'll build predicts whether a trip is likely to result in a tip based on the time of day, distance traveled, and pick-up location.

在本系列的第一部分中,你将安装必备组件,并还原示例数据库。In the first part of this series, you'll install the prerequisites and restore the sample database. 在第二和第三部分中,你将开发一些 Python 脚本,以准备数据并训练机器学习模型。In parts two and three, you'll develop some Python scripts to prepare your data and train a machine learning model. 然后,在第四和第五部分中,你将使用 T-SQL 存储过程在数据库中运行这些 Python 脚本。Then, in parts four and five, you'll run those Python scripts inside the database using T-SQL stored procedures.

在本文中,你将:In this article, you'll:

  • 安装必备组件Install prerequisites
  • 还原示例数据库Restore the sample database

第二部分中,你将探索示例数据,并生成一些绘图。In part two, you'll explore the sample data and generate some plots.

第三部分中,你将学习如何使用 Transact-SQL 函数根据原始数据创建特征。In part three, you'll learn how to create features from raw data by using a Transact-SQL function. 然后从存储过程调用该函数,创建包含该功能值的表。You'll then call that function from a stored procedure to create a table that contains the feature values.

第四部分中,你将加载模块,并调用必要的函数,以使用 SQL Server 存储过程来创建和训练模型。In part four, you'll load the modules and call the necessary functions to create and train the model using a SQL Server stored procedure.

第五部分中,你将了解如何操作在第四部分中训练和保存的模型。In part five, you'll learn how to operationalize the models that you trained and saved in part four.

备注

R 和 Python 均提供此教程。This tutorial is available in both R and Python. 有关 R 版本,请参阅 R 教程:使用二元分类来预测纽约市出租车费用For the R version, see R tutorial: Predict NYC taxi fares with binary classification.

必备知识Prerequisites

所有任务都可以使用 Azure Data Studio 或 Management StudioManagement Studio 中的 Transact-SQLTransact-SQL 存储过程来完成。All tasks can be done using Transact-SQLTransact-SQL stored procedures in Azure Data Studio or Management StudioManagement Studio.

若要更好地学习本系列教程,你需要熟悉基本数据库操作,如创建数据库和表、导入数据以及编写 SQL 查询。This tutorial series assumes familiarity with basic database operations such as creating databases and tables, importing data, and writing SQL queries. 但无需了解 Python,因为所有 Python 代码都已提供。It does not assume you know Python and all Python code is provided.

SQL 开发者背景Background for SQL developers

构建机器学习解决方案是一种复杂的过程,它可能涉及多种工具,并且需要主题专家跨多个阶段进行协调:The process of building a machine learning solution is a complex one that can involve multiple tools, and the coordination of subject matter experts across several phases:

  • 获取和清除数据obtaining and cleaning data
  • 探索数据并构建有助于建模的功能exploring the data and building features useful for modeling
  • 定型和优化模型training and tuning the model
  • 部署到生产环境deployment to production

使用专用的开发环境,可以以最佳方式执行实际代码的开发和测试。Development and testing of the actual code is best performed using a dedicated development environment. 不过,在完全测试脚本后,可以在熟悉的 Azure Data Studio 或 Management StudioManagement Studio 环境中使用 Transact-SQLTransact-SQL 存储过程将脚本轻松部署到 SQL ServerSQL ServerHowever, after the script is fully tested, you can easily deploy it to SQL ServerSQL Server using Transact-SQLTransact-SQL stored procedures in the familiar environment of Azure Data Studio or Management StudioManagement Studio. 在存储过程中包装外部代码是在 SQL Server 中操作代码的主要机制。Wrapping external code in stored procedures is the primary mechanism for operationalizing code in SQL Server.

将模型保存到数据库后,可以使用存储过程从 Transact-SQLTransact-SQL 调用该模型以用于预测。After the model has been saved to the database, you can call the model for predictions from Transact-SQLTransact-SQL by using stored procedures.

无论你是初次接触 Python 的 SQL 程序员,还是初次接触 SQL 的 Python 开发人员,都可以学习由五部分组成的本系列教程,其中介绍了使用 Python 和 SQL Server 进行数据库内分析的典型工作流。Whether you're a SQL programmer new to Python, or a Python developer new to SQL, this five-part tutorial series introduces a typical workflow for conducting in-database analytics with Python and SQL Server.

后续步骤Next steps

本文内容:In this article, you:

  • 安装必备组件Installed prerequisites
  • 还原示例数据库Restored the sample database