Python 教程:将 K-Means 聚类分析与 SQL 机器学习配合使用,对客户进行聚类分析Python tutorial: Categorizing customers using k-means clustering with SQL machine learning

适用于:Applies to: 是SQL Server 2017 (14.x)SQL Server 2017 (14.x)yesSQL Server 2017 (14.x)SQL Server 2017 (14.x) 及更高版本 是Azure SQL 托管实例Azure SQL Managed InstanceYesAzure SQL 托管实例Azure SQL Managed Instance适用于:Applies to: 是SQL Server 2017 (14.x)SQL Server 2017 (14.x)yesSQL Server 2017 (14.x)SQL Server 2017 (14.x) and later 是Azure SQL 托管实例Azure SQL Managed InstanceYesAzure SQL 托管实例Azure SQL Managed Instance

本系列教程由四个部分组成,引导你使用 Python 在 SQL Server 机器学习服务中或在大数据群集上开发和部署 K-Means 聚类分析模型,以便对客户数据进行聚类分析。In this four-part tutorial series, you'll use Python to develop and deploy a K-Means clustering model in SQL Server Machine Learning Services or on Big Data Clusters to categorize customer data.

本系列教程由四个部分组成,引导你使用 Python 在 SQL Server 机器学习服务中开发和部署 K-Means 群集模型,以便对客户数据进行聚类分析。In this four-part tutorial series, you'll use Python to develop and deploy a K-Means clustering model in SQL Server Machine Learning Services to cluster customer data.

本系列教程由四个部分组成,引导你使用 Python 在 Azure SQL 托管实例机器学习服务中开发和部署 K-Means 群集模型,以便对客户数据进行聚类分析。In this four-part tutorial series, you'll use Python to develop and deploy a K-Means clustering model in Azure SQL Managed Instance Machine Learning Services to cluster customer data.

在本系列的第一部分中,你将设置本教程的先决条件,然后将示例数据集还原到一个数据库。In part one of this series, you'll set up the prerequisites for the tutorial and then restore a sample dataset to a database. 在本系列的后续部分中,使用这些数据在 Python 中通过 SQL 机器学习来训练和部署聚类分析模型。Later in this series, you'll use this data to train and deploy a clustering model in Python with SQL machine learning.

在本系列的第二和第三部分中,在 Azure Data Studio 笔记本中开发一些 Python 脚本,用于分析和准备数据以及定型机器学习模型。In parts two and three of this series, you'll develop some Python scripts in an Azure Data Studio notebook to analyze and prepare your data and train a machine learning model. 然后,在第四部分中,使用存储过程在数据库中运行这些 Python 脚本。Then, in part four, you'll run those Python scripts inside a database using stored procedures.

聚类分析可解释为将数据组织成组,其中一个组的成员在某些方面类似。Clustering can be explained as organizing data into groups where members of a group are similar in some way. 对于本系列教程,假设你拥有一家零售企业。For this tutorial series, imagine you own a retail business. 你将使用 K-Means 算法在产品购买及退货的数据集中执行针对客户的聚类分析。You'll use the K-Means algorithm to perform the clustering of customers in a dataset of product purchases and returns. 通过对客户进行聚类分析,可以将特定组定为目标,更加高效地专注于市场营销工作。By clustering customers, you can focus your marketing efforts more effectively by targeting specific groups. K-Means 群集是一种无监督式学习算法,该算法根据相似性寻找数据中的规律。K-Means clustering is an unsupervised learning algorithm that looks for patterns in data based on similarities.

本文将指导如何进行以下操作:In this article, you'll learn how to:

  • 还原示例数据库Restore a sample database

第二部分介绍如何从数据库准备数据以执行聚类分析。In part two, you'll learn how to prepare the data from a database to perform clustering.

第三部分介绍如何在 Python 中创建和定型 K-Means 群集模型。In part three, you'll learn how to create and train a K-Means clustering model in Python.

第四部分中,你将了解如何在数据库中创建存储过程,以便基于新数据在 Python 中执行聚类分析。In part four, you'll learn how to create a stored procedure in a database that can perform clustering in Python based on new data.

先决条件Prerequisites

  • Azure Data StudioAzure Data Studio. 你将使用 Azure Data Studio 中同时适用于 Python 和 SQL 的笔记本。You'll use a notebook in Azure Data Studio for both Python and SQL. 若要详细了解笔记本,请参阅如何使用 Azure Data Studio 中的笔记本For more information about notebooks, see How to use notebooks in Azure Data Studio.

  • 附加 Python 包 - 在本教程系列中的示例所使用的 Python 包中,有些可能是你已经安装了的,有些可能是你尚未安装的。Additional Python packages - The examples in this tutorial series use Python packages that you may or may not have installed.

    打开命令提示符,并更改为 Azure Data Studio 中使用的 Python 版本的安装路径。Open a Command Prompt and change to the installation path for the version of Python you use in Azure Data Studio. 例如,cd %LocalAppData%\Programs\Python\Python37-32For example, cd %LocalAppData%\Programs\Python\Python37-32. 然后,运行下面的命令,以安装所有尚未安装的包。Then run the following commands to install any of these packages that are not already installed.

    pip install matplotlib
    pip install pandas
    pip install pyodbc
    pip install scipy
    pip install sklearn
    

还原示例数据库Restore the sample database

本教程中使用的示例数据集已保存到 .bak 数据库备份文件,以供下载和使用。The sample dataset used in this tutorial has been saved to a .bak database backup file for you to download and use. 此数据集派生自 事务处理性能委员会 (TPC) 提供的 tpcx-bb 数据集。This dataset is derived from the tpcx-bb dataset provided by the Transaction Processing Performance Council (TPC).

备注

如果在大数据群集上使用机器学习服务,请了解如何将数据库还原成 SQL Server 大数据群集主实例If you are using Machine Learning Services on Big Data Clusters, see how to Restore a database into the SQL Server big data cluster master instance.

  1. 下载 tpcxbb_1gb.bak 文件。Download the file tpcxbb_1gb.bak.

  2. 使用以下详细信息,按 Azure Data Studio 中从备份文件还原数据库中的说明操作:Follow the directions in Restore a database from a backup file in Azure Data Studio, using these details:

    • 从下载的 tpcxbb_1gb.bak 文件导入Import from the tpcxbb_1gb.bak file you downloaded
    • 将目标数据库命名为“tpcxbb_1gb”Name the target database "tpcxbb_1gb"
  3. 可以查询 dbo.customer 表验证数据库还原后数据集是否存在:You can verify that the dataset exists after you have restored the database by querying the dbo.customer table:

    USE tpcxbb_1gb;
    SELECT * FROM [dbo].[customer];
    
  1. 下载 tpcxbb_1gb.bak 文件。Download the file tpcxbb_1gb.bak.

  2. 使用以下详细信息,按 SQL Server Management Studio 中将数据库还原到托管实例的说明操作:Follow the directions in Restore a database to a Managed Instance in SQL Server Management Studio, using these details:

    • 从下载的 tpcxbb_1gb.bak 文件导入Import from the tpcxbb_1gb.bak file you downloaded
    • 将目标数据库命名为“tpcxbb_1gb”Name the target database "tpcxbb_1gb"
  3. 可以查询 dbo.customer 表验证数据库还原后数据集是否存在:You can verify that the dataset exists after you have restored the database by querying the dbo.customer table:

    USE tpcxbb_1gb;
    SELECT * FROM [dbo].[customer];
    

清理资源Clean up resources

如果不打算继续学习本教程,请删除 tpcxbb_1gb 数据库。If you're not going to continue with this tutorial, delete the tpcxbb_1gb database.

后续步骤Next steps

在本系列教程的第一部分中,你已完成以下步骤:In part one of this tutorial series, you completed these steps:

  • 还原示例数据库Restore a sample database

若要为机器学习模型准备数据,按本系列教程的第二部分进行操作:To prepare the data for the machine learning model, follow part two of this tutorial series: