您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

使用 Azure 机器学习工作室连接到数据Connect to data with the Azure Machine Learning studio

本文介绍如何使用 Azure 机器学习工作室访问数据。In this article, learn how to access your data with the Azure Machine Learning studio. 使用 Azure 机器学习数据存储连接到 Azure 上存储服务中的数据,然后使用 Azure 机器学习数据集为 ML 工作流中的任务打包这些数据。Connect to your data in storage services on Azure with Azure Machine Learning datastores, and then package that data for tasks in your ML workflows with Azure Machine Learning datasets.

下表定义并汇总了数据存储和数据集的好处。The following table defines and summarizes the benefits of datastores and datasets.

对象Object 说明Description 好处Benefits
数据存储Datastores 安全地连接到 Azure 上的存储服务,方法是将连接信息(如订阅 ID 和令牌授权)存储在与工作区关联的 Key VaultSecurely connect to your storage service on Azure, by storing your connection information, like your subscription ID and token authorization in your Key Vault associated with the workspace 由于你的信息已安全存储,因此,Because your information is securely stored, you

  • 请勿将身份验证、凭据或原始数据源置于危险之中。      Don't put authentication credentials or original data sources at risk.
  • 不再需要在脚本中对其进行硬编码。No longer need to hard code them in your scripts.
  • 数据集Datasets 通过创建数据集,可以创建对数据源位置的引用及其元数据的副本。By creating a dataset, you create a reference to the data source location, along with a copy of its metadata. 利用数据集,你可以,With datasets you can,

  • 在模型训练期间访问数据。Access data during model training.
  • 与其他用户共享数据和展开协作。Share data and collaborate with other users.
  • 利用开放源代码库(如 pandas)进行数据研究。Leverage open-source libraries, like pandas, for data exploration.
  • 由于数据集是延迟计算的,并且数据仍保留在其现有位置,因此Because datasets are lazily evaluated, and the data remains in its existing location, you

  • 在存储中保留单个数据副本。Keep a single copy of data in your storage.
  • 不会产生额外的存储成本Incur no extra storage cost
  • 不会无意中更改原始数据源。Don't risk unintentionally changing your original data sources.
  • 会提高 ML 工作流性能速度。Improve ML workflow performance speeds.
  • 若要了解在 Azure 机器学习总体数据访问工作流中的哪些位置使用数据存储和数据集,请参阅安全地访问数据一文。To understand where datastores and datasets fit in Azure Machine Learning's overall data access workflow, see the Securely access data article.

    若要获得代码优先体验,请参阅以下文章来使用 Azure 机器学习 Python SDK 以:For a code first experience, see the following articles to use the Azure Machine Learning Python SDK to:

    先决条件Prerequisites

    • Azure 订阅。An Azure subscription. 如果没有 Azure 订阅,请在开始操作前先创建一个免费帐户。If you don't have an Azure subscription, create a free account before you begin. 试用 Azure 机器学习的免费版或付费版Try the free or paid version of Azure Machine Learning.

    • 访问 Azure 机器学习工作室Access to Azure Machine Learning studio.

    • Azure 机器学习工作区。An Azure Machine Learning workspace. 创建 Azure 机器学习工作区Create an Azure Machine Learning workspace.

      • 创建工作区时,会将 Azure Blob 容器和 Azure 文件共享作为数据存储自动注册到工作区。When you create a workspace, an Azure blob container and an Azure file share are automatically registered as datastores to the workspace. 它们分别命名为 workspaceblobstoreworkspacefilestoreThey're named workspaceblobstore and workspacefilestore, respectively. 如果 blob 存储足以满足你的需要,则 workspaceblobstore 设置为默认数据存储,并且已配置,可供使用。If blob storage is sufficient for your needs, the workspaceblobstore is set as the default datastore, and already configured for use. 否则,你需要 Azure 上具有支持的存储类型的存储帐户。Otherwise, you need a storage account on Azure with a supported storage type.

    创建数据存储Create datastores

    可从这些 Azure 存储解决方案创建数据存储。You can create datastores from these Azure storage solutions. 对于不支持的存储解决方案,为了在 ML 试验期间节省数据出口成本,你必须 将数据移到支持的 Azure 存储解决方案。For unsupported storage solutions, and to save data egress cost during ML experiments, you must move your data to a supported Azure storage solution. 了解有关数据存储的详细信息Learn more about datastores.

    在 Azure 机器学习工作室中通过几个步骤创建新的数据存储。Create a new datastore in a few steps with the Azure Machine Learning studio.

    重要

    如果数据存储帐户位于虚拟网络中,则需要执行其他配置步骤以确保工作室可以访问你的数据。If your data storage account is in a virtual network, additional configuration steps are required to ensure the studio has access to your data. 请参阅网络隔离和隐私,以确保应用了适当的配置步骤。See Network isolation & privacy to ensure the appropriate configuration steps are applied.

    1. 登录到 Azure 机器学习工作室Sign in to Azure Machine Learning studio.
    2. 在左窗格中的“管理”下,选择“数据存储” 。Select Datastores on the left pane under Manage.
    3. 选择“+ 新建数据存储”。Select + New datastore.
    4. 完成表单以创建和注册新的数据存储。Complete the form to create and register a new datastore. 该表单会根据你选择的 Azure 存储类型和身份验证类型智能地进行更新。The form intelligently updates itself based on your selections for Azure storage type and authentication type. 请参阅存储访问和权限部分,了解在哪里可以找到填充此窗体所需的身份验证凭据。See the storage access and permissions section to understand where to find the authentication credentials you need to populate this form.

    下面的示例展示了创建 Azure Blob 数据存储 时窗体的外观:The following example demonstrates what the form looks like when you create an Azure blob datastore:

    新数据存储的表单

    创建数据集Create datasets

    创建数据存储后,创建一个数据集以与数据交互。After you create a datastore, create a dataset to interact with your data. 数据集可将数据打包成一个延迟计算的可供机器学习任务(例如训练)使用的对象。Datasets package your data into a lazily evaluated consumable object for machine learning tasks, like training. 了解有关数据集的详细信息Learn more about datasets.

    有两种类型的数据集:FileDataset 和 TablerDataSet。There are two types of datasets, FileDataset and TabularDataset. FileDatasets 创建对单个或多个文件或公共 URL 的引用。FileDatasets create references to single or multiple files or public URLs. 然而,TabularDatasets 以表格格式表示数据。Whereas, TabularDatasets represent your data in a tabular format. 可以从 .csv、.tsv、.parquet、.jsonl 文件以及从 SQL 查询结果创建 TabularDatasets。You can create TabularDatasets from .csv, .tsv, .parquet, .jsonl files, and from SQL query results.

    以下步骤和动画演示如何在 Azure 机器学习工作室中创建数据集。The following steps and animation show how to create a dataset in Azure Machine Learning studio.

    备注

    通过 Azure 机器学习工作室创建的数据集会自动注册到工作区。Datasets created through Azure Machine Learning studio are automatically registered to the workspace.

    使用 UI 创建数据集

    若要在工作室中创建数据集:To create a dataset in the studio:

    1. 登录到 Azure 机器学习工作室Sign in to the Azure Machine Learning studio.
    2. 在左侧窗格的“资产”部分,选择“数据集”。 Select Datasets in the Assets section of the left pane.
    3. 选择“创建数据集”以选择数据集的源。Select Create Dataset to choose the source of your dataset. 此源可以是本地文件、数据存储、公共 URL 或 Azure 开放数据集This source can be local files, a datastore, public URLs, or Azure Open Datasets.
    4. 为“数据集类型”选择“表格”或“文件”。 Select Tabular or File for Dataset type.
    5. 选择“下一步”,打开“数据存储和文件选择”窗体。Select Next to open the Datastore and file selection form. 在此窗体上,可以选择在创建数据集后保留数据集的位置,还可以选择要用于数据集的具体数据文件。On this form you select where to keep your dataset after creation, as well as select what data files to use for your dataset.
      1. 如果数据位于虚拟网络中,请启用“跳过验证”。Enable skip validation if your data is in a virtual network. 详细了解虚拟网络隔离和隐私Learn more about virtual network isolation and privacy.
      2. 对于表格数据集,可以指定“timeseries”特征,以便在数据集上启用与时间相关的操作。For Tabular datasets, you can specify a 'timeseries' trait to enable time related operations on your dataset. 了解如何将 timeseries 特征添加到数据集Learn how to add the timeseries trait to your dataset.
    6. 选择“下一步”以填充“设置和预览”以及“架构”窗体;它们是根据文件类型智能填充的。在这些窗体上进行创建之前,可以进一步配置数据集。 Select Next to populate the Settings and preview and Schema forms; they are intelligently populated based on file type and you can further configure your dataset prior to creation on these forms.
    7. 选择“下一步”,查看“确认详细信息”窗体。Select Next to review the Confirm details form. 检查所做的选择,为数据集创建可选的数据配置文件。Check your selections and create an optional data profile for your dataset. 详细了解数据分析Learn more about data profiling.
    8. 选择“创建”以完成数据集的创建。Select Create to complete your dataset creation.

    数据配置文件和预览Data profile and preview

    创建数据集后,请按照以下步骤验证是否可以在工作室中查看配置文件和预览。After you create your dataset, verify you can view the profile and preview in the studio with the following steps.

    1. 登录到 Azure 机器学习工作室Sign in to the Azure Machine Learning studio
    2. 在左侧窗格的“资产”部分,选择“数据集”。 Select Datasets in the Assets section of the left pane.
    3. 选择要查看的数据集的名称。Select the name of the dataset you want to view.
    4. 选择”浏览“选项卡。Select the Explore tab.
    5. 选择“预览”或“配置文件”选项卡 。Select the Preview or Profile tab.

    查看数据集配置文件和预览

    可以获取整个数据集的各种摘要统计信息,以验证该数据集是否随时可在机器学习中使用。You can get a vast variety of summary statistics across your data set to verify whether your data set is ML-ready. 对于非数字列,仅包括最小值、最大值和误差计数等基本统计信息。For non-numeric columns, they include only basic statistics like min, max, and error count. 对于数字列,还可以查看其统计时刻和估算的分位数。For numeric columns, you can also review their statistical moments and estimated quantiles.

    具体而言,Azure 机器学习数据集的数据配置文件包括:Specifically, Azure Machine Learning dataset's data profile includes:

    备注

    对于具有不相关类型的特征,将显示空白条目。Blank entries appear for features with irrelevant types.

    统计信息Statistic 说明Description
    功能Feature 正在汇总的列的名称。Name of the column that is being summarized.
    配置文件Profile 基于推理的类型显示的内联可视化效果。In-line visualization based on the type inferred. 例如,字符串、布尔值和日期包含值计数,而小数(数字)则包含近似的直方图。For example, strings, booleans, and dates will have value counts, while decimals (numerics) have approximated histograms. 这样,就可以快速了解数据的分布。This allows you to gain a quick understanding of the distribution of the data.
    类型分布Type distribution 列中类型的内联值计数。In-line value count of types within a column. Null 是其自身的类型,因此,此可视化效果可用于检测反常值或缺失值。Nulls are their own type, so this visualization is useful for detecting odd or missing values.
    类型Type 列的推理类型。Inferred type of the column. 可能的值包括:字符串、布尔值、日期和小数。Possible values include: strings, booleans, dates, and decimals.
    MinMin 列的最小值。Minimum value of the column. 对于没有固有顺序(例如布尔值)的特征类型,将显示空白条目。Blank entries appear for features whose type does not have an inherent ordering (like, booleans).
    MaxMax 列的最大值。Maximum value of the column.
    计数Count 列中缺失和未缺失条目的总数。Total number of missing and non-missing entries in the column.
    非缺失计数Not missing count 列中未缺失的条目数。Number of entries in the column that are not missing. 空字符串和误差被视为值,因此它们不会计入“未缺少计数”。Empty strings and errors are treated as values, so they will not contribute to the "not missing count."
    分位数Quantiles 每个分位数中的近似值,用于提供数据分布的概观。Approximated values at each quantile to provide a sense of the distribution of the data.
    平均值Mean 列的算术中间值或平均值。Arithmetic mean or average of the column.
    标准偏差Standard deviation 此列数据的离散量或差异量的度量。Measure of the amount of dispersion or variation of this column's data.
    VarianceVariance 此列数据与其平均值之间的分散程度度量。Measure of how far spread out this column's data is from its average value.
    倾斜Skewness 此列数据与正态分布之间的差异程度度量。Measure of how different this column's data is from a normal distribution.
    峰度Kurtosis 此列数据与正态分布相比的落后程度度量。Measure of how heavily tailed this column's data is compared to a normal distribution.

    存储访问和权限Storage access and permissions

    为了确保安全连接到 Azure 存储服务,Azure 机器学习会要求你具有相应数据存储的访问权限。To ensure you securely connect to your Azure storage service, Azure Machine Learning requires that you have permission to access the corresponding data storage. 此访问权限依赖用于注册数据存储的身份验证凭据。This access depends on the authentication credentials used to register the datastore.

    虚拟网络Virtual network

    如果你的数据存储帐户在虚拟网络中,则需要执行其他配置步骤来确保 Azure 机器学习能够访问你的数据。If your data storage account is in a virtual network, additional configuration steps are required to ensure Azure Machine Learning has access to your data. 请参阅网络隔离和隐私,以确保在创建和注册数据存储时应用适当的配置步骤。See Network isolation & privacy to ensure the appropriate configuration steps are applied when you create and register your datastore.

    访问验证Access validation

    在初始的数据存储创建和注册过程中,Azure 机器学习会自动验证基础存储服务是否存在,以及用户提供的主体(用户名、服务主体或 SAS 令牌)是否有权访问指定的存储。As part of the initial datastore creation and registration process, Azure Machine Learning automatically validates that the underlying storage service exists and the user provided principal (username, service principal, or SAS token) has access to the specified storage.

    创建数据存储后,此验证只针对要求访问基础存储容器的方法执行,而不是每次检索数据存储对象时都执行 。After datastore creation, this validation is only performed for methods that require access to the underlying storage container, not each time datastore objects are retrieved. 例如,如果要从数据存储中下载文件,则会进行验证,但如果只想更改默认数据存储,则不会进行验证。For example, validation happens if you want to download files from your datastore; but if you just want to change your default datastore, then validation does not happen.

    若要验证对基础存储服务的访问,可以根据要创建的数据存储类型提供帐户密钥、共享访问签名 (SAS) 令牌或服务主体。To authenticate your access to the underlying storage service, you can provide either your account key, shared access signatures (SAS) tokens, or service principal according to the datastore type you want to create. 存储类型矩阵列出了与各种数据存储类型对应的受支持的身份验证类型。The storage type matrix lists the supported authentication types that correspond to each datastore type.

    可在 Azure 门户上查找帐户密钥、SAS 令牌和服务主体信息。You can find account key, SAS token, and service principal information on your Azure portal.

    • 如果计划使用帐户密钥或 SAS 令牌进行身份验证,请在左窗格中选择“存储帐户”,然后选择要注册的存储帐户。If you plan to use an account key or SAS token for authentication, select Storage Accounts on the left pane, and choose the storage account that you want to register.

      • “概述”页面提供了帐户名称、容器和文件共享名称等信息。The Overview page provides information such as the account name, container, and file share name.
        1. 对于帐户密钥,请转到“设置”窗格中的“访问密钥” 。For account keys, go to Access keys on the Settings pane.
        2. 对于 SAS 令牌,请转到“设置”窗格中的“共享访问签名” 。For SAS tokens, go to Shared access signatures on the Settings pane.
    • 如果计划使用服务主体进行身份验证,请转到“应用注册”,然后选择要使用的应用。If you plan to use a service principal for authentication, go to your App registrations and select which app you want to use.

      • 其对应的“概览”页面将包含租户 ID 和客户端 ID 之类的必需信息。Its corresponding Overview page will contain required information like tenant ID and client ID.

    重要

    • 如果需要更改 Azure 存储帐户的访问密钥(帐户密钥或 SAS 令牌),请确保将新凭据与工作区以及与其连接的数据存储同步。If you need to change your access keys for an Azure Storage account (account key or SAS token), be sure to sync the new credentials with your workspace and the datastores connected to it. 了解如何同步更新的凭据Learn how to sync your updated credentials.

    • 如果你注销了一个数据存储并重新注册一个同名数据存储,但注册失败,则表示工作区的 Azure Key Vault 可能未启用软删除。If you unregister and re-register a datastore with the same name, and it fails, the Azure Key Vault for your workspace may not have soft-delete enabled. 默认情况下,将为工作区创建的密钥保管库实例启用软删除,但如果使用的是现有密钥保管库或是在 2020 年 10 月之前创建的工作区,则可能无法启用软删除。By default, soft-delete is enabled for the key vault instance created by your workspace, but it may not be enabled if you used an existing key vault or have a workspace created prior to October 2020. 有关如何启用软删除的信息,请参阅对现有的密钥保管库启用软删除For information on how to enable soft-delete, see Turn on Soft Delete for an existing key vault.”

    权限Permissions

    对于 Azure Blob 容器和 Azure Data Lake Gen2 存储,请确保身份验证凭据具有“存储 Blob 数据读取者”访问权限。For Azure blob container and Azure Data Lake Gen 2 storage, make sure your authentication credentials have Storage Blob Data Reader access. 详细了解存储 Blob 数据读取器Learn more about Storage Blob Data Reader. 帐户 SAS 令牌默认为无权限。An account SAS token defaults to no permissions.

    • 如需进行数据读取访问,你的身份验证凭据必须至少具有容器和对象的“列出”和“读取”权限。For data read access, your authentication credentials must have a minimum of list and read permissions for containers and objects.

    • 若需进行数据写入访问,还需要“写入”和“添加”权限。For data write access, write and add permissions also are required.

    使用数据集进行训练Train with datasets

    在机器学习试验中使用数据集来训练 ML 模型。Use your datasets in your machine learning experiments for training ML models. 详细了解如何使用数据集进行训练Learn more about how to train with datasets

    后续步骤Next steps