您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

配合使用 Azure Data Lake Storage Gen2 和 Azure HDInsight 群集Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters

Azure Data Lake Storage Gen2 是一种专用于在 Azure Blob 存储基础上进行大数据分析的云存储服务。Azure Data Lake Storage Gen2 is a cloud storage service dedicated to big data analytics, built on Azure Blob storage. Data Lake Storage Gen2 结合了 Azure Blob 存储和 Azure Data Lake Storage Gen1 功能。Data Lake Storage Gen2 combines the capabilities of Azure Blob storage and Azure Data Lake Storage Gen1. 生成的服务提供了 Azure Data Lake Storage Gen1 中的功能,例如文件系统语义、目录级别和文件级别的安全性、可伸缩性以及低成本、分层存储、高可用性和灾难恢复功能从 Azure Blob 存储。The resulting service offers features from Azure Data Lake Storage Gen1, such as file system semantics, directory-level and file-level security, and scalability, along with the low-cost, tiered storage, high availability, and disaster-recovery capabilities from Azure Blob storage.

Data Lake Storage Gen2 可用性Data Lake Storage Gen2 availability

Data Lake Storage Gen2 作为存储选项,几乎所有 Azure HDInsight 群集类型都作为默认存储帐户和其他存储帐户。Data Lake Storage Gen2 is available as a storage option for almost all Azure HDInsight cluster types as both a default and an additional storage account. 但 HBase 只能有一个 Data Lake Storage Gen2 帐户。HBase, however, can have only one Data Lake Storage Gen2 account.

有关使用 Data Lake Storage Gen2 的群集创建选项的完整比较,请参阅比较用于 Azure HDInsight 群集的存储选项For a full comparison of cluster creation options using Data Lake Storage Gen2, see Compare storage options for use with Azure HDInsight clusters.

备注

选择 "Data Lake Storage Gen2 作为主存储类型后,不能选择 Data Lake Storage Gen1 帐户作为附加存储。After you select Data Lake Storage Gen2 as your primary storage type, you cannot select a Data Lake Storage Gen1 account as additional storage.

通过 Azure 门户 Data Lake Storage Gen2 创建群集Create a cluster with Data Lake Storage Gen2 through the Azure portal

若要创建使用存储 Data Lake Storage Gen2 的 HDInsight 群集,请按照以下步骤配置 Data Lake Storage Gen2 帐户。To create an HDInsight cluster that uses Data Lake Storage Gen2 for storage, follow these steps to configure a Data Lake Storage Gen2 account.

创建用户分配的托管标识Create a user-assigned managed identity

创建用户分配的托管标识(如果还没有)。Create a user-assigned managed identity, if you don’t already have one.

  1. 登录 Azure 门户Sign in to the Azure portal.
  2. 在左上角单击 "创建资源"。In the upper-left click Create a resource.
  3. 在 "搜索" 框中,键入user user ,然后单击 "用户分配的托管标识"。In the search box, type user assigned and click User Assigned Managed Identity.
  4. 单击“创建”。Click Create.
  5. 输入托管标识的名称,选择正确的订阅、资源组和位置。Enter a name for your managed identity, select the correct subscription, resource group, and location.
  6. 单击“创建”。Click Create.

有关 Azure HDInsight 中托管标识的工作方式的详细信息,请参阅Azure hdinsight 中的托管标识For more information on how managed identities work in Azure HDInsight, see Managed identities in Azure HDInsight.

创建用户分配的托管标识

创建 Data Lake Storage Gen2 帐户Create a Data Lake Storage Gen2 account

创建 Azure Data Lake Storage Gen2 存储帐户。Create an Azure Data Lake Storage Gen2 storage account.

  1. 登录 Azure 门户Sign in to the Azure portal.
  2. 在左上角单击 "创建资源"。In the upper-left click Create a resource.
  3. 在搜索框中,键入 "存储",然后单击 "存储帐户"。In the search box, type storage and click Storage account.
  4. 单击“创建”。Click Create.
  5. 在 "创建存储帐户" 屏幕上:On the Create storage account screen:
    1. 选择正确的订阅和资源组。Select the correct subscription and resource group.
    2. 输入 Data Lake Storage Gen2 帐户的名称。Enter a name for your Data Lake Storage Gen2 account. 有关存储帐户命名约定的详细信息,请参阅Azure 资源的命名约定For more information on storage account naming conventions see Naming conventions for Azure resources.
    3. 单击 "高级" 选项卡。Click on the Advanced tab.
    4. Data Lake Storage Gen2下,单击 "分层命名空间" 旁边的 "启用"。Click Enabled next to Hierarchical namespace under Data Lake Storage Gen2.
    5. 单击“查看 + 创建”。Click Review + create.
    6. 单击“创建”Click Create

有关存储帐户创建过程中的其他选项的详细信息,请参阅快速入门:创建 Azure Data Lake Storage Gen2 的存储帐户For more information on other options during storage account creation, see Quickstart: Create an Azure Data Lake Storage Gen2 storage account.

显示 Azure 门户中存储帐户创建情况的屏幕截图

为 Data Lake Storage Gen2 帐户上的托管标识设置权限Set up permissions for the managed identity on the Data Lake Storage Gen2 account

将托管标识分配给存储帐户上的存储 Blob 数据所有者角色。Assign the managed identity to the Storage Blob Data Owner role on the storage account.

  1. Azure 门户中转到自己的存储帐户。In the Azure portal, go to your storage account.

  2. 选择存储帐户,然后选择 "访问控制(IAM) " 以显示该帐户的访问控制设置。Select your storage account, then select Access control (IAM) to display the access control settings for the account. 选择“角色分配”选项卡以查看角色分配列表。Select the Role assignments tab to see the list of role assignments.

    显示存储访问控制设置的屏幕截图

  3. 选择 " + 添加角色分配" 按钮以添加新角色。Select the + Add role assignment button to add a new role.

  4. 在 "添加角色分配" 窗口中,选择 "存储 Blob 数据所有者" 角色。In the Add role assignment window, select the Storage Blob Data Owner role. 然后,选择具有托管标识和存储帐户的订阅。Then, select the subscription that has the managed identity and storage account. 接下来,搜索并找到之前创建的用户分配托管标识。Next, search to locate the user-assigned managed identity that you created previously. 最后,选择托管标识,它将在 "所选成员" 下列出。Finally, select the managed identity, and it will be listed under Selected members.

    显示如何分配 RBAC 角色的屏幕截图

  5. 选择“保存”。Select Save. 所选的用户分配的标识现在列在所选角色下。The user-assigned identity that you selected is now listed under the selected role.

  6. 此初始设置完成后,可通过门户创建群集。After this initial setup is complete, you can create a cluster through the portal. 群集必须与存储帐户位于同一 Azure 区域中。The cluster must be in the same Azure region as the storage account. 在群集创建菜单的“存储”部分,选择以下选项:In the Storage section of the cluster creation menu, select the following options:

    • 对于 "主存储类型",请选择Azure Data Lake Storage Gen2For Primary storage type, select Azure Data Lake Storage Gen2.

    • 在 "选择存储帐户" 下,搜索并选择新创建的 Data Lake Storage Gen2 存储帐户。Under Select a Storage account, search for and select the newly created Data Lake Storage Gen2 storage account.

      用于配合使用 Data Lake Storage Gen2 和 Azure HDInsight 的存储设置

    • 在 "标识" 下,选择正确的订阅和新创建的用户分配的托管标识。Under Identity, select the correct subscription and the newly created user-assigned managed identity.

      将 Data Lake Storage Gen2 与 HDInsight 配合使用的标识设置

备注

  • 若要在存储帐户级别添加辅助 Data Lake Storage Gen2 帐户,只需将之前创建的托管标识分配到要添加的新 Data Lake Storage Gen2 存储帐户。To add a secondary Data Lake Storage Gen2 account, at the storage account level, simply assign the managed identity created earlier to the new Data Lake Storage Gen2 storage account that you want to add. 请注意,不支持通过 HDInsight 上的 "其他存储帐户" 边栏选项卡添加辅助 Data Lake Storage Gen2 帐户。Please be advised that adding a secondary Data Lake Storage Gen2 account via the "Additional storage accounts" blade on HDInsight isn't supported.
  • 可以在 HDInsight 使用的 Azure 存储帐户上启用 GRS 或 ZRS。You can enable RA-GRS or RA-ZRS on the Azure storage account that HDInsight uses. 但是,不支持对 GRS 或 ZRS 辅助终结点创建群集。However, creating a cluster against the RA-GRS or RA-ZRS secondary endpoint isn't supported.

通过 Azure CLI Data Lake Storage Gen2 创建群集Create a cluster with Data Lake Storage Gen2 through the Azure CLI

您可以下载示例模板文件下载示例参数文件You can download a sample template file and download a sample parameters file. 使用以下模板和 Azure CLI 代码片段之前,请将以下占位符替换为正确的值:Before using the template and the Azure CLI code snippet below, replace the following placeholders with their correct values:

占位符Placeholder 说明Description
<SUBSCRIPTION_ID> Azure 订阅的 IDThe ID of your Azure subscription
<RESOURCEGROUPNAME> 要在其中创建新群集和存储帐户的资源组。The resource group where you want the new cluster and storage account created.
<MANAGEDIDENTITYNAME> 将获得 Azure Data Lake Storage Gen2 帐户的权限的托管标识的名称。The name of the managed identity that will be given permissions on your Azure Data Lake Storage Gen2 account.
<STORAGEACCOUNTNAME> 要创建的新 Azure Data Lake Storage Gen2 帐户。The new Azure Data Lake Storage Gen2 account that will be created.
<CLUSTERNAME> 你的 HDInsight 群集的名称。The name of your HDInsight cluster.
<PASSWORD> 使用 SSH 和 Ambari 仪表板登录到群集所用的密码。Your chosen password for signing in to the cluster using SSH as well as the Ambari dashboard.

下面的代码片段执行以下初始步骤:The code snippet below does the following initial steps:

  1. 登录到 Azure 帐户。Logs in to your Azure account.
  2. 设置将在其中完成创建操作的活动订阅。Sets the active subscription where the create operations will be done.
  3. 为新的部署活动创建新的资源组。Creates a new resource group for the new deployment activities.
  4. 创建用户分配的托管标识。Creates a user-assigned managed identity.
  5. 将扩展添加到 Azure CLI 以使用 Data Lake Storage Gen2 的功能。Adds an extension to the Azure CLI to use features for Data Lake Storage Gen2.
  6. 使用 --hierarchical-namespace true 标志创建新的 Data Lake Storage Gen2 帐户。Creates a new Data Lake Storage Gen2 account by using the --hierarchical-namespace true flag.
az login
az account set --subscription <SUBSCRIPTION_ID>

# Create resource group
az group create --name <RESOURCEGROUPNAME> --location eastus

# Create managed identity
az identity create -g <RESOURCEGROUPNAME> -n <MANAGEDIDENTITYNAME>

az extension add --name storage-preview

az storage account create --name <STORAGEACCOUNTNAME> \
    --resource-group <RESOURCEGROUPNAME> \
    --location eastus --sku Standard_LRS \
    --kind StorageV2 --hierarchical-namespace true

接下来,登录到门户。Next, sign in to the portal. 按照使用 Azure 门户中的步骤3中所述,将新的用户分配的托管标识添加到存储帐户上的 "存储 Blob 数据参与者" 角色。Add the new user-assigned managed identity to the Storage Blob Data Contributor role on the storage account, as described in step 3 under Using the Azure portal.

为用户分配的托管标识分配角色后,请使用以下代码段部署模板。After you've assigned the role for the user-assigned managed identity, deploy the template by using the following code snippet.

az group deployment create --name HDInsightADLSGen2Deployment \
    --resource-group <RESOURCEGROUPNAME> \
    --template-file hdinsight-adls-gen2-template.json \
    --parameters parameters.json

通过 Data Lake Storage Gen2 创建群集 Azure PowerShellCreate a cluster with Data Lake Storage Gen2 through Azure PowerShell

当前不支持使用 PowerShell 创建具有 Azure Data Lake Storage Gen2 的 HDInsight 群集。Using PowerShell to create an HDInsight cluster with Azure Data Lake Storage Gen2 is not currently supported.

HDInsight 中 Data Lake Storage Gen2 的访问控制Access control for Data Lake Storage Gen2 in HDInsight

Data Lake Storage Gen2 支持哪些类型的权限?What kinds of permissions does Data Lake Storage Gen2 support?

Data Lake Storage Gen2 使用支持基于角色的访问控制(RBAC)和类似 POSIX 的访问控制列表(Acl)的访问控制模型。Data Lake Storage Gen2 uses an access control model that supports both role-based access control (RBAC) and POSIX-like access control lists (ACLs). Data Lake Storage Gen1 仅支持访问控制列表,以便控制对数据的访问。Data Lake Storage Gen1 supports access control lists only for controlling access to data.

RBAC 使用角色分配将权限集有效地应用于 Azure 资源的用户、组和服务主体。RBAC uses role assignments to effectively apply sets of permissions to users, groups, and service principals for Azure resources. 通常,这些 Azure 资源会被限制为顶级资源(例如,Azure 存储帐户)。Typically, those Azure resources are constrained to top-level resources (for example, Azure storage accounts). 对于 Azure 存储,以及 Data Lake Storage Gen2,此机制已扩展到文件系统资源。For Azure Storage, and also Data Lake Storage Gen2, this mechanism has been extended to the file system resource.

有关使用 RBAC 的文件权限的详细信息,请参阅Azure 基于角色的访问控制(RBAC)For more information about file permissions with RBAC, see Azure role-based access control (RBAC).

有关 Acl 的文件权限的详细信息,请参阅对文件和目录的访问控制列表For more information about file permissions with ACLs, see Access control lists on files and directories.

如何实现在 Data Lake Storage Gen2 中控制对我的数据的访问?How do I control access to my data in Data Lake Storage Gen2?

你的 HDInsight 群集能够访问 Data Lake Storage Gen2 中的文件是通过托管标识控制的。Your HDInsight cluster's ability to access files in Data Lake Storage Gen2 is controlled through managed identities. 托管标识是在 Azure Active Directory (Azure AD)中注册的标识,其凭据由 Azure 管理。A managed identity is an identity registered in Azure Active Directory (Azure AD) whose credentials are managed by Azure. 利用托管标识,无需在 Azure AD 或维护凭据(如证书)中注册服务主体。With managed identities, you don't need to register service principals in Azure AD or maintain credentials such as certificates.

Azure 服务具有两种类型的托管标识:系统分配和用户分配。Azure services have two types of managed identities: system-assigned and user-assigned. HDInsight 使用用户分配的托管标识来访问 Data Lake Storage Gen2。HDInsight uses user-assigned managed identities to access Data Lake Storage Gen2. 用户分配的托管标识作为独立的 Azure 资源创建。A user-assigned managed identity is created as a standalone Azure resource. 在创建过程中,Azure 会在由所用订阅信任的 Azure AD 租户中创建一个标识。Through a create process, Azure creates an identity in the Azure AD tenant that's trusted by the subscription in use. 在创建标识后,可以将标识分配到一个或多个 Azure 服务实例。After the identity is created, the identity can be assigned to one or more Azure service instances.

用户分配标识的生命周期与它所分配到的 Azure 服务实例的生命周期是分开管理的。The lifecycle of a user-assigned identity is managed separately from the lifecycle of the Azure service instances to which it's assigned. 有关托管标识的详细信息,请参阅Azure 资源的托管标识如何工作?For more information about managed identities, see How do the managed identities for Azure resources work?.

如何实现使用 Hive 或其他服务为 Azure AD 用户设置 Data Lake Storage Gen2 查询数据的权限?How do I set permissions for Azure AD users to query data in Data Lake Storage Gen2 by using Hive or other services?

若要设置用户查询数据的权限,请使用 Azure AD 安全组作为 Acl 中分配的主体。To set permissions for users to query data, use Azure AD security groups as the assigned principal in ACLs. 不要直接将文件访问权限分配给单个用户或服务主体。Don't directly assign file-access permissions to individual users or service principals. 当你使用 Azure AD 安全组来控制权限流时,你可以添加和删除用户或服务主体,而无需将 Acl 重新应用到整个目录结构。When you use Azure AD security groups to control the flow of permissions, you can add and remove users or service principals without reapplying ACLs to an entire directory structure. 只需要从相应的 Azure AD 安全组添加或删除用户。You only have to add or remove the users from the appropriate Azure AD security group. Acl 不会被继承,因此重新应用 Acl 需要在每个文件和子目录上更新 ACL。ACLs aren't inherited, so reapplying ACLs requires updating the ACL on every file and subdirectory.

从群集访问文件Access files from the cluster

可以通过多种方式从 HDInsight 群集访问 Data Lake Storage Gen2 中的文件。There are several ways you can access the files in Data Lake Storage Gen2 from an HDInsight cluster.

  • 使用完全限定的名称Using the fully qualified name. 使用此方法时,需要提供要访问的文件的完整路径。With this approach, you provide the full path to the file that you want to access.

    abfs://<containername>@<accountname>.dfs.core.windows.net/<file.path>/
    
  • 使用缩短的路径格式Using the shortened path format. 利用此方法,你可以将路径替换为群集根路径,如下所示:With this approach, you replace the path up to the cluster root with:

    abfs:///<file.path>/
    
  • 使用相对路径Using the relative path. 使用此方法时,仅需提供要访问的文件的相对路径。With this approach, you only provide the relative path to the file that you want to access.

    /<file.path>/
    

数据访问示例Data access examples

示例基于与群集头节点的ssh 连接Examples are based on an ssh connection to the head node of the cluster. 这些示例使用全部三个 URI 方案。The examples use all three URI schemes. CONTAINERNAMESTORAGEACCOUNT 替换为相关值Replace CONTAINERNAME and STORAGEACCOUNT with the relevant values

几个 hdfs 命令A few hdfs commands

  1. 在本地存储中创建一个简单文件。Create a simple file on local storage.

    touch testFile.txt
    
  2. 在群集存储上创建目录。Create directories on cluster storage.

    hdfs dfs -mkdir abfs://CONTAINERNAME@STORAGEACCOUNT.dfs.core.windows.net/sampledata1/
    hdfs dfs -mkdir abfs:///sampledata2/
    hdfs dfs -mkdir /sampledata3/
    
  3. 将数据从本地存储复制到群集存储。Copy data from local storage to cluster storage.

    hdfs dfs -copyFromLocal testFile.txt  abfs://CONTAINERNAME@STORAGEACCOUNT.dfs.core.windows.net/sampledata1/
    hdfs dfs -copyFromLocal testFile.txt  abfs:///sampledata2/
    hdfs dfs -copyFromLocal testFile.txt  /sampledata3/
    
  4. 列出群集存储中的目录内容。List directory contents on cluster storage.

    hdfs dfs -ls abfs://CONTAINERNAME@STORAGEACCOUNT.dfs.core.windows.net/sampledata1/
    hdfs dfs -ls abfs:///sampledata2/
    hdfs dfs -ls /sampledata3/
    

创建 Hive 表Creating a Hive table

出于说明目的,显示了三个文件位置。Three file locations are shown for illustrative purposes. 对于实际执行,只使用其中一个 LOCATION 条目。For actual execution, use only one of the LOCATION entries.

DROP TABLE myTable;
CREATE EXTERNAL TABLE myTable (
    t1 string,
    t2 string,
    t3 string,
    t4 string,
    t5 string,
    t6 string,
    t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 'abfs://CONTAINERNAME@STORAGEACCOUNT.dfs.core.windows.net/example/data/';
LOCATION 'abfs:///example/data/';
LOCATION '/example/data/';

后续步骤Next steps