您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.
将 Azure 存储与 Azure HDInsight 群集配合使用Use Azure storage with Azure HDInsight clusters
可以将数据存储在 Azure Blob 存储、 Azure Data Lake Storage Gen1或 Azure Data Lake Storage Gen2中。You can store data in Azure Blob storage, Azure Data Lake Storage Gen1, or Azure Data Lake Storage Gen2. 或这些选项的组合。Or a combination of these options. 使用这些存储选项,可安全地删除用于计算的 HDInsight 群集,而不会丢失用户数据。These storage options enable you to safely delete HDInsight clusters that are used for computation without losing user data.
Apache Hadoop 支持默认文件系统的概念。Apache Hadoop supports a notion of the default file system. 默认文件系统意指默认方案和授权。The default file system implies a default scheme and authority. 它还可用于解析相对路径。It can also be used to resolve relative paths. 在 HDInsight 群集创建过程中,可以指定 Azure 存储中的 blob 容器作为默认文件系统。During the HDInsight cluster creation process, you can specify a blob container in Azure Storage as the default file system. 对于 HDInsight 3.6,你可以选择 "Azure Blob 存储",也可以选择 "Azure Data Lake Storage Gen1/Azure Data Lake Storage Gen2 为默认文件系统,但有一些例外。Or with HDInsight 3.6, you can select either Azure Blob storage or Azure Data Lake Storage Gen1/ Azure Data Lake Storage Gen2 as the default files system with a few exceptions. 有关使用 Data Lake Storage Gen1 作为默认存储和链接存储的可支持性,请参阅 HDInsight 群集的可用性。For the supportability of using Data Lake Storage Gen1 as both the default and linked storage, see Availability for HDInsight cluster.
本文介绍 Azure 存储如何与 HDInsight 群集配合使用。In this article, you learn how Azure Storage works with HDInsight clusters.
- 若要了解 Data Lake Storage Gen1 如何与 HDInsight 群集配合使用,请参阅 将 Azure Data Lake Storage Gen1 与 Azure HDInsight 群集配合使用。To learn how Data Lake Storage Gen1 works with HDInsight clusters, see Use Azure Data Lake Storage Gen1 with Azure HDInsight clusters.
- 若要了解 Data Lake Storage Gen2 如何与 HDInsight 群集配合使用,请参阅 将 Azure Data Lake Storage Gen2 与 Azure HDInsight 群集配合使用。to learn how Data Lake Storage Gen2 works with HDInsight clusters, see Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters.
- 若要深入了解如何创建 HDInsight 群集,请参阅在 HDInsight 中创建 Apache Hadoop 群集。For more information about creating an HDInsight cluster, see Create Apache Hadoop clusters in HDInsight.
重要
存储帐户类型 BlobStorage 仅可用作 HDInsight 群集的辅助存储器 。Storage account kind BlobStorage can only be used as secondary storage for HDInsight clusters.
存储帐户类型Storage account kind | 支持的服务Supported services | 支持的性能层Supported performance tiers | 不支持的性能层Not supported performance tiers | 支持的访问层Supported access tiers |
---|---|---|---|---|
StorageV2(常规用途 v2)StorageV2 (general-purpose v2) | BlobBlob | 标准Standard | 高级Premium | 热、冷、存档*Hot, Cool, Archive* |
存储(常规用途 v1)Storage (general-purpose v1) | BlobBlob | 标准Standard | 高级Premium | 空值N/A |
BlobStorageBlobStorage | BlobBlob | 标准Standard | 高级Premium | 热、冷、存档*Hot, Cool, Archive* |
建议不要使用默认 blob 容器来存储业务数据。We don't recommend that you use the default blob container for storing business data. 最佳做法是每次使用之后删除默认 Blob 容器以降低存储成本。Deleting the default blob container after each use to reduce storage cost is a good practice. 默认容器包含应用程序日志和系统日志。The default container contains application and system logs. 请确保在删除该容器之前检索日志。Make sure to retrieve the logs before deleting the container.
不支持将单个 blob 容器共享为多个群集的默认文件系统。Sharing one blob container as the default file system for multiple clusters isn't supported.
备注
存档访问层是一个离线层,具有几小时的检索延迟,不建议与 HDInsight 一起使用。The Archive access tier is an offline tier that has a several hour retrieval latency and isn't recommended for use with HDInsight. 有关详细信息,请参阅存档访问层。For more information, see Archive access tier.
从群集中访问文件Access files from within cluster
可以通过多种方法从 HDInsight 群集访问 Data Lake Storage 中的文件。There are several ways you can access the files in Data Lake Storage from an HDInsight cluster. URI 方案提供了使用 wasb: 前缀的未加密访问和使用 wasbs 的 TLS 加密访问 。The URI scheme provides unencrypted access (with the wasb: prefix) and TLS encrypted access (with wasbs ). 建议尽量使用 wasbs ,即使在访问位于同一 Azure 区域内的数据时也是如此。We recommend using wasbs wherever possible, even when accessing data that lives inside the same region in Azure.
使用完全限定的名称 。Using the fully qualified name . 使用此方法时,需提供要访问的文件的完整路径。With this approach, you provide the full path to the file that you want to access.
wasb://<containername>@<accountname>.blob.core.windows.net/<file.path>/ wasbs://<containername>@<accountname>.blob.core.windows.net/<file.path>/
使用缩短的路径格式 。Using the shortened path format . 使用此方法时,需将群集根的路径替换为:With this approach, you replace the path up to the cluster root with:
wasb:///<file.path>/ wasbs:///<file.path>/
使用相对路径 。Using the relative path . 使用此方法时,仅需提供要访问的文件的相对路径。With this approach, you only provide the relative path to the file that you want to access.
/<file.path>/
数据访问示例Data access examples
示例基于到群集的头节点的 ssh 连接。Examples are based on an ssh connection to the head node of the cluster. 示例使用所有三个 URI 方案。The examples use all three URI schemes. 将 CONTAINERNAME
和 STORAGEACCOUNT
替换为相关值Replace CONTAINERNAME
and STORAGEACCOUNT
with the relevant values
几个 hdfs 命令A few hdfs commands
在本地存储上创建一个文件。Create a file on local storage.
touch testFile.txt
在群集存储上创建目录。Create directories on cluster storage.
hdfs dfs -mkdir wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.windows.net/sampledata1/ hdfs dfs -mkdir wasbs:///sampledata2/ hdfs dfs -mkdir /sampledata3/
将数据从本地存储复制到群集存储。Copy data from local storage to cluster storage.
hdfs dfs -copyFromLocal testFile.txt wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.windows.net/sampledata1/ hdfs dfs -copyFromLocal testFile.txt wasbs:///sampledata2/ hdfs dfs -copyFromLocal testFile.txt /sampledata3/
列出群集存储上的目录内容。List directory contents on cluster storage.
hdfs dfs -ls wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.windows.net/sampledata1/ hdfs dfs -ls wasbs:///sampledata2/ hdfs dfs -ls /sampledata3/
备注
在 HDInsight 外部使用 Blob 时,大多数实用程序无法识别 WASB 格式,应改用基本路径格式,如 example/jars/hadoop-mapreduce-examples.jar
。When working with blobs outside of HDInsight, most utilities do not recognize the WASB format and instead expect a basic path format, such as example/jars/hadoop-mapreduce-examples.jar
.
创建 Hive 表Creating a Hive table
但为了便于说明,显示了三个文件位置。Three file locations are shown for illustrative purposes. 在实际执行中,只使用其中一个 LOCATION
条目。For actual execution, use only one of the LOCATION
entries.
DROP TABLE myTable;
CREATE EXTERNAL TABLE myTable (
t1 string,
t2 string,
t3 string,
t4 string,
t5 string,
t6 string,
t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 'wasbs://CONTAINERNAME@STORAGEACCOUNT.blob.core.windows.net/example/data/';
LOCATION 'wasbs:///example/data/';
LOCATION '/example/data/';
从群集外部访问文件Access files from outside cluster
Microsoft 提供以下工具用于操作 Azure 存储:Microsoft provides the following tools to work with Azure Storage:
工具Tool | LinuxLinux | OS XOS X | WindowsWindows |
---|---|---|---|
Azure 门户Azure portal | ✔✔ | ✔✔ | ✔✔ |
Azure CLIAzure CLI | ✔✔ | ✔✔ | ✔✔ |
Azure PowerShellAzure PowerShell | ✔✔ | ||
AzCopyAzCopy | ✔✔ | ✔✔ |
从 Ambari 标识存储路径Identify storage path from Ambari
若要标识指向配置的默认存储的完整路径,请导航至:To identify the complete path to the configured default store, navigate to:
“HDFS” > “配置”,然后在筛选器输入框中输入
fs.defaultFS
。HDFS > Configs and enterfs.defaultFS
in the filter input box.若要检查是否已将 wasb 存储配置为辅助存储器,请导航到To check if wasb store is configured as secondary storage, navigate to:
HDFS > Configs 并在筛选器输入框输入
blob.core.windows.net
。HDFS > Configs and enterblob.core.windows.net
in the filter input box.
若要使用 Ambari REST API 获取路径,请参阅获取默认存储。To obtain the path using Ambari REST API, see Get the default storage.
Blob 容器Blob containers
若要使用 Blob,必须先创建 Azure 存储帐户。To use blobs, you first create an Azure Storage account. 作为此步骤的一部分,可指定在其中创建存储帐户的 Azure 区域。As part of this step, you specify an Azure region where the storage account is created. 群集和存储帐户必须位于同一区域。The cluster and the storage account must be hosted in the same region. Hive 元存储 SQL Server 数据库和 Apache Oozie 元存储 SQL Server 数据库必须位于同一区域。The Hive metastore SQL Server database and Apache Oozie metastore SQL Server database must be located in the same region.
无论所创建的每个 Blob 位于何处,它都属于 Azure 存储帐户中的某个容器。Wherever it lives, each blob you create belongs to a container in your Azure Storage account. 该容器可以是在 HDInsight 外部创建的现有 blob。This container may be an existing blob created outside of HDInsight. 也可以是为 HDInsight 群集创建的容器。Or it may be a container that is created for an HDInsight cluster.
默认的 Blob 容器存储群集特定的信息,如作业历史记录和日志。The default Blob container stores cluster-specific information such as job history and logs. 请不要多个 HDInsight 群集之间共享默认的 Blob 容器。Don't share a default Blob container with multiple HDInsight clusters. 该操作可能会损坏作业历史记录。This action might corrupt job history. 建议为每个群集使用不同的容器。It's recommended to use a different container for each cluster. 将共享数据放入为所有相关群集指定的链接存储帐户,而不是默认存储帐户。Put shared data on a linked storage account specified for all relevant clusters rather than the default storage account. 有关配置链接存储帐户的详细信息,请参阅创建 HDInsight 群集。For more information on configuring linked storage accounts, see Create HDInsight clusters. 但是,在删除原始的 HDInsight 群集后,可以重用默认存储容器。However you can reuse a default storage container after the original HDInsight cluster has been deleted. 对于 HBase 群集,实际上可以通过使用已删除的 HBase 群集使用的默认 Blob 容器创建新的 HBase 群集,从而保留 HBase 表架构和数据For HBase clusters, you can actually keep the HBase table schema and data by creating a new HBase cluster using the default blob container that is used by a deleted HBase cluster
备注
需要安全传输的功能强制通过安全连接来实施针对帐户的所有请求。The feature that requires secure transfer enforces all requests to your account through a secure connection. 仅 HDInsight 群集 3.6 或更高版本支持此功能。Only HDInsight cluster version 3.6 or newer supports this feature. 有关详细信息,请参阅在 Azure HDInsight 中使用安全传输存储帐户创建 Apache Hadoop 群集。For more information, see Create Apache Hadoop cluster with secure transfer storage accounts in Azure HDInsight.
使用其他存储帐户Use additional storage accounts
创建 HDInsight 群集时,可以指定要与其关联的 Azure 存储帐户。While creating an HDInsight cluster, you specify the Azure Storage account you want to associate with it. 此外,在创建过程中或群集创建完成后,还可以从同一 Azure 订阅或不同 Azure 订阅添加其他存储帐户。Also, you can add additional storage accounts from the same Azure subscription or different Azure subscriptions during the creation process or after a cluster has been created. 有关添加其他存储帐户的说明,请参阅创建 HDInsight 群集。For instructions about adding additional storage accounts, see Create HDInsight clusters.
警告
不支持在 HDInsight 群集之外的其他位置使用别的存储帐户。Using an additional storage account in a different location than the HDInsight cluster is not supported.
后续步骤Next steps
本文已介绍如何将 HDFS 兼容的 Azure 存储与 HDInsight 配合使用。In this article, you learned how to use HDFS-compatible Azure storage with HDInsight. 通过该存储可以生成自适应、长期存档数据采集解决方案,并使用 HDInsight 来解锁所存储结构化和非结构化数据内的信息。This storage allows you to build adaptable, long-term, archiving data acquisition solutions and use HDInsight to unlock the information inside the stored structured and unstructured data.
有关详细信息,请参阅:For more information, see:
- 快速入门:创建 Apache Hadoop 群集Quickstart: Create Apache Hadoop cluster
- 教程:创建 HDInsight 群集Tutorial: Create HDInsight clusters
- 将 Azure Data Lake Storage Gen2 用于 Azure HDInsight 群集Use Azure Data Lake Storage Gen2 with Azure HDInsight clusters
- 将数据上传到 HDInsightUpload data to HDInsight
- 教程:在 Azure HDInsight 中使用交互式查询提取、转换和加载数据Tutorial: Extract, transform, and load data using Interactive Query in Azure HDInsight
- 使用 Azure 存储共享访问签名来限制使用 HDInsight 访问数据Use Azure Storage Shared Access Signatures to restrict access to data with HDInsight