您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

比较用于与 Azure HDInsight 群集配合使用的存储选项Compare storage options for use with Azure HDInsight clusters

创建 HDInsight 群集时,可以在几个不同的 Azure 存储服务之间进行选择:You can choose between a few different Azure storage services when creating HDInsight clusters:

  • Azure 存储Azure Storage
  • Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2
  • Azure Data Lake Storage Gen1Azure Data Lake Storage Gen1

本文概述了这些存储类型和其独特功能。This article provides an overview of these storage types and their unique features.

下表汇总了不同版本的 HDInsight 支持的 Azure 存储服务:The following table summarizes the Azure Storage services that are supported with different versions of HDInsight:

存储服务Storage service 帐户类型Account type 命名空间类型Namespace Type 支持的服务Supported services 支持的性能层Supported performance tiers 支持的访问层Supported access tiers HDInsight 版本HDInsight Version 群集类型Cluster type
Azure Data Lake Storage Gen2Azure Data Lake Storage Gen2 常规用途 V2General-purpose V2 层次结构(文件系统)Hierarchical (filesystem) BlobBlob 标准Standard 热、冷、存档Hot, Cool, Archive 3.6+3.6+ AllAll
Azure 存储Azure Storage 常规用途 V2General-purpose V2 ObjectObject BlobBlob 标准Standard 热、冷、存档Hot, Cool, Archive 3.6+3.6+ AllAll
Azure 存储Azure Storage 常规用途 V1General-purpose V1 ObjectObject BlobBlob 标准Standard 不可用N/A AllAll AllAll
Azure 存储Azure Storage Blob 存储 * *Blob Storage** ObjectObject 块 BlobBlock Blob 标准Standard 热、冷、存档Hot, Cool, Archive AllAll AllAll
Azure Data Lake Storage Gen1Azure Data Lake Storage Gen1 不可用N/A 层次结构(文件系统)Hierarchical (filesystem) 不可用N/A 不可用N/A 不可用N/A 仅3。63.6 Only 除 HBase 之外的所有All except HBase

* * 对于 HDInsight 群集,只有辅助存储帐户的类型为 BlobStorage,页 Blob 不是受支持的存储选项。**For HDInsight clusters, only secondary storage accounts can be of type BlobStorage and Page Blob is not a supported storage option.

有关 Azure 存储帐户类型的详细信息,请参阅azure 存储帐户概述For more information on Azure Storage account types, see Azure storage account overview

有关 azure 存储访问层的详细信息,请参阅 azure Blob 存储:高级(预览版)、热、冷和存档存储层For more information on Azure Storage access tiers, see Azure Blob storage: Premium (preview), Hot, Cool, and Archive storage tiers

可以使用不同的服务组合创建群集,以用于主要和可选的辅助存储。You can create a cluster using different combinations of services for primary and optional secondary storage. 下表总结了 HDInsight 当前支持的群集存储配置:The following table summarizes the cluster storage configurations that are currently supported in HDInsight:

HDInsight 版本HDInsight Version 主存储Primary Storage 辅助存储Secondary Storage 支持Supported
3.6 & 4.03.6 & 4.0 常规用途 V1,常规用途 V2General Purpose V1 , General Purpose V2 常规用途 V1,常规用途 V2,BlobStorage (块 Blob)General Purpose V1 , General Purpose V2, BlobStorage(Block Blobs) Yes
3.6 & 4.03.6 & 4.0 常规用途 V1,常规用途 V2General Purpose V1 , General Purpose V2 Data Lake Storage Gen2Data Lake Storage Gen2 No
3.6 & 4.03.6 & 4.0 常规用途 V1,常规用途 V2General Purpose V1 , General Purpose V2 Data Lake Storage Gen1Data Lake Storage Gen1 Yes
3.6 & 4.03.6 & 4.0 Data Lake Storage Gen2 *Data Lake Storage Gen2* Data Lake Storage Gen2Data Lake Storage Gen2 Yes
3.6 & 4.03.6 & 4.0 Data Lake Storage Gen2 *Data Lake Storage Gen2* 常规用途 V1,常规用途 V2,BlobStorage (块 Blob)General Purpose V1 , General Purpose V2, BlobStorage(Block Blobs) Yes
3.6 & 4.03.6 & 4.0 Data Lake Storage Gen2Data Lake Storage Gen2 Data Lake Storage Gen1Data Lake Storage Gen1 No
3.63.6 Data Lake Storage Gen1Data Lake Storage Gen1 Data Lake Storage Gen1Data Lake Storage Gen1 Yes
3.63.6 Data Lake Storage Gen1Data Lake Storage Gen1 常规用途 V1,常规用途 V2,BlobStorage (块 Blob)General Purpose V1 , General Purpose V2, BlobStorage(Block Blobs) Yes
3.63.6 Data Lake Storage Gen1Data Lake Storage Gen1 Data Lake Storage Gen2Data Lake Storage Gen2 No
4.04.0 Data Lake Storage Gen1Data Lake Storage Gen1 任意Any No

* = 这可能是一个或多个 Data Lake Storage Gen2 帐户,只要它们都设置为使用相同的托管标识进行群集访问。*=This could be one or multiple Data Lake Storage Gen2 accounts, as long as they are all setup to use the same managed identity for cluster access.

在 Azure HDInsight 中将 Azure Data Lake Storage Gen2 用于 Apache HadoopUse Azure Data Lake Storage Gen2 with Apache Hadoop in Azure HDInsight

Azure Data Lake Storage Gen2 采用了 Azure Data Lake Storage Gen1 中的核心功能,并将它们集成到了 Azure Blob 存储中。Azure Data Lake Storage Gen2 takes core features from Azure Data Lake Storage Gen1 and integrates them into Azure Blob storage. 这些功能包括与 Hadoop 兼容的文件系统、Azure Active Directory (Azure AD) 和基于 POSIX 的访问控制列表 (ACL)。These features include a file system that is compatible with Hadoop, Azure Active Directory (Azure AD), and POSIX-based access control lists (ACLs). 此组合使你能够利用 Azure Data Lake Storage Gen1 的性能,同时还能够使用 Blob 存储的分层和数据生命周期管理。This combination allows you to take advantage of the performance of Azure Data Lake Storage Gen1 while also using the tiering and data life-cycle management of Blob storage.

有关 Azure Data Lake Storage Gen2 的详细信息,请参阅 Azure Data Lake Storage Gen2 简介For more information on Azure Data Lake Storage Gen2, see Introduction to Azure Data Lake Storage Gen2.

Azure Data Lake Storage Gen2 的核心功能Core functionality of Azure Data Lake Storage Gen2

  • 与 Hadoop 兼容的访问: 在 Azure Data Lake Storage Gen2 中,可以像使用 Hadoop 分布式文件系统 (HDFS) 一样管理和访问数据。Access that is compatible with Hadoop: In Azure Data Lake Storage Gen2, you can manage and access data just as you would with a Hadoop Distributed File System (HDFS). Azure Blob 文件系统 (ABFS) 驱动程序可以在所有 Apache Hadoop 环境中使用(包括 Azure HDInsight 和 Azure Databricks)。The Azure Blob File System (ABFS) driver is available within all Apache Hadoop environments, including Azure HDInsight and Azure Databricks. 使用 ABFS 可以访问存储在 Data Lake Storage Gen2 中的数据。Use ABFS to access data stored in Data Lake Storage Gen2.

  • POSIX 权限的超集: Data Lake Gen2 的安全模型支持 ACL 和 POSIX 权限,以及特定于 Data Lake Storage Gen2 的一些额外粒度。A superset of POSIX permissions: The security model for Data Lake Gen2 supports ACL and POSIX permissions along with some extra granularity specific to Data Lake Storage Gen2. 可以通过管理工具或 Apache Hive 和 Apache Spark 等框架来配置设置。Settings can be configured through admin tools or frameworks like Apache Hive and Apache Spark.

  • 成本效益: Data Lake Storage Gen2 提供低成本的存储容量和事务。Cost effectiveness: Data Lake Storage Gen2 offers low-cost storage capacity and transactions. Azure Blob 存储生命周期等功能可以随数据在其生命周期中的移动来调整计费率,从而帮助降低成本。Features such as Azure Blob storage life cycle help lower costs by adjusting billing rates as data moves through its life cycle.

  • 与 Blob 存储工具、框架和应用兼容: Data Lake Storage Gen2 可以继续使用适用于 Blob 存储的各种工具、框架和应用程序。Compatibility with Blob storage tools, frameworks, and apps: Data Lake Storage Gen2 continues to work with a wide array of tools, frameworks, and applications for Blob storage.

  • 优化的驱动程序: ABFS 驱动程序专门针对大数据分析进行了优化。Optimized driver: The ABFS driver is optimized specifically for big data analytics. 相应的 REST API 通过分布式文件系统 (DFS) 终结点 dfs.core.windows.net 进行显示。The corresponding REST APIs are surfaced through the distributed file system (DFS) endpoint, dfs.core.windows.net.

Azure Data Lake Storage Gen 2 的新增功能What's new for Azure Data Lake Storage Gen 2

用于安全文件访问的托管标识Managed identities for secure file access

Azure HDInsight 使用托管标识来保护对 Azure Data Lake Storage Gen2 中文件的群集访问。Azure HDInsight uses managed identities to secure cluster access to files in Azure Data Lake Storage Gen2. 托管标识是 Azure Active Directory 的一项功能,可以为 Azure 服务提供一组自动托管的凭据。Managed identities are a feature of Azure Active Directory that provides Azure services with a set of automatically managed credentials. 这些凭据可用于对任何支持 Active Directory 身份验证的服务进行身份验证。These credentials can be used to authenticate to any service that supports Active Directory authentication. 不需要将凭据存储在代码或配置文件中即可使用托管标识。Using managed identities doesn't require you to store credentials in code or configuration files.

有关详细信息,请参阅 Azure 资源的托管标识For more information, see Managed identities for Azure resources.

Azure Blob 文件系统驱动程序Azure Blob File System driver

Apache Hadoop 应用程序会以本机方式从本地磁盘存储读取和写入数据。Apache Hadoop applications natively expect to read and write data from local disk storage. ABFS 等 Hadoop 文件系统驱动程序通过模仿常规 Hadoop 文件系统操作,使 Hadoop 应用程序能够使用云存储。A Hadoop file system driver like ABFS enables Hadoop applications to work with cloud storage by emulating regular Hadoop file system operations. 驱动程序将从应用程序收到的这些命令转换为实际云存储平台可以理解的操作。The driver converts those commands received from the application into operations that the actual cloud storage platform understands.

以前,Hadoop 文件系统驱动程序会先将所有文件系统操作转换为针对客户端的 Azure 存储 REST API 调用,然后再调用 REST API。Previously, the Hadoop file system driver converted all file system operations to Azure Storage REST API calls on the client side and then invoked the REST API. 但是,这种客户端转换会导致针对单个文件系统操作(例如文件重命名)执行多个 REST API 调用。This client-side conversion, however, resulted in multiple REST API calls for a single file system operation like the renaming of a file. ABFS 已将部分 Hadoop 文件系统逻辑从客户端移到了服务器端。ABFS has moved some of the Hadoop file system logic from the client side to the server side. Azure Data Lake Storage Gen2 API 现在将与 Blob API 并行运行。The Azure Data Lake Storage Gen2 API now runs in parallel with the Blob API. 此迁移提高了性能,因为现在常用的 Hadoop 文件系统操作可以通过一个 REST API 调用来执行。This migration improves performance because now common Hadoop file system operations can be executed with one REST API call.

有关详细信息,请参阅 Azure Blob 文件系统驱动程序 (ABFS):用于 Hadoop 的专用 Azure 存储驱动程序For more information, see The Azure Blob Filesystem driver (ABFS): A dedicated Azure Storage driver for Hadoop.

Azure Data Lake Storage Gen 2 的 URI 方案URI scheme for Azure Data Lake Storage Gen 2

Azure Data Lake Storage Gen2 使用新的 URI 方案从 HDInsight 访问 Azure 存储中的文件:Azure Data Lake Storage Gen2 uses a new URI scheme to access files in Azure Storage from HDInsight:

abfs://<FILE_SYSTEM_NAME>@<ACCOUNT_NAME>.dfs.core.windows.net/<PATH>

URI 方案提供 SSL 加密访问。The URI scheme provides SSL-encrypted access.

<FILE_SYSTEM_NAME> 标识文件系统 Data Lake Storage Gen2 的路径。<FILE_SYSTEM_NAME> identifies the path of the file system Data Lake Storage Gen2.

<ACCOUNT_NAME> 标识 Azure 存储帐户名称。<ACCOUNT_NAME> identifies the Azure Storage account name. 完全限定域名 (FQDN) 是必需的。A fully qualified domain name (FQDN) is required.

<PATH> 是文件或目录 HDFS 路径名。<PATH> is the file or directory HDFS path name.

如果未指定 <FILE_SYSTEM_NAME><ACCOUNT_NAME> 的值,则会使用默认文件系统。If values for <FILE_SYSTEM_NAME> and <ACCOUNT_NAME> aren't specified, the default file system is used. 对于默认文件系统中的文件,可以使用相对路径或绝对路径。For the files on the default file system, use a relative path or an absolute path. 例如,可以使用以下任一路径引用 HDInsight 群集附带的 hadoop-mapreduce-examples.jar 文件:For example, the hadoop-mapreduce-examples.jar file that comes with HDInsight clusters can be referred to by using one of the following paths:

abfs://myfilesystempath@myaccount.dfs.core.windows.net/example/jars/hadoop-mapreduce-examples.jar
abfs:///example/jars/hadoop-mapreduce-examples.jar /example/jars/hadoop-mapreduce-examples.jar

备注

在 HDInsight 版本 2.1 和 1.6 群集中,文件名为 hadoop-examples.jarThe file name is hadoop-examples.jar in HDInsight versions 2.1 and 1.6 clusters. 在 HDInsight 外部使用文件时,你会发现大多数实用程序无法识别 ABFS 格式,应该改用基本的路径格式,如 example/jars/hadoop-mapreduce-examples.jarWhen you're working with files outside of HDInsight, you'll find that most utilities don't recognize the ABFS format but instead expect a basic path format, such as example/jars/hadoop-mapreduce-examples.jar.

有关详细信息,请参阅使用 Azure Data Lake Storage Gen2 URIFor more information, see Use the Azure Data Lake Storage Gen2 URI.

Azure 存储Azure Storage

Azure 存储是一种稳健、通用的存储解决方案,它与 HDInsight 无缝集成。Azure Storage is a robust general-purpose storage solution that integrates seamlessly with HDInsight. HDInsight 可将 Azure 存储中的 Blob 容器用作群集的默认文件系统。HDInsight can use a blob container in Azure Storage as the default file system for the cluster. HDInsight 中的整套组件可以通过 HDFS 界面直接操作以 Blob 形式存储的结构化或非结构化数据。Through an HDFS interface, the full set of components in HDInsight can operate directly on structured or unstructured data stored as blobs.

我们建议对默认群集存储和你的业务数据使用单独的存储容器,以便将 HDInsight 日志和临时文件与你自己的业务数据隔离开来。We recommend to use separate storage containers for your default cluster storage and your business data, to isolate the HDInsight logs and temporary files from your own business data. 我们还建议在每次使用后删除包含应用程序和系统日志的默认 blob 容器以降低存储成本。We also recommend deleting the default blob container, which contains application and system logs, after each use to reduce storage cost. 请确保在删除该容器之前检索日志。Make sure to retrieve the logs before deleting the container.

如果选择在“选定网络”上通过“防火墙和虚拟网络”限制来保护存储帐户的安全,请务必启用例外“允许受信任的 Microsoft 服务...”,这样 HDInsight 就能访问存储帐户。If you choose to secure your storage account with the Firewalls and virtual networks restrictions on Selected networks, be sure to enable the exception Allow trusted Microsoft services... so that HDInsight can access your storage account.

HDInsight 存储体系结构HDInsight storage architecture

下图提供了 Azure 存储的 HDInsight 存储体系结构的抽象视图:The following diagram provides an abstract view of the HDInsight architecture of Azure Storage:

HDInsight 存储体系结构HDInsight Storage Architecture

HDInsight 提供对在本地附加到计算节点的分布式文件系统的访问。HDInsight provides access to the distributed file system that is locally attached to the compute nodes. 可使用完全限定 URI 访问该文件系统,例如:This file system can be accessed by using the fully qualified URI, for example:

hdfs://<namenodehost>/<path>

通过 HDInsight 还可以访问 Azure 存储中的数据。Through HDInsight you can also access data in Azure Storage. 语法如下:The syntax is as follows:

wasb://<containername>@<accountname>.blob.core.windows.net/<path>

将 Azure 存储帐户与 HDInsight 群集配合使用时,请注意以下原则:Consider the following principles when using an Azure Storage account with HDInsight clusters:

  • 已连接到群集的存储帐户中的容器: 由于在创建过程中帐户名称和密钥将与群集相关联,因此,对这些容器中的 Blob 具有完全访问权限。Containers in the storage accounts that are connected to a cluster: Because the account name and key are associated with the cluster during creation, you have full access to the blobs in those containers.

  • 未连接到群集的存储帐户中的公用容器或公用 Blob: 你对容器中的 blob 具有只读权限。Public containers or public blobs in storage accounts that are not connected to a cluster: You have read-only permission to the blobs in the containers.

    备注

    利用公共容器,可以获得该容器中可用的所有 Blob 的列表以及容器元数据。Public containers allow you to get a list of all blobs that are available in that container and to get container metadata. 利用公共 Blob,仅在知道正确 URL 时才可访问 Blob。Public blobs allow you to access the blobs only if you know the exact URL. 有关详细信息,请参阅管理对容器和 Blob 的匿名读取访问For more information, see Manage anonymous read access to containers and blobs.

  • 未连接到群集的存储帐户中的专用容器: 你无法访问这些容器中的 Blob,除非在提交 WebHCat 作业时定义了存储帐户。Private containers in storage accounts that are not connected to a cluster: You can't access the blobs in the containers unless you define the storage account when you submit the WebHCat jobs.

创建过程中定义的存储帐户及其密钥存储在群集节点上的 %HADOOP/_HOME%/conf/core-site.xml 中。The storage accounts that are defined in the creation process and their keys are stored in %HADOOP_HOME%/conf/core-site.xml on the cluster nodes. HDInsight 默认使用 core-site.xml 文件中定义的存储帐户。By default, HDInsight uses the storage accounts defined in the core-site.xml file. 可以使用 Apache Ambari 修改此设置。You can modify this setting by using Apache Ambari.

多个 WebHCat 作业,包括 Apache Hive、MapReduce、Apache Hadoop 流和 Apache Pig,都可以带有存储帐户和元数据的说明。Multiple WebHCat jobs, including Apache Hive, MapReduce, Apache Hadoop streaming, and Apache Pig, can carry a description of storage accounts and metadata with them. (它目前对带有存储帐户的 Pig 有效,但对元数据无效。)有关详细信息,请参阅将 HDInsight 群集与备用存储帐户和元存储配合使用(This is currently true for Pig with storage accounts but not for metadata.) For more information, see Using an HDInsight cluster with alternate storage accounts and metastores.

Blob 可用于结构化和非结构化数据。Blobs can be used for structured and unstructured data. Blob 容器将数据存储为键值对,没有目录层次结构。Blob containers store data as key/value pairs and have no directory hierarchy. 不过,键名称可以包含斜杠字符 (/),使其看起来像存储在目录结构中的文件。However the key name can include a slash character ( / ) to make it appear as if a file is stored within a directory structure. 例如,Blob 的键可以是 input/log1.txtFor example, a blob's key can be input/log1.txt. 不存在实际的 input 目录,但由于键名称中包含斜杠字符,键看起来像一个文件路径。No actual input directory exists, but because of the slash character in the key name, the key looks like a file path.

Azure 存储的优点Benefits of Azure Storage

未共置在一起的计算群集和存储资源存在隐含的性能成本。Compute clusters and storage resources that aren't colocated have implied performance costs. 通过在 Azure 区域中的存储帐户资源附近创建计算群集可以减少这些成本。These costs are mitigated by the way the compute clusters are created close to the storage account resources inside the Azure region. 在此区域中,计算节点可以通过高速网络高效访问 Azure 存储中的数据。In this region, the compute nodes can efficiently access the data over the high-speed network inside Azure Storage.

在 Azure 存储而非 HDFS 中存储数据可带来多项优势:When you store the data in Azure Storage instead of HDFS, you get several benefits:

  • 数据重用和共享: HDFS 中的数据位于计算群集内。Data reuse and sharing: The data in HDFS is located inside the compute cluster. 仅有权访问计算群集的应用程序才能通过 HDFS API 使用数据。Only the applications that have access to the compute cluster can use the data by using HDFS APIs. 相比之下,Azure 存储中的数据可通过 HDFS API 或 Blob 存储 REST API 进行访问。The data in Azure Storage, by contrast, can be accessed through either the HDFS APIs or the Blob storage REST APIs. 因此,可以使用更多的应用程序(包括其他 HDInsight 群集)和工具来生成和使用此类数据。Because of this arrangement, a larger set of applications (including other HDInsight clusters) and tools can be used to produce and consume the data.

  • 数据存档: 在 Azure 存储中存储数据后,可以安全地删除用于计算的 HDInsight 群集而不会丢失用户数据。Data archiving: When data is stored in Azure Storage, the HDInsight clusters used for computation can be safely deleted without losing user data.

  • 数据存储成本: 与在 Azure 存储中存储数据相比,在 DFS 中长期存储数据的成本更高,因为计算群集的成本高于 Azure 存储的成本。Data storage cost: Storing data in DFS for the long term is more costly than storing the data in Azure Storage because the cost of a compute cluster is higher than the cost of Azure Storage. 此外,由于数据无需在每次生成计算群集时重新加载,也节省了数据加载成本。Also, because the data doesn't have to be reloaded for every compute cluster generation, you're saving data-loading costs as well.

  • 弹性横向扩展: 尽管 HDFS 提供了横向扩展文件系统,但具体的扩展由你为群集创建的节点数量决定。Elastic scale-out: Although HDFS provides you with a scaled-out file system, the scale is determined by the number of nodes that you create for your cluster. 与依靠自动获得的 Azure 存储的弹性缩放功能相比,更改规模可能会更复杂。Changing the scale can be more complicated than relying on the elastic scaling capabilities that you get automatically in Azure Storage.

  • 异地复制: 可对 Azure 存储进行异地复制。Geo-replication: Your Azure Storage can be geo-replicated. 尽管异地复制可提供地理恢复和数据冗余功能,但针对异地复制位置的故障转移将大大影响性能,并且可能会产生额外成本。Although geo-replication gives you geographic recovery and data redundancy, a failover to the geo-replicated location severely affects your performance, and it might incur additional costs. 因此,请谨慎选择异地复制,并仅在数据的价值值得支付额外成本时才选择它。So choose geo-replication cautiously and only if the value of the data justifies the additional cost.

某些 MapReduce 作业和包可能会产生中间结果,并不想在 Azure 存储中存储这些结果。Certain MapReduce jobs and packages might create intermediate results that you wouldn't want to store in Azure Storage. 在此情况下,仍可以选择将数据存储在本地 HDFS 中。In that case, you can choose to store the data in the local HDFS. HDInsight 在 Hive 作业和其他过程中会为其中某些中间结果使用 DFS。HDInsight uses DFS for several of these intermediate results in Hive jobs and other processes.

备注

大多数 HDFS 命令(例如 lscopyFromLocalmkdir)在 Azure 存储中可按预期方式工作。Most HDFS commands (for example, ls, copyFromLocal, and mkdir) work as expected in Azure Storage. 只有特定于本机 HDFS 实现(称作 DFS)的命令在 Azure 存储上会显示不同的行为,例如 fschkdfsadminOnly the commands that are specific to the native HDFS implementation (which is referred to as DFS), such as fschk and dfsadmin, show different behavior in Azure Storage.

Azure Data Lake Storage Gen1 概述Overview of Azure Data Lake Storage Gen1

Azure Data Lake Storage Gen1 是一个企业范围的超大规模存储库,适用于大数据分析工作负载。Azure Data Lake Storage Gen1 is an enterprise-wide hyperscale repository for big data analytic workloads. 使用 Azure Data Lake 可以在一个位置捕获任何大小、类型和引入速度的数据进行操作和探索分析。Using Azure Data Lake, you can capture data of any size, type, and ingestion speed in one place for operational and exploratory analytics.

使用与 WebHDFS 兼容的 REST API,可以从 Hadoop(HDInsight 群集提供)访问 Data Lake Storage Gen1。Access Data Lake Storage Gen1 from Hadoop (available with an HDInsight cluster) by using the WebHDFS-compatible REST APIs. Data Lake Storage Gen1 专为分析存储数据而设计,并已针对数据分析方案的性能做了优化。Data Lake Storage Gen1 is designed to enable analytics on the stored data and is tuned for performance in data analytics scenarios. 它现成地包含了现实企业用例不可或缺的功能。Out of the box, it includes the capabilities that are essential for real-world enterprise use cases. 这些功能包括安全性、可管理性、可伸缩性、可靠性和可用性。These capabilities include security, manageability, scalability, reliability, and availability.

有关 Azure Data Lake Storage Gen1 的详细信息,请参阅 Azure Data Lake Storage Gen1 概述For more information on Azure Data Lake Storage Gen1, see the detailed Overview of Azure Data Lake Storage Gen1.

下面介绍了 Data Lake Storage Gen1 的重要功能。The key capabilities of Data Lake Storage Gen1 include the following.

与 Hadoop 兼容Compatibility with Hadoop

Data Lake Storage Gen1 是一个 Apache Hadoop 文件系统,该系统与 HDFS 兼容并与 Hadoop 生态系统相互协作。Data Lake Storage Gen1 is an Apache Hadoop file system that is compatible with HDFS and works with the Hadoop ecosystem. 采用 WebHDFS API 的现有 HDInsight 应用程序或服务可以轻松与 Data Lake Storage Gen1 集成。Your existing HDInsight applications or services that use the WebHDFS API can easily integrate with Data Lake Storage Gen1. Data Lake Storage Gen1 还为应用程序公开了 WebHDFS 兼容的 REST 接口。Data Lake Storage Gen1 also exposes a WebHDFS-compatible REST interface for applications.

使用 Hadoop 分析框架(例如 MapReduce 或 Hive),可以轻松分析 Data Lake Storage Gen1 中存储的数据。Data stored in Data Lake Storage Gen1 can be easily analyzed using Hadoop analytic frameworks such as MapReduce or Hive. 可将 Azure HDInsight 群集预配和配置为直接访问 Data Lake Storage Gen1 中存储的数据。Azure HDInsight clusters can be provisioned and configured to directly access data stored in Data Lake Storage Gen1.

无限存储空间,PB 量级的文件Unlimited storage, petabyte files

Data Lake Storage Gen1 提供无限存储空间,适合用于存储各种分析数据。Data Lake Storage Gen1 provides unlimited storage and is suitable for storing a variety of data for analytics. 帐户大小、文件大小或 Data Lake 中可存储的数据量均无任何限制。It doesn't impose limits on account sizes, file sizes, or the amount of data that can be stored in a data lake. Data Lake Storage Gen1 支持 KB 到 PB 量级的单个文件大小,非常适合用于存储任何类型的数据。Individual files can range in size from kilobytes to petabytes, making Data Lake Storage Gen1 a great choice to store any type of data. 通过创建多个副本来长期存储数据,数据在 Data Lake 中的存储时间长短没有限制。Data is stored durably by making multiple copies, and there are no limits on how long the data can be stored in the data lake.

针对大数据分析优化了性能Performance tuning for big data analytics

Data Lake Storage Gen1 旨在运行需要利用超大吞吐量查询和分析海量数据的大规模分析系统。Data Lake Storage Gen1 is built to run large-scale analytic systems that require massive throughput to query and analyze large amounts of data. Data Lake 将文件的各个部分散在多个独立的存储服务器中。The data lake spreads parts of a file over several individual storage servers. 分数数据时,此设置可以改善并行读取文件时的读取吞吐量。When you're analyzing data, this setup improves the read throughput when the file is read in parallel.

企业就绪:高度可用且安全Readiness for enterprise: Highly available and secure

Data Lake Storage Gen1 提供符合行业标准的可用性和可靠性。Data Lake Storage Gen1 provides industry-standard availability and reliability. 数据资产可通过创建冗余副本来长期存储,防范意外的故障。Data assets are stored durably: redundant copies guard against unexpected failures. 企业可以在其解决方案中使用 Data Lake Storage Gen1 作为现有数据平台的重要组成部分。Enterprises can use Data Lake Storage Gen1 in their solutions as an important part of their existing data platform.

Data Lake Storage Gen1 还为存储的数据提供企业级安全性。Data Lake Storage Gen1 also provides enterprise-grade security for stored data. 有关详细信息,请参阅 保护 Azure Data Lake Storage Gen1 中的数据For more information, see Securing data in Azure Data Lake Storage Gen1.

弹性数据结构Flexible data structures

Data Lake Storage Gen1 可按本机格式(原样)存储任何数据,不需要事先经过转换。Data Lake Storage Gen1 can store any data in its native format, as is, without requiring prior transformations. 加载数据之前,Data Lake Storage Gen1 不需要定义架构。Data Lake Storage Gen1 doesn't require a schema to be defined before the data is loaded. 独立的分析框架在分析时会解释数据并定义架构。The individual analytic framework interprets the data and defines a schema at the time of the analysis. Data Lake Storage Gen1 能够存储任意大小和格式的文件,因此可以处理结构化、半结构化和非结构化数据。Because it can store files of arbitrary sizes and formats, Data Lake Storage Gen1 can handle structured, semistructured, and unstructured data.

Data Lake Storage Gen1 的数据容器本质上是文件夹和文件。Data Lake Storage Gen1 containers for data are essentially folders and files. 可以使用 SDK、Azure 门户和 Azure PowerShell 来操作存储的数据。You operate on the stored data by using SDKs, the Azure portal, and Azure PowerShell. 只要使用这些接口和相应容器将数据放入存储,就能存储任何类型的数据。As long as you put your data into the store by using these interfaces and the appropriate containers, you can store any type of data. Data Lake Storage Gen1 不会根据其存储的数据类型对数据执行任何特殊处理。Data Lake Storage Gen1 doesn't perform any special handling of data based on the type of data it stores.

Data Lake Storage Gen1 中的数据安全性Data security in Data Lake Storage Gen1

Data Lake Storage Gen1 使用 Azure Active Directory 进行身份验证,使用访问控制列表 (ACL) 管理对数据的访问。Data Lake Storage Gen1 uses Azure Active Directory for authentication and uses access control lists (ACLs) to manage access to your data.

功能Feature 说明Description
身份验证Authentication Data Lake Storage Gen1 与 Azure Active Directory (Azure AD) 集成,可对 Data Lake Storage Gen1 中存储的所有数据进行标识与访问管理。Data Lake Storage Gen1 integrates with Azure Active Directory (Azure AD) for identity and access management for all the data stored in Data Lake Storage Gen1. 因为这种集成,Data Lake Storage Gen1 可受益于所有 Azure AD 功能。Because of the integration, Data Lake Storage Gen1 benefits from all Azure AD features. 这些功能包括多重身份验证、条件访问、基于角色的访问控制、应用程序使用情况监视、安全监视和警报等。These features include multifactor authentication, Conditional Access, role-based access control, application usage monitoring, security monitoring and alerting, and so on. Data Lake Storage Gen1 支持在 REST 接口中使用 OAuth 2.0 协议进行身份验证。Data Lake Storage Gen1 supports the OAuth 2.0 protocol for authentication within the REST interface. 参阅使用 Azure Active Directory 进行 Azure Data Lake Storage Gen1 身份验证See Authentication within Azure Data Lake Storage Gen1 using Azure Active Directory
访问控制Access control Data Lake Storage Gen1 通过支持 WebHDFS 协议公开的 POSIX 样式权限来提供访问控制。Data Lake Storage Gen1 provides access control by supporting POSIX-style permissions that are exposed by the WebHDFS protocol. 可对根文件夹、子文件夹和单个文件启用 ACL。ACLs can be enabled on the root folder, on subfolders, and on individual files. 有关 ACL 在 Data Lake Storage Gen1 上下文中的工作原理的详细信息,请参阅 Data Lake Storage Gen1 中的访问控制For more information on how ACLs work in the context of Data Lake Storage Gen1, see Access control in Data Lake Storage Gen1.
加密Encryption Data Lake Storage Gen1 还针对帐户中存储的数据提供加密。Data Lake Storage Gen1 also provides encryption for data that is stored in the account. 创建 Data Lake Storage Gen1 帐户时可以指定加密设置。You specify the encryption settings while creating a Data Lake Storage Gen1 account. 可以选择加密或不加密数据。You can choose to have your data encrypted or opt for no encryption. 有关详细信息,请参阅 Data Lake Storage Gen1 中的加密For more information, see Encryption in Data Lake Storage Gen1. 有关如何提供加密相关配置的说明,请参阅通过 Azure 门户开始使用 Azure Data Lake Storage Gen1For instructions on how to provide an encryption-related configuration, see Get started with Azure Data Lake Storage Gen1 using the Azure portal.

若要详细了解如何保护 Data Lake Storage Gen1 中的数据,请参阅保护 Azure Data Lake Storage Gen1 中存储的数据To learn more about securing data in Data Lake Storage Gen1, see Securing data stored in Azure Data Lake Storage Gen1.

与 Data Lake Storage Gen1 兼容的应用程序Applications that are compatible with Data Lake Storage Gen1

Data Lake Storage Gen1 与 Hadoop 生态系统中的大多数开源组件兼容。Data Lake Storage Gen1 is compatible with most open-source components in the Hadoop ecosystem. 此外,还与其他 Azure 服务完美集成。It also integrates nicely with other Azure services. 请访问以下链接,详细了解 Data Lake Storage Gen1 如何与开源组件及其他 Azure 服务配合使用。Follow the links below to learn more about how Data Lake Storage Gen1 can be used both with open-source components as well as other Azure services.

Data Lake Storage Gen1 文件系统 (adl://)Data Lake Storage Gen1 file system (adl://)

在 Hadoop 环境(在 HDInsight 群集上提供)中,可以通过新文件系统 AzureDataLakeFilesystem (adl://) 访问 Data Lake Storage Gen1。In Hadoop environments (available with an HDInsight cluster), you can access Data Lake Storage Gen1 through the new file system, the AzureDataLakeFilesystem (adl://). 可以通过目前无法在 WebHDFS 中实现的方式来优化使用 adl:// 的应用程序和服务的性能。The performance of applications and services that use adl:// can be optimized in ways that aren't currently available in WebHDFS. 因此,使用 Data Lake Storage Gen1 时,可以灵活使用建议的 adl:// 获得最佳性能,或继续直接使用 WebHDFS API 维护现有代码。As a result, when you use Data Lake Storage Gen1, you get the flexibility to either avail the best performance by using the recommended adl:// or maintain existing code by continuing to use the WebHDFS API directly. Azure HDInsight 充分使用 AzureDataLakeFilesystem 来提供 Data Lake Storage Gen1 的最佳性能。Azure HDInsight takes full advantage of the AzureDataLakeFilesystem to provide the best performance on Data Lake Storage Gen1.

可以使用以下 URL 访问 Data Lake Storage Gen1 中的数据:Access your data in Data Lake Storage Gen1 by using the following:

adl://<data_lake_storage_gen1_name>.azuredatalakestore.net

有关如何访问 Data Lake Storage Gen1 中的数据的详细信息,请参阅可针对存储的数据执行的操作For more information on how to access the data in Data Lake Storage Gen1, see Actions available on the stored data.

后续步骤Next steps