Azure Data Lake Storage Gen2 簡介Introduction to Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 是一組巨量資料分析的專屬功能,內建於 Azure Blob 儲存體‎Azure Data Lake Storage Gen2 is a set of capabilities dedicated to big data analytics, built on Azure Blob storage. Data Lake Storage Gen2 是融合我們現有的兩項儲存體服務 (Azure Blob 儲存體和 Azure Data Lake Storage Gen1) 功能的結果。Data Lake Storage Gen2 is the result of converging the capabilities of our two existing storage services, Azure Blob storage and Azure Data Lake Storage Gen1. Azure Data Lake Storage Gen1 的功能 (例如檔案系統語意、目錄及檔案層級安全性和級別) 結合了 Azure Blob 儲存體的低成本、分層式儲存體、高可用性/災害復原功能。Features from Azure Data Lake Storage Gen1, such as file system semantics, directory, and file level security and scale are combined with low-cost, tiered storage, high availability/disaster recovery capabilities from Azure Blob storage.

針對企業巨量資料分析所設計Designed for enterprise big data analytics

Data Lake Storage Gen2 讓 Azure 儲存體成為在 Azure 上打造企業 Data Lake 的基礎。Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure. Data Lake Storage Gen2 從一開始就設計為服務數 PB 的資訊,同時可以維持數百 GB 的輸送量,可讓您輕鬆地管理大量資料。Designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput, Data Lake Storage Gen2 allows you to easily manage massive amounts of data.

Data Lake Storage Gen2 的基礎部分是新增至 Blob 儲存體的階層命名空間A fundamental part of Data Lake Storage Gen2 is the addition of a hierarchical namespace to Blob storage. 階層命名空間會將物件/檔案組織成目錄階層,讓資料存取更有效率。The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access. 一般物件存放區命名慣例是在名稱中使用斜線來模仿階層式目錄結構。A common object store naming convention uses slashes in the name to mimic a hierarchical directory structure. Data Lake Storage Gen2 會使這樣的結構成真。This structure becomes real with Data Lake Storage Gen2. 重新命名或刪除目錄等操作會成為目錄中單一不可部分完成的中繼資料作業,而不是列舉及處理共用目錄名稱前置詞的所有物件。Operations such as renaming or deleting a directory become single atomic metadata operations on the directory rather than enumerating and processing all objects that share the name prefix of the directory.

以往雲端式分析必須在效能、管理及安全性方面妥協。In the past, cloud-based analytics had to compromise in areas of performance, management, and security. Data Lake Storage Gen2 透過下列方式解決這些方面的問題:Data Lake Storage Gen2 addresses each of these aspects in the following ways:

  • 效能經過最佳化,因為您不需要複製或轉換資料作為分析的必要條件。Performance is optimized because you do not need to copy or transform data as a prerequisite for analysis. 階層命名空間大幅提高目錄管理作業的效能,從而提高整體作業效能。The hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance.

  • 管理更容易,因為您可以透過目錄和子目錄整理和操作檔案。Management is easier because you can organize and manipulate files through directories and subdirectories.

  • 安全性是強制的,因為您可以在目錄或個人檔案上定義 POSIX 權限。Security is enforceable because you can define POSIX permissions on directories or individual files.

  • 符合成本效益,因為 Data Lake Storage Gen2 建立在低成本的 Azure Blob 儲存體上。Cost effectiveness is made possible as Data Lake Storage Gen2 is built on top of the low-cost Azure Blob storage. 額外功能進一步降低了在 Azure 上執行巨量資料分析的擁有權總成本。The additional features further lower the total cost of ownership for running big data analytics on Azure.

Data Lake Storage Gen2 的主要功能Key features of Data Lake Storage Gen2

  • Hadoop 相容存取:Data Lake Storage Gen2 可讓您管理及存取資料,就如同使用 Hadoop 分散式檔案系統 (HDFS) 一樣。Hadoop compatible access: Data Lake Storage Gen2 allows you to manage and access data just as you would with a Hadoop Distributed File System (HDFS). 全新 ABFS 驅動程式可在所有 Apache Hadoop 環境中使用,包括 Azure HDInsightAzure DatabricksSQL 資料倉儲,以存取儲存在 Data Lake Storage Gen2 中的資料。The new ABFS driver is available within all Apache Hadoop environments, including Azure HDInsight, Azure Databricks, and SQL Data Warehouse to access data stored in Data Lake Storage Gen2.

  • POSIX 權限的超集合:Data Lake Gen2 的安全性模型可支援 ACL 和 POSIX 權限,以及一些 Data Lake Storage Gen2 特有的額外細微性。A superset of POSIX permissions: The security model for Data Lake Gen2 supports ACL and POSIX permissions along with some extra granularity specific to Data Lake Storage Gen2. 這些設定可透過儲存體總管或 Hive 和 Spark 這類架構來配置。Settings may be configured through Storage Explorer or through frameworks like Hive and Spark.

  • 符合成本效益:Data Lake Storage Gen2 提供低成本儲存體容量和異動功能。Cost effective: Data Lake Storage Gen2 offers low-cost storage capacity and transactions. 隨著資料在整個生命週期中進行轉換,計費率會有所更改,透過 Azure Blob 儲存體生命週期等內建功能將成本降到最低。As data transitions through its complete lifecycle, billing rates change keeping costs to a minimum via built-in features such as Azure Blob storage lifecycle.

  • 最佳化的驅動程式:ABFS 驅動程式特別最佳化進行巨量資料分析。Optimized driver: The ABFS driver is optimized specifically for big data analytics. 顯示對應的 REST Api 端點dfs.core.windows.netThe corresponding REST APIs are surfaced through the endpoint dfs.core.windows.net.

延展性Scalability

無論您是透過 Data Lake Storage Gen2 或 Blob 儲存體介面存取,Azure 儲存體都可以隨設計調整。Azure Storage is scalable by design whether you access via Data Lake Storage Gen2 or Blob storage interfaces. 而且能夠儲存和使用數 EB 的資料It is able to store and serve many exabytes of data. 這樣的儲存量可用於在每秒高輸入/輸出作業 (IOPS) 時以每秒 GB (Gbps) 為單位測量的輸送量。This amount of storage is available with throughput measured in gigabits per second (Gbps) at high levels of input/output operations per second (IOPS). 除了持續性之外,處理作業是在近常數的每個要求延遲時執行的,這些延遲是在服務、帳戶及檔案層級上所測得。Beyond just persistence, processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels.

符合成本效益Cost effectiveness

在 Azure Blob 儲存體上建立 Data Lake Storage Gen2 的眾多好處之一,在於儲存容量和異動的成本低。One of the many benefits of building Data Lake Storage Gen2 on top of Azure Blob storage is the low cost of storage capacity and transactions. Data Lake Storage Gen2 和其他雲端儲存體服務不同之處,是在執行分析之前不需要移動或轉換儲存在其中的資料。Unlike other cloud storage services, data stored in Data Lake Storage Gen2 is not required to be moved or transformed prior to performing analysis. 如需定價的詳細資訊,請參閱 Azure 儲存體定價For more information about pricing, see Azure Storage pricing.

此外,例如階層式命名空間等功能可大幅提升許多分析作業的整體效能。Additionally, features such as the hierarchical namespace significantly improve the overall performance of many analytics jobs. 效能提升即表示處理數量相同的資料時,所需的計算能力較少,因此可降低端對端分析工作的擁有權總成本 (TCO)。This improvement in performance means that you require less compute power to process the same amount of data, resulting in a lower total cost of ownership (TCO) for the end-to-end analytics job.

一項服務,多個概念One service, multiple concepts

Data Lake Storage Gen2 是巨量資料分析的額外功能,建置在 Azure Blob 儲存體的基礎之上。Data Lake Storage Gen2 is an additional capability for big data analytics, built on top of Azure Blob storage. 雖然利用現有的 Blobs 平台元件來建立及操作 Data Lake 進行分析有許多優點,但是它會導致用許多概念描述相同、共用的事項。While there are many benefits in leveraging existing platform components of Blobs to create and operate data lakes for analytics, it does lead to multiple concepts describing the same, shared things.

下列是以不同概念描述的對等實體。The following are the equivalent entities, as described by different concepts. 除非加以指定,否則這些實體是直接同義:Unless specified otherwise these entities are directly synonymous:

概念Concept 最上層組織Top Level Organization 較低層級組織Lower Level Organization 資料容器Data Container
Blobs – 一般用途物件儲存體Blobs – General purpose object storage 容器Container 虛擬目錄 (僅限 SDK – 不提供不可部分完成操作)Virtual directory (SDK only – does not provide atomic manipulation) BlobBlob
ADLS Gen2 – 分析儲存體ADLS Gen2 – Analytics Storage 檔案系統File system 目錄Directory 檔案File

支援的開放原始碼平台Supported open source platforms

數個開放原始碼平台支援 Data Lake Storage Gen2。Several open source platforms support Data Lake Storage Gen2. 這些平台會顯示在下表中。Those platforms appear in the following table.

注意

僅支援此資料表中顯示的版本。Only the versions that appear in this table are supported.

平台Platform 支援的版本Supported Version(s) 相關資訊More Information
HDInsightHDInsight 3.6+3.6+ 可以搭配 HDInsight 使用的 Apache Hadoop 元件和版本有哪些?What are the Apache Hadoop components and versions available with HDInsight?
HadoopHadoop 3.2+3.2+ Apache Hadoop 版本封存Apache Hadoop releases archive
ClouderaCloudera 6.1+6.1+ Cloudera Enterprise 6.x 版本資訊Cloudera Enterprise 6.x release notes
Azure DatabricksAzure Databricks 5.1+5.1+ Databricks Runtime 版本Databricks Runtime versions
HortonworksHortonworks 3.1.x++3.1.x++ 設定雲端資料存取 (英文)Configuring cloud data access

後續步驟Next steps

下列文章說明 Data Lake Storage Gen2 的一些主要概念,並詳述如何儲存、存取與管理資料,以及取得資料見解:The following articles describe some of the main concepts of Data Lake Storage Gen2 and detail how to store, access, manage, and gain insights from your data: