Introduction to Azure Data Lake Storage Gen2 Preview

‎Azure Data Lake Storage Gen2 Preview is a set of capabilities dedicated to big data analytics, built on Azure Blob storage. Data Lake Storage Gen2 is the result of converging the capabilities of our two existing storage services, Azure Blob storage and Azure Data Lake Storage Gen1. Features from Azure Data Lake Storage Gen1, such as file system semantics, directory, and file level security and scale are combined with low-cost, tiered storage, high availability/disaster recovery capabilities from Azure Blob storage.

Designed for enterprise big data analytics

Data Lake Storage Gen2 makes Azure Storage the foundation for building enterprise data lakes on Azure. Designed from the start to service multiple petabytes of information while sustaining hundreds of gigabits of throughput, Data Lake Storage Gen2 allows you to easily manage massive amounts of data.

A fundamental part of Data Lake Storage Gen2 is the addition of a hierarchical namespace to Blob storage. The hierarchical namespace organizes objects/files into a hierarchy of directories for efficient data access. A common object store naming convention uses slashes in the name to mimic a hierarchical directory structure. This structure becomes real with Data Lake Storage Gen2. Operations such as renaming or deleting a directory become single atomic metadata operations on the directory rather than enumerating and processing all objects that share the name prefix of the directory.

In the past, cloud-based analytics had to compromise in areas of performance, management, and security. Data Lake Storage Gen2 addresses each of these aspects in the following ways:

  • Performance is optimized because you do not need to copy or transform data as a prerequisite for analysis. The hierarchical namespace greatly improves the performance of directory management operations, which improves overall job performance.

  • Management is easier because you can organize and manipulate files through directories and subdirectories.

  • Security is enforceable because you can define POSIX permissions on directories or individual files.

  • Cost effectiveness is made possible as Data Lake Storage Gen2 is built on top of the low-cost Azure Blob storage. The additional features further lower the total cost of ownership for running big data analytics on Azure.

Key features of Data Lake Storage Gen2

  • Hadoop compatible access: Data Lake Storage Gen2 allows you to manage and access data just as you would with a Hadoop Distributed File System (HDFS). The new ABFS driver is available within all Apache Hadoop environments, including Azure HDInsight, Azure Databricks, and SQL Data Warehouse to access data stored in Data Lake Storage Gen2.

  • A superset of POSIX permissions: The security model for Data Lake Gen2 supports ACL and POSIX permissions along with some extra granularity specific to Data Lake Storage Gen2. Settings may be configured through Storage Explorer or through frameworks like Hive and Spark.

  • Cost effective: Data Lake Storage Gen2 offers low-cost storage capacity and transactions. As data transitions through its complete lifecycle, billing rates change keeping costs to a minimum via built-in features such as Azure Blob storage lifecycle.

  • Optimized driver: The abfs driver is optimized specifically for big data analytics. The corresponding REST APIs are surfaced through the dfs endpoint, dfs.core.windows.net.

Scalability

Azure Storage is scalable by design whether you access via Data Lake Storage Gen2 or Blob storage interfaces. It is able to store and serve many exabytes of data. This amount of storage is available with throughput measured in gigabits per second (Gbps) at high levels of input/output operations per second (IOPS). Beyond just persistence, processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels.

Cost effectiveness

One of the many benefits of building Data Lake Storage Gen2 on top of Azure Blob storage is the low cost of storage capacity and transactions. Unlike other cloud storage services, data stored in Data Lake Storage Gen2 is not required to be moved or transformed prior to performing analysis. For more information about pricing, see Azure Storage pricing.

Additionally, features such as the hierarchical namespace significantly improve the overall performance of many analytics jobs. This improvement in performance means that you require less compute power to process the same amount of data, resulting in a lower total cost of ownership (TCO) for the end-to-end analytics job.

One service, multiple concepts

Data Lake Storage Gen2 is an additional capability for big data analytics, built on top of Azure Blob storage. While there are many benefits in leveraging existing platform components of Blobs to create and operate data lakes for analytics, it does lead to multiple concepts describing the same, shared things.

The following are the equivalent entities, as described by different concepts. Unless specified otherwise these entities are directly synonymous:

Concept Top Level Organization Lower Level Organization Data Container
Blobs – General purpose object storage Container Virtual directory (SDK only – does not provide atomic manipulation) Blob
ADLS Gen2 – Analytics Storage Filesystem Directory File

Supported open source platforms

Several open source platforms support Data Lake Storage Gen2. Those platforms appear in the following table.

Note

Only the versions that appear in this table are supported.

Platform Supported Version(s) More Information
HDInsight 3.6+ What are the Apache Hadoop components and versions available with HDInsight?
Hadoop 2.7+ Apache Hadoop releases archive
Cloudera 6.1+ Cloudera Enterprise 6.x release notes
Azure Databricks 4.2+ Databricks Runtime versions
Hortonworks 2.6+ Hortonworks documentation

Next steps

The following articles describe some of the main concepts of Data Lake Storage Gen2 and detail how to store, access, manage, and gain insights from your data: