Migrate Azure Data Lake Storage from Gen1 to Gen2

You can migrate your data, workloads, and applications from Data Lake Storage Gen1 to Data Lake Storage Gen2.

‎Azure Data Lake Storage Gen2 is built on Azure Blob storage and provides a set of capabilities dedicated to big data analytics. Data Lake Storage Gen2 combines features from Azure Data Lake Storage Gen1, such as file system semantics, directory, and file level security and scale with low-cost, tiered storage, high availability/disaster recovery capabilities from Azure Blob storage.

Note

For easier reading, this article uses the term Gen1 to refer to Azure Data Lake Storage Gen1, and the term Gen2 to refer to Azure Data Lake Storage Gen2.

To migrate to Gen2, we recommend the following approach.

✔️ Step 1: Assess readiness

✔️ Step 2: Prepare to migrate

✔️ Step 3: Migrate data and application workloads

✔️ Step 4: Cutover from Gen1 to Gen2

Note

Gen1 and Gen2 are different services, there is no in-place upgrade experience, intentional migration effort required.

Step 1: Assess readiness

  1. Learn about the Data Lake Storage Gen2 offering; it's benefits, costs, and general architecture.

  2. Compare the capabilities of Gen1 with those of Gen2.

  3. Review a list of known issues to assess any gaps in functionality.

  4. Gen2 supports Blob storage features such as diagnostic logging, access tiers, and Blob storage lifecycle management policies. If you're interesting in using any of these features, review current level of support.

  5. Review the current state of Azure ecosystem support to ensure that Gen2 supports any services that your solutions depend upon.

Step 2: Prepare to migrate

  1. Identify the data sets that you'll migrate.

    Take this opportunity to clean up data sets that you no longer use. Unless you plan to migrate all of your data at one time, Take this time to identify logical groups of data that you can migrate in phases.

  2. Determine the impact that a migration will have on your business.

    For example, consider whether you can afford any downtime while the migration takes place. These considerations can help you to identify a suitable migration pattern, and to choose the most appropriate tools.

  3. Create a migration plan.

    We recommend these migration patterns. You can choose one of these patterns, combine them together, or design a custom pattern of your own.

Step 3: Migrate data, workloads, and applications

Migrate data, workloads, and applications by using the pattern that you prefer. We recommend that you validate scenarios incrementally.

  1. Create a storage account and enable the hierarchical namespace feature.

  2. Migrate your data.

  3. Configure services in your workloads to point to your Gen2 endpoint.

  4. Update applications to use Gen2 APIs. See guides for .NET, Java, Python, JavaScript and REST.

  5. Update scripts to use Data Lake Storage Gen2 PowerShell cmdlets, and Azure CLI commands.

  6. Search for URI references that contain the string adl:// in code files, or in Databricks notebooks, Apache Hive HQL files or any other file used as part of your workloads. Replace these references with the Gen2 formatted URI of your new storage account. For example: the Gen1 URI: adl://mydatalakestore.azuredatalakestore.net/mydirectory/myfile might become abfss://myfilesystem@mydatalakestore.dfs.core.windows.net/mydirectory/myfile.

  7. Configure the security on your account to include Azure roles, file and folder level security, and Azure Storage firewalls and virtual networks.

Step 4: Cutover from Gen1 to Gen2

After you're confident that your applications and workloads are stable on Gen2, you can begin using Gen2 to satisfy your business scenarios. Turn off any remaining pipelines that are running on Gen1 and decommission your Gen1 account.

Gen1 vs Gen2 capabilities

This table compares the capabilities of Gen1 to that of Gen2.

Area Gen1 Gen2
Data organization Hierarchical namespace
File and folder support
Hierarchical namespace
Container, file and folder support
Geo-redundancy LRS LRS, ZRS, GRS, RA-GRS
Authentication AAD managed identity
Service principals
AAD managed identity
Service principals
Shared Access Key
Authorization Management - RBAC
Data – ACLs
Management – RBAC
Data - ACLs, RBAC
Encryption – Data at rest Server side – with Microsoft-managed or customer-managed keys Server side – with Microsoft-managed or customer-managed keys
VNET Support VNET Integration Service Endpoints, Private Endpoints
Developer experience REST, .NET, Java, Python, PowerShell, Azure CLI Generally available - REST, .NET, Java, Python
Public preview - JavaScript, PowerShell, Azure CLI
Resource logs Classic logs
Azure Monitor integrated
Classic logs - Generally available
Azure monitor integration – timeline TBD
Ecosystem HDInsight (3.6), Azure Databricks (3.1 and above), SQL DW, ADF HDInsight (3.6, 4.0), Azure Databricks (5.1 and above), SQL DW, ADF

Gen1 to Gen2 patterns

Choose a migration pattern, and then modify that pattern as needed.

Lift and Shift The simplest pattern. Ideal if your data pipelines can afford downtime.
Incremental copy Similar to lift and shift, but with less downtime. Ideal for large amounts of data that take longer to copy.
Dual pipeline Ideal for pipelines that can't afford any downtime.
Bidirectional sync Similar to dual pipeline, but with a more phased approach that is suited for more complicated pipelines.

Let's take a closer look at each pattern.

Lift and shift pattern

This is the simplest pattern.

  1. Stop all writes to Gen1.

  2. Move data from Gen1 to Gen2. We recommend Azure Data Factory. ACLs copy with the data.

  3. Point ingest operations and workloads to Gen2.

  4. Decommission Gen1.

lift and shift pattern

Considerations for using the lift and shift pattern

✔️ Cutover from Gen1 to Gen2 for all workloads at the same time.

✔️ Expect downtime during the migration and the cutover period.

✔️ Ideal for pipelines that can afford downtime and all apps can be upgraded at one time.

Incremental copy pattern

  1. Start moving data from Gen1 to Gen2. We recommend Azure Data Factory. ACLs copy with the data.

  2. Incrementally copy new data from Gen1.

  3. After all data is copied, stop all writes to Gen1, and point workloads to Gen2.

  4. Decommission Gen1.

Incremental copy pattern

Considerations for using the incremental copy pattern:

✔️ Cutover from Gen1 to Gen2 for all workloads at the same time.

✔️ Expect downtime during cutover period only.

✔️ Ideal for pipelines where all apps upgraded at one time, but the data copy requires more time.

Dual pipeline pattern

  1. Move data from Gen1 to Gen2. We recommend Azure Data Factory. ACLs copy with the data.

  2. Ingest new data to both Gen1 and Gen2.

  3. Point workloads to Gen2.

  4. Stop all writes to Gen1 and then decommission Gen1.

Dual pipeline pattern

Considerations for using the dual pipeline pattern:

✔️ Gen1 and Gen2 pipelines run side-by-side.

✔️ Supports zero downtime.

✔️ Ideal in situations where your workloads and applications can't afford any downtime, and you can ingest into both storage accounts.

Bi-directional sync pattern

  1. Set up bidirectional replication between Gen1 and Gen2. We recommend WanDisco. It offers a repair feature for existing data.

  2. When all moves are complete, stop all writes to Gen1 and turn off bidirectional replication.

  3. Decommission Gen1.

Bidirectional pattern

Considerations for using the bi-directional sync pattern:

✔️ Ideal for complex scenarios that involve a large number of pipelines and dependencies where a phased approach might make more sense.

✔️ Migration effort is high, but it provides side-by-side support for Gen1 and Gen2.

Next steps