Indexing documents in Azure Data Lake Storage Gen2
When setting up an Azure storage account, you have the option to enable hierarchical namespace. This allows the collection of content in an account to be organized into a hierarchy of directories and nested subdirectories. By enabling hierarchical namespace, you enable Azure Data Lake Storage Gen2.
This article describes how to get started with indexing documents that are in Azure Data Lake Storage Gen2.
Set up Azure Data Lake Storage Gen2 indexer
There are a few steps you'll need to complete to index content from Data Lake Storage Gen2.
Step 1: Sign up for the preview
Sign up for the Data Lake Storage Gen2 indexer preview by filling out this form. You will receive a confirmation email once you have been accepted into the preview.
Step 2: Follow the Azure Blob storage indexing setup steps
Once you've received confirmation that your preview sign-up was successful, you're ready to create the indexing pipeline.
You can index content and metadata from Data Lake Storage Gen2 by using the REST API version 2019-05-06-Preview. There is no portal or .NET SDK support at this time.
Indexing content in Data Lake Storage Gen2 is identical to indexing content in Azure Blob storage. So to understand how to set up the Data Lake Storage Gen2 data source, index, and indexer, refer to How to index documents in Azure Blob Storage with Azure Cognitive Search. The Blob storage article also provides information about what document formats are supported, what blob metadata properties are extracted, incremental indexing, and more. This information will be the same for Data Lake Storage Gen2.
Azure Data Lake Storage Gen2 implements an access control model that supports both Azure role-based access control (RBAC) and POSIX-like access control lists (ACLs). When indexing content from Data Lake Storage Gen2, Azure Cognitive Search will not extract the RBAC and ACL information from the content. As a result, this information will not be included in your Azure Cognitive Search index.
If maintaining access control on each document in the index is important, it is up to the application developer to implement security trimming.
The Data Lake Storage Gen2 indexer supports change detection. This means that when the indexer runs it only reindexes the changed blobs as determined by the blob's
Data Lake Storage Gen2 allows directories to be renamed. When a directory is renamed the timestamps for the blobs in that directory do not get updated. As a result, the indexer will not reindex those blobs. If you need the blobs in a directory to be reindexed after a directory rename because they now have new URLs, you will need to update the
LastModified timestamp for all the blobs in the directory so that the indexer knows to reindex them during a future run.