Filter data by using Azure Data Lake Storage query acceleration

This article shows you how to use query acceleration to retrieve a subset of data from your storage account.

Query acceleration enables applications and analytics frameworks to dramatically optimize data processing by retrieving only the data that they require to perform a given operation. To learn more, see Azure Data Lake Storage Query Acceleration.

Prerequisites


Enable query acceleration

To use query acceleration, you must register the query acceleration feature with your subscription. Once you've verified that the feature is registered, you must register the Azure Storage resource provider.

Step 1: Register the query acceleration feature

To use query acceleration, you must first register the query acceleration feature with your subscription.

  1. Open a Windows PowerShell command window.

  2. Sign in to your Azure subscription with the Connect-AzAccount command and follow the on-screen directions.

    Connect-AzAccount
    
  3. If your identity is associated with more than one subscription, then set your active subscription.

    $context = Get-AzSubscription -SubscriptionId <subscription-id>
    Set-AzContext $context
    

    Replace the <subscription-id> placeholder value with the ID of your subscription.

  4. Register the query acceleration feature by using the Register-AzProviderFeature command.

    Register-AzProviderFeature -ProviderNamespace Microsoft.Storage -FeatureName BlobQuery
    

Step 2: Verify that the feature is registered

To verify that the registration is complete, use the Get-AzProviderFeature command.

Get-AzProviderFeature -ProviderNamespace Microsoft.Storage -FeatureName BlobQuery

Step 3: Register the Azure Storage resource provider

After your registration is approved, you must re-register the Azure Storage resource provider.

To register the resource provider, use the Register-AzResourceProvider command.

Register-AzResourceProvider -ProviderNamespace 'Microsoft.Storage'

Set up your environment

Step 1: Install packages

Install the Az module version 4.6.0 or higher.

Install-Module -Name Az -Repository PSGallery -Force

To update from an older version of Az, run the following command:

Update-Module -Name Az

Step 2: Add statements

Not applicable

Retrieve data by using a filter

You can use SQL to specify the row filter predicates and column projections in a query acceleration request. The following code queries a CSV file in storage and returns all rows of data where the third column matches the value Hemingway, Ernest.

  • In the SQL query, the keyword BlobStorage is used to denote the file that is being queried.

  • Column references are specified as _N where the first column is _1. If the source file contains a header row, then you can refer to columns by the name that is specified in the header row.

Function Get-QueryCsv($ctx, $container, $blob, $query, $hasheaders) {
    $tempfile = New-TemporaryFile
    $informat = New-AzStorageBlobQueryConfig -AsCsv -HasHeader:$hasheaders
    Get-AzStorageBlobQueryResult -Context $ctx -Container $container -Blob $blob -InputTextConfiguration $informat -OutputTextConfiguration (New-AzStorageBlobQueryConfig -AsCsv -HasHeader) -ResultFile $tempfile.FullName -QueryString $query -Force
    Get-Content $tempfile.FullName
}

$container = "data"
$blob = "csv/csv-general/seattle-library.csv"
Get-QueryCsv $ctx $container $blob "SELECT * FROM BlobStorage WHERE _3 = 'Hemingway, Ernest, 1899-1961'" $false

Retrieve specific columns

You can scope your results to a subset of columns. That way you retrieve only the columns needed to perform a given calculation. This improves application performance and reduces cost because less data is transferred over the network.

Note

The maximum number of columns that you can scope your results to is 49. If you need your results to contain more than 49 columns, then use a wildcard character (*) for the SELECT expression (For example: SELECT *).

This code retrieves only the BibNum column for all books in the data set. It also uses the information from the header row in the source file to reference columns in the query.

Function Get-QueryCsv($ctx, $container, $blob, $query, $hasheaders) {
    $tempfile = New-TemporaryFile
    $informat = New-AzStorageBlobQueryConfig -AsCsv -HasHeader:$hasheaders
    Get-AzStorageBlobQueryResult -Context $ctx -Container $container -Blob $blob -InputTextConfiguration $informat -OutputTextConfiguration (New-AzStorageBlobQueryConfig -AsCsv -HasHeader) -ResultFile $tempfile.FullName -QueryString $query -Force
    Get-Content $tempfile.FullName
}

$container = "data"
$blob = "csv/csv-general/seattle-library-with-headers.csv"
Get-QueryCsv $ctx $container $blob "SELECT BibNum FROM BlobStorage" $true

The following code combines row filtering and column projections into the same query.

Get-QueryCsv $ctx $container $blob $query $true

Function Get-QueryCsv($ctx, $container, $blob, $query, $hasheaders) {
    $tempfile = New-TemporaryFile
    $informat = New-AzStorageBlobQueryConfig -AsCsv -HasHeader:$hasheaders
    Get-AzStorageBlobQueryResult -Context $ctx -Container $container -Blob $blob -InputTextConfiguration $informat -OutputTextConfiguration (New-AzStorageBlobQueryConfig -AsCsv -HasHeader) -ResultFile $tempfile.FullName -QueryString $query -Force
    Get-Content $tempfile.FullName
}

$container = "data"
$query = "SELECT BibNum, Title, Author, ISBN, Publisher, ItemType
            FROM BlobStorage
            WHERE ItemType IN
                ('acdvd', 'cadvd', 'cadvdnf', 'calndvd', 'ccdvd', 'ccdvdnf', 'jcdvd', 'nadvd', 'nadvdnf', 'nalndvd', 'ncdvd', 'ncdvdnf')"

Next steps