Input Files (U-SQL)

Summary

U-SQL provides access to files that are stored either in Azure Data Lake Storage accounts or in Windows Azure Blob Storage accounts. Azure Data Lake Storage accounts have to be in the same Azure region as the Azure Data Lake Analytics account where you run your U-SQL script. Windows Azure Blob Storage accounts need to be registered with the Azure Data Lake Analytics account where you will run your U-SQL script but can be in any region. Note that if the data is in a different region than the Azure Data Lake Analytics account, you will get billed for data egress and data transfer rates will be based on the internet connection between the two regions and considerably lower than accessing data inside the same region. Any input files referenced in a script have to exist at the time the script is compiled.

In order to process files with U-SQL, they have to be schematized with an EXTRACT expression that will read from either a single file, a list of files, or a set of files, a so-called unstructured file set.

Input File Path URIs

The input file path URIs (Universal Resource Identifier) can be specified as a single file path URI, a list of file path URIs, or a path pattern expression that identifies an unstructured file set.

  • Since file paths are URIs, characters such as spaces need to be URI-encoded. For example the file name a has 3 spaces has to be written as a%20has%203%20spaces.

  • A file path URI or a path pattern expression has to be provided as a static string expression (an expression that can be constant folded) in the FROM clause of an EXTRACT expression.

  • A file path can also be specified as a local file path.

Syntax

Input_File_Path :=                                                                                       
     Single_File_Path 
|    List_File_Path 
|    File_Set_Path.

Single_File_Path

A single file path URI can be a relative or absolute Azure Data Lake Storage URI, or a Windows Azure Blob Storage URI:

Syntax

Single_File_Path :=                                                                                      
     Default_Account_Path_URI 
|    ADL_Path_URI 
|    WASB_Path_URI.

The following sections show the syntax and semantics for each of the supported paths.

  • Default_Account_Path_URI
    The Default Account Path URI is a relative Azure Data Lake Storage URI (a path without the URI header). It refers to a path in the Data Lake Analytics account’s associated default Data Lake Storage account.

Syntax

  Default_Account_Path_URI :=                                                                         
       ['/']+Relative_Path.
Relative_Path := {pathstep+'/'}+file_name.

Note

The + in the grammar indicates that spaces are not allowed between the parts.

Examples

The URI "/Samples/Data/searchlog.tsv" refers to the searchlog.tsv file in the directory path /Samples/Data inside the default Azure Data Lake Storage account.

  • ADL_Path_URI (Azure Data Lake Path URI)
    You can specify a full URI using either the webhdfs or adl URI schemes. They both look the same, except for the scheme name. webhdfs is supported for compatibility reasons with WebHDFS. It is recommended to use the adl URI scheme with U-SQL, since it provides some performance benefits.

    Both URI schemes use encryption during data retrieval.

Syntax

  ADL_Path_URI :=                                                                                     
       ADL_URI | WebHDFS_URI.
ADL_URI := ADL_Header +'/'+Relative_Path.
ADL_Header := 'adl://'+ADL_Domain.
ADL_Domain := adl_accountname+'.azuredatalake.net'.
WebHDFS_URI := WebHDFS_Header+'/'+Local_Path.
WebHDFS_Header := 'swebhdfs://'+ADL_Domain.

Note

The + in the grammar indicates that spaces are not allowed between the parts.

Examples

Both URIs "webhdfs://myadl.azuredatalake.net/Samples/Data/searchlog.tsv" and "adl://myadl.azuredatalake.net/Samples/Data/searchlog.tsv" refer to the searchlog.tsv file in the directory path /Samples/Data inside the Azure Data Lake Storage account myadl.

  • WASB_Path_URI (Windows Azure Blob Storage Path URI)
    U-SQL can refer to block blobs and append blobs stored in Windows Azure Blob Storage with the wasb URI scheme.If a short-form account name is provided, it will default to the blob.core.windows.net domain.

    The wasb URI scheme uses encryption during data retrieval.

Syntax

  WASB_Path :=                                                                                        
       WASB_Header+Relative_Path.
WASB_Header := 'wasb://'+wasb_container+'@'+wasb_accountname+ ['.blob.core.windows.net']+'/'.

Note

The + in the grammar indicates that spaces are not allowed between the parts.

Examples

The URI "wasb://samples@mywasb/Samples/Data/searchlog.tsv" refers to the searchlog.tsv file in the directory path /Samples/Data inside the Windows Azure Blob Store account mywasb's samples container.

List_File_Path

Files can be specified in a list that enumerates all the files that are being processed.

Syntax

List_File_Path :=                                                                                        
     Single_File_Path {',' Single_File_Path}.

File_Set_Path

If the list of files cannot be enumerated or it would be too cumbersome to enumerate them, a file set path can be specified. A file set path is a relative or absolute URI where parts of the folder or file names (but not the domain or container names) can be replaced with a file set pattern that is embedded inside curly braces {}.

Syntax

File_Set_Path :=                                                                                         
     [ADL_Header | WebHDFS_Header | WASB_Header]+  
         FileSet_Pattern.
FileSet_Pattern := '/'+{directory_name | Directory_Name_Pattern +'/'}+ file_name | File_Name_Pattern.
Directory_Name_Pattern := (character | Pattern) {character | Pattern}.
Pattern := '{'+((Virtual_Col_Name+[':'+ Date_Pattern ]) | '*')+'}'.
Date_Pattern := 'yyyy' | 'MM' | 'M' | 'dd' | 'd' | 'HH' | 'H' | 'mm' | 'm' | 'ss' | 's'.
Virtual_Col_Name := Quoted_or_Unquoted_Identifier.
File_Name_Pattern := (character | Pattern) {character | Pattern}.

Note

The + in the grammar indicates that spaces are not allowed between the parts.

The semantics of file sets is the following:

  • File set patterns can appear either in part of a directory path or in a filename.
  • Several patterns can appear in the same directory step or file name.
  • Each pattern is either associated with a virtual column name that has an optional date pattern, or is just a * wild card.
  • If the pattern is just a wild card, then file paths that are matching the pattern will be included in the file set, but you cannot refer to the identified pattern in the query.
  • If the pattern is associated with a virtual column name then file paths matching the pattern will be included, and in addition, the matched values for the pattern will be available through the virtual column name.
  • The virtual column names can be used in subsequent SELECT expressions to eliminate files from the matched files. If the expression can be evaluated at compile time, the files will not be included in the query at all.
  • The data type of the column is determined by the EXTRACT schema.
  • If a date pattern is provided, then the virtual column has to be of type DateTime and the column can be compared against constant DateTime values in a predicate in the script to eliminate files from the processing.

Not all of the date patterns have to be specified for a given virtual column. If a pattern is conflicting with another pattern then no path will be found. Currently the year (yyyy), month (MM or M), day (dd or d), hours (HH or H), minutes (mm or m) and second (ss or s) patterns are supported.

For more details and examples on extracting from file sets, please see EXTRACT Expression (U-SQL).

Examples

Single_File_Path: Default_Account_Path_URI

@searchLog =  
    EXTRACT UserId          int  
          , Start           DateTime  
          , Region          string  
          , Query           string  
          , Duration        int  
          , Urls            string  
          , ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

// The file created below will be used in later examples; otherwise, the OUTPUT statement is not necessarily needed.
OUTPUT @searchLog
TO "/Samples/Outputs/ExampleA.csv" 
USING Outputters.Csv();

Single_File_Path: ADL_Path_URI

@searchLog =  
    EXTRACT UserId          int  
          , Start           DateTime  
          , Region          string  
          , Query           string  
          , Duration        int  
          , Urls            string  
          , ClickedUrls     string  
    FROM "adl://<adl_accountname>.azuredatalakestore.net/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

// The file created below will be used in later examples; otherwise, the OUTPUT statement is not necessarily needed.
OUTPUT @searchLog
TO "/Samples/Outputs/ExampleB.csv" 
USING Outputters.Csv();

Single_File_Path: WASB_Path_URI

@searchLog =  
    EXTRACT UserId          int  
          , Start           DateTime  
          , Region          string  
          , Query           string  
          , Duration        int  
          , Urls            string  
          , ClickedUrls     string  
    FROM "wasb://<wasb_container>@<wasb_accountname>/SearchLog.tsv"
    USING Extractors.Tsv();

// The OUTPUT statement is not necessarily needed.
OUTPUT @searchLog
TO "/Samples/Outputs/wasb.csv" 
USING Outputters.Csv();

List_File_Path
This example extracts data from the two files created in earlier examples by enumerating each file.

@searchLog =  
    EXTRACT UserId          int  
          , Start           DateTime  
          , Region          string  
          , Query           string  
          , Duration        int  
          , Urls            string  
          , ClickedUrls     string
    FROM "/Samples/Outputs/ExampleA.csv",
         "/Samples/Outputs/ExampleB.csv"
    USING Extractors.Csv();

// The file created below will be used in a later example; otherwise, the OUTPUT statement is not necessarily needed.
OUTPUT @searchLog
TO "/Samples/Outputs/ExampleC.csv" 
USING Outputters.Csv();

File_Set_Path
This example extracts data from the three files created in earlier examples by using a file set pattern.

@searchLog =  
    EXTRACT UserId          int  
          , Start           DateTime  
          , Region          string  
          , Query           string  
          , Duration        int  
          , Urls            string  
          , ClickedUrls     string
    FROM "/Samples/Outputs/Example{*}.csv"
    USING Extractors.Csv();

// The OUTPUT statement is not necessarily needed.
OUTPUT @searchLog
TO "/Samples/Outputs/FileSetPattern.csv" 
USING Outputters.Csv();

See Also