開始在 Azure Data Lake Analytics 中使用 U-SQLGet started with U-SQL in Azure Data Lake Analytics

U-SQL 是一種語言,結合了宣告式 SQL 與命令式 C#,可讓您處理任何規模的資料。U-SQL is a language that combines declarative SQL with imperative C# to let you process data at any scale. 透過 U-SQL 的可調整分散式查詢功能,您可以有效率地分析各關聯式存放區 (Azure SQL Database) 中的資料。Through the scalable, distributed-query capability of U-SQL, you can efficiently analyze data across relational stores such as Azure SQL Database. 使用 U-SQL,您可以藉由在讀取時套用結構描述並插入自訂邏輯和 UDF,來處理非結構化資料。With U-SQL, you can process unstructured data by applying schema on read and inserting custom logic and UDFs. 此外,U-SQL 所含有的擴充性可讓您細微控制如何大規模執行。Additionally, U-SQL includes extensibility that gives you fine-grained control over how to execute at scale.

學習資源Learning resources

先決條件Prerequisites

您瀏覽這份文件中的 U-SQL 範例前,閱讀並完成教學課程:開發 U-SQL 指令碼,使用 Data Lake Tools for Visual StudioBefore you go through the U-SQL samples in this document, read and complete Tutorial: Develop U-SQL scripts using Data Lake Tools for Visual Studio. 本教學課程說明將 U-SQL 搭配 Azure Data Lake Tools for Visual Studio 使用的機制。That tutorial explains the mechanics of using U-SQL with Azure Data Lake Tools for Visual Studio.

您的第一個 U-SQL 指令碼Your first U-SQL script

以下是一個簡單的 U-SQL 指令碼,可讓我們探索 U-SQL 語言的許多層面。The following U-SQL script is simple and lets us explore many aspects the U-SQL language.

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

OUTPUT @searchlog   
    TO "/output/SearchLog-first-u-sql.csv"
    USING Outputters.Csv();

此指令碼沒有任何轉換步驟。This script doesn't have any transformation steps. 它會從名為 SearchLog.tsv 的原始程式檔讀取資料,為其建立結構描述,並將資料列集回寫到名為 SearchLog-first-u-sql.csv 檔案。It reads from the source file called SearchLog.tsv, schematizes it, and writes the rowset back into a file called SearchLog-first-u-sql.csv.

請注意 Duration 欄位中資料類型旁的問號,Notice the question mark next to the data type in the Duration field. 它表示 Duration 欄位可能是 null。It means that the Duration field could be null.

重要概念Key concepts

  • 資料列集變數:每個產生的資料列集的查詢運算式可以指派給變數。Rowset variables: Each query expression that produces a rowset can be assigned to a variable. 在指令碼中,U-SQL 會遵循 T-SQL 變數命名模式 (例如 @searchlog)。U-SQL follows the T-SQL variable naming pattern (@searchlog, for example) in the script.
  • EXTRACT 關鍵字會從檔案讀取資料,並在讀取時定義結構描述。The EXTRACT keyword reads data from a file and defines the schema on read. Extractors.Tsv 是內建的 U-SQL 擷取器,適用於以定位點分隔值的檔案。Extractors.Tsv is a built-in U-SQL extractor for tab-separated-value files. 您可以開發自訂擷取器。You can develop custom extractors.
  • OUTPUT 會將資料列集的資料寫入檔案。The OUTPUT writes data from a rowset to a file. Outputters.Csv() 是內建的 U-SQL 輸出器,用於建立以逗號分隔值的檔案。Outputters.Csv() is a built-in U-SQL outputter to create a comma-separated-value file. 您可以開發自訂輸出器。You can develop custom outputters.

檔案路徑File paths

EXTRACT 與 OUTPUT 陳述式使用檔案路徑。The EXTRACT and OUTPUT statements use file paths. 檔案路徑可以是絕對或相對路徑:File paths can be absolute or relative:

下面這個絕對檔案路徑會參考名為 mystore 的 Data Lake Store 中的檔案:This following absolute file path refers to a file in a Data Lake Store named mystore:

adl://mystore.azuredatalakestore.net/Samples/Data/SearchLog.tsv

下面這個檔案路徑的開頭為 "/"This following file path starts with "/". 它會參考預設 Data Lake Store 帳戶中的檔案:It refers to a file in the default Data Lake Store account:

/output/SearchLog-first-u-sql.csv

使用純量變數Use scalar variables

您也可以使用純量變數以方便維護指令碼。You can use scalar variables as well to make your script maintenance easier. 前述 U-SQL 指令碼也可以撰寫成:The previous U-SQL script can also be written as:

DECLARE @in  string = "/Samples/Data/SearchLog.tsv";
DECLARE @out string = "/output/SearchLog-scalar-variables.csv";

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM @in
    USING Extractors.Tsv();

OUTPUT @searchlog   
    TO @out
    USING Outputters.Csv();

轉換資料列集Transform rowsets

使用 SELECT 來轉換資料列集:Use SELECT to transform rowsets:

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@rs1 =
    SELECT Start, Region, Duration
    FROM @searchlog
WHERE Region == "en-gb";

OUTPUT @rs1   
    TO "/output/SearchLog-transform-rowsets.csv"
    USING Outputters.Csv();

WHERE 子句使用 C# 布林運算式The WHERE clause uses a C# Boolean expression. 您可以使用 C# 運算式語言來建置自己的運算式和函式。You can use the C# expression language to do your own expressions and functions. 您甚至可以將它們與邏輯結合 (AND) 和邏輯分離 (OR) 做結合,以執行更複雜的篩選。You can even perform more complex filtering by combining them with logical conjunctions (ANDs) and disjunctions (ORs).

下列指令碼使用 DateTime.Parse() 方法和邏輯結合。The following script uses the DateTime.Parse() method and a conjunction.

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@rs1 =
    SELECT Start, Region, Duration
    FROM @searchlog
WHERE Region == "en-gb";

@rs1 =
    SELECT Start, Region, Duration
    FROM @rs1
    WHERE Start >= DateTime.Parse("2012/02/16") AND Start <= DateTime.Parse("2012/02/17");

OUTPUT @rs1   
    TO "/output/SearchLog-transform-datetime.csv"
    USING Outputters.Csv();

注意

第二個查詢會對第一個資料列集的結果起作用,因而建立兩個篩選條件的組合。The second query is operating on the result of the first rowset, which creates a composite of the two filters. 您也可以重複使用變數名稱,因為它們是語彙範圍型名稱。You can also reuse a variable name, and the names are scoped lexically.

彙總資料列集Aggregate rowsets

U-SQL 提供您已熟悉使用的 ORDER BY、GROUP BY 和各種彙總語法。U-SQL gives you the familiar ORDER BY, GROUP BY, and aggregations.

下列查詢會尋找每個區域的總持續時間,然後按順序顯示前五大持續時間。The following query finds the total duration per region, and then displays the top five durations in order.

U-SQL 資料列集不會保留它們的順序以供下一次查詢使用。U-SQL rowsets do not preserve their order for the next query. 因此,若要對輸出排序,您需要將 ORDER BY 新增至 OUTPUT 陳述式:Thus, to order an output, you need to add ORDER BY to the OUTPUT statement:

DECLARE @outpref string = "/output/Searchlog-aggregation";
DECLARE @out1    string = @outpref+"_agg.csv";
DECLARE @out2    string = @outpref+"_top5agg.csv";

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@rs1 =
    SELECT
        Region,
        SUM(Duration) AS TotalDuration
    FROM @searchlog
GROUP BY Region;

@res =
    SELECT *
    FROM @rs1
    ORDER BY TotalDuration DESC
    FETCH 5 ROWS;

OUTPUT @rs1
    TO @out1
    ORDER BY TotalDuration DESC
    USING Outputters.Csv();

OUTPUT @res
    TO @out2
    ORDER BY TotalDuration DESC
    USING Outputters.Csv();

U-SQL ORDER BY 子句必須在 SELECT 運算式中使用 FETCH 子句。The U-SQL ORDER BY clause requires using the FETCH clause in a SELECT expression.

U-SQL 的 HAVING 子句可以用來將輸出限制為符合 HAVING 條件的群組:The U-SQL HAVING clause can be used to restrict the output to groups that satisfy the HAVING condition:

@searchlog =
    EXTRACT UserId          int,
            Start           DateTime,
            Region          string,
            Query           string,
            Duration        int?,
            Urls            string,
            ClickedUrls     string
    FROM "/Samples/Data/SearchLog.tsv"
    USING Extractors.Tsv();

@res =
    SELECT
        Region,
        SUM(Duration) AS TotalDuration
    FROM @searchlog
    GROUP BY Region
    HAVING SUM(Duration) > 200;

OUTPUT @res
    TO "/output/Searchlog-having.csv"
    ORDER BY TotalDuration DESC
    USING Outputters.Csv();

如需進階的彙總案例,請參閱 U-SQL 的彙總、分析及參考函式參考文件For advanced aggregation scenarios, see the U-SQL reference documentation for aggregate, analytic, and reference functions

後續步驟Next steps