Azure Stream Analytics - write to Delta Lake table

Article
03/29/2024

Delta Lake is an open format that brings reliability, quality, and performance to data lakes. Azure Stream Analytics allows you to directly write streaming data to your delta lake tables without writing a single line of code.

A stream analytics job can be configured to write through a native delta lake output connector, either to a new or a precreated Delta table in an Azure Data Lake Storage Gen2 account. This connector is optimized for high-speed ingestion to delta tables in append mode and also provides exactly once semantics, which guarantees that no data is lost or duplicated. Ingesting real-time data streams from Azure Event Hubs into Delta tables allows you to perform ad-hoc interactive or batch analytics.

Delta Lake configuration

To write data in Delta Lake, you need to connect to an Azure Data Lake Storage Gen2 account. The below table lists the properties related to Delta Lake configuration.

Property Name	Description
Event Serialization Format	Serialization format for output data. JSON, CSV, AVRO, Parquet are supported. Delta Lake is listed as an option here. The data is in Parquet format if Delta Lake is selected.
Delta path name	The path that is used to write your delta lake table within the specified container. It includes the table name. More details in the next section.
Partition Column	Optional. The {field} name from your output data to partition. Only one partition column is supported. The column's value must be of string type

To see the full list of ADLS Gen2 configuration, see ALDS Gen2 Overview.

Delta Path name

The Delta Path Name is used to specify the location and name of your Delta Lake table stored in Azure Data Lake Storage Gen2.

You can choose to use one or more path segments to define the path to the delta table and the delta table name. A path segment is the string between consecutive delimiter characters (for example, the forward slash /) that corresponds to the name of a virtual directory.

The segment name is alphanumeric and can include spaces, hyphens, and underscores. The last path segment is used as the table name.

Restrictions on Delta Path name include the following ones:

Field names aren't case-sensitive. For example, the service can't differentiate between column ID and id.
No dynamic {field} name is allowed. For example, {ID} is treated as text {ID}.
The number of path segments comprising the name can't exceed 254.

Examples

Examples for Delta path name:

Example 1: WestUS/CA/factory1/device-table
Example 2: Test/demo
Example 3: mytable

Example output files:

Under the chosen container, directory path would be WestEurope/CA/factory1 and delta table folder name would be device-table.
Under the chosen container, directory path would be Test and delta table folder name would be demo.
Under the chosen container, delta table folder name would be mytable.

Creating a new table

If there isn't already a Delta Lake table with the same name and in the location specified by the Delta Path name, by default, Azure Stream Analytics creates a new Delta Table. This new table is created with the following configuration:

Writer Version 2
Reader Version 1
The table is Append-Only
The table schema is created with the schema of the first record encountered.

Writing to the table

If there's already a Delta Lake table existing with the same name and in the location specified by the Delta Path name, by default, Azure Stream Analytics writes new records to the existing table.

Exactly once delivery

The transaction log enables Delta Lake to guarantee exactly once processing. Azure Stream Analytics also provides exactly once delivery when outputting data to Azure Data Lake Storage Gen2 during a single job run.

Schema enforcement

Schema enforcement means that all new writes to a table are enforced to be compatible with the target table's schema at write time, to ensure data quality.

All records of output data are projected to the schema of the existing table. If the output is being written to a new delta table, the table schema is created with the first record. If the incoming data has one extra column compared to the existing table schema, it's written in the table without the extra column. If the incoming data is missing one column compared to the existing table schema, it's written in the table with the column being null.

If there's no intersection between the schema of the delta table and the schema of a record of the streaming job, it's considered an instance of schema conversion failure. It isn't the only case that would be considered schema conversion failure.

At the failure of schema conversion, the job behavior follows the output data error handing policy configured at the job level.

Delta Log checkpoints

The Stream Analytics job creates Delta Log checkpoints periodically in the V1 format. Delta Log checkpoints are snapshots of the Delta Table and typically contain the name of the data file generated by the Stream Analytics job. If the number of data files is large, then it leads to large checkpoints, which can cause memory issues in the Stream Analytics Job.

Limitations

Dynamic partition key(specifying the name of a column of the record schema in the Delta Path) isn't supported.
Multiple partition columns aren't supported. If multiple partition columns are desired, the recommendation is to use a composite key in the query and then specify it as the partition column.
- A composite key can be created in the query for example: "SELECT concat (col1, col2) AS compositeColumn INTO [blobOutput] FROM [input]".
Writing to Delta Lake is append only.
Schema checking in query testing isn't available.
Small file compaction isn't performed by Stream Analytics.
All data files are created without compression.
The Date and Decimal types aren't supported.
Writing to existing tables of Writer Version 7 or above with writer features fail.
- Example: Writing to existing tables with Deletion Vectors enabled fail.
- The exceptions here are the changeDataFeed and appendOnly Writer Features.
When a Stream Analytics job writes a batch of data to a Delta Lake, it can generate multiple Add File Action. When there are too many Add File Actions generated for a single batch, a Stream Analytics Job can be stuck.
- The number of Add File Actions generated are determined by many factors:
  - Size of the batch. It's determined by the data volume and the batching parameters Minimum Rows and Maximum Time
  - Cardinality of the Partition Column values of the batch.
- To reduce the number of Add File Actions generated for a batch the following steps can be taken:
  - Reduce the batching configurations Minimum Rows and Maximum Time
  - Reduce the cardinality of the Partition Column values by tweaking the input data or choosing a different partition column
Stream Analytics jobs can only read and write single part V1 Checkpoints. Multi-part checkpoints and the Checkpoint V2 format aren't supported.

Next steps

Capture data from Event Hubs in Delta Lake format