Azure Data Explorer data ingestion

Data ingestion is the process used to load data records from one or more sources to create or update a table in Azure Data Explorer. Once ingested, the data becomes available for query. The diagram below shows the end-to-end flow for working in Azure Data Explorer, including data ingestion.

Data flow

The Azure Data Explorer data management service, which is responsible for data ingestion, provides the following functionality:

  1. Data pull: Pull data from external sources (Event Hubs) or read ingestion requests from an Azure Queue.

  2. Batching: Batch data flowing to the same database and table to optimize ingestion throughput.

  3. Validation: Preliminary validation and format conversion if necessary.

  4. Data manipulation: Matching schema, organizing, indexing, encoding and compressing the data.

  5. Persistence point in the ingestion flow: Manage ingestion load on the engine and handle retries upon transient failures.

  6. Commit the data ingest: Makes the data available for query.

Ingestion methods

Azure Data Explorer supports several ingestion methods, each with its own target scenarios, advantages, and disadvantages. Azure Data Explorer offers pipelines and connectors to common services, programmatic ingestion using SDKs, and direct access to the engine for exploration purposes.

Ingestion using pipelines, connectors, and plugins

Azure Data Explorer currently supports:

Ingestion using integration services

Programmatic ingestion

Azure Data Explorer provides SDKs that can be used for query and data ingestion. Programmatic ingestion is optimized for reducing ingestion costs (COGs), by minimizing storage transactions during and following the ingestion process.

Available SDKs and open-source projects:

Kusto offers client SDK that can be used to ingest and query data with:

Programmatic ingestion techniques:

  • Ingesting data through the Azure Data Explorer data management service (high-throughput and reliable ingestion):

    Batch ingestion (provided by SDK): the client uploads the data to Azure Blob storage (designated by the Azure Data Explorer data management service) and posts a notification to an Azure Queue. Batch ingestion is the recommended technique for high-volume, reliable, and cheap data ingestion.

  • Ingesting data directly into the Azure Data Explorer engine (most appropriate for exploration and prototyping):

    • Inline ingestion: control command (.ingest inline) containing in-band data is intended for ad hoc testing purposes.

    • Ingest from query: control command (.set, .set-or-append, .set-or-replace) that points to query results is used for generating reports or small temporary tables.

    • Ingest from storage: control command (.ingest into) with data stored externally (for example, Azure Blob Storage) allows efficient bulk ingestion of data.

Latency of different methods:

Method Latency
Inline ingestion Immediate
Ingest from query Query time + processing time
Ingest from storage Download time + processing time
Queued ingestion Batching time + processing time

Processing time depends on the data size, less than a few seconds. Batching time defaults to 5 minutes.

Choosing the most appropriate ingestion method

Before you start to ingest data, you should ask yourself the following questions.

  • Where does my data reside? ​
  • What is the data format, and can it be changed? ​
  • What are the required fields to be queried? ​
  • What is the expected data volume and velocity? ​
  • How many event types are expected (reflected as the number of tables)? ​
  • How often is the event schema expected to change? ​
  • How many nodes will generate the data? ​
  • What is the source OS? ​
  • What are the latency requirements? ​
  • Can one of the existing managed ingestion pipelines be used? ​

For organizations with an existing infrastructure that are based on a messaging service like Event Hub and IoT Hub, using a connector is likely the most appropriate solution. Queued ingestion is appropriate for large data volumes.

Supported data formats

For all ingestion methods other than ingest from query, format the data so that Azure Data Explorer can parse it. The supported data formats are:

  • TXT, CSV, TSV, TSVE, PSV, SCSV, SOH​
  • JSON (line-separated, multi-line), Avro​
  • ZIP and GZIP

Note

When data is being ingested, data types are inferred based on the target table columns. If a record is incomplete or a field cannot be parsed as the required data type, the corresponding table columns will be populated with null values.

Ingestion recommendations and limitations

  • The effective retention policy of ingested data is derived from the database's retention policy. See retention policy for details. Ingesting data requires Table ingestor or Database ingestor permissions.
  • Ingestion supports a maximum file size of 5 GB. The recommendation is to ingest files between 100 MB and 1 GB.

Schema mapping

Schema mapping helps bind source data fields to destination table columns.

Next steps