Azure Data Explorer data ingestion
Data ingestion is the process used to load data records from one or more sources to create or update a table in Azure Data Explorer. Once ingested, the data becomes available for query. The diagram below shows the end-to-end flow for working in Azure Data Explorer, including data ingestion.
The Azure Data Explorer data management service, which is responsible for data ingestion, provides the following functionality:
Data pull: Pull data from external sources (Event Hubs) or read ingestion requests from an Azure Queue.
Batching: Batch data flowing to the same database and table to optimize ingestion throughput.
Validation: Preliminary validation and format conversion if necessary.
Data manipulation: Matching schema, organizing, indexing, encoding and compressing the data.
Persistence point in the ingestion flow: Manage ingestion load on the engine and handle retries upon transient failures.
Commit the data ingest: Makes the data available for query.
Azure Data Explorer supports several ingestion methods, each with its own target scenarios, advantages, and disadvantages. Azure Data Explorer offers pipelines and connectors to common services, programmatic ingestion using SDKs, and direct access to the engine for exploration purposes.
Ingestion using pipelines, connectors, and plugins
Azure Data Explorer currently supports:
Event Grid pipeline, which can be managed using the management wizard in the Azure portal. For more information, see Ingest Azure Blobs into Azure Data Explorer.
Event Hub pipeline, which can be managed using the management wizard in the Azure portal. For more information, see Ingest data from Event Hub into Azure Data Explorer.
Logstash plugin, see Ingest data from Logstash to Azure Data Explorer.
Kafka connector, see Ingest data from Kafka into Azure Data Explorer.
Ingestion using integration services
- Azure Data Factory (ADF), a fully managed data integration service for analytic workloads in Azure, to copy data to and from Azure Data Explorer using supported data stores and formats. For more information, see Copy data from Azure Data Factory to Azure Data Explorer.
Azure Data Explorer provides SDKs that can be used for query and data ingestion. Programmatic ingestion is optimized for reducing ingestion costs (COGs), by minimizing storage transactions during and following the ingestion process.
Available SDKs and open-source projects:
Kusto offers client SDK that can be used to ingest and query data with:
Programmatic ingestion techniques:
Ingesting data through the Azure Data Explorer data management service (high-throughput and reliable ingestion):
Batch ingestion (provided by SDK): the client uploads the data to Azure Blob storage (designated by the Azure Data Explorer data management service) and posts a notification to an Azure Queue. Batch ingestion is the recommended technique for high-volume, reliable, and cheap data ingestion.
Ingesting data directly into the Azure Data Explorer engine (most appropriate for exploration and prototyping):
Inline ingestion: control command (.ingest inline) containing in-band data is intended for ad hoc testing purposes.
Ingest from query: control command (.set, .set-or-append, .set-or-replace) that points to query results is used for generating reports or small temporary tables.
Ingest from storage: control command (.ingest into) with data stored externally (for example, Azure Blob Storage) allows efficient bulk ingestion of data.
Latency of different methods:
|Ingest from query||Query time + processing time|
|Ingest from storage||Download time + processing time|
|Queued ingestion||Batching time + processing time|
Processing time depends on the data size, less than a few seconds. Batching time defaults to 5 minutes.
Choosing the most appropriate ingestion method
Before you start to ingest data, you should ask yourself the following questions.
- Where does my data reside?
- What is the data format, and can it be changed?
- What are the required fields to be queried?
- What is the expected data volume and velocity?
- How many event types are expected (reflected as the number of tables)?
- How often is the event schema expected to change?
- How many nodes will generate the data?
- What is the source OS?
- What are the latency requirements?
- Can one of the existing managed ingestion pipelines be used?
For organizations with an existing infrastructure that are based on a messaging service like Event Hub and IoT Hub, using a connector is likely the most appropriate solution. Queued ingestion is appropriate for large data volumes.
Supported data formats
For all ingestion methods other than ingest from query, format the data so that Azure Data Explorer can parse it.
- The supported data formats are: TXT, CSV, TSV, TSVE, PSV, SCSV, SOH, JSON (line-separated, multi-line), Avro, and Parquet.
- Supports ZIP and GZIP compression.
When data is being ingested, data types are inferred based on the target table columns. If a record is incomplete or a field cannot be parsed as the required data type, the corresponding table columns will be populated with null values.
Ingestion recommendations and limitations
- The effective retention policy of ingested data is derived from the database's retention policy. See retention policy for details. Ingesting data requires Table ingestor or Database ingestor permissions.
- Ingestion supports a maximum file size of 5 GB. The recommendation is to ingest files between 100 MB and 1 GB.
Schema mapping helps bind source data fields to destination table columns.
- CSV Mapping (optional) works with all ordinal-based formats. It can be performed using the ingest command parameter or pre-created on the table and referenced from the ingest command parameter.
- JSON Mapping (mandatory) and Avro mapping (mandatory) can be performed using the ingest command parameter. They can also be pre-created on the table and referenced from the ingest command parameter.