您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

Azure 数据资源管理器数据引入概述Azure Data Explorer data ingestion overview

数据引入是用于从一个或多个源加载数据记录以将数据导入 Azure 数据资源管理器的表中的过程。Data ingestion is the process used to load data records from one or more sources to import data into a table in Azure Data Explorer. 引入后,数据即可用于查询。Once ingested, the data becomes available for query.

下图显示了在 Azure 中工作数据资源管理器的端到端流程,并显示了不同的引入方法。The diagram below shows the end-to-end flow for working in Azure Data Explorer and shows different ingestion methods.

数据引入和管理的概述方案

Azure 数据资源管理器数据管理服务负责数据引入,实现以下过程:The Azure Data Explorer data management service, which is responsible for data ingestion, implements the following process:

Azure 数据资源管理器从外部源提取数据,并从挂起的 Azure 队列读取请求。Azure Data Explorer pulls data from an external source and reads requests from a pending Azure queue. 数据被分批或流式传输到数据管理器。Data is batched or streamed to the Data Manager. 流向相同数据库和表的批数据已针对引入吞吐量进行了优化。Batch data flowing to the same database and table is optimized for ingestion throughput. Azure 数据资源管理器将验证初始数据并在必要时转换数据格式。Azure Data Explorer validates initial data and converts data formats where necessary. 进一步的数据操作包括匹配架构、组织、索引、编码和压缩数据。Further data manipulation includes matching schema, organizing, indexing, encoding, and compressing the data. 数据根据设置的保留策略保留在存储中。Data is persisted in storage according to the set retention policy. 然后,数据管理器将数据引入提交到引擎,该引擎可用于查询。The Data Manager then commits the data ingest to the engine, where it's available for query.

支持的数据格式、属性和权限Supported data formats, properties, and permissions

批处理与流式引入Batching vs streaming ingestion

  • 批处理引入执行数据批处理,并针对高引入吞吐量进行了优化。Batching ingestion does data batching and is optimized for high ingestion throughput. 此方法是引入的首选和最高性能类型。This method is the preferred and most performant type of ingestion. 数据根据引入属性进行批处理。Data is batched according to ingestion properties. 然后合并少量的数据,并针对快速查询结果进行优化。Small batches of data are then merged, and optimized for fast query results. 可在数据库或表上设置引入批处理策略。The ingestion batching policy can be set on databases or tables. 默认情况下,最大批处理值为5分钟、1000项或总大小 500 MB。By default, the maximum batching value is 5 minutes, 1000 items, or a total size of 500 MB.

  • 流式引入是指从流式处理源进行数据引入。Streaming ingestion is ongoing data ingestion from a streaming source. 流式引入允许对每个表的小型数据集进行近乎实时的延迟。Streaming ingestion allows near real-time latency for small sets of data per table. 数据最初引入到行存储区,然后移动到列存储区区。Data is initially ingested to row store, then moved to column store extents. 可以使用 Azure 数据资源管理器客户端库或某个受支持的数据管道来完成流式引入。Streaming ingestion can be done using an Azure Data Explorer client library or one of the supported data pipelines.

引入方法和工具Ingestion methods and tools

Azure 数据资源管理器支持多种引入方法,每个方法都有其自己的目标方案。Azure Data Explorer supports several ingestion methods, each with its own target scenarios. 这些方法包括将工具、连接器和插件引入到各种服务、托管管道、使用 Sdk 进行编程引入以及直接访问引入。These methods include ingestion tools, connectors and plugins to diverse services, managed pipelines, programmatic ingestion using SDKs, and direct access to ingestion.

使用托管管道引入Ingestion using managed pipelines

对于希望通过外部服务进行管理(限制、重试、监视器、警报等)的组织而言,使用连接器可能是最合适的解决方案。For organizations who wish to have management (throttling, retries, monitors, alerts, and more) done by an external service, using a connector is likely the most appropriate solution. 排队引入适合大数据量。Queued ingestion is appropriate for large data volumes. Azure 数据资源管理器支持以下 Azure Pipelines:Azure Data Explorer supports the following Azure Pipelines:

使用连接器和插件的引入Ingestion using connectors and plugins

使用 Sdk 进行编程引入Programmatic ingestion using SDKs

Azure 数据资源管理器提供可用于查询和数据引入的 SDK。Azure Data Explorer provides SDKs that can be used for query and data ingestion. 通过在引入期间和之后尽量减少存储事务,编程引入得到优化,可降低引入成本 (COG)。Programmatic ingestion is optimized for reducing ingestion costs (COGs), by minimizing storage transactions during and following the ingestion process.

可用的 Sdk 和开源项目Available SDKs and open-source projects

工具Tools

  • 一键式引入:使你能够通过从各种源类型创建和调整表来快速引入数据。One click ingestion: Enables you to quickly ingest data by creating and adjusting tables from a wide range of source types. 一键式引入会根据 Azure 数据资源管理器中的数据源自动推荐表和映射结构。One click ingestion automatically suggests tables and mapping structures based on the data source in Azure Data Explorer. 一次单击引入可用于一次性引入,或用于通过事件网格在数据被引入的容器上定义连续引入。One click ingestion can be used for one-time ingestion, or to define continuous ingestion via Event Grid on the container to which the data was ingested.

  • LightIngest:用于即席数据引入到 Azure 数据资源管理器的命令行实用程序。LightIngest: A command-line utility for ad-hoc data ingestion into Azure Data Explorer. 该实用工具可以从本地文件夹或 Azure Blob 存储容器提取源数据。The utility can pull source data from a local folder or from an Azure blob storage container.

Kusto 查询语言引入控制命令Kusto Query Language ingest control commands

可以通过多种方法通过 Kusto 查询语言(KQL)命令直接将数据引入到引擎。There are a number of methods by which data can be ingested directly to the engine by Kusto Query Language (KQL) commands. 由于此方法会绕过数据管理服务,因此仅适用于探索和原型。Because this method bypasses the Data Management services, it's only appropriate for exploration and prototyping. 不要在生产或大容量方案中使用此方法。Don't use this method in production or high-volume scenarios.

  • 内联引入: control 命令。内联引入会发送到引擎,其中的数据引入是命令文本本身的一部分。Inline ingestion: A control command .ingest inline is sent to the engine, with the data to be ingested being a part of the command text itself. 此方法用于标识测试目的。This method is intended for improvised testing purposes.

  • 从查询中引入: "设置"、"追加"、"设置" 或 "追加" 或 "替换" 会发送到引擎,其中的数据间接指定为查询或命令的结果。Ingest from query: A control command .set, .append, .set-or-append, or .set-or-replace is sent to the engine, with the data specified indirectly as the results of a query or a command.

  • 从存储(拉取) 引入:将一个控件命令引入到引擎,并将数据存储在某些外部存储(例如 Azure Blob 存储)中,这些数据存储在引擎可访问的外部存储(例如 Azure Blob 存储)中,并且由命令指向。Ingest from storage (pull): A control command .ingest into is sent to the engine, with the data stored in some external storage (for example, Azure Blob Storage) accessible by the engine and pointed-to by the command.

比较引入方法和工具Comparing ingestion methods and tools

摄取名称Ingestion name 数据类型Data type 文件大小上限Maximum file size 流式处理,批处理,直接Streaming, batching, direct 最常见的方案Most common scenarios 注意事项Considerations
一键式引入One click ingestion * sv,JSON*sv, JSON 1 GB 未压缩(请参阅备注)1 GB uncompressed (see note) 在直接引入中批处理到容器、本地文件和 blobBatching to container, local file and blob in direct ingestion 一次性、创建表架构、事件网格持续引入的定义、与容器的大容量引入(最多10000个 blob)One-off, create table schema, definition of continuous ingestion with event grid, bulk ingestion with container (up to 10,000 blobs) 从容器中随机选择 10000 blob10,000 blobs are randomly selected from container
LightIngestLightIngest 支持的所有格式All formats supported 1 GB 未压缩(请参阅备注)1 GB uncompressed (see note) 通过 DM 或直接引入引擎进行批处理Batching via DM or direct ingestion to engine 数据迁移,含已调整引入时间戳的历史数据,大容量引入(无大小限制)Data migration, historical data with adjusted ingestion timestamps, bulk ingestion (no size restriction) 区分大小写,区分大小写Case-sensitive, space-sensitive
ADX KafkaADX Kafka
ADX 到 Apache SparkADX to Apache Spark
LogStashLogStash
Azure 数据工厂Azure Data Factory 支持的数据格式Supported data formats 无限制 * (每个 ADF 限制)unlimited *(per ADF restrictions) 批处理或每个 ADF 触发器Batching or per ADF trigger 支持通常不受支持的格式、大型文件、可从永久状态复制到云的90源Supports formats that are usually unsupported, large files, can copy from over 90 sources, from on perm to cloud 引入时间Time of ingestion
Azure 数据流 Azure Data Flow 在流中引入命令Ingestion commands as part of flow 必须具有高性能的响应时间Must have high-performing response time
IoT 中心IoT Hub 支持的数据格式Supported data formats 空值N/A 批处理,流式处理Batching, streaming IoT 消息,IoT 事件,IoT 属性IoT messages, IoT events, IoT properties
事件中心Event Hub 支持的数据格式Supported data formats 空值N/A 批处理,流式处理Batching, streaming 消息,事件Messages, events
事件网格Event Grid 支持的数据格式Supported data formats 1 GB 未压缩1 GB uncompressed 批处理Batching 从 Azure 存储空间、Azure 存储中的外部数据持续引入Continuous ingestion from Azure storage, external data in Azure storage 100 KB 是最佳文件大小,用于 blob 重命名和 blob 创建100 KB is optimal file size, Used for blob renaming and blob creation
Net StdNet Std 支持的所有格式All formats supported 1 GB 未压缩(请参阅备注)1 GB uncompressed (see note) 批处理,流式处理,直接Batching, streaming, direct 根据组织需求编写自己的代码Write your own code according to organizational needs
PythonPython 支持的所有格式All formats supported 1 GB 未压缩(请参阅备注)1 GB uncompressed (see note) 批处理,流式处理,直接Batching, streaming, direct 根据组织需求编写自己的代码Write your own code according to organizational needs
Node.jsNode.js 支持的所有格式All formats supported 1 GB 未压缩(请参阅1 GB uncompressed (see note 批处理,流式处理,直接Batching, streaming, direct 根据组织需求编写自己的代码Write your own code according to organizational needs
JavaJava 支持的所有格式All formats supported 1 GB 未压缩(请参阅备注)1 GB uncompressed (see note) 批处理,流式处理,直接Batching, streaming, direct 根据组织需求编写自己的代码Write your own code according to organizational needs
RESTREST 支持的所有格式All formats supported 1 GB 未压缩(请参阅备注)1 GB uncompressed (see note) 批处理,流式处理,直接Batching, streaming, direct 根据组织需求编写自己的代码Write your own code according to organizational needs
Go 支持的所有格式All formats supported 1 GB 未压缩(请参阅备注)1 GB uncompressed (see note) 批处理,流式处理,直接Batching, streaming, direct 根据组织需求编写自己的代码Write your own code according to organizational needs

备注

在上表中引用时,摄取支持的最大文件大小为 5 GB。When referenced in the above table, ingestion supports a maximum file size of 5 GB. 建议引入 100 MB 到 1 GB 的文件。The recommendation is to ingest files between 100 MB and 1 GB.

引入过程Ingestion process

根据需要选择最适合的摄取方法后,请执行以下步骤:Once you have chosen the most suitable ingestion method for your needs, do the following steps:

  1. 设置保留策略Set retention policy

    引入数据资源管理器 Azure 中的表的数据受表的有效保留策略的制约。Data ingested into a table in Azure Data Explorer is subject to the table's effective retention policy. 除非在表上显式设置,否则有效保留策略派生自数据库的保留策略。Unless set on a table explicitly, the effective retention policy is derived from the database's retention policy. 热保留是群集大小和保留策略的一个功能。Hot retention is a function of cluster size and your retention policy. 引入超过可用空间所需的数据将强制第一项在数据中进行冷保留。Ingesting more data than you have available space will force the first in data to cold retention.

    请确保数据库的保留策略适用于你的需求。Make sure that the database's retention policy is appropriate for your needs. 如果并非如此,请在表级别显式重写它。If not, explicitly override it at the table level. 有关详细信息,请参阅保留策略For more information, see retention policy.

  2. 创建表Create a table

    为了引入数据,需要事先创建一个表。In order to ingest data, a table needs to be created beforehand. 使用以下选项之一:Use one of the following options:

    备注

    如果记录不完整或者无法将字段解析为所需的数据类型,则将使用 NULL 值填充相应的表列。If a record is incomplete or a field cannot be parsed as the required data type, the corresponding table columns will be populated with null values.

  3. 创建架构映射Create schema mapping

    架构映射有助于将源数据字段绑定到目标表列。Schema mapping helps bind source data fields to destination table columns. 映射允许您根据定义的属性,将不同源中的数据导入到同一个表中。Mapping allows you to take data from different sources into the same table, based on the defined attributes. 支持不同类型的映射,两种类型均面向行(CSV、JSON 和 AVRO)以及面向列(Parquet)。Different types of mappings are supported, both row-oriented (CSV, JSON and AVRO), and column-oriented (Parquet). 在大多数方法中,还可以在表上预先创建映射,并从摄取命令参数中引用。In most methods, mappings can also be pre-created on the table and referenced from the ingest command parameter.

  4. 设置更新策略(可选)Set update policy (optional)

    某些数据格式映射(Parquet、JSON 和 Avro)支持简单且有用的引入时间转换。Some of the data format mappings (Parquet, JSON, and Avro) support simple and useful ingest-time transformations. 如果方案在引入时需要更复杂的处理,请使用更新策略,该策略允许使用 Kusto 查询语言命令进行轻型处理。Where the scenario requires more complex processing at ingest time, use update policy, which allows for lightweight processing using Kusto Query Language commands. 更新策略会自动对原始表上的引入数据运行提取和转换,并将生成的数据引入到一个或多个目标表。The update policy automatically runs extractions and transformations on ingested data on the original table, and ingests the resulting data into one or more destination tables. 设置你的更新策略Set your update policy.

后续步骤Next steps