您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

通过 Azure Blob 存储或 Azure Data Lake Storage 中的 Azure 事件中心来捕获事件Capture events through Azure Event Hubs in Azure Blob Storage or Azure Data Lake Storage

使用 Azure 事件中心,可以按指定的时间间隔或大小差异在所选的 Azure Blob 存储Azure Data Lake Store 帐户中自动捕获事件中心的流式处理数据。Azure Event Hubs enables you to automatically capture the streaming data in Event Hubs in an Azure Blob storage or Azure Data Lake Storage account of your choice, with the added flexibility of specifying a time or size interval. 设置捕获极其简单,无需管理费用即可运行它,并且可以使用事件中心吞吐量单位自动进行缩放。Setting up Capture is fast, there are no administrative costs to run it, and it scales automatically with Event Hubs throughput units. 事件中心捕获是在 Azure 中加载流式处理数据的最简单方法,并可让用户专注于数据处理,而不是数据捕获。Event Hubs Capture is the easiest way to load streaming data into Azure, and enables you to focus on data processing rather than on data capture.

事件中心捕获可让用户在同一个流上处理实时和基于批处理的管道。Event Hubs Capture enables you to process real-time and batch-based pipelines on the same stream. 这意味着可以构建随着时间的推移随用户的需要增长的解决方案。This means you can build solutions that grow with your needs over time. 无论用户现在正在构建基于批处理的系统并着眼于将来进行实时处理,还是要将高效的冷路径添加到现有的实时解决方案,事件中心捕获都可以使流式处理数据处理更加简单。Whether you're building batch-based systems today with an eye towards future real-time processing, or you want to add an efficient cold path to an existing real-time solution, Event Hubs Capture makes working with streaming data easier.

Azure 事件中心捕获的工作原理How Event Hubs Capture works

事件中心是遥测数据入口的时间保留持久缓冲区,类似于分布式日志。Event Hubs is a time-retention durable buffer for telemetry ingress, similar to a distributed log. 缩小事件中心的关键在于分区使用者模式The key to scaling in Event Hubs is the partitioned consumer model. 每个分区是独立的数据段,并单独使用。Each partition is an independent segment of data and is consumed independently. 根据可配置的保留期,随着时间的推移此数据会过时。Over time this data ages off, based on the configurable retention period. 因此,给定的事件中心永远不会装得“太满”。As a result, a given event hub never gets "too full."

使用事件中心捕获,用户可以指定自己的 Azure Blob 存储帐户和容器或 Azure Data Lake Store(用于存储已捕获数据)。Event Hubs Capture enables you to specify your own Azure Blob storage account and container, or Azure Data Lake Store account, which are used to store the captured data. 这些帐户可以与事件中心在同一区域中,也可以在另一个区域中,从而增加了事件中心捕获功能的灵活性。These accounts can be in the same region as your event hub or in another region, adding to the flexibility of the Event Hubs Capture feature.

已捕获数据以 Apache Avro 格式写入;该格式是紧凑、便捷的二进制格式,并使用内联架构提供丰富的数据结构。Captured data is written in Apache Avro format: a compact, fast, binary format that provides rich data structures with inline schema. 这种格式广泛用于 Hadoop 生态系统、流分析和 Azure 数据工厂。This format is widely used in the Hadoop ecosystem, Stream Analytics, and Azure Data Factory. 在本文后面提供了有关如何使用 Avro 的详细信息。More information about working with Avro is available later in this article.

捕获窗口Capture windowing

使用事件中心捕获,用户可以设置用于控制捕获的窗口。Event Hubs Capture enables you to set up a window to control capturing. 此窗口使用最小大小并具有使用“第一个获胜”策略的时间配置,这意味着遇到的第一个触发器将触发捕获操作。This window is a minimum size and time configuration with a "first wins policy," meaning that the first trigger encountered causes a capture operation. 如果使用 15 分钟,100 MB 的捕获窗口,且发送速度为每秒 1 MB,则大小窗口将在时间窗口之前触发。If you have a fifteen-minute, 100 MB capture window and send 1 MB per second, the size window triggers before the time window. 每个分区独立捕获,并在捕获时写入已完成的块 blob,在遇到捕获间隔时针对时间进行命名。Each partition captures independently and writes a completed block blob at the time of capture, named for the time at which the capture interval was encountered. 存储命名约定如下所示:The storage naming convention is as follows:

{Namespace}/{EventHub}/{PartitionId}/{Year}/{Month}/{Day}/{Hour}/{Minute}/{Second}

请注意,日期值填充为零;示例文件名可能是:Note that the date values are padded with zeroes; an example filename might be:

https://mystorageaccount.blob.core.windows.net/mycontainer/mynamespace/myeventhub/0/2017/12/08/03/03/17.avro

如果 Azure 存储 blob 暂时不可用,事件中心捕获将在事件中心上配置的数据保留期内保留数据,并在存储帐户再次可用后重新填充数据。In the event that your Azure storage blob is temporarily unavailable, Event Hubs Capture will retain your data for the data retention period configured on your event hub and back fill the data once your storage account is available again.

缩放到吞吐量单位Scaling to throughput units

事件中心流量由吞吐量单位控制。Event Hubs traffic is controlled by throughput units. 单个吞吐量单位允许 1 MB/秒或 1000 个入口事件/秒(是出口事件量的两倍)。A single throughput unit allows 1 MB per second or 1000 events per second of ingress and twice that amount of egress. 标准事件中心可以配置 1 到 20 个吞吐量单位,可以使用增加配额支持请求来购买更多吞吐量单位。Standard Event Hubs can be configured with 1-20 throughput units, and you can purchase more with a quota increase support request. 使用在超出购买的吞吐量单位时会受到限制。Usage beyond your purchased throughput units is throttled. 事件中心捕获直接从内部事件中心存储复制数据,从而绕过吞吐量单位出口配额,为流分析或 Spark 等其他处理读取器节省了出口量。Event Hubs Capture copies data directly from the internal Event Hubs storage, bypassing throughput unit egress quotas and saving your egress for other processing readers, such as Stream Analytics or Spark.

配置后,用户发送第一个事件时,事件中心捕获会自动运行,并持续保持运行状态。Once configured, Event Hubs Capture runs automatically when you send your first event, and continues running. 为了让下游处理更便于了解该进程正在运行,事件中心会在没有数据时写入空文件。To make it easier for your downstream processing to know that the process is working, Event Hubs writes empty files when there is no data. 此进程提供了可预测的频率以及可以供给批处理处理器的标记。This process provides a predictable cadence and marker that can feed your batch processors.

设置事件中心捕获Setting up Event Hubs Capture

可以使用 Azure 门户或使用 Azure 资源管理器模板在创建事件中心时配置捕获。You can configure Capture at the event hub creation time using the Azure portal, or using Azure Resource Manager templates. 有关详细信息,请参阅以下文章:For more information, see the following articles:

浏览已捕获的文件和使用 AvroExploring the captured files and working with Avro

事件中心捕获以在配置的时间窗口中指定的 Avro 格式创建文件。Event Hubs Capture creates files in Avro format, as specified on the configured time window. 可以在任何工具(例如 Azure 存储资源管理器)中查看这些文件。You can view these files in any tool such as Azure Storage Explorer. 可以本地下载这些文件以进行处理。You can download the files locally to work on them.

事件中心捕获生成的文件具有以下 Avro 架构:The files produced by Event Hubs Capture have the following Avro schema:

Avro 架构

浏览 Avro 文件的简单方法是使用 Apache 中的 Avro 工具 jar。An easy way to explore Avro files is by using the Avro Tools jar from Apache. 你还可以使用Apache 钻取轻型 SQL 驱动的体验或Apache Spark来对引入数据执行复杂的分布式处理。You can also use Apache Drill for a lightweight SQL-driven experience or Apache Spark to perform complex distributed processing on the ingested data.

使用 Apache DrillUse Apache Drill

Apache Drill 是一个“用于大数据探索的开源 SQL 查询引擎”,可以用来查询结构化和半结构化数据,无论数据位于哪里。Apache Drill is an "open-source SQL query engine for Big Data exploration" that can query structured and semi-structured data wherever it is. 该引擎可以作为独立节点或作为巨型群集运行以实现优异性能。The engine can run as a standalone node or as a huge cluster for great performance.

它原生支持 Azure Blob 存储,这使得查询 Avro 文件中的数据非常轻松,如以下文档中所述:A native support to Azure Blob storage is available, which makes it easy to query data in an Avro file, as described in the documentation:

Apache Drill:Azure Blob 存储插件Apache Drill: Azure Blob Storage Plugin

若要轻松查询捕获的文件,可以通过容器在启用了 Apache Drill 的情况下创建和执行 VM 来访问 Azure Blob 存储:To easily query captured files, you can create and execute a VM with Apache Drill enabled via a container to access Azure Blob storage:

https://github.com/yorek/apache-drill-azure-blob

“大规模流式处理”存储库中提供了完整的端到端示例:A full end-to-end sample is available in the Streaming at Scale repository:

大规模流式处理:事件中心捕获Streaming at Scale: Event Hubs Capture

使用 Apache SparkUse Apache Spark

Apache Spark 是“用于大规模数据处理的统一分析引擎”。Apache Spark is a "unified analytics engine for large-scale data processing." 它支持不同的语言,包括 SQL,并且可以轻松地访问 Azure Blob 存储。It supports different languages, including SQL, and can easily access Azure Blob storage. 有两个选项可用来在 Azure 中运行 Apache Spark,这两个选项都可以轻松访问 Azure Blob 存储:There are two options to run Apache Spark in Azure, and both provide easy access to Azure Blob storage:

使用 Avro 工具Use Avro Tools

Avro 工具以 jar 包的形式提供。Avro Tools are available as a jar package. 下载此 jar 文件后,可以运行以下命令来查看特定 Avro 文件的架构:After you download the jar file, you can see the schema of a specific Avro file by running the following command:

java -jar avro-tools-1.9.1.jar getschema <name of capture file>

此命令返回This command returns

{

    "type":"record",
    "name":"EventData",
    "namespace":"Microsoft.ServiceBus.Messaging",
    "fields":[
                 {"name":"SequenceNumber","type":"long"},
                 {"name":"Offset","type":"string"},
                 {"name":"EnqueuedTimeUtc","type":"string"},
                 {"name":"SystemProperties","type":{"type":"map","values":["long","double","string","bytes"]}},
                 {"name":"Properties","type":{"type":"map","values":["long","double","string","bytes"]}},
                 {"name":"Body","type":["null","bytes"]}
             ]
}

还可以使用 Avro 工具将文件转换为 JSON 格式以及执行其他处理。You can also use Avro Tools to convert the file to JSON format and perform other processing.

若要执行更高级的处理,请下载并安装适用于所选平台的 Avro。To perform more advanced processing, download and install Avro for your choice of platform. 在撰写本文时,有可用于 C、C++、C#、Java、NodeJS、Perl、PHP、Python 和 Ruby 的实现。At the time of this writing, there are implementations available for C, C++, C#, Java, NodeJS, Perl, PHP, Python, and Ruby.

Apache Avro 针对 JavaPython 提供了完整的快速入门指南。Apache Avro has complete Getting Started guides for Java and Python. 还可以参阅事件中心捕获入门一文。You can also read the Getting started with Event Hubs Capture article.

Azure 事件中心捕获的收费方式How Event Hubs Capture is charged

事件中心捕获的计量方式与吞吐量单位类似:按小时收费。Event Hubs Capture is metered similarly to throughput units: as an hourly charge. 费用直接与为命名空间购买的吞吐量单位数成正比。The charge is directly proportional to the number of throughput units purchased for the namespace. 随着吞吐量单位增加和减少,事件中心捕获计量也相应地增加和减少以提供匹配的性能。As throughput units are increased and decreased, Event Hubs Capture meters increase and decrease to provide matching performance. 相继进行计量。The meters occur in tandem. 有关定价的详细信息,请参见事件中心定价For pricing details, see Event Hubs pricing.

请注意,捕获不会消耗出口配额,因为它是单独计费的。Note that Capture does not consume egress quota as it is billed separately.

事件网格集成Integration with Event Grid

可以创建 Azure 事件网格订阅,其中事件中心命名空间作为其源。You can create an Azure Event Grid subscription with an Event Hubs namespace as its source. 以下教程介绍如何创建事件网格订阅,其中事件中心作为源,Azure Functions 应用作为接收器:使用事件网格和 Azure Functions 处理捕获的事件中心数据并将其迁移到 SQL 数据仓库The following tutorial shows you how to create an Event Grid subscription with an event hub as a source and an Azure Functions app as a sink: Process and migrate captured Event Hubs data to a SQL Data Warehouse using Event Grid and Azure Functions.

后续步骤Next steps

事件中心捕获是将数据加载到 Azure 最简单的方法。Event Hubs Capture is the easiest way to get data into Azure. 使用 Azure Data Lake、Azure 数据工厂和 Azure HDInsight,可以执行批处理操作和其他分析,并且可以选择熟悉的工具和平台,以所需的任何规模执行。Using Azure Data Lake, Azure Data Factory, and Azure HDInsight, you can perform batch processing and other analytics using familiar tools and platforms of your choosing, at any scale you need.

访问以下链接可以了解有关事件中心的详细信息:You can learn more about Event Hubs by visiting the following links: