您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

什么是 Azure HDInsight 中的 Apache KafkaWhat is Apache Kafka in Azure HDInsight

Apache Kafka 是一个分布式流式处理平台,以开源方式提供,可用于构建实时流式处理数据管道和应用程序。Apache Kafka is an open-source distributed streaming platform that can be used to build real-time streaming data pipelines and applications. Kafka 还提供了类似于消息队列的消息中转站,可在其中发布和订阅命名数据流。Kafka also provides message broker functionality similar to a message queue, where you can publish and subscribe to named data streams.

Kafka on HDInsight 的具体特征如下:The following are specific characteristics of Kafka on HDInsight:

  • 它是一种托管服务,可提供简化的配置过程。It is a managed service that provides a simplified configuration process. 其结果是经 Microsoft 测试并支持的配置。The result is a configuration that is tested and supported by Microsoft.

  • Microsoft 就 Kafka 正常运行时间提供 99.9 % 的服务级别协议(SLA)。Microsoft provides a 99.9% Service Level Agreement (SLA) on Kafka uptime. 有关详细信息,请参阅 HDInsight 的 SLA 信息文档。For more information, see the SLA information for HDInsight document.

  • 它使用 Azure 托管磁盘作为 Kafka 的后备存储。It uses Azure Managed Disks as the backing store for Kafka. 托管磁盘可为每个 Kafka 代理提供高达 16 TB 的存储空间。Managed Disks can provide up to 16 TB of storage per Kafka broker. 有关为 Kafka on HDInsight 配置托管磁盘的信息,请参阅提高 Apache Kafka on HDInsight 的可伸缩性For information on configuring managed disks with Kafka on HDInsight, see Increase scalability of Apache Kafka on HDInsight.

    有关托管磁盘的详细信息,请参阅 Azure 托管磁盘For more information on managed disks, see Azure Managed Disks.

  • Kafka 采用一维机架视图设计。Kafka was designed with a single dimensional view of a rack. Azure 将机架分为两个维度,即更新域 (UD) 和容错域 (FD)。Azure separates a rack into two dimensions - Update Domains (UD) and Fault Domains (FD). Microsoft 提供相关工具,重新均衡 UD 和 FD 中的 Kafka 分区与副本。Microsoft provides tools that rebalance Kafka partitions and replicas across UDs and FDs.

    有关详细信息,请参阅使用 Apache Kafka on HDInsight 实现高可用性For more information, see High availability with Apache Kafka on HDInsight.

  • 创建群集后,HDInsight 允许更改辅助角色节点(托管 Kafka 代理)的数目。HDInsight allows you to change the number of worker nodes (which host the Kafka-broker) after cluster creation. 可以通过 Azure 门户、Azure PowerShell 和其他 Azure 管理界面执行缩放。Scaling can be performed from the Azure portal, Azure PowerShell, and other Azure management interfaces. 对于 Kafka,在执行缩放操作后,应重新均衡分区副本。For Kafka, you should rebalance partition replicas after scaling operations. 重新均衡分区可让 Kafka 利用新的工作节点数。Rebalancing partitions allows Kafka to take advantage of the new number of worker nodes.

    有关详细信息,请参阅使用 Apache Kafka on HDInsight 实现高可用性For more information, see High availability with Apache Kafka on HDInsight.

  • Azure Monitor 日志可用于监视 Kafka on HDInsight。Azure Monitor logs can be used to monitor Kafka on HDInsight. Azure Monitor 日志可以显示虚拟机级别的信息,例如磁盘和 NIC 指标,以及 Kafka 中的 JMX 指标。Azure Monitor logs surfaces virtual machine level information, such as disk and NIC metrics, and JMX metrics from Kafka.

    有关详细信息,请参阅分析 Apache Kafka on HDInsight 的日志For more information, see Analyze logs for Apache Kafka on HDInsight.

Apache Kafka on HDInsight 体系结构Apache Kafka on HDInsight architecture

下图显示了一个典型的 Kafka 配置,该配置利用使用者组、分区和复制提供带容错功能的事件并行读取:The following diagram shows a typical Kafka configuration that uses consumer groups, partitioning, and replication to offer parallel reading of events with fault tolerance:

Kafka 群集配置关系图

Apache ZooKeeper 管理 Kafka 群集的状态。Apache ZooKeeper manages the state of the Kafka cluster. Zookeeper 专用于并发、可复原和低延迟事务。Zookeeper is built for concurrent, resilient, and low-latency transactions.

Kafka 将记录(数据)存储在主题中 。Kafka stores records (data) in topics. 记录由生成者 生成,由使用者 使用。Records are produced by producers, and consumed by consumers. 生成者将记录发送到 Kafka 代理 。Producers send records to Kafka brokers. HDInsight 群集中的每个辅助角色节点都是 Kafka 代理。Each worker node in your HDInsight cluster is a Kafka broker.

主题跨代理对记录进行分区。Topics partition records across brokers. 在使用记录时,每个分区最多可使用一个使用者来实现数据并行处理。When consuming records, you can use up to one consumer per partition to achieve parallel processing of the data.

利用复制功能将分区复制到各个节点上,以防止发生节点(代理)服务中断。Replication is employed to duplicate partitions across nodes, protecting against node (broker) outages. 关系图中用 (L) 表示的分区是给定分区的前导者 。A partition denoted with an (L) in the diagram is the leader for the given partition. 生成方流量将根据由 ZooKeeper 管理的状态路由到每个节点的前导者。Producer traffic is routed to the leader of each node, using the state managed by ZooKeeper.

为何使用 Apache Kafka on HDInsight?Why use Apache Kafka on HDInsight?

以下是可使用 Kafka on HDInsight 执行的常见任务和模式:The following are common tasks and patterns that can be performed using Kafka on HDInsight:

  • Apache Kafka 数据复制:Kafka 提供 MirrorMaker 实用工具,用于在 Kafka 群集之间复制数据。Replication of Apache Kafka data: Kafka provides the MirrorMaker utility, which replicates data between Kafka clusters.

    有关使用 MirrorMaker 的信息,请参阅使用 Apache Kafka on HDInsight 复制 Apache Kafka 主题For information on using MirrorMaker, see Replicate Apache Kafka topics with Apache Kafka on HDInsight.

  • 发布-订阅消息传送模式:Kafka 提供一个生成者 API,用于将记录发布到 Kafka 主题。Publish-subscribe messaging pattern: Kafka provides a Producer API for publishing records to a Kafka topic. 订阅主题时,将使用 Consumer API。The Consumer API is used when subscribing to a topic.

    有关详细信息,请参阅 Apache Kafka on HDInsight 入门For more information, see Start with Apache Kafka on HDInsight.

  • 流处理:Kafka 通常与 Apache Storm 或 Spark 配合使用,以实现实时流式处理。Stream processing: Kafka is often used with Apache Storm or Spark for real-time stream processing. Kafka 0.10.0.0(HDInsight 版本 3.5 和 3.6)引入了流式处理 API,可用于构建流式处理解决方案,而无需使用 Storm 或 Spark。Kafka 0.10.0.0 (HDInsight version 3.5 and 3.6) introduced a streaming API that allows you to build streaming solutions without requiring Storm or Spark.

    有关详细信息,请参阅 Apache Kafka on HDInsight 入门For more information, see Start with Apache Kafka on HDInsight.

  • 横向缩放:Kafka 可将 HDInsight 群集中不同节点之间的流进行分区。Horizontal scale: Kafka partitions streams across the nodes in the HDInsight cluster. 使用者进程可与单个分区关联,以便在使用记录时提供负载均衡。Consumer processes can be associated with individual partitions to provide load balancing when consuming records.

    有关详细信息,请参阅 Apache Kafka on HDInsight 入门For more information, see Start with Apache Kafka on HDInsight.

  • 有序传送:在每个分区中,记录按接收顺序存储在流中。In-order delivery: Within each partition, records are stored in the stream in the order that they were received. 通过为每个分区关联一个使用者进程,可保证按顺序处理记录。By associating one consumer process per partition, you can guarantee that records are processed in-order.

    有关详细信息,请参阅 Apache Kafka on HDInsight 入门For more information, see Start with Apache Kafka on HDInsight.

用例Use cases

  • 消息传送:由于 Kafka 支持发布-订阅消息模式,因此它经常用作消息中转站。Messaging: Since it supports the publish-subscribe message pattern, Kafka is often used as a message broker.

  • 活动跟踪:由于 Kafka 提供按顺序进行日志记录的记录,因此它还可用于跟踪和重新创建活动。Activity tracking: Since Kafka provides in-order logging of records, it can be used to track and re-create activities. 例如,网站上或应用程序内的用户操作。For example, user actions on a web site or within an application.

  • 聚合:使用流处理可从不同的流中聚合信息,将信息合并和集中到运营数据中。Aggregation: Using stream processing, you can aggregate information from different streams to combine and centralize the information into operational data.

  • 转换:使用流处理可将多个输入主题中的数据合并到一个或多个输出主题中,丰富其内容。Transformation: Using stream processing, you can combine and enrich data from multiple input topics into one or more output topics.

后续步骤Next steps

使用以下链接了解如何使用 Apache Kafka on HDInsight:Use the following links to learn how to use Apache Kafka on HDInsight: