您现在访问的是微软AZURE全球版技术文档网站,若需要访问由世纪互联运营的MICROSOFT AZURE中国区技术文档网站,请访问 https://docs.azure.cn.

Apache Kafka HDInsight 群集的性能优化Performance optimization for Apache Kafka HDInsight clusters

本文提供了一些建议,用于在 HDInsight 中优化 Apache Kafka 工作负荷的性能。This article gives some suggestions for optimizing the performance of your Apache Kafka workloads in HDInsight. 重点是调整制造者和 broker 配置。The focus is on adjusting producer and broker configuration. 可以通过不同的方式衡量性能,而应用的优化将取决于您的业务需求。There are different ways of measuring performance, and the optimizations that you apply will depend on your business needs.

体系结构概述Architecture overview

Kafka 主题用于组织记录。Kafka topics are used to organize records. 记录由生成者生成,由使用者使用。Records are produced by producers, and consumed by consumers. 发生器将记录发送到 Kafka 代理,然后存储数据。Producers send records to Kafka brokers, which then store the data. HDInsight 群集中的每个辅助角色节点都是 Kafka 代理。Each worker node in your HDInsight cluster is a Kafka broker.

主题跨代理对记录进行分区。Topics partition records across brokers. 在使用记录时,每个分区最多可使用一个使用者来实现数据并行处理。When consuming records, you can use up to one consumer per partition to achieve parallel processing of the data.

复制用于跨节点复制分区。Replication is used to duplicate partitions across nodes. 这可以防止节点(broker)中断。This protects against node (broker) outages. 副本组中的单个分区被指定为分区领导。A single partition among the group of replicas is designated as the partition leader. 生成方流量将根据由 ZooKeeper 管理的状态路由到每个节点的前导者。Producer traffic is routed to the leader of each node, using the state managed by ZooKeeper.

确定方案Identify your scenario

Apache Kafka 性能有两个主要方面–吞吐量和延迟。Apache Kafka performance has two main aspects – throughput and latency. 吞吐量是指可以处理数据的最大速率。Throughput is the maximum rate at which data can be processed. 更高的吞吐量通常更好。Higher throughput is usually better. 滞后时间是存储或检索数据所用的时间。Latency is the time it takes for data to be stored or retrieved. 通常,延迟越低越好。Lower latency is usually better. 在吞吐量、延迟和应用程序的基础结构成本之间找出适当的平衡可能会很困难。Finding the right balance between throughput, latency and the cost of the application's infrastructure can be challenging. 根据你是需要高吞吐量、低延迟还是同时满足这两个条件,你的性能要求可能会匹配以下三种常见情况之一:Your performance requirements will likely match one of the following three common situations, based on whether you require high throughput, low latency, or both:

  • 高吞吐量、低延迟。High throughput, low latency. 此方案需要高吞吐量和低延迟(约100毫秒)。This scenario requires both high throughput and low latency (~100 milliseconds). 此类应用程序的一个示例是服务可用性监视。An example of this type of application is service availability monitoring.
  • 高吞吐量、高延迟。High throughput, high latency. 此方案需要高吞吐量(约 1.5 GBps),但可容忍更高的延迟(< 250 ms)。This scenario requires high throughput (~1.5 GBps) but can tolerate higher latency (< 250 ms). 此类应用程序的一个示例是针对近乎实时的过程(如安全和入侵检测应用程序)的遥测数据引入。An example of this type of application is telemetry data ingestion for near real-time processes like security and intrusion detection applications.
  • 低吞吐量、低延迟。Low throughput, low latency. 此方案需要较低的延迟(< 10 ms)来进行实时处理,但会降低吞吐量。This scenario requires low latency (< 10 ms) for real-time processing, but can tolerate lower throughput. 此类应用程序的一个示例是联机拼写和语法检查。An example of this type of application is online spelling and grammar checks.

生成者配置Producer configurations

以下部分将重点介绍一些最重要的配置属性,以优化 Kafka 创建者的性能。The following sections will highlight some of the most important configuration properties to optimize performance of your Kafka producers. 有关所有配置属性的详细说明,请参阅有关生成方配置的 Apache Kafka 文档For a detailed explanation of all configuration properties, see Apache Kafka documentation on producer configurations.

批大小Batch size

Apache Kafka 发生器汇集消息组(称为批处理),这些消息将作为要存储在单个存储分区中的单元发送。Apache Kafka producers assemble groups of messages (called batches) which are sent as a unit to be stored in a single storage partition. 批大小表示在传输该组之前必须存在的字节数。Batch size means the number of bytes that must be present before that group is transmitted. 增加 batch.size 参数可能会增加吞吐量,因为这样可以减少网络和 IO 请求的处理开销。Increasing the batch.size parameter can increase throughput, because it reduces the processing overhead from network and IO requests. 在轻型负载下,增加的批大小可能会增加 Kafka 发送延迟,因为发生器会等待批处理准备就绪。Under light load, increased batch size may increase Kafka send latency as the producer waits for a batch to be ready. 在重负载下,建议增加批大小,以提高吞吐量和延迟时间。Under heavy load, it's recommended to increase the batch size to improve throughput and latency.

制造者要求的确认Producer required acknowledgments

acks 配置所需的制造者确定在写入请求被视为已完成之前分区前导符所需的确认次数。The producer required acks configuration determines the number of acknowledgments required by the partition leader before a write request is considered completed. 此设置会影响数据的可靠性,并采用 01-1的值。This setting affects data reliability and it takes values of 0, 1, or -1. -1 的值表示在写入完成之前必须从所有副本接收确认。The value of -1 means that an acknowledgment must be received from all replicas before the write is completed. 设置 acks = -1 提供更强的保证来防止数据丢失,但它也会导致更高的延迟和更低的吞吐量。Setting acks = -1 provides stronger guarantees against data loss, but it also results in higher latency and lower throughput. 如果你的应用程序要求更高的吞吐量,请尝试设置 acks = 0acks = 1If your application requirements demand higher throughput, try setting acks = 0 or acks = 1. 请记住,未确认所有副本可能会降低数据的可靠性。Keep in mind, that not acknowledging all replicas can reduce data reliability.

压缩Compression

Kafka 生成者可以配置为在将消息发送到代理之前对消息进行压缩。A Kafka producer can be configured to compress messages before sending them to brokers. compression.type 设置指定要使用的压缩编解码器。The compression.type setting specifies the compression codec to be used. 支持的压缩编解码器为 "gzip"、"snappy" 和 "lz4"。Supported compression codecs are “gzip,” “snappy,” and “lz4.” 压缩非常有用,如果磁盘容量有限制,则应考虑压缩。Compression is beneficial and should be considered if there's a limitation on disk capacity.

在两个常用的压缩编解码器(gzipsnappy)中,gzip 具有更高的压缩率,这将导致较低的磁盘使用率,同时降低 CPU 负载。Among the two commonly used compression codecs, gzip and snappy, gzip has a higher compression ratio, which results in lower disk usage at the cost of higher CPU load. snappy 编解码器降低了降低 CPU 开销的压缩。The snappy codec provides less compression with less CPU overhead. 可以根据 broker 磁盘或制造者 CPU 限制决定要使用的编解码器。You can decide which codec to use based on broker disk or producer CPU limitations. gzip 可以按比 snappy高5倍的速率压缩数据。gzip can compress data at a rate five times higher than snappy.

使用数据压缩将增加可以存储在磁盘上的记录数。Using data compression will increase the number of records that can be stored on a disk. 在制造者和 broker 使用的压缩格式不匹配的情况下,它还可能会增加 CPU 开销。It may also increase CPU overhead in cases where there's a mismatch between the compression formats being used by the producer and the broker. 因为在发送之前必须压缩数据,然后在处理之前解压缩数据。as the data must be compressed before sending and then decompressed before processing.

代理设置Broker settings

以下部分将重点介绍一些最重要的设置,以优化 Kafka 代理的性能。The following sections will highlight some of the most important settings to optimize performance of your Kafka brokers. 有关所有代理设置的详细说明,请参阅有关生成方配置的 Apache Kafka 文档For a detailed explanation of all broker settings, see Apache Kafka documentation on producer configurations.

磁盘数Number of disks

存储磁盘的 IOPS 有限(每秒输入/输出操作数)和每秒读取/写入字节数。Storage disks have limited IOPS (Input/Output Operations Per Second) and read/write bytes per second. 创建新分区时,Kafka 会将每个新分区存储在具有较少现有分区的磁盘上,以便在可用磁盘之间进行平衡。When creating new partitions, Kafka stores each new partition on the disk with fewest existing partitions to balance them across the available disks. 尽管存储策略,但在每个磁盘上处理数百个分区副本时,Kafka 可以轻松地将可用磁盘吞吐量饱和。Despite storage strategy, when processing hundreds of partition replicas on each disk, Kafka can easily saturate the available disk throughput. 此处的缺点是吞吐量和成本。The tradeoff here is between throughput and cost. 如果你的应用程序需要更高的吞吐量,请使用每个 broker 创建更多托管磁盘的群集。If your application requires greater throughput, create a cluster with more managed disks per broker. HDInsight 目前不支持将托管磁盘添加到正在运行的群集。HDInsight doesn't currently support adding managed disks to a running cluster. 有关如何配置托管磁盘数量的详细信息,请参阅为 HDInsight 上的 Apache Kafka 配置存储和可伸缩性For more information on how to configure the number of managed disks, see Configure storage and scalability for Apache Kafka on HDInsight. 了解增加群集中节点的存储空间的成本含义。Understand the cost implications of increasing storage space for the nodes in your cluster.

主题和分区数Number of topics and partitions

Kafka 发生器写入主题。Kafka producers write to topics. Kafka 使用者从主题中读取。Kafka consumers read from topics. 主题与日志关联,后者是磁盘上的数据结构。A topic is associated with a log, which is a data structure on disk. Kafka 将来自一个制造者的记录追加到主题日志的末尾。Kafka appends records from a producer(s) to the end of a topic log. 主题日志包含多个分布在多个文件上的分区。A topic log consists of many partitions that are spread over multiple files. 这些文件进而分散到多个 Kafka 群集节点。These files are, in turn, spread across multiple Kafka cluster nodes. 使用者在其步调上从 Kafka 主题读取,并可以在主题日志中选取其位置(偏移量)。Consumers read from Kafka topics at their cadence and can pick their position (offset) in the topic log.

每个 Kafka 分区都是系统上的一个日志文件,而制造者线程可以同时写入多个日志。Each Kafka partition is a log file on the system, and producer threads can write to multiple logs simultaneously. 同样,由于每个使用者线程从一个分区读取消息,因此也会并行处理多个分区中的消息。Similarly, since each consumer thread reads messages from one partition, consuming from multiple partitions is handled in parallel as well.

增加分区密度(每个 broker 的分区数)将在分区领导者及其关注者之间增加与元数据操作和每个分区请求/响应相关的开销。Increasing the partition density (the number of partitions per broker) adds an overhead related to metadata operations and per partition request/response between the partition leader and its followers. 即使在没有数据流动的情况下,分区副本仍会从前导引线获取数据,这会导致通过网络执行发送和接收请求的额外处理。Even in the absence of data flowing through, partition replicas still fetch data from leaders, which results in extra processing for send and receive requests over the network.

对于 HDInsight 中的 Apache Kafka 群集1.1 及更高版本,建议每个代理最多包含1000个分区,包括副本。For Apache Kafka clusters 1.1 and above in HDInsight, we recommend you to have a maximum of 1000 partitions per broker, including replicas. 增加每个 broker 的分区数会降低吞吐量,并且还可能导致主题不可用。Increasing the number of partitions per broker decreases throughput and may also cause topic unavailability. 有关 Kafka 分区支持的详细信息,请参阅官方 Apache Kafka 博客文章1.1.0 版本中受支持的分区数量的增加For more information on Kafka partition support, see the official Apache Kafka blog post on the increase in the number of supported partitions in version 1.1.0. 有关修改主题的详细信息,请参阅Apache Kafka:修改主题For details on modifying topics, see Apache Kafka: modifying topics.

副本数Number of replicas

更高的复制因素导致在分区领导和关注者之间进行附加请求。Higher replication factor results in additional requests between the partition leader and followers. 因此,更高的复制系数会消耗更多的磁盘和 CPU 来处理其他请求,从而提高写入延迟并降低吞吐量。Consequently, a higher replication factor consumes more disk and CPU to handle additional requests, increasing write latency and decreasing throughput.

建议在 Azure HDInsight 中为 Kafka 使用至少3到3个复制。We recommend that you use at least 3x replication for Kafka in Azure HDInsight. 大多数 Azure 区域都有三个容错域,但在仅有两个容错域的区域中,用户应使用4x 复制。Most Azure regions have three fault domains, but in regions with only two fault domains, users should use 4x replication.

有关复制的详细信息,请参阅Apache Kafka:复制Apache Kafka:增加复制因子For more information on replication, see Apache Kafka: replication and Apache Kafka: increasing replication factor.

后续步骤Next steps