Introducing Apache Kafka on HDInsight

Apache Kafka is an open-source distributed streaming platform that can be used to build real-time streaming data pipelines and applications. Kafka also provides message broker functionality similar to a message queue, where you can publish and subscribe to named data streams. Kafka on HDInsight provides you with a managed, highly scalable, and highly available service in the Microsoft Azure cloud.

Why use Kafka on HDInsight?

Kafka on HDInsight provides the following features:

  • Service Level Agreement (SLA): SLA information for HDInsight.

  • Publish-subscribe messaging pattern: Kafka provides a Producer API for publishing records to a Kafka topic. The Consumer API is used when subscribing to a topic.

  • Stream processing: Kafka is often used with Apache Storm or Spark for real-time stream processing. Kafka 0.10.0.0 (HDInsight version 3.5 and 3.6) introduced a streaming API that allows you to build streaming solutions without requiring Storm or Spark.

  • Horizontal scale: Kafka partitions streams across the nodes in the HDInsight cluster. Consumer processes can be associated with individual partitions to provide load balancing when consuming records.

  • In-order delivery: Within each partition, records are stored in the stream in the order that they were received. By associating one consumer process per partition, you can guarantee that records are processed in-order.

  • Fault-tolerant: Partitions can be replicated between nodes to provide fault tolerance.

  • Integration with Azure Managed Disks: Managed disks provide higher scale and throughput for the disks used by the virtual machines in the HDInsight cluster.

    Managed disks are enabled by default for Kafka on HDInsight. The number of disks used per node can be configured during HDInsight creation. For more information on managed disks, see Azure Managed Disks.

    For information on configuring managed disks with Kafka on HDInsight, see Increase scalability of Kafka on HDInsight.

Use cases

  • Messaging: Since it supports the publish-subscribe message pattern, Kafka is often used as a message broker.

  • Activity tracking: Since Kafka provides in-order logging of records, it can be used to track and re-create activities. For example, user actions on a web site or within an application.

  • Aggregation: Using stream processing, you can aggregate information from different streams to combine and centralize the information into operational data.

  • Transformation: Using stream processing, you can combine and enrich data from multiple input topics into one or more output topics.

Architecture

Kafka cluster configuration

This diagram shows a typical Kafka configuration that uses consumer groups, partitioning, and replication to offer parallel reading of events with fault tolerance. Apache ZooKeeper is built for concurrent, resilient, and low-latency transactions, as it manages the state of the Kafka cluster. Kafka stores records in topics. Records are produced by producers, and consumed by consumers. Producers retrieve records from Kafka brokers. Each worker node in your HDInsight cluster is a Kafka broker. One partition is created for each consumer, allowing parallel processing of the streaming data. Replication is employed to spread the partitions across nodes, protecting against node (broker) outages. A partition denoted with an (L) is the leader for the given partition. Producer traffic is routed to the leader of each node, using the state managed by ZooKeeper.

Each Kafka broker uses Azure Managed Disks. The number of disks is user-defined, and can provide up to 16 TB of storage per broker.

Important

Kafka is not aware of the underlying hardware (rack) in the Azure data center. To ensure that partitions are correctly balanced across the underlying hardware, see configure high availability of data (Kafka).

Next steps

Use the following links to learn how to use Apache Kafka on HDInsight: