Apache Kafka migration to Azure

Azure HDInsight
Azure Cosmos DB
Azure Data Lake Storage
Azure Synapse Analytics
Azure Stream Analytics

Apache Kafka is a highly scalable and fault tolerant distributed messaging system that implements a publish-subscribe architecture. It's used as an ingestion layer in real-time streaming scenarios, such as IoT and real-time log monitoring systems. It's also used increasingly as the immutable append-only data store in Kappa architectures.

Apache®, Apache Spark®, Apache Hadoop®, Apache HBase, Apache Storm®, Apache Sqoop®, Apache Kafka®, and the flame logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Migration approach

This article presents various strategies for migrating Kafka to Azure:

Here's a decision flowchart for deciding which to use:

Diagram that shows a decision chart for determining a strategy for migrating Kafka to Azure.

Migrate Kafka to Azure infrastructure as a service (IaaS)

For one way to migrate Kafka to Azure IaaS, see Kafka on Ubuntu VMs.

Migrate Kafka to Azure Event Hubs for Kafka

Event Hubs provides an endpoint that's compatible with the Apache Kafka producer and consumer APIs. This endpoint can be used by most Apache Kafka client applications, so it's an alternative to running a Kafka cluster on Azure. The endpoint supports clients that use versions 1.0 and later of the APIs. For more information about this feature, see Azure Event Hubs for Apache Kafka overview.

To learn how to migrate your Apache Kafka applications to use Azure Event Hubs, see Migrate to Azure Event Hubs for Apache Kafka Ecosystems.

Kafka and Event Hubs feature differences

How are Kafka and Event Hubs similar? How are Kafka and Event Hubs different?
Both use partitions. There are differences in these areas:
Partitions are independent. • PaaS vs. software
Both use a client-side cursor concept. • Partitioning
Both can scale to very high workloads. • APIs
Conceptually they are nearly the same. • Runtime
Neither uses the HTTP protocol for receiving. • Protocols
• Durability
• Security
• Throttling
Partitioning differences
Kafka Event Hubs
Scale is managed by partition count. Scale is managed by throughput units.
You must load-balance partitions across machines. Load balancing is automatic.
You must manually re-shard by using split and merge. Repartitioning isn't required.
Durability differences
Kafka Event Hubs
Volatile by default Always durable
Replicated after ACK Replicated before ACK
Depends on disk and quorum Provided by storage
Security differences
Kafka Event Hubs
SSL and SASL SAS and SASL/PLAIN RFC 4618
File-like ACLs Policy
Optional transport encryption Mandatory TLS
User based Token based (unlimited)
Other differences
Kafka Event Hubs
Kafka doesn't throttle. Event Hubs supports throttling.
Kafka uses a proprietary protocol. Event Hubs uses AMQP 1.0 protocol.
Kafka doesn't use HTTP for send. Event Hubs uses HTTP Send and Batch Send.

Migrate Kafka on Azure HDInsight

You can migrate Kafka to Kafka on Azure HDInsight. For more information, see What is Apache Kafka in Azure HDInsight?.

Use AKS with Kafka on HDInsight

See Use Azure Kubernetes Service with Apache Kafka on HDInsight.

Kafka Data Migration

You can use Kafka's MirrorMaker tool to replicate topics from one cluster to another. This technique can help you migrate data after a Kafka cluster is provisioned. For more information, see Use MirrorMaker to replicate Apache Kafka topics with Kafka on HDInsight.

Here's a migration approach that uses mirroring:

  • Move producers first and then move consumers. When you migrate the producers you prevent production of new messages on the source Kafka.
  • After the source Kafka consumes all remaining messages, you can migrate the consumers.

Here are the implementation steps:

  1. Change the Kafka connection address of the producer client to point to the new Kafka instance.
  2. Restart the producer business services and send new messages to the new Kafka instance.
  3. Wait for the data in the source Kafka to be consumed.
  4. Change the Kafka connection address of the consumer client to point to the new Kafka instance.
  5. Restart the consumer business services to consume messages from the new Kafka instance.
  6. Verify that consumers succeed in getting data from the new Kafka instance.

Monitor the Kafka cluster

You can use Azure Monitor logs to analyze logs that are generated by Apache Kafka on HDInsight. For more information, see: Analyze logs for Apache Kafka on HDInsight.

Apache Kafka Streams API

The Kafka Streams API makes it possible to process data in near real-time, and it provides the ability to join and aggregate data. There are many more features of the API worth knowing about. For more information, see Introducing Kafka Streams: Stream Processing Made Simple - Confluent.

The Microsoft and Confluent partnership

Confluent provides a cloud-native service for Apache Kafka. Microsoft and Confluent have a strategic alliance. For more information, see:

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal authors:

Other contributors:

To see non-public LinkedIn profiles, sign in to LinkedIn.

Next steps

Azure product introductions

Azure product reference

Other