Kafka + Raft = KRaft (KIP-500)

How Apache Kafka's KRaft consensus protocol works in 2 minutes

For a distributed system (like Kafka) to detect and agree on an action (like electing a new leader), consensus needs to be reached amongst all the nodes.

To achieve this, Kafka utilizes a centralized coordination model. 1

The central coordinator is none other than… a log.

Kafka durably stores all metadata changes in a special single partition called __cluster_metadata.

Each record in the log represents a single cluster event (a delta). When replayed fully in the same order, a node can deterministically rebuild the exact same cluster end state. (look up the stream-table duality if you haven’t)2

each record corresponds to a change in the cluster’s state
tip: open the image in a new tab if not readable

In other words, the cluster metadata log is the source of truth for the latest metadata in Kafka.

💡 Every broker in the cluster is subscribed to this topic.

In real time, each broker pulls the latest committed3 updates.

When a new record is fetched, the broker applies it to its in-memory metadata representing the latest state of the cluster.

Controllers

Controllers serve as the control plane for a Kafka cluster. They’re special kinds of brokers that don’t host regular topics - they only do specific cluster-management operations. Usually, you’d deploy three controller brokers.

At any one time, there is only one active controller - the leader of the log. Only the active controller can write to the log.

The other controllers serve as hot standbys (followers).

Controller Responsibility

The active controller is responsible for making all metadata decisions in the cluster, like electring new partition leaders (when the broker died), creating topics, changing configs at runtime, etc.

Most importantly, it’s responsible for determining broker liveness.5

Every broker issues heartbeats to the active controller. If a broker does not send heartbeats for 6 seconds straight, it is fenced off from the cluster by the controller. The controller then assigns other brokers to act as the leaders for the partitions the fenced broker hosted.

The careful reader will now ask:

if the active controller is responsible for electing partition leaders, who’s responsible for electing the __cluster_metadata leader?

- the careful reader

This partition is special - it’s a strongly-consistent Raft-based log.

KRaft

Leader election in a distributed system is a subset of the consensus problem. Many consensus algorithms exist, like Raft, Paxos, Zab, etc.

Kafka uses a Raft-inspired algorithm to:

  • a) elect the active controller 👑

    • the controller nodes comprise a Raft quorum

    • the quorum runs a Raft election protocol to elect a leader

  • b) agree on the latest state of the metadata log

    • 1. Metadata updates are first appended to the Raft log on the active controller.

    • 2. They are marked committed only when a majority of the quorum has persisted them.

The active controller determine the leaders for all the other regular topic partitions. It writes it to the log, and once committed by the quorum, its set in stone.

This is not too dissimilar to when Kafka used ZooKeeper. There, the controller determined leaders and its resulting metadata state was committed by the Zab algorithm (abstracted under the hood of ZooKeeper)

In other words, the way leader election works is:

  • Leader election between the controllers (picking the active one) is done through a variant of Raft (KRaft)

  • Leader election between regular brokers is done through the controller.

This is in contrast to systems like RedPanda, which use a separate Raft quorum per partition.

different leader election models

Stability

One should be doubtful about the robustness of a whole new consensus feature.
As of writing (~Kafka 4.1), KRaft is as stable as ever. Migration has been thoroughly ran in production at scale and has had several releases to marinate.

The timeline of KRaft was roughly:

  • Aug 2019: KRaft (KIP-500) proposed

  • Apr 2021 (Kafka 2.8): First early access KRaft mode shipped

  • Oct 2022 (Kafka 3.3): KRaft marked production-ready for new clusters

  • Oct 2023 (Kafka 3.6): ZooKeeper→KRaft migration marked production-ready

  • Nov 2024 (Kafka 3.9): Best (and final) version of ZooKeeper→KRaft migration.

  • March 2025 (Kafka 4.0): ZooKeeper support completely removed. It’s just KRaft now

Confluent migrated thousands of Kafka clusters from ZK to KRaft without downtime (Oct 2024)

If you learned something from this newsletter, I have two suggestions:

  1. Subscribe for more

  2. Share to a channel in Slack (so more of your colleagues learn)

Other Newsletter 🎈

I also write longer-form posts over at Big Data Stream.

I did 3 posts there recently:

🤔 Why Was Apache Kafka Created?

This one got super viral and reached 1st page Hacker News.

🌊 A Quick Intro to Kafka Streams

🏆 How to Size your Kafka Deployment

This is a 30-minute read…. a very precious and valuable piece for somebody who needs to think about how to size their cluster - including what instances to choose, disk type, disk space, etc. KIP-405 Tiered Storage completely flips the Kafka deployment model on its head, so this piece is a must read.

Apache®, Apache Kafka®, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

1  centralized coordination in distributed systems =
all the nodes rely on one single authority (a coordinator / leader) to make decisions, enforce rules, or keep state consistent.

2  stream-table duality (coined by Confluent) is the simple idea that a stream of events and a table are different representations of the same thing.
- Any mutation to a table (update/delete/insert) is in itself an event (in the stream).
- The table simply represents the end state collection of all events. If you start from zero and apply all the events, you reach the same table.
The idea has its roots in materialized view theory in databases and in change data capture. See the original paper: https://assets.confluent.io/m/7e921eefd5200e0d/original/20180808-WP-Streams_and_Tables_Two_Sides_of_the_Same_Coin.pdf

3  committed - under KRaft (and Raft), an update is committed only when a majority (a quorum) of nodes replicate it. In this case, this majority of nodes is the controller set (read on)

4  Kafka shipped with a “Combined Mode” feature that technically lets you run both a controller and a regular broker server in the same JVM process, but it isn’t recommended in production and has limitations.

5  liveness is a tricky distributed systems term that basically means that the system will eventually make progress (won’t freeze up). In the context of broker liveness, it means that a dead broker will get fenced so the rest of the cluster can move forward (partitions don’t get stuck on a dead node), and an alive broker will eventually be unfenced