KIP-392: Fetch From Follower

why enabling this feature can save you 50% of your Apache Kafka's deployment's total cost in the cloud

The Fetch Problem

Kafka is predominantly deployed across multiple data centers (or AZs in the cloud) for availability and durability purposes.

Kafka Consumers read from the leader replica.
But, in most cases, that leader will be in a separate data center. ❗️

In distributed systems, it is best practice to processes data as locally as possible. The benefits are:

  • 📉 better latency - your request needs to travel less

  • 💸 (massive) cloud cost savings in avoiding sending data across availability zones

Cost

Any production Kafka environment spans at least three availability zones (AZs), which results in Kafka racking up a lot of cross-zone traffic.

Assuming even distribution:

  1. 2/3 of all producer traffic

  2. all replication traffic

  3. 2/3 of all consumer traffic

will cross zone boundaries. 😱

network cost example of 100MiB/s produce / 300 MiB/s consume on AWS

Cloud providers charge you egregiously for cross-zone networking.

How do we fix this?

There is no fundamental reason why the Consumer wouldn’t be able to read from the follower replicas in the same AZ.

💡 The log is immutable, so once written - the data isn’t subject to change.

Enter KIP-392.

KIP-392

⭐️ the feature: consumers read from follower brokers.

The feature is configurable with all sorts of custom logic to have the leader broker choose the right follower for the consumer. The default implementation chooses a broker in the same rack.

Despite the data living closer, it actually results in a little higher latency when fetching the latest data. Because the high watermark needs an extra request to propagate from the leader to the follower, it artificially throttles when the follower can “reveal” the record to the consumer.

How it Works 👇

  1. The client sends its configured client.rack to the broker in each fetch request.

  2. For each partition the broker leads, it uses its configured replica.selector.class to choose what the PreferredReadReplica for that partition should be and returns it in the response (without any extra record data).

  3. The consumer will connect to the follower and start fetching from it for that partition 🙌

KIP-392 in Action

🤑 The Savings

KIP-392 can basically eliminate ALL of the consumer networking costs.

same AWS cluster setup, this time with KIP-392 enabled

This is always a significant chunk of the total networking costs. 💡

The higher the fanout, the higher the savings:

Support Table

Released in AK 2.4 (October 2019), this feature is 5+ years old yet there is STILL no wide support for it in the cloud:

I would have never expected MSK to have lead the way here, especially by 3 years. 👏
They’re the least incentivized out of all the providers to do so - they make money off of cross-zone traffic.

Speaking of which… why aren’t any of these providers offering pricing discounts when FFF is used? 🤔

Liked this edition?

Help support our growth so that we can continue to deliver valuable content!

And if you really enjoy the newsletter in general - please forward it to an engineer. It only takes 5 seconds. Writing it takes me 5+ hours.

🗣 The Latest on The Socials

What more did we post on other platforms?

🤑 The Story of How Confluent Bought WarpStream for $220 MILLION (…after just 13 months of operation…)

✨ Apache Kafka 3.9 Release Summary

😱 Kafka HDD Costs on AWS vs Hetzner

🤫 5 Log Details that you probably didn’t know about

More Content?

Make sure to follow me on all mediums to not miss anything:

Apache®, Apache Kafka®, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.