2 Minute Streaming
Posts
KIP-392: Fetch From Follower

KIP-392: Fetch From Follower

why enabling this feature can save you 50% of your Apache Kafka's deployment's total cost in the cloud

2 Minute Streaming
November 23, 2024

The Fetch Problem

Kafka is predominantly deployed across multiple data centers (or AZs in the cloud) for availability and durability purposes.

Kafka Consumers read from the leader replica.
But, in most cases, that leader will be in a separate data center. ❗️

In distributed systems, it is best practice to processes data as locally as possible. The benefits are:

📉 better latency - your request needs to travel less
💸 (massive) cloud cost savings in avoiding sending data across availability zones

Cost

Any production Kafka environment spans at least three availability zones (AZs), which results in Kafka racking up a lot of cross-zone traffic.

Assuming even distribution:

2/3 of all producer traffic
all replication traffic
2/3 of all consumer traffic

will cross zone boundaries. 😱

network cost example of 100MiB/s produce / 300 MiB/s consume on AWS

Cloud providers charge you egregiously for cross-zone networking.

Azure: Free. 🤩
GCP: $0.01/GiB, charged at the source
AWS: $0.02/GiB, charged 50% at the source & 50% at the destination

How do we fix this?

There is no fundamental reason why the Consumer wouldn’t be able to read from the follower replicas in the same AZ.

💡 The log is immutable, so once written - the data isn’t subject to change.

Enter KIP-392.

KIP-392

⭐️ the feature: consumers read from follower brokers.

The feature is configurable with all sorts of custom logic to have the leader broker choose the right follower for the consumer. The default implementation chooses a broker in the same rack.

Despite the data living closer, it actually results in a little higher latency when fetching the latest data. Because the high watermark needs an extra request to propagate from the leader to the follower, it artificially throttles when the follower can “reveal” the record to the consumer.

How it Works 👇

The client sends its configured client.rack to the broker in each fetch request.
For each partition the broker leads, it uses its configured replica.selector.class to choose what the PreferredReadReplica for that partition should be and returns it in the response (without any extra record data).
The consumer will connect to the follower and start fetching from it for that partition 🙌

KIP-392 in Action

🤑 The Savings

KIP-392 can basically eliminate ALL of the consumer networking costs.

same AWS cluster setup, this time with KIP-392 enabled

This is always a significant chunk of the total networking costs. 💡

The higher the fanout, the higher the savings:

Support Table

Released in AK 2.4 (October 2019), this feature is 5+ years old yet there is STILL no wide support for it in the cloud:

gist source with references

I would have never expected MSK to have lead the way here, especially by 3 years. 👏
They’re the least incentivized out of all the providers to do so - they make money off of cross-zone traffic.

Speaking of which… why aren’t any of these providers offering pricing discounts when FFF is used? 🤔

Liked this edition?

Help support our growth so that we can continue to deliver valuable content!

And if you really enjoy the newsletter in general - please forward it to an engineer. It only takes 5 seconds. Writing it takes me 5+ hours.

🗣 The Latest on The Socials

What more did we post on other platforms?

🤑 The Story of How Confluent Bought WarpStream for $220 MILLION (…after just 13 months of operation…)

X/Twitter
LinkedIn

An expensive Kafka cluster sells for $1M.
Cheap Kafka sells for … $220M
The story of how Confluent acquired WarpStream after just 13 months of operations 👇
— Stanislav Kozlovski (@BdKozlovski)
2:51 PM • Nov 2, 2024

✨ Apache Kafka 3.9 Release Summary

X/Twitter
LinkedIn

Apache Kafka 3.9.0 was just released! 🔥
The LAST ZooKeeper release 🫡 🥲
What comes with this new release?
Here are the top features you should know about:
(2-minute read) 🧵
— Stanislav Kozlovski (@BdKozlovski)
5:49 PM • Nov 9, 2024

😡 Why is Kafka in the Cloud so ridiculously EXPENSIVE?

X/Twitter
LinkedIn

A Kafka in the cloud doing 30MB/s costs more than $110,000 a year.
A $1,000 laptop can do 10x that.
Where did we go wrong? 👇
The Cloud. Namely - its absurd networking charges 👎
Let’s break it down simply:
• AWS charges you $0.01/GB for data crossing AZs (but in the same… x.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
2:52 PM • Sep 14, 2024

😱 Kafka HDD Costs on AWS vs Hetzner

X/Twitter
LinkedIn

Most people think the cloud saves them money.

Not with Kafka. ❌

Storage costs alone are 32 times more expensive than what they should be.

Even a miniscule cluster costs hundreds of thousands of dollars!

Let’s run the numbers. 🤓

Assume a small Kafka cluster consisting… x.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
2:53 PM • Sep 29, 2024

🤘 Bare Metal Kafka \w AWS network link cost savings

X/Twitter
LinkedIn

Instead of spending $278k a year to manage Kafka in the cloud, this user is paying 4x less - $72.7k annually. 🔥
Over the 7 years since deployment, they would have spent $1.94M if running in AWS at retail prices.
Instead, they paid ~$509k to host it themselves in a data center… x.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
1:00 PM • Oct 6, 2024

🤫 5 Log Details that you probably didn’t know about

X/Twitter
LinkedIn

5 Apache Kafka Log Details that you probably didn’t know about 👇
(1 min read)
💡1. Log retention time is based on the RECORD's timestamp.
A producer can send a record with a timestamp of 01-01-1999 and Kafka will evaluate the retention time of that partition’s log via the… x.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
5:14 PM • Oct 23, 2024