Slack 🤝 Kafka
Slack started their Kafka journey with… Redis.
They initially used Redis as a queue for async processing of any tasks that were too slow for a web request - e.g unfurling links, notifications, search index updates, etc.

High Level Redis Architecture
Then one day in 2016?
They had a big incident.
A database slowdown lead to a:
job execution slowdown which lead to:
Redis hitting its memory limit.
When Redis exhausts memory, it doesn’t allow you to enqueue new jobs. Slack lost data during this incident (the jobs they couldn’t enqueue).
And ultimately - Redis got completely stuck. It turns out dequeueing something from Redis requires a tiny amount of memory too. 😬
They solved this by migrating to Kafka. Incrementally.
They first placed Kafka in front of Redis to act as a durable store.
kafkagate, an HTTP proxy, was written & placed in front of Kafka for their PHP web apps to interface with.

Kafka + Redis Intermediate Architecture
Kafka 🤝 Slack’s Data Warehouse
In 2017, Slack shared that Kafka is also used to collect data (logs, jobs, etc) & push it to their data warehouse in S3.
They used Pinterest’s Secor library as the service that persists Kafka messages to S3. (really a sink connector equivalent)

Kafka 🤝 Slack’s Distributed Tracing Events
Another use case they have is shepherding distributed tracing events into the appropriate data stores for visualization purposes.
This is at the following scale:
310M traces a day (3587/s)
8.5 spans a day (98.3k/s)
The Latest Stats ✨
Slack continued to grow their Kafka usage across the org - with different teams adopting Kafka in their own setups. This eventually led to a fragmentation of versions & duplicate effort in managing Kafka.
Year by year, Kafka became an increasingly-central nervous system at their company, moving mission-critical data.
In 2022, it powered:
logging pipelines
trace data
billing
enterprise analytics
security analytics
The latest numbers I could find were the following:
6.5 Gbps
1,000,000s of messages a second
700TB of data (0.7PB)
10 Kafka clusters
100s of nodes
Managing 10 clusters at this scale required some work - they invested in automating many processes:
topic/partition creation & reassignment
capacity planning
adding/removing brokers
replacing nodes / upgrading versions
observability
ops toil
And Slack’s Kafka usage is only growing.
They formed a new Data Streaming team to handle all current and future Kafka usecases.
Immediate future plans include a new Change Data Capture (CDC) project. It will support the caching needs for Slack’s Permission Service used to authorize actions in Slack and enable near real-time updates to their Data Warehouse.
Have any questions for Slack?
We managed to find that platform team’s manager - Ryan - on Twitter and he agreed to an AMA👇 (thanks Ryan!)
Stan: This is how Kafka seems to naturally proliferate inside companies. Starts with one thing, and then teams just continue to adopt and adopt as its network effect of expertise & experience grows inside the company.
I’m curious to hear how adoption is going in your company. Please reply to this e-mail if you feel like sharing! We will make sure to highlight it in the next issue 🙂
Liked this edition?
Help support our growth so that we can continue to deliver valuable content!
More Kafka? 🔥
What more did we post on socials this week?
Let’s see:
Apache®, Apache Kafka®, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.


