- 2 Minute Streaming
- Posts
- Apache Kafka Lines of Code Analysis (Java, Scala, Python)
Apache Kafka Lines of Code Analysis (Java, Scala, Python)
š an analysis of apache kafka's codebase from versions 0.7.2 to 3.7.0
Talk is Cheap. Show Me The Code
A distributed streaming platform like Kafka is the infrastructure backbone of many, many companies today.
But how much code does it actually take to create such a platform?
Around 1.2 million.
That isā¦ a lot of code when you think about it.
Apache Kafka 3.7 lines of code by module
If we have to give them a rough grouping:
Backend Server (420k)
core - 242k
metadata - 57k
group-coordinator - 50k
storage - 28k
raft - 27k
server-common - 16k
Clients (329k)
clients - 287k
tools - 24k
trogdor - 18k
Kafka Streams
streams - 329k
Connect
connect - 134k
We see it starts to make more sense, as the Kafka repository is well split between a few different projects.
š£ Started From the Bottom, Now Weāre Here
Can you guess how many lines of code Kafka started with?
Picture a number in your mind.
I will now give you a hint: the repository grew at an average rate of 24% per release. There were 24 releases. (cue mental algebra š )
ā¦
Kafka started with 24,400 lines of code!
Thatās literally as much as the tools
module today!
But then. Developers started crackingā¦
Release over release code growth rate (in percentage)
The very first release tripled the codebaseās size, and each of the two subsequent roughly doubled it.
With such a growth rate, you didnāt need many releases to grow the size substantially.
2012 - 24k
2015 - 138k
2017 - 400k
2020 - 726k
2022 - 994k
(start of) 2024 - 1,262k
One thing is clear - development has NOT slowed at all.
If anything, Kafka is having more code contributed to it than ever.
Talk about a healthy community.
š Top Contributors
Whatās a newsletter without some shout outs?
While many people have written a lot of code, the top contributors in Apache Kafka have consistently contributed for the last ~7 years. This is an amazing feat.
Ismael Juma š
(many more, but this is all that fit on my screen)
Thank you to all the open source contributors!
š Languages
Kafka is mainly Java. But there was a fair amount of Scala code back in the day, and large parts of the server (core) are still in Scala.
The project is slowly migrating away from Scala by simply writing most new code in Java. For example - all the new major Kafka features are written in Java:
KRaft
The Java codebase growing at hyperspeed
Bonus: Largest Files
As a final present - I present you the top 5 Java/Scala files in terms of size. Challenge yourself to go read some of them š
š Less-Interesting Notes
this data was counted via cloc
we only count the three main languages in Kafka - Java, Scala and Python
we count blank lines, comments and code lines
the data includes test code, which is in all likelihood a large majority of the codebase (thatās a good thing!)
More than half of Kafkaās code is test code!
Liked this edition?
Help support our growth so that we can continue to deliver valuable content!
We were quiet on the socials this time.
Back to basics, this is a great share with any newbie you know:
Apache Kafka 101 in 1 minute. š„
Letās go! š
Itās a distributed commit log.
A log is the simplest data structure - an ordered sequence of records that only supports appends.
š Itās immutable, so you canāt delete or edit the records in place.
Kafka stores its data in topics.ā¦ twitter.com/i/web/status/1ā¦ā Stanislav Kozlovski (@BdKozlovski)
3:23 PM ā¢ Jan 13, 2024
ApacheĀ®, Apache KafkaĀ®, Kafka, and the Kafka logo are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.