- 2 Minute Streaming
- Posts
- The Numbers Behind Uber's Data Infrastructure (The Uber Series Part II)
The Numbers Behind Uber's Data Infrastructure (The Uber Series Part II)
what it takes to process 100s of GB/s with competing requirements
Last edition we mentioned:
the 9 open source Apache products Uber uses
the exabyte-scale they operate at
the complex requirements their data infra demands
With:
138m messages/s processing 89GB/s in Kafka
170k+ peak QPS, with 1m+ events a second and 6k+ tables over 800 nodes in Pinot
4000 jobs processing 6.5 petabytes a day in Flink
500K Presto queries/day reading over 90PB a day
(7k weekly active users, 3 regions, 12k nodes, 20 clusters)
400K Spark apps/day
and 10k+ processing hundreds of petabytes every day - it accounts for more than 95% of analytics compute resources
2M Hive queries/day over 500k+ tables
an HDFS with 10B calls a day and 150k peak RPS storing exabytes of data
This is a master class example of what a world renowned data infrastructure stack can look like. A must see for any data engineer.
Table of Contents
Lambda Architecture
Their architecture maintains two separate pipelines for different use cases:
real-time processing: Flink processes, Pinot stores to serve in real time
batch processing: the data is ingested through Spark into an HDFS data lake using Hudi as the ingestion table format and Parquet as the file format.
Hive-on-Spark jobs then transform the data to build ML models
SQL for everything: Presto allows you to access all sources, as well as deploy jobs.
All the data comes from Kafka.
Kafka
It all starts from Kafka. Data is ingested into Kafka and dispersed into the appropriate places.
Batch - Data Lake
Storage - HDFS & GCS
Their data lake is largely based on HDFS but slowly moving to GCP’s cloud storage now.
HDFS is so useful it’s used by other platforms to manage their own storage - e.g Flink, Pinot.
Formats
The data in the lake is in the Apache Hive and Hudi table formats.
The main file format is Apache Parquet. Avro is used for data in motion.
Apache Hudi was originally created in Uber to solve incremental data processing - the traditional formats don’t support that efficiently, requiring you to copy over the whole table (terabytes+) rather than editing it.
Hudi decreased their batch runtime by 82%! 🔥
Hive & Spark
Spark accounts for more than 95% of Uber’s analytics compute resources…
With such scale - what isn’t Spark used for?
They also run Hive as a query engine on the Spark execution framework too.
Real Time
Flink
Flink is used for both customer-facing products (e.g city-specific market conditions calculations, surge pricing) & internal analytics. (e.g global financial estimations)
Pinot
Pinot is used for everything OLAP, getting data from both the real-time system and the batch system.
Especially user-facing OLAP - low-latency high QPS cases.
A lot of pre-aggregated tables are used here to make querying faster.
Presto
Presto allows them to construct ad-hoc queries that join tables from different systems together.
As we will cover in a later newsletter, Uber actually very tightly integrated Presto with Pinot. (it didn’t support SQL before)
Next edition, we will cover how SQL dictates everything Uber does.
Liked this edition?
Help support our growth so that we can continue to deliver valuable content!
Where else are you going to find content like this?
It took me 40+ hours to compile this information, browings through way too many blog posts, watching too many talks and directly reaching out to numerous stakeholders in Uber.
It takes 2 seconds to boost this post, whereas this took me 2 weeks to gather!
Why don’t you? 😇
NOTE: This was not an exhaustive list of all the infrastructure in Uber.
A large company like Uber uses a ton of technologies (teams are free to choose after all), but they tend to converge and standardize on a few. These are the main technologies used.
Nevertheless, as a counter-example, there are cases where Uber uses a Kappa architecture as well.
What more did we post on other platforms?
Let’s see:
I wish I had this cheatsheet for Apache Flink earlier...
— Stanislav Kozlovski (@BdKozlovski)
5:28 PM • Aug 5, 2024
Everybody is talking about Apache Kafka.
Wonder what it is?
This single cheatsheet has all you need 👇
— Stanislav Kozlovski (@BdKozlovski)
2:30 PM • Aug 10, 2024
Amazon S3 processes more than 100,000,000 requests a second.
And it offers 11 nines of durability - 99.999999999%.
It’s not easy. Any theoretical failure that could happen - they’ve already hit.
What does that durability mean?
Durability is usually thought of as something… x.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
4:21 PM • Aug 8, 2024
A learning roadmap for Distributed Systems
What would you add here?
— Stanislav Kozlovski (@BdKozlovski)
7:23 PM • Aug 9, 2024
More Content?
Make sure to follow me on all mediums to not miss anything:
Big Data Stream - my long-form newsletter on Substack
BdKozlovski - my X profile
/in/stanislavkozlovski - my LinkedIn profile