- 2 Minute Streaming
- Posts
- the requirements of: Uber's Exabyte-Scale Data Infrastructure (The Uber Series Part I)
the requirements of: Uber's Exabyte-Scale Data Infrastructure (The Uber Series Part I)
how 9 open source technologies solve for massive scale that continues to grow exponentially amidst complex, competing requirements
Nobody does data like Uber.
With more people than Mexico’s population 🇲🇽 using Uber each month (137M) - you know they have data:
1 EXABYTE of data across tens of thousands of servers in EACH of their two regions (that’s 2,000,000 terabytes)
500,000+ Hive Tables
400,000+ Spark apps ran each day, processing HUNDREDS of PETABYTES
Each day, they stream 12 TRILLION messages worth 7.7 petabytes of data. 🔥
Let’s go through their stack:
The Stack
Uber’s Data Infra Stack
GCS ☁️
Apache Hadoop - HDFS 🐘
Apache Spark ⚡️
Apache Kafka 🏎️
Apache Pinot 🍷
Apache Flink 🐿️
Apache Parquet 🪵
Apache Hudi 🔼
Apache Hive 🐝
Presto 🪄
The common denominator?
Apache. 🪶
It’s all open source. (well… 99%)
The reasons they cited:
The Requirements
They generally relate to these aspects:
consistency / exactly once - mission-critical financial dashboards require consistent data across regions. Can’t skew data when it comes to 💵 !
availability - 99.99%. Loss of availability == significant financial losses (e.g unable to price rides)
data freshness - most use cases require seconds-level freshness. i.e data must be available for processing seconds after being produced (critical for demand-supply skews, etc)
query latency - some cases require p99 latency < 1 second.
e.g UberEats’ Restaurant manager executes a few analytical queries for each page load
scalability - the ability to scale with the ever-growing data set (petabytes collected a day) without re-architecting is a fundamental requirement
cost - Uber is low margin, so they need to ensure the cost is low.
This influences a lot of things - like data to keep in memory, tiered storage, pre-materialization vs. runtime computation, etc.
flexibility - need to provide both a SQL and programmatic interface.
You can’t guarantee all of these for one use case. (CAP theorem)
So it’s a question of how you provide the right mixture for each particular use case.
More Challenges
As if that wasn’t hard enough! 😮💨
Similar to a 3D cube, they have requirements growing in 3 opposite directions:
the 3 dimensions
Scaling Data - total incoming data volume is growing at an exponential rate
Replication factor & several geo regions copy data.
Can’t afford to regress on data freshness, e2e latency & availability while growing.
Scaling Use Cases - new use cases arise from various verticals & groups, each with competing requirements.
Scaling Users - the diverse users fall on a big spectrum of technical skills. (some none, some a lot)
The Use Cases
just some of the use cases
compute Surge Pricing - dynamic trip prices based on real-time supply & demand
needs data freshness, availability
fine tune your ETA prediction ML model with real-world outcomes
needs scale: absurd volume & cardinality - highest QPS model in their stack
power internal dashboards for ad-hoc exploration (e.g Growth Marketing teams figuring out incentives grow the business)
needs flexibility
generate reports for your Ops team to share with local authorities
needs completeness
push notifications / track ad impressions
needs exactly once, scale, low latency
UberEats:
visualize restaurant’s numbers in a dashboard
needs low latency, data completeness
generate daily restaurant statements
needs completeness
show popular orders near me
needs low latency
The Architecture
That’s all for two minutes!
the missing pieces
Wanna learn more about the details?
Tune in next week.
Liked this edition?
Help support our growth so that we can continue to deliver valuable content!
And if you really enjoy the newsletter in general - please forward it to a fellow engineer who would be interested in this topic. It only takes 10 seconds. Writing it takes me 10+ hours.
What more did we post on socials?
Let’s see:
After 6 years at Confluent, starting with:
• just ~300 people
• a hacky Kafka SaaS product
• a share price of $1and growing into:
• $91 share price, then back to $30
• 3000+ people
• the world’s best Kafka cloud serviceI bid my farewell 👋
A quick story... 🧵
— Stanislav Kozlovski (@BdKozlovski)
3:02 PM • Jun 27, 2024
The Blue July Crisis.
When one bad production deployment took down MILLIONS of machines in EVERY country across the globe...
Today we experienced the largest global software-inflicted outage.
A world-wide deployment of the CrowdStrike Falcon agent caused massive widespread… x.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
1:19 PM • Jul 19, 2024
After 6 years of working on the world’s largest Kafka SaaS (1000s of customers) and participating in 100s of incidents, I share my top 9 DEAD-SIMPLE tips to manage large Kafka deployments.
Guaranteed to make you want to buy a Kafka SaaS 🥲 (not joking)
Why?
Because of how hard… x.com/i/web/status/1…
— Stanislav Kozlovski (@BdKozlovski)
4:10 PM • Jul 18, 2024
Most engineers don’t understand the largest performance improvements often come the easiest.
Here’s yet another Kafka config tweak to increase your performance by 50% 🔥
a 1-minute thread 🧵
— Stanislav Kozlovski (@BdKozlovski)
2:54 PM • Jul 25, 2024
Databricks spent $1-2 BILLION dollars to acquire a ~30 person company from the creators of Apache Iceberg.
A revolution is going on in the Big Data space and its centering around Iceberg. 🧊
Why would Databricks spend this outrageous amount of money on such a small company?… x.com/i/web/status/1…— Stanislav Kozlovski (@BdKozlovski)
3:07 PM • Jun 21, 2024
Slack uses Apache Kafka at scale:
• 6.5 Gbps
• 700TB of data
• 100s of nodesHere's their fun story 👇
— Stanislav Kozlovski (@BdKozlovski)
3:07 PM • Jun 29, 2024
Stop wasting hours trying to understand Kafka's difficult zero copy optimization.
Just read this short & simple summary 👇
(there is a surprise at the end)
If you’ve ever read about Kafka, a particular optimization it makes use of might have caught your eye — the operating… x.com/i/web/status/1…— Stanislav Kozlovski (@BdKozlovski)
3:12 PM • Jul 3, 2024
Everyone is using Kafka.
But almost no one is using its new Infinite Storage feature. ✨
KIP-405 is introducing the ability to store Kafka data in S3.
And any other external store for that matter.
It’s incredibly needed, because storage is Kafka’s biggest flaw right now.… x.com/i/web/status/1…— Stanislav Kozlovski (@BdKozlovski)
5:40 PM • Jul 23, 2024