
Nobody does data like Uber.
With more people than Mexico’s population 🇲🇽 using Uber each month (137M) - you know they have data:
1 EXABYTE of data across tens of thousands of servers in EACH of their two regions (that’s 2,000,000 terabytes)
500,000+ Hive Tables
400,000+ Spark apps ran each day, processing HUNDREDS of PETABYTES
Each day, they stream 12 TRILLION messages worth 7.7 petabytes of data. 🔥
Let’s go through their stack:
The Stack

Uber’s Data Infra Stack
GCS ☁️
Apache Hadoop - HDFS 🐘
Apache Spark ⚡️
Apache Kafka 🏎️
Apache Pinot 🍷
Apache Flink 🐿️
Apache Parquet 🪵
Apache Hudi 🔼
Apache Hive 🐝
Presto 🪄
The common denominator?

Apache. 🪶
It’s all open source. (well… 99%)
The reasons they cited:

The Requirements
They generally relate to these aspects:
consistency / exactly once - mission-critical financial dashboards require consistent data across regions. Can’t skew data when it comes to 💵 !
availability - 99.99%. Loss of availability == significant financial losses (e.g unable to price rides)
data freshness - most use cases require seconds-level freshness. i.e data must be available for processing seconds after being produced (critical for demand-supply skews, etc)
query latency - some cases require p99 latency < 1 second.
e.g UberEats’ Restaurant manager executes a few analytical queries for each page load
scalability - the ability to scale with the ever-growing data set (petabytes collected a day) without re-architecting is a fundamental requirement
cost - Uber is low margin, so they need to ensure the cost is low.
This influences a lot of things - like data to keep in memory, tiered storage, pre-materialization vs. runtime computation, etc.
flexibility - need to provide both a SQL and programmatic interface.
You can’t guarantee all of these for one use case. (CAP theorem)
So it’s a question of how you provide the right mixture for each particular use case.
More Challenges
As if that wasn’t hard enough! 😮💨
Similar to a 3D cube, they have requirements growing in 3 opposite directions:

the 3 dimensions
Scaling Data - total incoming data volume is growing at an exponential rate
Replication factor & several geo regions copy data.
Can’t afford to regress on data freshness, e2e latency & availability while growing.
Scaling Use Cases - new use cases arise from various verticals & groups, each with competing requirements.
Scaling Users - the diverse users fall on a big spectrum of technical skills. (some none, some a lot)
The Use Cases

just some of the use cases
compute Surge Pricing - dynamic trip prices based on real-time supply & demand
needs data freshness, availability
fine tune your ETA prediction ML model with real-world outcomes
needs scale: absurd volume & cardinality - highest QPS model in their stack
power internal dashboards for ad-hoc exploration (e.g Growth Marketing teams figuring out incentives grow the business)
needs flexibility
generate reports for your Ops team to share with local authorities
needs completeness
push notifications / track ad impressions
needs exactly once, scale, low latency
UberEats:
visualize restaurant’s numbers in a dashboard
needs low latency, data completeness
generate daily restaurant statements
needs completeness
show popular orders near me
needs low latency
The Architecture
That’s all for two minutes!

the missing pieces
Wanna learn more about the details?
Tune in next week.
Liked this edition?
Help support our growth so that we can continue to deliver valuable content!
And if you really enjoy the newsletter in general - please forward it to a fellow engineer who would be interested in this topic. It only takes 10 seconds. Writing it takes me 10+ hours.
What more did we post on socials?
Let’s see:

