Design Case Studies · Lesson 14
Design: Ride-hailing API
A ride-hailing system looks simple from the passenger seat — tap a button, a car appears. Underneath, it coordinates a real-time geo index updated by a million moving drivers, a state machine that enforces a strict trip lifecycle, and a bidirectional stream that keeps the rider's map smooth at one update per second. This lesson traces each piece from requirements to latency budget.
By the end you'll be able to
- Name every state in the trip lifecycle and explain why invalid transitions are rejected server-side.
- Explain how a geohash spatial index answers "find 5 nearby drivers" in O(log N) — and why you must query the 8 neighboring cells, not just the pickup cell.
- Choose between WebSocket and batched HTTP for driver location updates, and justify the tradeoff in each direction.
- Apply idempotency keys to ride requests so a double-tap never creates two active trips.
- Calculate a latency budget for a ride match and verify it fits within the 5-second target.
1 — Requirements
Functional requirements
- Request a ride: rider specifies pickup and dropoff coordinates. The system finds an available nearby driver and assigns them. A fare estimate (returned before booking) is locked in at request time.
- Real-time driver location: while a trip is active, the rider sees the driver moving on a map. Updates arrive at approximately one per second.
- Trip lifecycle: a ride moves through defined states —
REQUESTING→MATCHING→DRIVER_EN_ROUTE→PICKUP→IN_TRIP→COMPLETED— with cancellation possible from several of these states. - Fare estimate before booking: the rider sees an estimated fare before committing. The system holds the estimate so the final price matches.
- Trip receipts on completion: once a trip reaches
COMPLETED, a receipt is available showing distance, time, and fare breakdown. - Cancellation: rider or driver can cancel before pickup. Defined penalty rules apply depending on how far the driver has traveled.
Non-functional requirements
- Matching latency: a driver must be assigned within 5 seconds at p95. This is the customer experience threshold — longer feels like the app is broken.
- Location write throughput: at 1 million active trips with one driver update per second each, the location store must handle 1 million writes per second at peak. This is not a database — it is a write-optimized geo index.
- Geo precision: the spatial index must support sub-second lookup for the nearest N drivers across millions of active positions. A B-tree index on latitude and longitude columns cannot do this efficiently.
- Idempotent ride request: a rider double-tapping "Request" — or a client retrying after a network timeout — must not create two active rides. This mirrors the idempotency pattern from payments: the client sends an idempotency key, the server deduplicates.
- Fault tolerance during matching: if no driver is found within the initial search radius, the system retries with a broader radius. A confirmed ride with no driver assigned is a product failure — never leave the rider waiting with no recourse.
2 — Design decisions
Location updates: WebSocket vs. batched HTTP
Driver location arrives at 1 Hz from millions of active devices. Two implementation options:
| Option | Mechanism | Tradeoff |
|---|---|---|
| WebSocket (persistent) | Driver app keeps a WS open. Each GPS fix is sent as a small JSON frame immediately. The same connection receives inbound commands: match notifications, route updates. | Low per-update overhead; full-duplex. Requires stable connection management at scale. Falls back when the OS suspends the app. |
| Batched HTTP | Driver app buffers 5–10 GPS points and POSTs them every 5–10 seconds. | Simpler to implement; survives background OS restrictions. Introduces up to 10 seconds of location lag — unacceptable while a rider is watching the driver approach. |
Decision: WebSocket while the trip is active; HTTP batch as a fallback when the app is backgrounded or the WS drops. The two modes coexist — the server accepts both input shapes and writes to the same geo index regardless of how the update arrived.
Spatial index for driver matching
The naive approach — a database table with lat and lng columns and a B-tree index — cannot efficiently answer "give me 5 drivers within 2 km of this point." A B-tree sorts in one dimension; a two-dimensional proximity query becomes a full scan or an expensive polygon intersection.
The right tool is a geohash index. Geohashing encodes a coordinate as a hierarchical string: nearby positions share a common prefix. The string 9q8yy and 9q8yz are adjacent cells; both start with 9q8y. A range query on the prefix 9q8y returns all points in that neighborhood in a single index scan — O(log N) against an indexed string column, or O(1) against Redis GEO, which implements exactly this under the hood.
GEORADIUS drivers:active -122.4194 37.7749 2 km ASC COUNT 10 — returns the 10 nearest active drivers within 2 km of the given coordinate, sorted by distance. Redis stores positions internally as geohash-encoded sorted set scores.
Trip as a state machine
Think of the trip state machine like air traffic control. A plane cannot jump from "taxiing" to "at gate" without passing through "landed." The controller — the server — enforces every transition and rejects nonsense routes. A request that tries to advance a trip from IN_TRIP back to REQUESTING is invalid by definition; the server rejects it without touching the trip record.
Every transition is validated against a server-side table of allowed moves. State is a single strongly-consistent row; transitions use optimistic locking — the UPDATE includes a WHERE status = 'MATCHING' predicate. If two concurrent requests both try to transition the same trip, exactly one succeeds and the other finds zero rows updated and retries or errors. This prevents ghost trips: a ride that appears DRIVER_EN_ROUTE to two different drivers simultaneously.
Idempotent ride request
The client includes an Idempotency-Key header with POST /v1/rides. The server stores the key alongside the new ride record before returning 201. A retry with the same key within the deduplication window (10 minutes) returns the existing ride object — same ID, same status — without creating a new ride or dispatching a second driver. See rel-02 for the full pattern.
Real-time status push to the rider
The rider needs two things continuously while a trip is active: driver position and trip state changes. Polling GET /v1/rides/:id every second would work but wastes one full HTTP round-trip per update. Instead, the rider app opens a WebSocket to the realtime service after requesting a ride. The server pushes two event types down this connection: LOCATION_UPDATE frames (lat, lng, bearing) at 1 Hz, and STATE_CHANGE frames when the trip transitions state. The rider never needs to poll.
3 — The API model
Request a ride
POST /v1/rides
Authorization: Bearer <rider-token>
Idempotency-Key: idem_a3f8b1c2-0041-4e9d-b7a5-ff3820d40011
Content-Type: application/json
{
"pickup": { "lat": 37.7749, "lng": -122.4194 },
"dropoff": { "lat": 37.7849, "lng": -122.4094 },
"fare_estimate_id": "fare_xyz" // locks the quoted fare
}
HTTP/1.1 201 Created
{
"id": "ride_abc",
"status": "MATCHING",
"estimated_wait_sec": 180
}
// Retry with same Idempotency-Key (double-tap / network timeout):
HTTP/1.1 201 Created
{
"id": "ride_abc", // same ride — no second driver dispatched
"status": "MATCHING"
}
Poll ride status
GET /v1/rides/ride_abc
Authorization: Bearer <rider-token>
HTTP/1.1 200 OK
{
"id": "ride_abc",
"status": "DRIVER_EN_ROUTE",
"driver": {
"id": "drv_7",
"name": "Marcus",
"vehicle": "Prius"
},
"eta_sec": 210
}
Real-time stream
// Rider opens WebSocket after POST /v1/rides returns
WebSocket wss://realtime.example.com/v1/rides/ride_abc/stream
Authorization: Bearer <rider-token>
// Server pushes: driver location at ~1 Hz
{ "type": "LOCATION_UPDATE", "lat": 37.7760, "lng": -122.4180, "bearing": 45 }
{ "type": "LOCATION_UPDATE", "lat": 37.7762, "lng": -122.4176, "bearing": 47 }
// Server pushes: state transition
{ "type": "STATE_CHANGE", "status": "PICKUP", "timestamp": "2025-09-10T18:32:01Z" }
// Server pushes: trip ended
{ "type": "STATE_CHANGE", "status": "COMPLETED", "timestamp": "2025-09-10T18:49:14Z" }
Cancel a ride
POST /v1/rides/ride_abc/cancel
Authorization: Bearer <rider-token>
Content-Type: application/json
{ "reason": "RIDER_CANCELLED" }
HTTP/1.1 200 OK
{
"id": "ride_abc",
"status": "CANCELLED",
"cancellation_fee": 0 // no penalty: driver not yet en route
}
Cancellation does not delete the ride record — it transitions the ride to CANCELLED. The ride remains for audit, receipts, and dispute resolution. DELETE /v1/rides/:id would remove a resource; POST /v1/rides/:id/cancel changes its state. This distinction matters: a DELETE that returns 404 on retry looks like an error; a POST /cancel that finds an already-cancelled ride can return the current state gracefully.
4 — Evaluation & latency budget
Geo matching performance
A geohash lookup for nearest drivers is O(log N) against a sorted index — in practice under 2 ms on a warm Redis GEO instance with millions of entries. When no drivers are available within the initial radius, the system widens by one geohash precision level (dropping the last character of the hash roughly quadruples the search area) and re-queries. This costs one additional index scan, not a full-table scan — the expanding-radius strategy stays cheap.
Real-time location latency
With a WebSocket connection, a GPS update from driver to rider travels: driver device → network (~25 ms) → location service → relay to rider WS (~2 ms) → rider network (~25 ms). Total end-to-end: approximately 52 ms. At 1 Hz update rate, the rider's map interpolates between updates smoothly. The critical metric here is not absolute latency but consistency: a jitter spike that delays one update by 500 ms creates a visible "snap" on the map. The WebSocket keep-alive and backpressure handling must account for this.
State machine consistency
Trip state lives in a single strongly-consistent row. Transitions use optimistic locking: the UPDATE trips SET status='DRIVER_EN_ROUTE' WHERE id=:id AND status='MATCHING' predicate ensures only one concurrent request can advance the state. If two requests race, exactly one updates zero rows and backs off. This is the same approach as payments' ledger consistency — one version of the truth, enforced at the write path.
Throughput: location writes
1 million active trips × 1 update/second = 1 million writes/second to the location store. A single Redis GEO instance handles roughly 100,000 operations per second. The solution is horizontal sharding by geographic region: Americas, EMEA, APAC each run their own location shard. A driver in São Paulo never touches the APAC shard. The matching service queries only the relevant regional shard for a given pickup coordinate.
Latency budget for ride matching
| Step | Latency | Bottleneck |
|---|---|---|
| API validation + idempotency key write | ~5 ms | One DB write |
| Geohash lookup (nearest drivers) | ~2 ms | Redis GEO range query |
| Driver notification via WS push | ~15 ms | Network to driver device |
| Driver acceptance (human reaction) | ~2000 ms | Human |
| Total visible to rider | 2–4 seconds | Driver acceptance dominates |
The system budget is 5 seconds at p95. The algorithmic steps add less than 25 ms total; the constraint is human response time. If a driver doesn't accept within a threshold window, the system offers the ride to the next candidate — this retry adds one more 2-second window but keeps the total under 5 seconds in most markets.
When asked to design a ride-hailing system, interviewers are checking three things: (1) whether you recognize that location throughput is a distinct problem from relational data (1M writes/sec needs Redis or Cassandra, not Postgres); (2) whether you model the trip as a formal state machine rather than a bag of boolean flags; (3) whether you bring up idempotency for the ride request without being prompted. All three signal that you think in systems, not just endpoints. A bonus signal: noting that the geohash query must include the 8 neighboring cells to handle pickup points near cell boundaries.
The rider app doesn't need to open the WebSocket before the ride exists. The safe sequence: (1) POST /v1/rides → get back ride_abc; (2) open wss://.../v1/rides/ride_abc/stream. Any events that occurred between step 1 and step 2 (unlikely but possible with very fast matching) can be fetched via GET /v1/rides/ride_abc before trusting the stream's first frame. This prevents a race where the ride advances past MATCHING before the rider's WS is established.
✍️ Exercise: geohash boundary edge case
Design the geohash-based driver lookup for a city with 10,000 active drivers. How would you handle the edge case where a pickup point is on a geohash cell boundary?
Model answer: Encode each driver's position at geohash precision level 6 (cells of roughly 1.2 km × 0.6 km). Index the geohash string in Redis GEO (or a sorted set keyed by geohash prefix). To find drivers near a pickup, compute the pickup's level-6 geohash and query that cell plus all 8 neighboring cells.
Why 9 cells? A driver 400 m from the pickup but in the adjacent cell would be missed by a single-cell query. Geohash boundaries are arbitrary lines across geography — a driver 350 m away on the wrong side of a boundary is closer than a driver 800 m away on the correct side. Querying all 8 neighbors guarantees that any driver within the search radius is included, regardless of which cell they land in.
Rubric: must mention querying neighbor cells; must explain why a single-cell query misses boundary drivers; bonus for noting that at precision level 6 the 9-cell union covers roughly a 2.4 km × 1.8 km area, which comfortably contains any driver within a 1 km search radius.
Under the hood: the core mechanism
Two mechanisms do the real work: the geohash spatial index that locates drivers in O(log N), and the trip state machine enforced through optimistic locking. Understanding both at the data-structure level makes the latency budget credible.
Geohash spatial index — how it works
A geohash encodes a (lat, lng) pair into a base-32 string by recursively bisecting the world into alternating vertical and horizontal halves. Each additional character narrows the cell by a factor of 8 or 4:
# Coordinate: (37.7749, -122.4194) — San Francisco
# Geohash prefix length → cell size (approximate):
# 4 chars → 39.1 km × 20 km (city-level)
# 5 chars → 4.9 km × 4.9 km (neighborhood-level)
# 6 chars → 1.2 km × 0.6 km (street-level) ← Uber uses ~level 6
# 7 chars → 153 m × 153 m
# Encode driver at that position (level 6):
geohash.encode(37.7749, -122.4194, precision=6) → "9q8yy6"
# The geohash prefix property:
# "9q8yy6" and "9q8yy7" share prefix "9q8yy" → they are adjacent cells
# A prefix scan on "9q8yy" returns all drivers in that neighborhood
Redis GEO stores positions as sorted set scores — the coordinate is encoded into a 52-bit integer (a 52-character geohash) that is used as the score, giving O(log N) range queries on geographically proximate points:
# Driver comes online → write their position
GEOADD drivers:active -122.4194 37.7749 "drv_7"
# Rider requests at (-122.4180, 37.7760) — find 5 nearest drivers within 2 km
GEORADIUS drivers:active -122.4180 37.7760 2 km ASC COUNT 5
# → ["drv_7", "drv_22", "drv_41", "drv_5", "drv_88"] (sorted nearest first)
# Internally: computes geohash at the search point, then queries the sorted set
# for the geohash range covering the 2 km radius + 8 neighbor cells
Why query 9 cells, not 1
A pickup point on a geohash cell boundary has nearby drivers in the adjacent cell. A driver 400 m away in the next cell is closer than one 800 m away inside the same cell. Redis GEO handles this automatically — but if you were building a raw geohash index, you must query the pickup cell plus all 8 neighbors:
| NW | N | NE | |
|---|---|---|---|
| W | NW cell | N cell | NE cell |
| W cell | pickup cell | E cell | |
| SW cell | S cell | SE cell |
Request → match worked trace
A rider at (37.7760, -122.4180) taps "Request". Here is the complete server-side sequence with approximate timing for each step:
── T=0 ms: POST /v1/rides arrives ───────────────────────────────
1. Idempotency key lookup
GET idem:idem_a3f8b1c2 → (nil) # not seen before
SET idem:idem_a3f8b1c2 "pending" EX 600 # lock for 10 min
2. Create ride record in DB (status = REQUESTING)
INSERT rides (id, rider, pickup, dropoff, status)
VALUES ('ride_abc', 'usr_42', ..., 'REQUESTING')
── T=5 ms: matching begins ───────────────────────────────────────
3. Geo lookup: nearest 5 drivers within 2 km
GEORADIUS drivers:active -122.4180 37.7760 2 km ASC COUNT 5
→ ["drv_7", "drv_22", "drv_41"] # 3 available in radius
4. Filter: check each candidate's current state in DB
SELECT status FROM drivers WHERE id IN ('drv_7','drv_22','drv_41')
→ drv_7 = AVAILABLE, drv_22 = IN_TRIP (skip), drv_41 = AVAILABLE
Top candidate: drv_7 (closest distance from GEORADIUS sort)
── T=7 ms: dispatch ──────────────────────────────────────────────
5. Transition ride to MATCHING, optimistic lock
UPDATE rides SET status='MATCHING' WHERE id='ride_abc' AND status='REQUESTING'
→ 1 row updated (success)
6. Send match offer to drv_7 via WebSocket push
→ WS frame: {"type":"MATCH_OFFER","ride_id":"ride_abc","pickup":...,"timeout":15}
── T=7–2007 ms: awaiting driver acceptance ───────────────────────
7. drv_7 accepts (T ≈ 500 ms): sends ACTION {"accept":"ride_abc"}
8. Transition: MATCHING → DRIVER_EN_ROUTE (optimistic lock)
UPDATE rides SET status='DRIVER_EN_ROUTE', driver_id='drv_7'
WHERE id='ride_abc' AND status='MATCHING'
→ 1 row updated
9. Notify rider via WebSocket
→ WS frame: {"type":"STATE_CHANGE","status":"DRIVER_EN_ROUTE","driver":{"id":"drv_7",...}}
── T ≈ 512 ms: rider sees "Driver on the way" ────────────────────
If drv_7 does not respond within 15 seconds, the matching service falls back to the next candidate (drv_41) and repeats steps 6–8. The optimistic lock in step 8 ensures that even if two concurrent requests both attempt to claim the same ride, exactly one succeeds.
Operating & debugging it
The two most important live metrics are geo index staleness (are driver positions current?) and match timeout rate (how often are rides offered and declined, requiring fallback to the next candidate?). High match timeout rates indicate the geo radius is too tight or available driver supply is low in that area.
Diagnosing a trip state machine issue — checking the current state and attempting a manual transition audit:
| Symptom | Likely cause | Fix |
|---|---|---|
| No drivers found — rider stuck in MATCHING indefinitely | Geo index empty or stale; drivers heartbeating to wrong shard; all nearby drivers in IN_TRIP state | Check ZCARD drivers:active; verify driver heartbeat key TTL; inspect match fallback radius widening |
| Two drivers dispatched to same ride | Optimistic lock missing or not checked; race condition in matching service | Confirm the WHERE status='MATCHING' predicate on the UPDATE; check that matched rows = 1 before proceeding |
| Rider map frozen — no location updates | Driver's WebSocket dropped; location service not relaying updates; Redis GEO write failing | Check driver WS connection health; verify location service logs for dropped frames; test GEOPOS for the driver |
| Ride stays MATCHING after driver acceptance | State transition UPDATE affected 0 rows — another process already advanced the state | Log rows-updated on every transition; implement retry with exponential backoff for 0-row transitions |
| Idempotent retry creates second ride | Idempotency key write failed or was not checked before INSERT | Use SET key NX and confirm key was set before writing the ride row; wrap both in a transaction |
| Driver appears nearby but is not offered the ride | Driver's DB status is not AVAILABLE (e.g. still IN_TRIP from a completed trip); stale state not updated | Confirm trip COMPLETED transition also sets driver status = AVAILABLE; check for stuck trips that never completed |
- To diagnose a failed match: run
GEORADIUSmanually from the pickup coords and verify drivers appear; then check each candidate's DB status. - To diagnose a stale map: run
GEOPOS drivers:active drv_Xrepeatedly — the coordinate should update every ~1 second; a frozen coordinate means the driver's WS heartbeat has stopped. - For state machine issues: query the ride row directly; compare DB status to what the client is displaying; trace the transition log for the ride ID.
- For scale issues: check that the matching service is routing to the correct regional geo shard; a São Paulo ride should never touch the APAC shard.
Quick check
1. The trip is in state DRIVER_EN_ROUTE. A rider cancels. What HTTP endpoint handles this, and what response code is expected?
Cancellation is a state transition, not a deletion. The ride record is preserved for audit. POST to a sub-resource with the action name (/cancel) is the REST convention for state-change operations that don't map cleanly to CRUD verbs. The response returns the updated ride object — status CANCELLED, with any applicable cancellation fee — so the client doesn't need a second GET.
2. Why does the system use an idempotency key on POST /v1/rides?
Without idempotency, a network timeout followed by a retry creates two separate rides: two drivers are dispatched, the rider sees two trips. The idempotency key lets the server recognize the duplicate request and return the existing ride instead of creating a new one. This is identical in structure to the payments case — the key converts an unsafe retry into a safe one.
3. A geohash lookup returns no drivers within the initial search radius. What is the correct fallback?
Geohash precision levels form a hierarchy. Dropping one character from the end of the hash roughly quadruples the search area. Iterative widening — try level 6, then level 5 — costs two cheap sorted-set scans and avoids a full-table scan. The linear scan option would require examining every driver in the city, which is O(N) and unacceptable at 10,000+ active drivers.
4. Location updates from 1 million active drivers arrive at 1 update/second each. Which data store property matters most for this write path?
Driver position is ephemeral — it is superseded by the next update in one second. A location that is 500 ms stale is fine for rider map display. Strong consistency on every write would bottleneck the path at 1M writes/sec: acquiring distributed locks or waiting for quorum acknowledgment at that rate is impractical. Write-optimized stores like Redis or Cassandra accept eventual consistency in exchange for the throughput the problem demands.
Key takeaways
- The trip is a state machine, not a bag of flags. Explicit states, server-side transition validation, and optimistic locking prevent ghost trips and race conditions. Think air traffic control — the server is the controller and nonsense routes get rejected at the runway.
- Location writes need a write-optimized geo index. 1M writes/sec rules out a relational table. Redis GEO (or any geohash-backed sorted set) solves both the throughput and the spatial query problems in one data structure.
- Geohash queries must include the 8 neighboring cells. A pickup point on a cell boundary has nearby drivers in adjacent cells. Single-cell queries silently miss them.
- Idempotency on ride creation is non-negotiable. A double-tap or network retry must not dispatch two drivers. Use an idempotency key; return the existing ride on duplicate.
- WebSocket for active trips, HTTP batch as fallback. The 1 Hz location stream needs low overhead and full-duplex; a persistent WS delivers both. Background OS restrictions make an HTTP fallback necessary for resilience.
- The human is the bottleneck. Algorithmic steps in the matching pipeline add <25 ms. Driver acceptance takes 2+ seconds. Design the retry logic (offer to next driver if no acceptance) around the human, not the algorithm.