Design Case Studies · Lesson 01
Design: Search API
Keyword search is deceptively hard to get right at scale. This case study works through every design choice — GET vs. POST, cursor pagination, query caching, rate limiting, and eventual index consistency — for a system that must return results in under 200 ms at 10 000 requests per second.
By the end you'll be able to
- Design a complete search endpoint including query params, filters, sort, and cursor pagination.
- Explain why cursor pagination is correct for search and when offset pagination fails at scale.
- Walk the p99 latency budget for a cache hit and a cache miss, and show that both fit the target.
Requirements
Functional requirements
- Clients can search a catalog of items by keyword.
- Results can be filtered on at least two dimensions (e.g., category, price range).
- Results can be sorted by relevance score or by a sortable field (e.g., price, date).
- Results are paginated; clients can retrieve the next page without repeating query params.
- The endpoint is public (no authentication required for basic search).
Non-functional requirements
- p99 response time ≤ 200 ms measured at the client.
- 99.9% availability (≤ 8.7 h downtime per year).
- Search index may be up to 30 seconds stale (eventual consistency is acceptable).
- Individual callers are rate-limited to prevent index overload.
Scale assumption
~10 000 requests per second at peak with a catalog of 50 million documents. Search traffic is highly skewed: roughly 80% of queries repeat the same ~5 000 popular terms. That skew is the single most important fact in this design.
Design decisions
GET, not POST
A search is a read. Using GET makes the request safe (no side effects) and idempotent, which has concrete downstream benefits: HTTP caches can store it, CDN edge nodes can serve it without consulting the origin, and clients and load balancers can safely retry it on timeout. The only counterargument for POST is when filter complexity overflows a URL (typically > 2 000 characters); at that scale you'd add a POST /v1/search/sessions endpoint that saves a query body and returns a cursor — but that is not needed at the requirements stated here.
Cursor pagination, not offset
Offset pagination (?page=42&page_size=20) has a fatal flaw at scale: OFFSET 840 LIMIT 20 forces the search engine to score and rank the first 860 results before discarding 840 of them. At 50 million documents, deep pages become prohibitively expensive. More importantly, if a document is inserted or deleted between page 1 and page 2, items shift — the client sees duplicates or misses items. Cursor pagination solves both problems. The cursor encodes the position of the last item seen (e.g., a serialised sort key + document ID), so the engine resumes from an exact point regardless of intervening writes. See RESTful maturity levels for why this is idiomatic REST. See data fetching and pagination for the full treatment.
Cache popular queries
Given that 80% of traffic hits ~5 000 popular queries, a query-level cache in front of the search engine dramatically reduces index load. The cache key is the normalised query string (lowercased, trimmed, parameters in canonical order). The TTL is 30 seconds — matching the acceptable staleness from the non-functional requirements, not a coincidence. A longer TTL would conflict with the consistency requirement; a shorter one would defeat the cache. See caching strategies for how to choose TTL and when to invalidate eagerly.
Rate limiting
Even with a cache, an adversarial or misconfigured client issuing thousands of cache-miss queries per second would saturate the search index. Rate limiting is applied per-caller (IP or API key) using a token bucket: 300 requests per minute with a burst allowance of 60 requests per second. The response includes Retry-After on 429 so clients can back off correctly. See rate limiting for the full token bucket vs. leaky bucket analysis.
Eventual consistency of the index
The write path (ingesting new documents) and the read path (serving search results) are decoupled. Documents are ingested via an internal pipeline that updates the search index asynchronously. This means a newly created document may not appear in search results for up to 30 seconds — which is explicitly allowed by the non-functional requirements. Attempting strong consistency would require holding writes until the index is updated, adding hundreds of milliseconds to the write path and creating a coupling that makes the read path brittle. Eventual consistency is the right trade-off here.
Designing the search endpoint to accept a POST with a JSON body "because it's cleaner than a long URL." This makes the response uncacheable by any standard HTTP caching layer, immediately invalidating the caching strategy that keeps p99 under 200 ms. Use GET unless you have a specific, documented reason not to — and document the reason.
The API model
-------------------------------------------------------------------
Search for items
-------------------------------------------------------------------
GET /v1/search?q={query}&filter={filter}&sort={sort}&cursor={cursor}&limit={limit}
Query parameters:
q string required The search keywords, URL-encoded.
filter string optional Comma-separated key:value pairs.
e.g. filter=category:shoes,max_price:120
sort string optional Field to sort by + direction.
e.g. sort=price:asc, sort=score:desc (default)
cursor string optional Opaque cursor from the previous page's
next_cursor. Omit on first request.
limit int optional Page size 1–50, default 20.
-------------------------------------------------------------------
Successful response — 200 OK
-------------------------------------------------------------------
HTTP/1.1 200 OK
Content-Type: application/json
Cache-Control: public, max-age=30, stale-while-revalidate=60
X-RateLimit-Limit: 300
X-RateLimit-Remaining: 287
X-RateLimit-Reset: 1720000060
{
"data": [
{
"id": "item_a7f3",
"title": "Trail Runner Pro",
"category": "shoes",
"price_cents": 11900,
"score": 0.94
}
/* … up to limit items … */
],
"next_cursor": "eyJzY29yZSI6MC40Miwi...", // null if last page
"total_estimate": 4821, // approximate; may change
"query_id": "qry_88ac2" // for debugging / tracing
}
-------------------------------------------------------------------
Rate-limited response — 429 Too Many Requests
-------------------------------------------------------------------
HTTP/1.1 429 Too Many Requests
Retry-After: 14
Content-Type: application/json
{
"error": "rate_limit_exceeded",
"message": "You have exceeded 300 requests per minute. Retry after 14 s.",
"retry_after_seconds": 14
}
Return next_cursor: null on the last page rather than omitting the field. Clients that check if response.next_cursor work correctly for both cases; clients that dereference an absent field will throw a null-pointer error on the last page. Consistent presence of the field makes the contract easier to consume.
Evaluation & latency budget
Walk a single GET /v1/search?q=trail+runner&filter=category:shoes&sort=price:asc request through the system under two scenarios.
Scenario A — cache hit (80% of traffic)
| Segment | Estimated time |
|---|---|
| Client → CDN edge (network RTT) | ~15 ms |
| CDN cache lookup (in-memory) | ~1 ms |
| CDN → client (response transfer) | ~8 ms |
| Total p99 (cache hit) | ~24 ms |
Well under the 200 ms budget. The origin never sees these requests.
Scenario B — cache miss (20% of traffic)
| Segment | Estimated time |
|---|---|
| Client → CDN edge | ~15 ms |
| CDN cache miss → rate limiter check | ~2 ms |
| Rate limiter → search service (internal network) | ~3 ms |
| Search service: parse query, decode cursor | ~2 ms |
| Search index query (p50 ~15 ms, p99 ~50 ms) | ~50 ms |
| Search service: rank, serialise response | ~5 ms |
| Search service → CDN → client | ~20 ms |
| Total p99 (cache miss) | ~97 ms |
Under the 200 ms budget with significant headroom. The search index is the dominant cost — optimise it first if the budget tightens.
What happens to the budget if cache TTL is raised?
Doubling the TTL from 30 s to 60 s would reduce cache-miss traffic by roughly 10–15%, each miss contributing ~97 ms. But the non-functional requirements cap staleness at 30 s — so raising the TTL would violate a stated requirement, not just a preference. The correct fix for a tight latency budget at cache miss is to reduce search index p99, not to extend the TTL beyond the consistency window.
When asked "how does this scale to 10x the load?", the correct answer for this design is: "Cache hit rate doesn't change — those 80% of requests still terminate at the CDN. Only the 20% cache-miss traffic grows, which means search index throughput needs to scale with total traffic growth, not cache-hit growth. I'd add read replicas to the search index and a load balancer in front of the search service tier. The API layer itself is stateless and horizontally scalable."
Under the hood: the core mechanism
The CDN and rate limiter are well understood, but the part that deserves a closer look is what the search service actually does: how an inverted index is built, how a query matches and ranks documents, and how a cursor lets pagination resume precisely over ranked results.
The inverted index: term → postings list
A search engine does not scan documents for each query. Instead, it builds an inverted index at ingest time: a lookup from each unique term to the list of documents that contain it, along with position and frequency information. At query time, looking up "trail" takes microseconds regardless of corpus size.
-- INGEST: index 3 documents --
doc_1: "Trail Runner Pro" → terms: [trail, runner, pro]
doc_2: "Trail Blazer Red" → terms: [trail, blazer, red]
doc_3: "Road Runner Shoes" → terms: [road, runner, shoes]
-- Resulting inverted index (term → postings list) --
"trail" → [ {doc:doc_1, tf:1}, {doc:doc_2, tf:1} ]
"runner" → [ {doc:doc_1, tf:1}, {doc:doc_3, tf:1} ]
"pro" → [ {doc:doc_1, tf:1} ]
"blazer" → [ {doc:doc_2, tf:1} ]
"road" → [ {doc:doc_3, tf:1} ]
"shoes" → [ {doc:doc_3, tf:1} ]
N = 3 documents total
-- QUERY: "trail runner" --
Step 1: look up "trail" → { doc_1, doc_2 }
Step 2: look up "runner" → { doc_1, doc_3 }
Step 3: intersect (AND semantics) → { doc_1 } ← only doc matching both terms
OR semantics would give → { doc_1, doc_2, doc_3 } — ranked by match count
-- SCORE each result (simplified TF-IDF) --
For doc_1 matching query terms "trail" and "runner":
idf("trail") = log(N / df) = log(3 / 2) ≈ 0.41 (df=2: appears in 2 docs)
idf("runner") = log(3 / 2) ≈ 0.41 (df=2: appears in 2 docs)
tf-idf(doc_1) = tf×idf(trail) + tf×idf(runner)
= 1×0.41 + 1×0.41 = 0.82
For doc_2 (OR semantics, only matches "trail"):
tf-idf(doc_2) = 1×0.41 = 0.41
Ranked results: [ {doc_1, score:0.82}, {doc_2, score:0.41} ]
-- CURSOR construction for second page --
Last item on page 1: doc_1 with score 0.82
Cursor payload: { "score": 0.82, "doc_id": "doc_1" }
Encoded: base64url({ "score": 0.82, "doc_id": "doc_1" })
→ "eyJzY29yZSI6MC44MiwiZG9jX2lkIjoiZG9jXzEifQ"
-- Page 2 query uses cursor to resume --
WHERE (score, doc_id) < (0.82, "doc_1") ORDER BY score DESC, doc_id DESC
Returns: [ {doc_2, score:0.41} ] next_cursor: null (last page)
If two documents have the same relevance score and the cursor encodes only score, the next-page query WHERE score < :cursor_score may skip or repeat items when scores are equal. Always include doc_id as a deterministic tie-breaker: encode {"score": 0.82, "doc_id": "doc_1"} and use WHERE (score, doc_id) < (:s, :id) ORDER BY score DESC, doc_id DESC. This is not optional on real corpora where many items have identical scores (e.g., all items with zero purchases).
Operating & debugging it
The search endpoint sits behind a CDN, a rate limiter, and a search index. Each layer has a distinct observable signature.
| Symptom | Likely cause | Fix |
|---|---|---|
| Newly created documents don't appear in results for several minutes | Indexing pipeline lag exceeds the 30-second SLA — ingestion backlog or slow index refresh | Monitor indexing queue depth; reduce index refresh interval or scale the ingest worker fleet |
| Same item appears on page 1 and page 2 | Cursor encodes only the score field, not the doc_id tie-breaker | Encode both score and doc_id in the cursor; use WHERE (score, doc_id) < (:s, :id) |
| Response has no Cache-Control header; every request reaches origin | CORS middleware or error handler runs before the cache headers are set and returns early | Ensure Cache-Control is added unconditionally, even on 4xx/5xx responses on GET /v1/search |
| 429 on every request even at low volume | Rate limiter is keyed on the wrong dimension (e.g., server IP instead of client IP or API key) | Verify the key extraction logic; log the rate-limit key per request to confirm it matches the caller |
total_estimate drops dramatically on page 2 | The estimate is recomputed per page from the search engine's approximate hit count, which is less accurate deep in the result set | Capture total_estimate on page 1 and re-use it client-side for display; don't re-read it from later pages |
Sorting by price:asc returns results in unexpected order | Sort field is null for some documents; nulls sort last or first depending on engine config | Ensure all documents have a non-null value for any sortable field, or set explicit null-handling in the sort spec |
Debug checklist:
- Check
x-cache: HIT/MISS— if every request is a MISS for the same query, the cache key is wrong or the response is not cacheable (e.g., POST instead of GET). - Decode the cursor from the response and confirm it contains both the sort field and the doc_id tie-breaker before any pagination bug investigation.
- Run the same query twice in quick succession; if the second request is faster and shows
x-cache: HIT, the CDN layer is working; if not, inspect Cache-Control headers. - For rate-limit issues: log the exact rate-limit key being used per request — a common bug is using the API gateway's internal IP as the key rather than the forwarded client IP.
- To diagnose indexing lag: compare the timestamp of a recently created document against the first time it appears in search results; if the gap exceeds the 30-second SLA, check the ingest pipeline queue depth and index refresh interval.
🧠 Quick check
1. Why is GET preferred over POST for a search endpoint with filter and sort parameters?
The cachability of GET responses is the design-forcing constraint. A POST response is not cacheable by default in HTTP, which would eliminate the CDN layer that handles 80% of traffic at low latency.
2. A client receives "next_cursor": null in a search response. This means:
A null next_cursor is the conventional signal that pagination has reached the end of the result set. The client should stop fetching, not retry.
3. The search index may return results that are up to 30 seconds out of date. Which requirement explicitly permits this?
Consistency behaviour is always a non-functional requirement — it describes how the system behaves, not what it does. The 30-second staleness window was stated explicitly to enable asynchronous index updates and the CDN cache TTL.
4. Which segment dominates the latency budget on a cache miss?
The search index is an I/O-bound operation (scanning and ranking 50M documents) and accounts for roughly half the total miss latency. Optimising the index (shard sizing, replica count, query tuning) yields the biggest gains when the budget is tight.
✍️ Exercise: add a spell-correction hint to the response
A product requirement arrives: "When the user's query matches very few results, the API should suggest a corrected spelling." For example, ?q=tral+runer might suggest "trail runner." Design the change to the response schema without breaking existing clients. Consider: what field name, what data type, when is it present?
Model answer:
{
"data": [...],
"next_cursor": "eyJ...",
"total_estimate": 3,
"spell_suggestion": { // present only when total_estimate < threshold
"corrected_query": "trail runner",
"suggestion_href": "/v1/search?q=trail+runner&filter=category:shoes"
}
// null or omitted when results are ample
}
Rubric: ✓ additive change — new field doesn't break clients that ignore unknown fields ✓ field is absent (or null) for normal responses, not an empty object ✓ includes a ready-to-use suggestion_href so clients don't have to reconstruct the URL ✓ candidate notes threshold for "few results" and who sets it (configurable server-side, not hardcoded in the contract) ✓ no new endpoint needed. All five = strong answer.
Key takeaways
- Use GET for search: it makes responses cacheable, which is the primary tool for hitting low-latency targets at high QPS.
- Cursor pagination avoids the deep-offset performance cliff and prevents duplicate/missing items when the index changes between pages.
- Match cache TTL to the stated consistency window — extending it beyond the non-functional requirement violates the contract, even if latency improves.
- The search index is the latency bottleneck on cache misses; horizontal scaling of the index tier is the right lever when traffic grows.
- Rate limiting protects the index from cache-miss storms; always return
Retry-Afterso clients can recover gracefully.
Sources & further reading
- RFC 9110 §9.3.1 — GET method semantics (safe + idempotent = cacheable)
- Elasticsearch — Paginate search results (search_after cursor, deep pagination tradeoffs)
- Twitter API — Pagination (cursor-based pagination in production)
- MDN — Cache-Control header (max-age, stale-while-revalidate)
- Google Cloud — Rate-limiting strategies and techniques
- API Design course — Caching strategies
- API Design course — Rate limiting
- API Design course — RESTful maturity levels