Design Case Studies · Lesson 17
Design: Price-tracker API
A price tracker quietly polls product pages it has no official access to, stores years of price history, and fires a notification the instant a price drops below a user's target. The interesting design challenges are not in the API surface — they are behind it: how do you poll thousands of upstream sources politely, deduplicate redundant writes, recover gracefully from upstream failures, and deliver an alert within minutes of a price drop?
By the end you'll be able to
- Explain the tension between polling frequency, upstream politeness, and price freshness.
- Design a time-series price history schema and a range query endpoint.
- Describe how a circuit breaker protects the ingestion pipeline from a failing upstream source.
- Model the alert delivery path from ingested price point to user notification.
- Deduplicate redundant price ingestion points without a full-table scan.
1 — Requirements
Before the API shape, nail down what the system actually needs to accomplish and what external constraints bind the design.
Functional requirements
- Watch a product: a user submits a product URL and an alert threshold price. The system begins tracking that product's price.
- Price history: for any tracked product, return the full price time-series (timestamp + price) queryable by date range.
- Threshold alerts: when an observed price drops at or below a user's threshold, deliver a notification to the user via their configured channel (email, webhook, push).
- Current price: return the most recently observed price for a tracked product on demand.
- Source ingestion: periodically crawl upstream product pages to read the current price. Handle multiple retailers as upstream sources.
- Watch management: users can list, pause, and delete their watches.
Non-functional requirements
- Upstream politeness: the crawler must not hammer upstream sources. Respect
robots.txt, use per-domain rate limits and exponential backoff, and add jitter to scheduled jobs. - Upstream failure resilience: if an upstream source goes down, the system must not cascade that failure into the rest of the pipeline. Use a circuit breaker per source domain.
- Alert timeliness: a price drop should trigger a user notification within 10 minutes of the drop being observed — sooner for actively watched popular products.
- History durability: price history must be retained and queryable for at least 3 years per product.
- Deduplication: if the crawl returns the same price as the previous data point for a product, do not write a duplicate row. Store only meaningful changes.
- Read scalability: price history queries (charting, export) may return thousands of data points; the query path must paginate and be independent of the ingestion write path.
Scale assumption: 2 million tracked product-URL pairs. Each product is crawled on average once every 30 minutes, yielding ~1,100 crawl completions per second. Most crawls find no price change (~80%), so effective write throughput is ~220 new price points per second. Price drops in practice are rare — assume 0.1% of crawls produce a price drop that triggers an alert, giving ~1 alert delivery per second at baseline with spikes during sale events.
2 — Design decisions
Decision 1: Scheduled polling with per-domain rate limiting and jitter
Think of the crawler as a polite librarian who makes copies of pages it needs, but calls ahead before arriving at each library and spaces its visits so it never overwhelms any single branch. The "calling ahead" is robots.txt checking; the spacing is rate limiting and jitter.
Concretely: a job scheduler queues crawl tasks per product. Each task is tagged with its domain (e.g. amazon.com). A per-domain rate-limiter allows at most N requests per second to that domain across all crawler workers. To prevent the thundering-herd problem where all products for a domain happen to be scheduled at the same minute, each task's scheduled time is offset by a random jitter of ±50% of the base interval (see retries and backoff).
Crawl interval adapts to demand: a product with 10,000 active watches is polled every 5 minutes; a product with 1 watch is polled every 60 minutes. This ensures high-value products are fresher without increasing total crawl load proportionally.
Upstreams can and do block IP ranges that ignore crawl policies. The crawler must fetch and cache each domain's robots.txt and honour Crawl-delay directives. A polite crawler that respects these limits can operate for years; an aggressive one gets blocked within days. Store the parsed robots.txt with a TTL of 24 hours and re-fetch before it expires.
Decision 2: Time-series price history with change-only writes
Price history is an append-only log: each record is (product_id, observed_at, price, currency). The schema is intentionally narrow — no product metadata, just the price signal. This lets you partition the table by product_id and sub-partition by time bucket (month or year) to keep partition sizes bounded over 3-year retention.
Deduplication (change-only writes): before inserting a new price point, the ingestor reads the product's last known price. If the incoming price equals the last stored price, the row is discarded. Only a price change produces a write. This cuts write volume by ~80% at the stated scale assumptions and keeps the time-series compact — the chart shows when prices actually moved, not thousands of identical "no change" records.
-- Pseudo-SQL: change-only insert
INSERT INTO price_points (product_id, observed_at, price)
SELECT $product_id, $now, $price
WHERE NOT EXISTS (
SELECT 1 FROM price_points
WHERE product_id = $product_id
AND price = $price
AND observed_at = (
SELECT MAX(observed_at) FROM price_points
WHERE product_id = $product_id
)
);
In practice, cache the last known price per product in Redis (key: last_price:{product_id}, value: price as a string). The ingestor reads Redis, compares, and only hits the database if the price changed. This eliminates the subquery entirely on the common (no-change) path.
Decision 3: Alert delivery via webhook / notification path
When the ingestor writes a new price point that is at or below any user's threshold for that product, it enqueues an alert task. The alert worker looks up all watchers for the product whose threshold_price >= new_price, then delivers a notification to each via the user's preferred channel (see Notification API pattern).
Delivery options per user:
- Email: lowest latency requirement; acceptable delay up to a few minutes.
- Webhook: user provides an endpoint URL; the alert worker POSTs the event payload. Retried with exponential backoff on failure (see retries & backoff).
- Push notification: mobile push via APNs / FCM; fastest perceived delivery for mobile users.
The alert pipeline is asynchronous — the ingestor does not wait for notification delivery before acknowledging the crawl task as complete. This decouples ingestion throughput from notification delivery latency.
Decision 4: Circuit breaker per upstream source domain
If a retailer's site goes down for an hour, the crawler should not continue hammering it with failing requests — wasting crawler worker threads, consuming rate-limit budget, and filling logs with errors. A circuit breaker per source domain handles this.
States per domain circuit breaker: CLOSED (normal crawling) → OPEN (upstream unavailable, skip all crawls for this domain, save capacity) → HALF-OPEN (probe with one request; if it succeeds, close; if it fails, reopen with longer timeout). Thresholds: open after 5 consecutive failures for a domain; probe after 10 minutes of OPEN state.
When a domain's circuit is OPEN, its crawl tasks are deferred (rescheduled to after the breaker probe window) rather than discarded. Price history shows a gap during the outage; that is correct and honest. Users are not alerted during a gap because no data was ingested — no false "price dropped to $0" events.
A crawl failure that returns a null price is not a price drop. Always check that the newly ingested price is a valid numeric value before comparing it against thresholds. An alert fired on a parsing error (upstream returned "Out of stock" where a price was expected) trains users to ignore real alerts. Add explicit validation: the price must parse as a positive decimal; if not, log the anomaly and skip the alert check entirely.
3 — The API model
Create a watch (start tracking a product)
POST /v1/watches
Authorization: Bearer <token>
Content-Type: application/json
// Request body
{
"product_url": "https://www.example-shop.com/products/headphones-pro",
"threshold_price": 79.99,
"currency": "USD",
"notify_via": "webhook",
"webhook_url": "https://my-app.example.com/price-alerts"
}
// 201 Created
{
"id": "watch_c9d2",
"product_id": "prod_ff01", // internal product identity
"product_url": "https://www.example-shop.com/products/headphones-pro",
"threshold_price": 79.99,
"currency": "USD",
"current_price": 109.00, // last observed price if already tracked
"status": "active",
"next_poll_at": "2024-06-01T10:08:23Z",
"created_at": "2024-06-01T10:00:00Z"
}
// 422 Unprocessable Entity — URL from unsupported domain
{
"error": "unsupported_domain",
"detail": "example-shop.com is not in the list of supported retailers"
}
Get price history for a product
GET /v1/products/prod_ff01/history?range=30d&limit=500&cursor=eyJ0IjoiMjAyNC0wNS0wMSJ9
Authorization: Bearer <token>
// 200 OK
{
"product_id": "prod_ff01",
"currency": "USD",
"range": "30d",
"points": [
{ "observed_at": "2024-05-01T08:12:00Z", "price": 119.00 },
{ "observed_at": "2024-05-03T14:05:00Z", "price": 109.00 },
{ "observed_at": "2024-05-18T22:01:00Z", "price": 95.00 },
// … up to limit
],
"next_cursor": "eyJ0IjoiMjAyNC0wNS0xOSJ9", // null if no more pages
"total_points": 87
}
// range values: 7d | 30d | 90d | 1y | all
Alert delivery via webhook
// Server → user's webhook_url (POST from our alert worker)
POST https://my-app.example.com/price-alerts
Content-Type: application/json
X-Signature-SHA256: hmac_sha256(secret, body) // for verification
{
"event": "price_drop",
"watch_id": "watch_c9d2",
"product_id": "prod_ff01",
"product_url": "https://www.example-shop.com/products/headphones-pro",
"threshold_price": 79.99,
"observed_price": 74.95,
"observed_at": "2024-06-01T10:08:10Z",
"currency": "USD"
}
The X-Signature-SHA256 header is an HMAC of the request body using a secret shared with the user on watch creation (see Stripe's webhook verification as a reference pattern). The recipient verifies the signature before acting on the alert. If the webhook endpoint returns a non-2xx status, the alert worker retries with exponential backoff up to 3 attempts over 30 minutes, then marks the delivery as failed.
A common mistake in system design interviews is to check thresholds at query time — scanning all watches at read time. That is O(W) work per history read, where W is the number of watches. The correct design checks thresholds at ingest time: when a new price point arrives in the ingestor, look up watches for that product with threshold_price >= new_price. You index on (product_id, threshold_price) and touch only the relevant rows. Alert delivery becomes an event-driven push, not a read-time scan.
4 — Evaluation & latency budget
Polling cadence vs. price freshness
The fresher you want price data, the more frequently you must crawl. But higher frequency means more upstream load. The relationship is linear: halving the poll interval doubles your crawl request rate. At 2 million tracked products:
| Poll interval | Crawl req/s | Max price lag | Notes |
|---|---|---|---|
| 5 min | 6,667 | 5 min | High upstream load; require priority tiers |
| 30 min | 1,111 | 30 min | Baseline design target |
| 60 min | 556 | 60 min | Gentle; appropriate for low-watch products |
The priority-tiered approach reconciles freshness with politeness: popular products (many watchers) get shorter intervals; niche products get longer. This keeps total crawl rate near 1,100 req/s without giving every product a 5-minute interval.
Upstream-failure handling
When a domain's circuit opens (5 consecutive failures), all crawl tasks for that domain are deferred. The price history shows a gap. Users are not falsely alerted. The circuit probes after 10 minutes; if the probe succeeds, the circuit closes and the scheduler resumes normal cadence for that domain's products — it does not immediately hammer the domain with the backlog of deferred tasks. Instead, the deferred tasks are re-queued with standard jitter to spread the recovery load.
What is lost during an outage? Freshness. A product that normally updates every 30 minutes may show prices from before the outage. If the actual price on the upstream site changed during the outage, we will not know until the circuit recovers and the next successful crawl. This is an inherent limitation of polling: you can only know what you can observe.
History storage and query
With change-only writes, a product that never changes price produces at most 1 row per period. A product with volatile pricing (flash sales, dynamic pricing) may produce dozens of rows per day. Over 3 years, a volatile product accumulates at most ~90,000 rows (1 per hour). At 200 bytes per row, that is 18 MB per volatile product — manageable.
Query: GET /v1/products/:id/history?range=30d translates to a range scan on (product_id, observed_at) with a composite index. The cursor-based pagination ensures that a large history (export, charting) does not block on a single large query. The query path is read-only and can be served from a read replica without touching the ingestor's write path.
Alert timeliness budget
| Stage | Max delay | Notes |
|---|---|---|
| Price drop occurs on upstream site | 0 | Ground truth |
| Next scheduled crawl fires | 0 – poll_interval | Worst case = full interval; average = half interval |
| Crawl completes, ingestor writes, threshold check | ~5 s | Network + parse + DB write |
| Alert task enqueued and picked up by alert worker | ~10 s | Depends on queue depth under load |
| First delivery attempt (webhook/push) | ~5 s | HTTP call to user endpoint |
| Total (p50) | ~15 min | Average half-interval wait dominates; within 10 min for popular products polled every 5–10 min |
The dominant term is the polling interval, not the processing pipeline. Improving alert timeliness means polling more frequently for high-value products — not optimising the ingestor.
A GET /history?offset=5000&limit=500 query requires the database to count and skip 5,000 rows before returning 500. On a large time-series table this gets slower as offset grows. A cursor-based approach — GET /history?cursor=<opaque_token> where the token encodes the last observed_at timestamp — translates to WHERE observed_at < $cursor_time ORDER BY observed_at DESC LIMIT 500, which uses the index efficiently regardless of how deep in the history you are.
Exercise — Price drop during a 2-hour upstream outage
Prompt: A flash sale drops the price of a product from $109 to $74 for exactly 2 hours, then the price returns to $109. The upstream domain was experiencing an outage for those exact 2 hours (circuit open). The user had a watch threshold of $80. What does your system do, and how do you handle the user's expectation of an alert?
Analysis:
- The price dropped to $74 (below the $80 threshold) during the outage window. No crawl ran during that window — the circuit was open and tasks were deferred.
- When the circuit closes and crawling resumes, the upstream price is back at $109. The crawl observes $109 — which is a change from the last stored price of $109 (no change), so no write occurs.
- The system never sees the $74 price. No alert is fired. The user misses the sale.
- This is correct and honest behaviour — the system cannot alert on information it never received. Attempting to infer a price drop that was never observed would produce false positives.
What you can do: surface the gap transparently. When a product's history shows a gap (no data for 2 hours in a normally-30-minute-poll product), the API could include an "outage_gaps" field in the history response indicating intervals where data was unavailable. Users can see that data was missing and understand why the alert did not fire.
Rubric: must correctly state that no alert fires — the dropped price was never observed. Must not suggest "we can infer the price dropped." Must acknowledge the gap. Bonus: describe the outage gap metadata in the history response.
Under the hood: the core mechanism
The visible API surface — watch creation, history queries, alert webhooks — is thin. All the interesting engineering is the ingestion pipeline that runs continuously behind it. This section traces exactly what happens inside one full poll cycle, from scheduler tick to alert delivery, and shows the data structures that make it fast.
Key data structures
| Structure | Where | Fields & purpose |
|---|---|---|
products table | PostgreSQL | id, url, domain, poll_interval_s, last_crawled_at, watcher_count — canonical product registry; watcher_count drives poll interval assignment |
price_points table | PostgreSQL (partitioned by month) | product_id, observed_at, price_cents, currency — append-only time-series; partitioning keeps range scans fast over 3-year retention |
watches table | PostgreSQL | id, user_id, product_id, threshold_price_cents, notify_via, webhook_url, status; indexed on (product_id, threshold_price_cents) for ingest-time threshold queries |
last_price:{product_id} | Redis (string) | Most recent observed price in cents as a string; TTL = poll_interval × 3. Hot-path deduplication read — avoids a DB query on the common no-change case |
domain_circuit:{domain} | Redis (hash) | state (CLOSED/OPEN/HALF_OPEN), failure_count, opened_at — circuit breaker state per upstream domain |
| Crawl task queue | Redis list / SQS | Each task: {product_id, domain, url, scheduled_for}; scheduled_for includes jitter |
| Alert task queue | Redis list / SQS | Each task: {watch_id, product_id, observed_price_cents, observed_at}; drained by alert workers |
Worked trace: one full poll cycle that crosses a threshold and fires an alert
Setup: product prod_ff01 (headphones), last stored price $109.00 (10900 cents in Redis). One user (user_42) has a watch with threshold $80.00 (8000 cents). The scheduler fires a crawl task for this product.
What happens when the upstream is flaky: circuit breaker trace
The deduplication pipeline in detail
The Redis fast-path and the SQL change-only write complement each other. Here is the decision tree the ingestor runs for every crawl result:
Operating & debugging it
Inspecting the ingestion pipeline
Symptom → cause → fix table
| Symptom | Cause | Fix |
|---|---|---|
| Alert fires but price shown in webhook is stale (previous price, not the one that crossed threshold) | Alert task was enqueued using the Redis last_price value read before the DB write; race between Redis update and task enqueue | Enqueue the alert task with the price_cents value from the current crawl result, not from a subsequent Redis read |
| Duplicate alerts fire for the same price drop (user gets 3 emails) | Alert worker is not idempotent — retries on a slow delivery create duplicates; or threshold check runs twice (worker crash mid-task) | Write an alert_log row with a unique key (watch_id + observed_at) and use INSERT … ON CONFLICT DO NOTHING before delivery; make delivery idempotent with a delivery_id in the webhook payload |
| Price history shows a gap of several hours with no data points | Domain circuit was OPEN (upstream outage), or crawler worker pool was exhausted, or the product was paused | Query domain_circuit Redis keys for the domain; check crawler worker health metrics; surface the gap as an outage_gaps field in the history API response |
| Same price point written to DB twice with slightly different timestamps | Redis cache miss (TTL expired) caused both a Redis read-miss and a fallback DB subquery, but a concurrent ingestor wrote between the two checks | Use Redis SET … NX (set-if-not-exists) as an atomic guard; the first writer wins and subsequent workers see the key and discard their result |
| Alert fires for "price = 0" or "$0.00" | Upstream HTML changed structure; parser returns null or 0; validation not catching it | Add explicit validation: price_cents must be > 0 (e.g., > 100 cents, i.e., $1.00 minimum); log parse failures as anomalies; never treat parse failure as a valid price point |
| Crawl worker threads exhausted; queue growing faster than consumed | One slow domain (high latency / connection timeouts) is tying up all worker threads waiting on I/O | Set a per-request HTTP timeout (e.g. 5 s); use async I/O or per-domain worker pools so a slow domain cannot starve workers for other domains |
| Circuit breaker never closes after an upstream outage ends | HALF-OPEN probe is using the same worker pool that is blocked by other tasks; probe never fires | Run circuit-breaker probes in a dedicated lightweight goroutine/thread, independent of the main crawl worker pool; use a scheduled job (every 5 min) to check OPEN circuits and issue probes |
Production monitoring checklist
- Circuit breaker state dashboard: alert if any domain circuit has been OPEN for more than 30 minutes — indicates a prolonged upstream outage or a mis-configured breaker.
- Crawl queue depth: should stay below 5× worker throughput per minute. A growing queue means workers are falling behind; check for timeout spikes or a large domain outage deferring tasks.
- Deduplication ratio: track the ratio of writes / crawl completions. Should be ~0.2 (20% are actual changes). A sudden spike toward 1.0 may mean your Redis TTL expired and every crawl is hitting the DB, or upstream is returning volatile/randomised prices.
- Alert delivery success rate: track delivered / attempted per hour. Sustained failures above 5% warrant investigation — check webhook endpoint health, HMAC signing key rotation, or whether users are returning non-2xx intentionally.
- Price anomaly log: count parse failures per domain per hour. A spike means the upstream changed its HTML structure and the price extractor needs updating.
🧠 Check your understanding
The last stored price for a product is $109.00. The latest crawl also returns $109.00. What should the ingestor do?
The design uses change-only writes to avoid duplicate rows and keep time-series storage compact. A price observation identical to the last stored value is silently discarded. A chart drawn from this data correctly shows "price was $109 from time T" without repeating the same value hundreds of times.
The crawler has hit an upstream domain 5 consecutive times and all 5 requests failed. What should happen next?
A circuit breaker protects the pipeline from a known-failing upstream. After the threshold of failures, the circuit opens and the system stops trying — freeing crawler workers and rate-limit budget for other domains. The tasks are deferred (not deleted) so they resume when the circuit closes after the probe succeeds.
Alert threshold checks happen at ingest time rather than query time. What is the primary reason?
At ingest time, you know exactly which product just changed price. You query watches WHERE product_id = $id AND threshold_price >= $new_price — one indexed lookup touching only relevant rows. At query time you'd have to check every watch against the history on every read, which is O(all watches) per request. Push the work to where the information first arrives.
A user's webhook endpoint returns 503 Service Unavailable on the first alert delivery attempt. What should the alert worker do?
A 503 indicates a temporary server-side failure. Immediate tight-loop retrying amplifies load on an already-struggling endpoint. Exponential backoff gives the recipient time to recover while still making reasonable delivery attempts. Marking as permanently failed on the first attempt would lose valid alerts whenever a user's endpoint briefly restarts.
Key takeaways
- Poll cadence drives freshness. The dominant source of alert latency is the polling interval, not the pipeline. Prioritise high-watch-count products with shorter intervals.
- Be polite to upstreams. Per-domain rate limiting,
robots.txtrespect, and jittered scheduling keep the crawler sustainable long-term. Aggressive crawling leads to IP bans. - Change-only writes cut storage and write load by ~80% at typical crawl volumes. Cache last price in Redis for fast deduplication without a DB subquery.
- Open the circuit on repeated upstream failures. Defer tasks, do not discard them. Resume with standard jitter after the probe succeeds — never flood-recover.
- Check alert thresholds at ingest time, not query time. An indexed lookup on (product_id, threshold_price) touches only the relevant rows.
- Never fire an alert on missing data. A null/unparseable price is not a price drop. Validate the parsed value before the threshold check.
- Use cursor pagination for history queries; offset pagination degrades as history depth grows.
Sources & further reading
- RFC 9309 — Robots Exclusion Protocol — the standard the crawler must implement before making any requests
- Stripe — Webhook best practices — canonical reference for idempotent, signed webhook delivery with retry
- Martin Fowler — Circuit Breaker — the original description of the circuit-breaker pattern used for upstream failure isolation
- PostgreSQL — Partial Indexes — useful for indexing only active watches when querying alert thresholds at ingest time
- TimescaleDB — Time-series data — purpose-built extension for time-series storage and range queries, directly applicable to price_points table design
- AWS Builders' Library — Avoiding insurmountable queue backlogs — covers the deferred-task recovery pattern used when a circuit breaker closes