Design Case Studies · Lesson 17

Design: Price-tracker API

A price tracker quietly polls product pages it has no official access to, stores years of price history, and fires a notification the instant a price drops below a user's target. The interesting design challenges are not in the API surface — they are behind it: how do you poll thousands of upstream sources politely, deduplicate redundant writes, recover gracefully from upstream failures, and deliver an alert within minutes of a price drop?

⏱ ~20 min advanced Prereq: Retries & backoff, Circuit breaker, Notification API

By the end you'll be able to

Explain the tension between polling frequency, upstream politeness, and price freshness.
Design a time-series price history schema and a range query endpoint.
Describe how a circuit breaker protects the ingestion pipeline from a failing upstream source.
Model the alert delivery path from ingested price point to user notification.
Deduplicate redundant price ingestion points without a full-table scan.

1 — Requirements

Before the API shape, nail down what the system actually needs to accomplish and what external constraints bind the design.

Functional requirements

Watch a product: a user submits a product URL and an alert threshold price. The system begins tracking that product's price.
Price history: for any tracked product, return the full price time-series (timestamp + price) queryable by date range.
Threshold alerts: when an observed price drops at or below a user's threshold, deliver a notification to the user via their configured channel (email, webhook, push).
Current price: return the most recently observed price for a tracked product on demand.
Source ingestion: periodically crawl upstream product pages to read the current price. Handle multiple retailers as upstream sources.
Watch management: users can list, pause, and delete their watches.

Non-functional requirements

Upstream politeness: the crawler must not hammer upstream sources. Respect robots.txt, use per-domain rate limits and exponential backoff, and add jitter to scheduled jobs.
Upstream failure resilience: if an upstream source goes down, the system must not cascade that failure into the rest of the pipeline. Use a circuit breaker per source domain.
Alert timeliness: a price drop should trigger a user notification within 10 minutes of the drop being observed — sooner for actively watched popular products.
History durability: price history must be retained and queryable for at least 3 years per product.
Deduplication: if the crawl returns the same price as the previous data point for a product, do not write a duplicate row. Store only meaningful changes.
Read scalability: price history queries (charting, export) may return thousands of data points; the query path must paginate and be independent of the ingestion write path.

Scale assumption: 2 million tracked product-URL pairs. Each product is crawled on average once every 30 minutes, yielding ~1,100 crawl completions per second. Most crawls find no price change (~80%), so effective write throughput is ~220 new price points per second. Price drops in practice are rare — assume 0.1% of crawls produce a price drop that triggers an alert, giving ~1 alert delivery per second at baseline with spikes during sale events.

2 — Design decisions

Decision 1: Scheduled polling with per-domain rate limiting and jitter

Think of the crawler as a polite librarian who makes copies of pages it needs, but calls ahead before arriving at each library and spaces its visits so it never overwhelms any single branch. The "calling ahead" is robots.txt checking; the spacing is rate limiting and jitter.

Concretely: a job scheduler queues crawl tasks per product. Each task is tagged with its domain (e.g. amazon.com). A per-domain rate-limiter allows at most N requests per second to that domain across all crawler workers. To prevent the thundering-herd problem where all products for a domain happen to be scheduled at the same minute, each task's scheduled time is offset by a random jitter of ±50% of the base interval (see retries and backoff).

Crawl interval adapts to demand: a product with 10,000 active watches is polled every 5 minutes; a product with 1 watch is polled every 60 minutes. This ensures high-value products are fresher without increasing total crawl load proportionally.

✅ Respect robots.txt or risk a ban

Upstreams can and do block IP ranges that ignore crawl policies. The crawler must fetch and cache each domain's robots.txt and honour Crawl-delay directives. A polite crawler that respects these limits can operate for years; an aggressive one gets blocked within days. Store the parsed robots.txt with a TTL of 24 hours and re-fetch before it expires.

Decision 2: Time-series price history with change-only writes

Price history is an append-only log: each record is (product_id, observed_at, price, currency). The schema is intentionally narrow — no product metadata, just the price signal. This lets you partition the table by product_id and sub-partition by time bucket (month or year) to keep partition sizes bounded over 3-year retention.

Deduplication (change-only writes): before inserting a new price point, the ingestor reads the product's last known price. If the incoming price equals the last stored price, the row is discarded. Only a price change produces a write. This cuts write volume by ~80% at the stated scale assumptions and keeps the time-series compact — the chart shows when prices actually moved, not thousands of identical "no change" records.

-- Pseudo-SQL: change-only insert
INSERT INTO price_points (product_id, observed_at, price)
SELECT $product_id, $now, $price
WHERE NOT EXISTS (
  SELECT 1 FROM price_points
  WHERE  product_id = $product_id
  AND    price      = $price
  AND    observed_at = (
    SELECT MAX(observed_at) FROM price_points
    WHERE product_id = $product_id
  )
);

In practice, cache the last known price per product in Redis (key: last_price:{product_id}, value: price as a string). The ingestor reads Redis, compares, and only hits the database if the price changed. This eliminates the subquery entirely on the common (no-change) path.

Decision 3: Alert delivery via webhook / notification path

When the ingestor writes a new price point that is at or below any user's threshold for that product, it enqueues an alert task. The alert worker looks up all watchers for the product whose threshold_price >= new_price, then delivers a notification to each via the user's preferred channel (see Notification API pattern).

Delivery options per user:

Email: lowest latency requirement; acceptable delay up to a few minutes.
Webhook: user provides an endpoint URL; the alert worker POSTs the event payload. Retried with exponential backoff on failure (see retries & backoff).
Push notification: mobile push via APNs / FCM; fastest perceived delivery for mobile users.

The alert pipeline is asynchronous — the ingestor does not wait for notification delivery before acknowledging the crawl task as complete. This decouples ingestion throughput from notification delivery latency.

Decision 4: Circuit breaker per upstream source domain

If a retailer's site goes down for an hour, the crawler should not continue hammering it with failing requests — wasting crawler worker threads, consuming rate-limit budget, and filling logs with errors. A circuit breaker per source domain handles this.

States per domain circuit breaker: CLOSED (normal crawling) → OPEN (upstream unavailable, skip all crawls for this domain, save capacity) → HALF-OPEN (probe with one request; if it succeeds, close; if it fails, reopen with longer timeout). Thresholds: open after 5 consecutive failures for a domain; probe after 10 minutes of OPEN state.

When a domain's circuit is OPEN, its crawl tasks are deferred (rescheduled to after the breaker probe window) rather than discarded. Price history shows a gap during the outage; that is correct and honest. Users are not alerted during a gap because no data was ingested — no false "price dropped to $0" events.

⚠️ Don't fire alerts on missing data

A crawl failure that returns a null price is not a price drop. Always check that the newly ingested price is a valid numeric value before comparing it against thresholds. An alert fired on a parsing error (upstream returned "Out of stock" where a price was expected) trains users to ignore real alerts. Add explicit validation: the price must parse as a positive decimal; if not, log the anomaly and skip the alert check entirely.

Price-tracker ingestion pipeline. The scheduler assigns crawl tasks with jitter; crawler workers respect per-domain rate limits and a circuit breaker. The ingestor deduplicates against the last known price before writing. Price drops that meet a threshold are enqueued for asynchronous alert delivery.

3 — The API model

Create a watch (start tracking a product)

POST /v1/watches
Authorization: Bearer <token>
Content-Type: application/json

// Request body
{
  "product_url":     "https://www.example-shop.com/products/headphones-pro",
  "threshold_price": 79.99,
  "currency":        "USD",
  "notify_via":      "webhook",
  "webhook_url":     "https://my-app.example.com/price-alerts"
}

// 201 Created
{
  "id":               "watch_c9d2",
  "product_id":       "prod_ff01",    // internal product identity
  "product_url":      "https://www.example-shop.com/products/headphones-pro",
  "threshold_price":  79.99,
  "currency":         "USD",
  "current_price":    109.00,          // last observed price if already tracked
  "status":           "active",
  "next_poll_at":     "2024-06-01T10:08:23Z",
  "created_at":       "2024-06-01T10:00:00Z"
}

// 422 Unprocessable Entity — URL from unsupported domain
{
  "error":  "unsupported_domain",
  "detail": "example-shop.com is not in the list of supported retailers"
}

Get price history for a product

GET /v1/products/prod_ff01/history?range=30d&limit=500&cursor=eyJ0IjoiMjAyNC0wNS0wMSJ9
Authorization: Bearer <token>

// 200 OK
{
  "product_id":  "prod_ff01",
  "currency":    "USD",
  "range":       "30d",
  "points": [
    { "observed_at": "2024-05-01T08:12:00Z", "price": 119.00 },
    { "observed_at": "2024-05-03T14:05:00Z", "price": 109.00 },
    { "observed_at": "2024-05-18T22:01:00Z", "price": 95.00  },
    // … up to limit
  ],
  "next_cursor": "eyJ0IjoiMjAyNC0wNS0xOSJ9",   // null if no more pages
  "total_points": 87
}

// range values: 7d | 30d | 90d | 1y | all

Alert delivery via webhook

// Server → user's webhook_url (POST from our alert worker)
POST https://my-app.example.com/price-alerts
Content-Type: application/json
X-Signature-SHA256: hmac_sha256(secret, body)   // for verification

{
  "event":            "price_drop",
  "watch_id":         "watch_c9d2",
  "product_id":       "prod_ff01",
  "product_url":      "https://www.example-shop.com/products/headphones-pro",
  "threshold_price":  79.99,
  "observed_price":   74.95,
  "observed_at":      "2024-06-01T10:08:10Z",
  "currency":         "USD"
}

The X-Signature-SHA256 header is an HMAC of the request body using a secret shared with the user on watch creation (see Stripe's webhook verification as a reference pattern). The recipient verifies the signature before acting on the alert. If the webhook endpoint returns a non-2xx status, the alert worker retries with exponential backoff up to 3 attempts over 30 minutes, then marks the delivery as failed.

🎯 Interview angle: where to check alert thresholds

A common mistake in system design interviews is to check thresholds at query time — scanning all watches at read time. That is O(W) work per history read, where W is the number of watches. The correct design checks thresholds at ingest time: when a new price point arrives in the ingestor, look up watches for that product with threshold_price >= new_price. You index on (product_id, threshold_price) and touch only the relevant rows. Alert delivery becomes an event-driven push, not a read-time scan.

Alert flow triggered at ingest time. The ingestor queries only watchers whose threshold price is at or above the new price — an indexed lookup, not a full scan. Each match produces one alert task. The alert worker retries failed deliveries with exponential backoff. If no watches match, the pipeline is silent.

4 — Evaluation & latency budget

Polling cadence vs. price freshness

The fresher you want price data, the more frequently you must crawl. But higher frequency means more upstream load. The relationship is linear: halving the poll interval doubles your crawl request rate. At 2 million tracked products:

Poll interval	Crawl req/s	Max price lag	Notes
5 min	6,667	5 min	High upstream load; require priority tiers
30 min	1,111	30 min	Baseline design target
60 min	556	60 min	Gentle; appropriate for low-watch products

The priority-tiered approach reconciles freshness with politeness: popular products (many watchers) get shorter intervals; niche products get longer. This keeps total crawl rate near 1,100 req/s without giving every product a 5-minute interval.

Upstream-failure handling

When a domain's circuit opens (5 consecutive failures), all crawl tasks for that domain are deferred. The price history shows a gap. Users are not falsely alerted. The circuit probes after 10 minutes; if the probe succeeds, the circuit closes and the scheduler resumes normal cadence for that domain's products — it does not immediately hammer the domain with the backlog of deferred tasks. Instead, the deferred tasks are re-queued with standard jitter to spread the recovery load.

What is lost during an outage? Freshness. A product that normally updates every 30 minutes may show prices from before the outage. If the actual price on the upstream site changed during the outage, we will not know until the circuit recovers and the next successful crawl. This is an inherent limitation of polling: you can only know what you can observe.

History storage and query

With change-only writes, a product that never changes price produces at most 1 row per period. A product with volatile pricing (flash sales, dynamic pricing) may produce dozens of rows per day. Over 3 years, a volatile product accumulates at most ~90,000 rows (1 per hour). At 200 bytes per row, that is 18 MB per volatile product — manageable.

Query: GET /v1/products/:id/history?range=30d translates to a range scan on (product_id, observed_at) with a composite index. The cursor-based pagination ensures that a large history (export, charting) does not block on a single large query. The query path is read-only and can be served from a read replica without touching the ingestor's write path.

Alert timeliness budget

Stage	Max delay	Notes
Price drop occurs on upstream site	0	Ground truth
Next scheduled crawl fires	0 – poll_interval	Worst case = full interval; average = half interval
Crawl completes, ingestor writes, threshold check	~5 s	Network + parse + DB write
Alert task enqueued and picked up by alert worker	~10 s	Depends on queue depth under load
First delivery attempt (webhook/push)	~5 s	HTTP call to user endpoint
Total (p50)	~15 min	Average half-interval wait dominates; within 10 min for popular products polled every 5–10 min

The dominant term is the polling interval, not the processing pipeline. Improving alert timeliness means polling more frequently for high-value products — not optimising the ingestor.

✅ Prefer cursor pagination over offset for history exports

A GET /history?offset=5000&limit=500 query requires the database to count and skip 5,000 rows before returning 500. On a large time-series table this gets slower as offset grows. A cursor-based approach — GET /history?cursor=<opaque_token> where the token encodes the last observed_at timestamp — translates to WHERE observed_at < $cursor_time ORDER BY observed_at DESC LIMIT 500, which uses the index efficiently regardless of how deep in the history you are.

Exercise — Price drop during a 2-hour upstream outage

Prompt: A flash sale drops the price of a product from $109 to $74 for exactly 2 hours, then the price returns to $109. The upstream domain was experiencing an outage for those exact 2 hours (circuit open). The user had a watch threshold of $80. What does your system do, and how do you handle the user's expectation of an alert?

Analysis:

The price dropped to $74 (below the $80 threshold) during the outage window. No crawl ran during that window — the circuit was open and tasks were deferred.
When the circuit closes and crawling resumes, the upstream price is back at $109. The crawl observes $109 — which is a change from the last stored price of $109 (no change), so no write occurs.
The system never sees the $74 price. No alert is fired. The user misses the sale.
This is correct and honest behaviour — the system cannot alert on information it never received. Attempting to infer a price drop that was never observed would produce false positives.

What you can do: surface the gap transparently. When a product's history shows a gap (no data for 2 hours in a normally-30-minute-poll product), the API could include an "outage_gaps" field in the history response indicating intervals where data was unavailable. Users can see that data was missing and understand why the alert did not fire.

Rubric: must correctly state that no alert fires — the dropped price was never observed. Must not suggest "we can infer the price dropped." Must acknowledge the gap. Bonus: describe the outage gap metadata in the history response.

Under the hood: the core mechanism

The visible API surface — watch creation, history queries, alert webhooks — is thin. All the interesting engineering is the ingestion pipeline that runs continuously behind it. This section traces exactly what happens inside one full poll cycle, from scheduler tick to alert delivery, and shows the data structures that make it fast.

Key data structures

Structure	Where	Fields & purpose
`products` table	PostgreSQL	`id, url, domain, poll_interval_s, last_crawled_at, watcher_count` — canonical product registry; `watcher_count` drives poll interval assignment
`price_points` table	PostgreSQL (partitioned by month)	`product_id, observed_at, price_cents, currency` — append-only time-series; partitioning keeps range scans fast over 3-year retention
`watches` table	PostgreSQL	`id, user_id, product_id, threshold_price_cents, notify_via, webhook_url, status`; indexed on `(product_id, threshold_price_cents)` for ingest-time threshold queries
`last_price:{product_id}`	Redis (string)	Most recent observed price in cents as a string; TTL = poll_interval × 3. Hot-path deduplication read — avoids a DB query on the common no-change case
`domain_circuit:{domain}`	Redis (hash)	`state (CLOSED/OPEN/HALF_OPEN), failure_count, opened_at` — circuit breaker state per upstream domain
Crawl task queue	Redis list / SQS	Each task: `{product_id, domain, url, scheduled_for}`; scheduled_for includes jitter
Alert task queue	Redis list / SQS	Each task: `{watch_id, product_id, observed_price_cents, observed_at}`; drained by alert workers

Worked trace: one full poll cycle that crosses a threshold and fires an alert

Setup: product prod_ff01 (headphones), last stored price $109.00 (10900 cents in Redis). One user (user_42) has a watch with threshold $80.00 (8000 cents). The scheduler fires a crawl task for this product.

# ① Scheduler fires at T=10:08:00 UTC # poll_interval_s = 600 (10 min — product has ~50 watchers) # jitter applied: +/-50% → actual scheduled_for = 10:08:23 (offset +23s) RPUSH crawl_queue '{"product_id":"prod_ff01","domain":"example-shop.com", "url":"https://www.example-shop.com/products/headphones-pro", "scheduled_for":"2024-06-01T10:08:23Z"}' # ② Crawler worker dequeues task at T=10:08:23 # Check circuit breaker for domain HGETALL domain_circuit:example-shop.com state CLOSED failure_count 0 # Circuit CLOSED → proceed # ③ Rate-limiter check: max 2 req/s to example-shop.com across all workers # Sliding window token available → proceed # ④ HTTP fetch with polite headers (T=10:08:23.088) GET https://www.example-shop.com/products/headphones-pro User-Agent: PriceBot/1.0 (+https://our-service.example.com/bot) Accept: text/html HTTP/1.1 200 OK [elapsed: 812ms] # HTML parser extracts price: $74.95 from ld+json schema.org/Product # Parsed price_cents = 7495 # ⑤ Deduplication check (T=10:08:24.012) GET last_price:prod_ff01 "10900" # 7495 != 10900 → price changed → proceed to write # ⑥ Write new price point INSERT INTO price_points(product_id, observed_at, price_cents, currency) VALUES ('prod_ff01', '2024-06-01T10:08:24Z', 7495, 'USD'); INSERT 1 SET last_price:prod_ff01 "7495" EX 1800 OK # ⑦ Threshold query: which watches trigger on this product at this price? SELECT id, user_id, notify_via, webhook_url FROM watches WHERE product_id = 'prod_ff01' AND threshold_price_cents >= 7495 AND status = 'active'; id | user_id | notify_via | webhook_url -----------+----------+------------+------------------------------------------ watch_c9d2 | user_42 | webhook | https://my-app.example.com/price-alerts # 1 match → enqueue 1 alert task # ⑧ Enqueue alert task RPUSH alert_queue '{"watch_id":"watch_c9d2","product_id":"prod_ff01", "observed_price_cents":7495,"threshold_cents":8000,"observed_at":"2024-06-01T10:08:24Z"}' (integer) 1 # ⑨ Alert worker dequeues task (T≈10:08:28) # Builds webhook payload, signs body with HMAC-SHA256, POSTs POST https://my-app.example.com/price-alerts HTTP/1.1 Content-Type: application/json X-Signature-SHA256: sha256=b3e1f2a4c... HTTP/1.1 200 OK [elapsed: 145ms] # Delivery confirmed. Mark alert as delivered. Total elapsed from crawl: ~5s.

What happens when the upstream is flaky: circuit breaker trace

# Five consecutive fetch failures for example-shop.com [10:15:00] GET example-shop.com → ConnectionTimeout (1/5) [10:15:30] GET example-shop.com → ConnectionTimeout (2/5) [10:16:00] GET example-shop.com → 503 Service Unavailable (3/5) [10:16:30] GET example-shop.com → ConnectionTimeout (4/5) [10:17:00] GET example-shop.com → ConnectionTimeout (5/5) → OPEN circuit HSET domain_circuit:example-shop.com state OPEN failure_count 5 opened_at 2024-06-01T10:17:00Z # All subsequent crawl tasks for example-shop.com products are deferred (+10 min) [10:17:05] prod_ff01 crawl deferred: circuit OPEN for example-shop.com [10:17:05] prod_ff02 crawl deferred: circuit OPEN for example-shop.com # After 10-minute cooldown: HALF-OPEN probe [10:27:00] PROBE GET https://www.example-shop.com/robots.txt HTTP/1.1 200 OK → probe success → CLOSE circuit HSET domain_circuit:example-shop.com state CLOSED failure_count 0 # Deferred tasks re-enter queue with standard jitter (not all at once)

The deduplication pipeline in detail

The Redis fast-path and the SQL change-only write complement each other. Here is the decision tree the ingestor runs for every crawl result:

Ingestor decision tree. Roughly 80% of crawl results are discarded at the Redis check — no DB write, no alert. Only genuine price changes reach the database and the threshold query.

Operating & debugging it

Inspecting the ingestion pipeline

# Check circuit breaker state for all domains redis-cli --scan --pattern 'domain_circuit:*' | xargs -I{} redis-cli HGETALL {} domain_circuit:amazon.com state CLOSED failure_count 0 domain_circuit:example-shop.com state OPEN failure_count 5 opened_at 2024-06-01T10:17:00Z # example-shop.com is OPEN — all tasks for it are deferred # Check queue depth — are tasks piling up? redis-cli LLEN crawl_queue ; redis-cli LLEN alert_queue crawl_queue: 142 alert_queue: 3 # Alert queue near-empty: good. Crawl queue at 142: normal for a domain outage. # Inspect last observed price and when it was last written redis-cli GET last_price:prod_ff01 "7495" psql -c "SELECT observed_at, price_cents FROM price_points WHERE product_id='prod_ff01' ORDER BY observed_at DESC LIMIT 3;" observed_at | price_cents ------------------------+------------- 2024-06-01 10:08:24+00 | 7495 2024-06-01 09:05:01+00 | 10900 2024-05-31 17:12:44+00 | 11900 # Check which alerts fired for a product today psql -c "SELECT w.id, w.user_id, w.threshold_price_cents, al.delivered_at, al.status FROM alert_log al JOIN watches w ON w.id=al.watch_id WHERE al.product_id='prod_ff01' AND al.created_at > now() - interval '1 day' ORDER BY al.created_at;" watch_c9d2 | user_42 | 8000 | 2024-06-01 10:08:29+00 | delivered

Symptom → cause → fix table

Symptom	Cause	Fix
Alert fires but price shown in webhook is stale (previous price, not the one that crossed threshold)	Alert task was enqueued using the Redis last_price value read before the DB write; race between Redis update and task enqueue	Enqueue the alert task with the `price_cents` value from the current crawl result, not from a subsequent Redis read
Duplicate alerts fire for the same price drop (user gets 3 emails)	Alert worker is not idempotent — retries on a slow delivery create duplicates; or threshold check runs twice (worker crash mid-task)	Write an alert_log row with a unique key (watch_id + observed_at) and use INSERT … ON CONFLICT DO NOTHING before delivery; make delivery idempotent with a delivery_id in the webhook payload
Price history shows a gap of several hours with no data points	Domain circuit was OPEN (upstream outage), or crawler worker pool was exhausted, or the product was paused	Query domain_circuit Redis keys for the domain; check crawler worker health metrics; surface the gap as an `outage_gaps` field in the history API response
Same price point written to DB twice with slightly different timestamps	Redis cache miss (TTL expired) caused both a Redis read-miss and a fallback DB subquery, but a concurrent ingestor wrote between the two checks	Use Redis SET … NX (set-if-not-exists) as an atomic guard; the first writer wins and subsequent workers see the key and discard their result
Alert fires for "price = 0" or "$0.00"	Upstream HTML changed structure; parser returns null or 0; validation not catching it	Add explicit validation: price_cents must be > 0 (e.g., > 100 cents, i.e., $1.00 minimum); log parse failures as anomalies; never treat parse failure as a valid price point
Crawl worker threads exhausted; queue growing faster than consumed	One slow domain (high latency / connection timeouts) is tying up all worker threads waiting on I/O	Set a per-request HTTP timeout (e.g. 5 s); use async I/O or per-domain worker pools so a slow domain cannot starve workers for other domains
Circuit breaker never closes after an upstream outage ends	HALF-OPEN probe is using the same worker pool that is blocked by other tasks; probe never fires	Run circuit-breaker probes in a dedicated lightweight goroutine/thread, independent of the main crawl worker pool; use a scheduled job (every 5 min) to check OPEN circuits and issue probes

Production monitoring checklist

Circuit breaker state dashboard: alert if any domain circuit has been OPEN for more than 30 minutes — indicates a prolonged upstream outage or a mis-configured breaker.
Crawl queue depth: should stay below 5× worker throughput per minute. A growing queue means workers are falling behind; check for timeout spikes or a large domain outage deferring tasks.
Deduplication ratio: track the ratio of writes / crawl completions. Should be ~0.2 (20% are actual changes). A sudden spike toward 1.0 may mean your Redis TTL expired and every crawl is hitting the DB, or upstream is returning volatile/randomised prices.
Alert delivery success rate: track delivered / attempted per hour. Sustained failures above 5% warrant investigation — check webhook endpoint health, HMAC signing key rotation, or whether users are returning non-2xx intentionally.
Price anomaly log: count parse failures per domain per hour. A spike means the upstream changed its HTML structure and the price extractor needs updating.

🧠 Check your understanding

The last stored price for a product is $109.00. The latest crawl also returns $109.00. What should the ingestor do?

The design uses change-only writes to avoid duplicate rows and keep time-series storage compact. A price observation identical to the last stored value is silently discarded. A chart drawn from this data correctly shows "price was $109 from time T" without repeating the same value hundreds of times.

The crawler has hit an upstream domain 5 consecutive times and all 5 requests failed. What should happen next?

A circuit breaker protects the pipeline from a known-failing upstream. After the threshold of failures, the circuit opens and the system stops trying — freeing crawler workers and rate-limit budget for other domains. The tasks are deferred (not deleted) so they resume when the circuit closes after the probe succeeds.

Alert threshold checks happen at ingest time rather than query time. What is the primary reason?

At ingest time, you know exactly which product just changed price. You query watches WHERE product_id = $id AND threshold_price >= $new_price — one indexed lookup touching only relevant rows. At query time you'd have to check every watch against the history on every read, which is O(all watches) per request. Push the work to where the information first arrives.

A user's webhook endpoint returns 503 Service Unavailable on the first alert delivery attempt. What should the alert worker do?

A 503 indicates a temporary server-side failure. Immediate tight-loop retrying amplifies load on an already-struggling endpoint. Exponential backoff gives the recipient time to recover while still making reasonable delivery attempts. Marking as permanently failed on the first attempt would lose valid alerts whenever a user's endpoint briefly restarts.

Key takeaways

Poll cadence drives freshness. The dominant source of alert latency is the polling interval, not the pipeline. Prioritise high-watch-count products with shorter intervals.
Be polite to upstreams. Per-domain rate limiting, robots.txt respect, and jittered scheduling keep the crawler sustainable long-term. Aggressive crawling leads to IP bans.
Change-only writes cut storage and write load by ~80% at typical crawl volumes. Cache last price in Redis for fast deduplication without a DB subquery.
Open the circuit on repeated upstream failures. Defer tasks, do not discard them. Resume with standard jitter after the probe succeeds — never flood-recover.
Check alert thresholds at ingest time, not query time. An indexed lookup on (product_id, threshold_price) touches only the relevant rows.
Never fire an alert on missing data. A null/unparseable price is not a price drop. Validate the parsed value before the threshold check.
Use cursor pagination for history queries; offset pagination degrades as history depth grows.

Sources & further reading

RFC 9309 — Robots Exclusion Protocol — the standard the crawler must implement before making any requests
Stripe — Webhook best practices — canonical reference for idempotent, signed webhook delivery with retry
Martin Fowler — Circuit Breaker — the original description of the circuit-breaker pattern used for upstream failure isolation
PostgreSQL — Partial Indexes — useful for indexing only active watches when querying alert thresholds at ingest time
TimescaleDB — Time-series data — purpose-built extension for time-series storage and range queries, directly applicable to price_points table design
AWS Builders' Library — Avoiding insurmountable queue backlogs — covers the deferred-task recovery pattern used when a circuit breaker closes