Reliability & Scale · Lesson 10
Event-driven & Pub/Sub
Not every service call needs an immediate answer. Asynchronous messaging — queues and publish/subscribe topics — lets producers and consumers evolve, scale, and fail independently, trading synchronous convenience for resilience and decoupling at scale.
By the end you'll be able to
- Contrast synchronous request/response with asynchronous messaging and name the benefits of each.
- Distinguish a point-to-point queue from a pub/sub topic and explain when to use each.
- Describe at-most-once, at-least-once, and exactly-once delivery semantics, and explain why idempotent consumers are essential.
The problem with always waiting for an answer
When a user places an order on an e-commerce site, at least five things need to happen: deduct inventory, charge the payment, send a confirmation email, update a search index, and log for fraud analysis. If the HTTP handler does all five inline and waits for each to finish before responding, two bad things follow: the user stares at a spinner for several seconds, and one slow or unavailable downstream service (say, the email provider is rate-limiting you) blocks the entire transaction.
The alternative is to do the one thing only the user is waiting for — create the order record — respond immediately with 202 Accepted, and publish an event that other services can consume in their own time. The services are now decoupled: they don't need to be online at the same instant, and a spike of orders buffers in the queue rather than overwhelming the email service.
Queue vs pub/sub topic
Two patterns for asynchronous delivery differ in how many consumers receive each message:
A queue (point-to-point) delivers each message to exactly one consumer. Workers compete to pull jobs — when Consumer A picks up a message, B and C never see it. This is ideal for work distribution: resize an image, send one email, process one payment. The message is gone once processed.
A pub/sub topic fans a message out to all subscribers simultaneously. When an "order.placed" event is published, the inventory service, the email service, the fraud service, and the analytics pipeline all receive their own copy independently. Adding a new subscriber requires zero changes to the producer — pure decoupling. This is the model Kafka topics, AWS SNS, and Google Pub/Sub use.
Benefits of async messaging
- Decoupling. Producer and consumer don't need to know about each other — or even be running at the same time.
- Buffering. A burst of 50,000 orders can land in the queue instantly; the email service works through them at its own pace instead of being overwhelmed.
- Resilience. If the email service crashes, the messages wait in the queue. When it restarts, it picks up where it left off — no orders are lost.
- Independent scalability. You can add ten email workers without touching the order service.
Delivery guarantees
How hard does the messaging system try to deliver each message? Three levels exist, and they involve painful trade-offs:
| Guarantee | What it means | Risk |
|---|---|---|
| At-most-once | Fire and forget — the broker sends once and moves on | Messages can be silently lost on failure |
| At-least-once | Broker retries until the consumer acknowledges | Consumer may receive the same message more than once |
| Exactly-once | Delivered precisely one time, no duplicates | Extremely hard; requires distributed transactions; only some systems (Kafka with idempotent producers + transactions) offer it, and at real throughput cost |
In practice, at-least-once is the sweet spot for most systems: it's reliable without the complexity cost of exactly-once. The catch is that your consumer must handle duplicate messages safely. That's what idempotent consumers (covered in Lesson 02) solve: design the handler so that processing the same message twice produces the same result as processing it once.
# Pattern: idempotent SQS consumer in pseudo-Python
def handle_order_placed(message):
order_id = message["order_id"]
# Guard: skip if already processed
if db.exists("processed_events", order_id):
logger.info("duplicate, skipping", order_id=order_id)
return # idempotent — safe to ack
with db.transaction():
update_inventory(order_id)
db.insert("processed_events", order_id, ttl=7_days)
message.ack() # tell broker: delivered successfully
Ordering and the difficulty of global order
A common assumption: "messages will arrive in the order I sent them." With most distributed queues, that's not guaranteed across partitions. Kafka guarantees order within a partition for the same key — so if you partition by user ID, all events for one user arrive in order to a single consumer. AWS SQS standard queues offer best-effort ordering; SQS FIFO queues guarantee order but at lower throughput.
Global ordering (all messages across all producers, globally in sequence) is nearly impossible at scale without sacrificing throughput — avoid designing systems that require it.
Backpressure
If a producer sends messages faster than consumers can process them, the queue grows without bound. Backpressure is a mechanism to signal this: the consumer or broker tells the producer "slow down." In Kafka this happens naturally — the producer can be configured to block or throw when the broker's internal buffer fills. In HTTP-based systems, a 429 Too Many Requests response is a form of backpressure. Without it, an unbounded queue is a time-delayed crash.
Real systems: Kafka, RabbitMQ, SNS/SQS
- Apache Kafka: Append-only distributed log, partitioned topics, very high throughput, consumer groups, replay-able history. Industry standard for event streaming and event sourcing.
- RabbitMQ: Traditional message broker with flexible routing (direct, fanout, topic, headers exchanges), lower throughput than Kafka, but simpler to operate for work queues.
- AWS SNS + SQS: SNS is a managed pub/sub topic (fan-out); SQS is a managed queue. A common pattern: SNS fans an event to multiple SQS queues, one per subscriber service — each service drains its own queue at its own pace.
# Before: synchronous, blocking
POST /posts
→ save post to DB
→ call notification service # 800 ms
→ call timeline service # 1200 ms
→ call analytics service # 400 ms
← 200 after ~2.5 s
# After: async fan-out
POST /posts
→ save post to DB
→ publish "post.created" event to SNS topic # <10 ms
← 202 Accepted { "post_id": "p_xyz" }
# SNS fans to three SQS queues:
# notifications-queue → push-notification worker
# timelines-queue → timeline-fanout worker
# analytics-queue → analytics worker
# Each drains independently; failures retry independently
Two common prompts: "when would you choose async over sync?" and "design a notification fan-out system." For the first: async wins when the caller doesn't need an immediate result, when the work is slow or flaky, or when multiple services need the same event. For the fan-out design: producer publishes to an SNS topic → SNS pushes to per-subscriber SQS queues → each subscriber drains its queue with idempotent consumers. Mention delivery guarantees (at-least-once), idempotency, and dead-letter queues for messages that fail repeatedly.
Two in one: assuming exactly-once delivery and relying on global message ordering. Neither is available cheaply at scale. Exactly-once requires coordination overhead that most systems cannot afford; build idempotent consumers instead. Global ordering requires a single sequencer — a scalability bottleneck. Partition by the right key (user ID, entity ID) to get local ordering within a consumer group.
Do include a unique event_id in every message payload and record processed IDs to achieve idempotent consumption under at-least-once delivery. Don't rely on the messaging system to guarantee exactly-once for you — most do not, and those that claim to do so require careful producer + consumer configuration that is easy to misconfigure.
Under the hood: how a broker actually delivers
The phrase "message is delivered to a consumer" glosses over a surprisingly concrete mechanism. Here is how Kafka — the most widely used event-streaming broker — actually works, followed by a worked trace. The same ideas apply (with different vocabulary) to RabbitMQ and SQS.
The append-only partition log
A Kafka topic is divided into one or more partitions. A partition is nothing more than an append-only, immutable file on disk — like a growing list. Producers append records to the end of a partition. Each record gets a monotonically increasing integer index called an offset. Offset 0 is the first record ever written; offset 1 is the second; and so on. Records are never overwritten or reordered — only appended.
This design is what gives Kafka its speed: sequential disk writes are far faster than random-access writes, and the log can be served to consumers via sendfile(2) (zero-copy) from the OS page cache. The broker is not a smart router — it is essentially a durable, indexed file server for a log.
Consumer groups and offset tracking
A consumer group is a named set of consumer instances that together consume a topic. Kafka assigns each partition to exactly one consumer in the group at a time — so work is distributed but never duplicated within a group. Multiple groups can independently consume the same topic, each with its own position in the log (this is the pub/sub fan-out: the inventory group and the email group each have their own offsets).
Each consumer group tracks its position with a committed offset: the offset of the last message the group has successfully processed. Kafka stores this in an internal topic (__consumer_offsets). On each poll cycle the consumer:
- Calls
poll()— fetches a batch of records at offsets above the last committed offset. - Processes each record in the batch (runs your handler code).
- Commits the offset of the last successfully processed record back to Kafka.
Step 3 is the ack. Only after commit does Kafka consider those records "consumed" by that group. If the consumer crashes between steps 2 and 3, Kafka will redeliver those records on the next poll — this is the mechanism behind at-least-once delivery.
Worked trace: offset, processing, commit, and redelivery
# Topic: orders | Partition 0 | Consumer group: email-service
# State of the partition log
Offset │ Record
───────┼───────────────────────────────────────
0 │ {"order_id": "ord_10", "event": "order.placed"}
1 │ {"order_id": "ord_11", "event": "order.placed"}
2 │ {"order_id": "ord_12", "event": "order.placed"}
3 │ {"order_id": "ord_13", "event": "order.placed"}
...
# Committed offset for email-service group = 1
# (records at offsets 0 and 1 have been acked)
# Consumer polls — fetches offsets 2 and 3
poll() → [record@2, record@3]
# Consumer processes record@2 (sends email for ord_12) ✓
# Consumer processes record@3 (sends email for ord_13) ✓
# Consumer crashes HERE — before committing
# On restart, committed offset is still 1.
# Consumer polls again — gets offsets 2 and 3 AGAIN (redelivery)
poll() → [record@2, record@3] # redelivered!
# Without idempotency guard: ord_12 and ord_13 get duplicate emails.
# With idempotency guard (seen-events table):
if db.exists("processed_events", "ord_12"):
skip() # safe — no duplicate email
else:
send_email("ord_12")
db.insert("processed_events", "ord_12")
# After successful processing, commit offset 3:
# Committed offset for email-service = 3
This trace explains why exactly-once is hard: to guarantee the email is sent exactly once, you would need to atomically send the email and commit the offset in a single transaction — but those are two different systems (SMTP and Kafka). Kafka's transactional API allows atomic writes to Kafka itself (producer-to-Kafka exactly-once), but it cannot make the downstream side-effect (sending an email, charging a card) atomic. This is why the industry default is at-least-once + idempotent consumers, not exactly-once.
Kafka routes records with the same partition key to the same partition, which is always assigned to the same consumer within a group. This is both a feature (order is preserved per key) and a trap. If you use a single hot key — say, the ID of a single very-active entity — all its records pile onto one partition and one consumer, even if you have twenty consumer instances. The other nineteen sit idle. Choose partition keys that distribute load evenly (user ID works; status="pending" does not).
Consumer lag: the operational heartbeat
The gap between the latest offset produced and the committed offset of a consumer group is called consumer lag. Lag = 0 means the consumer is caught up; lag growing means the consumer is falling behind the producer. Monitoring lag is the single most important operational signal for an async system — a growing lag is the earliest warning of a slow consumer, a resource bottleneck, or a silent processing failure.
How to debug & inspect it
Most async messaging bugs fall into one of four categories: the consumer is not running (lag grows), the consumer is crashing on a bad message (lag stalls at one offset), the consumer is processing but not committing (redelivery loop), or messages are being duplicated in the output (missing idempotency guard). Here is how to surface each one.
Inspect consumer lag (Kafka)
Inspect queue depth (SQS)
Check DLQ depth (the alarm you always want)
Trace a redelivery
Include a delivery_attempt counter or use the broker's native attribute (SQS: ApproximateReceiveCount; Kafka: there is no built-in counter — add it to the record headers at publish time) to distinguish first delivery from retries. Log it on every processing attempt. A message with ApproximateReceiveCount > 1 that your handler processed without a dedup guard is a double-processing event.
| Symptom | Likely cause | Fix |
|---|---|---|
| Consumer lag grows monotonically and never decreases | Consumer is down, underscaled, or blocked on a slow downstream call | Check consumer pod/process health; add more consumer instances; add a timeout to the downstream call |
| Lag stalls at one specific offset for minutes | Poison-pill message: the consumer crashes or throws on every attempt | Inspect the message at that offset; fix the consumer or route the message to a DLQ via a max-retry policy |
| Messages appear in DLQ unexpectedly | Consumer throws an unhandled exception or exceeds the visibility/processing timeout | Log the exception; extend timeout if slow; fix the handler bug |
| Duplicate side-effects (double emails, double charges) | At-least-once redelivery without an idempotency guard in the consumer | Add a seen-events dedup table keyed on the message/event ID; check before processing |
| Out-of-order processing for the same entity | Multiple partitions / consumers processing different events for the same entity key | Ensure the partition key is the entity ID so all events for one entity go to one partition/consumer |
| Consumer group shows 0 lag but work isn't happening | Auto-commit is advancing the offset even when processing fails | Switch to manual offset commit (commit only after successful processing) |
Debug checklist:
- Check consumer lag / queue depth — is there a backlog? Is it growing or stable?
- Check DLQ depth — any non-zero count demands investigation.
- Look at consumer logs for exceptions or timeouts on processing.
- Inspect a sample message in the DLQ to see the actual payload that failed.
- Verify offset commit strategy — is the consumer committing manually after success, or relying on auto-commit?
- Check partition key distribution — are all records landing on one hot partition?
- If duplicate output is observed, verify the idempotency guard is in place and the dedup store is being checked before (not after) processing.
🧠 Quick check
1. You publish an "invoice.generated" event and need the accounting service, the tax service, and the email service each to process it independently. Which pattern fits?
Pub/sub topics fan a single message out to all subscribers simultaneously. Each service receives an independent copy without any service needing to know about the others — that's the decoupling value of the pattern.
2. Your queue uses at-least-once delivery. The email consumer crashes after sending an email but before acknowledging the message. What happens?
With at-least-once delivery, an unacknowledged message is redelivered. The consumer must be idempotent — in this case, it should check whether the email was already sent before sending again.
3. Why is global ordering across all partitions nearly impossible at scale?
Global ordering means all messages must pass through one sequencer to be numbered. That single point cannot be partitioned, so it limits throughput to a single machine — the opposite of horizontal scale.
4. A consumer cannot keep up with the message rate and the queue grows unboundedly. What is the correct mechanism to address this before the system crashes?
Backpressure propagates the signal "you're producing faster than I can consume" back to the producer. The real fix is to slow the producer or scale the consumers — not to drop messages or extend TTL.
✍️ Exercise: design a notification fan-out system for 10 million users
A social platform sends notifications whenever a celebrity (with up to 10 million followers) posts. Design the messaging architecture from "celebrity posts" to "followers receive a push notification." Address: throughput, delivery guarantee, idempotency, and what happens if the push-notification provider is temporarily down.
Model answer:
- Publish event. When the celebrity posts, the posts service publishes a single
post.createdevent to a Kafka topic (or SNS), keyed by celebrity user ID. The HTTP handler returns 202 immediately. - Fan-out worker. A fan-out service reads the event and looks up the 10M follower list in batches (e.g. 1,000 at a time). For each batch it publishes individual
notify_user:{user_id}messages to an SQS queue (or a Kafka topic partitioned by user ID). This spreading is the slow step and happens asynchronously. - Notification workers. A pool of workers drains the notification queue. Each worker calls the push provider (APNs/FCM) and records the
notification_id+user_id+post_idin a deduplication store (Redis with 24 h TTL) to make processing idempotent. - Provider outage. SQS retries with exponential back-off. After N retries, messages land in a Dead Letter Queue (DLQ). An alarm on DLQ depth pages on-call; once the provider recovers, the DLQ is replayed.
- Delivery guarantee. At-least-once throughout (SQS default). Idempotency guards prevent duplicate pushes. Exactly-once is not required — receiving a duplicate push once in a while is acceptable.
Rubric: ✓ async publish (202 Accepted) ✓ fan-out decoupled from the post handler ✓ idempotent consumer with dedup store ✓ DLQ for push provider outage ✓ at-least-once + idempotency instead of exactly-once ✓ partitioning for ordering within a user's events. Five or more = strong answer.
Key takeaways
- Async messaging decouples producers and consumers in time and space — neither needs to be online simultaneously.
- Queues (point-to-point) deliver each message to one consumer; topics (pub/sub) fan messages to all subscribers independently.
- Benefits: decoupling, buffering spiky load, resilience to downstream failures, independent scalability.
- At-least-once delivery is the practical default; exactly-once is expensive and rarely necessary.
- At-least-once delivery mandates idempotent consumers — deduplicate on a business-level ID, not message metadata alone.
- Global ordering across partitions is a throughput bottleneck; prefer per-key local ordering instead.
- Backpressure prevents unbounded queue growth by signalling producers to slow down before the system crashes.