Resource Design Patterns · Lesson 05

Long-running Operations (LRO)

Exporting a project, generating a report, or bulk-reindexing 10 000 records cannot realistically finish within one HTTP request. The LRO pattern gives the client an immediate receipt — an Operation resource — and a clear contract for checking progress without pinning a connection open for minutes.

⏱ 16 min Difficulty: core Prereq: rdp-04 Field Masks, HTTP basics

By the end you'll be able to

Describe the LRO lifecycle: create → poll → done (or error/cancel), and what each field in the Operation resource carries.
Design the request, polling, and completion contracts for a Tasks API export operation, including per-phase metadata.
Explain when LRO is appropriate versus a plain 202 Accepted or a synchronous response, and what trade-offs each choice involves.

Why a plain 202 is not enough

HTTP's 202 Accepted status code says "I received the request and will do something with it eventually." That is useful as far as it goes — but it leaves critical questions unanswered: How do I check whether the work is done? Where does the result live when it is? How do I cancel if I change my mind? Without a standard pattern, every team invents its own polling URL, result envelope, and error shape, and every client has to learn each one separately.

Think of the Operations pattern like a baggage claim ticket. The airline takes your bags immediately (the server accepts the request), hands you a numbered ticket (the Operation resource), and sends you to a specific carousel (a stable polling URL). You can check the carousel as often as you like, and the ticket tells you which bags are yours when they arrive. Losing the ticket is recoverable — you can list your operations and find the reference again.

The Operation resource

An LRO call returns an Operation resource immediately. The shape follows Google AIP-151:

{
  "name":     "operations/export-abc123",   // stable, globally unique id
  "done":     false,                        // false until complete or failed
  "metadata": {                               // progress snapshot — caller-defined
    "@type":          "type.googleapis.com/tasks.ExportTasksMetadata",
    "phase":          "serializing",
    "tasksProcessed": 412,
    "tasksTotal":     1843,
    "createTime":     "2025-06-20T08:00:00Z",
    "updateTime":     "2025-06-20T08:00:07Z"
  }
}

When the operation completes, the server sets done: true and adds exactly one of two sibling fields: response (success, typed to the specific result type) or error (failure, following the standard error model). There is no intermediate state where both exist.

The async lifecycle

The client fires a single POST and immediately receives an Operation reference. It polls on its own schedule. The background worker updates metadata on each phase transition. When done:true, the response field holds the typed result.

Worked example: exporting all tasks from a project

Step 1 — Initiate the export

# Request
POST /v1/projects/acme-prod:exportTasks
Authorization: Bearer <token>
Content-Type: application/json

{
  "format":  "CSV",
  "filter":  "status = OPEN"
}

# Response — 200 OK (not 202, per AIP-151)
{
  "name":     "operations/export-7f3e9c12",
  "done":     false,
  "metadata": {
    "@type":          "type.googleapis.com/tasks.ExportTasksMetadata",
    "phase":          "queued",
    "tasksProcessed": 0,
    "tasksTotal":     1843,
    "createTime":     "2025-06-20T09:00:00Z",
    "updateTime":     "2025-06-20T09:00:00Z"
  }
}

Step 2 — Poll for progress

# Poll (after a few seconds)
GET /v1/operations/export-7f3e9c12
Authorization: Bearer <token>

# Response — still running
{
  "name":     "operations/export-7f3e9c12",
  "done":     false,
  "metadata": {
    "@type":          "type.googleapis.com/tasks.ExportTasksMetadata",
    "phase":          "serializing",
    "tasksProcessed": 921,
    "tasksTotal":     1843,
    "createTime":     "2025-06-20T09:00:00Z",
    "updateTime":     "2025-06-20T09:00:08Z"
  }
}

Step 3 — Operation completes

# Next poll — done!
GET /v1/operations/export-7f3e9c12

{
  "name":     "operations/export-7f3e9c12",
  "done":     true,
  "metadata": {
    "@type":          "type.googleapis.com/tasks.ExportTasksMetadata",
    "phase":          "complete",
    "tasksProcessed": 1843,
    "tasksTotal":     1843,
    "createTime":     "2025-06-20T09:00:00Z",
    "updateTime":     "2025-06-20T09:00:18Z"
  },
  "response": {
    "@type":     "type.googleapis.com/tasks.ExportTasksResponse",
    "exportUrl": "https://storage.example.com/exports/export-7f3e9c12.csv",
    "rowCount":  1843,
    "expiresAt": "2025-06-27T09:00:18Z"
  }
}

Cancel path

# Client decides to abort mid-export
POST /v1/operations/export-7f3e9c12:cancel
Authorization: Bearer <token>

# Response — 200 OK (empty body or confirmation)
{}

# Subsequent GET returns done=true with an error
{
  "name": "operations/export-7f3e9c12",
  "done": true,
  "error": {
    "code":    1,          // CANCELLED in gRPC status
    "message": "Operation was cancelled by the caller."
  }
}

Under the hood: how it actually works

When the server receives the :exportTasks call, it runs a short synchronous setup phase — validate the caller's authorization, look up the project to estimate work, pick an operation ID — and writes a row into an Operations table in the database before returning. The actual export work is handed off to a background job system (a queue, a scheduled task, or a worker pool). This is why the HTTP response arrives in milliseconds even for a multi-minute job.

The API handler writes the Operations row synchronously, enqueues the job, and returns. The background worker processes the export in phases and writes progress back to the same row. Every poll request is a simple row read — the DB is the single source of truth.

The worker updates the metadata field (and therefore updateTime) at phase boundaries — e.g. after moving from queued to serializing to uploading to complete. It does NOT update on every single row processed; that would turn the database into a write hotspot. A practical cadence is a write every N records or every T seconds, whichever comes first.

When the worker finishes, it does a final atomic write: set done = true and insert either the response or error payload in the same transaction. This ensures a polling client never sees a half-written state where done = true but response is absent.

Polling back-off and the `retry-after` hint

Clients that poll on a tight loop hammer the database for no benefit. A well-designed LRO contract signals the recommended poll interval. Two approaches are common:

Return a Retry-After response header (seconds until the next suggested poll) on every GET that returns done: false.
Include a metadata.estimatedCompletionTime field so the client can sleep until near that timestamp, then fall back to exponential back-off.

Clients should implement exponential back-off regardless — start at 1 s, double each time up to a ceiling (e.g. 30 s) — so that a long-running job doesn't generate a wall of traffic after several minutes.

LRO vs. plain 202 vs. synchronous response

Pattern	When to use	Client contract	Trade-off
Synchronous response	Work finishes in < ~5 seconds reliably	Block and read the response body	Simplest for the client; ties up a connection; times out on slow networks
Plain 202 Accepted	Fire-and-forget tasks where the result is never needed by the caller	Assume success; no polling path	Simple to implement; client has no way to detect failures or retrieve results
LRO (Operations resource)	Work takes seconds to minutes; caller needs the result or progress	Poll `GET /operations/id` with back-off	Requires persistent operation state; adds complexity; gives full visibility
Webhook / event on completion	Caller can't poll (serverless, browser tab might close)	Register a callback URL; handle incoming POST when done	Decouples completely; requires the caller to expose an endpoint; see event-driven pub/sub

How to debug & inspect it

# 1. Check operation status curl -s -H "Authorization: Bearer $TOKEN" \ https://api.example.com/v1/operations/export-7f3e9c12 | jq . { "name": "operations/export-7f3e9c12", "done": false, "metadata": { "phase": "serializing", "tasksProcessed": 921, "tasksTotal": 1843 } } # If done=true, check for "response" or "error" field curl -s ... | jq '{done, hasResponse: (.response != null), hasError: (.error != null)}' # Poll with back-off in bash for i in 1 2 4 8 16 30; do STATUS=$(curl -s ... | jq -r .done); [[ "$STATUS" == "true" ]] && break; sleep $i; done

Symptom	Likely cause	Fix
Operation stuck in `queued` phase	Background worker is down or queue backlog is too deep	Check worker health, queue depth metrics, dead-letter queue
`done: true` but `response` absent and `error` absent	Worker wrote the done flag in a separate transaction from the result	Enforce atomic write of done+response in one DB transaction
Poll returns `404 Not Found`	Operation ID is wrong, was never persisted (handler crashed before the INSERT), or was garbage-collected	Return `name` only after successful DB insert; document TTL for operation records
`metadata.updateTime` not advancing	Worker is alive but not writing progress	Add heartbeat writes at regular intervals; alert if `updateTime` is stale for > 2× expected cycle time
Client sees `done: false` indefinitely after cancel	Cancel is best-effort — worker did not observe the cancellation signal	Poll a short time after cancel; force-complete with error if still running after grace period

🎯 Interview angle

"Design a bulk task export feature" is a classic async system design question. A senior answer immediately reaches for the Operation resource pattern: explain that the POST returns synchronously with an Operation, describe the polling contract with back-off, note that the worker writes atomic done+result, cover the cancel path, and mention webhooks as an alternative for clients that can't poll. Interviewers also probe for the atomicity trap — make sure you explain why done and response must land in one transaction.

⚠️ Common trap: returning 202 and leaving the client guessing

A plain 202 Accepted with no Location header and no body is a dead end. The client confirms the request was received but has no way to check progress, retrieve the result, or detect a failure. If the background job crashes silently, the caller will wait forever. Always return an Operation resource with a stable name the client can poll.

✅ Idempotency on the initiating request

Make the LRO-creating request idempotent using a client-supplied requestId header (see idempotency lesson). If the client fires the same export twice due to a retry, the server returns the existing Operation rather than starting two concurrent exports of the same project. The client gets back the same name and can poll as usual — no duplicate work, no confused state.

Listing and managing operations

# List operations for the authenticated user (paginated)
GET /v1/operations?filter=done%3Dfalse&pageSize=10

{
  "operations": [
    { "name": "operations/export-7f3e9c12", "done": false, "metadata": { "phase": "serializing" } }
  ],
  "nextPageToken": "Cg8IARILb3BzL2V4cG9ydA"
}

# Delete a completed operation (clean up)
DELETE /v1/operations/export-7f3e9c12
# → 200 OK; {} (idempotent — deleting twice is safe)

Cross-reference: the async pattern appears in the code-submission model described in Case Study: LeetCode-style judge and in the event-driven pub/sub pattern when the worker publishes its completion event.

🧠 Quick check

1. When a client calls POST /v1/projects/p:exportTasks and receives an Operation resource, what HTTP status code should the server return?

Google AIP-151 specifies 200 OK. The Operation resource IS the synchronous result of the call — the server successfully created and returned it. 202 is for fire-and-forget where no resource is returned; 201 is for standard resource creation. The choice of 200 signals to the client: "you received a complete, well-formed response; use the name inside to poll."

2. A polling client fires GET /v1/operations/export-abc and receives {"done": true} but neither response nor error is present. This indicates:

When done=true, exactly one of response or error must be present. If neither is present, the worker used a non-atomic write — it committed done=true in one transaction and the result in another. A client that polls in the window between the two commits sees an inconsistent state. Fix: always write done+result/error atomically in a single DB transaction.

3. What is the primary reason to implement exponential back-off when polling an LRO endpoint?

Exponential back-off is a courtesy and a self-protection mechanism. Tight polling (e.g. every 100 ms for a 3-minute job) generates thousands of requests that all just read the same "done: false" row. Back-off reduces load and gives the server headroom to serve real work. It also automatically handles transient server overload — if the server is slow, backing off reduces pressure rather than compounding it.

4. A client calls POST /v1/operations/export-abc:cancel. A subsequent GET returns {"done": false}. What should the client conclude?

Cancellation is a signal, not an instant hard stop. The worker must reach a checkpoint where it checks for a cancellation flag, clean up any partial state, and then write done=true with an error.code=CANCELLED. This can take seconds. The client should continue polling after requesting a cancel; it will eventually see done=true with an error field.

5. Which of the following makes an LRO-creating request safe to retry if the client times out waiting for the response?

A client-generated idempotency key lets the server return the already-created Operation if it sees the same key again, without starting a second export job. Without this, a network timeout on the POST causes the client to retry, two exports run in parallel, and the client has no way to know which Operation is theirs. See the idempotency lesson for the full pattern.

✍️ Exercise: design the LRO contract for bulk task status update

Your team wants to add a :bulkUpdateStatus action to the Tasks API: given a list of task names and a target status, flip all of them to that status. The project might have up to 50 000 tasks. Design the full LRO contract: the initiating request/response, the metadata shape, the completion response, the cancel semantics, and the polling strategy you'd recommend to clients.

Model answer:

# Initiating request
POST /v1/projects/acme-prod/tasks:bulkUpdateStatus
{ "names": ["tasks/t1","tasks/t2",...], "status": "DONE" }

# Returns Operation immediately
{
  "name": "operations/bulk-upd-9a1f",
  "done": false,
  "metadata": {
    "@type": "type.googleapis.com/tasks.BulkUpdateStatusMetadata",
    "tasksUpdated": 0, "tasksTotal": 50000,
    "phase": "queued",
    "createTime": "...", "updateTime": "..."
  }
}

# On success (done=true)
"response": {
  "@type": "type.googleapis.com/tasks.BulkUpdateStatusResponse",
  "tasksUpdated": 50000, "tasksFailed": 0
}

# On partial failure — still resolves as "done"
"response": {
  "tasksUpdated": 49997,
  "tasksFailed": 3,
  "failedTasks": [
    {"name": "tasks/t101", "error": {"code": 5, "message": "NOT_FOUND"}}
  ]
}

Key design decisions: The operation is partially successful, not all-or-nothing — return a tasksFailed count and a list of failures in the response (not the error field), because the operation as a whole completed. Reserve the top-level error field for systemic failures (e.g. the worker crashed, the database was unreachable). Recommended client poll strategy: start at 2 s, double each poll up to 60 s ceiling; use a Retry-After header if provided. For 50 000 tasks, tell clients to expect 15–30 seconds at peak.

Rubric: Full marks for correct LRO shape (name/done/metadata/response), a defined metadata type with progress fields, the partial-success distinction (response vs. error), and a documented poll back-off strategy. Bonus for the idempotency key recommendation.

Key takeaways

Return an Operation resource immediately from any call that can't finish within a single request. The resource has a stable name, a done boolean, and typed metadata for progress.
When done: true, exactly one of response or error is present — write them atomically in one DB transaction.
The initiating call returns 200 OK (not 202). The Operation itself was successfully created; the underlying work is asynchronous.
Clients should poll with exponential back-off; surface a Retry-After hint in the response to help them choose an interval.
Use an idempotency key on the initiating POST so retries after network failures don't spin up duplicate jobs.
Cancel via :cancel is best-effort and asynchronous — always keep polling until done: true.