Resource Design Patterns · Lesson 06

Batch Operations

N individual API calls cost N round trips — and on a mobile connection or across a continent that can mean seconds of added latency. Batch operations pack many reads or writes into one request and one response, cutting that to a single round trip while keeping the resource model clean.

⏱ 15 min Difficulty: core Prereq: REST basics, rdp-05 LRO

By the end you'll be able to

Describe the four batch verb patterns (batchGet, batchCreate, batchUpdate, batchDelete) and write correct request/response shapes for each.
Explain the atomicity trade-off — all-or-nothing vs. partial success — and choose the right model for a given use case.
Design per-item error reporting for a partial-success batch and avoid the common traps (silent drops, oversized payloads, missing ordering guarantees).

The N+1 problem that batch solves

Imagine a dashboard that needs to display the details of 30 tasks assigned to the current user. Without batch support, the client makes 30 sequential GETs — or 30 concurrent GETs that each open a separate TLS connection and incur their own round-trip latency. On a 60 ms transatlantic link, 30 sequential calls alone add almost 2 seconds before the first byte of data is rendered.

Think of it like a supermarket conveyor belt. You could walk up to the cashier 30 times with one item each. Batch is loading everything onto the belt in one trip. The cashier (server) processes the items, and you collect everything in one bag at the other end.

See also: data fetching & pagination for the broader N+1 discussion, and GraphQL for a different approach to the same problem (client-specified fields instead of predefined batch endpoints).

The four batch verbs

Following Google AIP-231 and related AIPs, batch operations use a custom method suffix on the collection URL: :batchGet, :batchCreate, :batchUpdate, :batchDelete. The parent resource (the project) scopes the batch to its own children.

The four batch verbs and the atomicity choice that shapes the response envelope. The atomicity model is your most important design decision — it must be documented clearly and must be consistent for all items in the same call.

Worked example — batchGet: fetch several tasks at once

# Request — names as query parameters (GET, no body)
GET /v1/projects/acme-prod/tasks:batchGet?names=tasks%2Ft1&names=tasks%2Ft2&names=tasks%2Ft9
Authorization: Bearer <token>

# Response — 200 OK
{
  "tasks": [
    {
      "name":        "projects/acme-prod/tasks/t1",
      "title":       "Design login flow",
      "status":      "OPEN",
      "assignee":    "users/ada",
      "due_time":    "2025-07-01T00:00:00Z",
      "create_time": "2025-06-10T09:00:00Z",
      "update_time": "2025-06-18T14:22:11Z"
    },
    {
      "name":        "projects/acme-prod/tasks/t2",
      "title":       "Write API spec",
      "status":      "IN_PROGRESS",
      "assignee":    "users/grace",
      "due_time":    "2025-06-30T00:00:00Z",
      "create_time": "2025-06-12T10:00:00Z",
      "update_time": "2025-06-19T08:05:00Z"
    },
    {
      "name":        "projects/acme-prod/tasks/t9",
      "title":       "Set up CI pipeline",
      "status":      "DONE",
      "completed":   true,
      "create_time": "2025-06-01T07:30:00Z",
      "update_time": "2025-06-15T16:00:00Z"
    }
  ]
}

The response preserves the order of the input names. If one name is not found, the server must choose: omit it silently (bad — the client can't detect a missing resource), return a placeholder with an error, or return a 404 for the whole batch (bad for partial-read use cases). The standard pattern is to return a top-level 404 only if all names are unknown; if any name is valid, return the results and include the missing ones in an errors section. Document your choice explicitly.

Worked example — batchCreate: create several tasks at once

# Request — array of individual CreateTask request objects
POST /v1/projects/acme-prod/tasks:batchCreate
Authorization: Bearer <token>
Content-Type: application/json

{
  "requests": [
    {
      "task": {
        "title":    "Implement rate limiting",
        "notes":    "Token bucket, 100 req/min per user",
        "status":   "OPEN",
        "assignee": "users/grace",
        "labels":   ["backend", "infra"]
      }
    },
    {
      "task": {
        "title":    "Add field mask support to PATCH",
        "status":   "OPEN",
        "due_time": "2025-07-15T00:00:00Z"
      }
    },
    {
      "task": {
        "title":    "",     // intentionally invalid — missing title
        "status":   "OPEN"
      }
    }
  ]
}

All-or-nothing response

# If the batch is atomic — the empty title causes full rollback
HTTP/1.1 400 Bad Request
{
  "error": {
    "code":    400,
    "message": "Request 2 (index 2): task.title must not be empty",
    "status":  "INVALID_ARGUMENT"
  }
}
# Neither of the valid tasks was created.

Partial-success response

# If the batch allows partial success
HTTP/1.1 200 OK
{
  "tasks": [
    {
      "name":        "projects/acme-prod/tasks/t44",
      "title":       "Implement rate limiting",
      "create_time": "2025-06-20T10:00:01Z",
      "update_time": "2025-06-20T10:00:01Z"
    },
    {
      "name":        "projects/acme-prod/tasks/t45",
      "title":       "Add field mask support to PATCH",
      "create_time": "2025-06-20T10:00:01Z",
      "update_time": "2025-06-20T10:00:01Z"
    }
  ],
  "errors": [
    {
      "index":   2,
      "status": "INVALID_ARGUMENT",
      "message": "task.title must not be empty"
    }
  ]
}
# Items 0 and 1 were created; item 2 failed.

Under the hood: how it actually works

A batch request doesn't magically bypass the server's individual resource logic. The server unpacks the batch, runs each sub-request through the same validation, authorization, and business rules as a standalone call, then aggregates the results. Understanding this is key to predicting behavior.

Server-side execution model

Parse and validate the envelope. The server checks that the batch payload is well-formed (array not null, not too large) and that each sub-request has the required fields. This is a cheap synchronous pass before any DB access.
Authorize each item. Each item is checked against the caller's permissions for that specific resource. A user with read-only access to tasks/t1 but write access to tasks/t2 will cause item 0 to fail with 403 and item 1 to succeed in a partial-success model.
Execute in a transaction (atomic) or fan out (partial). For atomic batches, all operations run inside a single database transaction. For partial-success, operations run independently — either sequentially or in parallel using a worker pool.
Aggregate the results. Collect per-item results (successes + created/updated resources) and per-item errors, preserving the input order.
Return one response. The entire aggregated result goes back in a single HTTP response body.

In the partial-success model, sub-operations run independently. Successes and failures are collected and returned together in one response. The client must scan the errors array to detect which items failed.

The atomicity trade-off in depth

Model	How it works	Best for	Pitfall
All-or-nothing (atomic)	All writes happen inside one DB transaction. If any item fails, the transaction rolls back — zero items are committed.	Financial ledger entries, order line items, anything that must stay consistent as a group.	One bad item in a 500-item batch blocks all 499 good ones. Clients must fix the bad item and resend the entire batch. Throughput drops when the batch is large and error rates are non-zero.
Partial success	Each item is committed independently. The response carries a per-item result that can be either the created/updated resource or an error status.	Bulk import, notification sends, log ingestion — use cases where partial delivery is better than no delivery.	The client must read the errors array on every call — there is no status code signal that some items failed (the HTTP response is 200). Easy to miss silently dropped items.

Per-item result ordering

Regardless of the atomicity model, the response array must preserve input order. If the client sent names [t1, t2, t9], the response must return results in positions [0, 1, 2] corresponding to those names. If item 1 failed, position 1 should hold an error object (or a null placeholder with a sibling errors[index:1] entry). Never silently omit an item — the client has no way to detect the absence.

How to debug & inspect it

# batchGet — URL-encode each name, repeat the param curl -s -G "https://api.example.com/v1/projects/acme-prod/tasks:batchGet" \ --data-urlencode "names=projects/acme-prod/tasks/t1" \ --data-urlencode "names=projects/acme-prod/tasks/t2" \ -H "Authorization: Bearer $TOKEN" | jq '.tasks | length' 2 # Verify ordering is preserved curl -s ... | jq '[.tasks[].name]' ["projects/acme-prod/tasks/t1","projects/acme-prod/tasks/t2"] # batchCreate — check for partial failures RESPONSE=$(curl -s -X POST .../tasks:batchCreate -d @payload.json -H "...") echo "$RESPONSE" | jq '.errors // [] | length' 1 echo "$RESPONSE" | jq '.errors[]' { "index": 2, "status": "INVALID_ARGUMENT", "message": "task.title must not be empty" } # Extract only the successfully created tasks echo "$RESPONSE" | jq '.tasks'

Symptom	Likely cause	Fix
200 response but fewer items than requested	Partial-success model dropped failing items without reporting them	Always return an `errors[]` array (even empty) so clients can check; never silently omit
`414 URI Too Long` on batchGet	Too many names encoded in the query string (URL length limit ~8 KB on many servers)	Switch to POST body for batchGet when name count is large; or page the batch into chunks
Some items succeed on retry, others get duplicate-created	Partial-success batchCreate is not idempotent; retry resends already-created items	Support an `Idempotency-Key` per sub-request, or expose a `taskId` client-side field to deduplicate on re-insert
All-or-nothing batch times out for large payloads	A single large transaction locks many rows for the entire execution duration	Cap batch size (e.g. 100 items); document the limit and return 400 INVALID_ARGUMENT if exceeded
Response item ordering differs from request ordering	Server executed items in parallel and collected in arrival order	Sort results by input index before returning; use a `requestIndex` field in each result to allow out-of-order execution with in-order response

🎯 Interview angle

"How would you design a bulk import endpoint for 10 000 tasks?" is a classic API design interview question. A strong answer covers: batch verbs with a documented size cap, partial vs. atomic atomicity and the reasoning (import = partial success; financial = atomic), per-item error reporting with index references, an idempotency story for retries, and the LRO fallback for very large batches (see LRO lesson). Interviewers probe for the silent-drop trap — make sure you explain why omitting failed items is dangerous.

⚠️ Common trap: an unbounded batch size

Without a documented maximum, clients will eventually send batches with thousands of items. On an atomic batch, that is one enormous database transaction that holds locks for seconds and risks timing out. On a partial-success batch, it is a huge response body and a long tail of serialization time. Always document a hard limit (e.g. requests.length <= 100) and return 400 INVALID_ARGUMENT with the message "batch size exceeds limit of 100" when the client exceeds it. Then let clients paginate their large datasets through the batch endpoint in chunks.

✅ Treat batchGet as a read optimization, not a query

A batchGet takes a list of known resource names and returns those specific resources. It is NOT a filtered list endpoint — use a standard GET /tasks?filter=... for that. The distinction matters for caching: individual resource GETs are cacheable at the CDN layer; a batchGet with arbitrary name lists generally isn't. Keep the two patterns separate in your API surface.

Designing the size cap and pagination strategy

# Client sends more than the 100-item cap
POST /v1/projects/acme-prod/tasks:batchCreate

{ "requests": [ ... 150 items ... ] }

# Server rejects immediately — before any processing
HTTP/1.1 400 Bad Request
{
  "error": {
    "code":    400,
    "message": "requests array exceeds maximum batch size of 100. Split into smaller batches.",
    "status":  "INVALID_ARGUMENT"
  }
}

# Correct client strategy: chunk into pages of 100
# Chunk 1: items 0-99  → POST :batchCreate
# Chunk 2: items 100-149 → POST :batchCreate
# Collect errors from each response and retry only the failed indices

🧠 Quick check

1. A client sends a batchCreate with 5 tasks. The server uses the partial-success model and item 2 is invalid. What should the HTTP response status code be?

In the partial-success model the HTTP response itself is 200 OK because the batch request was processed correctly as a whole. The failure of individual items is communicated inside the response body via the errors array, not via the HTTP status code. This is why clients must always inspect the errors array — a 200 does not mean all items succeeded.

2. You are designing a batch endpoint for a financial ledger: each batch contains a set of debit/credit entries that must always balance. Which atomicity model is correct?

Financial entries must be consistent as a group — applying some debits without the corresponding credits produces an unbalanced ledger. All-or-nothing is the only model that preserves this invariant. Partial success is correct for use cases like bulk import where partial delivery is better than nothing, but it cannot be used for anything requiring transactional consistency across items.

3. A batchGet response comes back with 2 items, but the request included 3 names. What is the safest conclusion?

Silently dropping items from a batchGet is a design defect. The client has no way to distinguish "resource not found" from "server decided to omit it for another reason." A well-designed batchGet either returns a result for every input name (with an error entry for missing ones) or returns a 404 for the whole batch if all names are unknown. Silent omissions are the "null pointer exception" of batch APIs.

4. Why is a partial-success batchCreate dangerous to retry naively when a network error interrupts the response?

In a partial-success batch, items are committed independently as they succeed. If the network fails after the server processed items 0–3 but before the client received the response, a naive retry resends all 5 items. Items 0–3 are created a second time (duplicates), while item 4 may or may not have been committed. The fix is to use a per-request idempotency key so the server de-duplicates on retry.

5. When is a Long-running Operation (LRO) a better choice than a batch endpoint?

A batch endpoint is a synchronous pattern — the client waits for the entire response before proceeding. This works well for tens or low hundreds of items. For very large datasets (thousands of items, multi-second execution time), a synchronous response risks client-side timeouts and produces enormous response bodies. In those cases, use an LRO: return an Operation immediately, process in the background, and let the client poll for results.

✍️ Exercise: design batchUpdate for the Tasks API

Design the full request and response contract for POST /v1/projects/acme-prod/tasks:batchUpdate. The caller wants to change the status of tasks t1 and t2 to IN_PROGRESS, and change the due_time of task t3 (without touching any other field). Use field masks (see rdp-04) for each sub-request. Show: the request body, the success response (partial-success model), and an error response for the case where t2 does not exist.

Model answer:

# Request
POST /v1/projects/acme-prod/tasks:batchUpdate
{
  "requests": [
    {
      "task": { "name": "projects/acme-prod/tasks/t1", "status": "IN_PROGRESS" },
      "updateMask": "status"
    },
    {
      "task": { "name": "projects/acme-prod/tasks/t2", "status": "IN_PROGRESS" },
      "updateMask": "status"
    },
    {
      "task": {
        "name": "projects/acme-prod/tasks/t3",
        "due_time": "2025-08-01T00:00:00Z"
      },
      "updateMask": "due_time"
    }
  ]
}

# Response — partial success, t2 not found
HTTP/1.1 200 OK
{
  "tasks": [
    {
      "name": "projects/acme-prod/tasks/t1",
      "status": "IN_PROGRESS",
      "update_time": "2025-06-20T10:05:00Z"
    },
    {
      "name": "projects/acme-prod/tasks/t3",
      "due_time": "2025-08-01T00:00:00Z",
      "update_time": "2025-06-20T10:05:00Z"
    }
  ],
  "errors": [
    {
      "index": 1,
      "status": "NOT_FOUND",
      "message": "Task 'projects/acme-prod/tasks/t2' does not exist."
    }
  ]
}

Key points: Each sub-request has its own updateMask because different tasks are changing different fields. The response preserves order by index: position 0 = t1 result, position 1 = absent (error at index 1), position 2 = t3 result (but now at array position 1 in tasks[] because t2 failed). Note the ambiguity here — the response array skips t2 while the errors array explicitly reports index 1. An alternative design places null/placeholder at position 1 in the tasks array to maintain a 1:1 index correspondence. Both approaches are valid; the key is to document the choice unambiguously.

Rubric: Full marks for correct requests[] shape with per-item updateMask, partial-success 200 response, errors[] with index field, and a note about the index-alignment ambiguity. Partial marks for missing updateMask or missing errors structure.

Key takeaways

Batch operations (batchGet, batchCreate, batchUpdate, batchDelete) collapse N round trips into one. They live at the collection URL using a :batch* custom method suffix.
The atomicity decision is your most important design choice. All-or-nothing preserves consistency across items; partial success tolerates individual failures. Make the choice explicit in your API documentation and never mix them in the same endpoint.
In the partial-success model, always return an errors array (even when empty) and preserve input ordering. Never silently drop failing items.
Cap the batch size (e.g. 100 items) and document the limit. Return 400 INVALID_ARGUMENT if exceeded so clients know to paginate.
Use an idempotency key per sub-request to make batchCreate safe to retry after network failures.
For batches too large to process synchronously, fall back to an LRO (see rdp-05) rather than timing out the client.