Design Case Studies · Lesson 16
Design: Chess / Turn-based Game API
A chess API looks deceptively simple — two players, alternate turns, finite board. The hard parts are invisible: preventing a crafty client from submitting two moves in a row, keeping a spectating crowd in sync with the live board at sub-second latency, and ensuring a disconnect mid-game doesn't corrupt time controls. This lesson designs the whole system from requirements through latency budget.
By the end you'll be able to
- Explain why server-authoritative move validation is a non-negotiable anti-cheat boundary.
- Model a game lifecycle as a state machine and show how each transition maps to an API operation.
- Choose between REST and WebSocket for each operation in a turn-based game and justify the split.
- Design idempotent move submission keyed by ply number and explain what happens on a retry.
- Sketch a latency budget for real-time move broadcasting to spectators.
1 — Requirements
Capture what the system must do before deciding how to build it. Clarity here prevents scope creep and surfaces hidden constraints.
Functional requirements
- Create a game: one player creates a game (choosing time control, variant), and a second player joins or is paired via matchmaking.
- Make a move: a player submits a move in algebraic notation. The server validates legality, enforces turn order, and advances the board state.
- Game state: any authenticated client can fetch the current board state, move history, clocks, and result at any time.
- Matchmaking: automatically pair two players of similar rating for a new game.
- Spectators: any number of read-only observers can watch a live game in real time; they see each move within ~1 second of it being played.
- Real-time move updates: both players and spectators receive move notifications via push (not polling).
- Time controls: each player has a clock (e.g. blitz 5+0, rapid 10+5). The server is the authoritative clock; it deducts time on each move and flags a player as having lost on time if their clock expires.
Non-functional requirements
- Move latency: the opponent and spectators should see a move within 500 ms of submission in the same region; p99 < 1 s globally.
- Turn-order consistency: no player may submit two consecutive moves without the opponent having moved in between. Enforced server-side regardless of client behaviour.
- Anti-cheat: the server validates every move against legal chess rules. An illegal move is rejected (422) and does not advance the game.
- Idempotent moves: a network retry of the same move at the same ply must not create a duplicate move.
- Clock precision: server clock deductions are monotonic. A client cannot gain extra time by disconnecting and reconnecting.
- High availability for reads: game state reads should remain available even if the primary game server is temporarily unavailable.
Scale assumption: 500,000 concurrent active games. Each game produces roughly one move per 30 seconds on average (blitz games move faster; classical slower). That is ~17,000 move submissions per second at peak, each fanning out to 2 players plus a variable number of spectators. A popular tournament game may have 10,000 spectators.
2 — Design decisions
Decision 1: Server-authoritative move validation (anti-cheat boundary)
Imagine a locked vault with a slot in the door. You slide a move proposal through the slot; the vault checks it against the rules, stamps it "legal" or "illegal," and only if legal does it update the canonical game record inside. The client never touches the canonical record directly — it only sees what the vault stamps and returns.
In practice: the server receives a move string like "e2-e4", loads the current board state, runs a legal-move generator, and either advances the game or returns 422. The client holds a display board for smooth UX (premove, animation), but that board is considered speculative until the server confirms. This is what "server-authoritative" means: the server's board is the only board that counts.
Why it matters for anti-cheat: without this boundary, a client can lie about the board state, submit moves for the wrong colour, or use a local engine to calculate moves without network round-trips. With it, all of those attacks require compromising the server itself.
Decision 2: REST for game CRUD, WebSocket for real-time push
Two interaction styles serve different needs:
| Operation | Style | Reason |
|---|---|---|
| Create a game | REST POST | One-off, idempotent-safe with a client-generated ID, cacheable response |
| Make a move | REST POST | Discrete action with a clear response (accepted / rejected), idempotent with ply key |
| Fetch game state | REST GET | Cacheable, stateless, works for spectator replay and history |
| Receive live move updates | WebSocket | Server-push to an unknown number of subscribers; avoids polling overhead |
| Clock tick to clients | WebSocket | Continuous low-bandwidth updates; HTTP overhead per tick is prohibitive |
The REST / WebSocket split is clean: REST handles writes (games, moves) and point-in-time reads. WebSocket handles everything that is push-based — moves arriving at the opponent, clock updates, game-over signals. This means a player client needs one persistent WebSocket connection per active game, and uses REST for submitting their own moves.
You could submit moves over WebSocket, and some implementations do. The downside: you lose the HTTP request-response contract for move acceptance. A REST POST returns a clear 422 with a structured error when the move is illegal. Over WebSocket you have to invent your own request-response correlation (request IDs, ack messages). For the move-submission path specifically, the latency cost of a REST round-trip is acceptable and buys you a much cleaner error model. Use REST where you need a response guarantee; use WebSocket where you only need to push.
Decision 3: Game lifecycle as a state machine
A game is not just a collection of moves — it is a stateful entity with well-defined transitions. Modelling it as a state machine forces you to enumerate every transition and its preconditions, which makes the API surface explicit.
States: WAITING (created, awaiting second player) → ACTIVE (both players present, moves allowed) → PAUSED (optional: a player disconnected during casual/correspondence play) → COMPLETED (checkmate, stalemate, resignation, timeout, draw agreement). COMPLETED is terminal; no transitions out.
Each API operation is a transition guard: POST /moves is only valid when the game is ACTIVE and it is the submitting player's turn. Attempting it in any other state returns 409 Conflict with a game_state field in the error body so the client can handle it correctly.
Decision 4: Idempotent move submission keyed by ply number
Network retries are inevitable — a mobile player's connection drops mid-submit. Without idempotency, two copies of the same move arrive at the server; the first succeeds and the second either fails with a confusing error or, worse, is interpreted as a second move by the same player (violating turn order).
The solution (idempotency): the client includes a ply field in the move body — the half-move number this move should be. If the server receives a move with ply=23 and the game is already at ply=23 (the move was already applied), it returns the same 200 response as if it just accepted the move. If the server receives ply=23 and the game is at ply=25 (the move is stale), it returns 409. The ply number is a natural idempotency key: it encodes "this is the 23rd half-move in this game" uniquely.
Some designs key retries by a client-generated timestamp ("submitted_at": "2024-05-01T12:00:00.123Z"). This fails for two reasons: clocks differ between client and server, and a retry 50 ms later will have a different timestamp, making the server treat it as a new move. The ply number is deterministic, shared between client and server, and guaranteed to be unique within a game — use it.
Decision 5: Optimistic concurrency for turn order
Two rapid move submissions from the same player (a network glitch replays the POST) could race at the server. To prevent turn-order corruption without a distributed lock: the server stores the current ply number alongside the game record. The move handler runs a conditional update:
-- Pseudo-SQL: atomic turn check + advance
UPDATE games
SET ply = ply + 1,
position = $new_fen,
to_move = $next_colour
WHERE id = $game_id
AND ply = $expected_ply -- optimistic lock on ply
AND to_move = $submitting_colour;
If the WHERE clause matches zero rows (because ply advanced before this query ran), the handler returns 409. No extra locking infrastructure required.
3 — The API model
Create a game
POST /v1/games
Authorization: Bearer <token>
Content-Type: application/json
// Request body
{
"time_control": {
"initial_seconds": 300, // 5 minutes
"increment_seconds": 0 // no increment — blitz 5+0
},
"variant": "standard",
"opponent_id": "player_77" // null for matchmaking
}
// 201 Created
{
"id": "game_ab3f",
"state": "WAITING",
"white": "player_42",
"black": null, // not yet joined
"to_move": "white",
"ply": 0,
"created_at": "2024-06-01T10:00:00Z"
}
Make a move
POST /v1/games/game_ab3f/moves
Authorization: Bearer <token>
// Request body — ply is the idempotency key
{
"move": "e2e4", // long algebraic notation
"ply": 1 // this should be the 1st half-move
}
// 200 OK — move accepted (or idempotent re-accept on retry)
{
"ply": 1,
"move": "e2e4",
"fen": "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1",
"to_move": "black",
"clock": { "white": 292, "black": 300 }, // seconds remaining
"game_state": "ACTIVE"
}
// 422 Unprocessable Entity — illegal move
{
"error": "illegal_move",
"detail": "e2e5 is not a legal move for the piece on e2 in this position"
}
// 409 Conflict — wrong turn or stale ply
{
"error": "wrong_turn",
"current_ply": 4,
"to_move": "black"
}
Fetch game state
GET /v1/games/game_ab3f
// 200 OK
{
"id": "game_ab3f",
"state": "ACTIVE",
"white": "player_42",
"black": "player_77",
"ply": 4,
"to_move": "white",
"fen": "rnbqkbnr/pppp1ppp/8/4p3/4P3/8/PPPP1PPP/RNBQKBNR w KQkq e6 0 2",
"moves": [ "e2e4", "e7e5", "g1f3", "b8c6" ],
"clock": { "white": 286, "black": 289 }
}
WebSocket: subscribe to live game events
// Connect — one connection per game per client
wss://live.example.com/v1/games/game_ab3f/stream
Authorization: Bearer <token> // via query param or upgrade header
// Server → client: move event
{
"type": "MOVE",
"ply": 5,
"move": "f1c4",
"fen": "...",
"clock": { "white": 281, "black": 289 }
}
// Server → client: clock tick (every 1 s while ACTIVE)
{ "type": "CLOCK", "clock": { "white": 280, "black": 289 } }
// Server → client: game over
{
"type": "GAME_OVER",
"state": "COMPLETED",
"result": "white_wins",
"reason": "checkmate"
}
When an interviewer asks "how do you handle retries on the move endpoint?", the correct answer is not "add an Idempotency-Key header." That works generically but requires the client to generate and cache a UUID per move. The better answer for chess is: use the ply number. It's a natural, server-known sequence number that both sides agree on. A retry carrying ply=7 when the game is at ply=7 is immediately recognisable as an idempotent replay. Name the domain property that makes it work.
4 — Evaluation & latency budget
Real-time latency path
The player submitting a move cares about two latencies: (1) how long until they get the 200 confirmation (move-submit latency), and (2) how long until the opponent sees the move on their board (opponent-receive latency). Both matter for blitz games.
| Stage | Budget (p50) | Notes |
|---|---|---|
| Client → API server network | ~15 ms | Same-region; regional endpoints reduce this |
| State check + legal-move validation | ~2 ms | In-memory chess engine; no I/O on the hot path |
| Optimistic UPDATE (DB write) | ~5 ms | Single-row update with index on game_id + ply |
| Publish to WebSocket fan-out broker | ~2 ms | In-process pub/sub or Redis pub/sub |
| 200 response back to submitter | ~15 ms | Waits only until DB write confirmed |
| Broker → opponent WS delivery | ~3 ms | In-region; same server or same DC |
| Opponent sees move | ~35 ms total | Well under 500 ms target |
For spectator fan-out: a popular game with 10,000 spectators must push the same MOVE event to all 10,000 WebSocket connections. The WebSocket server cannot iterate 10,000 send calls synchronously on the hot path — it will block move acceptance for subsequent moves. The solution: publish to an internal game channel; a dedicated broadcast worker asynchronously drains the fan-out. The move-submit latency is unaffected; spectators may lag by up to ~200 ms behind the players, which is acceptable.
Turn-order consistency
The optimistic concurrency control described in Decision 5 is the consistency guarantee. Two concurrent POSTs for the same ply: the database accepts the first (ply matches) and rejects the second (WHERE ply = $expected_ply returns zero rows). No distributed lock needed; the database serialises competing writes through its own row-level locking on the game row.
Anti-cheat trade-offs
Server-side legal-move validation is comprehensive but has a cost: every move requires loading the current board position (FEN string) from storage, initialising a chess engine, and running the move generator. For a well-implemented engine in a compiled language this takes 1–3 ms. At 17,000 moves/second across 500,000 concurrent games, the CPU cost is proportional to the number of game workers, not the number of games — each game is independent.
The attack surface that server-authoritative validation does not eliminate: engine assistance (a player using a chess engine on their own device to choose moves). Detecting engine use requires statistical analysis of move quality over time (a separate anti-cheat service), not move legality checks. Legality checks only prevent cheating through illegal board manipulation.
A common mistake is to accept "move_timestamp" from the client and deduct the difference from the player's clock. A client can lie about its timestamp — or simply have a slow clock — and gain extra seconds per move. The server must record its own arrival timestamp when it accepts a move and use that for clock deduction. The client timestamp, if sent at all, is useful only for latency measurement telemetry, never for authoritative clock management.
Should the WebSocket MOVE message carry just the move notation ("f1c4") or the full FEN board position? If you send only the move, a spectator who joined late or missed a message has a stale board and cannot reconstruct the position from a single event. If you send the full FEN (128-byte string), each message is self-healing — even if a spectator misses 10 events, the next one corrects their view. For a game where position history is compact and board size is fixed, send the full state. Save delta-only for domains where state is large (video frames, 3D world state).
Exercise — Correspondence chess: how do time controls and state transitions change?
Prompt: The product team wants to add "correspondence chess" — games where players have 3 days per move instead of minutes. How does your API and state machine change? What new failure modes appear?
Key differences from blitz:
- No persistent WebSocket required: players check in every few hours, not every few seconds. REST polling (or email/push notification on opponent move) replaces a persistent WS connection. The MOVE event becomes a push notification rather than a WS broadcast.
- Clock granularity: clocks measure days, not seconds. You no longer need per-second CLOCK tick events. Store
move_deadline(a UTC timestamp) instead of remaining seconds. - PAUSED state becomes first-class: in blitz, PAUSED is a brief transient. In correspondence, a game can sit idle for 2.9 days between moves. The state machine should expose PAUSED visibly in the API response, and a background job must monitor for timeout.
- New failure mode — abandonment: players may simply stop checking in. The server needs a timeout job: if
now() > move_deadline, transition the game to COMPLETED withreason: "timeout"automatically. In blitz, timeout detection is triggered by the clock-tick logic on the game server; in correspondence, it must be a scheduled background sweep.
Rubric: must identify that WS is unnecessary for correspondence. Must propose a move-deadline timestamp instead of a clock. Must describe a background job for timeout detection. Bonus: mention push notification (email / mobile push) as the delivery mechanism instead of WebSocket.
Under the hood: the core mechanism
Every move submission flows through a strict pipeline: the server never accepts the client's view of the board as ground truth — it owns the canonical position and re-derives everything from it. Here is the exact data structure the server holds for a live game and the precise steps it executes on each move request.
The canonical game record
The server stores one row per active game. The key fields that enforce correctness:
| Field | Type | Purpose |
|---|---|---|
fen | VARCHAR(100) | Forsyth–Edwards Notation — the full board position, active colour, castling rights, en passant square, and half/full move counters in one compact string |
ply | INTEGER | Half-move counter, starting at 0. Incremented by 1 for every move by either colour. Also the idempotency key for move retries. |
to_move | ENUM(white, black) | Whose turn it is. Checked before the legal-move generator runs. |
state | ENUM(WAITING, ACTIVE, PAUSED, COMPLETED) | State machine node. Move submission is gated on ACTIVE. |
white_clock_ms | BIGINT | Milliseconds remaining for white. Set by the server using its own monotonic clock — never the client's timestamp. |
black_clock_ms | BIGINT | Same for black. |
last_move_at | TIMESTAMPTZ | Server-recorded timestamp of the previous move. Used to compute clock deduction for the current move. |
The FEN string is the complete game state. Example after 1.e4 e5:
"rnbqkbnr/pppp1ppp/8/4p3/4P3/8/PPPP1PPP/RNBQKBNR w KQkq e6 0 2"
# ranks 8→1 / active colour (w=white) / castling rights / en passant / halfmove / fullmove
The move history is stored as a separate moves append-only table (game_id, ply, move_san). The FEN in the games table is always the current position — the move table is the audit log. On retry, the server looks up the moves table for the requested ply to return the original response without re-running the engine.
Worked trace: submit move → validate → broadcast
Concrete example: White submits 1.e4 (e2e4) as ply 1. The game is brand new (starting FEN, ply 0, state ACTIVE, to_move white).
Worked trace: illegal move rejected
Now black tries e7e5 — a legal move — but the client has a bug and submits e7e2 (a pawn jumping 5 squares).
The game state machine: transition guards in code
The state machine is not just documentation — each transition is enforced as a pre-condition check before the move handler runs the engine. The server checks all of these in order, short-circuiting on the first failure:
| Check | Fail condition | Response |
|---|---|---|
| Game exists | Unknown game_id | 404 Not Found |
| State == ACTIVE | WAITING, PAUSED, or COMPLETED | 409 Conflict — includes current state |
| Submitter is a player | Token belongs to neither white nor black | 403 Forbidden |
| It is the submitter's turn | to_move != submitter's colour | 409 Conflict — wrong_turn |
| Ply matches current + 1 | ply < current+1 (stale / idempotent replay) or ply > current+1 | 200 OK (idempotent) or 409 Conflict |
| Move is legal | Not in legal-move set from engine | 422 Unprocessable Entity |
| Clock not expired | Player's clock_ms already 0 | 409 Conflict — clock_expired (game auto-completes) |
Move idempotency detail: when the server receives ply=N and the game is already at ply=N (or higher), it queries the moves table for ply=N and returns the stored response without re-executing. This means a retry after a network timeout returns a bit-for-bit identical response to the original — the FEN, clock, and to_move are all as they were when the move was first applied.
Operating & debugging it
Inspecting a live game from the server side
The three things you need to verify when a move is behaving unexpectedly:
Symptom → cause → fix table
| Symptom | Cause | Fix |
|---|---|---|
| Every move returns 409 wrong_turn even for the correct player | to_move in the DB is stuck on one colour — the UPDATE is silently not advancing it | Check that the SET clause includes to_move = $next_colour; verify it is not hard-coded |
| Legal move returns 422 illegal_move intermittently | FEN in the DB is stale — a prior write partially succeeded (crash between INSERT and UPDATE) | Wrap the moves table INSERT and games UPDATE in a single transaction; add a reconciliation job that rebuilds FEN from move history on discrepancy |
| Retry of the same move creates a duplicate entry in the moves table | Idempotency check missing — ply not checked before INSERT | Add WHERE ply = $expected_ply to the UPDATE and check rowsAffected == 0 before inserting into moves |
| Player's clock shows more time than they should have | Server is using the client-supplied move_timestamp for clock deduction instead of its own arrival time | Record received_at = now() at request entry; use received_at - last_move_at for deduction; ignore client timestamps for clocks |
| WebSocket MOVE events arrive out of order at some clients | Fan-out is multi-threaded and messages publish without ordering guarantee | Clients should use the ply field to order incoming events; if ply N+2 arrives before N+1, hold N+2 in a buffer and apply in order, or re-fetch state with GET /games/:id |
| GET /games/:id returns a board position 2 moves behind what players see | Read is hitting a replica with replication lag behind the primary | For read-your-own-write consistency on game state, route GET /games/:id to the primary (or use replication lag monitoring + fallback); spectator reads can tolerate replica lag |
| Game stays in ACTIVE after checkmate is played | End-of-game detection (checkmate, stalemate) not wired into the move handler — engine evaluates legality but not termination | After applying the move, call engine.isGameOver(); if true, set state=COMPLETED, result, and reason in the same transaction; publish GAME_OVER event |
Production monitoring checklist
- Move acceptance rate: alert if illegal_move rejections exceed 1% of submissions — could indicate a client bug, a corrupted FEN, or a chess engine version mismatch.
- Move handler p99 latency: should be under 50 ms. If it spikes, check the legal-move generator CPU time (profile the engine) and the games table UPDATE latency.
- Optimistic update collision rate: if WHERE ply=N matches 0 rows more than ~0.1% of moves, concurrent duplicate submissions are higher than expected — check client retry logic.
- WebSocket fan-out lag: measure time from DB write to opponent receives MOVE event. Should be under 100 ms. Spikes indicate pub/sub broker backpressure.
- Games stuck in ACTIVE after clock expiry: run a periodic query for games where
state='ACTIVE' AND (white_clock_ms=0 OR black_clock_ms=0)— these should have been transitioned to COMPLETED by the clock-expiry handler.
🧠 Check your understanding
A client submits POST /v1/games/g1/moves with {"move": "e2e4", "ply": 3} but the game's current ply is already 3 (this move was applied in a previous request). What should the server return?
The ply number is the idempotency key. If ply=3 has already been applied, the server recognises this as a retry and returns 200 with the same result it returned originally — it does not apply the move a second time. This is idempotency: the same operation with the same key yields the same outcome.
Why does the move-submission endpoint use REST (HTTP POST) rather than sending moves over the WebSocket connection?
WebSocket can carry JSON and supports authentication. The deciding factor is the error model: REST POST returns a typed HTTP status code per request automatically. Over WebSocket you'd have to invent request-response correlation (custom request IDs, ack messages) to know whether your specific move was accepted. REST handles this for free.
Two concurrent POST requests for ply=7 arrive at the server for the same game. The optimistic concurrency check is UPDATE games … WHERE ply = 7. What happens?
Databases serialise concurrent writes to the same row. After the first update sets ply to 8, the second update's WHERE clause (ply=7) matches zero rows. The application detects zero rows affected and returns 409 Conflict. The game is never corrupted — no distributed lock required.
The server wants to broadcast a move to 10,000 spectators without blocking the move-acceptance path. Which approach achieves this?
Blocking on 10,000 WS sends would create a severe tail latency for the move-acceptance response. Publishing to an internal channel decouples the write path from the fan-out path: the move is committed and the 200 response returned immediately; spectator delivery happens asynchronously, with acceptable lag of ~200 ms.
Key takeaways
- Server-authoritative validation is the only effective move-legality anti-cheat boundary. The client's board is speculative until the server confirms.
- Model the game as a state machine. Enumerating states and transitions explicitly makes the API surface precise: which operations are valid, and what status code each invalid transition returns.
- Split REST and WebSocket by role: REST for writes and point-in-time reads; WebSocket for push-based delivery to players and spectators.
- Ply number as idempotency key is more robust than a client-generated UUID for move retries — it is domain-native, naturally unique per game, and shared by both sides.
- Never use client timestamps for clock deduction. The server records its own arrival time; the client clock is untrusted.
- For high-spectator fan-out, decouple write acceptance from broadcast delivery via an internal channel and async worker to keep move latency predictable.
Sources & further reading
- Chess.com — How cheating detection works — covers engine detection vs. move legality, distinct problems
- MDN — WebSocket API — transport used for live game streaming
- RFC 6455 — The WebSocket Protocol
- Lichess API reference — open-source chess platform's public API; useful real-world comparison
- Wikipedia — Forsyth–Edwards Notation (FEN) — the standard compact encoding for chess board positions used in the API model above
- Stripe — Idempotent Requests — canonical reference for idempotency key design patterns