Design Case Studies · Lesson 16

Design: Chess / Turn-based Game API

A chess API looks deceptively simple — two players, alternate turns, finite board. The hard parts are invisible: preventing a crafty client from submitting two moves in a row, keeping a spectating crowd in sync with the live board at sub-second latency, and ensuring a disconnect mid-game doesn't corrupt time controls. This lesson designs the whole system from requirements through latency budget.

⏱ ~22 min advanced Prereq: WebSockets, Idempotency

By the end you'll be able to

Explain why server-authoritative move validation is a non-negotiable anti-cheat boundary.
Model a game lifecycle as a state machine and show how each transition maps to an API operation.
Choose between REST and WebSocket for each operation in a turn-based game and justify the split.
Design idempotent move submission keyed by ply number and explain what happens on a retry.
Sketch a latency budget for real-time move broadcasting to spectators.

1 — Requirements

Capture what the system must do before deciding how to build it. Clarity here prevents scope creep and surfaces hidden constraints.

Functional requirements

Create a game: one player creates a game (choosing time control, variant), and a second player joins or is paired via matchmaking.
Make a move: a player submits a move in algebraic notation. The server validates legality, enforces turn order, and advances the board state.
Game state: any authenticated client can fetch the current board state, move history, clocks, and result at any time.
Matchmaking: automatically pair two players of similar rating for a new game.
Spectators: any number of read-only observers can watch a live game in real time; they see each move within ~1 second of it being played.
Real-time move updates: both players and spectators receive move notifications via push (not polling).
Time controls: each player has a clock (e.g. blitz 5+0, rapid 10+5). The server is the authoritative clock; it deducts time on each move and flags a player as having lost on time if their clock expires.

Non-functional requirements

Move latency: the opponent and spectators should see a move within 500 ms of submission in the same region; p99 < 1 s globally.
Turn-order consistency: no player may submit two consecutive moves without the opponent having moved in between. Enforced server-side regardless of client behaviour.
Anti-cheat: the server validates every move against legal chess rules. An illegal move is rejected (422) and does not advance the game.
Idempotent moves: a network retry of the same move at the same ply must not create a duplicate move.
Clock precision: server clock deductions are monotonic. A client cannot gain extra time by disconnecting and reconnecting.
High availability for reads: game state reads should remain available even if the primary game server is temporarily unavailable.

Scale assumption: 500,000 concurrent active games. Each game produces roughly one move per 30 seconds on average (blitz games move faster; classical slower). That is ~17,000 move submissions per second at peak, each fanning out to 2 players plus a variable number of spectators. A popular tournament game may have 10,000 spectators.

2 — Design decisions

Decision 1: Server-authoritative move validation (anti-cheat boundary)

Imagine a locked vault with a slot in the door. You slide a move proposal through the slot; the vault checks it against the rules, stamps it "legal" or "illegal," and only if legal does it update the canonical game record inside. The client never touches the canonical record directly — it only sees what the vault stamps and returns.

In practice: the server receives a move string like "e2-e4", loads the current board state, runs a legal-move generator, and either advances the game or returns 422. The client holds a display board for smooth UX (premove, animation), but that board is considered speculative until the server confirms. This is what "server-authoritative" means: the server's board is the only board that counts.

Why it matters for anti-cheat: without this boundary, a client can lie about the board state, submit moves for the wrong colour, or use a local engine to calculate moves without network round-trips. With it, all of those attacks require compromising the server itself.

Decision 2: REST for game CRUD, WebSocket for real-time push

Two interaction styles serve different needs:

Operation	Style	Reason
Create a game	REST POST	One-off, idempotent-safe with a client-generated ID, cacheable response
Make a move	REST POST	Discrete action with a clear response (accepted / rejected), idempotent with ply key
Fetch game state	REST GET	Cacheable, stateless, works for spectator replay and history
Receive live move updates	WebSocket	Server-push to an unknown number of subscribers; avoids polling overhead
Clock tick to clients	WebSocket	Continuous low-bandwidth updates; HTTP overhead per tick is prohibitive

The REST / WebSocket split is clean: REST handles writes (games, moves) and point-in-time reads. WebSocket handles everything that is push-based — moves arriving at the opponent, clock updates, game-over signals. This means a player client needs one persistent WebSocket connection per active game, and uses REST for submitting their own moves.

✅ Why not submit moves over WebSocket?

You could submit moves over WebSocket, and some implementations do. The downside: you lose the HTTP request-response contract for move acceptance. A REST POST returns a clear 422 with a structured error when the move is illegal. Over WebSocket you have to invent your own request-response correlation (request IDs, ack messages). For the move-submission path specifically, the latency cost of a REST round-trip is acceptable and buys you a much cleaner error model. Use REST where you need a response guarantee; use WebSocket where you only need to push.

Decision 3: Game lifecycle as a state machine

A game is not just a collection of moves — it is a stateful entity with well-defined transitions. Modelling it as a state machine forces you to enumerate every transition and its preconditions, which makes the API surface explicit.

States: WAITING (created, awaiting second player) → ACTIVE (both players present, moves allowed) → PAUSED (optional: a player disconnected during casual/correspondence play) → COMPLETED (checkmate, stalemate, resignation, timeout, draw agreement). COMPLETED is terminal; no transitions out.

Each API operation is a transition guard: POST /moves is only valid when the game is ACTIVE and it is the submitting player's turn. Attempting it in any other state returns 409 Conflict with a game_state field in the error body so the client can handle it correctly.

Decision 4: Idempotent move submission keyed by ply number

Network retries are inevitable — a mobile player's connection drops mid-submit. Without idempotency, two copies of the same move arrive at the server; the first succeeds and the second either fails with a confusing error or, worse, is interpreted as a second move by the same player (violating turn order).

The solution (idempotency): the client includes a ply field in the move body — the half-move number this move should be. If the server receives a move with ply=23 and the game is already at ply=23 (the move was already applied), it returns the same 200 response as if it just accepted the move. If the server receives ply=23 and the game is at ply=25 (the move is stale), it returns 409. The ply number is a natural idempotency key: it encodes "this is the 23rd half-move in this game" uniquely.

⚠️ Don't use wall-clock timestamps as idempotency keys

Some designs key retries by a client-generated timestamp ("submitted_at": "2024-05-01T12:00:00.123Z"). This fails for two reasons: clocks differ between client and server, and a retry 50 ms later will have a different timestamp, making the server treat it as a new move. The ply number is deterministic, shared between client and server, and guaranteed to be unique within a game — use it.

Decision 5: Optimistic concurrency for turn order

Two rapid move submissions from the same player (a network glitch replays the POST) could race at the server. To prevent turn-order corruption without a distributed lock: the server stores the current ply number alongside the game record. The move handler runs a conditional update:

-- Pseudo-SQL: atomic turn check + advance
UPDATE games
SET    ply      = ply + 1,
       position = $new_fen,
       to_move  = $next_colour
WHERE  id       = $game_id
  AND  ply      = $expected_ply   -- optimistic lock on ply
  AND  to_move  = $submitting_colour;

If the WHERE clause matches zero rows (because ply advanced before this query ran), the handler returns 409. No extra locking infrastructure required.

Chess game lifecycle state machine. Only games in the ACTIVE state accept move submissions. Reaching COMPLETED is terminal — the game record becomes immutable. The optional PAUSED state applies to correspondence play with extended time controls.

3 — The API model

Create a game

POST /v1/games
Authorization: Bearer <token>
Content-Type: application/json

// Request body
{
  "time_control": {
    "initial_seconds": 300,    // 5 minutes
    "increment_seconds": 0      // no increment — blitz 5+0
  },
  "variant":      "standard",
  "opponent_id":  "player_77"     // null for matchmaking
}

// 201 Created
{
  "id":         "game_ab3f",
  "state":      "WAITING",
  "white":      "player_42",
  "black":      null,              // not yet joined
  "to_move":    "white",
  "ply":        0,
  "created_at": "2024-06-01T10:00:00Z"
}

Make a move

POST /v1/games/game_ab3f/moves
Authorization: Bearer <token>

// Request body — ply is the idempotency key
{
  "move":  "e2e4",       // long algebraic notation
  "ply":   1             // this should be the 1st half-move
}

// 200 OK — move accepted (or idempotent re-accept on retry)
{
  "ply":        1,
  "move":       "e2e4",
  "fen":        "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1",
  "to_move":    "black",
  "clock":      { "white": 292, "black": 300 },   // seconds remaining
  "game_state": "ACTIVE"
}

// 422 Unprocessable Entity — illegal move
{
  "error":  "illegal_move",
  "detail": "e2e5 is not a legal move for the piece on e2 in this position"
}

// 409 Conflict — wrong turn or stale ply
{
  "error":        "wrong_turn",
  "current_ply":  4,
  "to_move":      "black"
}

Fetch game state

GET /v1/games/game_ab3f

// 200 OK
{
  "id":         "game_ab3f",
  "state":      "ACTIVE",
  "white":      "player_42",
  "black":      "player_77",
  "ply":        4,
  "to_move":    "white",
  "fen":        "rnbqkbnr/pppp1ppp/8/4p3/4P3/8/PPPP1PPP/RNBQKBNR w KQkq e6 0 2",
  "moves":      [ "e2e4", "e7e5", "g1f3", "b8c6" ],
  "clock":      { "white": 286, "black": 289 }
}

WebSocket: subscribe to live game events

// Connect — one connection per game per client
wss://live.example.com/v1/games/game_ab3f/stream
Authorization: Bearer <token>  // via query param or upgrade header

// Server → client: move event
{
  "type":       "MOVE",
  "ply":        5,
  "move":       "f1c4",
  "fen":        "...",
  "clock":      { "white": 281, "black": 289 }
}

// Server → client: clock tick (every 1 s while ACTIVE)
{ "type": "CLOCK", "clock": { "white": 280, "black": 289 } }

// Server → client: game over
{
  "type":       "GAME_OVER",
  "state":      "COMPLETED",
  "result":     "white_wins",
  "reason":     "checkmate"
}

🎯 Interview angle: ply as idempotency key

When an interviewer asks "how do you handle retries on the move endpoint?", the correct answer is not "add an Idempotency-Key header." That works generically but requires the client to generate and cache a UUID per move. The better answer for chess is: use the ply number. It's a natural, server-known sequence number that both sides agree on. A retry carrying ply=7 when the game is at ply=7 is immediately recognisable as an idempotent replay. Name the domain property that makes it work.

Move submission pipeline. The server validates state, turn order, and legality before writing. An optimistic ply-keyed UPDATE prevents concurrent duplicates. On success, the new game state is fanned out via WebSocket to the opponent and all spectators.

4 — Evaluation & latency budget

Real-time latency path

The player submitting a move cares about two latencies: (1) how long until they get the 200 confirmation (move-submit latency), and (2) how long until the opponent sees the move on their board (opponent-receive latency). Both matter for blitz games.

Stage	Budget (p50)	Notes
Client → API server network	~15 ms	Same-region; regional endpoints reduce this
State check + legal-move validation	~2 ms	In-memory chess engine; no I/O on the hot path
Optimistic UPDATE (DB write)	~5 ms	Single-row update with index on game_id + ply
Publish to WebSocket fan-out broker	~2 ms	In-process pub/sub or Redis pub/sub
200 response back to submitter	~15 ms	Waits only until DB write confirmed
Broker → opponent WS delivery	~3 ms	In-region; same server or same DC
Opponent sees move	~35 ms total	Well under 500 ms target

For spectator fan-out: a popular game with 10,000 spectators must push the same MOVE event to all 10,000 WebSocket connections. The WebSocket server cannot iterate 10,000 send calls synchronously on the hot path — it will block move acceptance for subsequent moves. The solution: publish to an internal game channel; a dedicated broadcast worker asynchronously drains the fan-out. The move-submit latency is unaffected; spectators may lag by up to ~200 ms behind the players, which is acceptable.

Turn-order consistency

The optimistic concurrency control described in Decision 5 is the consistency guarantee. Two concurrent POSTs for the same ply: the database accepts the first (ply matches) and rejects the second (WHERE ply = $expected_ply returns zero rows). No distributed lock needed; the database serialises competing writes through its own row-level locking on the game row.

Anti-cheat trade-offs

Server-side legal-move validation is comprehensive but has a cost: every move requires loading the current board position (FEN string) from storage, initialising a chess engine, and running the move generator. For a well-implemented engine in a compiled language this takes 1–3 ms. At 17,000 moves/second across 500,000 concurrent games, the CPU cost is proportional to the number of game workers, not the number of games — each game is independent.

The attack surface that server-authoritative validation does not eliminate: engine assistance (a player using a chess engine on their own device to choose moves). Detecting engine use requires statistical analysis of move quality over time (a separate anti-cheat service), not move legality checks. Legality checks only prevent cheating through illegal board manipulation.

⚠️ Clock drift: never trust the client timestamp

A common mistake is to accept "move_timestamp" from the client and deduct the difference from the player's clock. A client can lie about its timestamp — or simply have a slow clock — and gain extra seconds per move. The server must record its own arrival timestamp when it accepts a move and use that for clock deduction. The client timestamp, if sent at all, is useful only for latency measurement telemetry, never for authoritative clock management.

💬 Interview angle: state vs. event in game streaming

Should the WebSocket MOVE message carry just the move notation ("f1c4") or the full FEN board position? If you send only the move, a spectator who joined late or missed a message has a stale board and cannot reconstruct the position from a single event. If you send the full FEN (128-byte string), each message is self-healing — even if a spectator misses 10 events, the next one corrects their view. For a game where position history is compact and board size is fixed, send the full state. Save delta-only for domains where state is large (video frames, 3D world state).

Exercise — Correspondence chess: how do time controls and state transitions change?

Prompt: The product team wants to add "correspondence chess" — games where players have 3 days per move instead of minutes. How does your API and state machine change? What new failure modes appear?

Key differences from blitz:

No persistent WebSocket required: players check in every few hours, not every few seconds. REST polling (or email/push notification on opponent move) replaces a persistent WS connection. The MOVE event becomes a push notification rather than a WS broadcast.
Clock granularity: clocks measure days, not seconds. You no longer need per-second CLOCK tick events. Store move_deadline (a UTC timestamp) instead of remaining seconds.
PAUSED state becomes first-class: in blitz, PAUSED is a brief transient. In correspondence, a game can sit idle for 2.9 days between moves. The state machine should expose PAUSED visibly in the API response, and a background job must monitor for timeout.
New failure mode — abandonment: players may simply stop checking in. The server needs a timeout job: if now() > move_deadline, transition the game to COMPLETED with reason: "timeout" automatically. In blitz, timeout detection is triggered by the clock-tick logic on the game server; in correspondence, it must be a scheduled background sweep.

Rubric: must identify that WS is unnecessary for correspondence. Must propose a move-deadline timestamp instead of a clock. Must describe a background job for timeout detection. Bonus: mention push notification (email / mobile push) as the delivery mechanism instead of WebSocket.

Under the hood: the core mechanism

Every move submission flows through a strict pipeline: the server never accepts the client's view of the board as ground truth — it owns the canonical position and re-derives everything from it. Here is the exact data structure the server holds for a live game and the precise steps it executes on each move request.

The canonical game record

The server stores one row per active game. The key fields that enforce correctness:

Field	Type	Purpose
`fen`	VARCHAR(100)	Forsyth–Edwards Notation — the full board position, active colour, castling rights, en passant square, and half/full move counters in one compact string
`ply`	INTEGER	Half-move counter, starting at 0. Incremented by 1 for every move by either colour. Also the idempotency key for move retries.
`to_move`	ENUM(white, black)	Whose turn it is. Checked before the legal-move generator runs.
`state`	ENUM(WAITING, ACTIVE, PAUSED, COMPLETED)	State machine node. Move submission is gated on ACTIVE.
`white_clock_ms`	BIGINT	Milliseconds remaining for white. Set by the server using its own monotonic clock — never the client's timestamp.
`black_clock_ms`	BIGINT	Same for black.
`last_move_at`	TIMESTAMPTZ	Server-recorded timestamp of the previous move. Used to compute clock deduction for the current move.

The FEN string is the complete game state. Example after 1.e4 e5:

"rnbqkbnr/pppp1ppp/8/4p3/4P3/8/PPPP1PPP/RNBQKBNR w KQkq e6 0 2"
# ranks 8→1 / active colour (w=white) / castling rights / en passant / halfmove / fullmove

The move history is stored as a separate moves append-only table (game_id, ply, move_san). The FEN in the games table is always the current position — the move table is the audit log. On retry, the server looks up the moves table for the requested ply to return the original response without re-running the engine.

Worked trace: submit move → validate → broadcast

Concrete example: White submits 1.e4 (e2e4) as ply 1. The game is brand new (starting FEN, ply 0, state ACTIVE, to_move white).

# ① Client sends POST /v1/games/game_ab3f/moves POST /v1/games/game_ab3f/moves HTTP/1.1 Authorization: Bearer eyJhbGciOiJSUzI1NiJ9... Content-Type: application/json {"move": "e2e4", "ply": 1} # ② Server loads games row: state=ACTIVE, ply=0, to_move=white # Checks: state == ACTIVE? YES # to_move == white? YES (submitter is white) # request ply == current+1? YES (1 == 0+1) # → proceed to legal-move validation # ③ Chess engine: initialise board from FEN, generate all legal moves for white # Legal moves (partial): e2e4, d2d4, g1f3, b1c3, a2a3, … (20 total from start) # Is "e2e4" in the set? YES → legal # ④ Compute clock deduction # server_now = 2024-06-01T10:00:08.142Z # last_move_at = 2024-06-01T10:00:00.000Z (game start) # elapsed_ms = 8142 # new white_clock_ms = 300000 - 8142 = 291858 # ⑤ Atomic DB write (optimistic concurrency) # UPDATE games # SET fen='rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1', # ply=1, to_move='black', white_clock_ms=291858, last_move_at=now() # WHERE id='game_ab3f' AND ply=0 AND to_move='white' # → 1 row affected (success) # INSERT INTO moves(game_id, ply, move_san) VALUES('game_ab3f', 1, 'e2e4') # ⑥ Publish to WebSocket game channel # publish("game:game_ab3f", {"type":"MOVE","ply":1,"move":"e2e4","fen":"...","clock":{...}}) # ⑦ Return 200 OK to submitter HTTP/1.1 200 OK Content-Type: application/json {"ply":1,"move":"e2e4","fen":"rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1", "to_move":"black","clock":{"white":291,"black":300},"game_state":"ACTIVE"}

Worked trace: illegal move rejected

Now black tries e7e5 — a legal move — but the client has a bug and submits e7e2 (a pawn jumping 5 squares).

POST /v1/games/game_ab3f/moves {"move": "e7e2", "ply": 2} # ② State checks: ACTIVE, to_move=black, ply matches (2 == 1+1) → all pass # ③ Engine: generate all legal moves for black in current FEN # Legal moves: e7e5, e7e6, d7d5, g8f6, … (20 total) # Is "e7e2" in the set? NO # ④ No DB write. No clock deduction. No publish. HTTP/1.1 422 Unprocessable Entity {"error":"illegal_move","detail":"e7e2 is not a legal move for the piece on e7 in this position"} # Game ply remains 1. to_move remains black. Board is unchanged.

The game state machine: transition guards in code

The state machine is not just documentation — each transition is enforced as a pre-condition check before the move handler runs the engine. The server checks all of these in order, short-circuiting on the first failure:

Check	Fail condition	Response
Game exists	Unknown game_id	404 Not Found
State == ACTIVE	WAITING, PAUSED, or COMPLETED	409 Conflict — includes current state
Submitter is a player	Token belongs to neither white nor black	403 Forbidden
It is the submitter's turn	to_move != submitter's colour	409 Conflict — wrong_turn
Ply matches current + 1	ply < current+1 (stale / idempotent replay) or ply > current+1	200 OK (idempotent) or 409 Conflict
Move is legal	Not in legal-move set from engine	422 Unprocessable Entity
Clock not expired	Player's clock_ms already 0	409 Conflict — clock_expired (game auto-completes)

Move idempotency detail: when the server receives ply=N and the game is already at ply=N (or higher), it queries the moves table for ply=N and returns the stored response without re-executing. This means a retry after a network timeout returns a bit-for-bit identical response to the original — the FEN, clock, and to_move are all as they were when the move was first applied.

Operating & debugging it

Inspecting a live game from the server side

The three things you need to verify when a move is behaving unexpectedly:

# 1. Check the canonical game record psql -c "SELECT ply, to_move, state, white_clock_ms, black_clock_ms, LEFT(fen, 60) AS fen_preview, last_move_at FROM games WHERE id = 'game_ab3f';" ply | to_move | state | white_clock_ms | black_clock_ms | fen_preview | last_move_at -----+---------+--------+----------------+----------------+----------------------------------------------------+---------------------------- 5 | white | ACTIVE | 271000 | 285000 | rnbqkbnr/pppp1ppp/8/4p3/4P3/8/PPPP1PPP/RNBQKBNR w | 2024-06-01 10:01:23.442+00 # 2. Check move history for the game psql -c "SELECT ply, move_san, created_at FROM moves WHERE game_id='game_ab3f' ORDER BY ply;" ply | move_san | created_at -----+----------+---------------------------- 1 | e2e4 | 2024-06-01 10:00:08.142+00 2 | e7e5 | 2024-06-01 10:00:16.803+00 3 | g1f3 | 2024-06-01 10:00:55.217+00 4 | b8c6 | 2024-06-01 10:01:10.590+00 5 | f1c4 | 2024-06-01 10:01:23.442+00 # 3. Verify FEN consistency — paste FEN into a chess position validator echo "rnbqkbnr/pppp1ppp/8/4p3/2B1P3/5N2/PPPP1PPP/RNBQK2R b KQkq - 3 3" | \ node -e "const {Chess}=require('chess.js'); const c=new Chess(require('fs').readFileSync('/dev/stdin','utf8').trim()); console.log(c.turn(), c.moves().length, 'legal moves')" b 29 legal moves

# Monitor move acceptance rate and rejection reasons in real time tail -f /var/log/chess-api/moves.log | jq -r 'select(.event=="move_result") | "\(.ts) \(.game_id) ply=\(.ply) result=\(.result) \(.reason // "")"' 2024-06-01T10:01:23Z game_ab3f ply=5 result=accepted 2024-06-01T10:01:31Z game_xk9q ply=12 result=accepted 2024-06-01T10:01:35Z game_ab3f ply=5 result=rejected reason=idempotent_replay 2024-06-01T10:01:41Z game_mm7r ply=8 result=rejected reason=illegal_move move=d1h7 2024-06-01T10:01:55Z game_pp2n ply=14 result=rejected reason=wrong_turn submitter=black to_move=white

Symptom → cause → fix table

Symptom	Cause	Fix
Every move returns 409 wrong_turn even for the correct player	`to_move` in the DB is stuck on one colour — the UPDATE is silently not advancing it	Check that the SET clause includes `to_move = $next_colour`; verify it is not hard-coded
Legal move returns 422 illegal_move intermittently	FEN in the DB is stale — a prior write partially succeeded (crash between INSERT and UPDATE)	Wrap the moves table INSERT and games UPDATE in a single transaction; add a reconciliation job that rebuilds FEN from move history on discrepancy
Retry of the same move creates a duplicate entry in the moves table	Idempotency check missing — ply not checked before INSERT	Add `WHERE ply = $expected_ply` to the UPDATE and check `rowsAffected == 0` before inserting into moves
Player's clock shows more time than they should have	Server is using the client-supplied `move_timestamp` for clock deduction instead of its own arrival time	Record `received_at = now()` at request entry; use `received_at - last_move_at` for deduction; ignore client timestamps for clocks
WebSocket MOVE events arrive out of order at some clients	Fan-out is multi-threaded and messages publish without ordering guarantee	Clients should use the `ply` field to order incoming events; if ply N+2 arrives before N+1, hold N+2 in a buffer and apply in order, or re-fetch state with GET /games/:id
GET /games/:id returns a board position 2 moves behind what players see	Read is hitting a replica with replication lag behind the primary	For read-your-own-write consistency on game state, route GET /games/:id to the primary (or use replication lag monitoring + fallback); spectator reads can tolerate replica lag
Game stays in ACTIVE after checkmate is played	End-of-game detection (checkmate, stalemate) not wired into the move handler — engine evaluates legality but not termination	After applying the move, call engine.isGameOver(); if true, set state=COMPLETED, result, and reason in the same transaction; publish GAME_OVER event

Production monitoring checklist

Move acceptance rate: alert if illegal_move rejections exceed 1% of submissions — could indicate a client bug, a corrupted FEN, or a chess engine version mismatch.
Move handler p99 latency: should be under 50 ms. If it spikes, check the legal-move generator CPU time (profile the engine) and the games table UPDATE latency.
Optimistic update collision rate: if WHERE ply=N matches 0 rows more than ~0.1% of moves, concurrent duplicate submissions are higher than expected — check client retry logic.
WebSocket fan-out lag: measure time from DB write to opponent receives MOVE event. Should be under 100 ms. Spikes indicate pub/sub broker backpressure.
Games stuck in ACTIVE after clock expiry: run a periodic query for games where state='ACTIVE' AND (white_clock_ms=0 OR black_clock_ms=0) — these should have been transitioned to COMPLETED by the clock-expiry handler.

🧠 Check your understanding

A client submits POST /v1/games/g1/moves with {"move": "e2e4", "ply": 3} but the game's current ply is already 3 (this move was applied in a previous request). What should the server return?

The ply number is the idempotency key. If ply=3 has already been applied, the server recognises this as a retry and returns 200 with the same result it returned originally — it does not apply the move a second time. This is idempotency: the same operation with the same key yields the same outcome.

Why does the move-submission endpoint use REST (HTTP POST) rather than sending moves over the WebSocket connection?

WebSocket can carry JSON and supports authentication. The deciding factor is the error model: REST POST returns a typed HTTP status code per request automatically. Over WebSocket you'd have to invent request-response correlation (custom request IDs, ack messages) to know whether your specific move was accepted. REST handles this for free.

Two concurrent POST requests for ply=7 arrive at the server for the same game. The optimistic concurrency check is UPDATE games … WHERE ply = 7. What happens?

Databases serialise concurrent writes to the same row. After the first update sets ply to 8, the second update's WHERE clause (ply=7) matches zero rows. The application detects zero rows affected and returns 409 Conflict. The game is never corrupted — no distributed lock required.

The server wants to broadcast a move to 10,000 spectators without blocking the move-acceptance path. Which approach achieves this?

Blocking on 10,000 WS sends would create a severe tail latency for the move-acceptance response. Publishing to an internal channel decouples the write path from the fan-out path: the move is committed and the 200 response returned immediately; spectator delivery happens asynchronously, with acceptable lag of ~200 ms.

Key takeaways

Server-authoritative validation is the only effective move-legality anti-cheat boundary. The client's board is speculative until the server confirms.
Model the game as a state machine. Enumerating states and transitions explicitly makes the API surface precise: which operations are valid, and what status code each invalid transition returns.
Split REST and WebSocket by role: REST for writes and point-in-time reads; WebSocket for push-based delivery to players and spectators.
Ply number as idempotency key is more robust than a client-generated UUID for move retries — it is domain-native, naturally unique per game, and shared by both sides.
Never use client timestamps for clock deduction. The server records its own arrival time; the client clock is untrusted.
For high-spectator fan-out, decouple write acceptance from broadcast delivery via an internal channel and async worker to keep move latency predictable.

Sources & further reading

Chess.com — How cheating detection works — covers engine detection vs. move legality, distinct problems
MDN — WebSocket API — transport used for live game streaming
RFC 6455 — The WebSocket Protocol
Lichess API reference — open-source chess platform's public API; useful real-world comparison
Wikipedia — Forsyth–Edwards Notation (FEN) — the standard compact encoding for chess board positions used in the API model above
Stripe — Idempotent Requests — canonical reference for idempotency key design patterns