Design Case Studies · Lesson 11

Design: Online Judge API

Running user-submitted code is one of the most dangerous things a server can do. Designing it correctly means separating the act of accepting a submission from the act of running it — and then running it inside a cage it cannot escape.

⏱ 17 min Difficulty: advanced Prereq: async patterns, event-driven pub/sub, idempotency

By the end you'll be able to

Explain why code execution is inherently asynchronous and why a synchronous submission API is a poor fit.
Sketch the submission → queue → sandbox → verdict flow and name every component's responsibility.
Identify the three security boundaries that prevent user code from affecting the host or other users.

1 — Requirements

An online judge does one thing: accept code, run it against test cases, and return a verdict. The operational constraints make that simple statement complicated.

Functional requirements

Submit code: accept a problem ID, a language, and source code. Return an identifier the client can use to check status.
Run in isolation: execute the code against a private set of judge test cases. User code must not be able to read the test cases, communicate with the network, or affect other submissions.
Return a verdict: Accepted, Wrong Answer, Time Limit Exceeded, Memory Limit Exceeded, Runtime Error, or Compile Error — plus which test case failed and why.
Async by default: execution can take up to a few seconds. The API must not hold an HTTP connection open waiting for it.
Webhook or poll: clients need a way to learn when the verdict is ready, either by polling GET or by registering a webhook URL.

Non-functional requirements

Security isolation: untrusted code runs in a sandboxed container with a read-only filesystem, no network, no host IPC, strict seccomp syscall filtering, and CPU/memory/time limits enforced by the kernel.
Scale: submissions burst heavily around contest start times. The queue absorbs spikes; worker count scales independently of the API tier.
Idempotency: a client retrying a submission (after a network timeout) should not produce two judge runs for the same code.
Rate limiting: per-user and per-IP submission limits prevent compute abuse. See Lesson rel-03.
Resource limits: each run has a wall-clock time limit (typically 1–3 s), a memory ceiling (typically 256 MB), and an output size cap to prevent fork bombs and output floods.

2 — Design decisions

Decision 1: Accept synchronously, execute asynchronously

A naive design executes code inline during the HTTP request and returns the verdict in the response body. This breaks under load: a single slow Python solution (3 s time limit) holds an HTTP worker thread for 3 seconds. At 100 concurrent slow submissions, 100 threads are blocked doing nothing. HTTP servers are designed for millisecond-range request handling — holding threads for seconds is an antipattern that collapses throughput.

The correct split is a 202 Accepted pattern: the API validates the submission, writes it to a queue, and returns an ID immediately. Workers pull from the queue at their own pace. This decoupling appears in Lesson rel-10 (Event-driven pub/sub): the API is the producer, judge workers are consumers.

Decision 2: Idempotency on submission

Network timeouts happen. A client that submits and then never receives a response will retry. Without idempotency, two identical retries produce two separate judge runs — burning compute and potentially producing two entries on a leaderboard. The fix is a client-generated idempotency key (see Lesson rel-02): if the key was seen before, the server returns the original submission's ID without creating a new one.

Decision 3: Verdict delivery — poll vs webhook

Clients need to know when execution finishes. Two options:

Poll: client calls GET /v1/submissions/:id repeatedly. Simple, stateless, slightly wasteful. Good for browser clients that can't receive incoming connections.
Webhook: client registers a callback URL when submitting; the judge POSTs the verdict to that URL when done. Efficient for server-to-server integrations (CI pipelines, contest backends). Requires the client to run an HTTPS endpoint. See Lesson dbg-04 for reliability patterns.

Expose both. Browser clients poll; backend systems use webhooks.

Decision 4: Sandbox architecture

User code is hostile code until proven otherwise. The sandbox enforces isolation at multiple layers:

Container isolation (Linux namespaces + cgroups): each run gets a fresh container. Network namespace has no external routes. PID namespace prevents processes from seeing or signaling the host or other containers.
Read-only filesystem + mounted test cases: source code is injected into an ephemeral tmpfs. Test case input is mounted read-only. The judge binary itself is read-only. The user process cannot write to anything persistent.
Seccomp syscall filter: only a minimal whitelist of syscalls is permitted — read, write, mmap, clock_gettime, exit. execve to a different binary, socket, ptrace, and fork-into-fork are all blocked. Attempting a forbidden syscall kills the process immediately with a Runtime Error verdict.

Decision 5: Queue depth as the primary scaling lever

Contest start events create submission spikes that can be 20× the average rate. The queue absorbs that burst: the API tier scales with web traffic (fast, cheap), while the worker tier scales with execution throughput (slow, CPU-bound). Operators can pre-warm workers before a contest starts without touching the API layer.

The API server accepts and enqueues within a single HTTP request (202). Workers execute in sandboxed containers at their own pace. Clients retrieve the verdict via poll or webhook — no HTTP connection stays open during execution.

3 — The API model

Submit a solution

POST /v1/submissions
Authorization: Bearer <user-token>
Idempotency-Key: client-generated-uuid
Content-Type: application/json

{
  "problem_id": "prob_twosum",
  "language":   "python3",
  "source_code": "class Solution:\n    def twoSum...",
  "webhook_url": "https://app.example.com/hooks/judge"  // optional
}

# Accepted — execution has NOT started yet
HTTP/1.1 202 Accepted
{
  "id":         "sub_4mNpQ7",
  "status":     "queued",
  "problem_id": "prob_twosum",
  "language":   "python3",
  "submitted_at": "2025-09-01T14:22:11Z"
}

# If the Idempotency-Key was seen before, same 202 with the original submission id
# No duplicate judge run is created

Poll for the verdict

GET /v1/submissions/sub_4mNpQ7
Authorization: Bearer <user-token>

# While running
HTTP/1.1 200 OK
{
  "id":     "sub_4mNpQ7",
  "status": "running",
  "verdict": null
}

# Completed — Accepted
HTTP/1.1 200 OK
{
  "id":       "sub_4mNpQ7",
  "status":   "finished",
  "verdict":  "Accepted",
  "runtime_ms": 48,
  "memory_kb":  14320,
  "passed_cases": 72,
  "total_cases":  72
}

# Completed — Wrong Answer
{
  "verdict":   "Wrong Answer",
  "failed_case": 5,           // 1-indexed
  "expected":  "[0,1]",
  "got":       "[1,0]"
}

# Completed — Time Limit Exceeded
{
  "verdict":    "Time Limit Exceeded",
  "failed_case": 61,
  "limit_ms":   2000
}

Webhook delivery (when webhook_url was provided)

# Judge POSTs to the registered webhook URL when execution finishes
POST https://app.example.com/hooks/judge
X-Judge-Signature: sha256=...   // HMAC — verify before trusting
Content-Type: application/json

{
  "event":         "submission.finished",
  "submission_id": "sub_4mNpQ7",
  "verdict":       "Accepted",
  "runtime_ms":    48
}

# Client must respond 200; non-200 triggers exponential retry up to 3×

Three independent containment layers. A privileged syscall kills the process (seccomp). A fork bomb hits the cgroup PID limit. A network probe finds no route. All three must hold — defence in depth, not a single fence.

4 — Evaluation & latency budget

The async pattern under load

Why 202 instead of 200? The POST contract is: I have accepted your request and will process it. If we returned 200, the client would expect the response body to contain the final verdict — which we cannot provide until execution finishes. 202 lets us return immediately while being honest about what has happened: the submission is in the queue, not yet run.

This is the same pattern as image transcoding, email sending, or any job that is measured in seconds rather than milliseconds. The API tier and the worker tier scale completely independently: if a contest attracts 10× the usual submissions, operators spin up more workers without touching the API servers — because the queue absorbs the burst.

Queue depth as a health signal

Queue depth	Typical wait	Operational signal
0–50	< 1 s	Healthy; workers keeping up
50–500	1–10 s	Normal contest load; monitor trends
500–5000	10–100 s	Scale workers; alert on page if growing
> 5000	> 100 s	Worker shortage or crash; incident

Security isolation analysis

Why three layers and not one? Because each layer has a different failure mode:

Container/namespace isolation prevents a running process from seeing the host filesystem, other containers, or the network. But if the container runtime itself has a kernel exploit (rare but possible), the process could escape.
Seccomp syscall filtering is enforced by the kernel BPF engine — not by the container runtime. Even if the container boundary were breached, a process that cannot call socket() or ptrace() has severely limited escape paths.
Resource limits via cgroups prevent denial-of-service against the worker host — a solution that loops forever hits the wall-clock limit; a solution that allocates unbounded memory hits the cgroup ceiling and is killed, not the host.

🎯 Interview angle

The interviewer will almost certainly ask why the submission endpoint returns 202 rather than blocking until the verdict is ready. Be precise: it is not primarily about performance — it is about the correct semantics. HTTP 200 means "here is the result you asked for." HTTP 202 means "I have accepted your work and will process it." Blocking until execution finishes conflates acceptance with completion. Disentangling them is the design insight, not just an optimisation. From there, explain that it also lets the API and worker tiers scale independently.

⚠️ Common trap

Forgetting idempotency on submission. A client that times out after 5 s (execution takes 3 s, network adds 2 s) will retry. Without an idempotency key, you now have two submissions and two judge runs for the same solution. On a timed contest leaderboard, a double submission that gets judged twice can change a user's score or eat their attempt count. Always accept an idempotency key and deduplicate server-side — see Lesson rel-02.

✅ Do this, not that

Do expose both poll and webhook so clients choose the pattern that fits them. Don't force synchronous long-polling (holding an HTTP connection open during execution) — it holds a thread on the server, and clients behind NATs or corporate proxies get connections reset after 30–60 s of inactivity anyway.

Under the hood: the core mechanism

Two mechanisms make an online judge simultaneously safe and fair: the multi-layer sandbox that prevents user code from escaping, and the submit→queue→worker→verdict pipeline that ensures execution never blocks the API server. Understanding both at the implementation level lets you reason about what each verdict actually means and how to debug unexpected ones.

Sandbox isolation: how it actually works

When a worker dequeues a submission, it does not simply run the code on the host machine. It builds a tight cage using three independent kernel mechanisms, each of which would alone be insufficient:

Layer	Kernel mechanism	What it enforces	What escapes it
1 — Isolation	Linux namespaces (net, pid, mnt, ipc) + cgroups v2	No network routes; no visibility of host processes; no persistent filesystem writes; CPU ≤ 1 core, mem ≤ 256 MB, wall time ≤ 3 s	A container-runtime kernel exploit (rare, but real)
2 — Syscall filter	seccomp BPF (Berkeley Packet Filter)	Every syscall is checked against a BPF program before execution; denied calls send SIGKILL immediately — no exception is raised	Syscalls on the whitelist (read, write, mmap, exit…) still execute
3 — Filesystem	read-only bind-mounts + tmpfs	Judge binary and stdlib are read-only; test input is an ephemeral tmpfs mount; no writable paths survive after the container exits	Reads from the mounted read-only paths are allowed

The seccomp whitelist for a typical judge contains roughly 30–50 syscalls out of the ~400+ available on Linux x86-64. The filter is a compiled BPF bytecode program loaded into the kernel with prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_prog) before the user binary starts. From that point, every syscall instruction in the user process first passes through the BPF VM; if the syscall number is not on the allow-list, the kernel delivers SIGKILL synchronously — there is no way for user code to catch or defer it.

The submit → queue → worker → verdict pipeline

The four stages are deliberately decoupled so that each one can scale and fail independently:

Stage 1 — API server (milliseconds)
  Client:  POST /v1/submissions  { problem_id, language, source_code }
           Idempotency-Key: <client-uuid>
  Server:
    1a. Check idempotency key in Redis/DB — if found, return stored submission
    1b. Validate request (language supported? code within size limit? rate limit OK?)
    1c. Write submission row to DB:  { id, status:"queued", code, problem_id, user_id }
    1d. Publish message to queue:   { submission_id: "sub_4mNpQ7" }
    1e. Return 202:  { id: "sub_4mNpQ7", status: "queued" }

Stage 2 — Message queue (milliseconds to seconds)
  Queue holds: [ "sub_4mNpQ7", "sub_...", ... ]
  Workers compete for messages; each message is leased (invisible) for 30 s.
  If no worker ACKs within 30 s, message becomes visible again → retry.

Stage 3 — Judge worker (1–5 seconds per submission)
  Worker:
    3a. Dequeue submission_id; fetch code + problem from DB
    3b. Update DB: status → "running"
    3c. Create sandbox container (namespace + cgroups + seccomp + ro-fs)
    3d. For each test case (sequentially):
          Inject test case stdin into tmpfs
          Run compiled binary (or interpreter) inside sandbox
          Capture stdout, stderr, exit code, wall time, peak RSS
          Compare stdout to expected output
          If any check fails → record failed_case, stop iteration
    3e. Compile all results → verdict
    3f. Write verdict to DB:  { status:"finished", verdict:"Accepted", runtime_ms, memory_kb }
    3g. ACK queue message (delete lease)
    3h. If webhook_url present → enqueue webhook delivery job

Stage 4 — Verdict delivery (poll or webhook)
  Poll:    Client GET /v1/submissions/sub_4mNpQ7 → reads DB row
  Webhook: Delivery worker POSTs result to client URL with HMAC signature

Worked trace: one submission from POST to verdict

T=0 ms    Client sends POST /v1/submissions
           Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
           { problem_id:"prob_twosum", language:"python3", source_code:"class Solution:..." }

T=8 ms    API writes row to DB (status: queued), publishes to queue
           Returns: 202 { id:"sub_4mNpQ7", status:"queued" }

T=12 ms   Worker dequeues sub_4mNpQ7 (queue depth was 3, wait ≈ 0)
           Fetches submission + problem metadata from DB
           Updates DB: status → "running"

T=15 ms   Worker spawns sandbox container
           - unshare(CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWNS)
           - cgroup limits: cpu.max=1000000 1000000, memory.max=268435456, pids.max=64
           - seccomp BPF loaded
           - source_code written to tmpfs /sandbox/solution.py

T=18 ms   Compile step (Python: no-op; C++: g++ -O2 -o /sandbox/a.out /sandbox/sol.cpp)
T=70 ms   Test case 1/72: stdin → [2,7,11,15], 9
           stdout: [0,1]   expected: [0,1]   ✓   runtime: 48 ms, rss: 14 MB

           ... test cases 2–72 run in 48–96 ms each ...
T=3700 ms All 72 cases passed
           Verdict: Accepted   total_runtime: 3682 ms   peak_memory: 14320 KB

T=3705 ms DB update: { status:"finished", verdict:"Accepted", runtime_ms:3682, memory_kb:14320 }
           Queue message ACKed (deleted)

T=3710 ms Webhook job enqueued (if webhook_url was provided)

T=3800 ms Client polls GET /v1/submissions/sub_4mNpQ7
           ← 200 { status:"finished", verdict:"Accepted", runtime_ms:48, memory_kb:14320,
                   passed_cases:72, total_cases:72 }
           (runtime_ms here is the median/best-case per test case, not total wall time)

Key timing insight: the 3682 ms total wall time is the sum of 72 test case runs. The runtime_ms: 48 returned to the client is the per-case execution time (the slowest case), which is what's compared against the problem's time limit — not the full wall-clock duration of the judge run.

Operating & debugging it

Judge debugging splits into two categories: unexpected verdicts (code that should pass but doesn't, or passes when it shouldn't) and operational problems (slow verdicts, stuck queue, workers crashing). The tools and investigation paths are different for each.

Inspecting the pipeline

# Check queue depth (Redis-backed queue example) redis-cli LLEN judge:submissions:queue 42 ← healthy; workers keeping up 8400 ← incident: workers not draining, scale up or check for crashes # Check worker health: running submissions in DB psql -c "SELECT id, status, submitted_at, NOW()-submitted_at AS age FROM submissions WHERE status='running' ORDER BY submitted_at LIMIT 20;" sub_4mNpQ7 | running | 2025-09-01 14:22:11 | 00:00:03 ← normal sub_2xKpR1 | running | 2025-09-01 14:18:02 | 00:04:10 ← stuck! worker crashed mid-run # View sandbox container logs for a specific submission (Docker example) docker logs judge-worker-$(hostname) 2>&1 | grep sub_4mNpQ7 verdict=Accepted runtime_ms=48 memory_kb=14320 verdict=RuntimeError seccomp_violation syscall=socket ← user code tried to open a socket # Monitor worker throughput (submissions per minute) watch -n5 'redis-cli INFO stats | grep instantaneous_ops_per_sec' # A healthy worker fleet processes 3–5 submissions/s per worker at 2 s average TLE

Symptom → cause → fix

Symptom	Likely cause	Fix
Correct code gets "Runtime Error" on first submit, "Accepted" on retry	Worker OOM-killed mid-run (unrelated submission consumed RAM on same host), or cgroup memory limit too low for that language's runtime	Check cgroup memory.events on worker; raise memory limit for language; pin workers to dedicated hosts
"Time Limit Exceeded" on the judge but code runs in 50 ms locally	Judge wall-clock limit includes compile time + sandbox setup + all test cases; or judge machine is CPU-throttled under load	Report per-case runtime separately from setup time; dedicate CPU cores to workers; check for cgroup CPU throttling (`cpu.stat throttled_usec`)
"Runtime Error" with no meaningful output	seccomp SIGKILL (forbidden syscall) — Python's `os.system()`, `socket`, `subprocess` are common triggers	Log the syscall number from the seccomp audit log; surface it in the verdict detail to the user
Two judge runs created for one submission (duplicate on leaderboard)	Idempotency key not sent by client, or key store missed (Redis eviction under memory pressure)	Enforce idempotency key as required header; set Redis `maxmemory-policy allkeys-lru` on a separately provisioned instance from the queue
Submissions stay "queued" indefinitely during contest	Workers exhausted; queue depth growing faster than workers drain	Pre-warm workers before contest start; alert on queue depth > 500 with > 5 min trend
"Wrong Answer" on a test case the user claims is correct	Output has trailing whitespace, Windows line endings (\r\n), or floating-point precision differs from expected	Normalise stdout before comparison (strip trailing whitespace, normalise line endings); document precision expectations on problem statement
Webhook never arrives after verdict is ready	Webhook delivery worker not running, or client URL returning non-200 and exhausting retries silently	Expose webhook delivery attempt history in the API; alert on delivery failure rate; implement exponential backoff up to 3× before marking delivery failed

⚠️ The "stuck running" submission

If a worker crashes (OOM, kernel panic, node replacement) while a submission is "running," the queue message's visibility timeout will expire and another worker will pick it up — but the DB row still says "running" from the first worker. The second worker will try to update status to "running" again (harmless) and then complete normally. However, if the visibility timeout is longer than the problem's TLE (e.g. 30 s timeout, 2 s TLE), a submission can appear stuck in "running" for up to 30 s before recovery. Fix: set the queue visibility timeout to max(time_limit + compile_time + overhead, 15 s) and add a background sweeper that marks submissions "failed" if they have been "running" for more than 2× the visibility timeout with no worker heartbeat.

🧠 Quick check

1. The submission endpoint returns 202 Accepted. What does that status code communicate precisely?

202 Accepted means "I have the request; I will process it asynchronously." It says nothing about the outcome — only that the work has been queued. The verdict comes later via poll or webhook.

2. A user's Python script calls import socket; socket.connect(("8.8.8.8", 53)). What happens inside the sandbox?

Seccomp intercepts the socket syscall before the network namespace ever gets involved. The kernel BPF program sees a disallowed syscall and delivers SIGKILL. The judge records this as Runtime Error — there is no exception for the user code to catch.

3. Queue depth is rising rapidly during a contest. What should the on-call engineer do first?

A rising queue with healthy API servers means the production rate exceeds the consumption rate. The workers are the bottleneck — they execute the code. Adding workers increases throughput. Restarting API servers or rate-limiting submissions does not drain a queue that already exists.

4. A client submits the same solution twice with the same Idempotency-Key. What should the server return on the second request?

An idempotency key maps to a single logical operation. A duplicate request with the same key must return the same response as the first — not an error, not a new resource. The client cannot tell the difference between a retry and a new request; the server resolves the ambiguity by replaying the original outcome.

✍️ Exercise: design the "contest leaderboard" edge — submissions from 10 000 users arriving in the first 60 seconds

A competitive programming contest begins. 10 000 users submit in the first 60 seconds. Your judge workers can process 200 submissions per minute. Sketch: what happens to the queue, how you communicate wait time to users, and what safeguards prevent a single user from hoarding worker capacity.

Model answer:

At 10 000 submissions / 60 s = ~167 submissions/s incoming, and 200 submissions/min = ~3.3 submissions/s outgoing, the queue grows at roughly 163/s for the first minute. After 60 s, the queue holds ~9 800 submissions. At 200/min consumption, clearing takes ~49 minutes — unacceptably long for a 2-hour contest.

The correct response is not to design around this at the API level, but to pre-warm the worker tier before contest start. Queue depth is the operational signal: alert at depth > 500 before the contest to catch under-provisioning early.

For per-user fairness, impose a per-user submission rate limit: e.g., 1 submission per problem per 30 seconds. This prevents a single user from consuming disproportionate queue capacity (or from brute-forcing solutions). The queue position can be included in the GET poll response so users see their estimated wait:

{ "status": "queued", "queue_position": 312, "estimated_wait_s": 94 }

Rubric: ✓ Identified the queue imbalance quantitatively ✓ Proposed pre-warming, not API-level fixes ✓ Mentioned per-user rate limiting ✓ Surfaced queue position in the poll response ✓ Did not propose synchronous execution as a solution.

Key takeaways

Accept synchronously, execute asynchronously. 202 + queue decouples API latency from execution time and lets each tier scale independently.
Idempotency keys prevent double runs when clients retry on network timeout — use a client-generated key, deduplicate server-side.
Expose both poll and webhook for verdict delivery — browser clients poll, server-to-server integrations use webhooks.
The sandbox needs three layers: namespaces (isolation), seccomp (syscall filter), cgroups (resource limits) — each catches a different failure mode.
Queue depth is the primary scaling signal — rising depth means more workers, not more API servers.