Design Case Studies · Lesson 11
Design: Online Judge API
Running user-submitted code is one of the most dangerous things a server can do. Designing it correctly means separating the act of accepting a submission from the act of running it — and then running it inside a cage it cannot escape.
By the end you'll be able to
- Explain why code execution is inherently asynchronous and why a synchronous submission API is a poor fit.
- Sketch the submission → queue → sandbox → verdict flow and name every component's responsibility.
- Identify the three security boundaries that prevent user code from affecting the host or other users.
1 — Requirements
An online judge does one thing: accept code, run it against test cases, and return a verdict. The operational constraints make that simple statement complicated.
Functional requirements
- Submit code: accept a problem ID, a language, and source code. Return an identifier the client can use to check status.
- Run in isolation: execute the code against a private set of judge test cases. User code must not be able to read the test cases, communicate with the network, or affect other submissions.
- Return a verdict: Accepted, Wrong Answer, Time Limit Exceeded, Memory Limit Exceeded, Runtime Error, or Compile Error — plus which test case failed and why.
- Async by default: execution can take up to a few seconds. The API must not hold an HTTP connection open waiting for it.
- Webhook or poll: clients need a way to learn when the verdict is ready, either by polling GET or by registering a webhook URL.
Non-functional requirements
- Security isolation: untrusted code runs in a sandboxed container with a read-only filesystem, no network, no host IPC, strict seccomp syscall filtering, and CPU/memory/time limits enforced by the kernel.
- Scale: submissions burst heavily around contest start times. The queue absorbs spikes; worker count scales independently of the API tier.
- Idempotency: a client retrying a submission (after a network timeout) should not produce two judge runs for the same code.
- Rate limiting: per-user and per-IP submission limits prevent compute abuse. See Lesson rel-03.
- Resource limits: each run has a wall-clock time limit (typically 1–3 s), a memory ceiling (typically 256 MB), and an output size cap to prevent fork bombs and output floods.
2 — Design decisions
Decision 1: Accept synchronously, execute asynchronously
A naive design executes code inline during the HTTP request and returns the verdict in the response body. This breaks under load: a single slow Python solution (3 s time limit) holds an HTTP worker thread for 3 seconds. At 100 concurrent slow submissions, 100 threads are blocked doing nothing. HTTP servers are designed for millisecond-range request handling — holding threads for seconds is an antipattern that collapses throughput.
The correct split is a 202 Accepted pattern: the API validates the submission, writes it to a queue, and returns an ID immediately. Workers pull from the queue at their own pace. This decoupling appears in Lesson rel-10 (Event-driven pub/sub): the API is the producer, judge workers are consumers.
Decision 2: Idempotency on submission
Network timeouts happen. A client that submits and then never receives a response will retry. Without idempotency, two identical retries produce two separate judge runs — burning compute and potentially producing two entries on a leaderboard. The fix is a client-generated idempotency key (see Lesson rel-02): if the key was seen before, the server returns the original submission's ID without creating a new one.
Decision 3: Verdict delivery — poll vs webhook
Clients need to know when execution finishes. Two options:
- Poll: client calls
GET /v1/submissions/:idrepeatedly. Simple, stateless, slightly wasteful. Good for browser clients that can't receive incoming connections. - Webhook: client registers a callback URL when submitting; the judge POSTs the verdict to that URL when done. Efficient for server-to-server integrations (CI pipelines, contest backends). Requires the client to run an HTTPS endpoint. See Lesson dbg-04 for reliability patterns.
Expose both. Browser clients poll; backend systems use webhooks.
Decision 4: Sandbox architecture
User code is hostile code until proven otherwise. The sandbox enforces isolation at multiple layers:
- Container isolation (Linux namespaces + cgroups): each run gets a fresh container. Network namespace has no external routes. PID namespace prevents processes from seeing or signaling the host or other containers.
- Read-only filesystem + mounted test cases: source code is injected into an ephemeral tmpfs. Test case input is mounted read-only. The judge binary itself is read-only. The user process cannot write to anything persistent.
- Seccomp syscall filter: only a minimal whitelist of syscalls is permitted — read, write, mmap, clock_gettime, exit.
execveto a different binary,socket,ptrace, and fork-into-fork are all blocked. Attempting a forbidden syscall kills the process immediately with a Runtime Error verdict.
Decision 5: Queue depth as the primary scaling lever
Contest start events create submission spikes that can be 20× the average rate. The queue absorbs that burst: the API tier scales with web traffic (fast, cheap), while the worker tier scales with execution throughput (slow, CPU-bound). Operators can pre-warm workers before a contest starts without touching the API layer.
3 — The API model
Submit a solution
POST /v1/submissions
Authorization: Bearer <user-token>
Idempotency-Key: client-generated-uuid
Content-Type: application/json
{
"problem_id": "prob_twosum",
"language": "python3",
"source_code": "class Solution:\n def twoSum...",
"webhook_url": "https://app.example.com/hooks/judge" // optional
}
# Accepted — execution has NOT started yet
HTTP/1.1 202 Accepted
{
"id": "sub_4mNpQ7",
"status": "queued",
"problem_id": "prob_twosum",
"language": "python3",
"submitted_at": "2025-09-01T14:22:11Z"
}
# If the Idempotency-Key was seen before, same 202 with the original submission id
# No duplicate judge run is created
Poll for the verdict
GET /v1/submissions/sub_4mNpQ7
Authorization: Bearer <user-token>
# While running
HTTP/1.1 200 OK
{
"id": "sub_4mNpQ7",
"status": "running",
"verdict": null
}
# Completed — Accepted
HTTP/1.1 200 OK
{
"id": "sub_4mNpQ7",
"status": "finished",
"verdict": "Accepted",
"runtime_ms": 48,
"memory_kb": 14320,
"passed_cases": 72,
"total_cases": 72
}
# Completed — Wrong Answer
{
"verdict": "Wrong Answer",
"failed_case": 5, // 1-indexed
"expected": "[0,1]",
"got": "[1,0]"
}
# Completed — Time Limit Exceeded
{
"verdict": "Time Limit Exceeded",
"failed_case": 61,
"limit_ms": 2000
}
Webhook delivery (when webhook_url was provided)
# Judge POSTs to the registered webhook URL when execution finishes
POST https://app.example.com/hooks/judge
X-Judge-Signature: sha256=... // HMAC — verify before trusting
Content-Type: application/json
{
"event": "submission.finished",
"submission_id": "sub_4mNpQ7",
"verdict": "Accepted",
"runtime_ms": 48
}
# Client must respond 200; non-200 triggers exponential retry up to 3×
4 — Evaluation & latency budget
The async pattern under load
Why 202 instead of 200? The POST contract is: I have accepted your request and will process it. If we returned 200, the client would expect the response body to contain the final verdict — which we cannot provide until execution finishes. 202 lets us return immediately while being honest about what has happened: the submission is in the queue, not yet run.
This is the same pattern as image transcoding, email sending, or any job that is measured in seconds rather than milliseconds. The API tier and the worker tier scale completely independently: if a contest attracts 10× the usual submissions, operators spin up more workers without touching the API servers — because the queue absorbs the burst.
Queue depth as a health signal
| Queue depth | Typical wait | Operational signal |
|---|---|---|
| 0–50 | < 1 s | Healthy; workers keeping up |
| 50–500 | 1–10 s | Normal contest load; monitor trends |
| 500–5000 | 10–100 s | Scale workers; alert on page if growing |
| > 5000 | > 100 s | Worker shortage or crash; incident |
Security isolation analysis
Why three layers and not one? Because each layer has a different failure mode:
- Container/namespace isolation prevents a running process from seeing the host filesystem, other containers, or the network. But if the container runtime itself has a kernel exploit (rare but possible), the process could escape.
- Seccomp syscall filtering is enforced by the kernel BPF engine — not by the container runtime. Even if the container boundary were breached, a process that cannot call
socket()orptrace()has severely limited escape paths. - Resource limits via cgroups prevent denial-of-service against the worker host — a solution that loops forever hits the wall-clock limit; a solution that allocates unbounded memory hits the cgroup ceiling and is killed, not the host.
The interviewer will almost certainly ask why the submission endpoint returns 202 rather than blocking until the verdict is ready. Be precise: it is not primarily about performance — it is about the correct semantics. HTTP 200 means "here is the result you asked for." HTTP 202 means "I have accepted your work and will process it." Blocking until execution finishes conflates acceptance with completion. Disentangling them is the design insight, not just an optimisation. From there, explain that it also lets the API and worker tiers scale independently.
Forgetting idempotency on submission. A client that times out after 5 s (execution takes 3 s, network adds 2 s) will retry. Without an idempotency key, you now have two submissions and two judge runs for the same solution. On a timed contest leaderboard, a double submission that gets judged twice can change a user's score or eat their attempt count. Always accept an idempotency key and deduplicate server-side — see Lesson rel-02.
Do expose both poll and webhook so clients choose the pattern that fits them. Don't force synchronous long-polling (holding an HTTP connection open during execution) — it holds a thread on the server, and clients behind NATs or corporate proxies get connections reset after 30–60 s of inactivity anyway.
Under the hood: the core mechanism
Two mechanisms make an online judge simultaneously safe and fair: the multi-layer sandbox that prevents user code from escaping, and the submit→queue→worker→verdict pipeline that ensures execution never blocks the API server. Understanding both at the implementation level lets you reason about what each verdict actually means and how to debug unexpected ones.
Sandbox isolation: how it actually works
When a worker dequeues a submission, it does not simply run the code on the host machine. It builds a tight cage using three independent kernel mechanisms, each of which would alone be insufficient:
| Layer | Kernel mechanism | What it enforces | What escapes it |
|---|---|---|---|
| 1 — Isolation | Linux namespaces (net, pid, mnt, ipc) + cgroups v2 | No network routes; no visibility of host processes; no persistent filesystem writes; CPU ≤ 1 core, mem ≤ 256 MB, wall time ≤ 3 s | A container-runtime kernel exploit (rare, but real) |
| 2 — Syscall filter | seccomp BPF (Berkeley Packet Filter) | Every syscall is checked against a BPF program before execution; denied calls send SIGKILL immediately — no exception is raised | Syscalls on the whitelist (read, write, mmap, exit…) still execute |
| 3 — Filesystem | read-only bind-mounts + tmpfs | Judge binary and stdlib are read-only; test input is an ephemeral tmpfs mount; no writable paths survive after the container exits | Reads from the mounted read-only paths are allowed |
The seccomp whitelist for a typical judge contains roughly 30–50 syscalls out of the ~400+ available on Linux x86-64. The filter is a compiled BPF bytecode program loaded into the kernel with prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_prog) before the user binary starts. From that point, every syscall instruction in the user process first passes through the BPF VM; if the syscall number is not on the allow-list, the kernel delivers SIGKILL synchronously — there is no way for user code to catch or defer it.
The submit → queue → worker → verdict pipeline
The four stages are deliberately decoupled so that each one can scale and fail independently:
Stage 1 — API server (milliseconds)
Client: POST /v1/submissions { problem_id, language, source_code }
Idempotency-Key: <client-uuid>
Server:
1a. Check idempotency key in Redis/DB — if found, return stored submission
1b. Validate request (language supported? code within size limit? rate limit OK?)
1c. Write submission row to DB: { id, status:"queued", code, problem_id, user_id }
1d. Publish message to queue: { submission_id: "sub_4mNpQ7" }
1e. Return 202: { id: "sub_4mNpQ7", status: "queued" }
Stage 2 — Message queue (milliseconds to seconds)
Queue holds: [ "sub_4mNpQ7", "sub_...", ... ]
Workers compete for messages; each message is leased (invisible) for 30 s.
If no worker ACKs within 30 s, message becomes visible again → retry.
Stage 3 — Judge worker (1–5 seconds per submission)
Worker:
3a. Dequeue submission_id; fetch code + problem from DB
3b. Update DB: status → "running"
3c. Create sandbox container (namespace + cgroups + seccomp + ro-fs)
3d. For each test case (sequentially):
Inject test case stdin into tmpfs
Run compiled binary (or interpreter) inside sandbox
Capture stdout, stderr, exit code, wall time, peak RSS
Compare stdout to expected output
If any check fails → record failed_case, stop iteration
3e. Compile all results → verdict
3f. Write verdict to DB: { status:"finished", verdict:"Accepted", runtime_ms, memory_kb }
3g. ACK queue message (delete lease)
3h. If webhook_url present → enqueue webhook delivery job
Stage 4 — Verdict delivery (poll or webhook)
Poll: Client GET /v1/submissions/sub_4mNpQ7 → reads DB row
Webhook: Delivery worker POSTs result to client URL with HMAC signature
Worked trace: one submission from POST to verdict
T=0 ms Client sends POST /v1/submissions
Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
{ problem_id:"prob_twosum", language:"python3", source_code:"class Solution:..." }
T=8 ms API writes row to DB (status: queued), publishes to queue
Returns: 202 { id:"sub_4mNpQ7", status:"queued" }
T=12 ms Worker dequeues sub_4mNpQ7 (queue depth was 3, wait ≈ 0)
Fetches submission + problem metadata from DB
Updates DB: status → "running"
T=15 ms Worker spawns sandbox container
- unshare(CLONE_NEWNET | CLONE_NEWPID | CLONE_NEWNS)
- cgroup limits: cpu.max=1000000 1000000, memory.max=268435456, pids.max=64
- seccomp BPF loaded
- source_code written to tmpfs /sandbox/solution.py
T=18 ms Compile step (Python: no-op; C++: g++ -O2 -o /sandbox/a.out /sandbox/sol.cpp)
T=70 ms Test case 1/72: stdin → [2,7,11,15], 9
stdout: [0,1] expected: [0,1] ✓ runtime: 48 ms, rss: 14 MB
... test cases 2–72 run in 48–96 ms each ...
T=3700 ms All 72 cases passed
Verdict: Accepted total_runtime: 3682 ms peak_memory: 14320 KB
T=3705 ms DB update: { status:"finished", verdict:"Accepted", runtime_ms:3682, memory_kb:14320 }
Queue message ACKed (deleted)
T=3710 ms Webhook job enqueued (if webhook_url was provided)
T=3800 ms Client polls GET /v1/submissions/sub_4mNpQ7
← 200 { status:"finished", verdict:"Accepted", runtime_ms:48, memory_kb:14320,
passed_cases:72, total_cases:72 }
(runtime_ms here is the median/best-case per test case, not total wall time)
Key timing insight: the 3682 ms total wall time is the sum of 72 test case runs. The runtime_ms: 48 returned to the client is the per-case execution time (the slowest case), which is what's compared against the problem's time limit — not the full wall-clock duration of the judge run.
Operating & debugging it
Judge debugging splits into two categories: unexpected verdicts (code that should pass but doesn't, or passes when it shouldn't) and operational problems (slow verdicts, stuck queue, workers crashing). The tools and investigation paths are different for each.
Inspecting the pipeline
Symptom → cause → fix
| Symptom | Likely cause | Fix |
|---|---|---|
| Correct code gets "Runtime Error" on first submit, "Accepted" on retry | Worker OOM-killed mid-run (unrelated submission consumed RAM on same host), or cgroup memory limit too low for that language's runtime | Check cgroup memory.events on worker; raise memory limit for language; pin workers to dedicated hosts |
| "Time Limit Exceeded" on the judge but code runs in 50 ms locally | Judge wall-clock limit includes compile time + sandbox setup + all test cases; or judge machine is CPU-throttled under load | Report per-case runtime separately from setup time; dedicate CPU cores to workers; check for cgroup CPU throttling (cpu.stat throttled_usec) |
| "Runtime Error" with no meaningful output | seccomp SIGKILL (forbidden syscall) — Python's os.system(), socket, subprocess are common triggers |
Log the syscall number from the seccomp audit log; surface it in the verdict detail to the user |
| Two judge runs created for one submission (duplicate on leaderboard) | Idempotency key not sent by client, or key store missed (Redis eviction under memory pressure) | Enforce idempotency key as required header; set Redis maxmemory-policy allkeys-lru on a separately provisioned instance from the queue |
| Submissions stay "queued" indefinitely during contest | Workers exhausted; queue depth growing faster than workers drain | Pre-warm workers before contest start; alert on queue depth > 500 with > 5 min trend |
| "Wrong Answer" on a test case the user claims is correct | Output has trailing whitespace, Windows line endings (\r\n), or floating-point precision differs from expected | Normalise stdout before comparison (strip trailing whitespace, normalise line endings); document precision expectations on problem statement |
| Webhook never arrives after verdict is ready | Webhook delivery worker not running, or client URL returning non-200 and exhausting retries silently | Expose webhook delivery attempt history in the API; alert on delivery failure rate; implement exponential backoff up to 3× before marking delivery failed |
If a worker crashes (OOM, kernel panic, node replacement) while a submission is "running," the queue message's visibility timeout will expire and another worker will pick it up — but the DB row still says "running" from the first worker. The second worker will try to update status to "running" again (harmless) and then complete normally. However, if the visibility timeout is longer than the problem's TLE (e.g. 30 s timeout, 2 s TLE), a submission can appear stuck in "running" for up to 30 s before recovery. Fix: set the queue visibility timeout to max(time_limit + compile_time + overhead, 15 s) and add a background sweeper that marks submissions "failed" if they have been "running" for more than 2× the visibility timeout with no worker heartbeat.
🧠 Quick check
1. The submission endpoint returns 202 Accepted. What does that status code communicate precisely?
202 Accepted means "I have the request; I will process it asynchronously." It says nothing about the outcome — only that the work has been queued. The verdict comes later via poll or webhook.
2. A user's Python script calls import socket; socket.connect(("8.8.8.8", 53)). What happens inside the sandbox?
Seccomp intercepts the socket syscall before the network namespace ever gets involved. The kernel BPF program sees a disallowed syscall and delivers SIGKILL. The judge records this as Runtime Error — there is no exception for the user code to catch.
3. Queue depth is rising rapidly during a contest. What should the on-call engineer do first?
A rising queue with healthy API servers means the production rate exceeds the consumption rate. The workers are the bottleneck — they execute the code. Adding workers increases throughput. Restarting API servers or rate-limiting submissions does not drain a queue that already exists.
4. A client submits the same solution twice with the same Idempotency-Key. What should the server return on the second request?
An idempotency key maps to a single logical operation. A duplicate request with the same key must return the same response as the first — not an error, not a new resource. The client cannot tell the difference between a retry and a new request; the server resolves the ambiguity by replaying the original outcome.
✍️ Exercise: design the "contest leaderboard" edge — submissions from 10 000 users arriving in the first 60 seconds
A competitive programming contest begins. 10 000 users submit in the first 60 seconds. Your judge workers can process 200 submissions per minute. Sketch: what happens to the queue, how you communicate wait time to users, and what safeguards prevent a single user from hoarding worker capacity.
Model answer:
At 10 000 submissions / 60 s = ~167 submissions/s incoming, and 200 submissions/min = ~3.3 submissions/s outgoing, the queue grows at roughly 163/s for the first minute. After 60 s, the queue holds ~9 800 submissions. At 200/min consumption, clearing takes ~49 minutes — unacceptably long for a 2-hour contest.
The correct response is not to design around this at the API level, but to pre-warm the worker tier before contest start. Queue depth is the operational signal: alert at depth > 500 before the contest to catch under-provisioning early.
For per-user fairness, impose a per-user submission rate limit: e.g., 1 submission per problem per 30 seconds. This prevents a single user from consuming disproportionate queue capacity (or from brute-forcing solutions). The queue position can be included in the GET poll response so users see their estimated wait:
{ "status": "queued", "queue_position": 312, "estimated_wait_s": 94 }
Rubric: ✓ Identified the queue imbalance quantitatively ✓ Proposed pre-warming, not API-level fixes ✓ Mentioned per-user rate limiting ✓ Surfaced queue position in the poll response ✓ Did not propose synchronous execution as a solution.
Key takeaways
- Accept synchronously, execute asynchronously. 202 + queue decouples API latency from execution time and lets each tier scale independently.
- Idempotency keys prevent double runs when clients retry on network timeout — use a client-generated key, deduplicate server-side.
- Expose both poll and webhook for verdict delivery — browser clients poll, server-to-server integrations use webhooks.
- The sandbox needs three layers: namespaces (isolation), seccomp (syscall filter), cgroups (resource limits) — each catches a different failure mode.
- Queue depth is the primary scaling signal — rising depth means more workers, not more API servers.