API Design

API Reliability · Lesson 19

Reliability: Chaos & Progressive Delivery

Writing code is only half the battle; deploying it safely under heavy load without degrading user experience is what defines a resilient production API. We will deep-dive into progressive rollouts (canaries and blue-green deploys), connection draining strategies during rolling updates, and using chaos engineering fault injection to proactively test resilience limits before outages occur.

⏱ ~16 min Advanced Prereq: rel-01, rel-14, rel-18

By the end you'll be able to

1. Progressive Delivery: Blue-Green vs. Canaries

Exposing a brand-new API version to 100% of production traffic at once is high risk. **Progressive Delivery** mitigates this by shifting traffic incrementally, allowing operations teams to observe performance signals (latencies, error rates) before committing to a full release.

Strategy How it works Pros Cons
Blue-Green Deployment Maintain two identical server clusters: Blue (active production) and Green (new release staging). Switch load-balancer routing instantly from Blue to Green. Zero-downtime cutover; instant rollback by flipping the router pointer back to Blue. Double the infrastructure cost; database migrations must support both versions simultaneously.
Canary Release Route a small fraction of real traffic (e.g. 2% to 5%) to a small cluster running the new code (the Canaries) while the remaining 95%+ flows to old code. Detects memory leaks, exceptions, and latency regressions with minimal blast radius. Complex routing required; sticky sessions can skew distribution metrics.

Canary Shifting Keys

2. Graceful Shutdown & Connection Draining

During a rolling deployment, old containers/servers must be destroyed and replaced by new ones. If you kill the old process immediately, active HTTP requests will terminate mid-flight (returning `502 Bad Gateway` or connection resets to users), and WebSocket connections will disconnect abruptly.

**Graceful shutdown** forces the application server to handle terminating signals cleanly: stop accepting *new* connections, finish processing *active* in-flight requests, and only then terminate the process.

Under the hood: Canary routing & connection draining

This diagram traces how a load balancer routes traffic progressively, shifting 5% to the canary nodes, and how old nodes drain connections gracefully upon receiving a SIGTERM signal.

Public Traffic HTTP / WS Load Balancer Weighted split Active v1.0 SIGTERM received DRAINING active connections Canary v1.1 New release pool Active (5% load) 95% traffic 5% (Canary)
Canary routing and connection draining sequence. The load balancer diverts 5% of requests to the new canary pool. During updates, the active v1.0 server receives a SIGTERM signal and ceases accepting new connections, entering a draining phase to complete in-flight transactions safely.

3. Chaos Engineering: Fault Injection

Resilience patterns like circuit breakers, retries, and rate limiters look excellent in architecture diagrams, but they often fail silently in production because they are rarely exercised. **Chaos Engineering** (popularized by Netflix's Chaos Monkey) proactively injects real failures into staging or production systems to confirm that client boundaries degrade gracefully.

Common Chaos Injection Vectors

By the numbers: canary statistical significance math

When running a canary deployment, how long must you wait to be statistically confident that the new version is not introducing more errors than the old version?

Governing Equations

Scenario Parameters

Worked Calculations: Canary Evaluation Window Sizing

  1. Compute statistical constants: $$(Z_{\alpha/2} + Z_\beta)^2 = (1.96 + 0.84)^2 = 7.84$$
  2. Compute numerator variance: $$Variance = 0.001(0.999) + 0.003(0.997) \approx 0.001 + 0.00299 = 0.00399$$
  3. Compute the denominator effect size: $$(p_1 - p_2)^2 = (0.001 - 0.003)^2 = (-0.002)^2 = 0.000004$$
  4. Calculate required requests $N$: $$N = \frac{7.84 \cdot 0.00399}{0.000004} = \frac{0.03128}{0.000004} = \mathbf{7,820 \text{ requests}}$$ We need at least 7,820 requests to hit the canary version to prove the error rate delta is real rather than random noise.
  5. Compute required runtime duration: With a 2% canary allocation receiving 200 QPS: $$Duration_{seconds} = \frac{7,820}{200\text{ QPS}} = 39.1\text{ seconds}$$ This is an excellent result! At 10,000 production QPS, we can safely validate the canary health in **under 1 minute** of observation.
⚠️ The Low-Traffic Canary Trap

If your API only receives 2 QPS, a 2% canary allocation receives just 0.04 QPS (1 request every 25 seconds). To reach the same Z-test significance of 7,820 requests, you would need to run the canary for: $$\frac{7,820}{0.04\text{ QPS}} = 195,500\text{ seconds} \approx \mathbf{54.3\text{ hours}}$$ For low-traffic APIs, pure canary routing is ineffective. Use Blue-Green deployments or larger canary percentages (e.g. 50%) to collect sufficient data quickly.

How to debug & inspect it

Observe graceful shutdown processes and test failure routes using shell output and header parameters.

# 1. Trigger and inspect graceful shutdown logs in Go/Node API server $ kill -SIGTERM $(pgrep server) [2026-07-02T16:08] INFO: SIGTERM received. Initiating graceful shutdown... [2026-07-02T16:08] INFO: Load balancer connection pool detached. New connections rejected. [2026-07-02T16:08] INFO: Draining connections... 142 requests in-flight. [2026-07-02T16:09] INFO: Drained all connections. Exiting cleanly (code 0). # 2. Trigger active chaos latency injection via API gateway debug headers $ curl -i -H "X-Chaos-Inject-Latency: 3000" -H "X-Chaos-Inject-Rate: 1.0" \ https://api-staging.example.com/v1/users/self HTTP/1.1 504 Gateway Timeout X-Circuit-Status: Open # Verdict: Downstream gateway timeout handled, circuit breaker opened successfully.

Review standard configurations for progressive delivery controls below:

Operation Symptom of Failure Design Mitigation
Rolling Deployment Client requests drop with 502/504 errors or connection resets during deployment Implement SIGTERM trapping; pause new connections; set a 30s draining buffer before exit.
Canary Release verification Faulty code version deployed to 100% due to false-negative metrics under low traffic Calculate sample thresholds; dynamically adjust canary sizes based on traffic volume.
Fault Tolerance Audit System experiences cascading outages because fallback logic contains bugs Proactively inject faults (chaos engineering) using staging headers to verify circuit breakers.

🧠 Quick check

1. What is the primary difference between a Blue-Green deployment and a Canary release?

Blue-Green shifts traffic completely from one environment (Blue) to another (Green). Canary releases shift traffic in small percentages (e.g. 5%) to observe behavior on a small target group before scale-up.

2. Why should an application server trap SIGTERM rather than SIGKILL during updates?

SIGKILL terminates the process immediately, cutting off active user requests. SIGTERM can be trapped by the server code, enabling connection draining where the process finishes in-flight requests before exiting.

3. Z-test calculations show that low-traffic APIs require days to reach statistical confidence in canary tests. How should you deploy changes to these APIs?

Under low QPS, canary data collects too slowly to yield statistical significance in a reasonable window. Instead, use Blue-Green deployments paired with automated validation tests to confirm release health quickly.

4. What is the goal of chaos engineering latency injection?

By artificially delaying requests, chaos engineering simulates downstream service exhaustion, allowing developers to verify that client timeouts are correctly configured and circuit breakers open to protect resources.

✍️ Exercise: design a graceful shutdown wrapper

Write out the pseudocode for a server wrapper that traps the terminating signal, stops accepting new requests, and exits after active requests resolve.


Model answer:

A graceful shutdown wrapper traps `SIGTERM`, stops incoming load-balancer probes, tracks request counters, and exits safely:

class GracefulServer {
    active_requests_count = 0
    is_shutting_down = false
    
    function start() {
        register_signal_handler("SIGTERM", handle_sigterm)
    }
    
    function handle_request(req, res) {
        if is_shutting_down:
            res.set_status(503) // Tell LB/Client to try another server
            res.write("Server is shutting down")
            return
            
        active_requests_count += 1
        try {
            process_request(req, res)
        } finally {
            active_requests_count -= 1
        }
    }
    
    function handle_sigterm() {
        is_shutting_down = true
        logger.info("SIGTERM received. Starting connection draining.")
        
        // Let the load balancer detect 503s or health-check failure to detach us
        sleep(5000) 
        
        drain_timeout = 30 // seconds max to wait
        start_time = current_time()
        
        while active_requests_count > 0:
            if current_time() - start_time > drain_timeout:
                logger.warn("Force shutdown: drain timeout reached with active requests remaining")
                break
            sleep(0.5) // Wait for requests to finish
            
        logger.info("All requests drained. Exiting.")
        system_exit(0)
    }
}

Key takeaways

Sources & further reading