API Design

Platform & API Product Engineering · Lesson 05

Multi-tenancy & Isolation

Every SaaS API serves multiple customers on the same infrastructure. How you draw the lines between them determines your cost structure, blast radius, compliance posture, and the degree to which one customer can harm another. Getting those lines wrong is the kind of mistake you discover at 2 AM when a noisy tenant crushes a paying customer's throughput — or worse, when a missing WHERE clause returns the wrong company's data.

⏱ 20 min advanced Prereq: HTTP basics, Database basics, AuthN/AuthZ

By the end you'll be able to

The three isolation models

Every multi-tenant system makes a fundamental choice about where to draw boundaries between tenants. There are three canonical positions, not a dozen — everything else is a variation or a marketing name for one of these.

SILO per-tenant infrastructure Client A App Instance A DB A Client B App Instance B DB B Hard walls — total isolation. High cost at scale. POOL shared infrastructure + tenant_id rows Client A Client B Shared App Layer Shared DB tenant_id | data … Low infra cost. Isolation via query filter. BRIDGE shared app, isolated data for large tenants SMB A–N Enterprise E Shared App Layer Shared Pool DB (SMB tenants) Isolated DB (Enterprise E) Best of both — at routing complexity cost. Fig 1 — Three canonical isolation models. SILO = hard walls. POOL = logical separation. BRIDGE = hybrid.
Each model is a deliberate trade-off between infrastructure cost, isolation strength, and operational complexity.

The models differ on where the tenant boundary lives:

Pros and cons across five dimensions

Dimension SILO POOL BRIDGE
Infrastructure cost Very high — scales linearly with tenant count Very low — shared resources, marginal cost per tenant Medium — pool for most; isolated infra only for large tenants
Isolation strength Maximum — physical separation; no shared kernel, no shared DB Logical — depends entirely on query-layer correctness Logical for small tenants; physical for enterprise
Blast radius Minimal — a crashed instance affects one tenant only High — a bad migration or DB outage affects all tenants in the cluster Medium — pool failure hits SMB tenants; enterprise tenants are isolated
Noisy-neighbor risk None High without explicit per-tenant resource controls Low for enterprise; remains for the SMB pool
Ops complexity High — N × (deploy, monitor, scale, patch) pipelines Low — one codebase, one schema, one deploy High — routing logic, schema variation, per-tenant provisioning for enterprise

Data isolation deep dive

The POOL model's entire security guarantee rests on one invariant: every query that touches tenant data includes a WHERE tenant_id = ? filter. Miss it once — in any query path, background job, report, or admin tool — and you have a cross-tenant data leak. In OWASP API Security terms, this is a BOLA (Broken Object-Level Authorization) vulnerability. See sec-05-authn-authz.html for the full BOLA taxonomy.

The unsafe query vs. the correct query

The canonical failure looks like this: a developer adds a "get invoice by ID" endpoint. They implement the authN check (valid JWT required) but forget to scope the DB query to the authenticated tenant.

-- ❌ UNSAFE: no tenant filter — any authenticated user can fetch any invoice
SELECT id, amount, status, customer_name
FROM   invoices
WHERE  id = $1;  -- $1 = invoice ID from request path

-- ✅ CORRECT: tenant_id bound at authentication layer, enforced at query layer
SELECT id, amount, status, customer_name
FROM   invoices
WHERE  id        = $1
AND    tenant_id = $2;  -- $2 = tenant_id extracted from validated JWT claim

The unsafe version authenticates the caller but does not authorize them for the specific object. Tenant B can brute-force or enumerate invoice IDs belonging to Tenant A. The correct version binds the query to the tenant extracted from the verified token — even if the caller guesses Tenant A's invoice ID, the query returns nothing.

Client JWT in header Auth Middleware verify JWT signature extract tenant_id claim Query Builder injects tenant_id = ? into every WHERE clause Database returns only rows where tenant_id matches Response tenant-scoped if middleware skipped — cross-tenant leak (BOLA) Fig 2 — Correct enforcement: tenant_id extracted once, injected everywhere. Skip any step → data leak.
The tenant_id must flow from the verified token claim through every query. A bypass at any layer — a background job, an admin API, a report query — is a data breach.

Schema and database strategies

Beyond tenant_id on rows, platforms use three structural approaches, each trading isolation for cost:

PostgreSQL Row-Level Security (RLS)

RLS is a PostgreSQL feature that enforces the tenant filter at the database layer — the application cannot accidentally bypass it, because the DB itself rejects queries that violate the policy. This is defense in depth: even if a developer writes an unsafe query, the DB returns nothing rather than leaking data.

-- 1. Enable RLS on the table
ALTER TABLE invoices ENABLE ROW LEVEL SECURITY;
ALTER TABLE invoices FORCE ROW LEVEL SECURITY;  -- applies even to the table owner

-- 2. Create the isolation policy
CREATE POLICY tenant_isolation ON invoices
  USING (tenant_id = current_setting('app.current_tenant')::uuid);

-- 3. Application code sets the tenant before any query in the transaction
SET LOCAL app.current_tenant = 'org_7f3a9c12';

-- Now this query returns only Tenant org_7f3a9c12's rows, enforced by the DB:
SELECT * FROM invoices;  -- implicitly filtered; no WHERE needed in application code

-- An unsafe application query — still safe with RLS active:
SELECT * FROM invoices WHERE id = $1;  -- RLS policy ANDs in tenant_id = current_setting(…)
✅ RLS as defense in depth

Use RLS even when your application already enforces the tenant filter on every query. Application code changes; people write new query paths under deadline pressure; ORMs can be configured incorrectly. RLS at the database layer is the last line of defense that doesn't depend on developer discipline. The cost is negligible — it's a btree lookup on the indexed tenant_id column, which you were doing anyway.

Compute and throughput isolation

Data isolation prevents cross-tenant reads. Compute isolation prevents one tenant's traffic from degrading another's performance. In a pooled system, these are independent problems — you can have perfect data isolation and still have a noisy-neighbor problem that makes enterprise customers call you at midnight.

The noisy-neighbor problem

In a shared thread pool, a tenant running a bulk data export — thousands of sequential requests, each holding a DB connection and a thread — can exhaust the pool and starve every other tenant. The mechanism: each request acquires a thread from the shared pool. With a pool of 100 threads and a tenant firing 150 concurrent requests, the pool is full. All other tenants queue. Their P99 latency goes from 40ms to 4s. No one crossed a tenant boundary — the data is perfectly isolated — but the customer experience is broken.

❌ Shared Pool — Noisy Neighbor Tenant A Tenant B Tenant C bulk export Shared Thread Pool Tenant C fills all threads A & B are queued, P99 → ∞ (DB connections exhausted too) ✅ Bulkhead — Per-tier Pools Tenant A Tenant B Tenant C bulk export Standard Pool (32 threads) headroom available Bulk Pool (8 threads) C is capped; A+B unaffected Router classifies by tenant tier Fig 3 — Bulkhead: tenant tiers route to separate thread + connection pools. Tenant C's bulk export cannot starve Tenant A or B.
The bulkhead pattern is a direct application of circuit-breaker thinking to tenant isolation. A pool that can fill is a pool that will fill.

Per-tenant rate limits and quotas

Rate limiting in a multi-tenant system is not just "protect the API from overload" — it is a fairness contract between tenants. The implementation follows the same algorithms as single-tenant rate limiting (token bucket, sliding window counter), but the key space is tenant:{tenant_id}:{endpoint} rather than ip:{ip}. Each tenant tier gets its own limit: free tier at 10 req/s, growth at 100 req/s, enterprise at 1,000 req/s with burst allowance. See also plat-01 (Rate Limits & Quotas) for the nested rate limit patterns that apply at the org, user, and API-key levels simultaneously.

Fair scheduling and weighted fair queuing

When multiple tenants share a processing queue — a job queue, a request queue, a DB connection pool — pure FIFO scheduling allows one heavy tenant to monopolize the queue. Weighted fair queuing (WFQ) assigns each tenant a weight proportional to their tier, then schedules requests so that each tenant receives approximately their share of processing capacity over any window, regardless of request arrival pattern. In practice: a tenant firing 1,000 requests into the queue receives their fair share (say 10% of throughput if they are a standard-tier tenant among ten equal tenants), but cannot consume more than their weight allows, even if other tenants are idle.

Tenant-scoped resource naming

URL structure is itself a form of access control and a documentation of scope. The /v1/orgs/{org_id}/... pattern makes the tenant scope explicit in the resource identifier — both the client and the server can see at a glance whose data is being addressed. It also makes access control tests obvious: if org_id in the path doesn't match tenant_id in the token, reject the request immediately in middleware, before it ever reaches the query layer.

-- Tenant-scoped URL namespace
GET  /v1/orgs/{org_id}/invoices
GET  /v1/orgs/{org_id}/invoices/{invoice_id}
POST /v1/orgs/{org_id}/invoices
GET  /v1/orgs/{org_id}/customers/{customer_id}/orders

-- Auth middleware: path org_id must match token claim
if params.org_id != token.claims.tenant_id:
    return 403  -- forbidden: path tenant doesn't match token

This pattern also makes test writing easy: security tests can assert that fetching /v1/orgs/org_A/invoices/{invoice_id_from_org_B} returns 403 or 404, not 200.

Per-tenant configuration

Encryption keys — envelope encryption

Storing all tenants' data encrypted with a single key means one key compromise exposes everyone. Per-tenant encryption isolates the blast radius. The standard pattern is envelope encryption:

  1. Generate a Data Encryption Key (DEK) per tenant (or per table, per object — the granularity is a policy choice).
  2. Encrypt the DEK with a Key Encryption Key (KEK) managed by a KMS (AWS KMS, GCP Cloud KMS, HashiCorp Vault). The wrapped DEK is stored with the data.
  3. At query time: fetch the wrapped DEK, call KMS to unwrap it (one KMS API call, cacheable for the session), decrypt the data with the unwrapped DEK.
  4. To revoke a tenant's access to their data: destroy their DEK in the KMS. The data becomes permanently inaccessible without needing to re-encrypt or delete rows.

The KEK never leaves the KMS. The DEK is in memory only during the decryption operation. This is how Stripe, AWS, and most regulated SaaS platforms handle tenant-level encryption.

Data residency

EU customers may be contractually or legally required to have their data stored in EU data centers. This adds a routing problem: the same API endpoint must direct EU tenants' reads and writes to the EU cluster and everyone else to the default cluster. The common implementation: a global routing layer holds a tenant-to-region mapping; on every request, after extracting tenant_id, it proxies to the appropriate regional cluster. This mapping itself must be available globally (typically a small, highly-replicated lookup table with aggressive caching). See rel-16-consistency-cap.html for the consistency considerations this introduces — the routing table and the data are in different systems with different replication lag.

Per-tenant feature flags

Feature flags in a multi-tenant system are scoped to tenants, not just code paths. This lets you: roll out a new API behavior to enterprise tenants first; run A/B experiments on a subset of tenants without affecting others; gate features behind plan tier; or disable a specific feature for a tenant who reported a bug. The implementation is a simple lookup: before executing any code path gated by a flag, resolve feature_flag(tenant_id, flag_name) — typically a Redis lookup with a fallback to a configuration database.

The cache cross-tenant leak

Caches are a frequent source of cross-tenant data leaks because the mistake is invisible in normal operation and only surfaces when two tenants happen to share a cache key collision. The mechanism:

  1. Tenant A requests GET /v1/invoices/summary. The cache layer stores the response under key invoices:summary.
  2. Tenant B requests the same endpoint. The cache hits on invoices:summary and returns Tenant A's data to Tenant B.

The fix is exact and non-negotiable: every cache key must include the tenant ID. The correct key is invoices:summary:{tenant_id} or {tenant_id}:invoices:summary. See rel-07-caching.html for cache invalidation patterns and the full cache key design framework.

⚠️ Pitfall: forgetting tenant in cache keys

This leak is particularly dangerous because it doesn't throw an error and passes all functional tests (which typically run with a single tenant). It only manifests in production when two tenants make the same request in the same cache TTL window. Audit your cache key generation code as a standalone review pass — it is easy to miss when reviewing application logic holistically.

Under the hood: the full BOLA walkthrough

Walk through the complete vulnerable-to-fixed progression, including how RLS catches what the application missed.

-- System: SaaS analytics platform, pooled multi-tenant, shared DB -- Table: events(id UUID, tenant_id UUID, event_type TEXT, payload JSONB, created_at TIMESTAMPTZ) -- Tenant A: org_7f3a9c12 | Tenant B: org_2d8b1e45 -- STEP 1: Tenant B makes an authenticated request GET /v1/events/evt_00a1b2c3 Authorization: Bearer <JWT with tenant_id=org_2d8b1e45> -- STEP 2: Application code (UNSAFE) — auth passed, but no tenant filter: SELECT id, event_type, payload FROM events WHERE id = 'evt_00a1b2c3'; -- ✗ returns Tenant A's event! evt_00a1b2c3 belongs to org_7f3a9c12 -- STEP 3: Application code (FIXED) — tenant_id from token injected into query: SELECT id, event_type, payload FROM events WHERE id = 'evt_00a1b2c3' AND tenant_id = 'org_2d8b1e45'; -- ✓ returns 0 rows — correct, Tenant B cannot see Tenant A's events -- STEP 4: With RLS active, even the UNSAFE query from STEP 2 is safe: SET LOCAL app.current_tenant = 'org_2d8b1e45'; SELECT id, event_type, payload FROM events WHERE id = 'evt_00a1b2c3'; -- PostgreSQL RLS policy ANDs in: tenant_id = current_setting('app.current_tenant')::uuid -- Result: 0 rows — RLS caught what the application missed

RLS is not a replacement for application-level enforcement — setting app.current_tenant correctly is itself application work that could be misconfigured. It is a defense-in-depth layer that prevents the worst case: a bug in application code that would otherwise return another tenant's data.

By the numbers

Modeled cost comparison — all figures are estimates

10,000 tenants, SILO vs. POOL:

ModelInfrastructureUnit cost (modeled)Monthly total (modeled)
SILO 1 DB instance per tenant × 10,000 $200/month per small RDS instance $2,000,000/month
POOL ~20 shared DB clusters, each serving ~500 tenants $4,000/month per cluster (larger instance) $80,000/month
BRIDGE 1 pool for 9,800 SMB + 200 isolated for 200 enterprise Pool: $60k/mo; Enterprise instances: $400/mo each ~$140,000/month

At 10,000 tenants, POOL is ~25× cheaper than SILO on infrastructure alone. The gap widens with tenant count — SILO scales linearly, POOL scales sub-linearly (each cluster can absorb more tenants as average utilization grows).

Blast radius in the POOL model. If each cluster serves 500 tenants and one cluster crashes, 500 tenants (5% of your total) are affected. With 20 clusters and no traffic isolation, an outage is bounded to 5% of tenants rather than 100% — but that is still 500 customers. Cluster-level sharding is a blast-radius reduction strategy: the more clusters, the smaller the fraction affected per incident, but the higher the operational overhead.

Break-even: when does SILO become operationally untenable? The rough formula:

-- Break-even point: where POOL ops overhead < SILO infra + ops overhead

SILO_total     = N * (infra_per_tenant + ops_per_tenant_per_month)
POOL_total     = (N / tenants_per_cluster) * (cluster_cost + ops_per_cluster_per_month)

-- Approximate: at N ≈ 50 tenants, SILO is viable if tenants are high-value.
-- At N ≈ 500+, the ops burden (patching, monitoring, incident response per tenant)
-- dominates. At N ≈ 5,000+, only POOL or BRIDGE is operationally sustainable.

-- The inflection point moves with automation level:
-- fully automated provisioning + IaC pushes SILO viability to ~200–500 tenants.

How real platforms do it

PlatformModelKey mechanismReference
Salesforce POOL — pioneered at scale OrgId on every object in every table. All queries at the ORM layer include the OrgId filter. Hundreds of thousands of orgs in shared infrastructure since 2000. Salesforce Multi-Tenant Architecture
Workday POOL with shared kernel Shared application kernel with tenant-aware data access layer. Configuration, workflows, and schema variations are layered on top of a shared base via a metadata-driven approach. Workday Multi-Tenant Architecture
AWS SILO — each account is the isolation unit AWS accounts are hard isolation boundaries: separate IAM namespaces, separate resource ARNs, separate billing, separate API rate limit pools. Cross-account access requires explicit trust policies. AWS SaaS Tenant Isolation Strategies (whitepaper)
Stripe POOL with strong API-layer isolation All API calls are scoped to an Account object. API keys are scoped to an account and cannot access other accounts' resources regardless of key permissions. Account ID is the first lookup on every API request. Stripe API Authentication & Account Scoping
🎯 Interview angle: "How would you design a multi-tenant API?"

A senior answer names the three models and explains the selection criteria: tenant count, size distribution (homogeneous vs. tiered), compliance requirements, and budget. It then covers the data isolation mechanism (tenant_id + RLS as defense in depth), the compute isolation problem (bulkheads for noisy neighbors, per-tenant rate limits), and the operational dimensions (blast radius, deploy complexity). Candidates who answer only "use tenant_id in every query" have covered one of the four required dimensions. Candidates who propose SILO for 50,000 tenants without addressing the cost curve have not thought through the trade-offs at scale.

🧠 Quick check

1. Which isolation model has the highest infrastructure cost but the strongest isolation?

SILO gives each tenant their own infrastructure: separate DB, separate app instances, separate everything. This is physically the strongest isolation — a breach in one tenant's environment cannot spill to another. The cost is that infrastructure scales linearly with tenant count, making SILO impractical at thousands of tenants unless the per-tenant revenue justifies it.

2. In a pooled multi-tenant system, what is the BOLA risk?

BOLA (Broken Object-Level Authorization) in a pooled system means a valid authenticated user accesses a resource they do not own. The canonical mechanism: an API endpoint takes an object ID from the request path, queries the DB by that ID alone without adding a tenant_id filter, and returns whatever row matches — even if it belongs to another tenant. The fix is to always AND the tenant_id from the verified token into every data retrieval query.

3. Why must cache keys include the tenant ID?

Cache keys that omit the tenant ID create a namespace collision: the first tenant to populate the cache entry defines what all subsequent tenants receive until the TTL expires. This is a cross-tenant data leak that bypasses all database-level isolation controls — the DB query was correct, but the cached result is served to the wrong tenant. The fix is mechanical: always prefix or suffix cache keys with the tenant ID.

4. What is the primary purpose of PostgreSQL Row-Level Security (RLS) in a multi-tenant database?

RLS attaches a policy to a table that the database evaluates for every query — SELECT, UPDATE, DELETE. The policy ANDs a tenant_id predicate into every WHERE clause automatically. This means an application bug that omits the tenant filter is caught by the database: the query returns only the rows belonging to the tenant set in the session variable, not all rows. It is defense in depth — the application should still enforce the filter, but RLS ensures a missed filter doesn't leak data.

✍️ Exercise: choose an isolation model for a mixed-tier analytics SaaS

Scenario: You are the platform architect for a SaaS analytics product. You currently have 500 tenants. The distribution is: 490 are small customers (free or growth tier, averaging 10 req/s each), and 10 are enterprise customers (each running 500 req/s, each paying $50k/year, each with a data processing agreement requiring EU data residency for their data). Recommend an isolation model and justify your recommendation. Address: data isolation, compute isolation, blast radius, cost, and compliance.

Model answer:

The correct recommendation is the BRIDGE model:

Why not pure SILO? At 500 tenants, a fully dedicated DB per tenant costs $200 × 500 = $100k/month. That is 10× the BRIDGE cost with no benefit for the SMB segment — they have no compliance requirement for isolation and their load is low. The operational burden (500 separate DB instances to patch, monitor, and back up) is also disproportionate.

Why not pure POOL? The 10 enterprise customers have a contractual data residency requirement. In a pure pool, you cannot guarantee EU data stays in EU without per-tenant routing complexity that is equivalent to BRIDGE anyway. Also, at 500 req/s each, enterprise customers are noisy neighbors even with rate limiting — dedicated infrastructure makes the performance contract enforceable.

Rubric:

Key takeaways

Sources & further reading