Platform & API Product Engineering · Lesson 05

Multi-tenancy & Isolation

Every SaaS API serves multiple customers on the same infrastructure. How you draw the lines between them determines your cost structure, blast radius, compliance posture, and the degree to which one customer can harm another. Getting those lines wrong is the kind of mistake you discover at 2 AM when a noisy tenant crushes a paying customer's throughput — or worse, when a missing WHERE clause returns the wrong company's data.

⏱ 20 min advanced Prereq: HTTP basics, Database basics, AuthN/AuthZ

By the end you'll be able to

Name the three canonical isolation models and select the right one given tenant count, size distribution, and compliance requirements.
Describe the cross-tenant data leak (BOLA) risk in a pooled system and explain how PostgreSQL Row-Level Security enforces the tenant filter at the database layer.
Explain the noisy-neighbor problem and describe the bulkhead pattern that contains it.
Work through the cost arithmetic that determines when POOL beats SILO operationally.
Identify at least three per-tenant configuration concerns: encryption keys, data residency, and feature flags.

The three isolation models

Every multi-tenant system makes a fundamental choice about where to draw boundaries between tenants. There are three canonical positions, not a dozen — everything else is a variation or a marketing name for one of these.

Each model is a deliberate trade-off between infrastructure cost, isolation strength, and operational complexity.

The models differ on where the tenant boundary lives:

SILO — each tenant has its own infrastructure stack: separate application instances, separate database, separate everything. The boundary is a physical network or account boundary. AWS runs this way: each AWS account is a hard isolation unit with separate resource namespaces.
POOL — all tenants share the same infrastructure. The boundary is enforced by a tenant_id column on every row and a filter in every query. Salesforce pioneered this at scale — OrgId on every object, in every query, for hundreds of thousands of orgs.
BRIDGE — the application layer is shared, but the data layer is configurable per tenant. Small tenants land in the pool; enterprise tenants with compliance requirements or sheer data volume get their own isolated data store, routed to transparently.

Pros and cons across five dimensions

Dimension	SILO	POOL	BRIDGE
Infrastructure cost	Very high — scales linearly with tenant count	Very low — shared resources, marginal cost per tenant	Medium — pool for most; isolated infra only for large tenants
Isolation strength	Maximum — physical separation; no shared kernel, no shared DB	Logical — depends entirely on query-layer correctness	Logical for small tenants; physical for enterprise
Blast radius	Minimal — a crashed instance affects one tenant only	High — a bad migration or DB outage affects all tenants in the cluster	Medium — pool failure hits SMB tenants; enterprise tenants are isolated
Noisy-neighbor risk	None	High without explicit per-tenant resource controls	Low for enterprise; remains for the SMB pool
Ops complexity	High — N × (deploy, monitor, scale, patch) pipelines	Low — one codebase, one schema, one deploy	High — routing logic, schema variation, per-tenant provisioning for enterprise

Data isolation deep dive

The POOL model's entire security guarantee rests on one invariant: every query that touches tenant data includes a WHERE tenant_id = ? filter. Miss it once — in any query path, background job, report, or admin tool — and you have a cross-tenant data leak. In OWASP API Security terms, this is a BOLA (Broken Object-Level Authorization) vulnerability. See sec-05-authn-authz.html for the full BOLA taxonomy.

The unsafe query vs. the correct query

The canonical failure looks like this: a developer adds a "get invoice by ID" endpoint. They implement the authN check (valid JWT required) but forget to scope the DB query to the authenticated tenant.

-- ❌ UNSAFE: no tenant filter — any authenticated user can fetch any invoice
SELECT id, amount, status, customer_name
FROM   invoices
WHERE  id = $1;  -- $1 = invoice ID from request path

-- ✅ CORRECT: tenant_id bound at authentication layer, enforced at query layer
SELECT id, amount, status, customer_name
FROM   invoices
WHERE  id        = $1
AND    tenant_id = $2;  -- $2 = tenant_id extracted from validated JWT claim

The unsafe version authenticates the caller but does not authorize them for the specific object. Tenant B can brute-force or enumerate invoice IDs belonging to Tenant A. The correct version binds the query to the tenant extracted from the verified token — even if the caller guesses Tenant A's invoice ID, the query returns nothing.

The tenant_id must flow from the verified token claim through every query. A bypass at any layer — a background job, an admin API, a report query — is a data breach.

Schema and database strategies

Beyond tenant_id on rows, platforms use three structural approaches, each trading isolation for cost:

Shared schema, tenant_id column. All tenants share the same table. Cheapest, but every query must include the filter. The leak surface is large — any query path that doesn't enforce the filter is an exposure.
Schema-per-tenant. Each tenant gets their own PostgreSQL schema (namespace) — tenant_acme.invoices vs. tenant_globex.invoices. Queries route to the right schema at connection setup via SET search_path TO tenant_acme. Reduces cross-contamination risk; enables per-tenant schema variations; still shares the underlying database cluster (and its failure domain).
DB-per-tenant. Full physical isolation — each tenant has their own database instance or cluster. Maximum isolation; linear cost. Appropriate for regulated industries (healthcare, finance) where tenants may demand proof of separation.

PostgreSQL Row-Level Security (RLS)

RLS is a PostgreSQL feature that enforces the tenant filter at the database layer — the application cannot accidentally bypass it, because the DB itself rejects queries that violate the policy. This is defense in depth: even if a developer writes an unsafe query, the DB returns nothing rather than leaking data.

-- 1. Enable RLS on the table
ALTER TABLE invoices ENABLE ROW LEVEL SECURITY;
ALTER TABLE invoices FORCE ROW LEVEL SECURITY;  -- applies even to the table owner

-- 2. Create the isolation policy
CREATE POLICY tenant_isolation ON invoices
  USING (tenant_id = current_setting('app.current_tenant')::uuid);

-- 3. Application code sets the tenant before any query in the transaction
SET LOCAL app.current_tenant = 'org_7f3a9c12';

-- Now this query returns only Tenant org_7f3a9c12's rows, enforced by the DB:
SELECT * FROM invoices;  -- implicitly filtered; no WHERE needed in application code

-- An unsafe application query — still safe with RLS active:
SELECT * FROM invoices WHERE id = $1;  -- RLS policy ANDs in tenant_id = current_setting(…)

✅ RLS as defense in depth

Use RLS even when your application already enforces the tenant filter on every query. Application code changes; people write new query paths under deadline pressure; ORMs can be configured incorrectly. RLS at the database layer is the last line of defense that doesn't depend on developer discipline. The cost is negligible — it's a btree lookup on the indexed tenant_id column, which you were doing anyway.

Compute and throughput isolation

Data isolation prevents cross-tenant reads. Compute isolation prevents one tenant's traffic from degrading another's performance. In a pooled system, these are independent problems — you can have perfect data isolation and still have a noisy-neighbor problem that makes enterprise customers call you at midnight.

The noisy-neighbor problem

In a shared thread pool, a tenant running a bulk data export — thousands of sequential requests, each holding a DB connection and a thread — can exhaust the pool and starve every other tenant. The mechanism: each request acquires a thread from the shared pool. With a pool of 100 threads and a tenant firing 150 concurrent requests, the pool is full. All other tenants queue. Their P99 latency goes from 40ms to 4s. No one crossed a tenant boundary — the data is perfectly isolated — but the customer experience is broken.

The bulkhead pattern is a direct application of circuit-breaker thinking to tenant isolation. A pool that can fill is a pool that will fill.

Per-tenant rate limits and quotas

Rate limiting in a multi-tenant system is not just "protect the API from overload" — it is a fairness contract between tenants. The implementation follows the same algorithms as single-tenant rate limiting (token bucket, sliding window counter), but the key space is tenant:{tenant_id}:{endpoint} rather than ip:{ip}. Each tenant tier gets its own limit: free tier at 10 req/s, growth at 100 req/s, enterprise at 1,000 req/s with burst allowance. See also plat-01 (Rate Limits & Quotas) for the nested rate limit patterns that apply at the org, user, and API-key levels simultaneously.

Fair scheduling and weighted fair queuing

When multiple tenants share a processing queue — a job queue, a request queue, a DB connection pool — pure FIFO scheduling allows one heavy tenant to monopolize the queue. Weighted fair queuing (WFQ) assigns each tenant a weight proportional to their tier, then schedules requests so that each tenant receives approximately their share of processing capacity over any window, regardless of request arrival pattern. In practice: a tenant firing 1,000 requests into the queue receives their fair share (say 10% of throughput if they are a standard-tier tenant among ten equal tenants), but cannot consume more than their weight allows, even if other tenants are idle.

Tenant-scoped resource naming

URL structure is itself a form of access control and a documentation of scope. The /v1/orgs/{org_id}/... pattern makes the tenant scope explicit in the resource identifier — both the client and the server can see at a glance whose data is being addressed. It also makes access control tests obvious: if org_id in the path doesn't match tenant_id in the token, reject the request immediately in middleware, before it ever reaches the query layer.

-- Tenant-scoped URL namespace
GET  /v1/orgs/{org_id}/invoices
GET  /v1/orgs/{org_id}/invoices/{invoice_id}
POST /v1/orgs/{org_id}/invoices
GET  /v1/orgs/{org_id}/customers/{customer_id}/orders

-- Auth middleware: path org_id must match token claim
if params.org_id != token.claims.tenant_id:
    return 403  -- forbidden: path tenant doesn't match token

This pattern also makes test writing easy: security tests can assert that fetching /v1/orgs/org_A/invoices/{invoice_id_from_org_B} returns 403 or 404, not 200.

Per-tenant configuration

Encryption keys — envelope encryption

Storing all tenants' data encrypted with a single key means one key compromise exposes everyone. Per-tenant encryption isolates the blast radius. The standard pattern is envelope encryption:

Generate a Data Encryption Key (DEK) per tenant (or per table, per object — the granularity is a policy choice).
Encrypt the DEK with a Key Encryption Key (KEK) managed by a KMS (AWS KMS, GCP Cloud KMS, HashiCorp Vault). The wrapped DEK is stored with the data.
At query time: fetch the wrapped DEK, call KMS to unwrap it (one KMS API call, cacheable for the session), decrypt the data with the unwrapped DEK.
To revoke a tenant's access to their data: destroy their DEK in the KMS. The data becomes permanently inaccessible without needing to re-encrypt or delete rows.

The KEK never leaves the KMS. The DEK is in memory only during the decryption operation. This is how Stripe, AWS, and most regulated SaaS platforms handle tenant-level encryption.

Data residency

EU customers may be contractually or legally required to have their data stored in EU data centers. This adds a routing problem: the same API endpoint must direct EU tenants' reads and writes to the EU cluster and everyone else to the default cluster. The common implementation: a global routing layer holds a tenant-to-region mapping; on every request, after extracting tenant_id, it proxies to the appropriate regional cluster. This mapping itself must be available globally (typically a small, highly-replicated lookup table with aggressive caching). See rel-16-consistency-cap.html for the consistency considerations this introduces — the routing table and the data are in different systems with different replication lag.

Per-tenant feature flags

Feature flags in a multi-tenant system are scoped to tenants, not just code paths. This lets you: roll out a new API behavior to enterprise tenants first; run A/B experiments on a subset of tenants without affecting others; gate features behind plan tier; or disable a specific feature for a tenant who reported a bug. The implementation is a simple lookup: before executing any code path gated by a flag, resolve feature_flag(tenant_id, flag_name) — typically a Redis lookup with a fallback to a configuration database.

The cache cross-tenant leak

Caches are a frequent source of cross-tenant data leaks because the mistake is invisible in normal operation and only surfaces when two tenants happen to share a cache key collision. The mechanism:

Tenant A requests GET /v1/invoices/summary. The cache layer stores the response under key invoices:summary.
Tenant B requests the same endpoint. The cache hits on invoices:summary and returns Tenant A's data to Tenant B.

The fix is exact and non-negotiable: every cache key must include the tenant ID. The correct key is invoices:summary:{tenant_id} or {tenant_id}:invoices:summary. See rel-07-caching.html for cache invalidation patterns and the full cache key design framework.

⚠️ Pitfall: forgetting tenant in cache keys

This leak is particularly dangerous because it doesn't throw an error and passes all functional tests (which typically run with a single tenant). It only manifests in production when two tenants make the same request in the same cache TTL window. Audit your cache key generation code as a standalone review pass — it is easy to miss when reviewing application logic holistically.

Under the hood: the full BOLA walkthrough

Walk through the complete vulnerable-to-fixed progression, including how RLS catches what the application missed.

-- System: SaaS analytics platform, pooled multi-tenant, shared DB -- Table: events(id UUID, tenant_id UUID, event_type TEXT, payload JSONB, created_at TIMESTAMPTZ) -- Tenant A: org_7f3a9c12 | Tenant B: org_2d8b1e45 -- STEP 1: Tenant B makes an authenticated request GET /v1/events/evt_00a1b2c3 Authorization: Bearer <JWT with tenant_id=org_2d8b1e45> -- STEP 2: Application code (UNSAFE) — auth passed, but no tenant filter: SELECT id, event_type, payload FROM events WHERE id = 'evt_00a1b2c3'; -- ✗ returns Tenant A's event! evt_00a1b2c3 belongs to org_7f3a9c12 -- STEP 3: Application code (FIXED) — tenant_id from token injected into query: SELECT id, event_type, payload FROM events WHERE id = 'evt_00a1b2c3' AND tenant_id = 'org_2d8b1e45'; -- ✓ returns 0 rows — correct, Tenant B cannot see Tenant A's events -- STEP 4: With RLS active, even the UNSAFE query from STEP 2 is safe: SET LOCAL app.current_tenant = 'org_2d8b1e45'; SELECT id, event_type, payload FROM events WHERE id = 'evt_00a1b2c3'; -- PostgreSQL RLS policy ANDs in: tenant_id = current_setting('app.current_tenant')::uuid -- Result: 0 rows — RLS caught what the application missed

RLS is not a replacement for application-level enforcement — setting app.current_tenant correctly is itself application work that could be misconfigured. It is a defense-in-depth layer that prevents the worst case: a bug in application code that would otherwise return another tenant's data.

By the numbers

Modeled cost comparison — all figures are estimates

10,000 tenants, SILO vs. POOL:

Model	Infrastructure	Unit cost (modeled)	Monthly total (modeled)
SILO	1 DB instance per tenant × 10,000	$200/month per small RDS instance	$2,000,000/month
POOL	~20 shared DB clusters, each serving ~500 tenants	$4,000/month per cluster (larger instance)	$80,000/month
BRIDGE	1 pool for 9,800 SMB + 200 isolated for 200 enterprise	Pool: $60k/mo; Enterprise instances: $400/mo each	~$140,000/month

At 10,000 tenants, POOL is ~25× cheaper than SILO on infrastructure alone. The gap widens with tenant count — SILO scales linearly, POOL scales sub-linearly (each cluster can absorb more tenants as average utilization grows).

Blast radius in the POOL model. If each cluster serves 500 tenants and one cluster crashes, 500 tenants (5% of your total) are affected. With 20 clusters and no traffic isolation, an outage is bounded to 5% of tenants rather than 100% — but that is still 500 customers. Cluster-level sharding is a blast-radius reduction strategy: the more clusters, the smaller the fraction affected per incident, but the higher the operational overhead.

Break-even: when does SILO become operationally untenable? The rough formula:

-- Break-even point: where POOL ops overhead < SILO infra + ops overhead

SILO_total     = N * (infra_per_tenant + ops_per_tenant_per_month)
POOL_total     = (N / tenants_per_cluster) * (cluster_cost + ops_per_cluster_per_month)

-- Approximate: at N ≈ 50 tenants, SILO is viable if tenants are high-value.
-- At N ≈ 500+, the ops burden (patching, monitoring, incident response per tenant)
-- dominates. At N ≈ 5,000+, only POOL or BRIDGE is operationally sustainable.

-- The inflection point moves with automation level:
-- fully automated provisioning + IaC pushes SILO viability to ~200–500 tenants.

How real platforms do it

Platform	Model	Key mechanism	Reference
Salesforce	POOL — pioneered at scale	OrgId on every object in every table. All queries at the ORM layer include the OrgId filter. Hundreds of thousands of orgs in shared infrastructure since 2000.	Salesforce Multi-Tenant Architecture
Workday	POOL with shared kernel	Shared application kernel with tenant-aware data access layer. Configuration, workflows, and schema variations are layered on top of a shared base via a metadata-driven approach.	Workday Multi-Tenant Architecture
AWS	SILO — each account is the isolation unit	AWS accounts are hard isolation boundaries: separate IAM namespaces, separate resource ARNs, separate billing, separate API rate limit pools. Cross-account access requires explicit trust policies.	AWS SaaS Tenant Isolation Strategies (whitepaper)
Stripe	POOL with strong API-layer isolation	All API calls are scoped to an Account object. API keys are scoped to an account and cannot access other accounts' resources regardless of key permissions. Account ID is the first lookup on every API request.	Stripe API Authentication & Account Scoping

🎯 Interview angle: "How would you design a multi-tenant API?"

A senior answer names the three models and explains the selection criteria: tenant count, size distribution (homogeneous vs. tiered), compliance requirements, and budget. It then covers the data isolation mechanism (tenant_id + RLS as defense in depth), the compute isolation problem (bulkheads for noisy neighbors, per-tenant rate limits), and the operational dimensions (blast radius, deploy complexity). Candidates who answer only "use tenant_id in every query" have covered one of the four required dimensions. Candidates who propose SILO for 50,000 tenants without addressing the cost curve have not thought through the trade-offs at scale.

🧠 Quick check

1. Which isolation model has the highest infrastructure cost but the strongest isolation?

SILO gives each tenant their own infrastructure: separate DB, separate app instances, separate everything. This is physically the strongest isolation — a breach in one tenant's environment cannot spill to another. The cost is that infrastructure scales linearly with tenant count, making SILO impractical at thousands of tenants unless the per-tenant revenue justifies it.

2. In a pooled multi-tenant system, what is the BOLA risk?

BOLA (Broken Object-Level Authorization) in a pooled system means a valid authenticated user accesses a resource they do not own. The canonical mechanism: an API endpoint takes an object ID from the request path, queries the DB by that ID alone without adding a tenant_id filter, and returns whatever row matches — even if it belongs to another tenant. The fix is to always AND the tenant_id from the verified token into every data retrieval query.

3. Why must cache keys include the tenant ID?

Cache keys that omit the tenant ID create a namespace collision: the first tenant to populate the cache entry defines what all subsequent tenants receive until the TTL expires. This is a cross-tenant data leak that bypasses all database-level isolation controls — the DB query was correct, but the cached result is served to the wrong tenant. The fix is mechanical: always prefix or suffix cache keys with the tenant ID.

4. What is the primary purpose of PostgreSQL Row-Level Security (RLS) in a multi-tenant database?

RLS attaches a policy to a table that the database evaluates for every query — SELECT, UPDATE, DELETE. The policy ANDs a tenant_id predicate into every WHERE clause automatically. This means an application bug that omits the tenant filter is caught by the database: the query returns only the rows belonging to the tenant set in the session variable, not all rows. It is defense in depth — the application should still enforce the filter, but RLS ensures a missed filter doesn't leak data.

✍️ Exercise: choose an isolation model for a mixed-tier analytics SaaS

Scenario: You are the platform architect for a SaaS analytics product. You currently have 500 tenants. The distribution is: 490 are small customers (free or growth tier, averaging 10 req/s each), and 10 are enterprise customers (each running 500 req/s, each paying $50k/year, each with a data processing agreement requiring EU data residency for their data). Recommend an isolation model and justify your recommendation. Address: data isolation, compute isolation, blast radius, cost, and compliance.

Model answer:

The correct recommendation is the BRIDGE model:

SMB pool (490 tenants): Put the 490 small tenants in a shared POOL on 2–3 DB clusters (each cluster serving ~150–200 tenants) with per-tenant row-level security and a tenant_id-keyed cache layer. Per-tenant rate limits at 10–50 req/s enforce the tier contract. Cost: 2–3 clusters at ~$4k/month = ~$10k/month for all SMB data infrastructure.
Enterprise silo (10 tenants): Each enterprise tenant gets their own DB cluster (or at minimum their own DB schema on a dedicated cluster) in the EU region. This satisfies the data residency requirement — their data never leaves EU infrastructure. It also eliminates blast radius cross-contamination between enterprise customers. At 500 req/s each, they generate enough load to justify dedicated resources. Cost: 10 clusters at $400–800/month each = $4k–8k/month, paid for by the $50k/year revenue per tenant.
Shared application layer: One codebase, one deploy pipeline. A routing layer inspects the tenant_id on each request and directs to the correct data store (pool or dedicated). This keeps operational complexity manageable — you are not maintaining 11 separate application stacks.
Compute isolation: Implement per-tenant rate limits and bulkhead thread pools partitioned by tier. Enterprise tenants get a dedicated thread pool with a larger limit; SMB tenants share a pool with fair-queuing scheduling. A bulk export from an enterprise tenant cannot affect other enterprise tenants (they have separate data stores) and cannot affect SMB tenants (separate thread pool).

Why not pure SILO? At 500 tenants, a fully dedicated DB per tenant costs $200 × 500 = $100k/month. That is 10× the BRIDGE cost with no benefit for the SMB segment — they have no compliance requirement for isolation and their load is low. The operational burden (500 separate DB instances to patch, monitor, and back up) is also disproportionate.

Why not pure POOL? The 10 enterprise customers have a contractual data residency requirement. In a pure pool, you cannot guarantee EU data stays in EU without per-tenant routing complexity that is equivalent to BRIDGE anyway. Also, at 500 req/s each, enterprise customers are noisy neighbors even with rate limiting — dedicated infrastructure makes the performance contract enforceable.

Rubric:

Full marks: BRIDGE recommendation with justification across all five dimensions (data isolation, compute isolation, blast radius, cost arithmetic, compliance). Notes that enterprise tenants have data residency constraints that push them toward isolation regardless of model preference.
Partial marks: BRIDGE or POOL recommendation with justification for three or more dimensions.
Minimum pass: identifies that the heterogeneous tenant size distribution (10 very large, 490 small) is the key driver of the hybrid recommendation.
Deductions: recommending pure SILO without addressing the cost at 500 tenants; recommending pure POOL without addressing the data residency constraint for enterprise customers.

Key takeaways

Three models, not infinitely many: SILO (per-tenant infra), POOL (shared infra + tenant_id filter), and BRIDGE (shared app + isolated data for large tenants). Select based on tenant count, size distribution, and compliance requirements.
The BOLA risk in POOL is a missing WHERE clause. Every DB query that touches tenant data must include WHERE tenant_id = ?, bound to the tenant_id extracted from the verified token. PostgreSQL Row-Level Security enforces this at the DB layer as defense in depth.
The noisy-neighbor problem is a compute isolation problem, separate from data isolation. Bulkhead thread pools and connection pools per tenant tier prevent one heavy tenant from degrading all others.
Cache keys must include the tenant ID. Any cache key that omits it creates a cross-tenant data leak that bypasses all DB-level isolation controls.
At 10,000 tenants, POOL is ~25× cheaper than SILO on infrastructure. The operational overhead of maintaining N separate environments compounds this advantage. SILO is appropriate for regulated markets with hard isolation requirements and a small, high-value tenant count (typically under 200).