Platform & API Product Engineering · Lesson 05
Multi-tenancy & Isolation
Every SaaS API serves multiple customers on the same infrastructure. How you draw the lines between them determines your cost structure, blast radius, compliance posture, and the degree to which one customer can harm another. Getting those lines wrong is the kind of mistake you discover at 2 AM when a noisy tenant crushes a paying customer's throughput — or worse, when a missing WHERE clause returns the wrong company's data.
By the end you'll be able to
- Name the three canonical isolation models and select the right one given tenant count, size distribution, and compliance requirements.
- Describe the cross-tenant data leak (BOLA) risk in a pooled system and explain how PostgreSQL Row-Level Security enforces the tenant filter at the database layer.
- Explain the noisy-neighbor problem and describe the bulkhead pattern that contains it.
- Work through the cost arithmetic that determines when POOL beats SILO operationally.
- Identify at least three per-tenant configuration concerns: encryption keys, data residency, and feature flags.
The three isolation models
Every multi-tenant system makes a fundamental choice about where to draw boundaries between tenants. There are three canonical positions, not a dozen — everything else is a variation or a marketing name for one of these.
The models differ on where the tenant boundary lives:
- SILO — each tenant has its own infrastructure stack: separate application instances, separate database, separate everything. The boundary is a physical network or account boundary. AWS runs this way: each AWS account is a hard isolation unit with separate resource namespaces.
- POOL — all tenants share the same infrastructure. The boundary is enforced by a
tenant_idcolumn on every row and a filter in every query. Salesforce pioneered this at scale — OrgId on every object, in every query, for hundreds of thousands of orgs. - BRIDGE — the application layer is shared, but the data layer is configurable per tenant. Small tenants land in the pool; enterprise tenants with compliance requirements or sheer data volume get their own isolated data store, routed to transparently.
Pros and cons across five dimensions
| Dimension | SILO | POOL | BRIDGE |
|---|---|---|---|
| Infrastructure cost | Very high — scales linearly with tenant count | Very low — shared resources, marginal cost per tenant | Medium — pool for most; isolated infra only for large tenants |
| Isolation strength | Maximum — physical separation; no shared kernel, no shared DB | Logical — depends entirely on query-layer correctness | Logical for small tenants; physical for enterprise |
| Blast radius | Minimal — a crashed instance affects one tenant only | High — a bad migration or DB outage affects all tenants in the cluster | Medium — pool failure hits SMB tenants; enterprise tenants are isolated |
| Noisy-neighbor risk | None | High without explicit per-tenant resource controls | Low for enterprise; remains for the SMB pool |
| Ops complexity | High — N × (deploy, monitor, scale, patch) pipelines | Low — one codebase, one schema, one deploy | High — routing logic, schema variation, per-tenant provisioning for enterprise |
Data isolation deep dive
The POOL model's entire security guarantee rests on one invariant: every query that touches tenant data includes a WHERE tenant_id = ? filter. Miss it once — in any query path, background job, report, or admin tool — and you have a cross-tenant data leak. In OWASP API Security terms, this is a BOLA (Broken Object-Level Authorization) vulnerability. See sec-05-authn-authz.html for the full BOLA taxonomy.
The unsafe query vs. the correct query
The canonical failure looks like this: a developer adds a "get invoice by ID" endpoint. They implement the authN check (valid JWT required) but forget to scope the DB query to the authenticated tenant.
-- ❌ UNSAFE: no tenant filter — any authenticated user can fetch any invoice
SELECT id, amount, status, customer_name
FROM invoices
WHERE id = $1; -- $1 = invoice ID from request path
-- ✅ CORRECT: tenant_id bound at authentication layer, enforced at query layer
SELECT id, amount, status, customer_name
FROM invoices
WHERE id = $1
AND tenant_id = $2; -- $2 = tenant_id extracted from validated JWT claim
The unsafe version authenticates the caller but does not authorize them for the specific object. Tenant B can brute-force or enumerate invoice IDs belonging to Tenant A. The correct version binds the query to the tenant extracted from the verified token — even if the caller guesses Tenant A's invoice ID, the query returns nothing.
Schema and database strategies
Beyond tenant_id on rows, platforms use three structural approaches, each trading isolation for cost:
- Shared schema, tenant_id column. All tenants share the same table. Cheapest, but every query must include the filter. The leak surface is large — any query path that doesn't enforce the filter is an exposure.
- Schema-per-tenant. Each tenant gets their own PostgreSQL schema (namespace) —
tenant_acme.invoicesvs.tenant_globex.invoices. Queries route to the right schema at connection setup viaSET search_path TO tenant_acme. Reduces cross-contamination risk; enables per-tenant schema variations; still shares the underlying database cluster (and its failure domain). - DB-per-tenant. Full physical isolation — each tenant has their own database instance or cluster. Maximum isolation; linear cost. Appropriate for regulated industries (healthcare, finance) where tenants may demand proof of separation.
PostgreSQL Row-Level Security (RLS)
RLS is a PostgreSQL feature that enforces the tenant filter at the database layer — the application cannot accidentally bypass it, because the DB itself rejects queries that violate the policy. This is defense in depth: even if a developer writes an unsafe query, the DB returns nothing rather than leaking data.
-- 1. Enable RLS on the table
ALTER TABLE invoices ENABLE ROW LEVEL SECURITY;
ALTER TABLE invoices FORCE ROW LEVEL SECURITY; -- applies even to the table owner
-- 2. Create the isolation policy
CREATE POLICY tenant_isolation ON invoices
USING (tenant_id = current_setting('app.current_tenant')::uuid);
-- 3. Application code sets the tenant before any query in the transaction
SET LOCAL app.current_tenant = 'org_7f3a9c12';
-- Now this query returns only Tenant org_7f3a9c12's rows, enforced by the DB:
SELECT * FROM invoices; -- implicitly filtered; no WHERE needed in application code
-- An unsafe application query — still safe with RLS active:
SELECT * FROM invoices WHERE id = $1; -- RLS policy ANDs in tenant_id = current_setting(…)
Use RLS even when your application already enforces the tenant filter on every query. Application code changes; people write new query paths under deadline pressure; ORMs can be configured incorrectly. RLS at the database layer is the last line of defense that doesn't depend on developer discipline. The cost is negligible — it's a btree lookup on the indexed tenant_id column, which you were doing anyway.
Compute and throughput isolation
Data isolation prevents cross-tenant reads. Compute isolation prevents one tenant's traffic from degrading another's performance. In a pooled system, these are independent problems — you can have perfect data isolation and still have a noisy-neighbor problem that makes enterprise customers call you at midnight.
The noisy-neighbor problem
In a shared thread pool, a tenant running a bulk data export — thousands of sequential requests, each holding a DB connection and a thread — can exhaust the pool and starve every other tenant. The mechanism: each request acquires a thread from the shared pool. With a pool of 100 threads and a tenant firing 150 concurrent requests, the pool is full. All other tenants queue. Their P99 latency goes from 40ms to 4s. No one crossed a tenant boundary — the data is perfectly isolated — but the customer experience is broken.
Per-tenant rate limits and quotas
Rate limiting in a multi-tenant system is not just "protect the API from overload" — it is a fairness contract between tenants. The implementation follows the same algorithms as single-tenant rate limiting (token bucket, sliding window counter), but the key space is tenant:{tenant_id}:{endpoint} rather than ip:{ip}. Each tenant tier gets its own limit: free tier at 10 req/s, growth at 100 req/s, enterprise at 1,000 req/s with burst allowance. See also plat-01 (Rate Limits & Quotas) for the nested rate limit patterns that apply at the org, user, and API-key levels simultaneously.
Fair scheduling and weighted fair queuing
When multiple tenants share a processing queue — a job queue, a request queue, a DB connection pool — pure FIFO scheduling allows one heavy tenant to monopolize the queue. Weighted fair queuing (WFQ) assigns each tenant a weight proportional to their tier, then schedules requests so that each tenant receives approximately their share of processing capacity over any window, regardless of request arrival pattern. In practice: a tenant firing 1,000 requests into the queue receives their fair share (say 10% of throughput if they are a standard-tier tenant among ten equal tenants), but cannot consume more than their weight allows, even if other tenants are idle.
Tenant-scoped resource naming
URL structure is itself a form of access control and a documentation of scope. The /v1/orgs/{org_id}/... pattern makes the tenant scope explicit in the resource identifier — both the client and the server can see at a glance whose data is being addressed. It also makes access control tests obvious: if org_id in the path doesn't match tenant_id in the token, reject the request immediately in middleware, before it ever reaches the query layer.
-- Tenant-scoped URL namespace
GET /v1/orgs/{org_id}/invoices
GET /v1/orgs/{org_id}/invoices/{invoice_id}
POST /v1/orgs/{org_id}/invoices
GET /v1/orgs/{org_id}/customers/{customer_id}/orders
-- Auth middleware: path org_id must match token claim
if params.org_id != token.claims.tenant_id:
return 403 -- forbidden: path tenant doesn't match token
This pattern also makes test writing easy: security tests can assert that fetching /v1/orgs/org_A/invoices/{invoice_id_from_org_B} returns 403 or 404, not 200.
Per-tenant configuration
Encryption keys — envelope encryption
Storing all tenants' data encrypted with a single key means one key compromise exposes everyone. Per-tenant encryption isolates the blast radius. The standard pattern is envelope encryption:
- Generate a Data Encryption Key (DEK) per tenant (or per table, per object — the granularity is a policy choice).
- Encrypt the DEK with a Key Encryption Key (KEK) managed by a KMS (AWS KMS, GCP Cloud KMS, HashiCorp Vault). The wrapped DEK is stored with the data.
- At query time: fetch the wrapped DEK, call KMS to unwrap it (one KMS API call, cacheable for the session), decrypt the data with the unwrapped DEK.
- To revoke a tenant's access to their data: destroy their DEK in the KMS. The data becomes permanently inaccessible without needing to re-encrypt or delete rows.
The KEK never leaves the KMS. The DEK is in memory only during the decryption operation. This is how Stripe, AWS, and most regulated SaaS platforms handle tenant-level encryption.
Data residency
EU customers may be contractually or legally required to have their data stored in EU data centers. This adds a routing problem: the same API endpoint must direct EU tenants' reads and writes to the EU cluster and everyone else to the default cluster. The common implementation: a global routing layer holds a tenant-to-region mapping; on every request, after extracting tenant_id, it proxies to the appropriate regional cluster. This mapping itself must be available globally (typically a small, highly-replicated lookup table with aggressive caching). See rel-16-consistency-cap.html for the consistency considerations this introduces — the routing table and the data are in different systems with different replication lag.
Per-tenant feature flags
Feature flags in a multi-tenant system are scoped to tenants, not just code paths. This lets you: roll out a new API behavior to enterprise tenants first; run A/B experiments on a subset of tenants without affecting others; gate features behind plan tier; or disable a specific feature for a tenant who reported a bug. The implementation is a simple lookup: before executing any code path gated by a flag, resolve feature_flag(tenant_id, flag_name) — typically a Redis lookup with a fallback to a configuration database.
The cache cross-tenant leak
Caches are a frequent source of cross-tenant data leaks because the mistake is invisible in normal operation and only surfaces when two tenants happen to share a cache key collision. The mechanism:
- Tenant A requests
GET /v1/invoices/summary. The cache layer stores the response under keyinvoices:summary. - Tenant B requests the same endpoint. The cache hits on
invoices:summaryand returns Tenant A's data to Tenant B.
The fix is exact and non-negotiable: every cache key must include the tenant ID. The correct key is invoices:summary:{tenant_id} or {tenant_id}:invoices:summary. See rel-07-caching.html for cache invalidation patterns and the full cache key design framework.
This leak is particularly dangerous because it doesn't throw an error and passes all functional tests (which typically run with a single tenant). It only manifests in production when two tenants make the same request in the same cache TTL window. Audit your cache key generation code as a standalone review pass — it is easy to miss when reviewing application logic holistically.
Under the hood: the full BOLA walkthrough
Walk through the complete vulnerable-to-fixed progression, including how RLS catches what the application missed.
RLS is not a replacement for application-level enforcement — setting app.current_tenant correctly is itself application work that could be misconfigured. It is a defense-in-depth layer that prevents the worst case: a bug in application code that would otherwise return another tenant's data.
By the numbers
10,000 tenants, SILO vs. POOL:
| Model | Infrastructure | Unit cost (modeled) | Monthly total (modeled) |
|---|---|---|---|
| SILO | 1 DB instance per tenant × 10,000 | $200/month per small RDS instance | $2,000,000/month |
| POOL | ~20 shared DB clusters, each serving ~500 tenants | $4,000/month per cluster (larger instance) | $80,000/month |
| BRIDGE | 1 pool for 9,800 SMB + 200 isolated for 200 enterprise | Pool: $60k/mo; Enterprise instances: $400/mo each | ~$140,000/month |
At 10,000 tenants, POOL is ~25× cheaper than SILO on infrastructure alone. The gap widens with tenant count — SILO scales linearly, POOL scales sub-linearly (each cluster can absorb more tenants as average utilization grows).
Blast radius in the POOL model. If each cluster serves 500 tenants and one cluster crashes, 500 tenants (5% of your total) are affected. With 20 clusters and no traffic isolation, an outage is bounded to 5% of tenants rather than 100% — but that is still 500 customers. Cluster-level sharding is a blast-radius reduction strategy: the more clusters, the smaller the fraction affected per incident, but the higher the operational overhead.
Break-even: when does SILO become operationally untenable? The rough formula:
-- Break-even point: where POOL ops overhead < SILO infra + ops overhead
SILO_total = N * (infra_per_tenant + ops_per_tenant_per_month)
POOL_total = (N / tenants_per_cluster) * (cluster_cost + ops_per_cluster_per_month)
-- Approximate: at N ≈ 50 tenants, SILO is viable if tenants are high-value.
-- At N ≈ 500+, the ops burden (patching, monitoring, incident response per tenant)
-- dominates. At N ≈ 5,000+, only POOL or BRIDGE is operationally sustainable.
-- The inflection point moves with automation level:
-- fully automated provisioning + IaC pushes SILO viability to ~200–500 tenants.
How real platforms do it
| Platform | Model | Key mechanism | Reference |
|---|---|---|---|
| Salesforce | POOL — pioneered at scale | OrgId on every object in every table. All queries at the ORM layer include the OrgId filter. Hundreds of thousands of orgs in shared infrastructure since 2000. | Salesforce Multi-Tenant Architecture |
| Workday | POOL with shared kernel | Shared application kernel with tenant-aware data access layer. Configuration, workflows, and schema variations are layered on top of a shared base via a metadata-driven approach. | Workday Multi-Tenant Architecture |
| AWS | SILO — each account is the isolation unit | AWS accounts are hard isolation boundaries: separate IAM namespaces, separate resource ARNs, separate billing, separate API rate limit pools. Cross-account access requires explicit trust policies. | AWS SaaS Tenant Isolation Strategies (whitepaper) |
| Stripe | POOL with strong API-layer isolation | All API calls are scoped to an Account object. API keys are scoped to an account and cannot access other accounts' resources regardless of key permissions. Account ID is the first lookup on every API request. | Stripe API Authentication & Account Scoping |
A senior answer names the three models and explains the selection criteria: tenant count, size distribution (homogeneous vs. tiered), compliance requirements, and budget. It then covers the data isolation mechanism (tenant_id + RLS as defense in depth), the compute isolation problem (bulkheads for noisy neighbors, per-tenant rate limits), and the operational dimensions (blast radius, deploy complexity). Candidates who answer only "use tenant_id in every query" have covered one of the four required dimensions. Candidates who propose SILO for 50,000 tenants without addressing the cost curve have not thought through the trade-offs at scale.
🧠 Quick check
1. Which isolation model has the highest infrastructure cost but the strongest isolation?
SILO gives each tenant their own infrastructure: separate DB, separate app instances, separate everything. This is physically the strongest isolation — a breach in one tenant's environment cannot spill to another. The cost is that infrastructure scales linearly with tenant count, making SILO impractical at thousands of tenants unless the per-tenant revenue justifies it.
2. In a pooled multi-tenant system, what is the BOLA risk?
BOLA (Broken Object-Level Authorization) in a pooled system means a valid authenticated user accesses a resource they do not own. The canonical mechanism: an API endpoint takes an object ID from the request path, queries the DB by that ID alone without adding a tenant_id filter, and returns whatever row matches — even if it belongs to another tenant. The fix is to always AND the tenant_id from the verified token into every data retrieval query.
3. Why must cache keys include the tenant ID?
Cache keys that omit the tenant ID create a namespace collision: the first tenant to populate the cache entry defines what all subsequent tenants receive until the TTL expires. This is a cross-tenant data leak that bypasses all database-level isolation controls — the DB query was correct, but the cached result is served to the wrong tenant. The fix is mechanical: always prefix or suffix cache keys with the tenant ID.
4. What is the primary purpose of PostgreSQL Row-Level Security (RLS) in a multi-tenant database?
RLS attaches a policy to a table that the database evaluates for every query — SELECT, UPDATE, DELETE. The policy ANDs a tenant_id predicate into every WHERE clause automatically. This means an application bug that omits the tenant filter is caught by the database: the query returns only the rows belonging to the tenant set in the session variable, not all rows. It is defense in depth — the application should still enforce the filter, but RLS ensures a missed filter doesn't leak data.
✍️ Exercise: choose an isolation model for a mixed-tier analytics SaaS
Scenario: You are the platform architect for a SaaS analytics product. You currently have 500 tenants. The distribution is: 490 are small customers (free or growth tier, averaging 10 req/s each), and 10 are enterprise customers (each running 500 req/s, each paying $50k/year, each with a data processing agreement requiring EU data residency for their data). Recommend an isolation model and justify your recommendation. Address: data isolation, compute isolation, blast radius, cost, and compliance.
Model answer:
The correct recommendation is the BRIDGE model:
- SMB pool (490 tenants): Put the 490 small tenants in a shared POOL on 2–3 DB clusters (each cluster serving ~150–200 tenants) with per-tenant row-level security and a tenant_id-keyed cache layer. Per-tenant rate limits at 10–50 req/s enforce the tier contract. Cost: 2–3 clusters at ~$4k/month = ~$10k/month for all SMB data infrastructure.
- Enterprise silo (10 tenants): Each enterprise tenant gets their own DB cluster (or at minimum their own DB schema on a dedicated cluster) in the EU region. This satisfies the data residency requirement — their data never leaves EU infrastructure. It also eliminates blast radius cross-contamination between enterprise customers. At 500 req/s each, they generate enough load to justify dedicated resources. Cost: 10 clusters at $400–800/month each = $4k–8k/month, paid for by the $50k/year revenue per tenant.
- Shared application layer: One codebase, one deploy pipeline. A routing layer inspects the tenant_id on each request and directs to the correct data store (pool or dedicated). This keeps operational complexity manageable — you are not maintaining 11 separate application stacks.
- Compute isolation: Implement per-tenant rate limits and bulkhead thread pools partitioned by tier. Enterprise tenants get a dedicated thread pool with a larger limit; SMB tenants share a pool with fair-queuing scheduling. A bulk export from an enterprise tenant cannot affect other enterprise tenants (they have separate data stores) and cannot affect SMB tenants (separate thread pool).
Why not pure SILO? At 500 tenants, a fully dedicated DB per tenant costs $200 × 500 = $100k/month. That is 10× the BRIDGE cost with no benefit for the SMB segment — they have no compliance requirement for isolation and their load is low. The operational burden (500 separate DB instances to patch, monitor, and back up) is also disproportionate.
Why not pure POOL? The 10 enterprise customers have a contractual data residency requirement. In a pure pool, you cannot guarantee EU data stays in EU without per-tenant routing complexity that is equivalent to BRIDGE anyway. Also, at 500 req/s each, enterprise customers are noisy neighbors even with rate limiting — dedicated infrastructure makes the performance contract enforceable.
Rubric:
- Full marks: BRIDGE recommendation with justification across all five dimensions (data isolation, compute isolation, blast radius, cost arithmetic, compliance). Notes that enterprise tenants have data residency constraints that push them toward isolation regardless of model preference.
- Partial marks: BRIDGE or POOL recommendation with justification for three or more dimensions.
- Minimum pass: identifies that the heterogeneous tenant size distribution (10 very large, 490 small) is the key driver of the hybrid recommendation.
- Deductions: recommending pure SILO without addressing the cost at 500 tenants; recommending pure POOL without addressing the data residency constraint for enterprise customers.
Key takeaways
- Three models, not infinitely many: SILO (per-tenant infra), POOL (shared infra + tenant_id filter), and BRIDGE (shared app + isolated data for large tenants). Select based on tenant count, size distribution, and compliance requirements.
- The BOLA risk in POOL is a missing WHERE clause. Every DB query that touches tenant data must include
WHERE tenant_id = ?, bound to the tenant_id extracted from the verified token. PostgreSQL Row-Level Security enforces this at the DB layer as defense in depth. - The noisy-neighbor problem is a compute isolation problem, separate from data isolation. Bulkhead thread pools and connection pools per tenant tier prevent one heavy tenant from degrading all others.
- Cache keys must include the tenant ID. Any cache key that omits it creates a cross-tenant data leak that bypasses all DB-level isolation controls.
- At 10,000 tenants, POOL is ~25× cheaper than SILO on infrastructure. The operational overhead of maintaining N separate environments compounds this advantage. SILO is appropriate for regulated markets with hard isolation requirements and a small, high-value tenant count (typically under 200).
Sources & further reading
- AWS Whitepaper — SaaS Tenant Isolation Strategies
- Salesforce Developer Wiki — Multi-Tenant Architecture
- PostgreSQL Documentation — Row Security Policies (RLS)
- OWASP API Security — BOLA (Broken Object-Level Authorization)
- AWS Partner Blog — SaaS Tenant Isolation Models with EKS
- Stripe API Documentation — Authentication & Account Scoping
- Martin Fowler — Writing Resilient Platform Code (bulkhead & isolation patterns)