MediumIntermediate

Multi-Tenant SaaS Platform

DatabasesAuthAPI DesignShardingMonitoring

Problem Statement

TenantOS is building a B2B SaaS platform framework (like what powers Notion, Linear, or Jira internally). The platform hosts thousands of customer organizations (tenants) on shared infrastructure. Key challenges:

- Tenant isolation - each tenant's data is logically isolated. A bug or query in one tenant must never access another's data. Choose between: shared database with tenant_id column, schema-per-tenant, or database-per-tenant.Authentication & authorization - SAML SSO for enterprise tenants. Each tenant has its own user directory, roles, and permissions.Billing & metering - track usage (API calls, storage, seats) per tenant. Generate invoices monthly. Support per-seat and usage-based pricing.Custom domains - enterprise tenants use their own domain (e.g., `app.customer.com`) rather than a subdomain. Requires automated SSL certificate provisioning (Let's Encrypt).Rate limiting per tenant - prevent noisy neighbors from degrading performance for others. Enforce per-tenant API rate limits.Data residency - some tenants require their data to stay in a specific region (EU, US). Route their requests to the correct regional deployment.Tenant onboarding - automated provisioning: create DB schema, seed default data, configure SSO, issue API keys.

Targeting 5,000 tenants with 2 million total users.

What You'll Learn

Design a multi-tenant SaaS backend with tenant isolation, per-tenant billing, custom domains, and data residency. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

DatabasesAuthAPI DesignShardingMonitoring

Constraints

Tenants~5,000
Total users~2,000,000
API requests/day~50,000,000
Per-tenant latency< 200 ms
Noisy neighbor toleranceZero - enforce limits
Data regions3 (US, EU, Asia)
Availability target99.99%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design a multi-tenant SaaS backend with tenant isolation, per-tenant billing, custom domains, and data residency.
  • Design for a peak load target around 2,894 RPS (including burst headroom).
  • Tenants: ~5,000
  • Total users: ~2,000,000
  • API requests/day: ~50,000,000
  • Per-tenant latency: < 200 ms
  • Noisy neighbor tolerance: Zero - enforce limits

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Auth: Centralize identity verification and keep authorization checks close to domain resources.
  • API Design: Standardize API boundaries, idempotency keys, pagination, and error contracts first.
  • Sharding: Choose shard keys around access patterns and growth hotspots, not just data size.
  • Monitoring: Instrument golden signals (latency, traffic, errors, saturation) per tier and per tenant/domain.

4) Reliability and Failure Strategy

  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Use short-lived tokens and secure key rotation workflows.
  • Apply strict input validation and backward-compatible versioning.
  • Support rebalancing and hotspot detection from day one.
  • Alert on user-impact SLOs, not only infrastructure metrics.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Auth: Central auth simplifies policy, but makes auth service availability/security critical.
  • API Design: Rich APIs improve developer speed but can create long-term compatibility burden.
  • Sharding: Sharding improves horizontal scale but makes cross-shard queries and transactions harder.
  • Monitoring: Deep observability speeds incident response but raises ingestion and tooling costs.

Practical Notes

  • Shared database with tenant_id column is simplest but riskiest (bugs can leak data). Schema-per-tenant balances isolation and operational cost. Database-per-tenant is safest but hardest to manage.
  • Add tenant_id to every database query via a middleware layer - catch any query missing the filter.
  • Custom domains: automate SSL with Let's Encrypt and a reverse proxy (Caddy or Nginx) that maps domains to tenants.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> Load Balancer -> API Gateway -> API Service -> Auth Service -> Primary NoSQL DB -> Monitoring -> Log Aggregator

Design strengths

  • Monitoring and logs are wired in from day one for rapid incident triage.
  • Security controls are enforced at ingress to protect downstream capacity.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.