HardEnterprise

Design Slack

WebSocketsDatabasesSearchAPI DesignAuthMicroservices

Problem Statement

Design the architecture for Slack - the leading enterprise communication platform with 30+ million daily active users across 750,000 organizations. Your design must cover:

- Channels & DMs - public channels, private channels, multi-person DMs. Each workspace can have 100,000+ channels. Messages support rich text, code blocks, emoji reactions, and threading.Real-time messaging - messages appear on all connected clients within 500 ms. Users can be connected on desktop, mobile, and web simultaneously. Typing indicators and presence (online/away/DND) update in real time.Threads - threaded replies to messages, with a "Threads" view aggregating all threads you're participating in. Threads reduce noise in busy channels.Search - full-text search across all messages a user has access to, with filters (in:channel, from:user, has:link, before:date). Results must respect permissions - users only see messages from channels they belong to. Returns results in < 1 second.File & app integrations - share files (up to 1 GB), integrate with 2,500+ third-party apps (Jira, GitHub, Google Calendar). Slash commands and bot users.Notifications - smart notifications: don't notify for every message in a busy channel, but always notify for DMs and @mentions. Badge counts update in real time.Enterprise features - SSO (SAML/OIDC), data retention policies, eDiscovery exports, DLP scanning, audit logs, and Enterprise Key Management (EKM).

The key challenge is real-time delivery + search + compliance at enterprise scale.

What You'll Learn

Design Slack's team communication platform - channels, threads, search, file sharing, integrations, and enterprise compliance at 30 M+ DAU. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

WebSocketsDatabasesSearchAPI DesignAuthMicroservices

Constraints

Daily active users30,000,000+
Organizations750,000+
Peak messages/second~200,000
Message delivery latency< 500 ms
Search latency< 1 second
Max file size1 GB
Integrations2,500+
Message retentionConfigurable per org
Availability target99.99%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design Slack's team communication platform - channels, threads, search, file sharing, integrations, and enterprise compliance at 30 M+ DAU.
  • Design for a peak load target around 1,736 RPS (including burst headroom).
  • Daily active users: 30,000,000+
  • Organizations: 750,000+
  • Peak messages/second: ~200,000
  • Message delivery latency: < 500 ms
  • Search latency: < 1 second

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • WebSockets: Use persistent connection gateways and decouple fanout via pub/sub or queues.
  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Search: Use primary store for writes and async index updates for search relevance + scale.
  • API Design: Standardize API boundaries, idempotency keys, pagination, and error contracts first.
  • Auth: Centralize identity verification and keep authorization checks close to domain resources.
  • Microservices: Split services by business boundary, not by technical layer, and enforce ownership per domain.

4) Reliability and Failure Strategy

  • Track connection churn, backpressure, and session resumption behavior.
  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Track indexing lag and support reindex from source of truth.
  • Apply strict input validation and backward-compatible versioning.
  • Use short-lived tokens and secure key rotation workflows.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • WebSockets: WebSockets reduce interaction latency but complicate scaling and state management.
  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Search: Search index gives rich querying but introduces eventual consistency and index ops overhead.
  • API Design: Rich APIs improve developer speed but can create long-term compatibility burden.
  • Auth: Central auth simplifies policy, but makes auth service availability/security critical.

Practical Notes

  • WebSocket gateway fleet: each user connects to a gateway. A pub/sub layer (Redis pub/sub or NATS) distributes messages to all gateways with subscribers.
  • Partition data by workspace_id - tenant isolation is natural and enables per-tenant compliance policies.
  • Search: Elasticsearch index per workspace. Index messages asynchronously via a message queue to avoid slowing the write path.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> Load Balancer -> API Gateway -> Core Service -> Auth Service -> Primary NoSQL DB -> Realtime Bus -> Search Index

Design strengths

  • Security controls are enforced at ingress to protect downstream capacity.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.