MediumIntermediate

Online Code Sandbox

ContainerizationDatabasesWebSocketsAPI DesignStorage

Problem Statement

DevBox is building an online code sandbox where developers write, run, and share code in the browser. Features:

- Browser IDE - code editor (Monaco) with syntax highlighting, autocomplete, and file tree. Supports JavaScript, Python, Go, and Rust.Secure execution - run user code in isolated sandboxes. Prevent malicious code from accessing the host system, network, or other users' data. Enforce CPU (10s max) and memory (256 MB max) limits.Real-time output - stdout/stderr streams to the browser in real time as the program runs.Project persistence - save projects with multiple files/folders. Share via URL. Fork other people's projects.Package installation - users can install npm/pip packages. Pre-warm popular packages in a base image.Preview - for web projects (React, HTML), show a live preview that updates as the user types (hot reload).Collaboration - two users can code together with live cursors (like Google Docs for code).

Targeting 200,000 users running 500,000 code executions per day.

What You'll Learn

Design an online code execution sandbox (like CodeSandbox / Replit) that runs user code safely with real-time collaboration. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

ContainerizationDatabasesWebSocketsAPI DesignStorage

Constraints

Registered users~200,000
Code executions/day~500,000
Max execution time10 seconds
Max memory per sandbox256 MB
Sandbox startup time< 2 seconds
Supported languages4 (JS, Python, Go, Rust)
Availability target99.9%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design an online code execution sandbox (like CodeSandbox / Replit) that runs user code safely with real-time collaboration.
  • Design for a peak load target around 100 RPS (including burst headroom).
  • Registered users: ~200,000
  • Code executions/day: ~500,000
  • Max execution time: 10 seconds
  • Max memory per sandbox: 256 MB
  • Sandbox startup time: < 2 seconds

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Containerization: Run services/jobs in isolated containers with reproducible images and resource quotas.
  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • WebSockets: Use persistent connection gateways and decouple fanout via pub/sub or queues.
  • API Design: Standardize API boundaries, idempotency keys, pagination, and error contracts first.
  • Storage: Use object storage for large blobs and keep metadata/authorization separate in the API tier.

4) Reliability and Failure Strategy

  • Use rolling deploys with readiness probes and fast rollback.
  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Track connection churn, backpressure, and session resumption behavior.
  • Apply strict input validation and backward-compatible versioning.
  • Enforce lifecycle policies, retention tiers, and checksum validation.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Containerization: Containerization standardizes environments but increases orchestration complexity.
  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • WebSockets: WebSockets reduce interaction latency but complicate scaling and state management.
  • API Design: Rich APIs improve developer speed but can create long-term compatibility burden.
  • Storage: Object storage is cheap and durable, but random low-latency reads are weaker than databases/caches.

Practical Notes

  • Use lightweight containers (gVisor or Firecracker microVMs) for sandboxing - they provide strong isolation with fast startup times.
  • Pre-warm a pool of idle sandboxes per language so users don't wait for container creation.
  • Stream stdout/stderr via WebSocket from the sandbox worker to the browser for real-time output.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> Load Balancer -> API Gateway -> API Service -> Primary NoSQL DB -> Realtime Bus -> Object Storage

Design strengths

  • The architecture keeps synchronous paths short and isolates stateful dependencies behind clear boundaries.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.