MediumIntermediate

Distributed Job Scheduler

DatabasesMessage QueuesMonitoringAPI Design

Problem Statement

CronCloud is building a managed job scheduling service (like a cloud-scale cron). Features:

- Schedule jobs - cron expressions ("every day at 3 AM"), one-time schedules, or interval-based ("every 15 minutes"). Jobs are HTTP webhook calls or message queue publishes.Reliability - jobs must execute exactly once (or at-least-once with idempotency). No missed executions, even during deployments or node failures.Distributed execution - an entire cluster of scheduler nodes that coordinate. If one node dies, another picks up its jobs.Retries & dead letter - failed jobs retry with exponential backoff (up to 5 retries). After max retries, move to a dead-letter queue for manual inspection.Job history - searchable log of every execution: start time, duration, status (success/failure), and output/error logs.Alerting - notify via webhook or email when a job consistently fails or misses its schedule.

Handle 5 million scheduled jobs triggering 50 million executions per day.

What You'll Learn

Design a distributed cron/job scheduler that runs millions of scheduled tasks reliably with retries and monitoring. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

DatabasesMessage QueuesMonitoringAPI Design

Constraints

Registered jobs~5,000,000
Executions per day~50,000,000
Schedule accuracyWithin 5 seconds of target time
Missed execution toleranceZero
Max retry attempts5
Job history retention30 days
Availability target99.99%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design a distributed cron/job scheduler that runs millions of scheduled tasks reliably with retries and monitoring.
  • Design for a peak load target around 2,894 RPS (including burst headroom).
  • Registered jobs: ~5,000,000
  • Executions per day: ~50,000,000
  • Schedule accuracy: Within 5 seconds of target time
  • Missed execution tolerance: Zero
  • Max retry attempts: 5

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Message Queues: Move non-blocking and retry-heavy work to async consumers with explicit retry and DLQ policies.
  • Monitoring: Instrument golden signals (latency, traffic, errors, saturation) per tier and per tenant/domain.
  • API Design: Standardize API boundaries, idempotency keys, pagination, and error contracts first.

4) Reliability and Failure Strategy

  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Guarantee idempotent consumers and trace every message with correlation IDs.
  • Alert on user-impact SLOs, not only infrastructure metrics.
  • Apply strict input validation and backward-compatible versioning.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Message Queues: Async pipelines absorb spikes well, but increase eventual-consistency complexity.
  • Monitoring: Deep observability speeds incident response but raises ingestion and tooling costs.
  • API Design: Rich APIs improve developer speed but can create long-term compatibility burden.

Practical Notes

  • Partition jobs across scheduler nodes (e.g., by hash of job_id). Use a consensus protocol (Raft) or a lease-based system to reassign partitions on node failure.
  • Pre-compute the next execution time for each job and store it in a 'next_fire' column. Poll for jobs where next_fire ≤ now().
  • Use a database row-level lock or an optimistic lock to prevent two nodes from executing the same job.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> Load Balancer -> API Gateway -> API Service -> Primary SQL DB -> Message Queue -> Background Workers -> Monitoring

Design strengths

  • Async queue/event bus isolates bursty workloads and supports retries without blocking synchronous requests.
  • Monitoring and logs are wired in from day one for rapid incident triage.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.