HardIoT Platform · Part 2

IoT Platform 2 - Industrial Scale & Edge Computing

DatabasesMessage QueuesMonitoringShardingAnalyticsContainerization

This challenge builds on IoT Platform 1 - Smart Home Hub. Complete it first for the best experience.

Problem Statement

HomeLink has pivoted to industrial IoT (IIoT), serving manufacturing plants, energy grids, and logistics fleets. The platform now manages 50 million sensors across 10,000 facilities worldwide. Challenges:

- Telemetry ingestion - sensors report data (temperature, vibration, pressure, GPS) every second. The platform ingests 50 million events per second at peak.Edge computing - each facility runs an edge gateway that pre-processes data locally (filtering, aggregation, anomaly pre-screening) before forwarding to the cloud. Edge nodes must operate autonomously during internet outages.Anomaly detection - an ML pipeline monitors for equipment failures (predictive maintenance). It must alert operators within 10 seconds of detecting an anomaly.Data retention - store raw telemetry for 90 days, aggregated data for 5 years. Total storage in petabytes.Multi-tenant isolation - each facility's data is strictly isolated for security and compliance.

This challenge tests time-series data at extreme scale, edge architecture, and real-time ML.

What You'll Learn

Scale to 50 M industrial sensors with edge processing, anomaly detection, and five-nines uptime. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

DatabasesMessage QueuesMonitoringShardingAnalyticsContainerization

Constraints

Total sensors~50,000,000
Peak events/second~50,000,000
Facilities~10,000
Anomaly alert latency< 10 seconds
Raw data retention90 days
Aggregate retention5 years
Edge autonomy (offline)Up to 24 hours
Availability target99.999%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Scale to 50 M industrial sensors with edge processing, anomaly detection, and five-nines uptime.
  • Design for a peak load target around 20,000 RPS (including burst headroom).
  • Total sensors: ~50,000,000
  • Peak events/second: ~50,000,000
  • Facilities: ~10,000
  • Anomaly alert latency: < 10 seconds
  • Raw data retention: 90 days

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Message Queues: Move non-blocking and retry-heavy work to async consumers with explicit retry and DLQ policies.
  • Monitoring: Instrument golden signals (latency, traffic, errors, saturation) per tier and per tenant/domain.
  • Sharding: Choose shard keys around access patterns and growth hotspots, not just data size.
  • Analytics: Maintain separate OLTP and analytics paths; stream events into a warehouse/time-series layer.
  • Containerization: Run services/jobs in isolated containers with reproducible images and resource quotas.

4) Reliability and Failure Strategy

  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Guarantee idempotent consumers and trace every message with correlation IDs.
  • Alert on user-impact SLOs, not only infrastructure metrics.
  • Support rebalancing and hotspot detection from day one.
  • Version event schemas and monitor drop/late-event rates.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Message Queues: Async pipelines absorb spikes well, but increase eventual-consistency complexity.
  • Monitoring: Deep observability speeds incident response but raises ingestion and tooling costs.
  • Sharding: Sharding improves horizontal scale but makes cross-shard queries and transactions harder.
  • Analytics: Analytics pipeline unlocks insights, but adds eventual consistency and governance overhead.

Practical Notes

  • Time-series databases (TimescaleDB, InfluxDB, or QuestDB) are purpose-built for this ingest rate.
  • Edge gateways can run lightweight containers (K3s) that buffer data locally during outages and sync when reconnected.
  • Downsample older data - raw 1-second data for 90 days, 1-minute aggregates for 5 years.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: IoT Devices -> Load Balancer -> Core Service -> Primary NoSQL DB -> Event Bus -> Background Workers -> Stream Processor -> Data Warehouse

Design strengths

  • Async queue/event bus isolates bursty workloads and supports retries without blocking synchronous requests.
  • Analytics pipeline is separated from OLTP path to avoid reporting workloads impacting transactions.
  • Monitoring and logs are wired in from day one for rapid incident triage.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.