HardPayment Gateway · Part 2

Payment Gateway 2 - Global & Fraud Detection

DatabasesShardingMessage QueuesAnalyticsGeo DistributionMonitoring

This challenge builds on Payment Gateway 1 - Online Checkout. Complete it first for the best experience.

Problem Statement

PayFlow has grown into a global payment platform processing 50 million transactions per day across 30 countries. New challenges:

- Real-time fraud detection - an ML pipeline must score every transaction in < 100 ms. It analyzes velocity (how many charges in the last hour), geolocation anomalies, device fingerprints, and behavioral patterns. Flagged transactions are held for manual review.Multi-currency settlement - merchants receive payouts in their local currency. The system must handle FX conversion, ledger entries in multiple currencies, and end-of-day settlement batches.Regulatory compliance - PSD2 / Strong Customer Authentication in Europe, different reserve requirements per country, and real-time reporting to financial regulators.Disaster recovery - a payment system can never lose a transaction. RPO = 0, RTO < 60 seconds.Ledger integrity - implement a double-entry accounting ledger that can be reconciled to the penny across billions of transactions.

This challenge tests distributed transactions, financial system design, and real-time ML at scale.

What You'll Learn

Scale to 50 M transactions/day across 30 countries with real-time fraud detection and multi-currency settlement. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

DatabasesShardingMessage QueuesAnalyticsGeo DistributionMonitoring

Constraints

Daily transactions~50,000,000
Countries30
Fraud scoring latency< 100 ms
Currencies supported25+
Settlement frequencyDaily
RPO0 (zero data loss)
RTO< 60 seconds
Availability target99.999%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Scale to 50 M transactions/day across 30 countries with real-time fraud detection and multi-currency settlement.
  • Design for a peak load target around 2,894 RPS (including burst headroom).
  • Daily transactions: ~50,000,000
  • Countries: 30
  • Fraud scoring latency: < 100 ms
  • Currencies supported: 25+
  • Settlement frequency: Daily

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Sharding: Choose shard keys around access patterns and growth hotspots, not just data size.
  • Message Queues: Move non-blocking and retry-heavy work to async consumers with explicit retry and DLQ policies.
  • Analytics: Maintain separate OLTP and analytics paths; stream events into a warehouse/time-series layer.
  • Geo Distribution: Route users to nearest region/edge while keeping write-consistency boundaries explicit.
  • Monitoring: Instrument golden signals (latency, traffic, errors, saturation) per tier and per tenant/domain.

4) Reliability and Failure Strategy

  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Support rebalancing and hotspot detection from day one.
  • Guarantee idempotent consumers and trace every message with correlation IDs.
  • Version event schemas and monitor drop/late-event rates.
  • Design region failover and data residency controls as first-class requirements.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Sharding: Sharding improves horizontal scale but makes cross-shard queries and transactions harder.
  • Message Queues: Async pipelines absorb spikes well, but increase eventual-consistency complexity.
  • Analytics: Analytics pipeline unlocks insights, but adds eventual consistency and governance overhead.
  • Geo Distribution: Global latency improves, but cross-region consistency and operations become harder.

Practical Notes

  • A streaming pipeline (Kafka + Flink) can compute fraud features in real time and feed an ML scoring service.
  • Shard the ledger by merchant_id - each merchant's financial data is independent.
  • Use an append-only, immutable ledger design - never update or delete entries, only add correcting entries.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> DNS -> Load Balancer -> Core Service -> Primary NoSQL DB -> Replica SQL DB -> Event Bus -> Background Workers

Design strengths

  • Async queue/event bus isolates bursty workloads and supports retries without blocking synchronous requests.
  • Analytics pipeline is separated from OLTP path to avoid reporting workloads impacting transactions.
  • Monitoring and logs are wired in from day one for rapid incident triage.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.