Distributed Systems Design: From Theory to Production
The academic view of distributed systems is elegant: Lamport clocks, consensus protocols, formal proofs. The real-world view is messier: timeouts, partial failures, cascading crashes, and trade-offs that theory doesn't prepare you for.
I've built distributed systems for fintech platforms serving millions. Here's what theory gets right—and what it misses.
Three Things Will Go Wrong
Any distributed system will face these realities:
- Networks fail. Partitions happen. Messages get lost or delayed. Accept this.
- Clocks lie. Your servers' clocks will drift. Don't rely on wall-clock time for ordering.
- Cascades. One failure triggers another, which triggers another. Design for isolation.
Most systems fail because they assume these don't happen. They do. Always.
Consistency vs. Availability: The Real Trade-Off
The CAP theorem is often misunderstood. Here's the truth:
You can't have all three during a partition: Consistency, Availability, Partition tolerance. You pick two. But here's the practical version:
- Strong consistency (linearizable): Every read sees the latest write. Used for financial transactions, user authentication. Costs: latency, availability during partitions.
- Eventual consistency: Reads might be stale, but will converge. Used for caches, timelines, non-critical data. Costs: complexity in conflict resolution.
- Causal consistency: Middle ground. Reads respect causality but not total ordering. Good for collaborative systems.
"The choice of consistency model is the most important decision in system architecture. Everything else follows from it."
Real Example: Payment Processing (Strong Consistency)
At alt.bank, payment processing requires strong consistency. A user can't spend the same $100 twice.
The architecture:
User initiates payment
↓
Request hits payment service
↓
Acquire lock on user's account
↓
Check balance (now guaranteed fresh)
↓
If balance sufficient:
- Deduct amount
- Write transaction record
- Release lock
Else:
- Reject payment
- Release lock
The lock ensures no concurrent transaction can double-spend. Cost: latency (locks wait). But it's non-negotiable for payments.
Real Example: User Timeline (Eventual Consistency)
But not everything needs strong consistency. When a user posts to their timeline:
User creates post
↓
Write to primary database
↓
Immediately return to user ("success!")
↓
Async replicate to followers' feeds
↓
Followers see post (possibly seconds later)
A few seconds of staleness is fine. Users expect Twitter to be fast, not perfectly consistent. This architecture scales to billions of posts.
The Hidden Complexity: Consensus
When you need multiple nodes to agree on state (critical for leader election, shard coordination), you need consensus.
Most teams use Raft or Paxos, but abstracted away by systems like etcd, Consul, or Kafka. You don't usually implement it yourself.
But understanding consensus helps:
- Why writes to the leader are fast but reads might lag
- Why you can't have write quorum < 3 nodes (split-brain risk)
- Why a leader election takes time (prevents cascading failures)
The Pattern That Actually Works: Saga Pattern
Transactions across microservices are hard. Two-phase commit (2PC) is slow and fragile. The better pattern is the saga:
Orchestrator: "Transfer $100 from account A to B"
↓
1. Service A: Deduct $100 from A
(saves compensating action: "add $100 back")
↓
2. Service B: Add $100 to B
(saves compensating action: "deduct $100")
↓
If either fails:
- Execute compensating actions in reverse order
- Transaction rolls back consistently
This handles failures gracefully. If step 2 fails, you revert step 1. No deadlocks, no long-running locks.
Failure Detection: The Underrated Problem
How do you know a server is dead? You can't just ping it—networks are unreliable.
- Naive approach: If no heartbeat in 5 seconds, it's dead.
- Problem: Network jitter causes false positives. You mark healthy servers as dead.
- Better approach: Timeout + failure pattern detection. "3 timeouts in a row" is stronger signal than "1 timeout."
In Kubernetes, readiness probes handle this. Your job: trust them, not your intuition.
Load Balancing: More Nuanced Than You Think
Round-robin load balancing sounds fair but isn't:
- Some servers are slower (higher latency)
- Some requests are heavier (take more CPU)
- Round-robin ignores both
Better approach: least-outstanding-requests. Route new requests to servers with fewest pending requests. Natural load balancing emerges.
Key Takeaways
- Accept that failures happen. Design for them.
- Choose consistency model first; everything else follows
- Use sagas for distributed transactions, not 2PC
- Failure detection is harder than it seems—use proven libraries
- Understand your consensus layer (even if hidden)
- Test failure scenarios. Chaos engineering isn't optional.
Distributed systems are hard because the real world is hard. But with the right patterns, they're predictable.
Building scalable systems?
I specialize in distributed architecture, system design, and production scaling.
Back to Portfolio