Challenges of Distribution

Updated June 3, 2026

Magic Magnets Team

9 min read

There's a moment every engineer hits when they first build something distributed: things start failing in ways that don't make sense. Requests time out for no obvious reason. Two nodes disagree about what happened. A message was definitely sent, but the receiver never got it.

Welcome to distributed systems. The hard part isn't the technology. Your mental model of how computers work stops being accurate.

The 8 Fallacies of Distributed Computing

In the 1990s, engineers at Sun Microsystems (Peter Deutsch, James Gosling, and others) catalogued a set of assumptions that developers incorrectly make when building distributed systems. They called them the 8 fallacies of distributed computing.

Every single one of them has bitten production systems at scale. Read them slowly.

Quiz Time

A microservice makes 10 sequential calls to other services to render a page, each taking ~50ms. What fallacy does this design violate?

The network is reliable: Packets get dropped. Connections time out. TCP retransmits silently. Your call to a remote service might have gone through, might not have, and you might not know which.
Latency is zero: A function call takes nanoseconds. A network call takes milliseconds. This sounds obvious, but systems built assuming fast networks fall apart when deployed across data centers or the open internet.
Bandwidth is infinite: You can't send 10 GB of data across the wire in a millisecond. Bandwidth costs money and has real limits. Systems that ignore this bloat their serialization formats and wonder why they're slow.
The network is secure: Data in transit can be intercepted. Services can be impersonated. TLS everywhere isn't paranoia. It's baseline hygiene.
Topology doesn't change: Nodes come and go. IP addresses change. Kubernetes pods restart with new IPs. Service discovery isn't optional.
There is one administrator: In a distributed system, different parts might be owned by different teams, companies, or even clouds. Coordination assumptions break.
Transport cost is zero: Network calls have overhead: serialization, TCP handshakes, TLS negotiation. Chatty protocols are expensive.
The network is homogeneous: Different nodes might run different OS versions, different hardware, different network stacks. You can't assume uniform behavior.

These aren't theoretical. Each one is a category of real production incidents. The fallacies exist because distributed systems force you to confront realities that a single machine hides.

Network Unreliability and Partial Failures

On a single machine, a function either returns or throws an exception. You always know what happened.

On a distributed system, you have a third possibility: you don't know what happened. You sent a request, and you got... nothing. Did the request arrive? Did it arrive and get processed? Did the processing fail halfway through? Did the response get lost on the way back?

This is called a partial failure — some parts of the system are working, others aren't, and you can't be sure which. Handling partial failures is one of the central challenges of distributed design.

The naive solution is to retry. But retrying isn't safe if the operation isn't idempotent. If you charged a customer's card, timed out, and retried, you might charge them twice. This is why idempotency keys exist: the payment processor deduplicates based on a unique request ID, so retrying is safe.

Quiz Time

Which of the following correctly describes why retrying a timed-out request can be dangerous?

Select all that apply

The Two Generals Problem

The Two Generals Problem is a classic thought experiment that illustrates why coordination over an unreliable network is fundamentally hard.

Imagine two generals planning a coordinated attack on a city. They can only communicate by sending messengers through enemy territory (messengers might be captured, meaning messages might be lost). General A sends a message: "Attack at dawn." General B receives it and sends a confirmation. But General A doesn't know if the confirmation arrived. So A sends a confirmation of the confirmation. Now B doesn't know if that arrived...

This regress never ends. There is no protocol that can guarantee both generals know they're in agreement, given an unreliable channel. This isn't a solvable engineering problem. It's mathematically proven to be impossible.

The relevance to distributed systems: you can never achieve perfect coordination over an unreliable network. What you can do is design systems that tolerate the uncertainty. You use timeouts and assume failure. You use idempotent operations so retries are safe. You use consensus algorithms that give you strong enough guarantees for practical purposes (but not theoretical perfection).

Latency as a First-Class Concern

In a monolith, a method call takes ~1 nanosecond. In a distributed system, a network call within the same data center takes ~0.5ms. Across data centers: 5-150ms. Across continents: 100-300ms.

That difference of 5-6 orders of magnitude is enormous. If your system makes 10 sequential service calls to render a page, and each takes 50ms, you've already spent 500ms before doing any real work. This is the microservices latency trap many companies have discovered the hard way.

Design for latency:

Fan out in parallel: if calls are independent, make them concurrently
Cache aggressively: reduce remote calls to local reads
Batch: aggregate multiple small calls into fewer large ones
Co-locate: keep latency-sensitive services in the same region

Why Distributed Systems Are Fundamentally Harder

Distributed systems don't just add complexity. They change the nature of the problem.

On a single machine: fail or succeed. On a distributed system: fail, succeed, partially succeed, or unknown.

On a single machine: sequential execution is the default. On a distributed system: concurrent execution is unavoidable, and race conditions become subtle and hard to reproduce.

On a single machine: a transaction either commits or rolls back. On a distributed system: distributed transactions require coordination protocols (2PC, Saga pattern) that are expensive and still have failure modes.

None of this means you shouldn't build distributed systems — you often have no choice if you want scale, fault tolerance, or geo-distribution. But you should go in with clear eyes. Every capability you take for granted on a single machine has to be deliberately re-engineered in a distributed context.

The best distributed system is the one you don't build. Before distributing, ask if a well-tuned single server (or a primary + read replicas) solves your problem. Often it does.

Summary

Distributed systems are hard because they expose realities that a single machine hides. The 8 fallacies of distributed computing capture the assumptions developers incorrectly carry over: the network is unreliable, latency is real, topology changes, and partial failures are unavoidable. The Two Generals Problem proves that perfect coordination over an unreliable network is mathematically impossible. The best you can do is design for tolerable uncertainty. Every primitive you take for granted (transactions, sequential execution, fail-or-succeed semantics) must be deliberately re-engineered at distribution. Approach distributed systems with respect for their inherent complexity.

Network Partitions

How helpful was this content?

Comments

0/2000

Saved on this device only