Scalability — System Design Primer

Performance vs Scalability

These two terms get thrown around interchangeably, but they describe fundamentally different problems.

A system has a performance problem when it's slow for a single user. A system has a scalability problem when it's fast for a single user but falls apart under load. You can have one without the other.

The simplest test: If adding more resources to your system (machines, CPU, memory) improves throughput proportionally, your system scales. If doubling your servers doesn't roughly double your capacity, you have a scalability bottleneck — and throwing hardware at it won't save you.

Performance optimization makes individual operations faster. Scalability engineering makes the system handle more operations. The techniques often overlap, but the goals are distinct. A hand-tuned SQL query is a performance win. Sharding your database across nodes is a scalability play.

Latency vs Throughput

Latency is how long a single request takes — the time between "I asked" and "I got an answer." Throughput is how many requests the system handles per unit of time.

Request In

→ 120ms →

Response Out

Latency = time for one round trip

1,000 req

per

1 second

Throughput = total operations per second

Latency measures speed. Throughput measures capacity.

In most systems, you're aiming for maximum throughput with acceptable latency. These are often at odds. Batching requests improves throughput (you process more per cycle) but increases latency for individual requests (each one waits for the batch).

Where it gets tricky

Under low load, latency and throughput seem independent. Under high load, they become coupled. When throughput approaches capacity, latency spikes — requests start queuing, and each one waits longer. This is why load testing matters: your system's latency profile at 10% utilization tells you almost nothing about how it behaves at 80%.

Vertical vs Horizontal Scaling

There are two fundamental approaches to adding capacity:

Vertical Scaling (Scale Up)

Add more power to an existing machine — more CPU, RAM, faster disks.

16 CPU / 64GB

↓

64 CPU / 256GB

Horizontal Scaling (Scale Out)

Add more machines and distribute work across them.

1 server

↓

Vertical scaling is simpler. Horizontal scaling is more resilient.

Vertical scaling is appealing because your code doesn't change — you just run it on a bigger box. But there's a ceiling. Hardware has physical limits, single machines are single points of failure, and the cost curve isn't linear (a 2x bigger machine costs more than 2x the price).

Horizontal scaling is harder up front. Your application needs to handle distribution: load balancing, data partitioning, consensus. But it offers near-unlimited growth and better fault tolerance. Most serious production systems end up here.

Real-world pattern: Most teams start vertical ("just get a bigger RDS instance") until they hit a wall, then invest in horizontal scaling. This is usually the right call — premature distribution adds complexity you don't need yet. Scale vertically until you can't, then go horizontal where it matters most.

Where Bottlenecks Hide

Systems don't scale uniformly. You'll find constraints in specific layers:

CPU-bound: Compute-heavy work like image processing, encryption, or ML inference. More cores or faster CPUs help.
I/O-bound: Waiting on disk, network, or database calls. Async processing, caching, and connection pooling help.
Memory-bound: Working set exceeds available RAM, causing cache misses or swap. More memory or data partitioning helps.
Network-bound: Bandwidth or latency between services. CDNs, compression, and co-location help.

The first job in any scalability effort is identifying which bottleneck you're actually hitting. Optimizing CPU when you're I/O-bound is wasted work.

Approach	Best for	Watch out for
Vertical scaling	Quick wins, simple architectures	Hardware limits, single point of failure
Horizontal scaling	High availability, elastic load	Distributed complexity, data consistency
Caching	Read-heavy workloads	Cache invalidation, stale data
Async processing	Decoupling, spiky workloads	Eventual consistency, debugging difficulty