Databases — System Design Primer

Relational Databases (RDBMS)

Relational databases organize data in tables with rows and columns, linked by foreign keys. They've been the default for decades because they offer something powerful: ACID guarantees.

ACID Properties

Property	What it means	Why it matters
Atomicity	A transaction either fully completes or fully rolls back	No partial writes — money doesn't vanish mid-transfer
Consistency	Every transaction moves the DB from one valid state to another	Constraints (unique keys, foreign keys) are always enforced
Isolation	Concurrent transactions don't interfere with each other	Two users updating the same row don't corrupt each other's writes
Durability	Once committed, data survives crashes	Power failure doesn't lose your committed transactions

When to use RDBMS: Structured data with clear relationships, complex queries involving JOINs, transactions that require strong consistency. Think e-commerce orders, user accounts, financial records.

NoSQL Databases

NoSQL isn't a single technology — it's a category of databases that trade some relational features for specific strengths. The four main types solve different problems:

Key-Value Stores

Simple get/set by key. Blazing fast, limited query capability.

Use for: Caching, session storage, feature flags

Examples: Redis, Memcached, DynamoDB

Document Stores

Store semi-structured documents (JSON/BSON). Flexible schema, nested data.

Use for: Content management, user profiles, catalogs

Examples: MongoDB, CouchDB, Firestore

Wide Column Stores

Columns grouped into families. Optimized for queries over large datasets.

Use for: Time series, IoT data, analytics

Examples: Cassandra, HBase, Bigtable

Graph Databases

Data as nodes and edges. Optimized for traversing relationships.

Use for: Social graphs, recommendation engines, fraud detection

Examples: Neo4j, Amazon Neptune, Dgraph

BASE Properties

While RDBMS targets ACID, many NoSQL systems follow BASE:

Basically Available: The system guarantees some level of availability, even during failures.
Soft state: The system's state may change over time, even without new input (due to background replication).
Eventually consistent: Given enough time without new writes, all replicas converge.

SQL vs NoSQL is not a religious war. Most production systems use both. Your user accounts table? Probably belongs in Postgres. Your user activity feed? Might be better in DynamoDB. Your social connections? A graph database shines there. The right answer is usually "both, for different things."

Scaling Patterns

Replication

Leader-Follower

Leader (reads + writes)

replicates ↓

Follower (reads)

Scales reads. Leader is the write bottleneck.

Multi-Leader

Leader A

⇄

Leader B

both accept writes

Scales writes. Needs conflict resolution.

Federation (Functional Partitioning)

Federation splits your monolithic database into multiple databases by function. Instead of one giant database handling users, products, orders, and analytics, you give each domain its own database.

Single monolithic DB

↓ split by function ↓

Users DB

Products DB

Orders DB

Each domain gets its own database, reducing load and enabling independent scaling.

Upside: Smaller databases, less replication lag, independent scaling. Downside: Cross-domain JOINs require application-level logic, and you need more infrastructure.

Sharding (Horizontal Partitioning)

Sharding distributes rows of a single table across multiple databases based on a shard key. If you shard users by user_id % 4, users 1, 5, 9 go to shard 0; users 2, 6, 10 go to shard 1; and so on.

Users table (100M rows)

↓ shard by user_id ↓

Shard 0 (25M)

Shard 1 (25M)

Shard 2 (25M)

Shard 3 (25M)

Each shard holds a subset of the data. Queries target only the relevant shard.

Choosing a shard key is critical. A bad shard key creates "hot shards" — one shard gets disproportionate traffic while others idle. Good shard keys distribute data and queries evenly. User ID is often good. Created date is often bad (recent data gets all the traffic). Celebrity user IDs are a special kind of hot shard problem.

Denormalization

Normalization removes data duplication. Denormalization adds it back intentionally. Why? Because JOINs are expensive at scale.

Instead of joining users and orders tables on every page load, you store the user's name directly in the orders table. The data is duplicated, but reads are faster because they hit a single table.

Trade-off: Faster reads, slower and more complex writes (you need to update the duplicated data everywhere). Use denormalization when read performance is critical and your data doesn't change frequently.

SQL Tuning Essentials

Indexes: An index on a column is like a book's index — it lets the DB find rows without scanning the entire table. But indexes slow down writes and consume storage. Index columns you query by, not everything.
Query analysis: Use EXPLAIN ANALYZE to see how your database executes a query. Look for sequential scans on large tables — that's your DB reading every single row.
Connection pooling: Opening a DB connection is expensive. Use a pool (like PgBouncer) that maintains pre-opened connections and lends them to your application.
Avoid N+1 queries: If you fetch 100 users and then make 100 queries to get each user's orders, that's 101 queries. Use JOINs or batch fetching instead.