~/home/blog/scale-to-millions
System DesignBackendArchitecture·May 14, 2025·15 min read

Scale From Zero To Millions of Users

Designing a system that supports millions of users is challenging — a journey of continuous refinement. Here we build a system starting from a single server and gradually scale it to serve millions, learning key techniques along the way.

Single Server Setup

A journey of a thousand miles begins with a single step. Everything starts on one server — the web app, database, and cache all running together.

🌐BrowserSINGLE SERVERWeb AppDatabaseCacheDNSapi.mysite.comIP: 15.125.23.214
Figure 1 — single server setup

Users access the site via a domain name. DNS resolves it to an IP address, the browser sends an HTTP request, and the server returns HTML or JSON. Traffic comes from two sources: a web app (server-side logic + client HTML/JS) and a mobile app (communicates via HTTP + JSON API).

Database

As the user base grows, we separate the web tier from the database tier so each can scale independently.

Relational vs Non-Relational

Relational databases (MySQL, PostgreSQL) store data in tables and support SQL joins — the default for 40+ years. NoSQL databases (Cassandra, DynamoDB) trade joins for massive scale, low latency, and flexible schemas.

💡When to choose NoSQL

  • Your app requires super-low latency
  • Data is unstructured or has no relational shape
  • You only need to serialize/deserialize (JSON, YAML)
  • You need to store a massive amount of data

Vertical vs Horizontal Scaling

Vertical scaling (scale up) adds more CPU/RAM to an existing server — simple, but has a hard ceiling and no failover. Horizontal scaling (scale out) adds more servers — more complex, but far more suitable for large-scale systems.

Load Balancer

A load balancer evenly distributes incoming traffic across web servers. Users connect to the load balancer's public IP; servers communicate via private IPs, making them unreachable directly from the internet.

👥UsersLoadBalancerPublic IPServer 1Private IPServer 2Private IPFailover + horizontal scaling enabled
Figure 4 — load balancer with two web servers

What this solves

If Server 1 goes offline, all traffic routes to Server 2 — no downtime. When traffic spikes, just add more servers and the load balancer handles the rest automatically.

Database Replication

The master/slave model: a master database handles all writes; one or more slave databases replicate from master and serve reads. Since most apps read far more than they write, this dramatically improves throughput.

Web ServerswritesreadsMaster DBwrites onlySlave DB 1reads onlySlave DB 2reads onlyBetter performance · Reliability · High availability
Figure 5 — master DB with multiple slave DBs

Benefits: better performance (parallel reads), reliability (data is replicated across locations), and high availability (if one DB goes offline, others serve traffic).

Cache & CDN

Cache

A cache stores expensive responses in memory so subsequent requests are served faster. The read-through pattern: check cache first — if hit, serve it; if miss, query the DB, store in cache, then serve.

⚠️Cache considerations

  • Use cache for data read often but modified rarely
  • Set sensible expiry — too short = DB overload, too long = stale data
  • Run multiple cache servers to avoid a single point of failure
  • LRU is the most common eviction policy

Content Delivery Network (CDN)

A CDN is a network of geographically dispersed servers that cache and deliver static content (JS, CSS, images, video). When a user requests content, the edge server closest to them responds — dramatically reducing latency.

👤UserCDNstatic assetsWeb ServerCacheRedis / MemcachedDatabaseMaster + Slavesedge servercache miss → query DB
Figure 7 — cache + CDN architecture

Stateless Web Tier

To scale horizontally, session state must leave the web servers. A stateful server remembers client data between requests, forcing sticky sessions. A stateless server reads session data from a shared store (Redis, NoSQL) that any server can access.

👥Any UserLoadBalancerServer 1Server 2SharedState StoreRedis / NoSQLAny server can handle any request — auto-scaling is easy
Figure 13 — stateless architecture

With stateless servers, auto-scaling becomes trivial: spin servers up or down based on traffic load without worrying about session affinity.

Data Centers

To serve a global audience with low latency, run multiple geographically distributed data centers. GeoDNS routes users to the nearest data center. In a failure, 100% of traffic redirects to the healthy data center automatically.

🌐Key challenges

  • Traffic redirection — GeoDNS routes to nearest healthy center
  • Data synchronization — async multi-region replication
  • Consistent deployment — automated pipelines across all centers

Message Queue

A message queue enables asynchronous communication between services. Producers publish messages to the queue; consumers pull and process them independently. This decoupling makes the system resilient and independently scalable.

ProducerWeb ServerMessage QueueConsumer 1Consumer 2Producer and consumer scale independently
Figure 17 — producer → queue → consumer

Example: a photo-editing app publishes processing jobs to the queue. Worker nodes pick them up asynchronously. When the queue grows long, scale up workers; when it empties, scale them down.

Database Scaling

Vertical Scaling

Add more CPU/RAM/disk to the existing database. AWS RDS supports up to 24 TB RAM. But hardware limits exist, it creates a single point of failure, and powerful servers are expensive.

Horizontal Scaling — Sharding

Sharding splits the database into smaller shards. Each shard holds the same schema but a unique subset of data. A hash function (e.g. user_id % 4) routes queries to the correct shard.

Applicationuser_id % 4Shard 0user_ids: 0, 4, 8, 12Shard 1user_ids: 1, 5, 9, 13Shard 2user_ids: 2, 6, 10, 14Shard 3user_ids: 3, 7, 11, 15← hash functionConsistent sharding key = even data distribution
Figure 21 — user data sharded across 4 nodes

⚠️Sharding challenges

  • Resharding — needed when a shard is exhausted; consistent hashing helps
  • Hotspot problem — a single shard may receive disproportionate traffic
  • Join complexity — cross-shard joins require denormalization

Summary

Scaling is an iterative process. Here is the full playbook for scaling from a single server to millions of users:

  • Keep web tier stateless

    Any server handles any request

  • Build redundancy at every tier

    No single point of failure

  • Cache data aggressively

    Reduce DB load and latency

  • Support multiple data centers

    Geo-routing and failover

  • Host static assets in CDN

    Serve from the edge, not origin

  • Scale data tier by sharding

    Distribute load across nodes

  • Use message queues

    Decouple services asynchronously

  • Monitor and automate

    Observability and CI/CD are non-negotiable

Based on concepts from System Design Interview by Alex Xu. Part of my ongoing series on backend architecture and distributed systems.

Views

If this post helped, tap the heart after reading.