A structured guide to system design concepts, tradeoffs, and interview-style problem solving.
Chapter 1. Foundations
- What system design means
- Requirements and constraints
- Functional requirements
- Nonfunctional requirements
- Latency, throughput, and availability
- Capacity estimation
- Back of the envelope math
- Traffic patterns
- Read heavy vs write heavy systems
- State, storage, and computation
- APIs and contracts
- Data models
- Consistency basics
- Durability basics
- Fault tolerance basics
- Scalability basics
- Observability basics
- Security basics
- Cost as a design constraint
- Simplicity as a design constraint
- Tradeoffs
- Failure modes
- Design review format
- Design documentation
- Common interview patterns
Chapter 2. Networking and Protocols
- IP, TCP, and UDP
- DNS resolution
- HTTP and HTTPS
- HTTP/1.1, HTTP/2, and HTTP/3
- TLS termination
- Reverse proxies
- Load balancers
- Connection pooling
- Keep alive and timeouts
- Retries and idempotency
- Rate limits
- Backpressure
- WebSockets
- Server sent events
- gRPC
- REST
- GraphQL
- Message framing
- Serialization formats
- Compression
- Caching headers
- CDN behavior
- Network partitions
- Network debugging
- Protocol selection
Chapter 3. API Design
- Resource modeling
- Endpoint design
- Request and response schemas
- Pagination
- Filtering and sorting
- Partial updates
- Bulk operations
- Idempotency keys
- Error models
- Status codes
- Versioning
- Compatibility
- Authentication
- Authorization
- Rate limiting
- Quotas
- Webhooks
- Long running operations
- Async APIs
- SDK design
- OpenAPI
- API gateways
- Contract testing
- Documentation
- API evolution
Chapter 4. Data Modeling
- Entities and relationships
- Keys and identifiers
- Natural keys vs surrogate keys
- Normalization
- Denormalization
- Index design
- Constraints
- Foreign keys
- Time fields
- Soft deletes
- Event tables
- Audit logs
- Schema migration
- Backfills
- Data retention
- Multitenancy models
- Ownership boundaries
- Read models
- Write models
- Materialized views
- Data validation
- Data quality
- Metadata
- Lineage
- Modeling tradeoffs
Chapter 5. Storage Systems
- Filesystems
- Block storage
- Object storage
- Relational databases
- Key value stores
- Document databases
- Wide column stores
- Time series databases
- Graph databases
- Search indexes
- Vector databases
- In memory stores
- Logs as storage
- Columnar formats
- Row oriented formats
- Compression
- Compaction
- Replication
- Partitioning
- Sharding
- Indexing
- Query planning
- Backup and restore
- Storage cost
- Storage selection
Chapter 6. Relational Database Design
- Table design
- Primary keys
- Secondary indexes
- Composite indexes
- Covering indexes
- Query plans
- Transactions
- Isolation levels
- Locks
- Deadlocks
- MVCC
- Constraints
- Migrations
- Online schema changes
- Read replicas
- Connection pools
- Partitioned tables
- Materialized views
- Stored procedures
- Triggers
- Full text search
- Analytical queries
- Backup strategy
- Failover strategy
- Operational pitfalls
Chapter 7. Distributed Data
- Replication models
- Leader follower replication
- Multi leader replication
- Leaderless replication
- Quorums
- Read repair
- Anti entropy
- Consistent hashing
- Range partitioning
- Hash partitioning
- Rebalancing
- Hot partitions
- Consensus
- Raft
- Paxos
- ZooKeeper style coordination
- Split brain
- Fencing tokens
- Clock skew
- Logical clocks
- Vector clocks
- Conflict resolution
- CRDTs
- Distributed transactions
- CAP and PACELC
Chapter 8. Caching
- Cache use cases
- Cache aside
- Read through cache
- Write through cache
- Write back cache
- Refresh ahead
- TTLs
- Eviction policies
- Invalidation
- Cache stampede
- Request coalescing
- Negative caching
- CDN caching
- Browser caching
- Edge caching
- Distributed caches
- Local caches
- Cache consistency
- Cache warming
- Hot keys
- Cache sizing
- Redis patterns
- Memcached patterns
- Observability
- Failure modes
Chapter 9. Messaging and Event Systems
- Queues
- Pub/sub
- Logs
- Message brokers
- Kafka style systems
- RabbitMQ style systems
- SQS style systems
- Producers
- Consumers
- Consumer groups
- Ordering
- Delivery guarantees
- At least once delivery
- At most once delivery
- Exactly once semantics
- Dead letter queues
- Retries
- Delayed messages
- Backpressure
- Fanout
- Event schemas
- Schema registry
- Outbox pattern
- Saga pattern
- Event replay
Chapter 10. Search, Ranking, and Retrieval
- Inverted indexes
- Tokenization
- Normalization
- Stemming and lemmatization
- Term frequency
- BM25
- Filters and facets
- Sorting
- Pagination
- Freshness
- Index updates
- Shard layout
- Replication
- Query routing
- Highlighting
- Autocomplete
- Spell correction
- Semantic search
- Vector indexes
- Hybrid retrieval
- Ranking pipelines
- Learning to rank
- Evaluation metrics
- Search observability
- Search failure modes
Chapter 11. Streaming and Realtime Systems
- Streams vs batches
- Event time and processing time
- Windows
- Watermarks
- Late events
- Stream joins
- Aggregations
- Stateful processing
- Checkpointing
- Reprocessing
- Exactly once processing
- WebSocket fanout
- Presence systems
- Live counters
- Notifications
- Activity feeds
- Realtime collaboration
- Operational transforms
- CRDT collaboration
- Realtime analytics
- Alerting pipelines
- Stream backpressure
- Capacity planning
- Debugging streams
- Realtime tradeoffs
Chapter 12. Batch Data Systems
- Batch processing
- ETL and ELT
- Data lakes
- Warehouses
- Lakehouse patterns
- File formats
- Parquet
- ORC
- Avro
- CSV pitfalls
- Partitioning
- Compaction
- Catalogs
- Metadata stores
- Workflow orchestration
- Scheduling
- Incremental jobs
- Backfills
- Data quality checks
- Deduplication
- Slowly changing dimensions
- Aggregation tables
- Cost control
- Data observability
- Failure recovery
Chapter 13. Scalability Patterns
- Vertical scaling
- Horizontal scaling
- Stateless services
- Stateful services
- Load shedding
- Backpressure
- Queue based leveling
- Partitioning by tenant
- Partitioning by geography
- Partitioning by time
- Fanout on write
- Fanout on read
- Precomputation
- Approximation
- Sampling
- Bloom filters
- Count min sketch
- Rate limiting algorithms
- Circuit breakers
- Bulkheads
- Graceful degradation
- Brownout patterns
- Autoscaling
- Capacity planning
- Scaling checklist
Chapter 14. Reliability and Resilience
- Availability targets
- SLOs and SLIs
- Error budgets
- Redundancy
- Failover
- Disaster recovery
- Backup verification
- Health checks
- Timeouts
- Retries
- Circuit breakers
- Load shedding
- Bulkheads
- Chaos testing
- Incident response
- Postmortems
- Runbooks
- Dependency failure
- Partial outages
- Regional outages
- Data corruption
- Recovery time objective
- Recovery point objective
- Reliability testing
- Resilience tradeoffs
Chapter 15. Observability
- Logs
- Metrics
- Traces
- Events
- Correlation IDs
- Structured logging
- RED metrics
- USE metrics
- High cardinality data
- Sampling
- Dashboards
- Alerts
- Alert fatigue
- SLO dashboards
- Distributed tracing
- Profiling
- Error tracking
- Synthetic checks
- Real user monitoring
- Business metrics
- Data observability
- Cost observability
- Debugging production
- Observability pipelines
- Instrumentation checklist
Chapter 16. Security and Privacy
- Threat modeling
- Authentication
- Authorization
- Sessions
- Tokens
- OAuth
- API keys
- Password storage
- Secrets management
- Encryption in transit
- Encryption at rest
- Key management
- Network isolation
- Input validation
- Injection attacks
- SSRF
- XSS
- CSRF
- Rate limiting for abuse
- Audit logging
- Privacy by design
- Data minimization
- Retention and deletion
- Compliance basics
- Security review checklist
Chapter 17. Deployment and Operations
- Build pipelines
- Release strategies
- Blue green deployment
- Canary deployment
- Feature flags
- Rollbacks
- Configuration
- Environment separation
- Containers
- Kubernetes
- Serverless
- Service discovery
- Service meshes
- Infrastructure as code
- Secrets in deployment
- Database migrations
- Operational readiness
- Runbooks
- On call handoff
- Incident drills
- Cost monitoring
- Capacity reviews
- Dependency reviews
- Production checklists
- Operational maturity
Chapter 18. Cost and Capacity Engineering
- Unit economics
- Cost drivers
- Compute pricing
- Storage pricing
- Network egress
- Database cost
- Cache cost
- Queue cost
- Search cost
- Observability cost
- Overprovisioning
- Autoscaling economics
- Reserved capacity
- Spot capacity
- Multi cloud cost
- Build vs buy
- Cost allocation
- Tenant level cost
- FinOps basics
- Capacity forecasts
- Load testing
- Benchmarking
- Cost regression tests
- Architecture cost review
- Cost reduction playbook
Chapter 19. Common System Designs
- URL shortener
- Pastebin
- File storage service
- Photo sharing service
- Video streaming service
- News feed
- Chat system
- Notification system
- Rate limiter
- Search engine
- Web crawler
- Payment system
- Booking system
- Ride hailing system
- Food delivery system
- Ecommerce marketplace
- Inventory system
- Logging platform
- Metrics platform
- Feature flag service
- Recommendation system
- Collaborative editor
- CDN
- Analytics platform
- Multi tenant SaaS
Chapter 20. Design Review and Case Studies
- Reading a design prompt
- Asking clarifying questions
- Defining scope
- Estimating load
- Sketching APIs
- Choosing storage
- Choosing consistency
- Designing data flow
- Identifying bottlenecks
- Identifying failure modes
- Security review
- Privacy review
- Cost review
- Operational review
- Scaling plan
- Migration plan
- Tradeoff table
- Design document template
- Review checklist
- Interview walkthrough
- Production design walkthrough
- Startup scale case study
- Enterprise scale case study
- Global scale case study
- Final system design rubric