Skip to content

Deep Learning with PyTorch

1.1 What Is Deep Learning 1.2 The PyTorch Ecosystem 1.3 Dynamic Computation Graphs 1.4 Tensor-Based Computation 1.5 GPUs and Accelerators 1.6 PyTorch Versus Other Frameworks 1.7 Installing and Configuring PyTorch 1.8

Part I. PyTorch Foundations

Chapter 1. Introduction to Deep Learning and PyTorch

1.1 What Is Deep Learning
1.2 The PyTorch Ecosystem
1.3 Dynamic Computation Graphs
1.4 Tensor-Based Computation
1.5 GPUs and Accelerators
1.6 PyTorch Versus Other Frameworks
1.7 Installing and Configuring PyTorch
1.8 Structure of a PyTorch Project

Chapter 2. Tensors and Tensor Operations

2.1 Creating Tensors
2.2 Tensor Shapes and Dimensions
2.3 Tensor Arithmetic
2.4 Broadcasting Rules
2.5 Indexing and Slicing
2.6 Tensor Reshaping
2.7 Matrix Operations
2.8 Random Tensor Generation
2.9 Tensor Memory Layout
2.10 CPU and GPU Tensors

Chapter 3. Automatic Differentiation

3.1 Computational Graphs
3.2 Gradient Computation
3.3 Reverse-Mode Differentiation
3.4 The requires_grad Mechanism
3.5 Backpropagation with backward()
3.6 Gradient Accumulation
3.7 Disabling Gradient Tracking
3.8 Custom Autograd Functions
3.9 Higher-Order Derivatives

Chapter 4. PyTorch Modules and Model Structure

4.1 The nn.Module Interface
4.2 Parameters and Buffers
4.3 Forward Methods
4.4 Sequential Models
4.5 Custom Layers
4.6 Parameter Initialization
4.7 Saving and Loading Models
4.8 Organizing Large Projects

Chapter 5. Data Loading and Preprocessing

5.1 Datasets and DataLoaders
5.2 Batch Processing
5.3 Data Shuffling
5.4 Parallel Data Loading
5.5 Transform Pipelines
5.6 Tokenization and Text Processing
5.7 Image Augmentation
5.8 Streaming and Large Datasets
5.9 Custom Dataset Classes

Part II. Neural Network Fundamentals

Chapter 6. Linear Models and Optimization

6.1 Linear Regression
6.2 Logistic Regression
6.3 Loss Functions
6.4 Gradient Descent
6.5 Stochastic Gradient Descent
6.6 Momentum and Adaptive Methods
6.7 Learning Rate Scheduling
6.8 Weight Decay and Regularization

Chapter 7. Multilayer Neural Networks

7.1 Feedforward Networks
7.2 Hidden Layers
7.3 Activation Functions
7.4 Universal Approximation
7.5 Deep Representations
7.6 Batch Normalization
7.7 Residual Connections

Chapter 8. Training Neural Networks

8.1 Training Loops
8.2 Validation and Testing
8.3 Metrics and Evaluation
8.4 Overfitting and Underfitting
8.5 Early Stopping
8.6 Dropout
8.7 Gradient Clipping
8.8 Mixed Precision Training

Chapter 9. Experiment Management

9.1 Configuration Systems
9.2 Logging and Visualization
9.3 TensorBoard Integration
9.4 Reproducibility
9.5 Checkpointing
9.6 Hyperparameter Search
9.7 Benchmarking and Profiling

Part III. Computer Vision with PyTorch

Chapter 10. Convolutional Neural Networks

10.1 Convolution Operations
10.2 Pooling Layers
10.3 Feature Maps
10.4 Padding and Stride
10.5 CNN Architectures
10.6 Residual Networks
10.7 Efficient Convolutions

Chapter 11. Image Classification

11.1 Classification Pipelines
11.2 Transfer Learning
11.3 Fine-Tuning Pretrained Models
11.4 Data Augmentation Strategies
11.5 Large-Scale Training
11.6 Calibration and Confidence

Chapter 12. Object Detection and Segmentation

12.1 Bounding Box Prediction
12.2 Region Proposal Methods
12.3 YOLO Architectures
12.4 Semantic Segmentation
12.5 Instance Segmentation
12.6 Vision Foundation Models

Chapter 13. Vision Transformers

13.1 Patch Embeddings
13.2 Self-Attention for Images
13.3 Transformer Encoders
13.4 Hybrid CNN-Transformer Models
13.5 Efficient Vision Transformers
13.6 Multimodal Vision Models

Part IV. Sequence Models and NLP

Chapter 14. Recurrent Neural Networks

14.1 Sequential Data
14.2 Recurrent Computation
14.3 Backpropagation Through Time
14.4 Vanishing Gradients
14.5 LSTM Networks
14.6 GRU Networks
14.7 Sequence Modeling Applications

Chapter 15. Attention and Transformers

15.1 Attention Mechanisms
15.2 Self-Attention
15.3 Multi-Head Attention
15.4 Positional Encoding
15.5 Transformer Encoders
15.6 Transformer Decoders
15.7 Efficient Attention Methods

Chapter 16. Natural Language Processing with PyTorch

16.1 Word Embeddings
16.2 Subword Tokenization
16.3 Text Classification
16.4 Named Entity Recognition
16.5 Machine Translation
16.6 Question Answering
16.7 Conversational Systems

Chapter 17. Large Language Models

17.1 Autoregressive Language Models
17.2 Pretraining Objectives
17.3 Instruction Tuning
17.4 Reinforcement Learning from Human Feedback
17.5 Retrieval-Augmented Generation
17.6 Long-Context Models
17.7 Tool-Using Agents

Part V. Generative Deep Learning

Chapter 18. Autoencoders and Representation Learning

18.1 Dimensionality Reduction
18.2 Sparse Autoencoders
18.3 Denoising Autoencoders
18.4 Variational Autoencoders
18.5 Latent Space Manipulation
18.6 Representation Learning

Chapter 19. Generative Adversarial Networks

19.1 Adversarial Training
19.2 Generator and Discriminator Models
19.3 Conditional GANs
19.4 Style-Based GANs
19.5 GAN Stabilization Techniques
19.6 Evaluation of Generative Models

Chapter 20. Diffusion Models

20.1 Forward Noise Processes
20.2 Reverse Denoising Processes
20.3 Score-Based Models
20.4 U-Net Architectures
20.5 Latent Diffusion
20.6 Text-to-Image Generation
20.7 Video Diffusion Systems

Part VI. Graph and Geometric Learning

Chapter 21. Graph Neural Networks

21.1 Graph Representations
21.2 Message Passing Networks
21.3 Graph Convolutions
21.4 Graph Attention Networks
21.5 Knowledge Graph Embeddings
21.6 PyTorch Geometric

Chapter 22. Geometric Deep Learning

22.1 Symmetry and Equivariance
22.2 Point Cloud Networks
22.3 Neural Fields
22.4 Implicit Representations
22.5 Geometric Transformers

Part VII. Reinforcement Learning

Chapter 23. Foundations of Reinforcement Learning

23.1 Agents and Environments
23.2 Markov Decision Processes
23.3 Value Functions
23.4 Policy Optimization
23.5 Exploration Strategies

Chapter 24. Deep Reinforcement Learning with PyTorch

24.1 Deep Q-Networks
24.2 Policy Gradient Methods
24.3 Actor-Critic Systems
24.4 Model-Based Reinforcement Learning
24.5 Offline Reinforcement Learning
24.6 RL for Language Models

Part VIII. Scaling and Systems

Chapter 25. Efficient Training Systems

25.1 GPU Optimization
25.2 Memory Management
25.3 Gradient Checkpointing
25.4 Quantization
25.5 Distillation
25.6 Low-Rank Adaptation

Chapter 26. Distributed Training

26.1 Data Parallelism
26.2 Distributed Data Parallel
26.3 Model Parallelism
26.4 Pipeline Parallelism
26.5 Fault Tolerance
26.6 Multi-Node Training

Chapter 27. PyTorch Compilation and Performance

27.1 TorchScript
27.2 torch.compile
27.3 Graph Optimization
27.4 Kernel Fusion
27.5 CUDA Extensions
27.6 Profiling Bottlenecks

Chapter 28. Deployment and Inference

28.1 Model Serialization
28.2 ONNX Export
28.3 TorchServe
28.4 Mobile Deployment
28.5 Edge Inference
28.6 High-Throughput Serving
28.7 Real-Time Systems

Part IX. Advanced Topics

Chapter 29. Probabilistic Deep Learning

29.1 Bayesian Neural Networks
29.2 Variational Inference
29.3 Monte Carlo Methods
29.4 Uncertainty Estimation
29.5 Gaussian Processes

Chapter 30. Robustness and Interpretability

30.1 Adversarial Examples
30.2 Distribution Shift
30.3 Saliency Maps
30.4 Attribution Methods
30.5 Mechanistic Interpretability
30.6 Model Editing

Chapter 31. Multimodal and Foundation Models

31.1 Vision-Language Models
31.2 Audio-Visual Learning
31.3 Unified Foundation Models
31.4 Retrieval Systems
31.5 Long-Horizon Agents

Chapter 32. Future Directions

32.1 Scaling Laws
32.2 Efficient AI Systems
32.3 Scientific Deep Learning
32.4 Robotics and Embodied AI
32.5 Open Research Problems

Computational GraphsA computational graph is a graph that represents a numerical computation. The nodes represent values or operations. The edges describe how data flows from one operation to the next.
11 min
Datasets and DataLoadersA deep learning model does not train directly from files. It trains from tensors. The purpose of a data pipeline is to convert stored data into batches of tensors with consistent shapes, data types, and labels.
10 min
Linear RegressionLinear regression is the simplest supervised learning model used in deep learning.
8 min
Mean Squared ErrorMean squared error is one of the simplest and most widely used loss functions in supervised learning. It measures the average squared difference between a model’s prediction and the target value.
10 min
Scalars, Vectors, Matrices, and TensorsDeep learning represents data and computation using arrays of numbers.
10 min
Sigmoid and Hyperbolic TangentActivation functions give neural networks their nonlinear structure.
8 min
Supervised LearningSupervised learning is the central paradigm of modern machine learning and deep learning.
7 min
What Is Deep LearningDeep learning is a branch of machine learning that studies models built from many layers of learned computation.
9 min
Cross-Entropy LossCross-entropy loss is the standard loss function for classification. It measures how well a model’s predicted class distribution matches the true class label.
10 min
Gradient ComputationGradient computation is the process of measuring how a scalar output changes when its input values change. In deep learning, the scalar output is usually the loss, and the inputs are usually the model parameters.
7 min
Logistic RegressionLinear regression predicts a real number. Logistic regression predicts a probability for binary classification.
7 min
Logistic RegressionLogistic regression is a linear model for classification. It predicts a probability instead of a raw numerical value. Despite its name, logistic regression is mainly used for classification, not regression.
6 min
Rectified Linear UnitsThe rectified linear unit, usually called ReLU, is the most widely used activation function in modern deep learning.
7 min
Tensor Shapes, Dimensions, and Memory LayoutDeep learning systems manipulate tensors with millions or billions of numerical entries.
7 min
The PyTorch EcosystemPyTorch is a deep learning platform built around tensors, automatic differentiation, and composable neural network modules.
7 min
Unsupervised LearningUnsupervised learning studies data without explicit target labels. The dataset contains inputs only:
6 min
Dynamic Computation GraphsDeep learning models are built from sequences of mathematical operations.
8 min
Leaky and Parametric ReLUReLU is simple and effective, but it has one sharp weakness.
7 min
Likelihood-Based ObjectivesMany deep learning loss functions can be understood as likelihood maximization.
8 min
Loss FunctionsA loss function measures how wrong a model’s predictions are.
6 min
Part IIIPart III of Deep Learning with PyTorch.
28 pages · 224 min
Reverse-Mode DifferentiationReverse-mode differentiation is the method used by backpropagation. It computes derivatives by first evaluating a function forward, then propagating gradient information backward from the output to the inputs.
9 min
Self-Supervised LearningSelf-supervised learning is a form of learning where the training signal is created from the data itself. The dataset does not need human-written labels, but the model still receives a prediction task.
7 min
Softmax RegressionSoftmax regression extends logistic regression from two classes to many classes. It is the standard linear model for multiclass classification.
7 min
Tensor Creation and InitializationNeural networks start with tensors. Some tensors come from data.
7 min
ELU, GELU, and SwishReLU and its variants improved optimization in deep networks, but they still have limitations.
7 min
Gradient DescentGradient descent is the basic optimization method used to train neural networks. It updates model parameters in the direction that reduces the loss.
6 min
Margin-Based LossesMargin-based losses are used when the goal is not only to make the correct prediction, but to make it by a sufficient margin.
9 min
Part IVPart IV of Deep Learning with PyTorch.
28 pages · 238 min
Reinforcement Learning OverviewReinforcement learning studies how an agent learns to act through interaction with an environment.
6 min
Tensor Arithmetic and BroadcastingTensor arithmetic is the basic computation layer of PyTorch.
8 min
Tensor-Based ComputationPyTorch programs are tensor programs. A tensor stores numbers in a structured array, and most model computation is expressed as operations over tensors.
10 min
The Chain RuleThe chain rule is the mathematical rule that makes backpropagation possible. Neural networks are built by composing many functions. The chain rule tells us how to differentiate such compositions.
8 min
The Perceptron AlgorithmThe perceptron is one of the earliest algorithms for binary classification. It learns a linear decision boundary by updating its weights whenever it makes a mistake.
7 min
Contrastive ObjectivesContrastive objectives train a model by comparing examples. Instead of learning only from an input and its target, the model learns which examples should be close together and which examples should be far apart.
10 min
GPUs and AcceleratorsDeep learning became practical at scale because neural network computation maps well to parallel hardware.
10 min
Indexing, Slicing, and Tensor ViewsIndexing and slicing select parts of a tensor. These operations are used constantly in PyTorch: selecting batches, cropping images, extracting token positions, applying masks, gathering logits, and rearranging model
8 min
Jacobians and HessiansGradients are enough for most neural network training. A gradient tells us how a scalar loss changes with respect to parameters.
9 min
Linear SeparabilityLinear separability describes when a classification dataset can be divided perfectly by a linear decision boundary. It is one of the central geometric ideas behind linear classification.
7 min
Part VPart V of Deep Learning with PyTorch.
19 pages · 155 min
Softmax and Output ActivationsMany neural networks produce raw scores. These scores are called logits.
8 min
Stochastic Gradient DescentStochastic gradient descent, usually abbreviated as SGD, is the standard form of gradient-based training used in deep learning.
8 min
Training, Validation, and Test SetsA machine learning dataset is usually divided into three parts: a training set, a validation set, and a test set.
9 min
BackpropagationBackpropagation is the algorithm used to compute gradients in neural networks efficiently.
8 min
Bias and VarianceBias and variance describe two different sources of prediction error. They are useful because they separate errors caused by an overly simple model from errors caused by an overly sensitive model.
8 min
Limits of Linear Decision BoundariesA linear classifier separates classes using a hyperplane. In two dimensions this boundary is a line. In three dimensions it is a plane. In higher dimensions it is a hyperplane.
7 min
Matrix OperationsMatrix operations are the main arithmetic language of deep learning.
9 min
Momentum and Adaptive MethodsStochastic gradient descent uses the current minibatch gradient to update the parameters.
6 min
Multi-Task ObjectivesMulti-task learning trains one model on several objectives at the same time.
10 min
Part VIPart VI of Deep Learning with PyTorch.
15 pages · 133 min
PyTorch Versus Other FrameworksPyTorch is one of several major frameworks for deep learning.
8 min
Saturation and Gradient FlowActivation functions control both the forward signal and the backward signal.
9 min
Gradient Flow in Deep NetworksGradient flow describes how derivative information moves backward through a neural network during training.
7 min
Installing and Configuring PyTorchA PyTorch installation must match three things: the Python environment, the operating system, and the available hardware.
6 min
Learning Rate SchedulingThe learning rate controls the size of each parameter update.
7 min
Limits of Linear ModelsLinear models are the first useful class of predictive models in deep learning.
8 min
Overfitting and UnderfittingOverfitting and underfitting describe two common ways a model can fail.
8 min
Part VIIPart VII of Deep Learning with PyTorch.
15 pages · 143 min
Practical Activation SelectionActivation functions should be chosen for the architecture, loss, initialization, normalization, and training scale. There is no universal best activation. The right choice depends on what the layer must do.
5 min
Random Tensor GenerationRandom tensors are used throughout deep learning. They initialize parameters, shuffle examples, sample noise, apply dropout, augment data, and generate outputs from probabilistic models.
7 min
Self-Supervised ObjectivesSelf-supervised learning trains a model using supervision constructed from the data itself. Instead of requiring human labels, the training task is derived from structure already present in the input.
9 min
Automatic Differentiation EnginesAn automatic differentiation engine is the system that records numerical operations and computes derivatives from them.
9 min
Choosing and Combining Loss FunctionsA loss function defines what the model is trained to improve. It translates a modeling goal into a scalar value that can be minimized by gradient-based optimization.
7 min
Evaluation MetricsEvaluation metrics convert model behavior into numbers. A loss function guides training. A metric reports performance. Sometimes they are the same. Often they are different.
9 min
Part VIIIPart VIII of Deep Learning with PyTorch.
28 pages · 244 min
Structure of a PyTorch ProjectA PyTorch project should separate concerns. Model code should define computation.
8 min
Tensor Data Types and DevicesA tensor has values, shape, data type, and device placement.
9 min
Weight Decay and RegularizationTraining loss measures how well a model fits the training data.
6 min
Data Leakage and Experimental DesignData leakage occurs when information that should be unavailable during training or evaluation enters the modeling process. It causes performance estimates to look better than they really are.
7 min
Part IXPart IX of Deep Learning with PyTorch.
26 pages · 210 min
Symbolic Versus Dynamic ComputationDeep learning frameworks need a way to represent computation.
10 min
Tensor Memory Layout and PerformanceA tensor has a logical shape and a physical memory layout. The shape tells us how to interpret the tensor as an array. The memory layout tells us how the entries are stored in memory.
9 min
CPU and GPU TensorsPyTorch tensors live on devices. A device is the hardware location where tensor storage exists and where tensor operations execute.
9 min