Deep Learning with PyTorch
1.1 What Is Deep Learning 1.2 The PyTorch Ecosystem 1.3 Dynamic Computation Graphs 1.4 Tensor-Based Computation 1.5 GPUs and Accelerators 1.6 PyTorch Versus Other Frameworks 1.7 Installing and Configuring PyTorch 1.8
256 notes
1.1 What Is Deep Learning 1.2 The PyTorch Ecosystem 1.3 Dynamic Computation Graphs 1.4 Tensor-Based Computation 1.5 GPUs and Accelerators 1.6 PyTorch Versus Other Frameworks 1.7 Installing and Configuring PyTorch 1.8
Chapter 32 sections from Deep Learning with PyTorch.
Chapter 31 sections from Deep Learning with PyTorch.
Chapter 30 sections from Deep Learning with PyTorch.
Chapter 29 sections from Deep Learning with PyTorch.
Chapter 28 sections from Deep Learning with PyTorch.
Chapter 27 sections from Deep Learning with PyTorch.
Chapter 26 sections from Deep Learning with PyTorch.
Chapter 25 sections from Deep Learning with PyTorch.
Chapter 24 sections from Deep Learning with PyTorch.
Chapter 23 sections from Deep Learning with PyTorch.
Chapter 22 sections from Deep Learning with PyTorch.
Chapter 21 sections from Deep Learning with PyTorch.
Chapter 20 sections from Deep Learning with PyTorch.
Chapter 19 sections from Deep Learning with PyTorch.
Chapter 18 sections from Deep Learning with PyTorch.
Chapter 17 sections from Deep Learning with PyTorch.
Chapter 16 sections from Deep Learning with PyTorch.
Chapter 15 sections from Deep Learning with PyTorch.
Chapter 14 sections from Deep Learning with PyTorch.
Chapter 13 sections from Deep Learning with PyTorch.
Chapter 12 sections from Deep Learning with PyTorch.
Chapter 11 sections from Deep Learning with PyTorch.
PyTorch tensors live on devices. A device is the hardware location where tensor storage exists and where tensor operations execute.
Chapter 10 sections from Deep Learning with PyTorch.
A tensor has a logical shape and a physical memory layout. The shape tells us how to interpret the tensor as an array. The memory layout tells us how the entries are stored in memory.
Deep learning frameworks need a way to represent computation.
Part IX of Deep Learning with PyTorch.
Data leakage occurs when information that should be unavailable during training or evaluation enters the modeling process. It causes performance estimates to look better than they really are.
Training loss measures how well a model fits the training data.
A tensor has values, shape, data type, and device placement.
Attention is a differentiable retrieval mechanism. A query asks for information, keys define where information can be found, and values carry the content returned to the model.
A PyTorch project should separate concerns. Model code should define computation.
Stochastic depth regularizes deep residual networks by randomly skipping residual branches during training.
Retrieval-augmented generation, usually abbreviated RAG, combines a language model with an external information retrieval system.
A pretraining objective defines the prediction task used to train a model before it is adapted to a downstream use case.
Part VIII of Deep Learning with PyTorch.
Language modeling is the task of predicting text sequences. A language model assigns probabilities to sequences of tokens and learns the statistical structure of language.
Training produces model parameters. Inference uses those parameters to generate predictions.
This chapter covered scaling, efficient systems, scientific AI, robotics, and open research problems. The following books, papers, and resources provide deeper treatment of these areas.
Evaluation metrics convert model behavior into numbers. A loss function guides training. A metric reports performance. Sometimes they are the same. Often they are different.
Early diffusion systems used convolutional U-Nets as denoising networks. U-Nets worked well because images contain strong local structure, and convolutions efficiently model nearby spatial relationships.
A loss function defines what the model is trained to improve. It translates a modeling goal into a scalar value that can be minimized by gradient-based optimization.
An automatic differentiation engine is the system that records numerical operations and computes derivatives from them.
Video diffusion extends image diffusion from still images to moving sequences. Instead of generating one image, the model generates a sequence of frames that should remain visually coherent over time.
Foundation models are large neural networks trained on broad datasets and adapted to many downstream tasks.
A language model becomes more useful when it can interact with external systems.
Probabilistic deep learning extends neural networks with explicit probability models.
Stable training means that a model can make steady progress without numerical collapse, uncontrolled gradients, or large oscillations in the loss.
Dense transformers activate every parameter for every token.
Self-supervised learning trains a model using supervision constructed from the data itself. Instead of requiring human labels, the training task is derived from structure already present in the input.
Random tensors are used throughout deep learning. They initialize parameters, shuffle examples, sample noise, apply dropout, augment data, and generate outputs from probabilistic models.
Activation functions should be chosen for the architecture, loss, initialization, normalization, and training scale. There is no universal best activation. The right choice depends on what the layer must do.
Part VII of Deep Learning with PyTorch.
Overfitting and underfitting describe two common ways a model can fail.
Mixup and CutMix are data augmentation methods that create new training examples by combining two examples and their labels. They regularize the model by discouraging overly sharp decision boundaries.
Linear models are the first useful class of predictive models in deep learning.
The learning rate controls the size of each parameter update.
A PyTorch installation must match three things: the Python environment, the operating system, and the available hardware.
Gradient flow describes how derivative information moves backward through a neural network during training.
---
After tokenization, text is represented as integer token IDs.
Efficient convolutions reduce computation, memory use, or latency while preserving useful spatial modeling.
Standard self-attention compares every token with every other token. For a sequence of length $T$, this produces a $T \times T$ attention matrix. The cost grows quadratically with sequence length.
Cross-lingual transfer is the ability of a model trained or adapted in one language to work in another language.
A conversational system processes dialogue between users and machines.
Automated machine learning, or AutoML, refers to systems that automate parts of the model development process.
Attention gives a model direct access between positions in a sequence.
A transformer decoder maps a partial output sequence to predictions for the next token or next output step.
Text-to-image generation aims to synthesize images from natural language descriptions. A model receives a prompt such as:
Subword methods split text into units smaller than words but usually larger than single characters.
Stochastic depth is a regularization method for deep residual networks.
Recurrent neural networks were among the first deep learning architectures capable of handling variable-length sequential data.
Activation functions control both the forward signal and the backward signal.
Residual networks are convolutional networks built from blocks with skip connections.
Residual connections allow a layer or block to add its input directly to its output. Instead of forcing a block to learn a complete transformation from scratch, the block learns a correction to the input.
Representation learning is the study of how a model converts raw data into useful internal variables.
Representation learning is the study of how models learn useful internal descriptions of data.
Question answering, often abbreviated QA, is the task of producing an answer to a question.
PyTorch is one of several major frameworks for deep learning.
Probabilistic circuits are tractable probabilistic models built from simple computational graphs.
Probabilistic deep learning adds distributions to ordinary neural networks.
Part VI of Deep Learning with PyTorch.
Neural architecture search, or NAS, is the process of automatically searching for model architectures.
Multi-task learning trains one model on several objectives at the same time.
Multi-node training uses more than one machine for a single training job. Each machine contributes one or more accelerators, and all machines cooperate to train the same model.
Multi-head attention runs several attention operations in parallel. Each head has its own query, key, and value projections. The outputs of the heads are concatenated and projected back to the model dimension.
Stochastic gradient descent uses the current minibatch gradient to update the parameters.
Model editing modifies a trained model so that it changes a specific behavior while preserving most other behaviors.
Matrix operations are the main arithmetic language of deep learning.
A linear classifier separates classes using a hyperplane. In two dimensions this boundary is a line. In three dimensions it is a plane. In higher dimensions it is a hyperplane.
Large language models can often perform new tasks without updating their parameters.
Standard transformer attention scales quadratically with sequence length. For a sequence of length $T$, self-attention constructs a score matrix of size
A dialogue system is a model or collection of models that interacts with users through natural language.
Deep learning systems have progressed from small task-specific models to large multimodal foundation systems capable of perception, language understanding, reasoning, planning, generation, and interaction.
A classifier returns scores. Users often interpret those scores as confidence. This interpretation is safe only when the scores are calibrated. A calibrated model assigns probabilities that match empirical correctness.
Bias and variance describe two different sources of prediction error. They are useful because they separate errors caused by an overly simple model from errors caused by an overly sensitive model.
Backpropagation is the algorithm used to compute gradients in neural networks efficiently.
A transformer encoder is a stack of layers that maps an input sequence to a contextual sequence representation.
A machine learning dataset is usually divided into three parts: a training set, a validation set, and a test set.
A language model does not read raw text directly. It reads tokens. Tokenization is the process that maps a string of text into a sequence of discrete symbols, and later maps generated symbols back into text.
Stochastic gradient descent, usually abbreviated as SGD, is the standard form of gradient-based training used in deep learning.
Speech recognition maps an acoustic signal to a text sequence. The input is continuous audio. The output is discrete symbols: characters, subword tokens, words, or phonemes.
Many neural networks produce raw scores. These scores are called logits.
Scaling a transformer means increasing its capacity, data exposure, context length, training compute, or serving throughput.
Population-based training, or PBT, is a hyperparameter optimization method that trains many models at the same time.
Part V of Deep Learning with PyTorch.
Deep learning has made large empirical gains, but many scientific and engineering questions remain open.
Mechanistic interpretability studies neural networks by treating them as learned computational systems.
Machine translation converts text from one language into another. Given a source sentence in one language, the model generates a semantically equivalent sentence in a target language.
A long-horizon agent is a model-driven system that pursues goals over many steps. It observes the environment, chooses actions, records intermediate state, uses tools, and adjusts its plan as new information arrives.
Linear separability describes when a classification dataset can be divided perfectly by a linear decision boundary. It is one of the central geometric ideas behind linear classification.
A latent space is the internal coordinate system learned by an encoder or generative model.
Latent space manipulation studies how to change a learned representation $z$ in order to produce controlled changes in the decoded output. In an autoencoder, the encoder maps an input into a latent vector,
Early diffusion models operated directly in pixel space. A model generated images by iteratively denoising tensors such as
Large-scale training means training models on datasets, model sizes, or hardware configurations that exceed a simple single-GPU workflow.
Label smoothing is a regularization method for classification.
Gradients are enough for most neural network training. A gradient tells us how a scalar loss changes with respect to parameters.
Information retrieval is the task of finding relevant items from a collection in response to a query.
Indexing and slicing select parts of a tensor. These operations are used constantly in PyTorch: selecting batches, cropping images, extracting token positions, applying masks, gathering logits, and rearranging model
Batch normalization and layer normalization are the two most common normalization layers, but they do not cover every setting well.
Deep learning became practical at scale because neural network computation maps well to parallel hardware.
A Gaussian process is a probabilistic model over functions. Instead of defining a probability distribution over parameters, as in Bayesian neural networks, a Gaussian process defines a probability distribution directly
Flow-based models are generative models that learn an invertible transformation between a simple probability distribution and a complex data distribution. Unlike many other generative models, flow-based systems provide:
Distributed training systems fail regularly. GPUs crash, network connections reset, processes hang, disks fill, filesystems become unavailable, and nodes disappear from the cluster.
Cross-attention is attention between two different sequences or sources of information. The queries come from one sequence, while the keys and values come from another.
Contrastive objectives train a model by comparing examples. Instead of learning only from an input and its target, the model learns which examples should be close together and which examples should be far apart.
Reinforcement learning from human feedback improves model behavior using preference data. However, collecting large amounts of human feedback is expensive, slow, and difficult to scale consistently.
A convolutional neural network architecture defines how convolutional layers, activation functions, normalization layers, pooling layers, residual paths, and classifier heads are arranged.
Standard recurrent neural networks process sequences in one direction, usually from left to right. At time step $t$, the hidden state summarizes only the past:
A variational autoencoder, or VAE, is an autoencoder with a probabilistic latent space.
A variational autoencoder, or VAE, is a generative latent variable model trained with neural networks.
Recurrent neural networks were designed to process sequential data by maintaining a hidden state over time.
Uncertainty estimation measures how much confidence a model should place in its own predictions.
The perceptron is one of the earliest algorithms for binary classification. It learns a linear decision boundary by updating its weights whenever it makes a mistake.
The chain rule is the mathematical rule that makes backpropagation possible. Neural networks are built by composing many functions. The chain rule tells us how to differentiate such compositions.
PyTorch programs are tensor programs. A tensor stores numbers in a structured array, and most model computation is expressed as operations over tensors.
Tensor arithmetic is the basic computation layer of PyTorch.
Summarization is the task of producing a shorter version of one or more source texts while preserving the important information.
Self-attention is attention applied within a single sequence.
Robotics and embodied AI study learning systems that act in the physical world.
A retrieval system finds relevant information from an external memory source.
Transformer layers are deep stacks of attention and feedforward blocks.
Reinforcement learning studies how an agent learns to act through interaction with an environment.
Instruction tuning teaches a model to imitate demonstrations.
Self-attention compares tokens by content. By itself, it has no built-in notion of token order.
Pipeline parallelism splits a model into sequential stages and places each stage on a different device. It is a form of model parallelism designed to reduce idle time.
Part IV of Deep Learning with PyTorch.
Padding and stride control the spatial size of convolutional feature maps.
A diffusion model needs a rule for how noise increases during the forward process.
Neural machine translation maps a sentence in one language to a sentence in another language using a neural sequence model. The model receives a source sentence and generates a target sentence.
Named entity recognition, usually abbreviated NER, identifies spans of text that refer to named or typed entities.
Masked language modeling trains a model to recover missing tokens from their surrounding context.
Margin-based losses are used when the goal is not only to make the correct prediction, but to make it by a sufficient margin.
Layer normalization is a normalization method that normalizes features within each individual example.
Gradient descent is the basic optimization method used to train neural networks. It updates model parameters in the direction that reduces the loss.
Energy-based models, or EBMs, define probability distributions using energy functions rather than normalized output probabilities directly.
ReLU and its variants improved optimization in deep networks, but they still have limitations.
Data augmentation creates modified versions of training examples without changing their labels.
Data augmentation is a regularization method that creates modified versions of training examples while preserving their labels.
Bayesian optimization is a hyperparameter optimization method for expensive black-box functions. It is useful when each training run costs enough that random search wastes too much compute.
Attribution methods assign credit or blame to parts of an input, hidden representation, neuron, feature, or training example for a model output.
A unified foundation model is a neural network trained across many modalities, tasks, and domains using a shared architecture and shared representations.
Text classification assigns one or more labels to a piece of text.
Neural networks start with tensors. Some tensors come from data.
Softmax regression extends logistic regression from two classes to many classes. It is the standard linear model for multiclass classification.
Self-supervised learning is a form of learning where the training signal is created from the data itself. The dataset does not need human-written labels, but the model still receives a prediction task.
Diffusion models can be understood from multiple mathematical viewpoints.
Scientific deep learning applies neural networks and differentiable computation to scientific and engineering problems.
A saliency map is a visualization that assigns an importance score to each part of an input.
Reverse-mode differentiation is the method used by backpropagation. It computes derivatives by first evaluating a function forward, then propagating gradient information backward from the output to the inputs.
Random search is a hyperparameter optimization method that samples configurations at random from a search space.
Question answering is the task of producing an answer to a question. The input may contain only the question, or it may contain both a question and one or more passages that may contain the answer.
Self-attention compares tokens to other tokens, but by itself it has no built-in notion of order.
Part III of Deep Learning with PyTorch.
Multi-head attention runs several attention operations in parallel.
Monte Carlo methods approximate difficult mathematical quantities using random samples.
Model parallelism splits a model across multiple devices. Instead of copying the whole model onto every GPU, different parts of the model live on different GPUs.
A loss function measures how wrong a model’s predictions are.
Many deep learning loss functions can be understood as likelihood maximization.
ReLU is simple and effective, but it has one sharp weakness.
Pretraining teaches a language model to predict text. It does not directly teach the model to follow user instructions, answer safely, maintain dialogue structure, or format outputs in a useful way.
Fine-tuning adapts a pretrained model to a target dataset by continuing training from learned weights instead of starting from random initialization.
A feature map is the spatial output produced by a convolutional filter. In a convolutional neural network, each output channel can be read as a map of where a learned feature appears in the input.
Deep learning models are built from sequences of mathematical operations.
Dropout is a regularization method that randomly removes parts of a neural network during training.
Dot-product attention uses an inner product to measure how well a query matches a key.
A denoising autoencoder learns to reconstruct a clean input from a corrupted version of that input. Instead of copying $x$ to $\hat{x}$, the model receives a noisy input $\tilde{x}$ and must recover the original $x$.
A denoising autoencoder learns to recover a clean input from a corrupted version of that input.
A deep belief network, or DBN, is a probabilistic generative model formed by stacking multiple layers of latent variables.
Beam search is a decoding algorithm for autoregressive sequence models. It is used when a model must generate a sequence, but greedy decoding is too narrow.
Batch normalization is a layer that normalizes activations using statistics computed from a mini-batch.
Recurrent networks reuse the same parameters at every time step.
Autoregressive modeling is the dominant formulation for modern language generation. The model predicts the next token from previous tokens. Repeating this prediction step produces a sequence.
Bayesian neural networks require inference over a posterior distribution:
Deep networks train by sending information in two directions.
Unsupervised learning studies data without explicit target labels. The dataset contains inputs only:
A transformer decoder is a neural network block that maps a prefix sequence to a sequence of next-token representations. It is used when the model must generate output one step at a time.
Transfer learning reuses a model trained on one task as the starting point for another task.
PyTorch is a deep learning platform built around tensors, automatic differentiation, and composable neural network modules.
Deep learning systems manipulate tensors with millions or billions of numerical entries.
Teacher forcing is a training method for autoregressive sequence models. It is used when a model generates an output sequence one token at a time, but during training we already know the correct output sequence.
A language model cannot process raw text directly. Text must first be converted into a sequence of token IDs. The procedure that performs this conversion is called tokenization.
An undercomplete autoencoder constrains the representation by reducing the latent dimension.
An ordinary autoencoder compresses information by forcing the latent representation to have fewer dimensions than the input.
Self-attention is attention applied within a single sequence. The same input supplies the queries, keys, and values. Each position builds a new representation by reading from other positions in the same sequence.
Scaling laws describe how model performance changes as we increase compute, parameter count, dataset size, and training tokens.
The forward diffusion process gradually transforms data into noise.
A restricted Boltzmann machine, or RBM, is a simplified Boltzmann machine with a bipartite structure.
A feedforward neural network processes inputs through a fixed sequence of layers. Once the output is produced, the computation ends. There is no memory of previous inputs.
The rectified linear unit, usually called ReLU, is the most widely used activation function in modern deep learning.
Pooling is a downsampling operation used in convolutional neural networks.
Statistical language models estimate probabilities from discrete counts.
Named entity recognition, or NER, is the task of finding spans of text that refer to entities and assigning each span a type.
Logistic regression is a linear model for classification. It predicts a probability instead of a raw numerical value. Despite its name, logistic regression is mainly used for classification, not regression.
Linear regression predicts a real number. Logistic regression predicts a probability for binary classification.
Grid search is one of the simplest methods for hyperparameter optimization.
Gradient computation is the process of measuring how a scalar output changes when its input values change. In deep learning, the scalar output is usually the loss, and the inputs are usually the model parameters.
Modern deep learning systems are constrained by compute, memory, bandwidth, latency, and energy. As models become larger, efficiency becomes a central engineering problem rather than a secondary optimization.
Neural networks are usually trained iteratively. An optimizer repeatedly updates model parameters to reduce the training loss.
A distribution shift occurs when the data seen at deployment differs from the data used during training.
Distributed Data Parallel, usually abbreviated as DDP, is PyTorch’s primary system for synchronous multi-GPU training.
Cross-entropy loss is the standard loss function for classification. It measures how well a model’s predicted class distribution matches the true class label.
Audio-visual learning studies models that jointly process sound and visual information. The goal is to learn representations that combine what is seen with what is heard.
Additive attention was one of the first successful neural attention mechanisms. It was introduced for neural machine translation to allow a decoder to selectively focus on different encoder states during generation.
Natural language models cannot operate directly on words as strings.
Deep learning is a branch of machine learning that studies models built from many layers of learned computation.
A vision-language model learns a joint representation of images and text.
A transformer encoder is a neural network block that maps a sequence of input vectors to a sequence of contextualized output vectors.
Text classification is the task of assigning one or more labels to a piece of text.
Supervised learning is the central paradigm of modern machine learning and deep learning.
A language model assigns probabilities to sequences of tokens. The tokens may be words, subwords, characters, bytes, or other discrete symbols. In the classical setting, a sentence is represented as a finite sequence
Activation functions give neural networks their nonlinear structure.
Many learning problems involve data whose meaning depends on order.
Hyperparameter optimization begins by deciding what may vary.
Modern deep learning systems often improve when we increase three quantities: model size, dataset size, and compute. This empirical regularity is called a scaling law.
Deep learning represents data and computation using arrays of numbers.
A large language model is trained in two broad phases. The first phase is pretraining.
A neural network begins training with parameters that have not yet been learned from data.
Sequence models often need to decide which parts of an input are relevant to a particular output.
Mean squared error is one of the simplest and most widely used loss functions in supervised learning. It measures the average squared difference between a model’s prediction and the target value.
Linear regression is the simplest supervised learning model used in deep learning.
A neural network is trained by minimizing a loss function. For a supervised learning problem, this loss measures how far the model predictions are from the target values.
Diffusion models are generative models built around a simple idea: learn to reverse a gradual corruption process.
A sequence-to-sequence model maps one sequence to another sequence.
High-dimensional data often contains structure that can be described with fewer variables than the raw representation suggests.
Deep learning often begins with data that has many coordinates.
A deep learning model does not train directly from files. It trains from tensors. The purpose of a data pipeline is to convert stored data into batches of tensors with consistent shapes, data types, and labels.
Data parallelism is the simplest and most widely used form of distributed deep learning.
Convolution is the central operation in convolutional neural networks.
A computational graph is a graph that represents a numerical computation. The nodes represent values or operations. The edges describe how data flows from one operation to the next.
Image classification assigns one label, or a small set of labels, to an image.
A Boltzmann machine is a probabilistic neural network that defines a probability distribution over binary variables.
A Bayesian neural network is a neural network whose parameters are treated as random variables rather than fixed unknown constants.
Attention is a method for letting a model choose which parts of an input are most relevant when producing an output.
An adversarial example is an input that has been deliberately modified so that a model makes a wrong prediction, while the modification is small enough that a human observer still sees the original object.