Deep-Learning

Deep Learning with PyTorch

1.1 What Is Deep Learning 1.2 The PyTorch Ecosystem 1.3 Dynamic Computation Graphs 1.4 Tensor-Based Computation 1.5 GPUs and Accelerators 1.6 PyTorch Versus Other Frameworks 1.7 Installing and Configuring PyTorch 1.8

Chapter 32

Chapter 32 sections from Deep Learning with PyTorch.

Chapter 31

Chapter 31 sections from Deep Learning with PyTorch.

Chapter 30

Chapter 30 sections from Deep Learning with PyTorch.

Chapter 29

Chapter 29 sections from Deep Learning with PyTorch.

Chapter 28

Chapter 28 sections from Deep Learning with PyTorch.

Chapter 27

Chapter 27 sections from Deep Learning with PyTorch.

Chapter 26

Chapter 26 sections from Deep Learning with PyTorch.

Chapter 25

Chapter 25 sections from Deep Learning with PyTorch.

Chapter 24

Chapter 24 sections from Deep Learning with PyTorch.

Chapter 23

Chapter 23 sections from Deep Learning with PyTorch.

Chapter 22

Chapter 22 sections from Deep Learning with PyTorch.

Chapter 21

Chapter 21 sections from Deep Learning with PyTorch.

Chapter 20

Chapter 20 sections from Deep Learning with PyTorch.

Chapter 19

Chapter 19 sections from Deep Learning with PyTorch.

Chapter 18

Chapter 18 sections from Deep Learning with PyTorch.

Chapter 17

Chapter 17 sections from Deep Learning with PyTorch.

Chapter 16

Chapter 16 sections from Deep Learning with PyTorch.

Chapter 15

Chapter 15 sections from Deep Learning with PyTorch.

Chapter 14

Chapter 14 sections from Deep Learning with PyTorch.

Chapter 13

Chapter 13 sections from Deep Learning with PyTorch.

Chapter 12

Chapter 12 sections from Deep Learning with PyTorch.

Chapter 11

Chapter 11 sections from Deep Learning with PyTorch.

CPU and GPU Tensors

PyTorch tensors live on devices. A device is the hardware location where tensor storage exists and where tensor operations execute.

Chapter 10

Chapter 10 sections from Deep Learning with PyTorch.

Tensor Memory Layout and Performance

A tensor has a logical shape and a physical memory layout. The shape tells us how to interpret the tensor as an array. The memory layout tells us how the entries are stored in memory.

Symbolic Versus Dynamic Computation

Deep learning frameworks need a way to represent computation.

Part IX

Part IX of Deep Learning with PyTorch.

Data Leakage and Experimental Design

Data leakage occurs when information that should be unavailable during training or evaluation enters the modeling process. It causes performance estimates to look better than they really are.

Weight Decay and Regularization

Training loss measures how well a model fits the training data.

Tensor Data Types and Devices

A tensor has values, shape, data type, and device placement.

Summary and Further Reading

Attention is a differentiable retrieval mechanism. A query asks for information, keys define where information can be found, and values carry the content returned to the model.

Structure of a PyTorch Project

A PyTorch project should separate concerns. Model code should define computation.

Stochastic Depth

Stochastic depth regularizes deep residual networks by randomly skipping residual branches during training.

Retrieval-Augmented Generation

Retrieval-augmented generation, usually abbreviated RAG, combines a language model with an external information retrieval system.

Pretraining Objectives

A pretraining objective defines the prediction task used to train a model before it is adapted to a downstream use case.

Part VIII

Part VIII of Deep Learning with PyTorch.

Language Modeling

Language modeling is the task of predicting text sequences. A language model assigns probabilities to sequences of tokens and learns the statistical structure of language.

Inference Optimization

Training produces model parameters. Inference uses those parameters to generate predictions.

Evaluation Metrics

Evaluation metrics convert model behavior into numbers. A loss function guides training. A metric reports performance. Sometimes they are the same. Often they are different.

Diffusion Transformers

Early diffusion systems used convolutional U-Nets as denoising networks. U-Nets worked well because images contain strong local structure, and convolutions efficiently model nearby spatial relationships.

Choosing and Combining Loss Functions

A loss function defines what the model is trained to improve. It translates a modeling goal into a scalar value that can be minimized by gradient-based optimization.

Automatic Differentiation Engines

An automatic differentiation engine is the system that records numerical operations and computes derivatives from them.

Video Diffusion Systems

Video diffusion extends image diffusion from still images to moving sequences. Instead of generating one image, the model generates a sequence of frames that should remain visually coherent over time.

Training Foundation Models

Foundation models are large neural networks trained on broad datasets and adapted to many downstream tasks.

Tool Use and Agents

A language model becomes more useful when it can interact with external systems.

Summary and Further Reading

Probabilistic deep learning extends neural networks with explicit probability models.

Stable Training in Deep Networks

Stable training means that a model can make steady progress without numerical collapse, uncontrolled gradients, or large oscillations in the loss.

Sparse Expert Architectures

Dense transformers activate every parameter for every token.

Self-Supervised Objectives

Self-supervised learning trains a model using supervision constructed from the data itself. Instead of requiring human labels, the training task is derived from structure already present in the input.

Random Tensor Generation

Random tensors are used throughout deep learning. They initialize parameters, shuffle examples, sample noise, apply dropout, augment data, and generate outputs from probabilistic models.

Practical Activation Selection

Activation functions should be chosen for the architecture, loss, initialization, normalization, and training scale. There is no universal best activation. The right choice depends on what the layer must do.

Part VII

Part VII of Deep Learning with PyTorch.

Overfitting and Underfitting

Overfitting and underfitting describe two common ways a model can fail.

Mixup and CutMix

Mixup and CutMix are data augmentation methods that create new training examples by combining two examples and their labels. They regularize the model by discouraging overly sharp decision boundaries.

Limits of Linear Models

Linear models are the first useful class of predictive models in deep learning.

Learning Rate Scheduling

The learning rate controls the size of each parameter update.

Installing and Configuring PyTorch

A PyTorch installation must match three things: the Python environment, the operating system, and the available hardware.

Gradient Flow in Deep Networks

Gradient flow describes how derivative information moves backward through a neural network during training.

Exercises

---

Embeddings and Output Projections

After tokenization, text is represented as integer token IDs.

Efficient Convolutions

Efficient convolutions reduce computation, memory use, or latency while preserving useful spatial modeling.

Efficient Attention Methods

Standard self-attention compares every token with every other token. For a sequence of length $T$, this produces a $T \times T$ attention matrix. The cost grows quadratically with sequence length.

Cross-Lingual Transfer

Cross-lingual transfer is the ability of a model trained or adapted in one language to work in another language.

Conversational Systems

A conversational system processes dialogue between users and machines.

Automated Machine Learning

Automated machine learning, or AutoML, refers to systems that automate parts of the model development process.

Attention Complexity

Attention gives a model direct access between positions in a sequence.

Transformer Decoders

A transformer decoder maps a partial output sequence to predictions for the next token or next output step.

Text-to-Image Systems

Text-to-image generation aims to synthesize images from natural language descriptions. A model receives a prompt such as:

Subword Methods

Subword methods split text into units smaller than words but usually larger than single characters.

Stochastic Depth

Stochastic depth is a regularization method for deep residual networks.

Sequence Modeling Applications

Recurrent neural networks were among the first deep learning architectures capable of handling variable-length sequential data.

Saturation and Gradient Flow

Activation functions control both the forward signal and the backward signal.

Residual Networks

Residual networks are convolutional networks built from blocks with skip connections.

Residual Connections

Residual connections allow a layer or block to add its input directly to its output. Instead of forcing a block to learn a complete transformation from scratch, the block learns a correction to the input.

Representation Learning

Representation learning is the study of how a model converts raw data into useful internal variables.

Representation Learning

Representation learning is the study of how models learn useful internal descriptions of data.

Question Answering

Question answering, often abbreviated QA, is the task of producing an answer to a question.

PyTorch Versus Other Frameworks

PyTorch is one of several major frameworks for deep learning.

Probabilistic Circuits

Probabilistic circuits are tractable probabilistic models built from simple computational graphs.

Practical Probabilistic Modeling in PyTorch

Probabilistic deep learning adds distributions to ordinary neural networks.

Part VI

Part VI of Deep Learning with PyTorch.

Neural Architecture Search

Neural architecture search, or NAS, is the process of automatically searching for model architectures.

Multi-Task Objectives

Multi-task learning trains one model on several objectives at the same time.

Multi-Node Training

Multi-node training uses more than one machine for a single training job. Each machine contributes one or more accelerators, and all machines cooperate to train the same model.

Multi-Head Attention

Multi-head attention runs several attention operations in parallel. Each head has its own query, key, and value projections. The outputs of the heads are concatenated and projected back to the model dimension.

Momentum and Adaptive Methods

Stochastic gradient descent uses the current minibatch gradient to update the parameters.

Model Editing

Model editing modifies a trained model so that it changes a specific behavior while preserving most other behaviors.

Matrix Operations

Matrix operations are the main arithmetic language of deep learning.

Limits of Linear Decision Boundaries

A linear classifier separates classes using a hyperplane. In two dimensions this boundary is a line. In three dimensions it is a plane. In higher dimensions it is a hyperplane.

In-Context Learning

Large language models can often perform new tasks without updating their parameters.

Efficient Transformers

Standard transformer attention scales quadratically with sequence length. For a sequence of length $T$, self-attention constructs a score matrix of size

Dialogue Systems

A dialogue system is a model or collection of models that interacts with users through natural language.

Conclusion

Deep learning systems have progressed from small task-specific models to large multimodal foundation systems capable of perception, language understanding, reasoning, planning, generation, and interaction.

Calibration and Confidence

A classifier returns scores. Users often interpret those scores as confidence. This interpretation is safe only when the scores are calibrated. A calibrated model assigns probabilities that match empirical correctness.

Bias and Variance

Bias and variance describe two different sources of prediction error. They are useful because they separate errors caused by an overly simple model from errors caused by an overly sensitive model.

Backpropagation

Backpropagation is the algorithm used to compute gradients in neural networks efficiently.

Transformer Encoders

A transformer encoder is a stack of layers that maps an input sequence to a contextual sequence representation.

Training, Validation, and Test Sets

A machine learning dataset is usually divided into three parts: a training set, a validation set, and a test set.

Tokenization Systems

A language model does not read raw text directly. It reads tokens. Tokenization is the process that maps a string of text into a sequence of discrete symbols, and later maps generated symbols back into text.

Stochastic Gradient Descent

Stochastic gradient descent, usually abbreviated as SGD, is the standard form of gradient-based training used in deep learning.

Speech Recognition Systems

Speech recognition maps an acoustic signal to a text sequence. The input is continuous audio. The output is discrete symbols: characters, subword tokens, words, or phonemes.

Softmax and Output Activations

Many neural networks produce raw scores. These scores are called logits.

Scaling Transformers

Scaling a transformer means increasing its capacity, data exposure, context length, training compute, or serving throughput.

Population-Based Training

Population-based training, or PBT, is a hyperparameter optimization method that trains many models at the same time.

Part V

Part V of Deep Learning with PyTorch.

Open Research Problems

Deep learning has made large empirical gains, but many scientific and engineering questions remain open.

Mechanistic Interpretability

Mechanistic interpretability studies neural networks by treating them as learned computational systems.

Machine Translation

Machine translation converts text from one language into another. Given a source sentence in one language, the model generates a semantically equivalent sentence in a target language.

Long-Horizon Agents

A long-horizon agent is a model-driven system that pursues goals over many steps. It observes the environment, chooses actions, records intermediate state, uses tools, and adjusts its plan as new information arrives.

Linear Separability

Linear separability describes when a classification dataset can be divided perfectly by a linear decision boundary. It is one of the central geometric ideas behind linear classification.

Latent Space Manipulation

A latent space is the internal coordinate system learned by an encoder or generative model.

Latent Space Manipulation

Latent space manipulation studies how to change a learned representation $z$ in order to produce controlled changes in the decoded output. In an autoencoder, the encoder maps an input into a latent vector,

Latent Diffusion

Early diffusion models operated directly in pixel space. A model generated images by iteratively denoising tensors such as

Large-Scale Training

Large-scale training means training models on datasets, model sizes, or hardware configurations that exceed a simple single-GPU workflow.

Label Smoothing

Label smoothing is a regularization method for classification.

Jacobians and Hessians

Gradients are enough for most neural network training. A gradient tells us how a scalar loss changes with respect to parameters.

Information Retrieval

Information retrieval is the task of finding relevant items from a collection in response to a query.

Indexing, Slicing, and Tensor Views

Indexing and slicing select parts of a tensor. These operations are used constantly in PyTorch: selecting batches, cropping images, extracting token positions, applying masks, gathering logits, and rearranging model

Group and Instance Normalization

Batch normalization and layer normalization are the two most common normalization layers, but they do not cover every setting well.

GPUs and Accelerators

Deep learning became practical at scale because neural network computation maps well to parallel hardware.

Gaussian Processes

A Gaussian process is a probabilistic model over functions. Instead of defining a probability distribution over parameters, as in Bayesian neural networks, a Gaussian process defines a probability distribution directly

Flow-Based Models

Flow-based models are generative models that learn an invertible transformation between a simple probability distribution and a complex data distribution. Unlike many other generative models, flow-based systems provide:

Fault Tolerance

Distributed training systems fail regularly. GPUs crash, network connections reset, processes hang, disks fill, filesystems become unavailable, and nodes disappear from the cluster.

Cross-Attention

Cross-attention is attention between two different sequences or sources of information. The queries come from one sequence, while the keys and values come from another.

Contrastive Objectives

Contrastive objectives train a model by comparing examples. Instead of learning only from an input and its target, the model learns which examples should be close together and which examples should be far apart.

Constitutional Alignment

Reinforcement learning from human feedback improves model behavior using preference data. However, collecting large amounts of human feedback is expensive, slow, and difficult to scale consistently.

CNN Architectures

A convolutional neural network architecture defines how convolutional layers, activation functions, normalization layers, pooling layers, residual paths, and classifier heads are arranged.

Bidirectional Networks

Standard recurrent neural networks process sequences in one direction, usually from left to right. At time step $t$, the hidden state summarizes only the past:

Variational Autoencoders

A variational autoencoder, or VAE, is an autoencoder with a probabilistic latent space.

Variational Autoencoders

A variational autoencoder, or VAE, is a generative latent variable model trained with neural networks.

Vanishing Gradients in RNNs

Recurrent neural networks were designed to process sequential data by maintaining a hidden state over time.

Uncertainty Estimation

Uncertainty estimation measures how much confidence a model should place in its own predictions.

The Perceptron Algorithm

The perceptron is one of the earliest algorithms for binary classification. It learns a linear decision boundary by updating its weights whenever it makes a mistake.

The Chain Rule

The chain rule is the mathematical rule that makes backpropagation possible. Neural networks are built by composing many functions. The chain rule tells us how to differentiate such compositions.

Tensor-Based Computation

PyTorch programs are tensor programs. A tensor stores numbers in a structured array, and most model computation is expressed as operations over tensors.

Tensor Arithmetic and Broadcasting

Tensor arithmetic is the basic computation layer of PyTorch.

Summarization

Summarization is the task of producing a shorter version of one or more source texts while preserving the important information.

Self-Attention

Self-attention is attention applied within a single sequence.

Robotics and Embodied AI

Robotics and embodied AI study learning systems that act in the physical world.

Retrieval Systems

A retrieval system finds relevant information from an external memory source.

Residual and Normalization Layers

Transformer layers are deep stacks of attention and feedforward blocks.

Reinforcement Learning Overview

Reinforcement learning studies how an agent learns to act through interaction with an environment.

Reinforcement Learning from Human Feedback

Instruction tuning teaches a model to imitate demonstrations.

Positional Encoding

Self-attention compares tokens by content. By itself, it has no built-in notion of token order.

Pipeline Parallelism

Pipeline parallelism splits a model into sequential stages and places each stage on a different device. It is a form of model parallelism designed to reduce idle time.

Part IV

Part IV of Deep Learning with PyTorch.

Padding and Stride

Padding and stride control the spatial size of convolutional feature maps.

Noise Schedules

A diffusion model needs a rule for how noise increases during the forward process.

Neural Machine Translation

Neural machine translation maps a sentence in one language to a sentence in another language using a neural sequence model. The model receives a source sentence and generates a target sentence.

Named Entity Recognition

Named entity recognition, usually abbreviated NER, identifies spans of text that refer to named or typed entities.

Masked Language Modeling

Masked language modeling trains a model to recover missing tokens from their surrounding context.

Margin-Based Losses

Margin-based losses are used when the goal is not only to make the correct prediction, but to make it by a sufficient margin.

Layer Normalization

Layer normalization is a normalization method that normalizes features within each individual example.

Gradient Descent

Gradient descent is the basic optimization method used to train neural networks. It updates model parameters in the direction that reduces the loss.

Energy-Based Models

Energy-based models, or EBMs, define probability distributions using energy functions rather than normalized output probabilities directly.

ELU, GELU, and Swish

ReLU and its variants improved optimization in deep networks, but they still have limitations.

Data Augmentation Strategies

Data augmentation creates modified versions of training examples without changing their labels.

Data Augmentation

Data augmentation is a regularization method that creates modified versions of training examples while preserving their labels.

Bayesian Optimization

Bayesian optimization is a hyperparameter optimization method for expensive black-box functions. It is useful when each training run costs enough that random search wastes too much compute.

Attribution Methods

Attribution methods assign credit or blame to parts of an input, hidden representation, neuron, feature, or training example for a model output.

Unified Foundation Models

A unified foundation model is a neural network trained across many modalities, tasks, and domains using a shared architecture and shared representations.

Text Classification

Text classification assigns one or more labels to a piece of text.

Tensor Creation and Initialization

Neural networks start with tensors. Some tensors come from data.

Softmax Regression

Softmax regression extends logistic regression from two classes to many classes. It is the standard linear model for multiclass classification.

Self-Supervised Learning

Self-supervised learning is a form of learning where the training signal is created from the data itself. The dataset does not need human-written labels, but the model still receives a prediction task.

Score Matching

Diffusion models can be understood from multiple mathematical viewpoints.

Scientific Deep Learning

Scientific deep learning applies neural networks and differentiable computation to scientific and engineering problems.

Saliency Maps

A saliency map is a visualization that assigns an importance score to each part of an input.

Reverse-Mode Differentiation

Reverse-mode differentiation is the method used by backpropagation. It computes derivatives by first evaluating a function forward, then propagating gradient information backward from the output to the inputs.

Random Search

Random search is a hyperparameter optimization method that samples configurations at random from a search space.

Question Answering

Question answering is the task of producing an answer to a question. The input may contain only the question, or it may contain both a question and one or more passages that may contain the answer.

Positional Encoding

Self-attention compares tokens to other tokens, but by itself it has no built-in notion of order.

Part III

Part III of Deep Learning with PyTorch.

Multi-Head Attention

Multi-head attention runs several attention operations in parallel.

Monte Carlo Methods

Monte Carlo methods approximate difficult mathematical quantities using random samples.

Model Parallelism

Model parallelism splits a model across multiple devices. Instead of copying the whole model onto every GPU, different parts of the model live on different GPUs.

Loss Functions

A loss function measures how wrong a model’s predictions are.

Likelihood-Based Objectives

Many deep learning loss functions can be understood as likelihood maximization.

Leaky and Parametric ReLU

ReLU is simple and effective, but it has one sharp weakness.

Instruction Tuning

Pretraining teaches a language model to predict text. It does not directly teach the model to follow user instructions, answer safely, maintain dialogue structure, or format outputs in a useful way.

Fine-Tuning Pretrained Models

Fine-tuning adapts a pretrained model to a target dataset by continuing training from learned weights instead of starting from random initialization.

Feature Maps

A feature map is the spatial output produced by a convolutional filter. In a convolutional neural network, each output channel can be read as a map of where a learned feature appears in the input.

Dynamic Computation Graphs

Deep learning models are built from sequences of mathematical operations.

Dropout

Dropout is a regularization method that randomly removes parts of a neural network during training.

Dot-Product Attention

Dot-product attention uses an inner product to measure how well a query matches a key.

Denoising Autoencoders

A denoising autoencoder learns to reconstruct a clean input from a corrupted version of that input. Instead of copying $x$ to $\hat{x}$, the model receives a noisy input $\tilde{x}$ and must recover the original $x$.

Denoising Autoencoders

A denoising autoencoder learns to recover a clean input from a corrupted version of that input.

Deep Belief Networks

A deep belief network, or DBN, is a probabilistic generative model formed by stacking multiple layers of latent variables.

Beam Search

Beam search is a decoding algorithm for autoregressive sequence models. It is used when a model must generate a sequence, but greedy decoding is too narrow.

Batch Normalization

Batch normalization is a layer that normalizes activations using statistics computed from a mini-batch.

Backpropagation Through Time

Recurrent networks reuse the same parameters at every time step.

Autoregressive Modeling

Autoregressive modeling is the dominant formulation for modern language generation. The model predicts the next token from previous tokens. Repeating this prediction step produces a sequence.

Variational Inference

Bayesian neural networks require inference over a posterior distribution:

Vanishing and Exploding Gradients

Deep networks train by sending information in two directions.

Unsupervised Learning

Unsupervised learning studies data without explicit target labels. The dataset contains inputs only:

Transformer Decoders

A transformer decoder is a neural network block that maps a prefix sequence to a sequence of next-token representations. It is used when the model must generate output one step at a time.

Transfer Learning

Transfer learning reuses a model trained on one task as the starting point for another task.

The PyTorch Ecosystem

PyTorch is a deep learning platform built around tensors, automatic differentiation, and composable neural network modules.

Tensor Shapes, Dimensions, and Memory Layout

Deep learning systems manipulate tensors with millions or billions of numerical entries.

Teacher Forcing

Teacher forcing is a training method for autoregressive sequence models. It is used when a model generates an output sequence one token at a time, but during training we already know the correct output sequence.

Subword Tokenization

A language model cannot process raw text directly. Text must first be converted into a sequence of token IDs. The procedure that performs this conversion is called tokenization.

Sparse Autoencoders

An undercomplete autoencoder constrains the representation by reducing the latent dimension.

Sparse Autoencoders

An ordinary autoencoder compresses information by forcing the latent representation to have fewer dimensions than the input.

Self-Attention

Self-attention is attention applied within a single sequence. The same input supplies the queries, keys, and values. Each position builds a new representation by reading from other positions in the same sequence.

Scaling Laws for Language Models

Scaling laws describe how model performance changes as we increase compute, parameter count, dataset size, and training tokens.

Reverse Denoising Processes

The forward diffusion process gradually transforms data into noise.

Restricted Boltzmann Machines

A restricted Boltzmann machine, or RBM, is a simplified Boltzmann machine with a bipartite structure.

Recurrent Computation

A feedforward neural network processes inputs through a fixed sequence of layers. Once the output is produced, the computation ends. There is no memory of previous inputs.

Rectified Linear Units

The rectified linear unit, usually called ReLU, is the most widely used activation function in modern deep learning.

Pooling Layers

Pooling is a downsampling operation used in convolutional neural networks.

Neural Language Models

Statistical language models estimate probabilities from discrete counts.

Named Entity Recognition

Named entity recognition, or NER, is the task of finding spans of text that refer to entities and assigning each span a type.

Logistic Regression

Logistic regression is a linear model for classification. It predicts a probability instead of a raw numerical value. Despite its name, logistic regression is mainly used for classification, not regression.

Logistic Regression

Linear regression predicts a real number. Logistic regression predicts a probability for binary classification.

Grid Search

Grid search is one of the simplest methods for hyperparameter optimization.

Gradient Computation

Gradient computation is the process of measuring how a scalar output changes when its input values change. In deep learning, the scalar output is usually the loss, and the inputs are usually the model parameters.

Efficient AI Systems

Modern deep learning systems are constrained by compute, memory, bandwidth, latency, and energy. As models become larger, efficiency becomes a central engineering problem rather than a secondary optimization.

Early Stopping

Neural networks are usually trained iteratively. An optimizer repeatedly updates model parameters to reduce the training loss.

Distribution Shift

A distribution shift occurs when the data seen at deployment differs from the data used during training.

Distributed Data Parallel

Distributed Data Parallel, usually abbreviated as DDP, is PyTorch’s primary system for synchronous multi-GPU training.

Cross-Entropy Loss

Cross-entropy loss is the standard loss function for classification. It measures how well a model’s predicted class distribution matches the true class label.

Audio-Visual Learning

Audio-visual learning studies models that jointly process sound and visual information. The goal is to learn representations that combine what is seen with what is heard.

Additive Attention

Additive attention was one of the first successful neural attention mechanisms. It was introduced for neural machine translation to allow a decoder to selectively focus on different encoder states during generation.

Word Embeddings

Natural language models cannot operate directly on words as strings.

What Is Deep Learning

Deep learning is a branch of machine learning that studies models built from many layers of learned computation.

Vision-Language Models

A vision-language model learns a joint representation of images and text.

Transformer Encoders

A transformer encoder is a neural network block that maps a sequence of input vectors to a sequence of contextualized output vectors.

Text Classification

Text classification is the task of assigning one or more labels to a piece of text.

Supervised Learning

Supervised learning is the central paradigm of modern machine learning and deep learning.

Statistical Language Models

A language model assigns probabilities to sequences of tokens. The tokens may be words, subwords, characters, bytes, or other discrete symbols. In the classical setting, a sentence is represented as a finite sequence

Sigmoid and Hyperbolic Tangent

Activation functions give neural networks their nonlinear structure.

Sequential Data

Many learning problems involve data whose meaning depends on order.

Search Spaces

Hyperparameter optimization begins by deciding what may vary.

Scaling Laws

Modern deep learning systems often improve when we increase three quantities: model size, dataset size, and compute. This empirical regularity is called a scaling law.

Scalars, Vectors, Matrices, and Tensors

Deep learning represents data and computation using arrays of numbers.

Pretraining Objectives

A large language model is trained in two broad phases. The first phase is pretraining.

Parameter Initialization

A neural network begins training with parameters that have not yet been learned from data.

Motivation for Attention

Sequence models often need to decide which parts of an input are relevant to a particular output.

Mean Squared Error

Mean squared error is one of the simplest and most widely used loss functions in supervised learning. It measures the average squared difference between a model’s prediction and the target value.

Linear Regression

Linear regression is the simplest supervised learning model used in deep learning.

L1 and L2 Regularization

A neural network is trained by minimizing a loss function. For a supervised learning problem, this loss measures how far the model predictions are from the target values.

Forward Diffusion Processes

Diffusion models are generative models built around a simple idea: learn to reverse a gradual corruption process.

Encoder-Decoder Architectures

A sequence-to-sequence model maps one sequence to another sequence.

Dimensionality Reduction

High-dimensional data often contains structure that can be described with fewer variables than the raw representation suggests.

Dimensionality Reduction

Deep learning often begins with data that has many coordinates.

Datasets and DataLoaders

A deep learning model does not train directly from files. It trains from tensors. The purpose of a data pipeline is to convert stored data into batches of tensors with consistent shapes, data types, and labels.

Data Parallelism

Data parallelism is the simplest and most widely used form of distributed deep learning.

Convolution Operations

Convolution is the central operation in convolutional neural networks.

Computational Graphs

A computational graph is a graph that represents a numerical computation. The nodes represent values or operations. The edges describe how data flows from one operation to the next.

Classification Pipelines

Image classification assigns one label, or a small set of labels, to an image.

Boltzmann Machines

A Boltzmann machine is a probabilistic neural network that defines a probability distribution over binary variables.

Bayesian Neural Networks

A Bayesian neural network is a neural network whose parameters are treated as random variables rather than fixed unknown constants.

Attention Mechanisms

Attention is a method for letting a model choose which parts of an input are most relevant when producing an output.

Adversarial Examples

An adversarial example is an input that has been deliberately modified so that a model makes a wrong prediction, while the modification is small enough that a human observer still sees the original object.