A Gaussian process is a probabilistic model over functions. Instead of defining a probability distribution over parameters, as in Bayesian neural networks, a Gaussian process defines a probability distribution directly
A Gaussian process is a probabilistic model over functions. Instead of defining a probability distribution over parameters, as in Bayesian neural networks, a Gaussian process defines a probability distribution directly over functions.
This gives a flexible nonparametric approach to regression, uncertainty estimation, Bayesian optimization, and probabilistic modeling.
Gaussian processes are important in deep learning because they provide:
- principled uncertainty estimates
- exact Bayesian inference in small settings
- connections between kernels and neural networks
- theoretical insight into infinitely wide neural networks
Functions as Random Variables
In ordinary regression, we seek a function
that maps inputs to outputs.
A Gaussian process treats the function itself as random:
A Gaussian process is completely specified by:
| Component | Meaning |
|---|---|
| Mean function | |
| Covariance kernel |
The mean function gives the expected function value:
The kernel defines covariance between function values:
The kernel determines smoothness, similarity structure, periodicity, and inductive bias.
Gaussian Processes as Infinite Gaussian Distributions
A Gaussian process generalizes the multivariate Gaussian distribution.
A multivariate Gaussian defines a distribution over vectors:
A Gaussian process defines a distribution over functions.
For any finite set of inputs
the corresponding function values
are jointly Gaussian:
The mean vector is
and the covariance matrix is
Thus, a Gaussian process defines consistent Gaussian distributions over all finite collections of function values.
Mean Functions
The simplest mean function is zero:
This is common in practice because the kernel usually dominates the behavior.
A nonzero mean function may encode prior trends:
For example, one may combine a linear mean function with a smooth kernel to model deviations around a trend.
Kernel Functions
The kernel is the core of a Gaussian process. It determines how function values at different inputs relate to one another.
A kernel must produce a positive semidefinite covariance matrix.
Common kernels include:
| Kernel | Formula | Properties |
|---|---|---|
| Linear | Linear functions | |
| RBF | Smooth functions | |
| Periodic | Periodic structure | |
| Matérn | Depends on smoothness parameter | Controlled roughness |
The radial basis function kernel is especially important:
genui{“math_block_widget_always_prefetch_v2”:{“content”:“k(x,x’)=\sigma_f^2\exp\left(-\frac{\|x-x’\|^2}{2\ell^2}\right)”}}
Here:
| Parameter | Meaning |
|---|---|
| Output variance | |
| Length scale |
Nearby points have high covariance. Distant points have low covariance.
The length scale controls how quickly correlations decay.
Gaussian Process Regression
Suppose we observe training data
Assume
with Gaussian noise
The prior over function values is
Including observation noise gives
Now consider a test input .
The joint distribution of training outputs and test output is
Conditioning this Gaussian gives the posterior predictive distribution.
Predictive Mean and Variance
The posterior predictive mean is
genui{“math_block_widget_always_prefetch_v2”:{“content”:"\mu_\star=k_\star^\top(K+\sigma_n^2I)^{-1}y"}}
The predictive variance is
genui{“math_block_widget_always_prefetch_v2”:{“content”:"\sigma_\star^2=k(x_\star,x_\star)-k_\star^\top(K+\sigma_n^2I)^{-1}k_\star"}}
These equations are central to Gaussian process regression.
The predictive mean interpolates nearby observations according to kernel similarity.
The predictive variance increases in regions far from training data.
This gives uncertainty estimates automatically.
Interpretation of Predictive Variance
The predictive variance behaves intuitively.
Near observed training points, uncertainty becomes small because the model has evidence.
Far from training data, uncertainty grows because the model has little information.
This differs from ordinary neural networks, which may produce confident predictions even far outside the training distribution.
Gaussian processes therefore provide well-calibrated uncertainty in many low-data settings.
Kernel Matrices
For training inputs
the kernel matrix is
The matrix must be symmetric and positive semidefinite.
The kernel matrix acts as a similarity matrix between training examples.
In Gaussian process regression, training requires inversion of
This is computationally expensive.
Computational Complexity
Exact Gaussian process regression requires:
| Operation | Complexity |
|---|---|
| Kernel matrix storage | |
| Matrix inversion | |
| Prediction per test point | to |
This limits exact Gaussian processes to relatively small datasets.
For large-scale problems, approximations are necessary.
Sparse Gaussian Processes
Sparse Gaussian processes reduce complexity using inducing points.
Instead of modeling all training points directly, the model introduces a smaller set of inducing variables:
where
are inducing inputs and
The inducing points summarize the function.
Sparse approximations reduce complexity from
to roughly
This makes larger datasets practical.
Gaussian Processes and Feature Spaces
Many kernels can be interpreted as inner products in a feature space.
Suppose
maps inputs into a feature representation.
Then a kernel may be written as
This is called the kernel trick.
The feature space may be extremely high-dimensional or even infinite-dimensional, but the kernel computes similarities directly without explicitly constructing the features.
This idea connects Gaussian processes, support vector machines, and kernel methods.
Gaussian Processes and Neural Networks
Gaussian processes are deeply connected to neural networks.
Suppose a neural network becomes infinitely wide. Under suitable assumptions, the distribution over functions induced by random weights converges to a Gaussian process.
For example, an infinitely wide one-hidden-layer network with random Gaussian weights produces a kernel of the form
Different activation functions produce different kernels.
This connection provides theoretical insight into deep learning.
Neural Tangent Kernels
During training, infinitely wide neural networks can behave like kernel methods.
The neural tangent kernel, or NTK, describes how predictions evolve under gradient descent.
The NTK is
As width approaches infinity, the NTK can remain approximately constant during training.
In this regime, neural network training becomes mathematically similar to kernel regression.
This perspective has become important in theoretical deep learning.
Gaussian Processes in Bayesian Optimization
Gaussian processes are widely used in Bayesian optimization.
Suppose evaluating a function is expensive:
Examples include:
- hyperparameter tuning
- scientific experiments
- engineering design
- reinforcement learning evaluation
A Gaussian process models the unknown function and its uncertainty.
An acquisition function chooses where to sample next.
Common acquisition functions include:
| Acquisition function | Idea |
|---|---|
| Expected improvement | Prefer likely improvements |
| Upper confidence bound | Balance mean and uncertainty |
| Probability of improvement | Prefer high success probability |
Because Gaussian processes provide predictive uncertainty naturally, they work well for efficient exploration.
Gaussian Process Classification
Gaussian processes also support classification.
Instead of Gaussian likelihoods, classification uses Bernoulli or categorical likelihoods:
For binary classification:
The posterior becomes non-Gaussian because the likelihood is nonlinear.
Inference therefore requires approximation:
- Laplace approximation
- expectation propagation
- variational inference
- MCMC
Gaussian process classification is usually more computationally difficult than regression.
Gaussian Processes Versus Neural Networks
| Property | Gaussian Processes | Neural Networks |
|---|---|---|
| Uncertainty estimation | Natural | Usually approximate |
| Data efficiency | Strong in small data | Often needs more data |
| Scalability | Limited | Excellent |
| Flexibility | Kernel dependent | Highly flexible |
| Training | Closed-form or approximate inference | Gradient optimization |
| Interpretability | Often easier | Harder |
| Large-scale deployment | Difficult | Standard |
Gaussian processes are often strong in low-data regimes with expensive labels. Neural networks dominate in large-scale representation learning.
Gaussian Processes in Deep Learning Systems
Modern deep learning combines Gaussian process ideas with neural architectures.
Examples include:
| Method | Idea |
|---|---|
| Deep kernel learning | Neural network learns kernel representation |
| Neural processes | Neural-network-inspired Gaussian process behavior |
| Bayesian neural networks | Approximate probabilistic neural inference |
| NTK theory | Infinite-width neural kernels |
| Gaussian process output layers | Uncertainty-aware prediction heads |
These hybrid approaches attempt to combine the representation power of neural networks with the uncertainty quality of Gaussian processes.
Practical PyTorch Example
PyTorch libraries such as GPyTorch provide scalable Gaussian process implementations.
A minimal regression model may look like:
import torch
import gpytorch
class ExactGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super().__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.ZeroMean()
self.covar_module = gpytorch.kernels.ScaleKernel(
gpytorch.kernels.RBFKernel()
)
def forward(self, x):
mean = self.mean_module(x)
covar = self.covar_module(x)
return gpytorch.distributions.MultivariateNormal(
mean,
covar
)The kernel and likelihood define the probabilistic structure of the model.
Limitations of Gaussian Processes
Gaussian processes are elegant, but they have practical limits.
Exact inference scales poorly with dataset size.
Kernel design strongly influences performance.
High-dimensional input spaces can become difficult because distance-based kernels may degrade.
Large modern datasets and foundation-model-scale systems are usually beyond exact Gaussian process methods.
Approximate methods help, but large-scale deep neural networks remain more practical for many applications.
Summary
A Gaussian process defines a probability distribution over functions using a mean function and a covariance kernel.
Gaussian process regression provides exact Bayesian prediction for finite datasets, including predictive uncertainty estimates. Kernels encode assumptions about smoothness and similarity.
Gaussian processes connect probabilistic modeling, kernel methods, Bayesian optimization, and deep learning theory. Infinite-width neural networks and neural tangent kernels reveal deep mathematical relationships between Gaussian processes and modern neural networks.