Gaussian Processes

A Gaussian process is a probabilistic model over functions. Instead of defining a probability distribution over parameters, as in Bayesian neural networks, a Gaussian process defines a probability distribution directly over functions.

This gives a flexible nonparametric approach to regression, uncertainty estimation, Bayesian optimization, and probabilistic modeling.

Gaussian processes are important in deep learning because they provide:

principled uncertainty estimates
exact Bayesian inference in small settings
connections between kernels and neural networks
theoretical insight into infinitely wide neural networks

Functions as Random Variables

In ordinary regression, we seek a function

f(x)

that maps inputs to outputs.

A Gaussian process treats the function itself as random:

f \sim \mathcal{GP}(m(x), k(x,x')).

A Gaussian process is completely specified by:

Component	Meaning
$m(x)$	Mean function
$k(x,x')$	Covariance kernel

The mean function gives the expected function value:

m(x) = \mathbb{E}[f(x)].

The kernel defines covariance between function values:

k(x,x') = \mathrm{Cov}(f(x),f(x')).

The kernel determines smoothness, similarity structure, periodicity, and inductive bias.

Gaussian Processes as Infinite Gaussian Distributions

A Gaussian process generalizes the multivariate Gaussian distribution.

A multivariate Gaussian defines a distribution over vectors:

x \sim \mathcal{N}(\mu,\Sigma).

A Gaussian process defines a distribution over functions.

For any finite set of inputs

x_1,\ldots,x_n,

the corresponding function values

f(x_1),\ldots,f(x_n)

are jointly Gaussian:

\begin{bmatrix} f(x_1) \\ f(x_2) \\ \vdots \\ f(x_n) \end{bmatrix} \sim \mathcal{N}(\mu,K).

The mean vector is

\mu_i = m(x_i),

and the covariance matrix is

K_{ij}=k(x_i,x_j).

Thus, a Gaussian process defines consistent Gaussian distributions over all finite collections of function values.

Mean Functions

The simplest mean function is zero:

m(x)=0.

This is common in practice because the kernel usually dominates the behavior.

A nonzero mean function may encode prior trends:

m(x)=ax+b.

For example, one may combine a linear mean function with a smooth kernel to model deviations around a trend.

Kernel Functions

The kernel is the core of a Gaussian process. It determines how function values at different inputs relate to one another.

A kernel must produce a positive semidefinite covariance matrix.

Common kernels include:

Kernel	Formula	Properties
Linear	$x^\top x'$	Linear functions
RBF	$\exp(-\frac{\\|x-x'\\|^2}{2\ell^2})$	Smooth functions
Periodic	$\exp(-\frac{2\sin^2(\pi\\|x-x'\\|/p)}{\ell^2})$	Periodic structure
Matérn	Depends on smoothness parameter	Controlled roughness

The radial basis function kernel is especially important:

genui{“math_block_widget_always_prefetch_v2”:{“content”:“k(x,x’)=\sigma_f^2\exp\left(-\frac{\|x-x’\|^2}{2\ell^2}\right)”}}

Here:

Parameter	Meaning
$\sigma_f^2$	Output variance
$\ell$	Length scale

Nearby points have high covariance. Distant points have low covariance.

The length scale controls how quickly correlations decay.

Gaussian Process Regression

Suppose we observe training data

D=\{(x_i,y_i)\}_{i=1}^{N}.

Assume

y_i = f(x_i)+\epsilon_i,

with Gaussian noise

\epsilon_i\sim\mathcal{N}(0,\sigma_n^2).

The prior over function values is

f(X) \sim \mathcal{N}(0,K(X,X)).

Including observation noise gives

y \sim \mathcal{N}(0,K+\sigma_n^2 I).

Now consider a test input $x_\star$ .

The joint distribution of training outputs and test output is

\begin{bmatrix} y \\ f_\star \end{bmatrix} \sim \mathcal{N} \left( 0, \begin{bmatrix} K+\sigma_n^2 I & k_\star \\ k_\star^\top & k(x_\star,x_\star) \end{bmatrix} \right).

Conditioning this Gaussian gives the posterior predictive distribution.

Predictive Mean and Variance

The posterior predictive mean is

genui{“math_block_widget_always_prefetch_v2”:{“content”:"\mu_\star=k_\star^\top(K+\sigma_n^2I)^{-1}y"}}

The predictive variance is

genui{“math_block_widget_always_prefetch_v2”:{“content”:"\sigma_\star^2=k(x_\star,x_\star)-k_\star^\top(K+\sigma_n^2I)^{-1}k_\star"}}

These equations are central to Gaussian process regression.

The predictive mean interpolates nearby observations according to kernel similarity.

The predictive variance increases in regions far from training data.

This gives uncertainty estimates automatically.

Interpretation of Predictive Variance

The predictive variance behaves intuitively.

Near observed training points, uncertainty becomes small because the model has evidence.

Far from training data, uncertainty grows because the model has little information.

This differs from ordinary neural networks, which may produce confident predictions even far outside the training distribution.

Gaussian processes therefore provide well-calibrated uncertainty in many low-data settings.

Kernel Matrices

For training inputs

X=\{x_1,\ldots,x_n\},

the kernel matrix is

K(X,X) = \begin{bmatrix} k(x_1,x_1) & \cdots & k(x_1,x_n) \\ \vdots & \ddots & \vdots \\ k(x_n,x_1) & \cdots & k(x_n,x_n) \end{bmatrix}.

The matrix must be symmetric and positive semidefinite.

The kernel matrix acts as a similarity matrix between training examples.

In Gaussian process regression, training requires inversion of

K+\sigma_n^2 I.

This is computationally expensive.

Computational Complexity

Exact Gaussian process regression requires:

Operation	Complexity
Kernel matrix storage	$O(N^2)$
Matrix inversion	$O(N^3)$
Prediction per test point	$O(N)$ to $O(N^2)$

This limits exact Gaussian processes to relatively small datasets.

For large-scale problems, approximations are necessary.

Sparse Gaussian Processes

Sparse Gaussian processes reduce complexity using inducing points.

Instead of modeling all training points directly, the model introduces a smaller set of inducing variables:

u = f(Z),

where

Z=\{z_1,\ldots,z_M\}

are inducing inputs and

M \ll N.

The inducing points summarize the function.

Sparse approximations reduce complexity from

O(N^3)

to roughly

O(NM^2).

This makes larger datasets practical.

Gaussian Processes and Feature Spaces

Many kernels can be interpreted as inner products in a feature space.

Suppose

\phi(x)

maps inputs into a feature representation.

Then a kernel may be written as

k(x,x') = \phi(x)^\top \phi(x').

This is called the kernel trick.

The feature space may be extremely high-dimensional or even infinite-dimensional, but the kernel computes similarities directly without explicitly constructing the features.

This idea connects Gaussian processes, support vector machines, and kernel methods.

Gaussian Processes and Neural Networks

Gaussian processes are deeply connected to neural networks.

Suppose a neural network becomes infinitely wide. Under suitable assumptions, the distribution over functions induced by random weights converges to a Gaussian process.

For example, an infinitely wide one-hidden-layer network with random Gaussian weights produces a kernel of the form

k(x,x') = \mathbb{E} [ \sigma(w^\top x)\sigma(w^\top x') ].

Different activation functions produce different kernels.

This connection provides theoretical insight into deep learning.

Neural Tangent Kernels

During training, infinitely wide neural networks can behave like kernel methods.

The neural tangent kernel, or NTK, describes how predictions evolve under gradient descent.

The NTK is

K_{\text{NTK}}(x,x') = \nabla_\theta f_\theta(x)^\top \nabla_\theta f_\theta(x').

As width approaches infinity, the NTK can remain approximately constant during training.

In this regime, neural network training becomes mathematically similar to kernel regression.

This perspective has become important in theoretical deep learning.

Gaussian Processes in Bayesian Optimization

Gaussian processes are widely used in Bayesian optimization.

Suppose evaluating a function is expensive:

f(x).

Examples include:

hyperparameter tuning
scientific experiments
engineering design
reinforcement learning evaluation

A Gaussian process models the unknown function and its uncertainty.

An acquisition function chooses where to sample next.

Common acquisition functions include:

Acquisition function	Idea
Expected improvement	Prefer likely improvements
Upper confidence bound	Balance mean and uncertainty
Probability of improvement	Prefer high success probability

Because Gaussian processes provide predictive uncertainty naturally, they work well for efficient exploration.

Gaussian Process Classification

Gaussian processes also support classification.

Instead of Gaussian likelihoods, classification uses Bernoulli or categorical likelihoods:

p(y\mid f(x)).

For binary classification:

p(y=1\mid x) = \sigma(f(x)).

The posterior becomes non-Gaussian because the likelihood is nonlinear.

Inference therefore requires approximation:

Laplace approximation
expectation propagation
variational inference
MCMC

Gaussian process classification is usually more computationally difficult than regression.

Gaussian Processes Versus Neural Networks

Property	Gaussian Processes	Neural Networks
Uncertainty estimation	Natural	Usually approximate
Data efficiency	Strong in small data	Often needs more data
Scalability	Limited	Excellent
Flexibility	Kernel dependent	Highly flexible
Training	Closed-form or approximate inference	Gradient optimization
Interpretability	Often easier	Harder
Large-scale deployment	Difficult	Standard

Gaussian processes are often strong in low-data regimes with expensive labels. Neural networks dominate in large-scale representation learning.

Gaussian Processes in Deep Learning Systems

Modern deep learning combines Gaussian process ideas with neural architectures.

Examples include:

Method	Idea
Deep kernel learning	Neural network learns kernel representation
Neural processes	Neural-network-inspired Gaussian process behavior
Bayesian neural networks	Approximate probabilistic neural inference
NTK theory	Infinite-width neural kernels
Gaussian process output layers	Uncertainty-aware prediction heads

These hybrid approaches attempt to combine the representation power of neural networks with the uncertainty quality of Gaussian processes.

Practical PyTorch Example

PyTorch libraries such as GPyTorch provide scalable Gaussian process implementations.

A minimal regression model may look like:

import torch
import gpytorch

class ExactGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_x, train_y, likelihood):
        super().__init__(train_x, train_y, likelihood)

        self.mean_module = gpytorch.means.ZeroMean()

        self.covar_module = gpytorch.kernels.ScaleKernel(
            gpytorch.kernels.RBFKernel()
        )

    def forward(self, x):
        mean = self.mean_module(x)
        covar = self.covar_module(x)

        return gpytorch.distributions.MultivariateNormal(
            mean,
            covar
        )

The kernel and likelihood define the probabilistic structure of the model.

Limitations of Gaussian Processes

Gaussian processes are elegant, but they have practical limits.

Exact inference scales poorly with dataset size.

Kernel design strongly influences performance.

High-dimensional input spaces can become difficult because distance-based kernels may degrade.

Large modern datasets and foundation-model-scale systems are usually beyond exact Gaussian process methods.

Approximate methods help, but large-scale deep neural networks remain more practical for many applications.

Summary

A Gaussian process defines a probability distribution over functions using a mean function and a covariance kernel.

Gaussian process regression provides exact Bayesian prediction for finite datasets, including predictive uncertainty estimates. Kernels encode assumptions about smoothness and similarity.

Gaussian processes connect probabilistic modeling, kernel methods, Bayesian optimization, and deep learning theory. Infinite-width neural networks and neural tangent kernels reveal deep mathematical relationships between Gaussian processes and modern neural networks.