Skip to content

Chapter 134. Numerical Optimization

134.1 Introduction

Numerical optimization studies algorithms for finding minima or maxima of functions using finite computation.

Linear algebra is one of the foundations of numerical optimization. Vectors represent variables, matrices represent constraints and derivatives, and linear systems appear inside nearly every optimization algorithm.

The basic optimization problem is:

minimize f(x) \text{minimize } f(x)

subject to

xC, x\in C,

where:

SymbolMeaning
xxUnknown vector
f(x)f(x)Objective function
CCFeasible set

Optimization problems appear throughout science and engineering:

AreaOptimization task
Machine learningLoss minimization
StatisticsMaximum likelihood
Control theoryOptimal control
Signal processingError minimization
FinancePortfolio optimization
PhysicsEnergy minimization
Numerical PDEsVariational methods

The central challenge is computational: how to efficiently approximate optimal solutions in finite precision arithmetic.

134.2 Optimization Problems

An optimization problem consists of:

  1. Variables,
  2. An objective function,
  3. Constraints.

The objective function assigns a real number to each feasible point:

f:RnR. f:\mathbb{R}^n\to\mathbb{R}.

The feasible set contains all allowed vectors.

For example:

minimize x12+x22 \text{minimize } x_1^2+x_2^2

subject to

x1+x2=1. x_1+x_2=1.

The objective measures squared distance from the origin. The constraint restricts the solution to a line.

Optimization seeks the feasible point with smallest objective value.

134.3 Linear Optimization

A linear optimization problem has:

ComponentForm
ObjectiveLinear
ConstraintsLinear

The standard form of linear programming is:

minimize cTx \text{minimize } c^Tx

subject to

Ax=b, Ax=b,

and

x0. x\ge0.

Here:

SymbolMeaning
xxDecision vector
ccCost vector
AAConstraint matrix
bbConstraint values

The feasible region is a convex polyhedron.

Linear programming is important because:

  1. Many practical problems are naturally linear,
  2. Nonlinear problems are often approximated linearly,
  3. Efficient algorithms exist.

The geometry of linear programming is fundamentally linear-algebraic.

134.4 Least Squares Problems

One of the simplest optimization problems is least squares.

Given

ARm×n,bRm, A\in\mathbb{R}^{m\times n}, \qquad b\in\mathbb{R}^m,

the least squares problem minimizes

Axb2. \|Ax-b\|^2.

This problem appears when the system

Ax=b Ax=b

has no exact solution.

The least squares solution satisfies the normal equations:

ATAx=ATb. A^TAx=A^Tb.

A^TAx=A^Tb

The matrix

ATA A^TA

is symmetric and positive semidefinite.

Least squares is central in regression, data fitting, signal reconstruction, and inverse problems.

134.5 Convex Optimization

A convex optimization problem minimizes a convex function over a convex set.

The problem is:

minimize f(x) \text{minimize } f(x)

subject to

xC, x\in C,

where:

PropertyMeaning
CC convexFeasible region has no holes or bends inward
ff convexLocal minima are global

Convex optimization is especially important because it avoids many pathological behaviors of general nonlinear optimization.

If ff is differentiable and convex, then:

f(x)=0 \nabla f(x)=0

implies xx is globally optimal.

Convexity transforms optimization from a potentially difficult global problem into a tractable local one.

134.6 Gradient Descent

Gradient descent is one of the most fundamental optimization algorithms.

Given a differentiable function

f:RnR, f:\mathbb{R}^n\to\mathbb{R},

the gradient vector is

f(x)=[fx1fxn]. \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}\\ \vdots\\ \frac{\partial f}{\partial x_n} \end{bmatrix}.

The gradient points in the direction of steepest increase.

Therefore:

f(x) -\nabla f(x)

points in the direction of steepest decrease.

Gradient descent updates:

xk+1=xkαkf(xk), x_{k+1} = x_k-\alpha_k\nabla f(x_k),

where:

SymbolMeaning
xkx_kCurrent iterate
αk\alpha_kStep size
f(xk)\nabla f(x_k)Gradient

x_{k+1}=x_k-\alpha_k\nabla f(x_k)

The algorithm repeatedly moves downhill.

134.7 Quadratic Optimization

A quadratic optimization problem minimizes a quadratic function:

f(x)=12xTQx+cTx+d, f(x) = \frac12 x^TQx+c^Tx+d,

where:

SymbolMeaning
QQSymmetric matrix
ccLinear term
ddConstant

The gradient is

f(x)=Qx+c. \nabla f(x)=Qx+c.

Setting the gradient equal to zero gives:

Qx=c. Qx=-c.

Thus quadratic optimization reduces to solving a linear system.

If QQ is positive definite, then the problem has a unique minimizer.

Quadratic problems are central because many nonlinear functions are locally approximated by quadratics.

134.8 Hessian Matrices

The Hessian matrix collects second derivatives.

For a twice-differentiable function

f:RnR, f:\mathbb{R}^n\to\mathbb{R},

the Hessian is

Hf(x)=(2fxixj). H_f(x) = \left( \frac{\partial^2 f}{\partial x_i\partial x_j} \right).

The Hessian measures local curvature.

Important cases:

HessianInterpretation
Positive definiteLocal minimum
Negative definiteLocal maximum
IndefiniteSaddle point

Second-order optimization methods use Hessians to accelerate convergence.

134.9 Newton’s Method

Newton’s method uses second-order information.

The update rule is:

xk+1=xkHf(xk)1f(xk). x_{k+1} = x_k - H_f(x_k)^{-1}\nabla f(x_k).

x_{k+1}=x_k-H_f(x_k)^{-1}\nabla f(x_k)

The method approximates the objective locally by a quadratic function and minimizes that quadratic approximation.

Advantages:

AdvantageReason
Fast local convergenceUses curvature
Scale adaptationHessian rescales directions

Disadvantages:

DisadvantageReason
ExpensiveRequires solving linear systems
Hessian storageLarge matrices
Possible instabilityHessian may be singular

Large-scale optimization often approximates Newton steps rather than computing them exactly.

134.10 Constrained Optimization

Many optimization problems include constraints.

Equality constraints have the form:

gi(x)=0. g_i(x)=0.

Inequality constraints have the form:

hj(x)0. h_j(x)\le0.

The feasible set becomes:

C={x:gi(x)=0, hj(x)0}. C = \{x:g_i(x)=0,\ h_j(x)\le0\}.

Constraints change the geometry of optimization.

The optimum may occur on a boundary rather than in the interior.

134.11 Lagrange Multipliers

Lagrange multipliers solve equality-constrained optimization problems.

Suppose we minimize

f(x) f(x)

subject to

g(x)=0. g(x)=0.

Define the Lagrangian:

L(x,λ)=f(x)+λg(x). L(x,\lambda) = f(x)+\lambda g(x).

Optimality conditions become:

xL(x,λ)=0, \nabla_x L(x,\lambda)=0,

and

g(x)=0. g(x)=0.

This yields:

f(x)=λg(x). \nabla f(x) = -\lambda \nabla g(x).

\nabla f(x)=-\lambda \nabla g(x)

Geometrically, the objective gradient becomes parallel to the constraint gradient.

134.12 Karush-Kuhn-Tucker Conditions

The Karush-Kuhn-Tucker conditions generalize Lagrange multipliers to inequality constraints.

For problems with constraints

hi(x)0, h_i(x)\le0,

the KKT conditions include:

ConditionMeaning
StationarityGradient balance
Primal feasibilityConstraints satisfied
Dual feasibilityMultipliers nonnegative
Complementary slacknessActive constraints identified

The KKT system forms the basis of many modern optimization algorithms.

134.13 Interior-Point Methods

Interior-point methods solve constrained optimization problems by moving through the interior of the feasible region.

Rather than enforcing constraints exactly at each step, they use barrier functions such as:

μilog(hi(x)). -\mu\sum_i \log(-h_i(x)).

The logarithm approaches infinity near the boundary, preventing the iterates from leaving the feasible region.

Interior-point methods are especially important in:

Problem classApplication
Linear programmingLarge sparse systems
Quadratic programmingControl and finance
Semidefinite programmingConvex matrix optimization

These methods rely heavily on sparse linear algebra.

134.14 Conjugate Gradient Method

The conjugate gradient method solves large symmetric positive definite systems:

Ax=b. Ax=b.

It may also be viewed as minimizing the quadratic function:

f(x)=12xTAxbTx. f(x) = \frac12x^TAx-b^Tx.

f(x)=\frac12x^TAx-b^Tx

Conjugate gradient constructs search directions that are mutually AA-orthogonal.

Advantages include:

AdvantageReason
No matrix inversionIterative method
Exploits sparsityEfficient for large systems
Low memoryMatrix-vector products only

Conjugate gradient is one of the central algorithms in numerical optimization.

134.15 Stochastic Gradient Descent

Stochastic gradient descent, or SGD, approximates gradients using random subsets of data.

Instead of computing:

f(x), \nabla f(x),

one computes an estimate:

gkf(xk). g_k \approx \nabla f(x_k).

The update becomes:

xk+1=xkαkgk. x_{k+1} = x_k-\alpha_k g_k.

SGD is fundamental in machine learning because exact gradients may be too expensive for massive datasets.

Noise in the gradient introduces randomness but greatly reduces computational cost.

134.16 Regularization

Optimization problems may be ill-posed or unstable.

Regularization adds penalty terms to stabilize solutions.

Examples include:

MethodObjective
Ridge regressionAxb2+λx2\|Ax-b\|^2+\lambda\|x\|^2
LassoAxb2+λx1\|Ax-b\|^2+\lambda\|x\|_1

Ridge regression penalizes large Euclidean norm.

Lasso promotes sparsity.

Regularization changes both the geometry and numerical conditioning of optimization problems.

134.17 Duality

Many optimization problems possess dual formulations.

The primal problem:

minimize f(x) \text{minimize } f(x)

may correspond to a dual problem involving multipliers or supporting hyperplanes.

Duality provides:

BenefitInterpretation
Lower boundsOptimality certification
Sensitivity analysisConstraint interpretation
Alternative algorithmsEasier computation

Strong duality occurs when primal and dual optimal values coincide.

Convexity is closely tied to duality theory.

134.18 Numerical Stability

Optimization algorithms operate in finite precision arithmetic.

Important issues include:

IssueEffect
Roundoff errorLoss of accuracy
Ill-conditioningAmplified perturbations
CancellationNumerical instability
Overflow/underflowComputational failure

Stable algorithms minimize error growth.

Condition numbers are especially important because they measure sensitivity of solutions to perturbations.

134.19 Sparse Optimization

Many optimization problems involve sparse matrices.

A sparse matrix has mostly zero entries.

Sparse linear algebra allows:

AdvantageReason
Lower memory usageStore nonzeros only
Faster computationSkip zero operations
Scalable algorithmsLarge systems feasible

Sparse optimization is fundamental in:

AreaTypical problem
PDE discretizationSparse linear systems
Network optimizationGraph sparsity
Machine learningSparse features
Signal processingCompressed sensing

134.20 Optimization and Linear Algebra

Nearly every optimization method relies on linear algebra.

Optimization conceptLinear algebra role
GradientVector
HessianMatrix
ConstraintsLinear systems
Newton stepSolve matrix equation
Least squaresOrthogonal projection
ConvexityPositive semidefinite Hessian
DualityLinear functionals
Iterative methodsKrylov subspaces

Modern optimization is fundamentally computational linear algebra.

134.21 Summary

Numerical optimization studies computational methods for minimizing or maximizing functions.

The central concepts are:

ConceptMeaning
Objective functionQuantity being optimized
Feasible setAllowed region
ConvexityGuarantees global optimality
GradientFirst-order change
HessianSecond-order curvature
Newton methodQuadratic local approximation
Least squaresError minimization
Conjugate gradientIterative quadratic solver
Interior-point methodBarrier-based constrained optimization
Stochastic gradientRandomized optimization
DualityComplementary optimization formulation
RegularizationStabilization through penalties

Numerical optimization combines geometry, calculus, probability, and linear algebra into algorithms for solving large computational problems.