Skip to content

Chapter 110. Optimization and Linear Algebra

Optimization studies how to choose variables so that a quantity becomes as small or as large as possible. Linear algebra enters optimization because many objective functions, constraints, approximations, and algorithms are naturally written with vectors and matrices.

A standard optimization problem has the form

minimize f(x) \text{minimize } f(x)

where xRnx \in \mathbb{R}^n is the vector of decision variables and f:RnRf : \mathbb{R}^n \to \mathbb{R} is the objective function. With constraints, the problem becomes

minimize f0(x)subject to fi(x)bi,i=1,,m. \begin{aligned} \text{minimize } \quad & f_0(x) \\ \text{subject to } \quad & f_i(x) \leq b_i, \qquad i = 1,\ldots,m. \end{aligned}

This is the usual mathematical form of a constrained optimization problem. Linear algebra provides the notation and the computational machinery for expressing such problems, especially least squares, linear programming, quadratic programming, Newton methods, and constrained convex optimization.

110.1 Decision Variables

The unknown in an optimization problem is usually a vector

x=[x1x2xn]. x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}.

Each component xix_i represents a quantity to be chosen.

For example, in portfolio optimization, the entries of xx may represent asset weights. In machine learning, they may represent model parameters. In engineering design, they may represent physical dimensions. In numerical linear algebra, they may represent an approximate solution to a system.

The objective function assigns a scalar value to each vector xx. The optimizer searches for a vector that gives the best value.

110.2 Linear Objective Functions

The simplest objective function is linear:

f(x)=cTx. f(x) = c^Tx.

Here cc is a fixed vector of coefficients.

Expanding,

cTx=c1x1+c2x2++cnxn. c^Tx = c_1x_1 + c_2x_2 + \cdots + c_nx_n.

A linear objective has constant slope. Its level sets are hyperplanes.

Linear objectives appear in linear programming, resource allocation, transportation problems, scheduling, and economic planning.

A linear optimization problem with linear constraints has the form

minimize cTxsubject to Axb. \begin{aligned} \text{minimize } \quad & c^Tx \\ \text{subject to } \quad & Ax \leq b. \end{aligned}

Here the feasible set is described by matrix inequalities.

110.3 Quadratic Objective Functions

The next important class is quadratic optimization.

A quadratic objective has the form

f(x)=12xTQx+cTx+d. f(x) = \frac{1}{2}x^TQx + c^Tx + d.

Here QQ is an n×nn \times n matrix, cc is a vector, and dd is a scalar.

The matrix QQ controls curvature. If QQ is symmetric positive definite, the function has a unique global minimizer.

The gradient is

f(x)=Qx+c, \nabla f(x) = Qx + c,

assuming QQ is symmetric.

The Hessian is

2f(x)=Q. \nabla^2 f(x) = Q.

Thus quadratic optimization is directly governed by a matrix.

Quadratic functions are important because second-order Taylor approximation replaces a smooth nonlinear function near a point by a quadratic model. In optimization, this is the basis for Newton methods and many local algorithms.

110.4 Positive Definite Matrices

A symmetric matrix QQ is positive definite if

xTQx>0 x^TQx > 0

for every nonzero vector xx.

Positive definiteness means that the quadratic form curves upward in every direction. In optimization, this corresponds to strict convexity for the quadratic part.

If QQ is positive definite, then

f(x)=12xTQx+cTx+d f(x) = \frac{1}{2}x^TQx + c^Tx + d

has a unique minimizer.

The minimizer satisfies

f(x)=0. \nabla f(x) = 0.

Therefore

Qx+c=0, Qx + c = 0,

and

x=Q1c. x^\ast = -Q^{-1}c.

This formula shows the direct relation between minimization and solving a linear system.

110.5 Gradients

The gradient of a scalar-valued function

f:RnR f : \mathbb{R}^n \to \mathbb{R}

is the vector

f(x)=[fx1(x)fx2(x)fxn(x)]. \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(x) \\ \frac{\partial f}{\partial x_2}(x) \\ \vdots \\ \frac{\partial f}{\partial x_n}(x) \end{bmatrix}.

The gradient points in the direction of steepest increase. Therefore the vector

f(x) -\nabla f(x)

points in the direction of steepest local decrease.

Gradient-based methods build sequences

x0,x1,x2, x_0, x_1, x_2, \ldots

that move toward a minimizer.

The most basic method is gradient descent:

xk+1=xkαkf(xk). x_{k+1} = x_k - \alpha_k \nabla f(x_k).

Here αk>0\alpha_k > 0 is the step size.

110.6 Hessians

The Hessian matrix of a twice differentiable scalar function is the matrix of second partial derivatives:

2f(x)=[2fx122fx1x22fx1xn2fx2x12fx222fx2xn2fxnx12fxnx22fxn2]. \nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_n} \\ \frac{\partial^2 f}{\partial x_2\partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n\partial x_1} & \frac{\partial^2 f}{\partial x_n\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}.

The Hessian describes local curvature. When the second partial derivatives are continuous, the Hessian is symmetric.

The Hessian is central in second-order optimization methods.

At a critical point where

f(x)=0, \nabla f(x^\ast) = 0,

the Hessian helps classify the point.

Hessian conditionInterpretation
Positive definiteStrict local minimum
Negative definiteStrict local maximum
IndefiniteSaddle point
Positive semidefinitePossible local minimum

110.7 Taylor Approximation

Near a point xx, a smooth function can be approximated by its second-order Taylor expansion:

f(x+p)f(x)+f(x)Tp+12pT2f(x)p. f(x+p) \approx f(x) + \nabla f(x)^Tp + \frac{1}{2}p^T\nabla^2 f(x)p.

This expression is a quadratic model in the step vector pp.

Optimization algorithms often choose pp by minimizing this local model.

If the Hessian is positive definite, the minimizer of the quadratic model satisfies

2f(x)p=f(x). \nabla^2 f(x)p = -\nabla f(x).

This is a linear system.

Thus each Newton step requires solving a linear algebra problem.

110.8 Newton’s Method

Newton’s method uses first and second derivative information.

At iteration kk, it solves

2f(xk)pk=f(xk) \nabla^2 f(x_k)p_k = -\nabla f(x_k)

for the step pkp_k, then updates

xk+1=xk+pk. x_{k+1} = x_k + p_k.

If the Hessian is positive definite near the solution, Newton’s method can converge rapidly. In many settings, Newton’s method has quadratic local convergence, while basic gradient descent often has only linear convergence.

The computational cost lies in forming or approximating the Hessian and solving the linear system.

Linear algebra determines both cost and stability.

110.9 Least Squares as Optimization

Least squares is one of the most important optimization problems in linear algebra.

Given

Axb, A x \approx b,

the least squares problem is

minxAxb2. \min_x \|Ax - b\|^2.

The objective can be expanded as

Axb2=xTATAx2bTAx+bTb. \|Ax-b\|^2 = x^TA^TAx - 2b^TAx + b^Tb.

Thus least squares is a quadratic optimization problem.

The gradient is

f(x)=2AT(Axb). \nabla f(x) = 2A^T(Ax-b).

Setting the gradient equal to zero gives

ATAx=ATb. A^TAx = A^Tb.

These are the normal equations. Least squares is therefore both a projection problem and an optimization problem.

110.10 Constrained Optimization

Many optimization problems include constraints.

A linear equality constraint has the form

Ax=b. Ax = b.

A linear inequality constraint has the form

Axb. Ax \leq b.

A constrained problem may be written as

minimize f(x)subject to Ax=b. \begin{aligned} \text{minimize } \quad & f(x) \\ \text{subject to } \quad & Ax = b. \end{aligned}

Linear algebra appears in the description of the feasible set. The equation Ax=bAx=b describes an affine subspace when the system is consistent.

The null space of AA describes feasible directions. If x0x_0 is one feasible point, then every feasible point has the form

x=x0+z, x = x_0 + z,

where

zNull(A). z \in \operatorname{Null}(A).

Thus constrained optimization can often be reduced to optimization over a subspace.

110.11 Lagrange Multipliers

Consider the equality-constrained problem

minimize f(x)subject to Ax=b. \begin{aligned} \text{minimize } \quad & f(x) \\ \text{subject to } \quad & Ax = b. \end{aligned}

The Lagrangian is

L(x,λ)=f(x)+λT(Axb). L(x,\lambda) = f(x) + \lambda^T(Ax-b).

Here λ\lambda is the vector of Lagrange multipliers.

At a constrained optimum, under suitable regularity conditions,

xL(x,λ)=0 \nabla_x L(x,\lambda) = 0

and

Ax=b. Ax = b.

For a quadratic objective

f(x)=12xTQx+cTx, f(x) = \frac{1}{2}x^TQx + c^Tx,

the conditions become

Qx+c+ATλ=0, Qx + c + A^T\lambda = 0, Ax=b. Ax = b.

These equations can be written as the block linear system

[QATA0][xλ]=[cb]. \begin{bmatrix} Q & A^T \\ A & 0 \end{bmatrix} \begin{bmatrix} x \\ \lambda \end{bmatrix} = \begin{bmatrix} -c \\ b \end{bmatrix}.

This matrix is called a KKT matrix. It is fundamental in constrained optimization.

110.12 Convexity

A function ff is convex if for all x,yx,y and all t[0,1]t \in [0,1],

f(tx+(1t)y)tf(x)+(1t)f(y). f(tx+(1-t)y) \leq tf(x)+(1-t)f(y).

Convexity means that the function has no false local minima.

For a twice differentiable function, convexity is closely connected to the Hessian:

2f(x)0 \nabla^2 f(x) \succeq 0

for all xx in the domain.

Here 0\succeq 0 means positive semidefinite.

Convex optimization is important because local optimality implies global optimality under standard convex assumptions. Linear algebra enters through gradients, Hessians, affine constraints, cones, projections, and matrix inequalities.

110.13 Projection Methods

Projection methods use linear algebra to enforce constraints.

Given a closed subspace SS, the projection of a vector xx onto SS is the closest point in SS.

If SS is the column space of a matrix AA with full column rank, the projection matrix is

P=A(ATA)1AT. P = A(A^TA)^{-1}A^T.

Projection methods appear in constrained optimization, least squares, alternating projections, projected gradient descent, and signal reconstruction.

Projected gradient descent has the form

xk+1=PC(xkαkf(xk)), x_{k+1} = P_C(x_k - \alpha_k \nabla f(x_k)),

where PCP_C denotes projection onto the feasible set CC.

When CC is a subspace or affine space, this projection is a linear algebra operation.

110.14 Quadratic Programming

A quadratic program has the form

minimize 12xTQx+cTxsubject to Axb. \begin{aligned} \text{minimize } \quad & \frac{1}{2}x^TQx + c^Tx \\ \text{subject to } \quad & Ax \leq b. \end{aligned}

If QQ is positive semidefinite, the problem is convex.

Quadratic programs appear in support vector machines, portfolio optimization, model predictive control, structural mechanics, and constrained least squares.

Solving quadratic programs usually involves repeated solution of linear systems, factorization of KKT matrices, or iterative methods.

110.15 Eigenvalues and Conditioning

Eigenvalues describe the geometry of quadratic objectives.

For

f(x)=12xTQx, f(x) = \frac{1}{2}x^TQx,

where QQ is symmetric positive definite, the level sets are ellipsoids.

The eigenvectors of QQ give the principal axes. The eigenvalues determine curvature along those axes.

The condition number

κ(Q)=λmax(Q)λmin(Q) \kappa(Q) = \frac{\lambda_{\max}(Q)}{\lambda_{\min}(Q)}

measures how elongated the level sets are.

A large condition number makes optimization harder. Gradient descent may move slowly because the function is steep in some directions and flat in others.

This explains why preconditioning is important.

110.16 Preconditioning

Preconditioning changes the geometry of an optimization problem.

Instead of moving by

f(x), -\nabla f(x),

one may move by

M1f(x), -M^{-1}\nabla f(x),

where MM approximates the Hessian or another curvature matrix.

The goal is to transform an ill-conditioned problem into a better-conditioned one.

For linear systems and quadratic optimization, preconditioning often determines whether an iterative method is practical.

A good preconditioner should be cheap to apply and close enough to the original curvature to improve convergence.

110.17 Conjugate Gradient Method

The conjugate gradient method solves systems

Qx=b Qx = b

where QQ is symmetric positive definite.

It can also be interpreted as minimizing the quadratic function

f(x)=12xTQxbTx. f(x) = \frac{1}{2}x^TQx - b^Tx.

Conjugate gradient uses search directions that are conjugate with respect to QQ. This means

piTQpj=0 p_i^TQp_j = 0

when iji \neq j.

Unlike ordinary gradient descent, conjugate gradient accounts for the geometry induced by QQ. It is especially useful for large sparse systems where forming a factorization would be too expensive.

110.18 Linear Algebra in Modern Optimization

Modern optimization algorithms rely heavily on linear algebra.

Optimization ideaLinear algebra object
VariablesVectors
Linear constraintsMatrix equations
Quadratic objectivesQuadratic forms
CurvatureHessians
Newton stepsLinear systems
Least squaresProjections
RegularizationNorms and penalties
ConditioningSingular values and eigenvalues
Large-scale methodsSparse matrices
PreconditioningApproximate inverses
Low-rank methodsMatrix decompositions

Even nonlinear optimization uses linear algebra locally. At each iteration, the nonlinear problem is approximated by a linear or quadratic problem.

110.19 Summary

Optimization and linear algebra are tightly connected.

Linear algebra supplies the language of vectors, matrices, subspaces, projections, eigenvalues, and decompositions. Optimization uses that language to choose the best vector under an objective and possible constraints.

The most important bridge is the quadratic model. Least squares, Newton methods, quadratic programming, and conjugate gradient methods all reduce optimization to structured linear algebra.

In practice, the success of an optimization method often depends on matrix structure: rank, sparsity, positive definiteness, conditioning, and the availability of stable factorizations.