Chapter 110. Optimization and Linear Algebra

Optimization studies how to choose variables so that a quantity becomes as small or as large as possible. Linear algebra enters optimization because many objective functions, constraints, approximations, and algorithms are naturally written with vectors and matrices.

A standard optimization problem has the form

\text{minimize } f(x)

where $x \in \mathbb{R}^n$ is the vector of decision variables and $f : \mathbb{R}^n \to \mathbb{R}$ is the objective function. With constraints, the problem becomes

\begin{aligned} \text{minimize } \quad & f_0(x) \\ \text{subject to } \quad & f_i(x) \leq b_i, \qquad i = 1,\ldots,m. \end{aligned}

This is the usual mathematical form of a constrained optimization problem. Linear algebra provides the notation and the computational machinery for expressing such problems, especially least squares, linear programming, quadratic programming, Newton methods, and constrained convex optimization.

110.1 Decision Variables

The unknown in an optimization problem is usually a vector

x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix}.

Each component $x_i$ represents a quantity to be chosen.

For example, in portfolio optimization, the entries of $x$ may represent asset weights. In machine learning, they may represent model parameters. In engineering design, they may represent physical dimensions. In numerical linear algebra, they may represent an approximate solution to a system.

The objective function assigns a scalar value to each vector $x$ . The optimizer searches for a vector that gives the best value.

110.2 Linear Objective Functions

The simplest objective function is linear:

f(x) = c^Tx.

Here $c$ is a fixed vector of coefficients.

Expanding,

c^Tx = c_1x_1 + c_2x_2 + \cdots + c_nx_n.

A linear objective has constant slope. Its level sets are hyperplanes.

Linear objectives appear in linear programming, resource allocation, transportation problems, scheduling, and economic planning.

A linear optimization problem with linear constraints has the form

\begin{aligned} \text{minimize } \quad & c^Tx \\ \text{subject to } \quad & Ax \leq b. \end{aligned}

Here the feasible set is described by matrix inequalities.

110.3 Quadratic Objective Functions

The next important class is quadratic optimization.

A quadratic objective has the form

f(x) = \frac{1}{2}x^TQx + c^Tx + d.

Here $Q$ is an $n \times n$ matrix, $c$ is a vector, and $d$ is a scalar.

The matrix $Q$ controls curvature. If $Q$ is symmetric positive definite, the function has a unique global minimizer.

The gradient is

\nabla f(x) = Qx + c,

assuming $Q$ is symmetric.

The Hessian is

\nabla^2 f(x) = Q.

Thus quadratic optimization is directly governed by a matrix.

Quadratic functions are important because second-order Taylor approximation replaces a smooth nonlinear function near a point by a quadratic model. In optimization, this is the basis for Newton methods and many local algorithms.

110.4 Positive Definite Matrices

A symmetric matrix $Q$ is positive definite if

x^TQx > 0

for every nonzero vector $x$ .

Positive definiteness means that the quadratic form curves upward in every direction. In optimization, this corresponds to strict convexity for the quadratic part.

If $Q$ is positive definite, then

f(x) = \frac{1}{2}x^TQx + c^Tx + d

has a unique minimizer.

The minimizer satisfies

\nabla f(x) = 0.

Therefore

Qx + c = 0,

and

x^\ast = -Q^{-1}c.

This formula shows the direct relation between minimization and solving a linear system.

110.5 Gradients

The gradient of a scalar-valued function

f : \mathbb{R}^n \to \mathbb{R}

is the vector

\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(x) \\ \frac{\partial f}{\partial x_2}(x) \\ \vdots \\ \frac{\partial f}{\partial x_n}(x) \end{bmatrix}.

The gradient points in the direction of steepest increase. Therefore the vector

-\nabla f(x)

points in the direction of steepest local decrease.

Gradient-based methods build sequences

x_0, x_1, x_2, \ldots

that move toward a minimizer.

The most basic method is gradient descent:

x_{k+1} = x_k - \alpha_k \nabla f(x_k).

Here $\alpha_k > 0$ is the step size.

110.6 Hessians

The Hessian matrix of a twice differentiable scalar function is the matrix of second partial derivatives:

\nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_n} \\ \frac{\partial^2 f}{\partial x_2\partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n\partial x_1} & \frac{\partial^2 f}{\partial x_n\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}.

The Hessian describes local curvature. When the second partial derivatives are continuous, the Hessian is symmetric.

The Hessian is central in second-order optimization methods.

At a critical point where

\nabla f(x^\ast) = 0,

the Hessian helps classify the point.

Hessian condition	Interpretation
Positive definite	Strict local minimum
Negative definite	Strict local maximum
Indefinite	Saddle point
Positive semidefinite	Possible local minimum

110.7 Taylor Approximation

Near a point $x$ , a smooth function can be approximated by its second-order Taylor expansion:

f(x+p) \approx f(x) + \nabla f(x)^Tp + \frac{1}{2}p^T\nabla^2 f(x)p.

This expression is a quadratic model in the step vector $p$ .

Optimization algorithms often choose $p$ by minimizing this local model.

If the Hessian is positive definite, the minimizer of the quadratic model satisfies

\nabla^2 f(x)p = -\nabla f(x).

This is a linear system.

Thus each Newton step requires solving a linear algebra problem.

110.8 Newton’s Method

Newton’s method uses first and second derivative information.

At iteration $k$ , it solves

\nabla^2 f(x_k)p_k = -\nabla f(x_k)

for the step $p_k$ , then updates

x_{k+1} = x_k + p_k.

If the Hessian is positive definite near the solution, Newton’s method can converge rapidly. In many settings, Newton’s method has quadratic local convergence, while basic gradient descent often has only linear convergence.

The computational cost lies in forming or approximating the Hessian and solving the linear system.

Linear algebra determines both cost and stability.

110.9 Least Squares as Optimization

Least squares is one of the most important optimization problems in linear algebra.

Given

A x \approx b,

the least squares problem is

\min_x \|Ax - b\|^2.

The objective can be expanded as

\|Ax-b\|^2 = x^TA^TAx - 2b^TAx + b^Tb.

Thus least squares is a quadratic optimization problem.

The gradient is

\nabla f(x) = 2A^T(Ax-b).

Setting the gradient equal to zero gives

A^TAx = A^Tb.

These are the normal equations. Least squares is therefore both a projection problem and an optimization problem.

110.10 Constrained Optimization

Many optimization problems include constraints.

A linear equality constraint has the form

Ax = b.

A linear inequality constraint has the form

Ax \leq b.

A constrained problem may be written as

\begin{aligned} \text{minimize } \quad & f(x) \\ \text{subject to } \quad & Ax = b. \end{aligned}

Linear algebra appears in the description of the feasible set. The equation $Ax=b$ describes an affine subspace when the system is consistent.

The null space of $A$ describes feasible directions. If $x_0$ is one feasible point, then every feasible point has the form

x = x_0 + z,

where

z \in \operatorname{Null}(A).

Thus constrained optimization can often be reduced to optimization over a subspace.

110.11 Lagrange Multipliers

Consider the equality-constrained problem

\begin{aligned} \text{minimize } \quad & f(x) \\ \text{subject to } \quad & Ax = b. \end{aligned}

The Lagrangian is

L(x,\lambda) = f(x) + \lambda^T(Ax-b).

Here $\lambda$ is the vector of Lagrange multipliers.

At a constrained optimum, under suitable regularity conditions,

\nabla_x L(x,\lambda) = 0

and

Ax = b.

For a quadratic objective

f(x) = \frac{1}{2}x^TQx + c^Tx,

the conditions become

Qx + c + A^T\lambda = 0,

Ax = b.

These equations can be written as the block linear system

\begin{bmatrix} Q & A^T \\ A & 0 \end{bmatrix} \begin{bmatrix} x \\ \lambda \end{bmatrix} = \begin{bmatrix} -c \\ b \end{bmatrix}.

This matrix is called a KKT matrix. It is fundamental in constrained optimization.

110.12 Convexity

A function $f$ is convex if for all $x,y$ and all $t \in [0,1]$ ,

f(tx+(1-t)y) \leq tf(x)+(1-t)f(y).

Convexity means that the function has no false local minima.

For a twice differentiable function, convexity is closely connected to the Hessian:

\nabla^2 f(x) \succeq 0

for all $x$ in the domain.

Here $\succeq 0$ means positive semidefinite.

Convex optimization is important because local optimality implies global optimality under standard convex assumptions. Linear algebra enters through gradients, Hessians, affine constraints, cones, projections, and matrix inequalities.

110.13 Projection Methods

Projection methods use linear algebra to enforce constraints.

Given a closed subspace $S$ , the projection of a vector $x$ onto $S$ is the closest point in $S$ .

If $S$ is the column space of a matrix $A$ with full column rank, the projection matrix is

P = A(A^TA)^{-1}A^T.

Projection methods appear in constrained optimization, least squares, alternating projections, projected gradient descent, and signal reconstruction.

Projected gradient descent has the form

x_{k+1} = P_C(x_k - \alpha_k \nabla f(x_k)),

where $P_C$ denotes projection onto the feasible set $C$ .

When $C$ is a subspace or affine space, this projection is a linear algebra operation.

110.14 Quadratic Programming

A quadratic program has the form

\begin{aligned} \text{minimize } \quad & \frac{1}{2}x^TQx + c^Tx \\ \text{subject to } \quad & Ax \leq b. \end{aligned}

If $Q$ is positive semidefinite, the problem is convex.

Quadratic programs appear in support vector machines, portfolio optimization, model predictive control, structural mechanics, and constrained least squares.

Solving quadratic programs usually involves repeated solution of linear systems, factorization of KKT matrices, or iterative methods.

110.15 Eigenvalues and Conditioning

Eigenvalues describe the geometry of quadratic objectives.

For

f(x) = \frac{1}{2}x^TQx,

where $Q$ is symmetric positive definite, the level sets are ellipsoids.

The eigenvectors of $Q$ give the principal axes. The eigenvalues determine curvature along those axes.

The condition number

\kappa(Q) = \frac{\lambda_{\max}(Q)}{\lambda_{\min}(Q)}

measures how elongated the level sets are.

A large condition number makes optimization harder. Gradient descent may move slowly because the function is steep in some directions and flat in others.

This explains why preconditioning is important.

110.16 Preconditioning

Preconditioning changes the geometry of an optimization problem.

Instead of moving by

-\nabla f(x),

one may move by

-M^{-1}\nabla f(x),

where $M$ approximates the Hessian or another curvature matrix.

The goal is to transform an ill-conditioned problem into a better-conditioned one.

For linear systems and quadratic optimization, preconditioning often determines whether an iterative method is practical.

A good preconditioner should be cheap to apply and close enough to the original curvature to improve convergence.

110.17 Conjugate Gradient Method

The conjugate gradient method solves systems

Qx = b

where $Q$ is symmetric positive definite.

It can also be interpreted as minimizing the quadratic function

f(x) = \frac{1}{2}x^TQx - b^Tx.

Conjugate gradient uses search directions that are conjugate with respect to $Q$ . This means

p_i^TQp_j = 0

when $i \neq j$ .

Unlike ordinary gradient descent, conjugate gradient accounts for the geometry induced by $Q$ . It is especially useful for large sparse systems where forming a factorization would be too expensive.

110.18 Linear Algebra in Modern Optimization

Modern optimization algorithms rely heavily on linear algebra.

Optimization idea	Linear algebra object
Variables	Vectors
Linear constraints	Matrix equations
Quadratic objectives	Quadratic forms
Curvature	Hessians
Newton steps	Linear systems
Least squares	Projections
Regularization	Norms and penalties
Conditioning	Singular values and eigenvalues
Large-scale methods	Sparse matrices
Preconditioning	Approximate inverses
Low-rank methods	Matrix decompositions

Even nonlinear optimization uses linear algebra locally. At each iteration, the nonlinear problem is approximated by a linear or quadratic problem.

110.19 Summary

Optimization and linear algebra are tightly connected.

Linear algebra supplies the language of vectors, matrices, subspaces, projections, eigenvalues, and decompositions. Optimization uses that language to choose the best vector under an objective and possible constraints.

The most important bridge is the quadratic model. Least squares, Newton methods, quadratic programming, and conjugate gradient methods all reduce optimization to structured linear algebra.

In practice, the success of an optimization method often depends on matrix structure: rank, sparsity, positive definiteness, conditioning, and the availability of stable factorizations.