# Limits of Linear Models

Linear models are the first useful class of predictive models in deep learning. They introduce weighted sums, biases, logits, losses, gradients, and optimization. They are also the simplest examples of models trained by minimizing an objective function.

Their limits explain why deep networks are needed.

A linear model has the form

$$
f(x) = w^\top x + b.
$$

For regression, this value is used directly. For binary classification, it is passed through a sigmoid or sign function. For multiclass classification, several linear scores are passed through softmax.

In every case, the model depends on a linear score. The model can only use the features it is given, and it combines them additively.

### Additive Structure

The linear score is

$$
w^\top x = w_1x_1 + w_2x_2 + \cdots + w_dx_d.
$$

Each feature contributes independently to the final score. This means the model assumes that feature effects can be added together.

This works well for many problems. If house price increases roughly with size, number of rooms, and location score, a linear model may be a strong baseline.

But many real problems depend on interactions. The effect of one feature may depend on another. For example, the value of a keyword in a sentence depends on context. The meaning of a pixel depends on nearby pixels. The risk of a medical measurement may depend on age, history, and other measurements.

A plain linear model cannot learn these interactions unless they are manually added as features.

### Manual Feature Engineering

Before deep learning became dominant, much of applied machine learning depended on feature engineering. A practitioner designed transformations such as

$$
x_1^2,\quad x_1x_2,\quad \log(x_3),\quad \mathbf{1}[x_4 > c].
$$

A linear model was then trained on the transformed feature vector:

$$
\phi(x) =
\begin{bmatrix}
x_1 \\
x_2 \\
x_1^2 \\
x_1x_2 \\
\log(x_3)
\end{bmatrix}.
$$

The model remained linear in \(\phi(x)\):

$$
f(x) = w^\top \phi(x) + b.
$$

But it became nonlinear in the original input \(x\).

This idea is powerful, but it shifts the burden to the feature designer. The model can only use nonlinearities that have been explicitly supplied. If an important feature interaction is missing, the model may fail.

Deep learning replaces much of this manual feature design with learned representations.

### Poor Fit to Raw Data

Linear models often perform poorly on raw sensory data.

An image is a tensor of pixel values. A linear classifier on raw pixels assigns one weight to each pixel and sums the result. It has no built-in notion of edges, textures, parts, objects, translation, or spatial hierarchy.

An audio waveform is a sequence of amplitudes. A linear model on raw samples has no natural representation of frequency, phonemes, rhythm, or speaker identity.

A text sequence is a sequence of tokens. A bag-of-words linear model can count words, but it loses order and syntax. A linear model over token positions is sensitive to exact placement and cannot easily represent paraphrase.

In these domains, the useful structure is compositional. Small local patterns combine into larger patterns. Linear models do not naturally represent this hierarchy.

### Fixed Representation

A linear model does not learn a representation. It learns weights over an existing representation.

If the input features are good, linear models can be excellent. If the input features are weak, linear models have limited room to improve.

This distinction is central:

$$
\text{linear model}:
\quad
x \mapsto w^\top x + b
$$

$$
\text{deep model}:
\quad
x \mapsto h_1 \mapsto h_2 \mapsto \cdots \mapsto h_L \mapsto y.
$$

A deep network learns intermediate representations \(h_1,\ldots,h_L\). These representations can become increasingly task-specific.

For example, in a vision model, early layers may detect edges, middle layers may detect parts, and later layers may represent object categories. In a language model, lower layers may encode local syntax, while higher layers may encode semantic or task-level information.

### Limited Decision Geometry

For binary classification, a linear model produces one hyperplane. The input space is divided into two half-spaces.

This geometry is too simple for many tasks. Classes may appear in multiple separated regions. Boundaries may be curved. Labels may depend on combinations of features.

For multiclass classification, softmax regression creates several linear boundaries. The boundary between class \(a\) and class \(b\) is

$$
(w_a - w_b)^\top x + (b_a - b_b) = 0.
$$

This still gives linear boundaries in the input space.

Deep networks make the boundary nonlinear in the input by first transforming the input into learned features:

$$
h = \phi_\theta(x),
\quad
z = Wh + b.
$$

The final layer is linear in \(h\), but the whole model is nonlinear in \(x\).

### The XOR Failure

The XOR problem is the smallest clean example of the limitation.

| \(x_1\) | \(x_2\) | Label |
|---:|---:|---:|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |

No line can separate the positive examples from the negative examples.

A linear model fails because the label depends on an interaction:

$$
y = 1
\quad
\text{when exactly one of } x_1,x_2 \text{ is active}.
$$

A small neural network can solve XOR by creating hidden features. One hidden unit can detect one region. Another hidden unit can detect another region. The output layer combines them.

This example is simple, but it captures the main reason hidden layers matter.

### Lack of Hierarchical Composition

Deep learning is effective partly because many real-world signals are hierarchical.

In images:

$$
\text{pixels} \to \text{edges} \to \text{textures} \to \text{parts} \to \text{objects}.
$$

In language:

$$
\text{characters} \to \text{subwords} \to \text{words} \to \text{phrases} \to \text{meaning}.
$$

In speech:

$$
\text{waveform} \to \text{frequency patterns} \to \text{phonemes} \to \text{words}.
$$

A single linear transformation has no depth. It cannot build features from features. It maps input directly to output.

A deep network composes functions:

$$
f(x) =
f_L(f_{L-1}(\cdots f_1(x))).
$$

Each layer can reuse and transform earlier features. This compositional structure lets deep models represent some functions far more efficiently than shallow or linear models.

### Linear Models and Invariance

Many tasks require invariance. An image classifier should recognize an object even if it shifts slightly. A speech recognizer should recognize a word across speakers. A text classifier should recognize similar meaning across different phrasings.

A plain linear model has no built-in invariance. If an object moves to a different part of an image, the active pixels change. A linear model treats those pixel positions as different features.

Architectures such as convolutional networks introduce useful inductive biases. A convolutional layer shares weights across spatial positions, making it natural to detect the same pattern in different locations.

Transformers introduce another kind of structure: token interactions through attention. Graph neural networks introduce message passing over graph structure. These architectures go beyond linear modeling by encoding assumptions about the data.

### Data Efficiency

Linear models are often data-efficient when the representation is strong. They have relatively few parameters and simple objectives. They can generalize well from limited data.

Deep models usually need more data because they learn both representation and prediction. But when enough data is available, deep models can learn patterns that linear models cannot express.

This tradeoff is practical:

| Situation | Linear model | Deep model |
|---|---|---|
| Small data, good features | Often strong | May overfit |
| Large data, raw inputs | Usually weak | Often strong |
| Need interpretability | Strong | Harder |
| Need learned representations | Weak | Strong |
| Complex perceptual task | Weak | Strong |

Linear models remain valuable baselines. A deep model should justify its extra complexity.

### Optimization and Stability

Linear models are usually easier to optimize than deep networks. Linear regression with squared loss has a convex objective. Logistic regression with cross-entropy also has a convex objective.

Convexity means that any local minimum is global. Optimization is stable and well understood.

Deep networks have nonconvex objectives. They contain many layers, many parameters, and many symmetries. Optimization is harder to analyze, although in practice gradient-based methods often work well at large scale.

Thus deep learning gains expressive power at the cost of more complex training behavior.

### Interpretability

Linear models are easier to interpret because each weight corresponds directly to an input feature.

If features are standardized, the magnitude and sign of a weight give useful information about how the feature affects the prediction.

Deep networks distribute information across many parameters and layers. A single weight rarely has a simple meaning. Interpretability requires additional tools such as saliency maps, feature attribution, probing, activation analysis, or mechanistic interpretability.

This does not mean linear models are always transparent. Correlated features, feature scaling, and dataset bias can still make interpretation difficult. But compared with deep networks, their structure is simpler.

### Why Study Linear Models First

Linear models are worth studying carefully because they contain the core training pattern used throughout deep learning:

$$
\text{prediction}
\to
\text{loss}
\to
\text{gradient}
\to
\text{parameter update}.
$$

They also introduce logits, cross-entropy, regularization, margins, separability, and evaluation.

Deep networks generalize these ideas. A multilayer perceptron replaces the single affine map with a composition of affine maps and nonlinear activations:

$$
h_1 = \phi(W_1x + b_1),
$$

$$
h_2 = \phi(W_2h_1 + b_2),
$$

$$
z = W_3h_2 + b_3.
$$

The last layer may still be linear. The difference is that the input to that layer is a learned representation, not the raw feature vector.

### Summary

Linear models are efficient, interpretable, and mathematically clean. They are strong baselines and remain useful as classifier heads on top of learned representations.

Their weakness is expressivity. They cannot directly learn nonlinear interactions, hierarchical features, invariances, or complex decision regions from raw input. Deep learning addresses these limits by learning representations through stacked nonlinear transformations.

