# Linear Separability

Linear separability describes when a classification dataset can be divided perfectly by a linear decision boundary. It is one of the central geometric ideas behind linear classification.

For binary classification, each example has an input vector

$$
x_i \in \mathbb{R}^d
$$

and a label

$$
y_i \in \{-1,+1\}.
$$

The dataset is linearly separable if there exists a weight vector \(w\in\mathbb{R}^d\) and a bias \(b\in\mathbb{R}\) such that

$$
y_i(w^\top x_i + b) > 0
$$

for every training example \(i\).

This condition says that every positive example lies on one side of the hyperplane, and every negative example lies on the other side.

### Hyperplanes

A linear decision boundary has the form

$$
w^\top x + b = 0.
$$

In two dimensions, this boundary is a line. In three dimensions, it is a plane. In higher dimensions, it is called a hyperplane.

The vector \(w\) is normal to the hyperplane. It points in the direction in which the score

$$
w^\top x + b
$$

increases most quickly.

The classifier predicts

$$
\hat{y} =
\operatorname{sign}(w^\top x + b).
$$

The positive class is assigned to points where

$$
w^\top x + b > 0.
$$

The negative class is assigned to points where

$$
w^\top x + b < 0.
$$

### Geometric Meaning

For a point \(x\), the quantity

$$
w^\top x + b
$$

is a signed score. Its sign tells us which side of the boundary the point lies on. Its magnitude is related to the distance from the boundary.

The signed distance from \(x\) to the hyperplane is

$$
\frac{w^\top x + b}{\|w\|}.
$$

For a labeled example \((x_i,y_i)\), the signed margin is

$$
y_i(w^\top x_i + b).
$$

The geometric margin is

$$
\frac{y_i(w^\top x_i + b)}{\|w\|}.
$$

A positive margin means the example is correctly classified. A negative margin means it is misclassified. A larger positive margin means the example is farther from the decision boundary on the correct side.

### Strict Separability and Weak Separability

A dataset is strictly linearly separable if every example satisfies

$$
y_i(w^\top x_i + b) > 0.
$$

It is weakly linearly separable if every example satisfies

$$
y_i(w^\top x_i + b) \ge 0.
$$

Strict separability is usually the useful condition. If some examples lie exactly on the boundary, tiny perturbations may change their predicted labels. Strict separation gives a positive margin.

When a finite dataset is strictly separable, there is usually not just one separating hyperplane. There are often infinitely many. Some separate the data with a small margin; others separate the data with a larger margin.

### Example in Two Dimensions

Consider points in \(\mathbb{R}^2\). Suppose positive examples are clustered near \((2,2)\), and negative examples are clustered near \((-2,-2)\). A line can separate them.

For example, the boundary

$$
x_1 + x_2 = 0
$$

can be written as

$$
w^\top x + b = 0
$$

with

$$
w =
\begin{bmatrix}
1 \\
1
\end{bmatrix},
\quad
b = 0.
$$

The classifier predicts positive when

$$
x_1 + x_2 > 0
$$

and negative when

$$
x_1 + x_2 < 0.
$$

If all positive points satisfy \(x_1+x_2>0\), and all negative points satisfy \(x_1+x_2<0\), the dataset is linearly separable.

### Nonseparable Data

Many datasets are not linearly separable. The simplest example is XOR.

The XOR input points are

$$
(0,0),\quad (0,1),\quad (1,0),\quad (1,1).
$$

Assign labels:

| Input | Label |
|---|---:|
| \((0,0)\) | 0 |
| \((0,1)\) | 1 |
| \((1,0)\) | 1 |
| \((1,1)\) | 0 |

The positive examples lie on opposite corners of a square. The negative examples lie on the other two corners. No single line can separate the positive points from the negative points.

This example is small but important. It shows that linear classifiers cannot represent all simple logical patterns.

### Feature Maps

A dataset that is not linearly separable in its original input space may become linearly separable after a feature transformation.

Let

$$
\phi:\mathbb{R}^d \to \mathbb{R}^m
$$

be a feature map. Instead of classifying \(x\), we classify \(\phi(x)\):

$$
\hat{y} =
\operatorname{sign}(w^\top \phi(x) + b).
$$

For XOR, one useful feature is the product

$$
x_1x_2.
$$

By adding nonlinear features, the classifier can separate patterns that were not linearly separable in the original coordinates.

Classical machine learning often depends on hand-designed feature maps. Deep learning learns feature maps from data. A neural network layer computes a transformation

$$
h = \phi_\theta(x),
$$

and the final classifier applies a linear boundary in the learned representation space:

$$
\hat{y} = \operatorname{sign}(w^\top h + b).
$$

This is one of the main reasons deep networks are powerful. They do not merely fit linear boundaries in raw input space. They learn representations where useful linear boundaries become possible.

### Separability in Higher Dimensions

Increasing the feature dimension can make separation easier.

Suppose data is represented in a low-dimensional space where classes overlap under every linear boundary. A nonlinear mapping into a higher-dimensional space may spread the data out. In the new space, a hyperplane may separate the classes.

This idea appears in kernel methods. A kernel implicitly computes inner products in a feature space, often without constructing the feature vectors explicitly.

Deep networks use a different strategy. They explicitly compute learned representations through layers. Each layer changes the geometry of the data. Ideally, later layers make task-relevant distinctions easier to separate.

### Linear Separability and the Perceptron

The perceptron convergence theorem depends on linear separability.

If a dataset is linearly separable and the examples have bounded norm, then the perceptron algorithm eventually finds a separating hyperplane. It will make only finitely many mistakes.

If the dataset is not linearly separable, the perceptron may never converge. It can keep updating forever because no hyperplane can correctly classify every training example.

This explains why the perceptron is mainly a conceptual algorithm. Real datasets often contain noise, overlapping classes, mislabeled examples, or ambiguous examples. A hard requirement of perfect separability is usually too strong.

### Separability and Generalization

Perfect separation of the training set does not guarantee good generalization.

A high-dimensional model may separate training examples perfectly but perform poorly on new data. This can happen when the model memorizes noise or relies on accidental patterns.

The margin matters. A classifier that separates the data with a large margin often generalizes better than one that barely separates it.

For this reason, support vector machines choose a separating hyperplane that maximizes the margin. Neural networks do not usually solve this exact optimization problem, but margin-like ideas remain important in classification, robustness, and representation learning.

### Linear Separability in Neural Networks

In a deep classifier, the final layer is often linear:

$$
z = Wh + b.
$$

Here \(h\) is the representation produced by earlier layers. The final layer assigns classes using linear decision boundaries in representation space.

For multiclass classification, the model predicts

$$
\hat{y} = \arg\max_k z_k.
$$

The boundary between class \(a\) and class \(b\) is where

$$
z_a = z_b.
$$

If

$$
z_a = w_a^\top h + b_a
$$

and

$$
z_b = w_b^\top h + b_b,
$$

then the boundary is

$$
(w_a - w_b)^\top h + (b_a - b_b) = 0.
$$

Thus the final classifier is still linear in \(h\). The nonlinear part of the network is responsible for producing a representation where these linear boundaries work well.

### Practical Diagnosis

Linear separability is useful as a diagnostic idea.

If a simple linear classifier performs well on learned features, those features already contain useful structure. For example, researchers often freeze a pretrained model and train a linear classifier on top of its representations. This is called linear probing.

A strong linear probe means the representation makes the target classes easy to separate. A weak linear probe means the information may be absent, entangled, or not linearly accessible.

Linear probes are widely used to analyze vision models, language models, speech models, and self-supervised representations.

### Summary

A dataset is linearly separable when a hyperplane can classify all training examples correctly. This idea gives a geometric foundation for perceptrons, logistic regression, softmax regression, support vector machines, and the final layers of neural networks.

Linear separability in raw input space is often too restrictive. Deep learning addresses this by learning representations. The goal is to transform data into a space where simple linear decision boundaries become effective.

