Linear separability describes when a classification dataset can be divided perfectly by a linear decision boundary. It is one of the central geometric ideas behind linear classification.
Linear separability describes when a classification dataset can be divided perfectly by a linear decision boundary. It is one of the central geometric ideas behind linear classification.
For binary classification, each example has an input vector
and a label
The dataset is linearly separable if there exists a weight vector and a bias such that
for every training example .
This condition says that every positive example lies on one side of the hyperplane, and every negative example lies on the other side.
Hyperplanes
A linear decision boundary has the form
In two dimensions, this boundary is a line. In three dimensions, it is a plane. In higher dimensions, it is called a hyperplane.
The vector is normal to the hyperplane. It points in the direction in which the score
increases most quickly.
The classifier predicts
The positive class is assigned to points where
The negative class is assigned to points where
Geometric Meaning
For a point , the quantity
is a signed score. Its sign tells us which side of the boundary the point lies on. Its magnitude is related to the distance from the boundary.
The signed distance from to the hyperplane is
For a labeled example , the signed margin is
The geometric margin is
A positive margin means the example is correctly classified. A negative margin means it is misclassified. A larger positive margin means the example is farther from the decision boundary on the correct side.
Strict Separability and Weak Separability
A dataset is strictly linearly separable if every example satisfies
It is weakly linearly separable if every example satisfies
Strict separability is usually the useful condition. If some examples lie exactly on the boundary, tiny perturbations may change their predicted labels. Strict separation gives a positive margin.
When a finite dataset is strictly separable, there is usually not just one separating hyperplane. There are often infinitely many. Some separate the data with a small margin; others separate the data with a larger margin.
Example in Two Dimensions
Consider points in . Suppose positive examples are clustered near , and negative examples are clustered near . A line can separate them.
For example, the boundary
can be written as
with
The classifier predicts positive when
and negative when
If all positive points satisfy , and all negative points satisfy , the dataset is linearly separable.
Nonseparable Data
Many datasets are not linearly separable. The simplest example is XOR.
The XOR input points are
Assign labels:
| Input | Label |
|---|---|
| 0 | |
| 1 | |
| 1 | |
| 0 |
The positive examples lie on opposite corners of a square. The negative examples lie on the other two corners. No single line can separate the positive points from the negative points.
This example is small but important. It shows that linear classifiers cannot represent all simple logical patterns.
Feature Maps
A dataset that is not linearly separable in its original input space may become linearly separable after a feature transformation.
Let
be a feature map. Instead of classifying , we classify :
For XOR, one useful feature is the product
By adding nonlinear features, the classifier can separate patterns that were not linearly separable in the original coordinates.
Classical machine learning often depends on hand-designed feature maps. Deep learning learns feature maps from data. A neural network layer computes a transformation
and the final classifier applies a linear boundary in the learned representation space:
This is one of the main reasons deep networks are powerful. They do not merely fit linear boundaries in raw input space. They learn representations where useful linear boundaries become possible.
Separability in Higher Dimensions
Increasing the feature dimension can make separation easier.
Suppose data is represented in a low-dimensional space where classes overlap under every linear boundary. A nonlinear mapping into a higher-dimensional space may spread the data out. In the new space, a hyperplane may separate the classes.
This idea appears in kernel methods. A kernel implicitly computes inner products in a feature space, often without constructing the feature vectors explicitly.
Deep networks use a different strategy. They explicitly compute learned representations through layers. Each layer changes the geometry of the data. Ideally, later layers make task-relevant distinctions easier to separate.
Linear Separability and the Perceptron
The perceptron convergence theorem depends on linear separability.
If a dataset is linearly separable and the examples have bounded norm, then the perceptron algorithm eventually finds a separating hyperplane. It will make only finitely many mistakes.
If the dataset is not linearly separable, the perceptron may never converge. It can keep updating forever because no hyperplane can correctly classify every training example.
This explains why the perceptron is mainly a conceptual algorithm. Real datasets often contain noise, overlapping classes, mislabeled examples, or ambiguous examples. A hard requirement of perfect separability is usually too strong.
Separability and Generalization
Perfect separation of the training set does not guarantee good generalization.
A high-dimensional model may separate training examples perfectly but perform poorly on new data. This can happen when the model memorizes noise or relies on accidental patterns.
The margin matters. A classifier that separates the data with a large margin often generalizes better than one that barely separates it.
For this reason, support vector machines choose a separating hyperplane that maximizes the margin. Neural networks do not usually solve this exact optimization problem, but margin-like ideas remain important in classification, robustness, and representation learning.
Linear Separability in Neural Networks
In a deep classifier, the final layer is often linear:
Here is the representation produced by earlier layers. The final layer assigns classes using linear decision boundaries in representation space.
For multiclass classification, the model predicts
The boundary between class and class is where
If
and
then the boundary is
Thus the final classifier is still linear in . The nonlinear part of the network is responsible for producing a representation where these linear boundaries work well.
Practical Diagnosis
Linear separability is useful as a diagnostic idea.
If a simple linear classifier performs well on learned features, those features already contain useful structure. For example, researchers often freeze a pretrained model and train a linear classifier on top of its representations. This is called linear probing.
A strong linear probe means the representation makes the target classes easy to separate. A weak linear probe means the information may be absent, entangled, or not linearly accessible.
Linear probes are widely used to analyze vision models, language models, speech models, and self-supervised representations.
Summary
A dataset is linearly separable when a hyperplane can classify all training examples correctly. This idea gives a geometric foundation for perceptrons, logistic regression, softmax regression, support vector machines, and the final layers of neural networks.
Linear separability in raw input space is often too restrictive. Deep learning addresses this by learning representations. The goal is to transform data into a space where simple linear decision boundaries become effective.