The perceptron is one of the earliest algorithms for binary classification. It learns a linear decision boundary by updating its weights whenever it makes a mistake.
The perceptron is one of the earliest algorithms for binary classification. It learns a linear decision boundary by updating its weights whenever it makes a mistake.
Unlike logistic regression, the perceptron does not predict calibrated probabilities. It predicts a class label directly. Its purpose is simple: find a hyperplane that separates two classes, when such a hyperplane exists.
Binary Labels
For the perceptron, it is convenient to use labels
The input is a vector
The model has weight vector
and bias
It computes a score
The predicted class is
The decision boundary is the hyperplane
Correct and Incorrect Classification
A point is correctly classified when
This expression is positive when the label and score have the same sign.
If , then correctness requires
If , then correctness requires
Thus the single condition
covers both cases.
A point is misclassified when
Update Rule
The perceptron scans through the training data. Whenever it finds a misclassified example, it updates the parameters:
Here is the learning rate.
The update moves the decision boundary so that the current example becomes more likely to be classified correctly.
To see why, consider the new score margin for the same example:
Expanding:
Since , the update increases the signed margin for that example.
Geometric Interpretation
The vector is normal to the decision boundary. The perceptron update rotates and shifts this boundary.
If a positive example is classified incorrectly, then the model score is too small. The update
moves toward that positive example.
If a negative example is classified incorrectly, then the model score is too large. The update
moves away from that negative example.
Thus the perceptron learns by mistake correction. It does not use a smooth loss or probability model. It only asks whether the current prediction is right or wrong.
Perceptron Loss
The perceptron algorithm can also be viewed as minimizing the perceptron loss:
If the example is correctly classified with positive margin, the loss is zero. If it is misclassified, the loss is positive.
The gradient for a misclassified example is
A gradient descent step gives
This matches the perceptron update.
Algorithm
The basic perceptron algorithm is:
initialize w = 0, b = 0
repeat for several passes over the data:
for each training example (x, y):
if y * (dot(w, x) + b) <= 0:
w = w + eta * y * x
b = b + eta * yA full pass over the training set is often called an epoch.
The algorithm can stop after a fixed number of epochs or after an epoch with no mistakes.
Perceptron in PyTorch
Although PyTorch is usually used for differentiable neural networks, we can implement the perceptron update directly with tensors.
import torch
torch.manual_seed(0)
n = 200
d = 2
X_pos = torch.randn(n // 2, d) + torch.tensor([2.0, 2.0])
X_neg = torch.randn(n // 2, d) + torch.tensor([-2.0, -2.0])
X = torch.cat([X_pos, X_neg], dim=0)
y = torch.cat([
torch.ones(n // 2),
-torch.ones(n // 2),
])
w = torch.zeros(d)
b = torch.tensor(0.0)
eta = 0.1
num_epochs = 10
for epoch in range(num_epochs):
mistakes = 0
for i in range(n):
score = X[i] @ w + b
if y[i] * score <= 0:
w = w + eta * y[i] * X[i]
b = b + eta * y[i]
mistakes += 1
print(f"epoch {epoch + 1}, mistakes {mistakes}")This code does not use loss.backward(). The update rule is explicit.
Mini-Batch Version
The classical perceptron updates after each mistake. We can also write a batch-style version that updates using all misclassified examples in a batch.
w = torch.zeros(d)
b = torch.tensor(0.0)
eta = 0.1
for epoch in range(20):
scores = X @ w + b
margins = y * scores
mask = margins <= 0
if mask.any():
X_bad = X[mask]
y_bad = y[mask]
w = w + eta * (y_bad[:, None] * X_bad).mean(dim=0)
b = b + eta * y_bad.mean()
mistakes = mask.sum().item()
print(f"epoch {epoch + 1}, mistakes {mistakes}")This version is closer to the vectorized style used in deep learning. It computes all scores at once, creates a mask for mistakes, and updates using tensor operations.
Linear Separability
A dataset is linearly separable if there exists some and such that
for every training example.
The perceptron convergence theorem states that if the data is linearly separable and the examples have bounded norm, then the perceptron algorithm makes only finitely many mistakes. In other words, it eventually finds a separating hyperplane.
This result is important historically and conceptually. It shows that a simple mistake-driven linear classifier can learn under a clear geometric condition.
If the data is not linearly separable, the basic perceptron may keep making mistakes forever. In practice, we stop after a fixed number of epochs.
Margin
The signed margin of an example is
A positive margin means correct classification. A larger positive margin means the example is farther from the decision boundary.
The geometric margin normalizes by the weight norm:
Margin matters because classifiers with larger margins tend to generalize better. This idea leads naturally to support vector machines and margin-based losses.
The perceptron only requires examples to be on the correct side of the boundary. It does not explicitly maximize the margin.
Comparison with Logistic Regression
The perceptron and logistic regression are both linear classifiers, but they optimize different objectives.
| Model | Output | Loss | Update signal |
|---|---|---|---|
| Perceptron | Class label | Perceptron loss | Only mistakes |
| Logistic regression | Probability | Cross-entropy | Every example |
| Softmax regression | Class probabilities | Cross-entropy | Every example |
The perceptron ignores correctly classified examples, even if they are close to the boundary. Logistic regression uses all examples and assigns stronger gradients to confident mistakes.
This makes logistic regression better suited for probabilistic modeling and modern neural network training.
Perceptron as a Neural Unit
A perceptron can be viewed as a single artificial neuron with a step activation:
The step function is not differentiable. This makes it unsuitable for gradient-based training in multilayer networks.
Modern neural networks replace the step function with differentiable or almost-everywhere differentiable activations such as sigmoid, tanh, ReLU, GELU, and SiLU. The underlying structure remains similar: compute an affine transformation, then apply a nonlinear function.
Limits of the Perceptron
The perceptron can only learn linear decision boundaries. It cannot solve problems that require nonlinear separation unless the input features are transformed first.
The classic example is XOR. The four points
with labels determined by exclusive-or cannot be separated by a single line.
A multilayer neural network can solve XOR because hidden layers create new features. The first layer maps the input into a representation where the classes become separable, and later layers classify that representation.
Summary
The perceptron is a mistake-driven linear classifier. It predicts labels using the sign of a linear score and updates parameters only when it misclassifies an example.
It introduces core ideas that remain useful: linear decision boundaries, signed margins, mistake correction, and the role of separability. Its limitations motivate differentiable activations, smooth losses, and multilayer neural networks.