The Perceptron Algorithm

The perceptron is one of the earliest algorithms for binary classification. It learns a linear decision boundary by updating its weights whenever it makes a mistake.

Unlike logistic regression, the perceptron does not predict calibrated probabilities. It predicts a class label directly. Its purpose is simple: find a hyperplane that separates two classes, when such a hyperplane exists.

Binary Labels

For the perceptron, it is convenient to use labels

y \in \{-1, +1\}.

The input is a vector

x \in \mathbb{R}^d.

The model has weight vector

w \in \mathbb{R}^d

and bias

b \in \mathbb{R}.

It computes a score

s = w^\top x + b.

The predicted class is

\hat{y} = \begin{cases} +1 & \text{if } s \ge 0, \\ -1 & \text{if } s < 0. \end{cases}

The decision boundary is the hyperplane

w^\top x + b = 0.

Correct and Incorrect Classification

A point is correctly classified when

y(w^\top x + b) > 0.

This expression is positive when the label and score have the same sign.

If $y=+1$ , then correctness requires

w^\top x + b > 0.

If $y=-1$ , then correctness requires

w^\top x + b < 0.

Thus the single condition

y(w^\top x + b) > 0

covers both cases.

A point is misclassified when

y(w^\top x + b) \le 0.

Update Rule

The perceptron scans through the training data. Whenever it finds a misclassified example, it updates the parameters:

w \leftarrow w + \eta yx,

b \leftarrow b + \eta y.

Here $\eta > 0$ is the learning rate.

The update moves the decision boundary so that the current example becomes more likely to be classified correctly.

To see why, consider the new score margin for the same example:

y(w_{\text{new}}^\top x + b_{\text{new}}) = y((w + \eta yx)^\top x + b + \eta y).

Expanding:

y(w^\top x + b) + \eta \|x\|^2 + \eta.

Since $\eta \|x\|^2 + \eta > 0$ , the update increases the signed margin for that example.

Geometric Interpretation

The vector $w$ is normal to the decision boundary. The perceptron update rotates and shifts this boundary.

If a positive example is classified incorrectly, then the model score is too small. The update

w \leftarrow w + \eta x

moves $w$ toward that positive example.

If a negative example is classified incorrectly, then the model score is too large. The update

w \leftarrow w - \eta x

moves $w$ away from that negative example.

Thus the perceptron learns by mistake correction. It does not use a smooth loss or probability model. It only asks whether the current prediction is right or wrong.

Perceptron Loss

The perceptron algorithm can also be viewed as minimizing the perceptron loss:

\ell(w,b;x,y) = \max(0, -y(w^\top x + b)).

If the example is correctly classified with positive margin, the loss is zero. If it is misclassified, the loss is positive.

The gradient for a misclassified example is

\nabla_w \ell = -yx,

\nabla_b \ell = -y.

A gradient descent step gives

w \leftarrow w - \eta(-yx) = w + \eta yx,

b \leftarrow b - \eta(-y) = b + \eta y.

This matches the perceptron update.

Algorithm

The basic perceptron algorithm is:

initialize w = 0, b = 0

repeat for several passes over the data:
    for each training example (x, y):
        if y * (dot(w, x) + b) <= 0:
            w = w + eta * y * x
            b = b + eta * y

A full pass over the training set is often called an epoch.

The algorithm can stop after a fixed number of epochs or after an epoch with no mistakes.

Perceptron in PyTorch

Although PyTorch is usually used for differentiable neural networks, we can implement the perceptron update directly with tensors.

import torch

torch.manual_seed(0)

n = 200
d = 2

X_pos = torch.randn(n // 2, d) + torch.tensor([2.0, 2.0])
X_neg = torch.randn(n // 2, d) + torch.tensor([-2.0, -2.0])

X = torch.cat([X_pos, X_neg], dim=0)
y = torch.cat([
    torch.ones(n // 2),
    -torch.ones(n // 2),
])

w = torch.zeros(d)
b = torch.tensor(0.0)

eta = 0.1
num_epochs = 10

for epoch in range(num_epochs):
    mistakes = 0

    for i in range(n):
        score = X[i] @ w + b

        if y[i] * score <= 0:
            w = w + eta * y[i] * X[i]
            b = b + eta * y[i]
            mistakes += 1

    print(f"epoch {epoch + 1}, mistakes {mistakes}")

This code does not use loss.backward(). The update rule is explicit.

Mini-Batch Version

The classical perceptron updates after each mistake. We can also write a batch-style version that updates using all misclassified examples in a batch.

w = torch.zeros(d)
b = torch.tensor(0.0)

eta = 0.1

for epoch in range(20):
    scores = X @ w + b
    margins = y * scores

    mask = margins <= 0

    if mask.any():
        X_bad = X[mask]
        y_bad = y[mask]

        w = w + eta * (y_bad[:, None] * X_bad).mean(dim=0)
        b = b + eta * y_bad.mean()

    mistakes = mask.sum().item()
    print(f"epoch {epoch + 1}, mistakes {mistakes}")

This version is closer to the vectorized style used in deep learning. It computes all scores at once, creates a mask for mistakes, and updates using tensor operations.

Linear Separability

A dataset is linearly separable if there exists some $w$ and $b$ such that

y_i(w^\top x_i + b) > 0

for every training example.

The perceptron convergence theorem states that if the data is linearly separable and the examples have bounded norm, then the perceptron algorithm makes only finitely many mistakes. In other words, it eventually finds a separating hyperplane.

This result is important historically and conceptually. It shows that a simple mistake-driven linear classifier can learn under a clear geometric condition.

If the data is not linearly separable, the basic perceptron may keep making mistakes forever. In practice, we stop after a fixed number of epochs.

Margin

The signed margin of an example is

y(w^\top x + b).

A positive margin means correct classification. A larger positive margin means the example is farther from the decision boundary.

The geometric margin normalizes by the weight norm:

\frac{y(w^\top x + b)}{\|w\|}.

Margin matters because classifiers with larger margins tend to generalize better. This idea leads naturally to support vector machines and margin-based losses.

The perceptron only requires examples to be on the correct side of the boundary. It does not explicitly maximize the margin.

Comparison with Logistic Regression

The perceptron and logistic regression are both linear classifiers, but they optimize different objectives.

Model	Output	Loss	Update signal
Perceptron	Class label	Perceptron loss	Only mistakes
Logistic regression	Probability	Cross-entropy	Every example
Softmax regression	Class probabilities	Cross-entropy	Every example

The perceptron ignores correctly classified examples, even if they are close to the boundary. Logistic regression uses all examples and assigns stronger gradients to confident mistakes.

This makes logistic regression better suited for probabilistic modeling and modern neural network training.

Perceptron as a Neural Unit

A perceptron can be viewed as a single artificial neuron with a step activation:

\hat{y} = \operatorname{sign}(w^\top x + b).

The step function is not differentiable. This makes it unsuitable for gradient-based training in multilayer networks.

Modern neural networks replace the step function with differentiable or almost-everywhere differentiable activations such as sigmoid, tanh, ReLU, GELU, and SiLU. The underlying structure remains similar: compute an affine transformation, then apply a nonlinear function.

Limits of the Perceptron

The perceptron can only learn linear decision boundaries. It cannot solve problems that require nonlinear separation unless the input features are transformed first.

The classic example is XOR. The four points

(0,0), (0,1), (1,0), (1,1)

with labels determined by exclusive-or cannot be separated by a single line.

A multilayer neural network can solve XOR because hidden layers create new features. The first layer maps the input into a representation where the classes become separable, and later layers classify that representation.

Summary

The perceptron is a mistake-driven linear classifier. It predicts labels using the sign of a linear score and updates parameters only when it misclassifies an example.

It introduces core ideas that remain useful: linear decision boundaries, signed margins, mistake correction, and the role of separability. Its limitations motivate differentiable activations, smooth losses, and multilayer neural networks.