Skip to content

Supervised Learning

Supervised learning is the central paradigm of modern machine learning and deep learning.

Supervised learning is the central paradigm of modern machine learning and deep learning. In supervised learning, a model learns a mapping from inputs to outputs using examples where the correct outputs are already known.

A supervised learning system receives pairs of data:

(x,y), (x, y),

where xx is the input and yy is the target output or label.

The goal is to learn a function

fθ(x)y, f_\theta(x) \approx y,

where fθf_\theta is a parameterized model with parameters θ\theta. During training, the model adjusts its parameters so that its predictions become close to the true targets on the training data.

Supervised learning forms the basis of image classification, speech recognition, machine translation, spam detection, recommendation systems, medical diagnosis, and large language model fine-tuning.

The Learning Problem

Suppose we observe a dataset

D={(x(1),y(1)),(x(2),y(2)),,(x(N),y(N))}. \mathcal{D} = \{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(N)}, y^{(N)}) \}.

Each pair contains an input example and its corresponding target.

Examples:

TaskInput xxTarget yy
Image classificationImage pixelsClass label
Speech recognitionAudio waveformText transcript
TranslationEnglish sentenceFrench sentence
Sentiment analysisReview textPositive or negative label
House price predictionHouse featuresNumerical price

The model receives the input xx and produces a prediction

y^=fθ(x). \hat{y} = f_\theta(x).

The prediction is compared with the true target yy. A loss function measures the difference between them. The training algorithm then modifies the parameters θ\theta to reduce this loss.

The full supervised learning pipeline is therefore

xfθ(x)y^L(y^,y). x \longrightarrow f_\theta(x) \longrightarrow \hat{y} \longrightarrow L(\hat{y}, y).

Inputs and Targets

Inputs may have many forms:

Data typeTypical tensor shape
Tabular data[B,D][B, D]
Images[B,C,H,W][B, C, H, W]
Text tokens[B,T][B, T]
Audio spectrograms[B,F,T][B, F, T]
Graph dataNode and edge tensors

Targets also vary by task.

For classification tasks, targets are often integer class labels:

y{0,1,,K1}. y \in \{0,1,\dots,K-1\}.

For regression tasks, targets are continuous values:

yR. y \in \mathbb{R}.

For sequence tasks such as translation, the target may itself be a sequence:

y=(y1,y2,,yT). y = (y_1, y_2, \dots, y_T).

The structure of the target determines the choice of model architecture and loss function.

Regression

Regression predicts continuous numerical values.

Examples include:

  • Predicting house prices
  • Forecasting temperatures
  • Estimating stock volatility
  • Predicting energy usage

Suppose the input vector is

xRd. x \in \mathbb{R}^d.

A linear regression model predicts

y^=wx+b. \hat{y} = w^\top x + b. y=wx+b y = w^\top x + b

Here:

SymbolMeaning
xxInput vector
wwWeight vector
bbBias
y^\hat{y}Predicted output

The model attempts to minimize prediction error across the dataset.

A common regression loss is mean squared error:

L(y,y^)=(yy^)2. L(y,\hat{y}) = (y - \hat{y})^2. L(y,y^)=(yy^)2 L(y,\hat{y}) = (y-\hat{y})^2

In PyTorch:

import torch
import torch.nn as nn

model = nn.Linear(3, 1)

x = torch.randn(16, 3)
y = torch.randn(16, 1)

pred = model(x)

loss_fn = nn.MSELoss()
loss = loss_fn(pred, y)

print(loss)

The model outputs one scalar prediction for each example in the batch.

Classification

Classification predicts discrete categories.

Examples include:

  • Identifying objects in images
  • Detecting spam emails
  • Recognizing diseases from scans
  • Predicting customer churn

Suppose there are KK classes. The model produces a vector of scores called logits:

zRK. z \in \mathbb{R}^K.

The softmax function converts logits into probabilities:

pi=ezij=1Kezj. p_i = \frac{e^{z_i}} {\sum_{j=1}^{K} e^{z_j}}. pi=ezij=1Kezj p_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Each probability satisfies

0pi1, 0 \le p_i \le 1,

and

i=1Kpi=1. \sum_{i=1}^{K} p_i = 1.

The predicted class is usually

y^=argmaxipi. \hat{y} = \arg\max_i p_i.

Cross-entropy loss is commonly used for classification:

L=logpy, L = -\log p_y,

where pyp_y is the predicted probability assigned to the correct class.

In PyTorch:

model = nn.Linear(128, 10)

x = torch.randn(32, 128)
targets = torch.randint(0, 10, (32,))

logits = model(x)

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, targets)

print(loss)

The output tensor has shape

[batch_size, num_classes]

In this example:

[32, 10]

Binary Classification

Binary classification predicts one of two possible classes.

Examples include:

  • Fraud or non-fraud
  • Positive or negative sentiment
  • Benign or malignant tumor

The model produces a scalar logit zz. The sigmoid function converts it into a probability:

σ(z)=11+ez. \sigma(z) = \frac{1}{1 + e^{-z}}. σ(z)=11+ez \sigma(z)=\frac{1}{1+e^{-z}}

The output represents the estimated probability of the positive class.

Binary cross-entropy loss is often used:

L=ylogy^(1y)log(1y^). L = -y\log \hat{y} -(1-y)\log(1-\hat{y}).

In PyTorch:

model = nn.Linear(64, 1)

x = torch.randn(16, 64)
targets = torch.randint(0, 2, (16, 1)).float()

logits = model(x)

loss_fn = nn.BCEWithLogitsLoss()
loss = loss_fn(logits, targets)

print(loss)

The class BCEWithLogitsLoss combines the sigmoid operation and cross-entropy computation in a numerically stable form.

The Role of the Dataset

A supervised model can only learn patterns present in its training data.

The dataset defines:

  • What the model sees
  • What patterns are learnable
  • What biases may appear
  • Which environments the model can generalize to

Training data quality is often more important than model complexity.

A dataset usually contains three splits:

SplitPurpose
Training setParameter learning
Validation setHyperparameter tuning
Test setFinal evaluation

The model learns only from the training set. Validation data helps select architectures and hyperparameters. Test data estimates real-world performance.

Empirical Risk Minimization

Supervised learning is usually framed as minimizing expected loss.

The ideal objective is

R(θ)=E(x,y)pdata[L(fθ(x),y)]. \mathcal{R}(\theta) = \mathbb{E}_{(x,y)\sim p_{\text{data}}} [L(f_\theta(x), y)].

This quantity is called the population risk.

Because the true data distribution is unknown, we approximate it using the dataset:

R^(θ)=1Ni=1NL(fθ(x(i)),y(i)). \hat{\mathcal{R}}(\theta) = \frac{1}{N} \sum_{i=1}^{N} L(f_\theta(x^{(i)}), y^{(i)}). R^(θ)=1Ni=1NL(fθ(x(i)),y(i)) \hat{\mathcal{R}}(\theta)=\frac{1}{N}\sum_{i=1}^{N}L(f_\theta(x^{(i)}),y^{(i)})

This quantity is called empirical risk.

Training attempts to find parameters

θ=argminθR^(θ). \theta^\ast = \arg\min_\theta \hat{\mathcal{R}}(\theta).

Gradient-based optimization algorithms approximate this minimization process.

Generalization

The central challenge in supervised learning is not memorizing training examples. The real challenge is generalization.

A model generalizes when it performs well on unseen data drawn from the same underlying distribution.

Two models may achieve near-zero training error, yet one may generalize far better than the other.

Generalization depends on many factors:

  • Dataset size
  • Noise levels
  • Model architecture
  • Optimization method
  • Regularization
  • Distribution mismatch

Deep learning systems are often heavily overparameterized, yet still generalize well in practice. Understanding why this occurs remains an active research area.

Batch Training

Modern supervised learning trains on mini-batches rather than individual examples.

Suppose:

XRB×d X \in \mathbb{R}^{B \times d}

is a batch of inputs and

YRB Y \in \mathbb{R}^{B}

contains the targets.

The model processes the entire batch simultaneously:

Y^=fθ(X). \hat{Y} = f_\theta(X).

Batch training improves computational efficiency because GPUs operate efficiently on large tensor operations.

In PyTorch:

for x_batch, y_batch in dataloader:

    optimizer.zero_grad()

    pred = model(x_batch)

    loss = loss_fn(pred, y_batch)

    loss.backward()

    optimizer.step()

This loop performs one gradient update per mini-batch.

Supervised Learning in Deep Networks

Modern deep learning systems are usually supervised at large scale.

Examples include:

SystemSupervised objective
ImageNet classifiersPredict object category
Speech systemsPredict text transcription
Translation systemsPredict target sentence
Chat modelsPredict next token
Recommendation systemsPredict user interaction

Even many self-supervised systems eventually rely on supervised fine-tuning for downstream tasks.

Large language models are often trained in multiple supervised stages:

  1. Self-supervised pretraining
  2. Supervised instruction tuning
  3. Preference optimization or reinforcement learning

Thus supervised learning remains fundamental even in modern foundation model pipelines.

Limitations of Supervised Learning

Supervised learning has several limitations.

First, labeled data is expensive. Human annotation may require domain experts, large budgets, and extensive quality control.

Second, supervised models may learn spurious correlations instead of causal structure.

Third, models assume that future data resembles training data. Distribution shift can severely reduce performance.

Fourth, labels themselves may be noisy, inconsistent, or biased.

Finally, supervised learning often struggles to learn from small datasets when models contain millions or billions of parameters.

These limitations motivated the development of self-supervised learning, transfer learning, few-shot learning, and reinforcement learning.

Summary

Supervised learning learns a mapping from inputs to targets using labeled examples.

The model produces predictions, a loss function measures prediction error, and optimization algorithms update the model parameters to reduce this error.

Regression predicts continuous values. Classification predicts discrete categories. Training minimizes empirical risk over the dataset. The ultimate goal is generalization to unseen data.

Most practical deep learning systems, including modern foundation models, rely heavily on supervised learning objectives and supervised fine-tuning procedures.