Supervised Learning

Supervised learning is the central paradigm of modern machine learning and deep learning. In supervised learning, a model learns a mapping from inputs to outputs using examples where the correct outputs are already known.

A supervised learning system receives pairs of data:

(x, y),

where $x$ is the input and $y$ is the target output or label.

The goal is to learn a function

f_\theta(x) \approx y,

where $f_\theta$ is a parameterized model with parameters $\theta$ . During training, the model adjusts its parameters so that its predictions become close to the true targets on the training data.

Supervised learning forms the basis of image classification, speech recognition, machine translation, spam detection, recommendation systems, medical diagnosis, and large language model fine-tuning.

The Learning Problem

Suppose we observe a dataset

\mathcal{D} = \{ (x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(N)}, y^{(N)}) \}.

Each pair contains an input example and its corresponding target.

Examples:

Task	Input $x$	Target $y$
Image classification	Image pixels	Class label
Speech recognition	Audio waveform	Text transcript
Translation	English sentence	French sentence
Sentiment analysis	Review text	Positive or negative label
House price prediction	House features	Numerical price

The model receives the input $x$ and produces a prediction

\hat{y} = f_\theta(x).

The prediction is compared with the true target $y$ . A loss function measures the difference between them. The training algorithm then modifies the parameters $\theta$ to reduce this loss.

The full supervised learning pipeline is therefore

x \longrightarrow f_\theta(x) \longrightarrow \hat{y} \longrightarrow L(\hat{y}, y).

Inputs and Targets

Inputs may have many forms:

Data type	Typical tensor shape
Tabular data	$[B, D]$
Images	$[B, C, H, W]$
Text tokens	$[B, T]$
Audio spectrograms	$[B, F, T]$
Graph data	Node and edge tensors

Targets also vary by task.

For classification tasks, targets are often integer class labels:

y \in \{0,1,\dots,K-1\}.

For regression tasks, targets are continuous values:

y \in \mathbb{R}.

For sequence tasks such as translation, the target may itself be a sequence:

y = (y_1, y_2, \dots, y_T).

The structure of the target determines the choice of model architecture and loss function.

Regression

Regression predicts continuous numerical values.

Examples include:

Predicting house prices
Forecasting temperatures
Estimating stock volatility
Predicting energy usage

Suppose the input vector is

x \in \mathbb{R}^d.

A linear regression model predicts

\hat{y} = w^\top x + b.

y = w^\top x + b

Here:

Symbol	Meaning
$x$	Input vector
$w$	Weight vector
$b$	Bias
$\hat{y}$	Predicted output

The model attempts to minimize prediction error across the dataset.

A common regression loss is mean squared error:

L(y,\hat{y}) = (y - \hat{y})^2.

L(y,\hat{y}) = (y-\hat{y})^2

In PyTorch:

import torch
import torch.nn as nn

model = nn.Linear(3, 1)

x = torch.randn(16, 3)
y = torch.randn(16, 1)

pred = model(x)

loss_fn = nn.MSELoss()
loss = loss_fn(pred, y)

print(loss)

The model outputs one scalar prediction for each example in the batch.

Classification

Classification predicts discrete categories.

Examples include:

Identifying objects in images
Detecting spam emails
Recognizing diseases from scans
Predicting customer churn

Suppose there are $K$ classes. The model produces a vector of scores called logits:

z \in \mathbb{R}^K.

The softmax function converts logits into probabilities:

p_i = \frac{e^{z_i}} {\sum_{j=1}^{K} e^{z_j}}.

p_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

Each probability satisfies

0 \le p_i \le 1,

and

\sum_{i=1}^{K} p_i = 1.

The predicted class is usually

\hat{y} = \arg\max_i p_i.

Cross-entropy loss is commonly used for classification:

L = -\log p_y,

where $p_y$ is the predicted probability assigned to the correct class.

In PyTorch:

model = nn.Linear(128, 10)

x = torch.randn(32, 128)
targets = torch.randint(0, 10, (32,))

logits = model(x)

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, targets)

print(loss)

The output tensor has shape

[batch_size, num_classes]

In this example:

[32, 10]

Binary Classification

Binary classification predicts one of two possible classes.

Examples include:

Fraud or non-fraud
Positive or negative sentiment
Benign or malignant tumor

The model produces a scalar logit $z$ . The sigmoid function converts it into a probability:

\sigma(z) = \frac{1}{1 + e^{-z}}.

\sigma(z)=\frac{1}{1+e^{-z}}

The output represents the estimated probability of the positive class.

Binary cross-entropy loss is often used:

L = -y\log \hat{y} -(1-y)\log(1-\hat{y}).

In PyTorch:

model = nn.Linear(64, 1)

x = torch.randn(16, 64)
targets = torch.randint(0, 2, (16, 1)).float()

logits = model(x)

loss_fn = nn.BCEWithLogitsLoss()
loss = loss_fn(logits, targets)

print(loss)

The class BCEWithLogitsLoss combines the sigmoid operation and cross-entropy computation in a numerically stable form.

The Role of the Dataset

A supervised model can only learn patterns present in its training data.

The dataset defines:

What the model sees
What patterns are learnable
What biases may appear
Which environments the model can generalize to

Training data quality is often more important than model complexity.

A dataset usually contains three splits:

Split	Purpose
Training set	Parameter learning
Validation set	Hyperparameter tuning
Test set	Final evaluation

The model learns only from the training set. Validation data helps select architectures and hyperparameters. Test data estimates real-world performance.

Empirical Risk Minimization

Supervised learning is usually framed as minimizing expected loss.

The ideal objective is

\mathcal{R}(\theta) = \mathbb{E}_{(x,y)\sim p_{\text{data}}} [L(f_\theta(x), y)].

This quantity is called the population risk.

Because the true data distribution is unknown, we approximate it using the dataset:

\hat{\mathcal{R}}(\theta) = \frac{1}{N} \sum_{i=1}^{N} L(f_\theta(x^{(i)}), y^{(i)}).

\hat{\mathcal{R}}(\theta)=\frac{1}{N}\sum_{i=1}^{N}L(f_\theta(x^{(i)}),y^{(i)})

This quantity is called empirical risk.

Training attempts to find parameters

\theta^\ast = \arg\min_\theta \hat{\mathcal{R}}(\theta).

Gradient-based optimization algorithms approximate this minimization process.

Generalization

The central challenge in supervised learning is not memorizing training examples. The real challenge is generalization.

A model generalizes when it performs well on unseen data drawn from the same underlying distribution.

Two models may achieve near-zero training error, yet one may generalize far better than the other.

Generalization depends on many factors:

Dataset size
Noise levels
Model architecture
Optimization method
Regularization
Distribution mismatch

Deep learning systems are often heavily overparameterized, yet still generalize well in practice. Understanding why this occurs remains an active research area.

Batch Training

Modern supervised learning trains on mini-batches rather than individual examples.

Suppose:

X \in \mathbb{R}^{B \times d}

is a batch of inputs and

Y \in \mathbb{R}^{B}

contains the targets.

The model processes the entire batch simultaneously:

\hat{Y} = f_\theta(X).

Batch training improves computational efficiency because GPUs operate efficiently on large tensor operations.

In PyTorch:

for x_batch, y_batch in dataloader:

    optimizer.zero_grad()

    pred = model(x_batch)

    loss = loss_fn(pred, y_batch)

    loss.backward()

    optimizer.step()

This loop performs one gradient update per mini-batch.

Supervised Learning in Deep Networks

Modern deep learning systems are usually supervised at large scale.

Examples include:

System	Supervised objective
ImageNet classifiers	Predict object category
Speech systems	Predict text transcription
Translation systems	Predict target sentence
Chat models	Predict next token
Recommendation systems	Predict user interaction

Even many self-supervised systems eventually rely on supervised fine-tuning for downstream tasks.

Large language models are often trained in multiple supervised stages:

Self-supervised pretraining
Supervised instruction tuning
Preference optimization or reinforcement learning

Thus supervised learning remains fundamental even in modern foundation model pipelines.

Limitations of Supervised Learning

Supervised learning has several limitations.

First, labeled data is expensive. Human annotation may require domain experts, large budgets, and extensive quality control.

Second, supervised models may learn spurious correlations instead of causal structure.

Third, models assume that future data resembles training data. Distribution shift can severely reduce performance.

Fourth, labels themselves may be noisy, inconsistent, or biased.

Finally, supervised learning often struggles to learn from small datasets when models contain millions or billions of parameters.

These limitations motivated the development of self-supervised learning, transfer learning, few-shot learning, and reinforcement learning.

Summary

Supervised learning learns a mapping from inputs to targets using labeled examples.

The model produces predictions, a loss function measures prediction error, and optimization algorithms update the model parameters to reduce this error.

Regression predicts continuous values. Classification predicts discrete categories. Training minimizes empirical risk over the dataset. The ultimate goal is generalization to unseen data.

Most practical deep learning systems, including modern foundation models, rely heavily on supervised learning objectives and supervised fine-tuning procedures.