# Supervised Learning

Supervised learning is the central paradigm of modern machine learning and deep learning. In supervised learning, a model learns a mapping from inputs to outputs using examples where the correct outputs are already known.

A supervised learning system receives pairs of data:

$$
(x, y),
$$

where \(x\) is the input and \(y\) is the target output or label.

The goal is to learn a function

$$
f_\theta(x) \approx y,
$$

where \(f_\theta\) is a parameterized model with parameters \(\theta\). During training, the model adjusts its parameters so that its predictions become close to the true targets on the training data.

Supervised learning forms the basis of image classification, speech recognition, machine translation, spam detection, recommendation systems, medical diagnosis, and large language model fine-tuning.

### The Learning Problem

Suppose we observe a dataset

$$
\mathcal{D} =
\{
(x^{(1)}, y^{(1)}),
(x^{(2)}, y^{(2)}),
\dots,
(x^{(N)}, y^{(N)})
\}.
$$

Each pair contains an input example and its corresponding target.

Examples:

| Task | Input \(x\) | Target \(y\) |
|---|---|---|
| Image classification | Image pixels | Class label |
| Speech recognition | Audio waveform | Text transcript |
| Translation | English sentence | French sentence |
| Sentiment analysis | Review text | Positive or negative label |
| House price prediction | House features | Numerical price |

The model receives the input \(x\) and produces a prediction

$$
\hat{y} = f_\theta(x).
$$

The prediction is compared with the true target \(y\). A loss function measures the difference between them. The training algorithm then modifies the parameters \(\theta\) to reduce this loss.

The full supervised learning pipeline is therefore

$$
x
\longrightarrow
f_\theta(x)
\longrightarrow
\hat{y}
\longrightarrow
L(\hat{y}, y).
$$

### Inputs and Targets

Inputs may have many forms:

| Data type | Typical tensor shape |
|---|---|
| Tabular data | \([B, D]\) |
| Images | \([B, C, H, W]\) |
| Text tokens | \([B, T]\) |
| Audio spectrograms | \([B, F, T]\) |
| Graph data | Node and edge tensors |

Targets also vary by task.

For classification tasks, targets are often integer class labels:

$$
y \in \{0,1,\dots,K-1\}.
$$

For regression tasks, targets are continuous values:

$$
y \in \mathbb{R}.
$$

For sequence tasks such as translation, the target may itself be a sequence:

$$
y = (y_1, y_2, \dots, y_T).
$$

The structure of the target determines the choice of model architecture and loss function.

### Regression

Regression predicts continuous numerical values.

Examples include:

- Predicting house prices
- Forecasting temperatures
- Estimating stock volatility
- Predicting energy usage

Suppose the input vector is

$$
x \in \mathbb{R}^d.
$$

A linear regression model predicts

$$
\hat{y} = w^\top x + b.
$$

$$
y = w^\top x + b
$$

Here:

| Symbol | Meaning |
|---|---|
| \(x\) | Input vector |
| \(w\) | Weight vector |
| \(b\) | Bias |
| \(\hat{y}\) | Predicted output |

The model attempts to minimize prediction error across the dataset.

A common regression loss is mean squared error:

$$
L(y,\hat{y}) = (y - \hat{y})^2.
$$

$$
L(y,\hat{y}) = (y-\hat{y})^2
$$

In PyTorch:

```python id="y9v1tq"
import torch
import torch.nn as nn

model = nn.Linear(3, 1)

x = torch.randn(16, 3)
y = torch.randn(16, 1)

pred = model(x)

loss_fn = nn.MSELoss()
loss = loss_fn(pred, y)

print(loss)
```

The model outputs one scalar prediction for each example in the batch.

### Classification

Classification predicts discrete categories.

Examples include:

- Identifying objects in images
- Detecting spam emails
- Recognizing diseases from scans
- Predicting customer churn

Suppose there are \(K\) classes. The model produces a vector of scores called logits:

$$
z \in \mathbb{R}^K.
$$

The softmax function converts logits into probabilities:

$$
p_i =
\frac{e^{z_i}}
{\sum_{j=1}^{K} e^{z_j}}.
$$

$$
p_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
$$

Each probability satisfies

$$
0 \le p_i \le 1,
$$

and

$$
\sum_{i=1}^{K} p_i = 1.
$$

The predicted class is usually

$$
\hat{y} = \arg\max_i p_i.
$$

Cross-entropy loss is commonly used for classification:

$$
L = -\log p_y,
$$

where \(p_y\) is the predicted probability assigned to the correct class.

In PyTorch:

```python id="8mkh6q"
model = nn.Linear(128, 10)

x = torch.randn(32, 128)
targets = torch.randint(0, 10, (32,))

logits = model(x)

loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, targets)

print(loss)
```

The output tensor has shape

```python id="f1vrx0"
[batch_size, num_classes]
```

In this example:

```python id="eqh7be"
[32, 10]
```

### Binary Classification

Binary classification predicts one of two possible classes.

Examples include:

- Fraud or non-fraud
- Positive or negative sentiment
- Benign or malignant tumor

The model produces a scalar logit \(z\). The sigmoid function converts it into a probability:

$$
\sigma(z) =
\frac{1}{1 + e^{-z}}.
$$

$$
\sigma(z)=\frac{1}{1+e^{-z}}
$$

The output represents the estimated probability of the positive class.

Binary cross-entropy loss is often used:

$$
L =
-y\log \hat{y}
-(1-y)\log(1-\hat{y}).
$$

In PyTorch:

```python id="r8u2je"
model = nn.Linear(64, 1)

x = torch.randn(16, 64)
targets = torch.randint(0, 2, (16, 1)).float()

logits = model(x)

loss_fn = nn.BCEWithLogitsLoss()
loss = loss_fn(logits, targets)

print(loss)
```

The class `BCEWithLogitsLoss` combines the sigmoid operation and cross-entropy computation in a numerically stable form.

### The Role of the Dataset

A supervised model can only learn patterns present in its training data.

The dataset defines:

- What the model sees
- What patterns are learnable
- What biases may appear
- Which environments the model can generalize to

Training data quality is often more important than model complexity.

A dataset usually contains three splits:

| Split | Purpose |
|---|---|
| Training set | Parameter learning |
| Validation set | Hyperparameter tuning |
| Test set | Final evaluation |

The model learns only from the training set. Validation data helps select architectures and hyperparameters. Test data estimates real-world performance.

### Empirical Risk Minimization

Supervised learning is usually framed as minimizing expected loss.

The ideal objective is

$$
\mathcal{R}(\theta) =
\mathbb{E}_{(x,y)\sim p_{\text{data}}}
[L(f_\theta(x), y)].
$$

This quantity is called the population risk.

Because the true data distribution is unknown, we approximate it using the dataset:

$$
\hat{\mathcal{R}}(\theta) =
\frac{1}{N}
\sum_{i=1}^{N}
L(f_\theta(x^{(i)}), y^{(i)}).
$$

$$
\hat{\mathcal{R}}(\theta)=\frac{1}{N}\sum_{i=1}^{N}L(f_\theta(x^{(i)}),y^{(i)})
$$

This quantity is called empirical risk.

Training attempts to find parameters

$$
\theta^\ast =
\arg\min_\theta
\hat{\mathcal{R}}(\theta).
$$

Gradient-based optimization algorithms approximate this minimization process.

### Generalization

The central challenge in supervised learning is not memorizing training examples. The real challenge is generalization.

A model generalizes when it performs well on unseen data drawn from the same underlying distribution.

Two models may achieve near-zero training error, yet one may generalize far better than the other.

Generalization depends on many factors:

- Dataset size
- Noise levels
- Model architecture
- Optimization method
- Regularization
- Distribution mismatch

Deep learning systems are often heavily overparameterized, yet still generalize well in practice. Understanding why this occurs remains an active research area.

### Batch Training

Modern supervised learning trains on mini-batches rather than individual examples.

Suppose:

$$
X \in \mathbb{R}^{B \times d}
$$

is a batch of inputs and

$$
Y \in \mathbb{R}^{B}
$$

contains the targets.

The model processes the entire batch simultaneously:

$$
\hat{Y} = f_\theta(X).
$$

Batch training improves computational efficiency because GPUs operate efficiently on large tensor operations.

In PyTorch:

```python id="ybw1n7"
for x_batch, y_batch in dataloader:

    optimizer.zero_grad()

    pred = model(x_batch)

    loss = loss_fn(pred, y_batch)

    loss.backward()

    optimizer.step()
```

This loop performs one gradient update per mini-batch.

### Supervised Learning in Deep Networks

Modern deep learning systems are usually supervised at large scale.

Examples include:

| System | Supervised objective |
|---|---|
| ImageNet classifiers | Predict object category |
| Speech systems | Predict text transcription |
| Translation systems | Predict target sentence |
| Chat models | Predict next token |
| Recommendation systems | Predict user interaction |

Even many self-supervised systems eventually rely on supervised fine-tuning for downstream tasks.

Large language models are often trained in multiple supervised stages:

1. Self-supervised pretraining  
2. Supervised instruction tuning  
3. Preference optimization or reinforcement learning  

Thus supervised learning remains fundamental even in modern foundation model pipelines.

### Limitations of Supervised Learning

Supervised learning has several limitations.

First, labeled data is expensive. Human annotation may require domain experts, large budgets, and extensive quality control.

Second, supervised models may learn spurious correlations instead of causal structure.

Third, models assume that future data resembles training data. Distribution shift can severely reduce performance.

Fourth, labels themselves may be noisy, inconsistent, or biased.

Finally, supervised learning often struggles to learn from small datasets when models contain millions or billions of parameters.

These limitations motivated the development of self-supervised learning, transfer learning, few-shot learning, and reinforcement learning.

### Summary

Supervised learning learns a mapping from inputs to targets using labeled examples.

The model produces predictions, a loss function measures prediction error, and optimization algorithms update the model parameters to reduce this error.

Regression predicts continuous values. Classification predicts discrete categories. Training minimizes empirical risk over the dataset. The ultimate goal is generalization to unseen data.

Most practical deep learning systems, including modern foundation models, rely heavily on supervised learning objectives and supervised fine-tuning procedures.