# Logistic Regression

Logistic regression is a linear model for classification. It predicts a probability instead of a raw numerical value. Despite its name, logistic regression is mainly used for classification, not regression.

For binary classification, each target belongs to one of two classes:

$$
y \in \{0, 1\}.
$$

The model receives an input vector

$$
x \in \mathbb{R}^{d}
$$

and computes a score:

$$
z = w^\top x + b.
$$

This score is called a logit. A logit can be any real number. To convert it into a probability, we apply the sigmoid function:

$$
\hat{p} = \sigma(z) = \frac{1}{1 + e^{-z}}.
$$

The output \(\hat{p}\) lies between 0 and 1. It can be interpreted as the model’s estimated probability that the input belongs to class 1.

### From Linear Scores to Probabilities

Linear regression produces an unrestricted value:

$$
\hat{y} = w^\top x + b.
$$

For classification, this is inconvenient because probabilities must lie in the interval \([0,1]\). Logistic regression solves this by passing the linear score through the sigmoid function.

When \(z\) is large and positive, \(\sigma(z)\) is close to 1. When \(z\) is large and negative, \(\sigma(z)\) is close to 0. When \(z = 0\), \(\sigma(z) = 0.5\).

In PyTorch:

```python
import torch

z = torch.tensor([-4.0, 0.0, 4.0])
p = torch.sigmoid(z)

print(p)
# tensor([0.0180, 0.5000, 0.9820])
```

A probability threshold converts the probability into a class prediction:

```python
pred = (p >= 0.5).long()
print(pred)
# tensor([0, 1, 1])
```

The threshold 0.5 is common, but it is not mandatory. In imbalanced or high-risk tasks, a different threshold may be better.

### The Decision Boundary

Logistic regression is still a linear classifier. The decision boundary is the set of points where the model is exactly uncertain:

$$
\sigma(w^\top x + b) = 0.5.
$$

Since \(\sigma(0)=0.5\), this is equivalent to

$$
w^\top x + b = 0.
$$

This equation defines a hyperplane. In two dimensions, it is a line. In three dimensions, it is a plane. In higher dimensions, it is a hyperplane.

The sigmoid converts distance from the hyperplane into probability. Points far on one side receive probability close to 1. Points far on the other side receive probability close to 0.

### Binary Cross-Entropy Loss

Mean squared error can train logistic regression, but it is usually the wrong objective. The standard loss is binary cross-entropy:

$$
L =
-\frac{1}{N}
\sum_{i=1}^{N}
\left[
y_i\log(\hat{p}_i)
+
(1-y_i)\log(1-\hat{p}_i)
\right].
$$

If \(y_i=1\), the loss becomes

$$
-\log(\hat{p}_i).
$$

The model is penalized when it assigns low probability to class 1.

If \(y_i=0\), the loss becomes

$$
-\log(1-\hat{p}_i).
$$

The model is penalized when it assigns high probability to class 1.

This loss corresponds to maximum likelihood under a Bernoulli model:

$$
p(y\mid x) =
\hat{p}^{y}(1-\hat{p})^{1-y}.
$$

Taking the negative log-likelihood gives binary cross-entropy.

### Stable Loss Computation in PyTorch

In PyTorch, the preferred loss for binary logistic regression is `nn.BCEWithLogitsLoss`.

This loss expects logits, not sigmoid probabilities. It combines the sigmoid operation and binary cross-entropy in one numerically stable function.

```python
import torch
from torch import nn

logits = torch.tensor([-4.0, 0.0, 4.0])
targets = torch.tensor([0.0, 1.0, 1.0])

loss_fn = nn.BCEWithLogitsLoss()
loss = loss_fn(logits, targets)

print(loss)
```

Avoid this pattern during training:

```python
p = torch.sigmoid(logits)
loss = nn.BCELoss()(p, targets)
```

It is mathematically valid, but less numerically stable for very large positive or negative logits.

### Logistic Regression with `nn.Linear`

A binary logistic regression model can be implemented as a linear layer with one output:

```python
import torch
from torch import nn

model = nn.Linear(in_features=4, out_features=1)

X = torch.randn(32, 4)
logits = model(X).squeeze(-1)

print(logits.shape)
# torch.Size([32])
```

The model returns logits. During training, we pass these logits directly into `BCEWithLogitsLoss`.

```python
targets = torch.randint(0, 2, (32,)).float()

loss_fn = nn.BCEWithLogitsLoss()
loss = loss_fn(logits, targets)
```

During inference, we convert logits to probabilities:

```python
probs = torch.sigmoid(logits)
preds = (probs >= 0.5).long()
```

### Full Training Example

The following example creates a synthetic binary classification dataset and trains logistic regression.

```python
import torch
from torch import nn

torch.manual_seed(0)

N = 1000
d = 2

X0 = torch.randn(N // 2, d) + torch.tensor([-2.0, -2.0])
X1 = torch.randn(N // 2, d) + torch.tensor([2.0, 2.0])

X = torch.cat([X0, X1], dim=0)
y = torch.cat([
    torch.zeros(N // 2),
    torch.ones(N // 2),
], dim=0)

model = nn.Linear(d, 1)
loss_fn = nn.BCEWithLogitsLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

for step in range(200):
    logits = model(X).squeeze(-1)
    loss = loss_fn(logits, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

with torch.no_grad():
    logits = model(X).squeeze(-1)
    probs = torch.sigmoid(logits)
    preds = (probs >= 0.5).long()
    accuracy = (preds == y.long()).float().mean()

print("loss:", loss.item())
print("accuracy:", accuracy.item())
```

The training loop is the same pattern used for larger neural networks: compute logits, compute loss, backpropagate, and update parameters.

### Gradients of Logistic Regression

For one example, let

$$
z = w^\top x + b,
\quad
\hat{p} = \sigma(z).
$$

The binary cross-entropy loss is

$$
\ell =
-\left[
y\log(\hat{p}) + (1-y)\log(1-\hat{p})
\right].
$$

A useful result is

$$
\frac{\partial \ell}{\partial z} =
\hat{p} - y.
$$

Therefore,

$$
\nabla_w \ell =
(\hat{p}-y)x,
$$

and

$$
\frac{\partial \ell}{\partial b} =
\hat{p}-y.
$$

This has a simple interpretation. The gradient is driven by prediction error. If the model predicts a probability larger than the target, \(\hat{p}-y\) is positive. If it predicts a probability smaller than the target, the term is negative.

For a batch:

$$
\nabla_w L =
\frac{1}{N}
X^\top(\hat{p}-y).
$$

The same structure appears in many neural network classifiers: logits produce probabilities, probabilities are compared with labels, and the resulting error signal flows backward through the model.

### Multi-Label Classification

Binary logistic regression can be extended to multi-label classification. In multi-label classification, each example may belong to several classes at once.

For example, an image may contain `cat`, `car`, and `tree` at the same time. Each class receives an independent binary label.

If there are \(K\) labels, the model outputs \(K\) logits:

$$
z \in \mathbb{R}^{K}.
$$

Each logit is passed through a sigmoid:

$$
\hat{p}_k = \sigma(z_k).
$$

In PyTorch:

```python
model = nn.Linear(in_features=128, out_features=10)

X = torch.randn(32, 128)
targets = torch.randint(0, 2, (32, 10)).float()

logits = model(X)
loss = nn.BCEWithLogitsLoss()(logits, targets)
```

This differs from multi-class classification, where each example belongs to exactly one class and the model usually uses softmax cross-entropy.

### Logistic Regression as a Neural Network

Logistic regression is a neural network with one linear layer followed by a sigmoid interpretation:

$$
x \mapsto \sigma(w^\top x+b).
$$

During training in PyTorch, we usually keep the sigmoid inside the loss function by using `BCEWithLogitsLoss`. During inference, we apply the sigmoid explicitly to obtain probabilities.

This model is limited because its decision boundary is linear. It cannot separate data that requires nonlinear boundaries unless the input features already contain useful nonlinear transformations.

A multilayer neural network generalizes logistic regression by replacing the raw input with learned hidden features:

$$
h = \phi(W_1x+b_1),
$$

$$
z = w_2^\top h + b_2.
$$

The final layer is still often a logistic regression head. The difference is that earlier layers learn a representation where the classes become easier to separate.

### Summary

Logistic regression is a linear classifier for binary classification. It computes a logit with a linear layer and converts the logit into a probability with the sigmoid function.

The standard training loss is binary cross-entropy. In PyTorch, `nn.BCEWithLogitsLoss` should usually be preferred because it combines sigmoid and cross-entropy in a numerically stable way.

The model is simple, but it introduces concepts used throughout classification: logits, probabilities, decision thresholds, cross-entropy, and classification accuracy.