# Logistic Regression

Linear regression predicts a real number. Logistic regression predicts a probability for binary classification.

A binary classification problem has two possible labels:

$$
y \in \{0,1\}.
$$

Examples include spam versus not spam, fraud versus legitimate, disease versus no disease, and click versus no click.

The model receives an input vector

$$
x \in \mathbb{R}^d
$$

and predicts the probability that the label is 1.

$$
\hat{p} = P(y = 1 \mid x).
$$

### From Linear Score to Probability

Logistic regression first computes a linear score:

$$
z = w^\top x + b.
$$

The value \(z\) can be any real number. A probability, however, must lie between 0 and 1. Logistic regression converts the score into a probability using the sigmoid function:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}.
$$

The prediction is

$$
\hat{p} = \sigma(w^\top x + b).
$$

When \(z\) is large and positive, \(\hat{p}\) is close to 1. When \(z\) is large and negative, \(\hat{p}\) is close to 0. When \(z = 0\), \(\hat{p} = 0.5\).

### Decision Boundary

To turn a probability into a class prediction, we choose a threshold. The usual threshold is 0.5:

$$
\hat{y} =
\begin{cases}
1 & \text{if } \hat{p} \ge 0.5, \\
0 & \text{otherwise.}
\end{cases}
$$

Since \(\sigma(z) \ge 0.5\) exactly when \(z \ge 0\), the decision boundary is

$$
w^\top x + b = 0.
$$

This is a line in two dimensions, a plane in three dimensions, and a hyperplane in higher dimensions.

Thus logistic regression is a linear classifier. The probability is nonlinear because of the sigmoid function, but the decision boundary remains linear.

### Binary Cross-Entropy Loss

For binary classification, mean squared error is usually a poor choice. Logistic regression instead uses binary cross-entropy.

For one example, the loss is

$$
\ell(\hat{p}, y) =
-\left[
y\log(\hat{p}) + (1-y)\log(1-\hat{p})
\right].
$$

If \(y=1\), this becomes

$$
\ell(\hat{p}, 1) = -\log(\hat{p}).
$$

The loss is small when \(\hat{p}\) is close to 1 and large when \(\hat{p}\) is close to 0.

If \(y=0\), this becomes

$$
\ell(\hat{p}, 0) = -\log(1-\hat{p}).
$$

The loss is small when \(\hat{p}\) is close to 0 and large when \(\hat{p}\) is close to 1.

For a dataset with \(n\) examples:

$$
L(w,b) =
-\frac{1}{n}
\sum_{i=1}^{n}
\left[
y_i\log(\hat{p}_i)
+
(1-y_i)\log(1-\hat{p}_i)
\right].
$$

### Maximum Likelihood View

Binary cross-entropy can be derived from probability.

For each example, logistic regression models the label as a Bernoulli random variable:

$$
y_i \sim \operatorname{Bernoulli}(\hat{p}_i).
$$

The probability of observing \(y_i\) is

$$
P(y_i \mid x_i) =
\hat{p}_i^{y_i}(1-\hat{p}_i)^{1-y_i}.
$$

For the full dataset, the likelihood is

$$
\prod_{i=1}^n
\hat{p}_i^{y_i}(1-\hat{p}_i)^{1-y_i}.
$$

Training by maximum likelihood means choosing parameters that make the observed labels likely. Taking the negative log of the likelihood gives binary cross-entropy.

This connection matters because it explains why the loss has logarithms. The model is not merely fitting numbers. It is fitting a conditional probability model.

### Gradients

Let

$$
z_i = w^\top x_i + b
$$

and

$$
\hat{p}_i = \sigma(z_i).
$$

For binary cross-entropy with sigmoid, the derivative of the loss with respect to the logit \(z_i\) has a simple form:

$$
\frac{\partial \ell_i}{\partial z_i} =
\hat{p}_i - y_i.
$$

Therefore,

$$
\nabla_w L =
\frac{1}{n}
X^\top(\hat{p} - y),
$$

and

$$
\nabla_b L =
\frac{1}{n}
\sum_{i=1}^{n}
(\hat{p}_i - y_i).
$$

The gradient is prediction minus target, multiplied by the input. This same pattern appears throughout deep learning.

### Logits

In deep learning, the value before the sigmoid is called a logit.

$$
z = w^\top x + b.
$$

A logit is an unconstrained real number. A probability is constrained to the interval \([0,1]\).

Many PyTorch loss functions expect logits rather than probabilities. This is numerically safer. Computing sigmoid first and then taking logs can cause instability when probabilities are extremely close to 0 or 1.

For binary classification, PyTorch provides:

```python
nn.BCEWithLogitsLoss()
```

This combines sigmoid and binary cross-entropy in one stable operation.

The recommended pattern is:

```python
logits = model(X)
loss = loss_fn(logits, y)
```

rather than:

```python
probs = torch.sigmoid(model(X))
loss = binary_cross_entropy(probs, y)
```

### Logistic Regression in PyTorch

A logistic regression model is a single linear layer producing one logit per example.

```python
import torch
from torch import nn

class LogisticRegression(nn.Module):
    def __init__(self, in_features):
        super().__init__()
        self.linear = nn.Linear(in_features, 1)

    def forward(self, x):
        return self.linear(x).squeeze(-1)
```

The model maps an input batch

$$
X \in \mathbb{R}^{B \times d}
$$

to logits

$$
z \in \mathbb{R}^{B}.
$$

A training loop:

```python
torch.manual_seed(0)

B = 256
d = 10

X = torch.randn(B, d)
true_w = torch.randn(d)
true_b = -0.2

logits_true = X @ true_w + true_b
probs_true = torch.sigmoid(logits_true)
y = torch.bernoulli(probs_true)

model = LogisticRegression(d)

loss_fn = nn.BCEWithLogitsLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

for epoch in range(200):
    logits = model(X)
    loss = loss_fn(logits, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
```

The target tensor `y` should contain floating-point values 0.0 or 1.0:

```python
y = y.float()
```

This is required because binary cross-entropy is defined over probabilities and real-valued targets.

### Computing Predictions

During inference, we convert logits to probabilities with sigmoid:

```python
with torch.no_grad():
    logits = model(X)
    probs = torch.sigmoid(logits)
    preds = (probs >= 0.5).long()
```

The threshold may be changed depending on the application.

For spam detection, false positives may be costly because legitimate emails could be hidden. For disease screening, false negatives may be costly because a sick patient could be missed. The threshold should reflect the cost of each type of error.

### Evaluation Metrics

Accuracy measures the fraction of correct predictions:

$$
\text{accuracy} =
\frac{\text{number of correct predictions}}
{\text{number of examples}}.
$$

Accuracy is useful when classes are balanced. It can be misleading when classes are imbalanced.

Suppose only 1 percent of transactions are fraudulent. A classifier that always predicts “not fraud” has 99 percent accuracy, but detects no fraud.

For imbalanced classification, other metrics are often better:

| Metric | Meaning |
|---|---|
| Precision | Among predicted positives, how many are true positives |
| Recall | Among actual positives, how many are found |
| F1 score | Harmonic mean of precision and recall |
| ROC-AUC | Ranking quality across thresholds |
| PR-AUC | Precision-recall quality across thresholds |

Metric choice is part of model design. The loss optimizes the model during training. The metric measures whether the model is useful for the task.

### Feature Scaling

Logistic regression is sensitive to feature scale. If one feature has values around 10000 and another has values around 0.01, optimization may become inefficient.

Standardization is usually helpful:

$$
x_j' = \frac{x_j - \mu_j}{\sigma_j}.
$$

After standardization, each feature has approximately zero mean and unit variance. This improves gradient behavior and makes regularization more consistent across features.

### Regularized Logistic Regression

Logistic regression is often trained with weight decay, also called L2 regularization.

The regularized objective is

$$
L_{\text{reg}}(w,b) =
L(w,b)
+
\lambda \lVert w \rVert_2^2.
$$

The regularization term discourages large weights. This can improve generalization, especially when the dataset is small or features are correlated.

In PyTorch, weight decay is usually passed to the optimizer:

```python
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    weight_decay=1e-2,
)
```

Some implementations avoid applying weight decay to bias parameters. This distinction becomes more important in larger neural networks.

### Multiclass Extension

Binary logistic regression handles two classes. For \(K\) classes, we use softmax regression, covered in the next section.

The binary model produces one logit:

$$
z \in \mathbb{R}.
$$

The multiclass model produces one logit per class:

$$
z \in \mathbb{R}^K.
$$

The sigmoid function becomes the softmax function, and binary cross-entropy becomes multiclass cross-entropy.

### Logistic Regression as a Neural Network

Logistic regression is a neural network with one linear layer and a sigmoid output. It has no hidden layers.

The architecture is simple:

$$
x
\longrightarrow
w^\top x + b
\longrightarrow
\sigma
\longrightarrow
\hat{p}.
$$

In practice, we keep the sigmoid outside the model during training and use `BCEWithLogitsLoss`.

This model introduces several ideas that remain central in deep learning: logits, probabilities, cross-entropy, thresholds, calibration, and classification metrics.

### Summary

Logistic regression is a linear model for binary classification. It computes a logit, converts it to a probability with the sigmoid function, and trains using binary cross-entropy.

Its decision boundary is linear, but its probabilistic interpretation is powerful. It is the simplest bridge from linear models to neural classifiers.

