# Multi-Task Objectives

Multi-task learning trains one model on several objectives at the same time. The model may predict several targets from the same input, share a common representation across related tasks, or combine supervised, self-supervised, and auxiliary losses.

The basic idea is that related tasks can provide useful training signal to each other. A model trained only for one objective may learn narrow features. A model trained across related objectives can learn representations that generalize better.

For example, an autonomous driving model may jointly predict lane boundaries, object boxes, semantic segmentation masks, depth, and future trajectories. A language model may combine next-token prediction, instruction following, preference modeling, retrieval supervision, and tool-use objectives.

### Basic Form

Suppose a model has parameters \(\theta\) and is trained on \(T\) tasks. Each task has its own loss:

$$
L_1(\theta), L_2(\theta), \ldots, L_T(\theta).
$$

A simple multi-task objective is a weighted sum:

$$
L(\theta) =
\sum_{t=1}^{T}
\lambda_t L_t(\theta).
$$

Here \(\lambda_t\) is the weight assigned to task \(t\).

The weights control the relative influence of each task. If \(\lambda_t\) is large, task \(t\) contributes strongly to the gradient. If \(\lambda_t\) is small, it has weaker influence.

In PyTorch:

```python
loss = (
    lambda_cls * classification_loss
    + lambda_reg * regression_loss
    + lambda_aux * auxiliary_loss
)
```

Then training uses the ordinary pattern:

```python
optimizer.zero_grad()
loss.backward()
optimizer.step()
```

The final scalar loss is backpropagated through all model components that contributed to it.

### Shared Trunk and Task Heads

A common architecture uses a shared trunk and several task-specific heads.

The shared trunk computes a representation:

$$
h = f_\theta(x).
$$

Each task head produces a task-specific output:

$$
\hat{y}_t = g_t(h).
$$

The full model can be written as

$$
\hat{y}_t = g_t(f_\theta(x)).
$$

The trunk learns features useful across tasks. The heads specialize those features for particular targets.

In PyTorch:

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiTaskModel(nn.Module):
    def __init__(self, in_features, hidden_features, num_classes):
        super().__init__()

        self.trunk = nn.Sequential(
            nn.Linear(in_features, hidden_features),
            nn.ReLU(),
            nn.Linear(hidden_features, hidden_features),
            nn.ReLU(),
        )

        self.class_head = nn.Linear(hidden_features, num_classes)
        self.reg_head = nn.Linear(hidden_features, 1)

    def forward(self, x):
        h = self.trunk(x)
        class_logits = self.class_head(h)
        value = self.reg_head(h)
        return class_logits, value
```

Training:

```python
model = MultiTaskModel(
    in_features=20,
    hidden_features=128,
    num_classes=5,
)

x = torch.randn(32, 20)
class_target = torch.randint(0, 5, (32,))
value_target = torch.randn(32, 1)

class_logits, value_pred = model(x)

classification_loss = F.cross_entropy(class_logits, class_target)
regression_loss = F.mse_loss(value_pred, value_target)

loss = classification_loss + 0.1 * regression_loss
```

The classification head receives gradients from cross-entropy. The regression head receives gradients from MSE. The shared trunk receives gradients from both.

### Why Loss Scales Matter

Loss terms may have very different numerical scales.

For example, a cross-entropy loss may be near \(1.0\), while an unnormalized MSE loss may be near \(10000.0\). If we simply add them, the MSE term can dominate training.

This does not necessarily mean that the regression task is more important. It may only mean that the target variable has a larger numerical scale.

Suppose

$$
L_{\text{cls}} = 0.8,
\qquad
L_{\text{reg}} = 5000.
$$

Then

$$
L = L_{\text{cls}} + L_{\text{reg}}
$$

will behave almost entirely like the regression loss.

A common correction is to normalize targets, normalize losses, or introduce task weights:

$$
L = L_{\text{cls}} + \lambda L_{\text{reg}}.
$$

For example:

```python
loss = classification_loss + 0.001 * regression_loss
```

Choosing \(\lambda\) is a central practical problem in multi-task learning.

### Task Weighting

Task weights can be fixed manually, tuned with validation data, or learned during training.

Manual weighting is simple:

```python
loss = (
    1.0 * loss_main
    + 0.3 * loss_aux
    + 0.1 * loss_reg
)
```

This works when the losses are well understood and their scales are stable.

A more adaptive approach is uncertainty weighting. For a task with observation noise \(\sigma_t^2\), the objective may include a learned scale:

$$
L(\theta) =
\sum_t
\left(
\frac{1}{2\sigma_t^2}L_t(\theta)
+
\log \sigma_t
\right).
$$

The model learns how much uncertainty to assign to each task. High-uncertainty tasks receive smaller effective weights.

In practice, we usually optimize log variances for numerical stability:

```python
class WeightedMultiTaskLoss(nn.Module):
    def __init__(self, num_tasks):
        super().__init__()
        self.log_vars = nn.Parameter(torch.zeros(num_tasks))

    def forward(self, losses):
        total = 0.0
        for i, loss in enumerate(losses):
            precision = torch.exp(-self.log_vars[i])
            total = total + precision * loss + self.log_vars[i]
        return total
```

This method can reduce manual tuning, but it should still be monitored. A task can receive too little weight if its loss is noisy or poorly specified.

### Auxiliary Losses

An auxiliary loss supports the main objective without being the final task of interest.

For example, an image model may use a main classification loss and an auxiliary rotation-prediction loss. A sequence model may use next-token prediction plus an auxiliary sentence-order objective. A reinforcement learning agent may use a value loss, policy loss, entropy loss, and representation loss.

The objective has the form

$$
L = L_{\text{main}} + \lambda L_{\text{aux}}.
$$

Auxiliary losses can help by improving representation learning, stabilizing gradients, adding supervision to intermediate layers, or discouraging degenerate solutions.

Example:

```python
main_loss = F.cross_entropy(main_logits, labels)
aux_loss = F.cross_entropy(aux_logits, aux_labels)

loss = main_loss + 0.2 * aux_loss
```

During inference, the auxiliary head may be discarded if it is only a training aid.

### Deep Supervision

Deep supervision attaches losses to intermediate layers, not only the final output.

Suppose a network produces intermediate predictions:

$$
\hat{y}^{(1)}, \hat{y}^{(2)}, \ldots, \hat{y}^{(K)}.
$$

The loss may be

$$
L =
L_{\text{final}}
+
\sum_{k=1}^{K-1}
\lambda_k L^{(k)}.
$$

This gives earlier layers direct gradient signal.

Deep supervision is common in segmentation, detection, and very deep architectures. It can improve optimization because early layers do not depend only on gradients passing through the entire network.

Example:

```python
loss_final = F.cross_entropy(final_logits, target)
loss_mid = F.cross_entropy(mid_logits, target)

loss = loss_final + 0.4 * loss_mid
```

At inference time, the final output is often used alone.

### Multi-Task Learning with Missing Labels

In real datasets, not every example has labels for every task.

For example, an image may have a class label but no segmentation mask. Another image may have a segmentation mask but no bounding boxes.

The objective should include only the losses whose labels are available.

Let \(m_{i,t}\in\{0,1\}\) indicate whether example \(i\) has a label for task \(t\). A masked multi-task loss can be written as

$$
L =
\sum_t
\lambda_t
\frac{
\sum_i m_{i,t} L_{i,t}
}{
\sum_i m_{i,t} + \epsilon
}.
$$

In PyTorch:

```python
per_example_loss = F.cross_entropy(
    logits,
    targets,
    reduction="none",
)

mask = has_label.float()
loss = (per_example_loss * mask).sum() / mask.sum().clamp_min(1.0)
```

This prevents missing labels from contributing invalid loss values.

For regression:

```python
errors = (pred - target).pow(2).squeeze(-1)
loss = (errors * mask).sum() / mask.sum().clamp_min(1.0)
```

Masking is essential in partially labeled datasets.

### Gradient Interference

Multi-task learning can fail when task gradients conflict.

For two task losses \(L_a\) and \(L_b\), their gradients with respect to shared parameters are

$$
g_a = \nabla_\theta L_a,
\qquad
g_b = \nabla_\theta L_b.
$$

If the dot product is negative,

$$
g_a^\top g_b < 0,
$$

then the tasks disagree locally. A gradient step that helps one task may hurt the other.

This is called gradient interference or negative transfer.

Negative transfer occurs when tasks are unrelated, loss weights are poor, one task is much noisier than another, or the shared trunk lacks enough capacity.

A simple diagnostic is to monitor validation metrics for each task. If adding a task improves its own metric but degrades another task, the tasks may be interfering.

### Gradient Balancing

Several methods try to balance multi-task gradients.

One simple approach is to normalize losses so they have similar magnitudes. Another approach is to normalize gradient norms. A task with an extremely large gradient may dominate shared parameters.

A gradient norm for task \(t\) is

$$
\|g_t\|_2.
$$

If one task has much larger gradient norms, it may control the shared trunk.

In practice, full gradient balancing can be expensive because it may require separate backward passes for each task. But a simpler approximation is often enough: normalize target scales, inspect loss magnitudes, and tune weights.

More advanced methods modify gradients directly. For example, when two gradients conflict, one may project part of a gradient away from another task’s direction. These methods can help, but they add implementation complexity and are not always better than careful task design.

### Task Sampling

In some systems, tasks use different datasets. The training loop must choose which task to sample at each step.

A simple strategy samples tasks uniformly. Another samples tasks in proportion to dataset size. A third uses temperature-based sampling to avoid letting very large datasets dominate.

If task \(t\) has \(n_t\) examples, proportional sampling uses

$$
P(t) =
\frac{n_t}{\sum_j n_j}.
$$

Temperature sampling uses

$$
P(t) =
\frac{n_t^\alpha}{\sum_j n_j^\alpha},
$$

where \(0 \leq \alpha \leq 1\). When \(\alpha=1\), sampling is proportional to size. When \(\alpha=0\), sampling is uniform over tasks.

This is useful when one task has millions of examples and another has only thousands.

### Multi-Objective Evaluation

A multi-task model should be evaluated per task, not only by the total training loss.

The weighted training objective is an optimization tool. It may not correspond directly to what matters in deployment.

For example, a model may have

$$
L = L_{\text{classification}} + 0.1L_{\text{regression}}.
$$

The number \(0.1\) was chosen for training stability. It does not mean the regression task is ten times less important in the real system.

A proper evaluation report includes separate metrics:

| Task | Loss | Metric |
|---|---|---|
| Classification | Cross-entropy | Accuracy, F1, calibration |
| Regression | MSE | RMSE, MAE |
| Segmentation | Cross-entropy, Dice | IoU, Dice score |
| Detection | Box loss, class loss | mAP |
| Retrieval | Contrastive loss | Recall@k, MRR |

Do not select models solely by the summed training loss unless that sum exactly matches the application’s utility.

### Multi-Task Objectives in Detection

Object detection is a standard example of multi-task learning. A detector usually predicts both object categories and bounding boxes.

The loss may have several terms:

$$
L =
L_{\text{cls}}
+
\lambda_{\text{box}}L_{\text{box}}
+
\lambda_{\text{obj}}L_{\text{obj}}.
$$

Here:

| Term | Purpose |
|---|---|
| \(L_{\text{cls}}\) | Class prediction |
| \(L_{\text{box}}\) | Bounding box regression |
| \(L_{\text{obj}}\) | Objectness prediction |

Bounding box losses may use smooth L1, generalized IoU, distance IoU, or complete IoU. Classification may use cross-entropy or focal loss.

This example shows why multi-task losses often mix different loss families.

### Multi-Task Objectives in Language Models

Large language model training often combines several objectives across stages.

Pretraining commonly uses next-token prediction:

$$
L_{\text{LM}} =
-\sum_t
\log p_\theta(x_t \mid x_{<t}).
$$

Instruction tuning uses supervised response modeling:

$$
L_{\text{SFT}} =
-\sum_t
\log p_\theta(y_t \mid x, y_{<t}).
$$

Preference optimization may compare a preferred response \(y^+\) with a rejected response \(y^-\). A common form is a pairwise objective:

$$
L_{\text{pref}} =
-\log \sigma
\left(
\beta
[
r_\theta(x,y^+) - r_\theta(x,y^-)
]
\right).
$$

Tool-use and retrieval training may add auxiliary losses for selecting tools, formatting calls, citing documents, or grounding answers.

Thus, modern language models are often trained through a sequence of objectives rather than one fixed loss.

### Regularization Terms as Auxiliary Objectives

Regularization terms can also be treated as auxiliary objectives.

For example, weight decay adds a penalty:

$$
L =
L_{\text{data}}
+
\lambda \|\theta\|_2^2.
$$

A sparsity penalty may add

$$
\lambda \|h\|_1
$$

to encourage sparse activations.

A consistency loss may require predictions to remain stable under augmentation:

$$
L_{\text{consistency}} =
\|f_\theta(x) - f_\theta(\tilde{x})\|^2.
$$

These terms do not always correspond to a separate task, but they still contribute additional training signal.

### Implementation Pattern

A clean PyTorch implementation returns a dictionary of outputs and computes a dictionary of losses.

```python
outputs = model(batch["x"])

losses = {}

losses["classification"] = F.cross_entropy(
    outputs["class_logits"],
    batch["class_target"],
)

losses["regression"] = F.mse_loss(
    outputs["value"],
    batch["value_target"],
)

losses["regularization"] = 1e-4 * sum(
    p.pow(2).sum()
    for p in model.parameters()
)

weights = {
    "classification": 1.0,
    "regression": 0.1,
    "regularization": 1.0,
}

loss = sum(weights[name] * losses[name] for name in losses)
```

This pattern makes the training objective explicit. It also makes logging easier:

```python
for name, value in losses.items():
    logger.log(f"loss/{name}", value.item())
```

Logging each component separately is important. A decreasing total loss can hide the fact that one task is getting worse.

### Practical Guidelines

Start with a single main objective and add auxiliary tasks only when they have a clear reason to help. Normalize targets and inspect loss scales before choosing weights. Log every loss component separately. Evaluate each task separately. Watch for negative transfer.

Use shared trunks when tasks depend on similar features. Use separate heads when outputs have different structures. For weakly related tasks, consider partial sharing, adapters, or separate models.

Use masks when labels are missing. Avoid letting missing labels silently produce fake losses. Tune task weights using validation metrics, not only training loss.

A multi-task objective is a design choice. It defines which signals shape the representation, how gradients are shared, and which tradeoffs the model is allowed to make.

