Bayesian Optimization

Bayesian optimization is a hyperparameter optimization method for expensive black-box functions. It is useful when each training run costs enough that random search wastes too much compute.

The central idea is to build a probabilistic model of the relationship between hyperparameters and validation performance. This model is called a surrogate model. Instead of blindly sampling configurations, Bayesian optimization uses previous results to decide which configuration to try next.

The Optimization Problem

Let $\lambda$ denote a hyperparameter configuration. Training a model with $\lambda$ produces validation performance:

f(\lambda) = \mathcal{L}_{\text{val}}(\theta^\ast(\lambda)).

We want to find

\lambda^\ast = \arg\min_{\lambda \in \Lambda} f(\lambda).

The function $f$ is usually expensive to evaluate. One evaluation may require training a model for minutes, hours, or days. The function may also be noisy because different random seeds, data orders, and hardware kernels can produce different results.

Bayesian optimization treats $f$ as an unknown function. After each trial, it updates its belief about where good configurations may be.

Surrogate Models

A surrogate model approximates the expensive objective function. Instead of training every possible model, we train a cheap statistical model on completed trials.

A trial produces a pair:

(\lambda_i, y_i),

where $\lambda_i$ is the sampled configuration and $y_i$ is the observed validation score or loss.

After $n$ trials, the observed dataset is

\mathcal{D}_n = \{(\lambda_1,y_1),(\lambda_2,y_2),\ldots,(\lambda_n,y_n)\}.

The surrogate model estimates the likely value of $f(\lambda)$ at untried configurations.

Common surrogate models include:

Surrogate	Common use
Gaussian process	Small continuous search spaces
Tree-structured Parzen estimator	Mixed and conditional spaces
Random forest	Discrete and categorical spaces
Bayesian neural network	Larger or more complex spaces

The surrogate should provide both a prediction and uncertainty. This uncertainty is what allows Bayesian optimization to balance exploration and exploitation.

Exploration and Exploitation

Bayesian optimization must decide between two kinds of trials.

Exploitation means trying configurations near known good regions. If a learning rate around $3\times10^{-4}$ has worked well, nearby values may be promising.

Exploration means trying uncertain regions. A region may have few completed trials, so the surrogate model has high uncertainty there. Exploring it may discover a better solution.

Good Bayesian optimization balances both. Too much exploitation can get stuck near a local optimum. Too much exploration behaves like random search.

Acquisition Functions

An acquisition function selects the next configuration to evaluate. It uses the surrogate model’s predicted mean and uncertainty.

The acquisition function is cheap to evaluate, so we can optimize it many times before running the next expensive training job.

If the surrogate predicts mean $\mu(\lambda)$ and uncertainty $\sigma(\lambda)$ , an acquisition function may prefer configurations with low predicted loss, high uncertainty, or both.

Common acquisition functions include:

Acquisition function	Idea
Probability of Improvement	Choose points likely to improve over the current best
Expected Improvement	Choose points with high expected gain
Upper Confidence Bound	Trade off predicted score and uncertainty
Thompson Sampling	Sample a possible objective function and optimize it

For minimization, Expected Improvement measures how much a point is expected to improve over the best observed value.

Let $y_{\min}$ be the best validation loss observed so far. The improvement at $\lambda$ is

I(\lambda) = \max(0, y_{\min} - f(\lambda)).

The acquisition function uses the expected value of this improvement under the surrogate model:

\text{EI}(\lambda) = \mathbb{E}\left[\max(0, y_{\min} - f(\lambda))\right].

A point with a low predicted loss can have high expected improvement. A point with high uncertainty can also have high expected improvement because it may turn out better than expected.

Gaussian Process Surrogates

Gaussian processes are the classical surrogate model for Bayesian optimization.

A Gaussian process defines a distribution over functions:

f \sim \mathcal{GP}(m, k),

where $m$ is the mean function and $k$ is the kernel function.

The kernel controls how similar two configurations are expected to be. If two learning rates are close on a logarithmic scale, their validation losses may be correlated.

After observing trials, the Gaussian process gives a posterior distribution for $f(\lambda)$ . This posterior provides both a mean and a variance.

Gaussian process Bayesian optimization works well when:

Condition	Reason
Number of trials is small	GP inference can become expensive as trials grow
Search space is mostly continuous	Kernels are natural for continuous variables
Dimensionality is modest	GPs degrade in high-dimensional spaces
Objective is expensive	Surrogate overhead is acceptable

For modern deep learning, Gaussian processes are useful for small to medium search spaces. They become less convenient for complex architecture search spaces with many conditional categorical choices.

Tree-Structured Parzen Estimators

The tree-structured Parzen estimator, often abbreviated TPE, is widely used in practical hyperparameter optimization systems.

Instead of directly modeling $p(y \mid \lambda)$ , TPE models two distributions:

p(\lambda \mid y < y^\ast)

and

p(\lambda \mid y \ge y^\ast),

where $y^\ast$ is a threshold that separates good trials from bad trials.

The algorithm then chooses configurations that are likely under the good distribution and unlikely under the bad distribution.

TPE handles mixed spaces well:

Search space feature	TPE suitability
Continuous variables	Good
Discrete variables	Good
Categorical choices	Good
Conditional parameters	Good
Tree-structured configurations	Good

This makes TPE a common default for deep learning projects, especially through libraries such as Optuna and Hyperopt.

A Simple Bayesian Optimization Loop

Bayesian optimization follows a repeated loop:

Start with several random trials.
Fit a surrogate model to completed trials.
Use an acquisition function to choose the next configuration.
Train and evaluate the model.
Add the result to the trial history.
Repeat until the budget is exhausted.

In pseudocode:

history = []

for config in initial_random_configs:
    score = train_and_evaluate(config)
    history.append((config, score))

for step in range(num_bo_steps):
    surrogate = fit_surrogate(history)

    next_config = optimize_acquisition(
        surrogate=surrogate,
        search_space=search_space,
    )

    score = train_and_evaluate(next_config)
    history.append((next_config, score))

best_config, best_score = select_best(history)

The main difference from random search is that each new trial depends on previous results.

Using Optuna with PyTorch

Optuna is a common Python library for hyperparameter optimization. It supports random search, TPE, pruning, and database-backed study tracking.

A PyTorch training objective can be written as a function that accepts a trial object:

import optuna
import torch
from torch import nn

def objective(trial):
    learning_rate = trial.suggest_float(
        "learning_rate",
        1e-5,
        1e-1,
        log=True,
    )

    weight_decay = trial.suggest_float(
        "weight_decay",
        1e-6,
        1e-1,
        log=True,
    )

    hidden_dim = trial.suggest_categorical(
        "hidden_dim",
        [128, 256, 512, 1024],
    )

    dropout = trial.suggest_float(
        "dropout",
        0.0,
        0.5,
    )

    model = MLP(
        input_dim=784,
        hidden_dim=hidden_dim,
        output_dim=10,
        num_layers=3,
        dropout=dropout,
    )

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
    )

    val_accuracy = train_and_evaluate(
        model=model,
        optimizer=optimizer,
    )

    return val_accuracy

Then create and run a study:

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)

print(study.best_value)
print(study.best_params)

This example maximizes validation accuracy. For validation loss, use direction="minimize".

Pruning Poor Trials

Many Bayesian optimization systems support pruning. Pruning stops bad trials early before they consume the full training budget.

A trial may report intermediate validation scores:

def objective(trial):
    model = build_model_from_trial(trial)
    optimizer = build_optimizer_from_trial(trial)

    for epoch in range(num_epochs):
        train_one_epoch(model, optimizer)
        val_accuracy = evaluate(model)

        trial.report(val_accuracy, step=epoch)

        if trial.should_prune():
            raise optuna.TrialPruned()

    return val_accuracy

Pruning is useful when early validation results are predictive of final results. It can greatly reduce wasted compute.

However, pruning must be used carefully. Some configurations learn slowly but eventually perform well. Aggressive pruning may remove them too early.

Noisy Objectives

Deep learning validation performance is noisy. The same configuration can produce different results under different seeds.

Noise comes from:

Source	Example
Initialization	Random initial weights
Data order	Shuffled mini-batches
Regularization	Dropout and augmentation
Hardware	Nondeterministic kernels
Evaluation	Small validation sets

Bayesian optimization can overreact to noise. A configuration may look good because of a lucky seed.

Several practices help:

Practice	Purpose
Repeat top configurations	Estimate stability
Use larger validation sets	Reduce metric noise
Log random seeds	Improve auditability
Compare confidence intervals	Avoid chasing noise
Use median over seeds	Prefer robust configurations

For expensive models, it may be impractical to repeat every trial. A common compromise is to repeat only the best few configurations.

Parallel Bayesian Optimization

Classic Bayesian optimization is sequential. It chooses one point, observes the result, then updates the surrogate.

Modern training systems often have many GPUs available. We want to run several trials in parallel.

Parallel Bayesian optimization chooses a batch of candidate configurations. This is harder because pending trials have no results yet.

Common approaches include:

Method	Idea
Constant liar	Pretend pending trials have temporary outcomes
Thompson sampling	Sample multiple possible good configurations
Local penalization	Avoid sampling points too close together
Asynchronous BO	Launch new trials whenever workers become free

Asynchronous methods are common in cluster settings. They avoid waiting for the slowest trial before launching the next one.

Multi-Objective Bayesian Optimization

Sometimes we care about more than one objective.

For example, we may want high accuracy and low latency. A model with the best accuracy may be too slow for deployment.

We can define objectives such as:

\text{maximize accuracy}, \qquad \text{minimize latency}, \qquad \text{minimize memory}.

A configuration is Pareto optimal if no other configuration improves one objective without worsening another.

For deployment-oriented deep learning, multi-objective search is often more realistic than optimizing validation accuracy alone.

A practical scalarized objective may be:

J(\lambda) = \text{accuracy}(\lambda) - \alpha \cdot \text{latency}(\lambda) - \beta \cdot \text{memory}(\lambda).

The constants $\alpha$ and $\beta$ encode deployment tradeoffs.

Strengths and Weaknesses

Strengths	Weaknesses
More sample-efficient than random search	More complex to implement
Uses previous trial results	Surrogate can be misleading
Good for expensive evaluations	Less effective in very high dimensions
Handles noisy objectives with care	Sequential dependence can limit parallelism
Works well with pruning	Requires careful search-space design

Bayesian optimization is strongest when each trial is expensive, the search space is moderate, and previous trials provide useful information about future ones.

When to Use Bayesian Optimization

Bayesian optimization is appropriate when:

Situation	Reason
Training runs are expensive	Sample efficiency matters
Trial budget is small or moderate	BO uses previous results
Search space has important continuous variables	Surrogates can model smooth structure
You can log results reliably	BO depends on trial history
You want pruning and adaptive search	Modern BO systems support both

Random search may be better when the search budget is large, trials are cheap, or the space is highly irregular.

Manual tuning may be better when only two or three hyperparameters matter and expert intuition is strong.

Summary

Bayesian optimization chooses hyperparameters by combining a surrogate model with an acquisition function. The surrogate estimates validation performance and uncertainty. The acquisition function selects the next trial by balancing exploitation of known good regions with exploration of uncertain regions.

For deep learning, Bayesian optimization is useful when each training run is costly and the search space is not too large. Practical systems often use TPE, pruning, asynchronous execution, and repeated evaluation of top configurations.

Bayesian optimization does not remove the need for a good search space. It improves how the space is explored, but the quality of the result still depends on meaningful ranges, valid constraints, reliable metrics, and careful experiment logging.