Skip to content

Bayesian Optimization

Bayesian optimization is a hyperparameter optimization method for expensive black-box functions. It is useful when each training run costs enough that random search wastes too much compute.

Bayesian optimization is a hyperparameter optimization method for expensive black-box functions. It is useful when each training run costs enough that random search wastes too much compute.

The central idea is to build a probabilistic model of the relationship between hyperparameters and validation performance. This model is called a surrogate model. Instead of blindly sampling configurations, Bayesian optimization uses previous results to decide which configuration to try next.

The Optimization Problem

Let λ\lambda denote a hyperparameter configuration. Training a model with λ\lambda produces validation performance:

f(λ)=Lval(θ(λ)). f(\lambda) = \mathcal{L}_{\text{val}}(\theta^\ast(\lambda)).

We want to find

λ=argminλΛf(λ). \lambda^\ast = \arg\min_{\lambda \in \Lambda} f(\lambda).

The function ff is usually expensive to evaluate. One evaluation may require training a model for minutes, hours, or days. The function may also be noisy because different random seeds, data orders, and hardware kernels can produce different results.

Bayesian optimization treats ff as an unknown function. After each trial, it updates its belief about where good configurations may be.

Surrogate Models

A surrogate model approximates the expensive objective function. Instead of training every possible model, we train a cheap statistical model on completed trials.

A trial produces a pair:

(λi,yi), (\lambda_i, y_i),

where λi\lambda_i is the sampled configuration and yiy_i is the observed validation score or loss.

After nn trials, the observed dataset is

Dn={(λ1,y1),(λ2,y2),,(λn,yn)}. \mathcal{D}_n = \{(\lambda_1,y_1),(\lambda_2,y_2),\ldots,(\lambda_n,y_n)\}.

The surrogate model estimates the likely value of f(λ)f(\lambda) at untried configurations.

Common surrogate models include:

SurrogateCommon use
Gaussian processSmall continuous search spaces
Tree-structured Parzen estimatorMixed and conditional spaces
Random forestDiscrete and categorical spaces
Bayesian neural networkLarger or more complex spaces

The surrogate should provide both a prediction and uncertainty. This uncertainty is what allows Bayesian optimization to balance exploration and exploitation.

Exploration and Exploitation

Bayesian optimization must decide between two kinds of trials.

Exploitation means trying configurations near known good regions. If a learning rate around 3×1043\times10^{-4} has worked well, nearby values may be promising.

Exploration means trying uncertain regions. A region may have few completed trials, so the surrogate model has high uncertainty there. Exploring it may discover a better solution.

Good Bayesian optimization balances both. Too much exploitation can get stuck near a local optimum. Too much exploration behaves like random search.

Acquisition Functions

An acquisition function selects the next configuration to evaluate. It uses the surrogate model’s predicted mean and uncertainty.

The acquisition function is cheap to evaluate, so we can optimize it many times before running the next expensive training job.

If the surrogate predicts mean μ(λ)\mu(\lambda) and uncertainty σ(λ)\sigma(\lambda), an acquisition function may prefer configurations with low predicted loss, high uncertainty, or both.

Common acquisition functions include:

Acquisition functionIdea
Probability of ImprovementChoose points likely to improve over the current best
Expected ImprovementChoose points with high expected gain
Upper Confidence BoundTrade off predicted score and uncertainty
Thompson SamplingSample a possible objective function and optimize it

For minimization, Expected Improvement measures how much a point is expected to improve over the best observed value.

Let yminy_{\min} be the best validation loss observed so far. The improvement at λ\lambda is

I(λ)=max(0,yminf(λ)). I(\lambda) = \max(0, y_{\min} - f(\lambda)).

The acquisition function uses the expected value of this improvement under the surrogate model:

EI(λ)=E[max(0,yminf(λ))]. \text{EI}(\lambda) = \mathbb{E}\left[\max(0, y_{\min} - f(\lambda))\right].

A point with a low predicted loss can have high expected improvement. A point with high uncertainty can also have high expected improvement because it may turn out better than expected.

Gaussian Process Surrogates

Gaussian processes are the classical surrogate model for Bayesian optimization.

A Gaussian process defines a distribution over functions:

fGP(m,k), f \sim \mathcal{GP}(m, k),

where mm is the mean function and kk is the kernel function.

The kernel controls how similar two configurations are expected to be. If two learning rates are close on a logarithmic scale, their validation losses may be correlated.

After observing trials, the Gaussian process gives a posterior distribution for f(λ)f(\lambda). This posterior provides both a mean and a variance.

Gaussian process Bayesian optimization works well when:

ConditionReason
Number of trials is smallGP inference can become expensive as trials grow
Search space is mostly continuousKernels are natural for continuous variables
Dimensionality is modestGPs degrade in high-dimensional spaces
Objective is expensiveSurrogate overhead is acceptable

For modern deep learning, Gaussian processes are useful for small to medium search spaces. They become less convenient for complex architecture search spaces with many conditional categorical choices.

Tree-Structured Parzen Estimators

The tree-structured Parzen estimator, often abbreviated TPE, is widely used in practical hyperparameter optimization systems.

Instead of directly modeling p(yλ)p(y \mid \lambda), TPE models two distributions:

p(λy<y) p(\lambda \mid y < y^\ast)

and

p(λyy), p(\lambda \mid y \ge y^\ast),

where yy^\ast is a threshold that separates good trials from bad trials.

The algorithm then chooses configurations that are likely under the good distribution and unlikely under the bad distribution.

TPE handles mixed spaces well:

Search space featureTPE suitability
Continuous variablesGood
Discrete variablesGood
Categorical choicesGood
Conditional parametersGood
Tree-structured configurationsGood

This makes TPE a common default for deep learning projects, especially through libraries such as Optuna and Hyperopt.

A Simple Bayesian Optimization Loop

Bayesian optimization follows a repeated loop:

  1. Start with several random trials.
  2. Fit a surrogate model to completed trials.
  3. Use an acquisition function to choose the next configuration.
  4. Train and evaluate the model.
  5. Add the result to the trial history.
  6. Repeat until the budget is exhausted.

In pseudocode:

history = []

for config in initial_random_configs:
    score = train_and_evaluate(config)
    history.append((config, score))

for step in range(num_bo_steps):
    surrogate = fit_surrogate(history)

    next_config = optimize_acquisition(
        surrogate=surrogate,
        search_space=search_space,
    )

    score = train_and_evaluate(next_config)
    history.append((next_config, score))

best_config, best_score = select_best(history)

The main difference from random search is that each new trial depends on previous results.

Using Optuna with PyTorch

Optuna is a common Python library for hyperparameter optimization. It supports random search, TPE, pruning, and database-backed study tracking.

A PyTorch training objective can be written as a function that accepts a trial object:

import optuna
import torch
from torch import nn

def objective(trial):
    learning_rate = trial.suggest_float(
        "learning_rate",
        1e-5,
        1e-1,
        log=True,
    )

    weight_decay = trial.suggest_float(
        "weight_decay",
        1e-6,
        1e-1,
        log=True,
    )

    hidden_dim = trial.suggest_categorical(
        "hidden_dim",
        [128, 256, 512, 1024],
    )

    dropout = trial.suggest_float(
        "dropout",
        0.0,
        0.5,
    )

    model = MLP(
        input_dim=784,
        hidden_dim=hidden_dim,
        output_dim=10,
        num_layers=3,
        dropout=dropout,
    )

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=learning_rate,
        weight_decay=weight_decay,
    )

    val_accuracy = train_and_evaluate(
        model=model,
        optimizer=optimizer,
    )

    return val_accuracy

Then create and run a study:

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)

print(study.best_value)
print(study.best_params)

This example maximizes validation accuracy. For validation loss, use direction="minimize".

Pruning Poor Trials

Many Bayesian optimization systems support pruning. Pruning stops bad trials early before they consume the full training budget.

A trial may report intermediate validation scores:

def objective(trial):
    model = build_model_from_trial(trial)
    optimizer = build_optimizer_from_trial(trial)

    for epoch in range(num_epochs):
        train_one_epoch(model, optimizer)
        val_accuracy = evaluate(model)

        trial.report(val_accuracy, step=epoch)

        if trial.should_prune():
            raise optuna.TrialPruned()

    return val_accuracy

Pruning is useful when early validation results are predictive of final results. It can greatly reduce wasted compute.

However, pruning must be used carefully. Some configurations learn slowly but eventually perform well. Aggressive pruning may remove them too early.

Noisy Objectives

Deep learning validation performance is noisy. The same configuration can produce different results under different seeds.

Noise comes from:

SourceExample
InitializationRandom initial weights
Data orderShuffled mini-batches
RegularizationDropout and augmentation
HardwareNondeterministic kernels
EvaluationSmall validation sets

Bayesian optimization can overreact to noise. A configuration may look good because of a lucky seed.

Several practices help:

PracticePurpose
Repeat top configurationsEstimate stability
Use larger validation setsReduce metric noise
Log random seedsImprove auditability
Compare confidence intervalsAvoid chasing noise
Use median over seedsPrefer robust configurations

For expensive models, it may be impractical to repeat every trial. A common compromise is to repeat only the best few configurations.

Parallel Bayesian Optimization

Classic Bayesian optimization is sequential. It chooses one point, observes the result, then updates the surrogate.

Modern training systems often have many GPUs available. We want to run several trials in parallel.

Parallel Bayesian optimization chooses a batch of candidate configurations. This is harder because pending trials have no results yet.

Common approaches include:

MethodIdea
Constant liarPretend pending trials have temporary outcomes
Thompson samplingSample multiple possible good configurations
Local penalizationAvoid sampling points too close together
Asynchronous BOLaunch new trials whenever workers become free

Asynchronous methods are common in cluster settings. They avoid waiting for the slowest trial before launching the next one.

Multi-Objective Bayesian Optimization

Sometimes we care about more than one objective.

For example, we may want high accuracy and low latency. A model with the best accuracy may be too slow for deployment.

We can define objectives such as:

maximize accuracy,minimize latency,minimize memory. \text{maximize accuracy}, \qquad \text{minimize latency}, \qquad \text{minimize memory}.

A configuration is Pareto optimal if no other configuration improves one objective without worsening another.

For deployment-oriented deep learning, multi-objective search is often more realistic than optimizing validation accuracy alone.

A practical scalarized objective may be:

J(λ)=accuracy(λ)αlatency(λ)βmemory(λ). J(\lambda) = \text{accuracy}(\lambda) - \alpha \cdot \text{latency}(\lambda) - \beta \cdot \text{memory}(\lambda).

The constants α\alpha and β\beta encode deployment tradeoffs.

Strengths and Weaknesses

StrengthsWeaknesses
More sample-efficient than random searchMore complex to implement
Uses previous trial resultsSurrogate can be misleading
Good for expensive evaluationsLess effective in very high dimensions
Handles noisy objectives with careSequential dependence can limit parallelism
Works well with pruningRequires careful search-space design

Bayesian optimization is strongest when each trial is expensive, the search space is moderate, and previous trials provide useful information about future ones.

When to Use Bayesian Optimization

Bayesian optimization is appropriate when:

SituationReason
Training runs are expensiveSample efficiency matters
Trial budget is small or moderateBO uses previous results
Search space has important continuous variablesSurrogates can model smooth structure
You can log results reliablyBO depends on trial history
You want pruning and adaptive searchModern BO systems support both

Random search may be better when the search budget is large, trials are cheap, or the space is highly irregular.

Manual tuning may be better when only two or three hyperparameters matter and expert intuition is strong.

Summary

Bayesian optimization chooses hyperparameters by combining a surrogate model with an acquisition function. The surrogate estimates validation performance and uncertainty. The acquisition function selects the next trial by balancing exploitation of known good regions with exploration of uncertain regions.

For deep learning, Bayesian optimization is useful when each training run is costly and the search space is not too large. Practical systems often use TPE, pruning, asynchronous execution, and repeated evaluation of top configurations.

Bayesian optimization does not remove the need for a good search space. It improves how the space is explored, but the quality of the result still depends on meaningful ranges, valid constraints, reliable metrics, and careful experiment logging.