Bayesian optimization is a hyperparameter optimization method for expensive black-box functions. It is useful when each training run costs enough that random search wastes too much compute.
Bayesian optimization is a hyperparameter optimization method for expensive black-box functions. It is useful when each training run costs enough that random search wastes too much compute.
The central idea is to build a probabilistic model of the relationship between hyperparameters and validation performance. This model is called a surrogate model. Instead of blindly sampling configurations, Bayesian optimization uses previous results to decide which configuration to try next.
The Optimization Problem
Let denote a hyperparameter configuration. Training a model with produces validation performance:
We want to find
The function is usually expensive to evaluate. One evaluation may require training a model for minutes, hours, or days. The function may also be noisy because different random seeds, data orders, and hardware kernels can produce different results.
Bayesian optimization treats as an unknown function. After each trial, it updates its belief about where good configurations may be.
Surrogate Models
A surrogate model approximates the expensive objective function. Instead of training every possible model, we train a cheap statistical model on completed trials.
A trial produces a pair:
where is the sampled configuration and is the observed validation score or loss.
After trials, the observed dataset is
The surrogate model estimates the likely value of at untried configurations.
Common surrogate models include:
| Surrogate | Common use |
|---|---|
| Gaussian process | Small continuous search spaces |
| Tree-structured Parzen estimator | Mixed and conditional spaces |
| Random forest | Discrete and categorical spaces |
| Bayesian neural network | Larger or more complex spaces |
The surrogate should provide both a prediction and uncertainty. This uncertainty is what allows Bayesian optimization to balance exploration and exploitation.
Exploration and Exploitation
Bayesian optimization must decide between two kinds of trials.
Exploitation means trying configurations near known good regions. If a learning rate around has worked well, nearby values may be promising.
Exploration means trying uncertain regions. A region may have few completed trials, so the surrogate model has high uncertainty there. Exploring it may discover a better solution.
Good Bayesian optimization balances both. Too much exploitation can get stuck near a local optimum. Too much exploration behaves like random search.
Acquisition Functions
An acquisition function selects the next configuration to evaluate. It uses the surrogate model’s predicted mean and uncertainty.
The acquisition function is cheap to evaluate, so we can optimize it many times before running the next expensive training job.
If the surrogate predicts mean and uncertainty , an acquisition function may prefer configurations with low predicted loss, high uncertainty, or both.
Common acquisition functions include:
| Acquisition function | Idea |
|---|---|
| Probability of Improvement | Choose points likely to improve over the current best |
| Expected Improvement | Choose points with high expected gain |
| Upper Confidence Bound | Trade off predicted score and uncertainty |
| Thompson Sampling | Sample a possible objective function and optimize it |
For minimization, Expected Improvement measures how much a point is expected to improve over the best observed value.
Let be the best validation loss observed so far. The improvement at is
The acquisition function uses the expected value of this improvement under the surrogate model:
A point with a low predicted loss can have high expected improvement. A point with high uncertainty can also have high expected improvement because it may turn out better than expected.
Gaussian Process Surrogates
Gaussian processes are the classical surrogate model for Bayesian optimization.
A Gaussian process defines a distribution over functions:
where is the mean function and is the kernel function.
The kernel controls how similar two configurations are expected to be. If two learning rates are close on a logarithmic scale, their validation losses may be correlated.
After observing trials, the Gaussian process gives a posterior distribution for . This posterior provides both a mean and a variance.
Gaussian process Bayesian optimization works well when:
| Condition | Reason |
|---|---|
| Number of trials is small | GP inference can become expensive as trials grow |
| Search space is mostly continuous | Kernels are natural for continuous variables |
| Dimensionality is modest | GPs degrade in high-dimensional spaces |
| Objective is expensive | Surrogate overhead is acceptable |
For modern deep learning, Gaussian processes are useful for small to medium search spaces. They become less convenient for complex architecture search spaces with many conditional categorical choices.
Tree-Structured Parzen Estimators
The tree-structured Parzen estimator, often abbreviated TPE, is widely used in practical hyperparameter optimization systems.
Instead of directly modeling , TPE models two distributions:
and
where is a threshold that separates good trials from bad trials.
The algorithm then chooses configurations that are likely under the good distribution and unlikely under the bad distribution.
TPE handles mixed spaces well:
| Search space feature | TPE suitability |
|---|---|
| Continuous variables | Good |
| Discrete variables | Good |
| Categorical choices | Good |
| Conditional parameters | Good |
| Tree-structured configurations | Good |
This makes TPE a common default for deep learning projects, especially through libraries such as Optuna and Hyperopt.
A Simple Bayesian Optimization Loop
Bayesian optimization follows a repeated loop:
- Start with several random trials.
- Fit a surrogate model to completed trials.
- Use an acquisition function to choose the next configuration.
- Train and evaluate the model.
- Add the result to the trial history.
- Repeat until the budget is exhausted.
In pseudocode:
history = []
for config in initial_random_configs:
score = train_and_evaluate(config)
history.append((config, score))
for step in range(num_bo_steps):
surrogate = fit_surrogate(history)
next_config = optimize_acquisition(
surrogate=surrogate,
search_space=search_space,
)
score = train_and_evaluate(next_config)
history.append((next_config, score))
best_config, best_score = select_best(history)The main difference from random search is that each new trial depends on previous results.
Using Optuna with PyTorch
Optuna is a common Python library for hyperparameter optimization. It supports random search, TPE, pruning, and database-backed study tracking.
A PyTorch training objective can be written as a function that accepts a trial object:
import optuna
import torch
from torch import nn
def objective(trial):
learning_rate = trial.suggest_float(
"learning_rate",
1e-5,
1e-1,
log=True,
)
weight_decay = trial.suggest_float(
"weight_decay",
1e-6,
1e-1,
log=True,
)
hidden_dim = trial.suggest_categorical(
"hidden_dim",
[128, 256, 512, 1024],
)
dropout = trial.suggest_float(
"dropout",
0.0,
0.5,
)
model = MLP(
input_dim=784,
hidden_dim=hidden_dim,
output_dim=10,
num_layers=3,
dropout=dropout,
)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=learning_rate,
weight_decay=weight_decay,
)
val_accuracy = train_and_evaluate(
model=model,
optimizer=optimizer,
)
return val_accuracyThen create and run a study:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)
print(study.best_value)
print(study.best_params)This example maximizes validation accuracy. For validation loss, use direction="minimize".
Pruning Poor Trials
Many Bayesian optimization systems support pruning. Pruning stops bad trials early before they consume the full training budget.
A trial may report intermediate validation scores:
def objective(trial):
model = build_model_from_trial(trial)
optimizer = build_optimizer_from_trial(trial)
for epoch in range(num_epochs):
train_one_epoch(model, optimizer)
val_accuracy = evaluate(model)
trial.report(val_accuracy, step=epoch)
if trial.should_prune():
raise optuna.TrialPruned()
return val_accuracyPruning is useful when early validation results are predictive of final results. It can greatly reduce wasted compute.
However, pruning must be used carefully. Some configurations learn slowly but eventually perform well. Aggressive pruning may remove them too early.
Noisy Objectives
Deep learning validation performance is noisy. The same configuration can produce different results under different seeds.
Noise comes from:
| Source | Example |
|---|---|
| Initialization | Random initial weights |
| Data order | Shuffled mini-batches |
| Regularization | Dropout and augmentation |
| Hardware | Nondeterministic kernels |
| Evaluation | Small validation sets |
Bayesian optimization can overreact to noise. A configuration may look good because of a lucky seed.
Several practices help:
| Practice | Purpose |
|---|---|
| Repeat top configurations | Estimate stability |
| Use larger validation sets | Reduce metric noise |
| Log random seeds | Improve auditability |
| Compare confidence intervals | Avoid chasing noise |
| Use median over seeds | Prefer robust configurations |
For expensive models, it may be impractical to repeat every trial. A common compromise is to repeat only the best few configurations.
Parallel Bayesian Optimization
Classic Bayesian optimization is sequential. It chooses one point, observes the result, then updates the surrogate.
Modern training systems often have many GPUs available. We want to run several trials in parallel.
Parallel Bayesian optimization chooses a batch of candidate configurations. This is harder because pending trials have no results yet.
Common approaches include:
| Method | Idea |
|---|---|
| Constant liar | Pretend pending trials have temporary outcomes |
| Thompson sampling | Sample multiple possible good configurations |
| Local penalization | Avoid sampling points too close together |
| Asynchronous BO | Launch new trials whenever workers become free |
Asynchronous methods are common in cluster settings. They avoid waiting for the slowest trial before launching the next one.
Multi-Objective Bayesian Optimization
Sometimes we care about more than one objective.
For example, we may want high accuracy and low latency. A model with the best accuracy may be too slow for deployment.
We can define objectives such as:
A configuration is Pareto optimal if no other configuration improves one objective without worsening another.
For deployment-oriented deep learning, multi-objective search is often more realistic than optimizing validation accuracy alone.
A practical scalarized objective may be:
The constants and encode deployment tradeoffs.
Strengths and Weaknesses
| Strengths | Weaknesses |
|---|---|
| More sample-efficient than random search | More complex to implement |
| Uses previous trial results | Surrogate can be misleading |
| Good for expensive evaluations | Less effective in very high dimensions |
| Handles noisy objectives with care | Sequential dependence can limit parallelism |
| Works well with pruning | Requires careful search-space design |
Bayesian optimization is strongest when each trial is expensive, the search space is moderate, and previous trials provide useful information about future ones.
When to Use Bayesian Optimization
Bayesian optimization is appropriate when:
| Situation | Reason |
|---|---|
| Training runs are expensive | Sample efficiency matters |
| Trial budget is small or moderate | BO uses previous results |
| Search space has important continuous variables | Surrogates can model smooth structure |
| You can log results reliably | BO depends on trial history |
| You want pruning and adaptive search | Modern BO systems support both |
Random search may be better when the search budget is large, trials are cheap, or the space is highly irregular.
Manual tuning may be better when only two or three hyperparameters matter and expert intuition is strong.
Summary
Bayesian optimization chooses hyperparameters by combining a surrogate model with an acquisition function. The surrogate estimates validation performance and uncertainty. The acquisition function selects the next trial by balancing exploitation of known good regions with exploration of uncertain regions.
For deep learning, Bayesian optimization is useful when each training run is costly and the search space is not too large. Practical systems often use TPE, pruning, asynchronous execution, and repeated evaluation of top configurations.
Bayesian optimization does not remove the need for a good search space. It improves how the space is explored, but the quality of the result still depends on meaningful ranges, valid constraints, reliable metrics, and careful experiment logging.