Hyperparameter optimization begins by deciding what may vary. This set of possible choices is called the search space. Before we run grid search, random search, Bayesian optimization, or any automated method, we must define the hyperparameters, their allowed values, and the rules that make some combinations valid or invalid.
A hyperparameter is a value chosen outside the training process. It controls how the model is built or how training is performed. Examples include the learning rate, batch size, number of layers, hidden dimension, dropout rate, optimizer type, weight decay, data augmentation strength, and learning rate schedule.
Parameters and hyperparameters have different roles. Parameters are learned from data. Hyperparameters are chosen by the practitioner or by an outer optimization procedure.
For a neural network with weights , training usually solves an optimization problem of the form
where denotes the hyperparameters. Hyperparameter optimization then chooses so that the trained model performs well on validation data:
Here is the search space.
What a Search Space Contains
A search space specifies the allowed values of each hyperparameter.
For example, a simple multilayer perceptron may have the following search space:
| Hyperparameter | Type | Example values |
|---|---|---|
| Learning rate | Continuous | to |
| Batch size | Discrete | 32, 64, 128, 256 |
| Hidden dimension | Discrete | 128, 256, 512, 1024 |
| Number of layers | Discrete | 2, 3, 4, 6 |
| Dropout rate | Continuous | 0.0 to 0.5 |
| Optimizer | Categorical | SGD, Adam, AdamW |
| Weight decay | Continuous | to |
This table defines the domain of possible experiments. Each complete assignment of values gives one configuration.
For example:
config = {
"learning_rate": 3e-4,
"batch_size": 128,
"hidden_dim": 512,
"num_layers": 4,
"dropout": 0.1,
"optimizer": "AdamW",
"weight_decay": 1e-2,
}This configuration can be used to construct and train a model.
Types of Hyperparameters
Hyperparameters may be continuous, discrete, categorical, conditional, or structured.
A continuous hyperparameter takes values from an interval. Learning rate and dropout rate are common examples:
A discrete hyperparameter takes values from a countable set. Batch size, hidden dimension, and number of layers are usually discrete:
A categorical hyperparameter selects from named choices:
A conditional hyperparameter is active only when another choice enables it. For example, momentum matters when the optimizer is SGD, but may be irrelevant when the optimizer is AdamW:
if config["optimizer"] == "SGD":
momentum = config["momentum"]A structured hyperparameter describes a more complex object. Examples include the full architecture of a network, the list of channel widths in a CNN, or the schedule used by the learning rate.
Linear and Logarithmic Scales
The scale of a search dimension matters. Some hyperparameters should be searched linearly. Others should be searched logarithmically.
A dropout rate from 0.0 to 0.5 is usually searched on a linear scale because changes of 0.1 have roughly comparable meaning across the interval:
A learning rate should usually be searched on a logarithmic scale. The difference between and is often as important as the difference between and . A linear search over would spend most trials near large values and very few near small values.
A logarithmic search may be written as
or equivalently
In Python:
import random
u = random.uniform(-5, -1)
learning_rate = 10 ** uThis samples learning rates between and , with equal coverage per order of magnitude.
Weight decay, Adam epsilon, and initialization scale are also commonly searched on logarithmic scales.
Common Search Dimensions in Deep Learning
The most important hyperparameters depend on the model class, dataset, and training budget. Still, several choices appear in many deep learning systems.
| Area | Common hyperparameters |
|---|---|
| Optimization | Learning rate, optimizer, momentum, beta values, weight decay |
| Training | Batch size, number of epochs, gradient clipping, accumulation steps |
| Architecture | Number of layers, hidden dimension, activation function, normalization |
| Regularization | Dropout, label smoothing, augmentation strength, stochastic depth |
| Scheduling | Warmup steps, decay schedule, minimum learning rate |
| Data | Input resolution, sequence length, sampling strategy, tokenization |
| Efficiency | Mixed precision, checkpointing, parallelism, compilation |
The learning rate is often the most sensitive hyperparameter. A learning rate that is too large may cause unstable training. A learning rate that is too small may waste computation and produce undertrained models.
Batch size affects optimization, memory use, and throughput. Larger batches make better use of modern accelerators, but they may require learning rate adjustment. Smaller batches introduce more gradient noise, which can sometimes help generalization.
Architecture hyperparameters control model capacity. Increasing hidden dimension or depth can improve performance when data and compute are sufficient. It can also increase overfitting, memory cost, and latency.
Regularization hyperparameters control how strongly the model is discouraged from fitting accidental patterns in the training data.
PyTorch Example: Defining a Search Space
A search space can be represented directly as a Python dictionary.
search_space = {
"learning_rate": ("log_uniform", 1e-5, 1e-1),
"batch_size": ("choice", [32, 64, 128, 256]),
"hidden_dim": ("choice", [128, 256, 512, 1024]),
"num_layers": ("choice", [2, 3, 4, 6]),
"dropout": ("uniform", 0.0, 0.5),
"optimizer": ("choice", ["SGD", "Adam", "AdamW"]),
"weight_decay": ("log_uniform", 1e-6, 1e-1),
}A simple sampler can draw one configuration:
import math
import random
def sample_config(space):
config = {}
for name, spec in space.items():
kind = spec[0]
if kind == "choice":
values = spec[1]
config[name] = random.choice(values)
elif kind == "uniform":
low, high = spec[1], spec[2]
config[name] = random.uniform(low, high)
elif kind == "log_uniform":
low, high = spec[1], spec[2]
log_low = math.log10(low)
log_high = math.log10(high)
config[name] = 10 ** random.uniform(log_low, log_high)
else:
raise ValueError(f"Unknown search type: {kind}")
return configExample:
config = sample_config(search_space)
print(config)This kind of representation is sufficient for simple random search. More advanced systems, such as Optuna, Ray Tune, or Ax, provide richer ways to define search spaces, conditional parameters, pruning, parallel execution, and storage.
Using a Configuration to Build a Model
A useful search space should map cleanly into training code. The training function should accept a configuration object and construct the model, optimizer, and data loaders from that configuration.
import torch
from torch import nn
class MLP(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, num_layers, dropout):
super().__init__()
layers = []
dim = input_dim
for _ in range(num_layers):
layers.append(nn.Linear(dim, hidden_dim))
layers.append(nn.ReLU())
layers.append(nn.Dropout(dropout))
dim = hidden_dim
layers.append(nn.Linear(dim, output_dim))
self.net = nn.Sequential(*layers)
def forward(self, x):
return self.net(x)The configuration determines the architecture:
model = MLP(
input_dim=784,
hidden_dim=config["hidden_dim"],
output_dim=10,
num_layers=config["num_layers"],
dropout=config["dropout"],
)It also determines the optimizer:
if config["optimizer"] == "SGD":
optimizer = torch.optim.SGD(
model.parameters(),
lr=config["learning_rate"],
momentum=0.9,
weight_decay=config["weight_decay"],
)
elif config["optimizer"] == "Adam":
optimizer = torch.optim.Adam(
model.parameters(),
lr=config["learning_rate"],
weight_decay=config["weight_decay"],
)
elif config["optimizer"] == "AdamW":
optimizer = torch.optim.AdamW(
model.parameters(),
lr=config["learning_rate"],
weight_decay=config["weight_decay"],
)
else:
raise ValueError(f"Unknown optimizer: {config['optimizer']}")This separation between configuration and training code is important. Hyperparameter optimization becomes difficult when choices are scattered throughout the program.
Valid and Invalid Configurations
Not every combination in a search space is valid.
A large hidden dimension and a large batch size may exceed GPU memory. A very deep network without normalization may train poorly. A learning rate that works for AdamW may be unsuitable for SGD. A sequence length of 8192 may be valid for one transformer architecture but impossible for another.
Therefore, a search space may need constraints.
For example:
def is_valid(config):
if config["hidden_dim"] >= 1024 and config["batch_size"] >= 256:
return False
if config["optimizer"] == "SGD" and config["learning_rate"] < 1e-4:
return False
return TrueA sampler can reject invalid configurations:
def sample_valid_config(space, max_attempts=100):
for _ in range(max_attempts):
config = sample_config(space)
if is_valid(config):
return config
raise RuntimeError("Could not sample a valid configuration")For small projects, manual constraints are enough. For large projects, it is better to encode constraints explicitly and log every failed trial. Silent failures can corrupt the interpretation of search results.
Search Space Size
The size of a search space determines how difficult exploration becomes.
Suppose we define:
The total number of configurations is
If we add optimizer choice with 3 values and dropout with 5 values, the total becomes
Search spaces grow multiplicatively. This is one reason exhaustive grid search becomes expensive in deep learning.
If each training run takes two hours, then 405 configurations require 810 GPU-hours. If the model is large, even a single failed configuration may be costly.
A practical search space should be large enough to contain good solutions and small enough to explore under the available compute budget.
Coarse-to-Fine Search
A useful strategy is to begin with a broad search and then refine.
In the first stage, we search over a wide range:
After observing good configurations, we narrow the ranges:
This coarse-to-fine approach is common because early trials reveal the scale of useful values. The first stage answers broad questions. Is AdamW better than SGD? Is the model under-regularized? Does the learning rate need warmup? Does the batch size affect convergence?
The second stage answers finer questions. Which learning rate gives the best validation loss? How much dropout is beneficial? Which hidden size gives the best accuracy under a latency constraint?
Architecture Search Spaces
Architecture search spaces are more complex than optimizer search spaces. They may include choices such as:
| Model family | Search dimensions |
|---|---|
| MLP | Layers, width, activation, normalization, dropout |
| CNN | Channels, kernel sizes, strides, residual blocks |
| Transformer | Layers, hidden size, attention heads, MLP ratio, context length |
| GNN | Message passing layers, aggregation function, readout function |
| Diffusion model | U-Net width, attention resolutions, noise schedule, sampling steps |
For a transformer, a simplified search space may be:
transformer_space = {
"num_layers": ("choice", [6, 12, 24]),
"hidden_dim": ("choice", [512, 768, 1024]),
"num_heads": ("choice", [8, 12, 16]),
"mlp_ratio": ("choice", [2, 4]),
"dropout": ("uniform", 0.0, 0.2),
"learning_rate": ("log_uniform", 1e-5, 5e-4),
}But this space has hidden constraints. The hidden dimension must usually be divisible by the number of attention heads:
In code:
def valid_transformer_config(config):
return config["hidden_dim"] % config["num_heads"] == 0This constraint exists because each attention head receives a subspace of size
If cannot be divided evenly by , the model shape becomes invalid.
Budget-Aware Search Spaces
A search space should reflect the available budget. There are several kinds of budget:
| Budget type | Constraint |
|---|---|
| Compute | GPU-hours, accelerator type, parallel workers |
| Memory | Maximum batch size, sequence length, model size |
| Time | Wall-clock deadline |
| Data | Number of examples, labeling cost |
| Latency | Inference speed requirement |
| Storage | Checkpoints, logs, generated artifacts |
For example, if the model must run on a mobile device, the search space should include model size and latency constraints. A configuration with the best validation accuracy may be unusable if it exceeds the deployment budget.
If the training budget is small, the search space should focus on the most important variables: learning rate, weight decay, batch size, model width, and dropout. Searching too many dimensions with too few trials usually gives noisy conclusions.
Practical Rules for Search Spaces
A good search space encodes prior knowledge. It avoids values that are clearly unreasonable while retaining enough freedom to find unexpected improvements.
For deep learning, these rules are often useful:
| Hyperparameter | Practical search rule |
|---|---|
| Learning rate | Search logarithmically |
| Weight decay | Search logarithmically |
| Dropout | Search linearly |
| Batch size | Search powers of two or hardware-friendly values |
| Hidden dimension | Search hardware-friendly multiples |
| Number of layers | Search small discrete sets |
| Optimizer | Compare a few strong defaults |
| Schedule | Start with simple schedules before complex ones |
The search space should also include a baseline configuration. Without a baseline, it becomes difficult to know whether the search has helped.
A baseline can be represented as:
baseline_config = {
"learning_rate": 3e-4,
"batch_size": 128,
"hidden_dim": 512,
"num_layers": 4,
"dropout": 0.1,
"optimizer": "AdamW",
"weight_decay": 1e-2,
}The baseline is trained first. Search results are then compared against it.
Common Mistakes
A common mistake is defining a search space that is too large. If we vary every possible choice, the number of trials needed becomes impractical. Most trials will be wasted on poor regions.
Another mistake is using a linear scale for parameters whose useful values span orders of magnitude. Learning rate and weight decay should almost always be sampled logarithmically.
A third mistake is ignoring conditional structure. If momentum is sampled even when the optimizer is AdamW, the search algorithm receives irrelevant variables. This can reduce search efficiency.
A fourth mistake is treating failed trials as missing data without explanation. Failed trials often reveal important information. They may indicate memory limits, unstable training, invalid shapes, or bad learning rates.
A fifth mistake is choosing the search space after repeatedly looking at the test set. Hyperparameter search should use validation data. The test set should remain reserved for final evaluation.
Summary
A search space is the set of hyperparameter configurations considered during hyperparameter optimization. It defines which values may vary, how they are sampled, and which combinations are valid.
In deep learning, important search dimensions include learning rate, batch size, optimizer, weight decay, model depth, model width, dropout, normalization, data augmentation, and learning rate schedule.
Good search spaces use logarithmic scales for scale-sensitive quantities, encode constraints explicitly, respect compute and memory budgets, and separate configuration from training code. The quality of the search space often matters more than the choice of search algorithm.