Skip to content

Search Spaces

Hyperparameter optimization begins by deciding what may vary.

Hyperparameter optimization begins by deciding what may vary. This set of possible choices is called the search space. Before we run grid search, random search, Bayesian optimization, or any automated method, we must define the hyperparameters, their allowed values, and the rules that make some combinations valid or invalid.

A hyperparameter is a value chosen outside the training process. It controls how the model is built or how training is performed. Examples include the learning rate, batch size, number of layers, hidden dimension, dropout rate, optimizer type, weight decay, data augmentation strength, and learning rate schedule.

Parameters and hyperparameters have different roles. Parameters are learned from data. Hyperparameters are chosen by the practitioner or by an outer optimization procedure.

For a neural network with weights θ\theta, training usually solves an optimization problem of the form

θ(λ)=argminθLtrain(θ;λ), \theta^\ast(\lambda) = \arg\min_{\theta} \mathcal{L}_{\text{train}}(\theta;\lambda),

where λ\lambda denotes the hyperparameters. Hyperparameter optimization then chooses λ\lambda so that the trained model performs well on validation data:

λ=argminλΛLval(θ(λ)). \lambda^\ast = \arg\min_{\lambda\in\Lambda} \mathcal{L}_{\text{val}}(\theta^\ast(\lambda)).

Here Λ\Lambda is the search space.

What a Search Space Contains

A search space specifies the allowed values of each hyperparameter.

For example, a simple multilayer perceptron may have the following search space:

HyperparameterTypeExample values
Learning rateContinuous10510^{-5} to 10110^{-1}
Batch sizeDiscrete32, 64, 128, 256
Hidden dimensionDiscrete128, 256, 512, 1024
Number of layersDiscrete2, 3, 4, 6
Dropout rateContinuous0.0 to 0.5
OptimizerCategoricalSGD, Adam, AdamW
Weight decayContinuous10610^{-6} to 10110^{-1}

This table defines the domain of possible experiments. Each complete assignment of values gives one configuration.

For example:

config = {
    "learning_rate": 3e-4,
    "batch_size": 128,
    "hidden_dim": 512,
    "num_layers": 4,
    "dropout": 0.1,
    "optimizer": "AdamW",
    "weight_decay": 1e-2,
}

This configuration can be used to construct and train a model.

Types of Hyperparameters

Hyperparameters may be continuous, discrete, categorical, conditional, or structured.

A continuous hyperparameter takes values from an interval. Learning rate and dropout rate are common examples:

η[105,101],pdrop[0,0.5]. \eta \in [10^{-5}, 10^{-1}], \qquad p_{\text{drop}} \in [0,0.5].

A discrete hyperparameter takes values from a countable set. Batch size, hidden dimension, and number of layers are usually discrete:

B{32,64,128,256}. B \in \{32,64,128,256\}.

A categorical hyperparameter selects from named choices:

optimizer{SGD,Adam,AdamW}. \text{optimizer} \in \{\text{SGD}, \text{Adam}, \text{AdamW}\}.

A conditional hyperparameter is active only when another choice enables it. For example, momentum matters when the optimizer is SGD, but may be irrelevant when the optimizer is AdamW:

if config["optimizer"] == "SGD":
    momentum = config["momentum"]

A structured hyperparameter describes a more complex object. Examples include the full architecture of a network, the list of channel widths in a CNN, or the schedule used by the learning rate.

Linear and Logarithmic Scales

The scale of a search dimension matters. Some hyperparameters should be searched linearly. Others should be searched logarithmically.

A dropout rate from 0.0 to 0.5 is usually searched on a linear scale because changes of 0.1 have roughly comparable meaning across the interval:

pdrop[0.0,0.5]. p_{\text{drop}} \in [0.0,0.5].

A learning rate should usually be searched on a logarithmic scale. The difference between 10510^{-5} and 10410^{-4} is often as important as the difference between 10310^{-3} and 10210^{-2}. A linear search over [105,101][10^{-5},10^{-1}] would spend most trials near large values and very few near small values.

A logarithmic search may be written as

log10(η)Uniform(5,1), \log_{10}(\eta) \sim \text{Uniform}(-5,-1),

or equivalently

η=10u,uUniform(5,1). \eta = 10^u,\qquad u\sim \text{Uniform}(-5,-1).

In Python:

import random

u = random.uniform(-5, -1)
learning_rate = 10 ** u

This samples learning rates between 10510^{-5} and 10110^{-1}, with equal coverage per order of magnitude.

Weight decay, Adam epsilon, and initialization scale are also commonly searched on logarithmic scales.

Common Search Dimensions in Deep Learning

The most important hyperparameters depend on the model class, dataset, and training budget. Still, several choices appear in many deep learning systems.

AreaCommon hyperparameters
OptimizationLearning rate, optimizer, momentum, beta values, weight decay
TrainingBatch size, number of epochs, gradient clipping, accumulation steps
ArchitectureNumber of layers, hidden dimension, activation function, normalization
RegularizationDropout, label smoothing, augmentation strength, stochastic depth
SchedulingWarmup steps, decay schedule, minimum learning rate
DataInput resolution, sequence length, sampling strategy, tokenization
EfficiencyMixed precision, checkpointing, parallelism, compilation

The learning rate is often the most sensitive hyperparameter. A learning rate that is too large may cause unstable training. A learning rate that is too small may waste computation and produce undertrained models.

Batch size affects optimization, memory use, and throughput. Larger batches make better use of modern accelerators, but they may require learning rate adjustment. Smaller batches introduce more gradient noise, which can sometimes help generalization.

Architecture hyperparameters control model capacity. Increasing hidden dimension or depth can improve performance when data and compute are sufficient. It can also increase overfitting, memory cost, and latency.

Regularization hyperparameters control how strongly the model is discouraged from fitting accidental patterns in the training data.

PyTorch Example: Defining a Search Space

A search space can be represented directly as a Python dictionary.

search_space = {
    "learning_rate": ("log_uniform", 1e-5, 1e-1),
    "batch_size": ("choice", [32, 64, 128, 256]),
    "hidden_dim": ("choice", [128, 256, 512, 1024]),
    "num_layers": ("choice", [2, 3, 4, 6]),
    "dropout": ("uniform", 0.0, 0.5),
    "optimizer": ("choice", ["SGD", "Adam", "AdamW"]),
    "weight_decay": ("log_uniform", 1e-6, 1e-1),
}

A simple sampler can draw one configuration:

import math
import random

def sample_config(space):
    config = {}

    for name, spec in space.items():
        kind = spec[0]

        if kind == "choice":
            values = spec[1]
            config[name] = random.choice(values)

        elif kind == "uniform":
            low, high = spec[1], spec[2]
            config[name] = random.uniform(low, high)

        elif kind == "log_uniform":
            low, high = spec[1], spec[2]
            log_low = math.log10(low)
            log_high = math.log10(high)
            config[name] = 10 ** random.uniform(log_low, log_high)

        else:
            raise ValueError(f"Unknown search type: {kind}")

    return config

Example:

config = sample_config(search_space)
print(config)

This kind of representation is sufficient for simple random search. More advanced systems, such as Optuna, Ray Tune, or Ax, provide richer ways to define search spaces, conditional parameters, pruning, parallel execution, and storage.

Using a Configuration to Build a Model

A useful search space should map cleanly into training code. The training function should accept a configuration object and construct the model, optimizer, and data loaders from that configuration.

import torch
from torch import nn

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers, dropout):
        super().__init__()

        layers = []
        dim = input_dim

        for _ in range(num_layers):
            layers.append(nn.Linear(dim, hidden_dim))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout))
            dim = hidden_dim

        layers.append(nn.Linear(dim, output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

The configuration determines the architecture:

model = MLP(
    input_dim=784,
    hidden_dim=config["hidden_dim"],
    output_dim=10,
    num_layers=config["num_layers"],
    dropout=config["dropout"],
)

It also determines the optimizer:

if config["optimizer"] == "SGD":
    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=config["learning_rate"],
        momentum=0.9,
        weight_decay=config["weight_decay"],
    )

elif config["optimizer"] == "Adam":
    optimizer = torch.optim.Adam(
        model.parameters(),
        lr=config["learning_rate"],
        weight_decay=config["weight_decay"],
    )

elif config["optimizer"] == "AdamW":
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=config["learning_rate"],
        weight_decay=config["weight_decay"],
    )

else:
    raise ValueError(f"Unknown optimizer: {config['optimizer']}")

This separation between configuration and training code is important. Hyperparameter optimization becomes difficult when choices are scattered throughout the program.

Valid and Invalid Configurations

Not every combination in a search space is valid.

A large hidden dimension and a large batch size may exceed GPU memory. A very deep network without normalization may train poorly. A learning rate that works for AdamW may be unsuitable for SGD. A sequence length of 8192 may be valid for one transformer architecture but impossible for another.

Therefore, a search space may need constraints.

For example:

def is_valid(config):
    if config["hidden_dim"] >= 1024 and config["batch_size"] >= 256:
        return False

    if config["optimizer"] == "SGD" and config["learning_rate"] < 1e-4:
        return False

    return True

A sampler can reject invalid configurations:

def sample_valid_config(space, max_attempts=100):
    for _ in range(max_attempts):
        config = sample_config(space)
        if is_valid(config):
            return config

    raise RuntimeError("Could not sample a valid configuration")

For small projects, manual constraints are enough. For large projects, it is better to encode constraints explicitly and log every failed trial. Silent failures can corrupt the interpretation of search results.

Search Space Size

The size of a search space determines how difficult exploration becomes.

Suppose we define:

learning rate{104,103,102}, \text{learning rate} \in \{10^{-4}, 10^{-3}, 10^{-2}\}, batch size{32,64,128}, \text{batch size} \in \{32,64,128\}, hidden dimension{128,256,512}. \text{hidden dimension} \in \{128,256,512\}.

The total number of configurations is

3×3×3=27. 3\times 3\times 3 = 27.

If we add optimizer choice with 3 values and dropout with 5 values, the total becomes

3×3×3×3×5=405. 3\times 3\times 3\times 3\times 5 = 405.

Search spaces grow multiplicatively. This is one reason exhaustive grid search becomes expensive in deep learning.

If each training run takes two hours, then 405 configurations require 810 GPU-hours. If the model is large, even a single failed configuration may be costly.

A practical search space should be large enough to contain good solutions and small enough to explore under the available compute budget.

Coarse-to-Fine Search

A useful strategy is to begin with a broad search and then refine.

In the first stage, we search over a wide range:

η[105,101],λwd[106,101]. \eta \in [10^{-5},10^{-1}], \qquad \lambda_{\text{wd}} \in [10^{-6},10^{-1}].

After observing good configurations, we narrow the ranges:

η[104,103],λwd[104,102]. \eta \in [10^{-4},10^{-3}], \qquad \lambda_{\text{wd}} \in [10^{-4},10^{-2}].

This coarse-to-fine approach is common because early trials reveal the scale of useful values. The first stage answers broad questions. Is AdamW better than SGD? Is the model under-regularized? Does the learning rate need warmup? Does the batch size affect convergence?

The second stage answers finer questions. Which learning rate gives the best validation loss? How much dropout is beneficial? Which hidden size gives the best accuracy under a latency constraint?

Architecture Search Spaces

Architecture search spaces are more complex than optimizer search spaces. They may include choices such as:

Model familySearch dimensions
MLPLayers, width, activation, normalization, dropout
CNNChannels, kernel sizes, strides, residual blocks
TransformerLayers, hidden size, attention heads, MLP ratio, context length
GNNMessage passing layers, aggregation function, readout function
Diffusion modelU-Net width, attention resolutions, noise schedule, sampling steps

For a transformer, a simplified search space may be:

transformer_space = {
    "num_layers": ("choice", [6, 12, 24]),
    "hidden_dim": ("choice", [512, 768, 1024]),
    "num_heads": ("choice", [8, 12, 16]),
    "mlp_ratio": ("choice", [2, 4]),
    "dropout": ("uniform", 0.0, 0.2),
    "learning_rate": ("log_uniform", 1e-5, 5e-4),
}

But this space has hidden constraints. The hidden dimension must usually be divisible by the number of attention heads:

dmodelmodh=0. d_{\text{model}} \bmod h = 0.

In code:

def valid_transformer_config(config):
    return config["hidden_dim"] % config["num_heads"] == 0

This constraint exists because each attention head receives a subspace of size

dhead=dmodelh. d_{\text{head}} = \frac{d_{\text{model}}}{h}.

If dmodeld_{\text{model}} cannot be divided evenly by hh, the model shape becomes invalid.

Budget-Aware Search Spaces

A search space should reflect the available budget. There are several kinds of budget:

Budget typeConstraint
ComputeGPU-hours, accelerator type, parallel workers
MemoryMaximum batch size, sequence length, model size
TimeWall-clock deadline
DataNumber of examples, labeling cost
LatencyInference speed requirement
StorageCheckpoints, logs, generated artifacts

For example, if the model must run on a mobile device, the search space should include model size and latency constraints. A configuration with the best validation accuracy may be unusable if it exceeds the deployment budget.

If the training budget is small, the search space should focus on the most important variables: learning rate, weight decay, batch size, model width, and dropout. Searching too many dimensions with too few trials usually gives noisy conclusions.

Practical Rules for Search Spaces

A good search space encodes prior knowledge. It avoids values that are clearly unreasonable while retaining enough freedom to find unexpected improvements.

For deep learning, these rules are often useful:

HyperparameterPractical search rule
Learning rateSearch logarithmically
Weight decaySearch logarithmically
DropoutSearch linearly
Batch sizeSearch powers of two or hardware-friendly values
Hidden dimensionSearch hardware-friendly multiples
Number of layersSearch small discrete sets
OptimizerCompare a few strong defaults
ScheduleStart with simple schedules before complex ones

The search space should also include a baseline configuration. Without a baseline, it becomes difficult to know whether the search has helped.

A baseline can be represented as:

baseline_config = {
    "learning_rate": 3e-4,
    "batch_size": 128,
    "hidden_dim": 512,
    "num_layers": 4,
    "dropout": 0.1,
    "optimizer": "AdamW",
    "weight_decay": 1e-2,
}

The baseline is trained first. Search results are then compared against it.

Common Mistakes

A common mistake is defining a search space that is too large. If we vary every possible choice, the number of trials needed becomes impractical. Most trials will be wasted on poor regions.

Another mistake is using a linear scale for parameters whose useful values span orders of magnitude. Learning rate and weight decay should almost always be sampled logarithmically.

A third mistake is ignoring conditional structure. If momentum is sampled even when the optimizer is AdamW, the search algorithm receives irrelevant variables. This can reduce search efficiency.

A fourth mistake is treating failed trials as missing data without explanation. Failed trials often reveal important information. They may indicate memory limits, unstable training, invalid shapes, or bad learning rates.

A fifth mistake is choosing the search space after repeatedly looking at the test set. Hyperparameter search should use validation data. The test set should remain reserved for final evaluation.

Summary

A search space is the set of hyperparameter configurations considered during hyperparameter optimization. It defines which values may vary, how they are sampled, and which combinations are valid.

In deep learning, important search dimensions include learning rate, batch size, optimizer, weight decay, model depth, model width, dropout, normalization, data augmentation, and learning rate schedule.

Good search spaces use logarithmic scales for scale-sensitive quantities, encode constraints explicitly, respect compute and memory budgets, and separate configuration from training code. The quality of the search space often matters more than the choice of search algorithm.