Search Spaces

Hyperparameter optimization begins by deciding what may vary. This set of possible choices is called the search space. Before we run grid search, random search, Bayesian optimization, or any automated method, we must define the hyperparameters, their allowed values, and the rules that make some combinations valid or invalid.

A hyperparameter is a value chosen outside the training process. It controls how the model is built or how training is performed. Examples include the learning rate, batch size, number of layers, hidden dimension, dropout rate, optimizer type, weight decay, data augmentation strength, and learning rate schedule.

Parameters and hyperparameters have different roles. Parameters are learned from data. Hyperparameters are chosen by the practitioner or by an outer optimization procedure.

For a neural network with weights $\theta$ , training usually solves an optimization problem of the form

\theta^\ast(\lambda) = \arg\min_{\theta} \mathcal{L}_{\text{train}}(\theta;\lambda),

where $\lambda$ denotes the hyperparameters. Hyperparameter optimization then chooses $\lambda$ so that the trained model performs well on validation data:

\lambda^\ast = \arg\min_{\lambda\in\Lambda} \mathcal{L}_{\text{val}}(\theta^\ast(\lambda)).

Here $\Lambda$ is the search space.

What a Search Space Contains

A search space specifies the allowed values of each hyperparameter.

For example, a simple multilayer perceptron may have the following search space:

Hyperparameter	Type	Example values
Learning rate	Continuous	$10^{-5}$ to $10^{-1}$
Batch size	Discrete	32, 64, 128, 256
Hidden dimension	Discrete	128, 256, 512, 1024
Number of layers	Discrete	2, 3, 4, 6
Dropout rate	Continuous	0.0 to 0.5
Optimizer	Categorical	SGD, Adam, AdamW
Weight decay	Continuous	$10^{-6}$ to $10^{-1}$

This table defines the domain of possible experiments. Each complete assignment of values gives one configuration.

For example:

config = {
    "learning_rate": 3e-4,
    "batch_size": 128,
    "hidden_dim": 512,
    "num_layers": 4,
    "dropout": 0.1,
    "optimizer": "AdamW",
    "weight_decay": 1e-2,
}

This configuration can be used to construct and train a model.

Types of Hyperparameters

Hyperparameters may be continuous, discrete, categorical, conditional, or structured.

A continuous hyperparameter takes values from an interval. Learning rate and dropout rate are common examples:

\eta \in [10^{-5}, 10^{-1}], \qquad p_{\text{drop}} \in [0,0.5].

A discrete hyperparameter takes values from a countable set. Batch size, hidden dimension, and number of layers are usually discrete:

B \in \{32,64,128,256\}.

A categorical hyperparameter selects from named choices:

\text{optimizer} \in \{\text{SGD}, \text{Adam}, \text{AdamW}\}.

A conditional hyperparameter is active only when another choice enables it. For example, momentum matters when the optimizer is SGD, but may be irrelevant when the optimizer is AdamW:

if config["optimizer"] == "SGD":
    momentum = config["momentum"]

A structured hyperparameter describes a more complex object. Examples include the full architecture of a network, the list of channel widths in a CNN, or the schedule used by the learning rate.

Linear and Logarithmic Scales

The scale of a search dimension matters. Some hyperparameters should be searched linearly. Others should be searched logarithmically.

A dropout rate from 0.0 to 0.5 is usually searched on a linear scale because changes of 0.1 have roughly comparable meaning across the interval:

p_{\text{drop}} \in [0.0,0.5].

A learning rate should usually be searched on a logarithmic scale. The difference between $10^{-5}$ and $10^{-4}$ is often as important as the difference between $10^{-3}$ and $10^{-2}$ . A linear search over $[10^{-5},10^{-1}]$ would spend most trials near large values and very few near small values.

A logarithmic search may be written as

\log_{10}(\eta) \sim \text{Uniform}(-5,-1),

or equivalently

\eta = 10^u,\qquad u\sim \text{Uniform}(-5,-1).

In Python:

import random

u = random.uniform(-5, -1)
learning_rate = 10 ** u

This samples learning rates between $10^{-5}$ and $10^{-1}$ , with equal coverage per order of magnitude.

Weight decay, Adam epsilon, and initialization scale are also commonly searched on logarithmic scales.

Common Search Dimensions in Deep Learning

The most important hyperparameters depend on the model class, dataset, and training budget. Still, several choices appear in many deep learning systems.

Area	Common hyperparameters
Optimization	Learning rate, optimizer, momentum, beta values, weight decay
Training	Batch size, number of epochs, gradient clipping, accumulation steps
Architecture	Number of layers, hidden dimension, activation function, normalization
Regularization	Dropout, label smoothing, augmentation strength, stochastic depth
Scheduling	Warmup steps, decay schedule, minimum learning rate
Data	Input resolution, sequence length, sampling strategy, tokenization
Efficiency	Mixed precision, checkpointing, parallelism, compilation

The learning rate is often the most sensitive hyperparameter. A learning rate that is too large may cause unstable training. A learning rate that is too small may waste computation and produce undertrained models.

Batch size affects optimization, memory use, and throughput. Larger batches make better use of modern accelerators, but they may require learning rate adjustment. Smaller batches introduce more gradient noise, which can sometimes help generalization.

Architecture hyperparameters control model capacity. Increasing hidden dimension or depth can improve performance when data and compute are sufficient. It can also increase overfitting, memory cost, and latency.

Regularization hyperparameters control how strongly the model is discouraged from fitting accidental patterns in the training data.

PyTorch Example: Defining a Search Space

A search space can be represented directly as a Python dictionary.

search_space = {
    "learning_rate": ("log_uniform", 1e-5, 1e-1),
    "batch_size": ("choice", [32, 64, 128, 256]),
    "hidden_dim": ("choice", [128, 256, 512, 1024]),
    "num_layers": ("choice", [2, 3, 4, 6]),
    "dropout": ("uniform", 0.0, 0.5),
    "optimizer": ("choice", ["SGD", "Adam", "AdamW"]),
    "weight_decay": ("log_uniform", 1e-6, 1e-1),
}

A simple sampler can draw one configuration:

import math
import random

def sample_config(space):
    config = {}

    for name, spec in space.items():
        kind = spec[0]

        if kind == "choice":
            values = spec[1]
            config[name] = random.choice(values)

        elif kind == "uniform":
            low, high = spec[1], spec[2]
            config[name] = random.uniform(low, high)

        elif kind == "log_uniform":
            low, high = spec[1], spec[2]
            log_low = math.log10(low)
            log_high = math.log10(high)
            config[name] = 10 ** random.uniform(log_low, log_high)

        else:
            raise ValueError(f"Unknown search type: {kind}")

    return config

Example:

config = sample_config(search_space)
print(config)

This kind of representation is sufficient for simple random search. More advanced systems, such as Optuna, Ray Tune, or Ax, provide richer ways to define search spaces, conditional parameters, pruning, parallel execution, and storage.

Using a Configuration to Build a Model

A useful search space should map cleanly into training code. The training function should accept a configuration object and construct the model, optimizer, and data loaders from that configuration.

import torch
from torch import nn

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_layers, dropout):
        super().__init__()

        layers = []
        dim = input_dim

        for _ in range(num_layers):
            layers.append(nn.Linear(dim, hidden_dim))
            layers.append(nn.ReLU())
            layers.append(nn.Dropout(dropout))
            dim = hidden_dim

        layers.append(nn.Linear(dim, output_dim))
        self.net = nn.Sequential(*layers)

    def forward(self, x):
        return self.net(x)

The configuration determines the architecture:

model = MLP(
    input_dim=784,
    hidden_dim=config["hidden_dim"],
    output_dim=10,
    num_layers=config["num_layers"],
    dropout=config["dropout"],
)

It also determines the optimizer:

if config["optimizer"] == "SGD":
    optimizer = torch.optim.SGD(
        model.parameters(),
        lr=config["learning_rate"],
        momentum=0.9,
        weight_decay=config["weight_decay"],
    )

elif config["optimizer"] == "Adam":
    optimizer = torch.optim.Adam(
        model.parameters(),
        lr=config["learning_rate"],
        weight_decay=config["weight_decay"],
    )

elif config["optimizer"] == "AdamW":
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=config["learning_rate"],
        weight_decay=config["weight_decay"],
    )

else:
    raise ValueError(f"Unknown optimizer: {config['optimizer']}")

This separation between configuration and training code is important. Hyperparameter optimization becomes difficult when choices are scattered throughout the program.

Valid and Invalid Configurations

Not every combination in a search space is valid.

A large hidden dimension and a large batch size may exceed GPU memory. A very deep network without normalization may train poorly. A learning rate that works for AdamW may be unsuitable for SGD. A sequence length of 8192 may be valid for one transformer architecture but impossible for another.

Therefore, a search space may need constraints.

For example:

def is_valid(config):
    if config["hidden_dim"] >= 1024 and config["batch_size"] >= 256:
        return False

    if config["optimizer"] == "SGD" and config["learning_rate"] < 1e-4:
        return False

    return True

A sampler can reject invalid configurations:

def sample_valid_config(space, max_attempts=100):
    for _ in range(max_attempts):
        config = sample_config(space)
        if is_valid(config):
            return config

    raise RuntimeError("Could not sample a valid configuration")

For small projects, manual constraints are enough. For large projects, it is better to encode constraints explicitly and log every failed trial. Silent failures can corrupt the interpretation of search results.

Search Space Size

The size of a search space determines how difficult exploration becomes.

Suppose we define:

\text{learning rate} \in \{10^{-4}, 10^{-3}, 10^{-2}\},

\text{batch size} \in \{32,64,128\},

\text{hidden dimension} \in \{128,256,512\}.

The total number of configurations is

3\times 3\times 3 = 27.

If we add optimizer choice with 3 values and dropout with 5 values, the total becomes

3\times 3\times 3\times 3\times 5 = 405.

Search spaces grow multiplicatively. This is one reason exhaustive grid search becomes expensive in deep learning.

If each training run takes two hours, then 405 configurations require 810 GPU-hours. If the model is large, even a single failed configuration may be costly.

A practical search space should be large enough to contain good solutions and small enough to explore under the available compute budget.

Coarse-to-Fine Search

A useful strategy is to begin with a broad search and then refine.

In the first stage, we search over a wide range:

\eta \in [10^{-5},10^{-1}], \qquad \lambda_{\text{wd}} \in [10^{-6},10^{-1}].

After observing good configurations, we narrow the ranges:

\eta \in [10^{-4},10^{-3}], \qquad \lambda_{\text{wd}} \in [10^{-4},10^{-2}].

This coarse-to-fine approach is common because early trials reveal the scale of useful values. The first stage answers broad questions. Is AdamW better than SGD? Is the model under-regularized? Does the learning rate need warmup? Does the batch size affect convergence?

The second stage answers finer questions. Which learning rate gives the best validation loss? How much dropout is beneficial? Which hidden size gives the best accuracy under a latency constraint?

Architecture Search Spaces

Architecture search spaces are more complex than optimizer search spaces. They may include choices such as:

Model family	Search dimensions
MLP	Layers, width, activation, normalization, dropout
CNN	Channels, kernel sizes, strides, residual blocks
Transformer	Layers, hidden size, attention heads, MLP ratio, context length
GNN	Message passing layers, aggregation function, readout function
Diffusion model	U-Net width, attention resolutions, noise schedule, sampling steps

For a transformer, a simplified search space may be:

transformer_space = {
    "num_layers": ("choice", [6, 12, 24]),
    "hidden_dim": ("choice", [512, 768, 1024]),
    "num_heads": ("choice", [8, 12, 16]),
    "mlp_ratio": ("choice", [2, 4]),
    "dropout": ("uniform", 0.0, 0.2),
    "learning_rate": ("log_uniform", 1e-5, 5e-4),
}

But this space has hidden constraints. The hidden dimension must usually be divisible by the number of attention heads:

d_{\text{model}} \bmod h = 0.

In code:

def valid_transformer_config(config):
    return config["hidden_dim"] % config["num_heads"] == 0

This constraint exists because each attention head receives a subspace of size

d_{\text{head}} = \frac{d_{\text{model}}}{h}.

If $d_{\text{model}}$ cannot be divided evenly by $h$ , the model shape becomes invalid.

Budget-Aware Search Spaces

A search space should reflect the available budget. There are several kinds of budget:

Budget type	Constraint
Compute	GPU-hours, accelerator type, parallel workers
Memory	Maximum batch size, sequence length, model size
Time	Wall-clock deadline
Data	Number of examples, labeling cost
Latency	Inference speed requirement
Storage	Checkpoints, logs, generated artifacts

For example, if the model must run on a mobile device, the search space should include model size and latency constraints. A configuration with the best validation accuracy may be unusable if it exceeds the deployment budget.

If the training budget is small, the search space should focus on the most important variables: learning rate, weight decay, batch size, model width, and dropout. Searching too many dimensions with too few trials usually gives noisy conclusions.

Practical Rules for Search Spaces

A good search space encodes prior knowledge. It avoids values that are clearly unreasonable while retaining enough freedom to find unexpected improvements.

For deep learning, these rules are often useful:

Hyperparameter	Practical search rule
Learning rate	Search logarithmically
Weight decay	Search logarithmically
Dropout	Search linearly
Batch size	Search powers of two or hardware-friendly values
Hidden dimension	Search hardware-friendly multiples
Number of layers	Search small discrete sets
Optimizer	Compare a few strong defaults
Schedule	Start with simple schedules before complex ones

The search space should also include a baseline configuration. Without a baseline, it becomes difficult to know whether the search has helped.

A baseline can be represented as:

baseline_config = {
    "learning_rate": 3e-4,
    "batch_size": 128,
    "hidden_dim": 512,
    "num_layers": 4,
    "dropout": 0.1,
    "optimizer": "AdamW",
    "weight_decay": 1e-2,
}

The baseline is trained first. Search results are then compared against it.

Common Mistakes

A common mistake is defining a search space that is too large. If we vary every possible choice, the number of trials needed becomes impractical. Most trials will be wasted on poor regions.

Another mistake is using a linear scale for parameters whose useful values span orders of magnitude. Learning rate and weight decay should almost always be sampled logarithmically.

A third mistake is ignoring conditional structure. If momentum is sampled even when the optimizer is AdamW, the search algorithm receives irrelevant variables. This can reduce search efficiency.

A fourth mistake is treating failed trials as missing data without explanation. Failed trials often reveal important information. They may indicate memory limits, unstable training, invalid shapes, or bad learning rates.

A fifth mistake is choosing the search space after repeatedly looking at the test set. Hyperparameter search should use validation data. The test set should remain reserved for final evaluation.

Summary

A search space is the set of hyperparameter configurations considered during hyperparameter optimization. It defines which values may vary, how they are sampled, and which combinations are valid.

In deep learning, important search dimensions include learning rate, batch size, optimizer, weight decay, model depth, model width, dropout, normalization, data augmentation, and learning rate schedule.

Good search spaces use logarithmic scales for scale-sensitive quantities, encode constraints explicitly, respect compute and memory budgets, and separate configuration from training code. The quality of the search space often matters more than the choice of search algorithm.