Grid Search

Grid search is one of the simplest methods for hyperparameter optimization. The idea is straightforward: define a finite set of candidate values for each hyperparameter, construct every possible combination, train a model for each configuration, and select the configuration with the best validation performance.

Although modern deep learning systems often use more advanced methods, grid search remains important because it is easy to implement, easy to reason about, reproducible, and useful for small search spaces.

The Basic Idea

Suppose we want to tune two hyperparameters:

\eta \in \{10^{-4},10^{-3},10^{-2}\},

B \in \{32,64,128\},

where $\eta$ is the learning rate and $B$ is the batch size.

Grid search evaluates every pair:

Learning rate	Batch size
$10^{-4}$	32
$10^{-4}$	64
$10^{-4}$	128
$10^{-3}$	32
$10^{-3}$	64
$10^{-3}$	128
$10^{-2}$	32
$10^{-2}$	64
$10^{-2}$	128

This produces

3\times 3 = 9

training runs.

For each configuration, we train a model and compute a validation metric such as accuracy or loss. The best-performing configuration is selected.

Formally, if the search space is

\Lambda = \Lambda_1 \times \Lambda_2 \times \cdots \times \Lambda_n,

then grid search evaluates every point in the Cartesian product.

Why Grid Search Works

Grid search works because it guarantees systematic coverage of the specified search space. Every candidate value is evaluated.

This property makes grid search deterministic and reproducible. If the same grid and random seeds are used, the same configurations will always be explored.

Grid search is especially useful when:

Situation	Reason
Small search space	Exhaustive coverage is affordable
Few hyperparameters	Number of combinations remains manageable
Expensive failures	Deterministic exploration is easier to debug
Baseline experiments	Easy comparison across runs
Educational settings	Clear and interpretable behavior

For example, if we only want to compare:

SGD versus AdamW
three learning rates
two dropout values

then grid search gives complete coverage with only

2\times 3\times 2 = 12

runs.

A Simple PyTorch Example

Suppose we want to tune:

learning rate
hidden dimension
dropout rate

We first define the candidate values:

search_grid = {
    "learning_rate": [1e-4, 1e-3, 1e-2],
    "hidden_dim": [128, 256, 512],
    "dropout": [0.0, 0.1, 0.3],
}

The Cartesian product can be generated using itertools.product:

from itertools import product

keys = search_grid.keys()
values = search_grid.values()

configs = [
    dict(zip(keys, v))
    for v in product(*values)
]

The number of configurations is:

print(len(configs))

Output:

Each configuration is then evaluated:

best_config = None
best_score = float("-inf")

for config in configs:

    model = build_model(
        hidden_dim=config["hidden_dim"],
        dropout=config["dropout"],
    )

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=config["learning_rate"],
    )

    val_accuracy = train_and_evaluate(
        model=model,
        optimizer=optimizer,
    )

    if val_accuracy > best_score:
        best_score = val_accuracy
        best_config = config

At the end:

print(best_config)
print(best_score)

This is the core structure of grid search.

Computational Cost

The major weakness of grid search is combinatorial growth.

If each hyperparameter has $k$ candidate values and there are $n$ hyperparameters, then the total number of configurations is

k^n.

This grows exponentially with the number of dimensions.

For example:

Hyperparameters	Values each	Total configurations
2	5	25
4	5	625
6	5	15,625
10	5	9,765,625

Even moderate search spaces become impossible to explore exhaustively.

Suppose one training run takes 4 hours:

15,625 \times 4 = 62,500

GPU-hours are required.

Large deep learning models may require days or weeks per run, making exhaustive search impractical.

Curse of Dimensionality

Grid search suffers from the curse of dimensionality. As the number of hyperparameters increases, most grid points become wasteful.

This problem becomes clearer when only a few hyperparameters strongly affect performance.

Suppose validation accuracy depends mostly on learning rate and weight decay, while other variables matter little. Grid search still allocates equal resolution to every dimension.

For example:

Hyperparameter	Values
Learning rate	5
Weight decay	5
Batch size	5
Dropout	5
Hidden dimension	5

Total runs:

5^5 = 3125.

But only the first two dimensions significantly matter. Most runs are redundant.

This inefficiency is one reason random search often outperforms grid search in high-dimensional spaces.

Resolution Problems

Grid search depends heavily on the chosen resolution.

Suppose we search learning rate using:

\{10^{-4},10^{-3},10^{-2}\}.

If the best value is actually $3\times10^{-4}$ , the grid misses it entirely.

Increasing resolution improves coverage but increases cost:

\{10^{-5},3\times10^{-5},10^{-4},3\times10^{-4},\ldots\}

A fine grid rapidly becomes expensive.

This issue is especially severe for hyperparameters that vary over many orders of magnitude.

Linear Versus Logarithmic Grids

Many hyperparameters should be searched logarithmically rather than linearly.

A linear grid:

[0.0001,0.025,0.05,0.075,0.1]

allocates almost all resolution to large values.

A logarithmic grid:

[10^{-5},10^{-4},10^{-3},10^{-2},10^{-1}]

allocates equal resolution per order of magnitude.

For learning rate and weight decay, logarithmic spacing is usually better.

In Python:

import numpy as np

learning_rates = np.logspace(-5, -1, num=5)

print(learning_rates)

Output:

[1.e-05 1.e-04 1.e-03 1.e-02 1.e-01]

Parallel Execution

One advantage of grid search is that every configuration is independent. Runs can therefore execute in parallel.

If we have 16 GPUs and 160 configurations, we can evaluate:

160 / 16 = 10

waves of experiments.

This parallelism is called embarrassingly parallel computation because no communication between runs is required.

In practice, experiment schedulers often distribute grid search jobs across clusters automatically.

Validation Metrics

Grid search requires a validation metric to compare configurations.

Common choices include:

Task	Metric
Classification	Accuracy, F1 score
Regression	Mean squared error
Language modeling	Perplexity
Object detection	mAP
Segmentation	IoU
Retrieval	Recall@K

The metric should align with the deployment objective.

For example, maximizing top-1 accuracy may not be appropriate when latency or calibration matters. In production systems, hyperparameter optimization may use a combined objective:

J = \text{accuracy} - \lambda \cdot \text{latency}.

This balances predictive quality against inference cost.

Overfitting to the Validation Set

A large grid search can indirectly overfit to validation data.

Suppose we test thousands of configurations. Some may perform well purely by chance. Selecting the best configuration may therefore exploit random variation in the validation set.

This phenomenon is sometimes called hyperparameter overfitting.

To reduce this problem:

Strategy	Purpose
Use sufficiently large validation sets	Reduce variance
Keep the test set untouched	Preserve unbiased evaluation
Repeat experiments across seeds	Measure stability
Use cross-validation for small datasets	Reduce sensitivity

The test set should only be used after hyperparameter selection is complete.

Early Stopping in Grid Search

Grid search becomes more efficient when poor runs are terminated early.

Suppose a configuration performs very poorly after a few epochs. Continuing training may waste computation.

A simple early stopping rule:

if epoch >= 5 and val_accuracy < 0.5:
    stop_training()

This idea appears in more advanced methods such as Hyperband and successive halving.

Visualizing Grid Search Results

Grid search results are often visualized as heatmaps.

Suppose we vary:

learning rate
weight decay

We can create a matrix:

	WD $10^{-5}$	WD $10^{-4}$	WD $10^{-3}$
LR $10^{-4}$	81.2	82.1	80.4
LR $10^{-3}$	85.6	87.4	84.8
LR $10^{-2}$	70.1	68.4	61.3

This reveals patterns:

very large learning rates destabilize training
moderate weight decay improves generalization
the optimum lies near $10^{-3}$ learning rate

Visualization helps interpret interactions between hyperparameters.

Grid Search in Practice

Modern deep learning rarely uses pure exhaustive grid search for large models. However, grid search still appears in several situations:

Use case	Reason
Small datasets	Training is cheap
Baseline tuning	Easy interpretation
Reproducible benchmarks	Deterministic coverage
Educational experiments	Simple implementation
Small discrete spaces	Exhaustive search feasible

For example, transformer research papers often use small grids for:

learning rate
weight decay
warmup ratio
dropout rate

while using more advanced methods for large architecture searches.

Advantages and Disadvantages

Advantages	Disadvantages
Simple to implement	Exponential cost growth
Deterministic	Inefficient in high dimensions
Easy to parallelize	Wastes trials on unimportant dimensions
Reproducible	Poor resolution between grid points
Easy to debug	Expensive for large models

Grid search is therefore best viewed as a baseline method rather than a universal solution.

Summary

Grid search evaluates every configuration in a predefined Cartesian product of hyperparameter values. It is simple, reproducible, easy to parallelize, and useful for small search spaces.

Its main weakness is exponential growth in the number of configurations. As dimensionality increases, most evaluations become redundant or wasteful. Resolution also becomes problematic because useful values may lie between grid points.

Despite these limitations, grid search remains important as a baseline method, a debugging tool, and a practical solution for small-scale deep learning experiments.