Skip to content

Grid Search

Grid search is one of the simplest methods for hyperparameter optimization.

Grid search is one of the simplest methods for hyperparameter optimization. The idea is straightforward: define a finite set of candidate values for each hyperparameter, construct every possible combination, train a model for each configuration, and select the configuration with the best validation performance.

Although modern deep learning systems often use more advanced methods, grid search remains important because it is easy to implement, easy to reason about, reproducible, and useful for small search spaces.

The Basic Idea

Suppose we want to tune two hyperparameters:

η{104,103,102}, \eta \in \{10^{-4},10^{-3},10^{-2}\}, B{32,64,128}, B \in \{32,64,128\},

where η\eta is the learning rate and BB is the batch size.

Grid search evaluates every pair:

Learning rateBatch size
10410^{-4}32
10410^{-4}64
10410^{-4}128
10310^{-3}32
10310^{-3}64
10310^{-3}128
10210^{-2}32
10210^{-2}64
10210^{-2}128

This produces

3×3=9 3\times 3 = 9

training runs.

For each configuration, we train a model and compute a validation metric such as accuracy or loss. The best-performing configuration is selected.

Formally, if the search space is

Λ=Λ1×Λ2××Λn, \Lambda = \Lambda_1 \times \Lambda_2 \times \cdots \times \Lambda_n,

then grid search evaluates every point in the Cartesian product.

Why Grid Search Works

Grid search works because it guarantees systematic coverage of the specified search space. Every candidate value is evaluated.

This property makes grid search deterministic and reproducible. If the same grid and random seeds are used, the same configurations will always be explored.

Grid search is especially useful when:

SituationReason
Small search spaceExhaustive coverage is affordable
Few hyperparametersNumber of combinations remains manageable
Expensive failuresDeterministic exploration is easier to debug
Baseline experimentsEasy comparison across runs
Educational settingsClear and interpretable behavior

For example, if we only want to compare:

  • SGD versus AdamW
  • three learning rates
  • two dropout values

then grid search gives complete coverage with only

2×3×2=12 2\times 3\times 2 = 12

runs.

A Simple PyTorch Example

Suppose we want to tune:

  • learning rate
  • hidden dimension
  • dropout rate

We first define the candidate values:

search_grid = {
    "learning_rate": [1e-4, 1e-3, 1e-2],
    "hidden_dim": [128, 256, 512],
    "dropout": [0.0, 0.1, 0.3],
}

The Cartesian product can be generated using itertools.product:

from itertools import product

keys = search_grid.keys()
values = search_grid.values()

configs = [
    dict(zip(keys, v))
    for v in product(*values)
]

The number of configurations is:

print(len(configs))

Output:

27

Each configuration is then evaluated:

best_config = None
best_score = float("-inf")

for config in configs:

    model = build_model(
        hidden_dim=config["hidden_dim"],
        dropout=config["dropout"],
    )

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=config["learning_rate"],
    )

    val_accuracy = train_and_evaluate(
        model=model,
        optimizer=optimizer,
    )

    if val_accuracy > best_score:
        best_score = val_accuracy
        best_config = config

At the end:

print(best_config)
print(best_score)

This is the core structure of grid search.

Computational Cost

The major weakness of grid search is combinatorial growth.

If each hyperparameter has kk candidate values and there are nn hyperparameters, then the total number of configurations is

kn. k^n.

This grows exponentially with the number of dimensions.

For example:

HyperparametersValues eachTotal configurations
2525
45625
6515,625
1059,765,625

Even moderate search spaces become impossible to explore exhaustively.

Suppose one training run takes 4 hours:

15,625×4=62,500 15,625 \times 4 = 62,500

GPU-hours are required.

Large deep learning models may require days or weeks per run, making exhaustive search impractical.

Curse of Dimensionality

Grid search suffers from the curse of dimensionality. As the number of hyperparameters increases, most grid points become wasteful.

This problem becomes clearer when only a few hyperparameters strongly affect performance.

Suppose validation accuracy depends mostly on learning rate and weight decay, while other variables matter little. Grid search still allocates equal resolution to every dimension.

For example:

HyperparameterValues
Learning rate5
Weight decay5
Batch size5
Dropout5
Hidden dimension5

Total runs:

55=3125. 5^5 = 3125.

But only the first two dimensions significantly matter. Most runs are redundant.

This inefficiency is one reason random search often outperforms grid search in high-dimensional spaces.

Resolution Problems

Grid search depends heavily on the chosen resolution.

Suppose we search learning rate using:

{104,103,102}. \{10^{-4},10^{-3},10^{-2}\}.

If the best value is actually 3×1043\times10^{-4}, the grid misses it entirely.

Increasing resolution improves coverage but increases cost:

{105,3×105,104,3×104,} \{10^{-5},3\times10^{-5},10^{-4},3\times10^{-4},\ldots\}

A fine grid rapidly becomes expensive.

This issue is especially severe for hyperparameters that vary over many orders of magnitude.

Linear Versus Logarithmic Grids

Many hyperparameters should be searched logarithmically rather than linearly.

A linear grid:

[0.0001,0.025,0.05,0.075,0.1] [0.0001,0.025,0.05,0.075,0.1]

allocates almost all resolution to large values.

A logarithmic grid:

[105,104,103,102,101] [10^{-5},10^{-4},10^{-3},10^{-2},10^{-1}]

allocates equal resolution per order of magnitude.

For learning rate and weight decay, logarithmic spacing is usually better.

In Python:

import numpy as np

learning_rates = np.logspace(-5, -1, num=5)

print(learning_rates)

Output:

[1.e-05 1.e-04 1.e-03 1.e-02 1.e-01]

Parallel Execution

One advantage of grid search is that every configuration is independent. Runs can therefore execute in parallel.

If we have 16 GPUs and 160 configurations, we can evaluate:

160/16=10 160 / 16 = 10

waves of experiments.

This parallelism is called embarrassingly parallel computation because no communication between runs is required.

In practice, experiment schedulers often distribute grid search jobs across clusters automatically.

Validation Metrics

Grid search requires a validation metric to compare configurations.

Common choices include:

TaskMetric
ClassificationAccuracy, F1 score
RegressionMean squared error
Language modelingPerplexity
Object detectionmAP
SegmentationIoU
RetrievalRecall@K

The metric should align with the deployment objective.

For example, maximizing top-1 accuracy may not be appropriate when latency or calibration matters. In production systems, hyperparameter optimization may use a combined objective:

J=accuracyλlatency. J = \text{accuracy} - \lambda \cdot \text{latency}.

This balances predictive quality against inference cost.

Overfitting to the Validation Set

A large grid search can indirectly overfit to validation data.

Suppose we test thousands of configurations. Some may perform well purely by chance. Selecting the best configuration may therefore exploit random variation in the validation set.

This phenomenon is sometimes called hyperparameter overfitting.

To reduce this problem:

StrategyPurpose
Use sufficiently large validation setsReduce variance
Keep the test set untouchedPreserve unbiased evaluation
Repeat experiments across seedsMeasure stability
Use cross-validation for small datasetsReduce sensitivity

The test set should only be used after hyperparameter selection is complete.

Early Stopping in Grid Search

Grid search becomes more efficient when poor runs are terminated early.

Suppose a configuration performs very poorly after a few epochs. Continuing training may waste computation.

A simple early stopping rule:

if epoch >= 5 and val_accuracy < 0.5:
    stop_training()

This idea appears in more advanced methods such as Hyperband and successive halving.

Visualizing Grid Search Results

Grid search results are often visualized as heatmaps.

Suppose we vary:

  • learning rate
  • weight decay

We can create a matrix:

WD 10510^{-5}WD 10410^{-4}WD 10310^{-3}
LR 10410^{-4}81.282.180.4
LR 10310^{-3}85.687.484.8
LR 10210^{-2}70.168.461.3

This reveals patterns:

  • very large learning rates destabilize training
  • moderate weight decay improves generalization
  • the optimum lies near 10310^{-3} learning rate

Visualization helps interpret interactions between hyperparameters.

Grid Search in Practice

Modern deep learning rarely uses pure exhaustive grid search for large models. However, grid search still appears in several situations:

Use caseReason
Small datasetsTraining is cheap
Baseline tuningEasy interpretation
Reproducible benchmarksDeterministic coverage
Educational experimentsSimple implementation
Small discrete spacesExhaustive search feasible

For example, transformer research papers often use small grids for:

  • learning rate
  • weight decay
  • warmup ratio
  • dropout rate

while using more advanced methods for large architecture searches.

Advantages and Disadvantages

AdvantagesDisadvantages
Simple to implementExponential cost growth
DeterministicInefficient in high dimensions
Easy to parallelizeWastes trials on unimportant dimensions
ReproduciblePoor resolution between grid points
Easy to debugExpensive for large models

Grid search is therefore best viewed as a baseline method rather than a universal solution.

Summary

Grid search evaluates every configuration in a predefined Cartesian product of hyperparameter values. It is simple, reproducible, easy to parallelize, and useful for small search spaces.

Its main weakness is exponential growth in the number of configurations. As dimensionality increases, most evaluations become redundant or wasteful. Resolution also becomes problematic because useful values may lie between grid points.

Despite these limitations, grid search remains important as a baseline method, a debugging tool, and a practical solution for small-scale deep learning experiments.