Grid search is one of the simplest methods for hyperparameter optimization. The idea is straightforward: define a finite set of candidate values for each hyperparameter, construct every possible combination, train a model for each configuration, and select the configuration with the best validation performance.
Although modern deep learning systems often use more advanced methods, grid search remains important because it is easy to implement, easy to reason about, reproducible, and useful for small search spaces.
The Basic Idea
Suppose we want to tune two hyperparameters:
where is the learning rate and is the batch size.
Grid search evaluates every pair:
| Learning rate | Batch size |
|---|---|
| 32 | |
| 64 | |
| 128 | |
| 32 | |
| 64 | |
| 128 | |
| 32 | |
| 64 | |
| 128 |
This produces
training runs.
For each configuration, we train a model and compute a validation metric such as accuracy or loss. The best-performing configuration is selected.
Formally, if the search space is
then grid search evaluates every point in the Cartesian product.
Why Grid Search Works
Grid search works because it guarantees systematic coverage of the specified search space. Every candidate value is evaluated.
This property makes grid search deterministic and reproducible. If the same grid and random seeds are used, the same configurations will always be explored.
Grid search is especially useful when:
| Situation | Reason |
|---|---|
| Small search space | Exhaustive coverage is affordable |
| Few hyperparameters | Number of combinations remains manageable |
| Expensive failures | Deterministic exploration is easier to debug |
| Baseline experiments | Easy comparison across runs |
| Educational settings | Clear and interpretable behavior |
For example, if we only want to compare:
- SGD versus AdamW
- three learning rates
- two dropout values
then grid search gives complete coverage with only
runs.
A Simple PyTorch Example
Suppose we want to tune:
- learning rate
- hidden dimension
- dropout rate
We first define the candidate values:
search_grid = {
"learning_rate": [1e-4, 1e-3, 1e-2],
"hidden_dim": [128, 256, 512],
"dropout": [0.0, 0.1, 0.3],
}The Cartesian product can be generated using itertools.product:
from itertools import product
keys = search_grid.keys()
values = search_grid.values()
configs = [
dict(zip(keys, v))
for v in product(*values)
]The number of configurations is:
print(len(configs))Output:
27Each configuration is then evaluated:
best_config = None
best_score = float("-inf")
for config in configs:
model = build_model(
hidden_dim=config["hidden_dim"],
dropout=config["dropout"],
)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=config["learning_rate"],
)
val_accuracy = train_and_evaluate(
model=model,
optimizer=optimizer,
)
if val_accuracy > best_score:
best_score = val_accuracy
best_config = configAt the end:
print(best_config)
print(best_score)This is the core structure of grid search.
Computational Cost
The major weakness of grid search is combinatorial growth.
If each hyperparameter has candidate values and there are hyperparameters, then the total number of configurations is
This grows exponentially with the number of dimensions.
For example:
| Hyperparameters | Values each | Total configurations |
|---|---|---|
| 2 | 5 | 25 |
| 4 | 5 | 625 |
| 6 | 5 | 15,625 |
| 10 | 5 | 9,765,625 |
Even moderate search spaces become impossible to explore exhaustively.
Suppose one training run takes 4 hours:
GPU-hours are required.
Large deep learning models may require days or weeks per run, making exhaustive search impractical.
Curse of Dimensionality
Grid search suffers from the curse of dimensionality. As the number of hyperparameters increases, most grid points become wasteful.
This problem becomes clearer when only a few hyperparameters strongly affect performance.
Suppose validation accuracy depends mostly on learning rate and weight decay, while other variables matter little. Grid search still allocates equal resolution to every dimension.
For example:
| Hyperparameter | Values |
|---|---|
| Learning rate | 5 |
| Weight decay | 5 |
| Batch size | 5 |
| Dropout | 5 |
| Hidden dimension | 5 |
Total runs:
But only the first two dimensions significantly matter. Most runs are redundant.
This inefficiency is one reason random search often outperforms grid search in high-dimensional spaces.
Resolution Problems
Grid search depends heavily on the chosen resolution.
Suppose we search learning rate using:
If the best value is actually , the grid misses it entirely.
Increasing resolution improves coverage but increases cost:
A fine grid rapidly becomes expensive.
This issue is especially severe for hyperparameters that vary over many orders of magnitude.
Linear Versus Logarithmic Grids
Many hyperparameters should be searched logarithmically rather than linearly.
A linear grid:
allocates almost all resolution to large values.
A logarithmic grid:
allocates equal resolution per order of magnitude.
For learning rate and weight decay, logarithmic spacing is usually better.
In Python:
import numpy as np
learning_rates = np.logspace(-5, -1, num=5)
print(learning_rates)Output:
[1.e-05 1.e-04 1.e-03 1.e-02 1.e-01]Parallel Execution
One advantage of grid search is that every configuration is independent. Runs can therefore execute in parallel.
If we have 16 GPUs and 160 configurations, we can evaluate:
waves of experiments.
This parallelism is called embarrassingly parallel computation because no communication between runs is required.
In practice, experiment schedulers often distribute grid search jobs across clusters automatically.
Validation Metrics
Grid search requires a validation metric to compare configurations.
Common choices include:
| Task | Metric |
|---|---|
| Classification | Accuracy, F1 score |
| Regression | Mean squared error |
| Language modeling | Perplexity |
| Object detection | mAP |
| Segmentation | IoU |
| Retrieval | Recall@K |
The metric should align with the deployment objective.
For example, maximizing top-1 accuracy may not be appropriate when latency or calibration matters. In production systems, hyperparameter optimization may use a combined objective:
This balances predictive quality against inference cost.
Overfitting to the Validation Set
A large grid search can indirectly overfit to validation data.
Suppose we test thousands of configurations. Some may perform well purely by chance. Selecting the best configuration may therefore exploit random variation in the validation set.
This phenomenon is sometimes called hyperparameter overfitting.
To reduce this problem:
| Strategy | Purpose |
|---|---|
| Use sufficiently large validation sets | Reduce variance |
| Keep the test set untouched | Preserve unbiased evaluation |
| Repeat experiments across seeds | Measure stability |
| Use cross-validation for small datasets | Reduce sensitivity |
The test set should only be used after hyperparameter selection is complete.
Early Stopping in Grid Search
Grid search becomes more efficient when poor runs are terminated early.
Suppose a configuration performs very poorly after a few epochs. Continuing training may waste computation.
A simple early stopping rule:
if epoch >= 5 and val_accuracy < 0.5:
stop_training()This idea appears in more advanced methods such as Hyperband and successive halving.
Visualizing Grid Search Results
Grid search results are often visualized as heatmaps.
Suppose we vary:
- learning rate
- weight decay
We can create a matrix:
| WD | WD | WD | |
|---|---|---|---|
| LR | 81.2 | 82.1 | 80.4 |
| LR | 85.6 | 87.4 | 84.8 |
| LR | 70.1 | 68.4 | 61.3 |
This reveals patterns:
- very large learning rates destabilize training
- moderate weight decay improves generalization
- the optimum lies near learning rate
Visualization helps interpret interactions between hyperparameters.
Grid Search in Practice
Modern deep learning rarely uses pure exhaustive grid search for large models. However, grid search still appears in several situations:
| Use case | Reason |
|---|---|
| Small datasets | Training is cheap |
| Baseline tuning | Easy interpretation |
| Reproducible benchmarks | Deterministic coverage |
| Educational experiments | Simple implementation |
| Small discrete spaces | Exhaustive search feasible |
For example, transformer research papers often use small grids for:
- learning rate
- weight decay
- warmup ratio
- dropout rate
while using more advanced methods for large architecture searches.
Advantages and Disadvantages
| Advantages | Disadvantages |
|---|---|
| Simple to implement | Exponential cost growth |
| Deterministic | Inefficient in high dimensions |
| Easy to parallelize | Wastes trials on unimportant dimensions |
| Reproducible | Poor resolution between grid points |
| Easy to debug | Expensive for large models |
Grid search is therefore best viewed as a baseline method rather than a universal solution.
Summary
Grid search evaluates every configuration in a predefined Cartesian product of hyperparameter values. It is simple, reproducible, easy to parallelize, and useful for small search spaces.
Its main weakness is exponential growth in the number of configurations. As dimensionality increases, most evaluations become redundant or wasteful. Resolution also becomes problematic because useful values may lie between grid points.
Despite these limitations, grid search remains important as a baseline method, a debugging tool, and a practical solution for small-scale deep learning experiments.