# Grid Search

Grid search is one of the simplest methods for hyperparameter optimization. The idea is straightforward: define a finite set of candidate values for each hyperparameter, construct every possible combination, train a model for each configuration, and select the configuration with the best validation performance.

Although modern deep learning systems often use more advanced methods, grid search remains important because it is easy to implement, easy to reason about, reproducible, and useful for small search spaces.

### The Basic Idea

Suppose we want to tune two hyperparameters:

$$
\eta \in \{10^{-4},10^{-3},10^{-2}\},
$$

$$
B \in \{32,64,128\},
$$

where $\eta$ is the learning rate and $B$ is the batch size.

Grid search evaluates every pair:

| Learning rate | Batch size |
|---|---:|
| $10^{-4}$ | 32 |
| $10^{-4}$ | 64 |
| $10^{-4}$ | 128 |
| $10^{-3}$ | 32 |
| $10^{-3}$ | 64 |
| $10^{-3}$ | 128 |
| $10^{-2}$ | 32 |
| $10^{-2}$ | 64 |
| $10^{-2}$ | 128 |

This produces

$$
3\times 3 = 9
$$

training runs.

For each configuration, we train a model and compute a validation metric such as accuracy or loss. The best-performing configuration is selected.

Formally, if the search space is

$$
\Lambda =
\Lambda_1 \times \Lambda_2 \times \cdots \times \Lambda_n,
$$

then grid search evaluates every point in the Cartesian product.

### Why Grid Search Works

Grid search works because it guarantees systematic coverage of the specified search space. Every candidate value is evaluated.

This property makes grid search deterministic and reproducible. If the same grid and random seeds are used, the same configurations will always be explored.

Grid search is especially useful when:

| Situation | Reason |
|---|---|
| Small search space | Exhaustive coverage is affordable |
| Few hyperparameters | Number of combinations remains manageable |
| Expensive failures | Deterministic exploration is easier to debug |
| Baseline experiments | Easy comparison across runs |
| Educational settings | Clear and interpretable behavior |

For example, if we only want to compare:

- SGD versus AdamW  
- three learning rates  
- two dropout values  

then grid search gives complete coverage with only

$$
2\times 3\times 2 = 12
$$

runs.

### A Simple PyTorch Example

Suppose we want to tune:

- learning rate  
- hidden dimension  
- dropout rate  

We first define the candidate values:

```python id="l9d3qx"
search_grid = {
    "learning_rate": [1e-4, 1e-3, 1e-2],
    "hidden_dim": [128, 256, 512],
    "dropout": [0.0, 0.1, 0.3],
}
```

The Cartesian product can be generated using `itertools.product`:

```python id="jmr5g8"
from itertools import product

keys = search_grid.keys()
values = search_grid.values()

configs = [
    dict(zip(keys, v))
    for v in product(*values)
]
```

The number of configurations is:

```python id="9hq0w9"
print(len(configs))
```

Output:

```python id="0z5y5i"
27
```

Each configuration is then evaluated:

```python id="chajh4"
best_config = None
best_score = float("-inf")

for config in configs:

    model = build_model(
        hidden_dim=config["hidden_dim"],
        dropout=config["dropout"],
    )

    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=config["learning_rate"],
    )

    val_accuracy = train_and_evaluate(
        model=model,
        optimizer=optimizer,
    )

    if val_accuracy > best_score:
        best_score = val_accuracy
        best_config = config
```

At the end:

```python id="0klx1u"
print(best_config)
print(best_score)
```

This is the core structure of grid search.

### Computational Cost

The major weakness of grid search is combinatorial growth.

If each hyperparameter has $k$ candidate values and there are $n$ hyperparameters, then the total number of configurations is

$$
k^n.
$$

This grows exponentially with the number of dimensions.

For example:

| Hyperparameters | Values each | Total configurations |
|---|---:|---:|
| 2 | 5 | 25 |
| 4 | 5 | 625 |
| 6 | 5 | 15,625 |
| 10 | 5 | 9,765,625 |

Even moderate search spaces become impossible to explore exhaustively.

Suppose one training run takes 4 hours:

$$
15,625 \times 4 = 62,500
$$

GPU-hours are required.

Large deep learning models may require days or weeks per run, making exhaustive search impractical.

### Curse of Dimensionality

Grid search suffers from the curse of dimensionality. As the number of hyperparameters increases, most grid points become wasteful.

This problem becomes clearer when only a few hyperparameters strongly affect performance.

Suppose validation accuracy depends mostly on learning rate and weight decay, while other variables matter little. Grid search still allocates equal resolution to every dimension.

For example:

| Hyperparameter | Values |
|---|---|
| Learning rate | 5 |
| Weight decay | 5 |
| Batch size | 5 |
| Dropout | 5 |
| Hidden dimension | 5 |

Total runs:

$$
5^5 = 3125.
$$

But only the first two dimensions significantly matter. Most runs are redundant.

This inefficiency is one reason random search often outperforms grid search in high-dimensional spaces.

### Resolution Problems

Grid search depends heavily on the chosen resolution.

Suppose we search learning rate using:

$$
\{10^{-4},10^{-3},10^{-2}\}.
$$

If the best value is actually $3\times10^{-4}$, the grid misses it entirely.

Increasing resolution improves coverage but increases cost:

$$
\{10^{-5},3\times10^{-5},10^{-4},3\times10^{-4},\ldots\}
$$

A fine grid rapidly becomes expensive.

This issue is especially severe for hyperparameters that vary over many orders of magnitude.

### Linear Versus Logarithmic Grids

Many hyperparameters should be searched logarithmically rather than linearly.

A linear grid:

$$
[0.0001,0.025,0.05,0.075,0.1]
$$

allocates almost all resolution to large values.

A logarithmic grid:

$$
[10^{-5},10^{-4},10^{-3},10^{-2},10^{-1}]
$$

allocates equal resolution per order of magnitude.

For learning rate and weight decay, logarithmic spacing is usually better.

In Python:

```python id="e5rkg5"
import numpy as np

learning_rates = np.logspace(-5, -1, num=5)

print(learning_rates)
```

Output:

```python id="hdxj17"
[1.e-05 1.e-04 1.e-03 1.e-02 1.e-01]
```

### Parallel Execution

One advantage of grid search is that every configuration is independent. Runs can therefore execute in parallel.

If we have 16 GPUs and 160 configurations, we can evaluate:

$$
160 / 16 = 10
$$

waves of experiments.

This parallelism is called embarrassingly parallel computation because no communication between runs is required.

In practice, experiment schedulers often distribute grid search jobs across clusters automatically.

### Validation Metrics

Grid search requires a validation metric to compare configurations.

Common choices include:

| Task | Metric |
|---|---|
| Classification | Accuracy, F1 score |
| Regression | Mean squared error |
| Language modeling | Perplexity |
| Object detection | mAP |
| Segmentation | IoU |
| Retrieval | Recall@K |

The metric should align with the deployment objective.

For example, maximizing top-1 accuracy may not be appropriate when latency or calibration matters. In production systems, hyperparameter optimization may use a combined objective:

$$
J =
\text{accuracy} -
\lambda \cdot \text{latency}.
$$

This balances predictive quality against inference cost.

### Overfitting to the Validation Set

A large grid search can indirectly overfit to validation data.

Suppose we test thousands of configurations. Some may perform well purely by chance. Selecting the best configuration may therefore exploit random variation in the validation set.

This phenomenon is sometimes called hyperparameter overfitting.

To reduce this problem:

| Strategy | Purpose |
|---|---|
| Use sufficiently large validation sets | Reduce variance |
| Keep the test set untouched | Preserve unbiased evaluation |
| Repeat experiments across seeds | Measure stability |
| Use cross-validation for small datasets | Reduce sensitivity |

The test set should only be used after hyperparameter selection is complete.

### Early Stopping in Grid Search

Grid search becomes more efficient when poor runs are terminated early.

Suppose a configuration performs very poorly after a few epochs. Continuing training may waste computation.

A simple early stopping rule:

```python id="r2v2qt"
if epoch >= 5 and val_accuracy < 0.5:
    stop_training()
```

This idea appears in more advanced methods such as Hyperband and successive halving.

### Visualizing Grid Search Results

Grid search results are often visualized as heatmaps.

Suppose we vary:

- learning rate  
- weight decay  

We can create a matrix:

| | WD $10^{-5}$ | WD $10^{-4}$ | WD $10^{-3}$ |
|---|---:|---:|---:|
| LR $10^{-4}$ | 81.2 | 82.1 | 80.4 |
| LR $10^{-3}$ | 85.6 | 87.4 | 84.8 |
| LR $10^{-2}$ | 70.1 | 68.4 | 61.3 |

This reveals patterns:

- very large learning rates destabilize training  
- moderate weight decay improves generalization  
- the optimum lies near $10^{-3}$ learning rate  

Visualization helps interpret interactions between hyperparameters.

### Grid Search in Practice

Modern deep learning rarely uses pure exhaustive grid search for large models. However, grid search still appears in several situations:

| Use case | Reason |
|---|---|
| Small datasets | Training is cheap |
| Baseline tuning | Easy interpretation |
| Reproducible benchmarks | Deterministic coverage |
| Educational experiments | Simple implementation |
| Small discrete spaces | Exhaustive search feasible |

For example, transformer research papers often use small grids for:

- learning rate  
- weight decay  
- warmup ratio  
- dropout rate  

while using more advanced methods for large architecture searches.

### Advantages and Disadvantages

| Advantages | Disadvantages |
|---|---|
| Simple to implement | Exponential cost growth |
| Deterministic | Inefficient in high dimensions |
| Easy to parallelize | Wastes trials on unimportant dimensions |
| Reproducible | Poor resolution between grid points |
| Easy to debug | Expensive for large models |

Grid search is therefore best viewed as a baseline method rather than a universal solution.

### Summary

Grid search evaluates every configuration in a predefined Cartesian product of hyperparameter values. It is simple, reproducible, easy to parallelize, and useful for small search spaces.

Its main weakness is exponential growth in the number of configurations. As dimensionality increases, most evaluations become redundant or wasteful. Resolution also becomes problematic because useful values may lie between grid points.

Despite these limitations, grid search remains important as a baseline method, a debugging tool, and a practical solution for small-scale deep learning experiments.