# Random Search

Random search is a hyperparameter optimization method that samples configurations at random from a search space. Instead of evaluating every point on a fixed grid, random search chooses a fixed number of trials and draws each trial independently.

This method is simple, but it is often more effective than grid search in deep learning. The reason is that only a small number of hyperparameters usually dominate performance. Random search spends more trials exploring different values of important dimensions instead of repeatedly evaluating unimportant combinations.

### The Basic Idea

Suppose we want to tune learning rate, batch size, hidden dimension, dropout, and optimizer. Grid search would require choosing a small finite set for each one and evaluating the full Cartesian product.

Random search instead defines distributions:

$$
\eta \sim \text{LogUniform}(10^{-5},10^{-1}),
$$

$$
B \sim \text{Choice}\{32,64,128,256\},
$$

$$
p_{\text{drop}} \sim \text{Uniform}(0,0.5).
$$

Each trial samples one configuration from these distributions.

For example:

```python
config = {
    "learning_rate": 2.7e-4,
    "batch_size": 128,
    "hidden_dim": 512,
    "dropout": 0.18,
    "optimizer": "AdamW",
}
```

After training and validation, the result is recorded. The best configuration found so far is retained.

### Why Random Search Helps

Grid search allocates equal attention to every search dimension. This is inefficient when some dimensions matter much more than others.

Assume validation performance depends strongly on learning rate but weakly on dropout. A grid with five learning rates and five dropout values evaluates only five distinct learning rates:

$$
5 \times 5 = 25
$$

runs, but the learning rate takes only five possible values.

Random search with 25 trials can evaluate 25 different learning rates. This gives much better coverage of the important dimension.

The advantage becomes larger as the number of weak or irrelevant dimensions increases.

### Random Search Versus Grid Search

Consider two hyperparameters:

| Method | Learning rate values explored | Dropout values explored | Total trials |
|---|---:|---:|---:|
| Grid search | 5 fixed values | 5 fixed values | 25 |
| Random search | 25 sampled values | 25 sampled values | 25 |

With the same number of trials, random search explores more distinct values per dimension.

This matters especially for continuous hyperparameters. A grid imposes artificial resolution. Random sampling avoids this fixed lattice.

### Defining Sampling Distributions

Random search requires distributions, not just candidate sets.

Common choices are:

| Hyperparameter | Suggested distribution |
|---|---|
| Learning rate | Log-uniform |
| Weight decay | Log-uniform |
| Dropout | Uniform |
| Batch size | Categorical |
| Hidden dimension | Categorical |
| Number of layers | Categorical |
| Warmup ratio | Uniform |
| Gradient clipping norm | Log-uniform |
| Label smoothing | Uniform |

A log-uniform distribution is useful when the right scale is unknown. For example, the useful learning rate may be $10^{-4}$, $10^{-3}$, or $10^{-2}$. Sampling uniformly in ordinary space would over-sample large values.

A log-uniform draw can be written as:

$$
u \sim \text{Uniform}(\log a,\log b),
\qquad
x = \exp(u).
$$

If base 10 is used:

$$
u \sim \text{Uniform}(\log_{10} a,\log_{10} b),
\qquad
x = 10^u.
$$

In Python:

```python
import random
import math

def log_uniform(low, high):
    return math.exp(random.uniform(math.log(low), math.log(high)))

learning_rate = log_uniform(1e-5, 1e-1)
```

### A Minimal Random Search Implementation

A search space can be represented as a dictionary:

```python
search_space = {
    "learning_rate": ("log_uniform", 1e-5, 1e-1),
    "weight_decay": ("log_uniform", 1e-6, 1e-1),
    "batch_size": ("choice", [32, 64, 128, 256]),
    "hidden_dim": ("choice", [128, 256, 512, 1024]),
    "dropout": ("uniform", 0.0, 0.5),
    "optimizer": ("choice", ["SGD", "Adam", "AdamW"]),
}
```

We can implement a sampler:

```python
import random
import math

def sample_value(spec):
    kind = spec[0]

    if kind == "choice":
        return random.choice(spec[1])

    if kind == "uniform":
        low, high = spec[1], spec[2]
        return random.uniform(low, high)

    if kind == "log_uniform":
        low, high = spec[1], spec[2]
        return math.exp(random.uniform(math.log(low), math.log(high)))

    raise ValueError(f"unknown search distribution: {kind}")

def sample_config(search_space):
    return {
        name: sample_value(spec)
        for name, spec in search_space.items()
    }
```

Then random search becomes:

```python
best_config = None
best_score = float("-inf")

num_trials = 50

for trial in range(num_trials):
    config = sample_config(search_space)

    score = train_and_evaluate(config)

    if score > best_score:
        best_score = score
        best_config = config

print("best score:", best_score)
print("best config:", best_config)
```

The function `train_and_evaluate` should construct the model, optimizer, scheduler, data loaders, and training loop from the configuration.

### Conditional Random Search

Some hyperparameters only make sense under certain choices.

For example, momentum matters for SGD:

```python
if optimizer == "SGD":
    momentum = sample_uniform(0.0, 0.99)
```

Adam beta values matter for Adam and AdamW:

```python
if optimizer in {"Adam", "AdamW"}:
    beta1 = sample_uniform(0.8, 0.99)
    beta2 = sample_uniform(0.9, 0.9999)
```

A conditional sampler can encode this directly:

```python
def sample_optimizer_config():
    optimizer = random.choice(["SGD", "Adam", "AdamW"])

    config = {"optimizer": optimizer}

    if optimizer == "SGD":
        config["momentum"] = random.uniform(0.0, 0.99)

    if optimizer in {"Adam", "AdamW"}:
        config["beta1"] = random.uniform(0.8, 0.99)
        config["beta2"] = log_uniform(0.9, 0.9999)

    return config
```

Conditional search spaces avoid meaningless parameters. This improves search efficiency and makes the result easier to interpret.

### Number of Trials

Random search requires choosing a trial budget. The budget depends on training cost, available hardware, and search-space size.

For small models, hundreds of trials may be feasible. For large models, even ten trials may be expensive.

A practical pattern is:

| Training cost per run | Reasonable initial trials |
|---|---:|
| Seconds | 100 to 1000 |
| Minutes | 50 to 200 |
| Hours | 10 to 50 |
| Days | 3 to 10 |

The search should begin with wide ranges and a modest budget. After promising regions are found, a second random search can focus on narrower ranges.

### Coarse-to-Fine Random Search

Random search is often used in stages.

First, run a broad search:

```python
broad_space = {
    "learning_rate": ("log_uniform", 1e-5, 1e-1),
    "weight_decay": ("log_uniform", 1e-6, 1e-1),
    "dropout": ("uniform", 0.0, 0.5),
}
```

Suppose the best results cluster near:

$$
\eta \in [10^{-4},10^{-3}],
\qquad
\lambda_{\text{wd}} \in [10^{-3},10^{-2}].
$$

Then define a narrower search:

```python
narrow_space = {
    "learning_rate": ("log_uniform", 1e-4, 1e-3),
    "weight_decay": ("log_uniform", 1e-3, 1e-2),
    "dropout": ("uniform", 0.05, 0.2),
}
```

This second stage increases resolution where useful configurations are likely.

### Random Seeds and Reproducibility

Random search introduces randomness in two places.

First, the search algorithm samples configurations randomly. Second, model training itself is stochastic due to initialization, data shuffling, dropout, nondeterministic GPU kernels, and augmentation.

To improve reproducibility, save:

| Item | Purpose |
|---|---|
| Search seed | Reproduce sampled configurations |
| Training seed | Reproduce model initialization and data order |
| Full configuration | Rebuild the experiment |
| Validation metrics | Compare trials |
| Checkpoint path | Reload selected model |
| Code version | Match implementation |

Example:

```python
import random
import torch

def set_seed(seed):
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
```

Each trial can receive its own seed:

```python
base_seed = 1234

for trial in range(num_trials):
    trial_seed = base_seed + trial
    set_seed(trial_seed)

    config = sample_config(search_space)
    config["seed"] = trial_seed

    score = train_and_evaluate(config)
```

This makes the search easier to audit.

### Handling Failed Trials

Random search may sample unstable or invalid configurations. A learning rate may be too large. A batch size may exceed memory. A transformer hidden size may be incompatible with the number of heads.

Failed trials should be recorded, not silently ignored.

```python
results = []

for trial in range(num_trials):
    config = sample_config(search_space)

    try:
        score = train_and_evaluate(config)
        status = "ok"

    except RuntimeError as e:
        score = None
        status = "failed"
        error = str(e)

    results.append({
        "trial": trial,
        "config": config,
        "score": score,
        "status": status,
    })
```

This record helps identify bad regions of the search space. If many trials fail, the search space should be constrained.

### Comparing Configurations Fairly

A configuration should be compared under the same evaluation protocol.

This means:

| Factor | Should be fixed across trials |
|---|---|
| Training data | Same split |
| Validation data | Same split |
| Number of epochs | Same budget unless using early stopping |
| Evaluation metric | Same metric |
| Preprocessing | Same rules unless intentionally searched |
| Random seed policy | Same procedure |
| Hardware assumptions | Same precision and device type |

If one configuration receives more training steps than another, its score may reflect extra compute rather than better hyperparameters.

For budget-aware optimization, use an explicit objective such as validation accuracy after a fixed number of steps, or validation loss under a fixed GPU-hour budget.

### Search Results as Data

Random search produces useful diagnostic data. Even failed or mediocre trials can show which hyperparameters matter.

After running trials, we can sort by validation score:

```python
results = sorted(results, key=lambda r: r["score"], reverse=True)
```

We can inspect the best configurations:

```python
for r in results[:5]:
    print(r["score"], r["config"])
```

Patterns are often more valuable than a single best configuration. For example, the top configurations may all use AdamW, learning rates near $3\times10^{-4}$, dropout below 0.2, and moderate weight decay. This suggests a stable region of the search space.

### Advantages and Disadvantages

| Advantages | Disadvantages |
|---|---|
| Simple to implement | No guarantee of finding the optimum |
| Works well in high-dimensional spaces | Results vary across random seeds |
| Efficient for continuous variables | Can waste trials in bad regions |
| Easy to parallelize | Does not learn from previous trials |
| Better than grid search for many DL tasks | Needs careful distribution design |

Random search is a strong default when the search space is moderately large and the cost per run is acceptable.

### When to Use Random Search

Random search is a good choice when:

| Situation | Reason |
|---|---|
| Several hyperparameters matter | Better coverage than grid search |
| Some dimensions are continuous | Avoids fixed grid resolution |
| Search budget is limited | Can stop after any number of trials |
| Parallel workers are available | Trials are independent |
| Baselines are needed quickly | Simple and robust |

Random search is less suitable when each run is extremely expensive and only a few trials are possible. In that case, expert tuning or Bayesian optimization may use the budget more effectively.

### Summary

Random search samples hyperparameter configurations from predefined distributions. It replaces exhaustive enumeration with stochastic exploration.

Compared with grid search, random search often covers important dimensions more effectively under the same trial budget. It works especially well when only a few hyperparameters strongly influence performance.

A good random search depends on well-designed sampling distributions, proper logging, valid constraints, reproducible seeds, and fair evaluation. It is simple enough to be a baseline and strong enough to be useful in real deep learning systems.

