Skip to content

Random Search

Random search is a hyperparameter optimization method that samples configurations at random from a search space.

Random search is a hyperparameter optimization method that samples configurations at random from a search space. Instead of evaluating every point on a fixed grid, random search chooses a fixed number of trials and draws each trial independently.

This method is simple, but it is often more effective than grid search in deep learning. The reason is that only a small number of hyperparameters usually dominate performance. Random search spends more trials exploring different values of important dimensions instead of repeatedly evaluating unimportant combinations.

The Basic Idea

Suppose we want to tune learning rate, batch size, hidden dimension, dropout, and optimizer. Grid search would require choosing a small finite set for each one and evaluating the full Cartesian product.

Random search instead defines distributions:

ηLogUniform(105,101), \eta \sim \text{LogUniform}(10^{-5},10^{-1}), BChoice{32,64,128,256}, B \sim \text{Choice}\{32,64,128,256\}, pdropUniform(0,0.5). p_{\text{drop}} \sim \text{Uniform}(0,0.5).

Each trial samples one configuration from these distributions.

For example:

config = {
    "learning_rate": 2.7e-4,
    "batch_size": 128,
    "hidden_dim": 512,
    "dropout": 0.18,
    "optimizer": "AdamW",
}

After training and validation, the result is recorded. The best configuration found so far is retained.

Why Random Search Helps

Grid search allocates equal attention to every search dimension. This is inefficient when some dimensions matter much more than others.

Assume validation performance depends strongly on learning rate but weakly on dropout. A grid with five learning rates and five dropout values evaluates only five distinct learning rates:

5×5=25 5 \times 5 = 25

runs, but the learning rate takes only five possible values.

Random search with 25 trials can evaluate 25 different learning rates. This gives much better coverage of the important dimension.

The advantage becomes larger as the number of weak or irrelevant dimensions increases.

Random Search Versus Grid Search

Consider two hyperparameters:

MethodLearning rate values exploredDropout values exploredTotal trials
Grid search5 fixed values5 fixed values25
Random search25 sampled values25 sampled values25

With the same number of trials, random search explores more distinct values per dimension.

This matters especially for continuous hyperparameters. A grid imposes artificial resolution. Random sampling avoids this fixed lattice.

Defining Sampling Distributions

Random search requires distributions, not just candidate sets.

Common choices are:

HyperparameterSuggested distribution
Learning rateLog-uniform
Weight decayLog-uniform
DropoutUniform
Batch sizeCategorical
Hidden dimensionCategorical
Number of layersCategorical
Warmup ratioUniform
Gradient clipping normLog-uniform
Label smoothingUniform

A log-uniform distribution is useful when the right scale is unknown. For example, the useful learning rate may be 10410^{-4}, 10310^{-3}, or 10210^{-2}. Sampling uniformly in ordinary space would over-sample large values.

A log-uniform draw can be written as:

uUniform(loga,logb),x=exp(u). u \sim \text{Uniform}(\log a,\log b), \qquad x = \exp(u).

If base 10 is used:

uUniform(log10a,log10b),x=10u. u \sim \text{Uniform}(\log_{10} a,\log_{10} b), \qquad x = 10^u.

In Python:

import random
import math

def log_uniform(low, high):
    return math.exp(random.uniform(math.log(low), math.log(high)))

learning_rate = log_uniform(1e-5, 1e-1)

A Minimal Random Search Implementation

A search space can be represented as a dictionary:

search_space = {
    "learning_rate": ("log_uniform", 1e-5, 1e-1),
    "weight_decay": ("log_uniform", 1e-6, 1e-1),
    "batch_size": ("choice", [32, 64, 128, 256]),
    "hidden_dim": ("choice", [128, 256, 512, 1024]),
    "dropout": ("uniform", 0.0, 0.5),
    "optimizer": ("choice", ["SGD", "Adam", "AdamW"]),
}

We can implement a sampler:

import random
import math

def sample_value(spec):
    kind = spec[0]

    if kind == "choice":
        return random.choice(spec[1])

    if kind == "uniform":
        low, high = spec[1], spec[2]
        return random.uniform(low, high)

    if kind == "log_uniform":
        low, high = spec[1], spec[2]
        return math.exp(random.uniform(math.log(low), math.log(high)))

    raise ValueError(f"unknown search distribution: {kind}")

def sample_config(search_space):
    return {
        name: sample_value(spec)
        for name, spec in search_space.items()
    }

Then random search becomes:

best_config = None
best_score = float("-inf")

num_trials = 50

for trial in range(num_trials):
    config = sample_config(search_space)

    score = train_and_evaluate(config)

    if score > best_score:
        best_score = score
        best_config = config

print("best score:", best_score)
print("best config:", best_config)

The function train_and_evaluate should construct the model, optimizer, scheduler, data loaders, and training loop from the configuration.

Conditional Random Search

Some hyperparameters only make sense under certain choices.

For example, momentum matters for SGD:

if optimizer == "SGD":
    momentum = sample_uniform(0.0, 0.99)

Adam beta values matter for Adam and AdamW:

if optimizer in {"Adam", "AdamW"}:
    beta1 = sample_uniform(0.8, 0.99)
    beta2 = sample_uniform(0.9, 0.9999)

A conditional sampler can encode this directly:

def sample_optimizer_config():
    optimizer = random.choice(["SGD", "Adam", "AdamW"])

    config = {"optimizer": optimizer}

    if optimizer == "SGD":
        config["momentum"] = random.uniform(0.0, 0.99)

    if optimizer in {"Adam", "AdamW"}:
        config["beta1"] = random.uniform(0.8, 0.99)
        config["beta2"] = log_uniform(0.9, 0.9999)

    return config

Conditional search spaces avoid meaningless parameters. This improves search efficiency and makes the result easier to interpret.

Number of Trials

Random search requires choosing a trial budget. The budget depends on training cost, available hardware, and search-space size.

For small models, hundreds of trials may be feasible. For large models, even ten trials may be expensive.

A practical pattern is:

Training cost per runReasonable initial trials
Seconds100 to 1000
Minutes50 to 200
Hours10 to 50
Days3 to 10

The search should begin with wide ranges and a modest budget. After promising regions are found, a second random search can focus on narrower ranges.

Coarse-to-Fine Random Search

Random search is often used in stages.

First, run a broad search:

broad_space = {
    "learning_rate": ("log_uniform", 1e-5, 1e-1),
    "weight_decay": ("log_uniform", 1e-6, 1e-1),
    "dropout": ("uniform", 0.0, 0.5),
}

Suppose the best results cluster near:

η[104,103],λwd[103,102]. \eta \in [10^{-4},10^{-3}], \qquad \lambda_{\text{wd}} \in [10^{-3},10^{-2}].

Then define a narrower search:

narrow_space = {
    "learning_rate": ("log_uniform", 1e-4, 1e-3),
    "weight_decay": ("log_uniform", 1e-3, 1e-2),
    "dropout": ("uniform", 0.05, 0.2),
}

This second stage increases resolution where useful configurations are likely.

Random Seeds and Reproducibility

Random search introduces randomness in two places.

First, the search algorithm samples configurations randomly. Second, model training itself is stochastic due to initialization, data shuffling, dropout, nondeterministic GPU kernels, and augmentation.

To improve reproducibility, save:

ItemPurpose
Search seedReproduce sampled configurations
Training seedReproduce model initialization and data order
Full configurationRebuild the experiment
Validation metricsCompare trials
Checkpoint pathReload selected model
Code versionMatch implementation

Example:

import random
import torch

def set_seed(seed):
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

Each trial can receive its own seed:

base_seed = 1234

for trial in range(num_trials):
    trial_seed = base_seed + trial
    set_seed(trial_seed)

    config = sample_config(search_space)
    config["seed"] = trial_seed

    score = train_and_evaluate(config)

This makes the search easier to audit.

Handling Failed Trials

Random search may sample unstable or invalid configurations. A learning rate may be too large. A batch size may exceed memory. A transformer hidden size may be incompatible with the number of heads.

Failed trials should be recorded, not silently ignored.

results = []

for trial in range(num_trials):
    config = sample_config(search_space)

    try:
        score = train_and_evaluate(config)
        status = "ok"

    except RuntimeError as e:
        score = None
        status = "failed"
        error = str(e)

    results.append({
        "trial": trial,
        "config": config,
        "score": score,
        "status": status,
    })

This record helps identify bad regions of the search space. If many trials fail, the search space should be constrained.

Comparing Configurations Fairly

A configuration should be compared under the same evaluation protocol.

This means:

FactorShould be fixed across trials
Training dataSame split
Validation dataSame split
Number of epochsSame budget unless using early stopping
Evaluation metricSame metric
PreprocessingSame rules unless intentionally searched
Random seed policySame procedure
Hardware assumptionsSame precision and device type

If one configuration receives more training steps than another, its score may reflect extra compute rather than better hyperparameters.

For budget-aware optimization, use an explicit objective such as validation accuracy after a fixed number of steps, or validation loss under a fixed GPU-hour budget.

Search Results as Data

Random search produces useful diagnostic data. Even failed or mediocre trials can show which hyperparameters matter.

After running trials, we can sort by validation score:

results = sorted(results, key=lambda r: r["score"], reverse=True)

We can inspect the best configurations:

for r in results[:5]:
    print(r["score"], r["config"])

Patterns are often more valuable than a single best configuration. For example, the top configurations may all use AdamW, learning rates near 3×1043\times10^{-4}, dropout below 0.2, and moderate weight decay. This suggests a stable region of the search space.

Advantages and Disadvantages

AdvantagesDisadvantages
Simple to implementNo guarantee of finding the optimum
Works well in high-dimensional spacesResults vary across random seeds
Efficient for continuous variablesCan waste trials in bad regions
Easy to parallelizeDoes not learn from previous trials
Better than grid search for many DL tasksNeeds careful distribution design

Random search is a strong default when the search space is moderately large and the cost per run is acceptable.

When to Use Random Search

Random search is a good choice when:

SituationReason
Several hyperparameters matterBetter coverage than grid search
Some dimensions are continuousAvoids fixed grid resolution
Search budget is limitedCan stop after any number of trials
Parallel workers are availableTrials are independent
Baselines are needed quicklySimple and robust

Random search is less suitable when each run is extremely expensive and only a few trials are possible. In that case, expert tuning or Bayesian optimization may use the budget more effectively.

Summary

Random search samples hyperparameter configurations from predefined distributions. It replaces exhaustive enumeration with stochastic exploration.

Compared with grid search, random search often covers important dimensions more effectively under the same trial budget. It works especially well when only a few hyperparameters strongly influence performance.

A good random search depends on well-designed sampling distributions, proper logging, valid constraints, reproducible seeds, and fair evaluation. It is simple enough to be a baseline and strong enough to be useful in real deep learning systems.