Neural architecture search, or NAS, is the process of automatically searching for model architectures.
Neural architecture search, or NAS, is the process of automatically searching for model architectures. Ordinary hyperparameter optimization usually tunes values such as learning rate, batch size, dropout, or weight decay. NAS searches the structure of the network itself.
Architecture choices include the number of layers, hidden width, convolution kernel sizes, attention heads, skip connections, normalization placement, activation functions, and block types. In large models, architecture search may also include mixture-of-experts routing, context length, embedding dimension, MLP expansion ratio, and parameter sharing.
The Architecture Search Problem
A neural architecture defines a function class. Once the architecture is fixed, training chooses parameters inside that class.
Let denote an architecture and denote its trainable parameters. Training solves
Architecture search chooses the architecture that performs best on validation data:
Here is the architecture search space.
This is harder than ordinary hyperparameter optimization because architectures may have different tensor shapes, parameter counts, memory costs, and training dynamics.
Architecture Search Spaces
A search space defines which architectures may be considered. A poor search space can exclude good models or include too many invalid models.
For an MLP, a simple search space might include:
| Choice | Possible values |
|---|---|
| Number of layers | 2, 3, 4, 6, 8 |
| Hidden dimension | 128, 256, 512, 1024 |
| Activation | ReLU, GELU, SiLU |
| Normalization | None, BatchNorm, LayerNorm |
| Dropout | 0.0, 0.1, 0.2, 0.3 |
For a CNN, the space may include:
| Choice | Possible values |
|---|---|
| Number of blocks | 3 to 8 |
| Channels per block | 32, 64, 128, 256 |
| Kernel size | 3, 5, 7 |
| Stride | 1, 2 |
| Residual connection | True, False |
| Squeeze-excitation | True, False |
For a transformer, common architecture choices include:
| Choice | Possible values |
|---|---|
| Number of layers | 6, 12, 24, 32 |
| Model dimension | 512, 768, 1024, 2048 |
| Attention heads | 8, 12, 16, 32 |
| MLP ratio | 2, 4, 8 |
| Normalization placement | Pre-norm, post-norm |
| Positional encoding | Learned, sinusoidal, rotary |
| Attention type | Full, local, sparse, linear |
| Experts | Dense, MoE |
Architecture search spaces are often constrained. For example, in a transformer,
where is the number of attention heads. Each head then has dimension
Manual Search Versus Automated Search
Most successful deep learning architectures have involved substantial human design. NAS does not remove design work. It changes where design work happens.
Manual design chooses the architecture directly. NAS chooses the search space, objective, budget, and search algorithm.
| Approach | Human decides | Algorithm decides |
|---|---|---|
| Manual design | Architecture | Nothing |
| Hyperparameter search | Search space and ranges | Best configuration |
| NAS | Search space and objective | Architecture within space |
The search space usually encodes expert assumptions. For example, a CNN search space assumes locality and translation structure. A transformer search space assumes attention, residual connections, and token embeddings.
Search Algorithms
Several algorithms can be used for NAS.
| Method | Basic idea |
|---|---|
| Random search | Sample architectures randomly |
| Bayesian optimization | Model performance as a function of architecture choices |
| Evolutionary search | Mutate and select architectures |
| Reinforcement learning | Train a controller to propose architectures |
| Differentiable NAS | Relax discrete architecture choices into continuous weights |
| Weight-sharing NAS | Train a supernet containing many subnetworks |
Simple random search is often a strong baseline. More complex methods are useful when each architecture is expensive to train and the search space has exploitable structure.
Evolutionary Architecture Search
Evolutionary NAS maintains a population of architectures. Each architecture is trained and evaluated. Better architectures are selected as parents. New architectures are created by mutation or crossover.
A mutation might:
| Mutation | Example |
|---|---|
| Add a layer | 6 layers to 7 layers |
| Change width | 512 hidden units to 768 |
| Change kernel size | 3x3 to 5x5 |
| Add skip connection | Insert residual path |
| Change activation | ReLU to GELU |
A simple evolutionary loop is:
population = initialize_architectures()
for generation in range(num_generations):
scores = evaluate_population(population)
parents = select_best(population, scores)
children = []
for parent in parents:
child = mutate(parent)
children.append(child)
population = parents + childrenEvolutionary methods are flexible. They handle discrete, conditional, and irregular architecture spaces well. Their main cost is the need to train many candidate architectures.
Reinforcement Learning NAS
In reinforcement learning NAS, a controller generates architecture descriptions. The generated architecture is trained and evaluated. The validation score is used as a reward to update the controller.
The controller may output a sequence such as:
The reward may be validation accuracy, or a deployment-aware score:
RL-based NAS can discover useful structures, but it is expensive and difficult to reproduce. For many projects, simpler methods give most of the benefit with less complexity.
Differentiable NAS
Differentiable NAS makes architecture choices continuous.
Suppose a layer may choose between several operations:
Instead of choosing one operation directly, differentiable NAS computes a weighted mixture:
where are learnable architecture weights.
The weights are usually normalized with softmax:
Training then optimizes both model parameters and architecture weights. After training, the strongest operations are selected to form a discrete architecture.
Differentiable NAS can be more efficient than training many architectures from scratch, but the continuous relaxation may introduce bias. The best relaxed architecture may not correspond to the best discrete architecture.
Weight Sharing and Supernets
Weight-sharing NAS trains one large supernet that contains many possible subnetworks. Each candidate architecture is a path through the supernet.
Instead of training every architecture separately, we train the supernet and estimate candidate performance using inherited weights.
This reduces cost, but introduces approximation error. A subnetwork may look strong inside the supernet but perform differently when trained alone.
Weight sharing is common in efficient NAS systems because full training of every architecture is usually too expensive.
PyTorch Example: Configurable MLP
A simple architecture search can be implemented by building models from configuration dictionaries.
import torch
from torch import nn
class SearchMLP(nn.Module):
def __init__(self, input_dim, output_dim, config):
super().__init__()
activation_name = config["activation"]
if activation_name == "relu":
activation = nn.ReLU
elif activation_name == "gelu":
activation = nn.GELU
elif activation_name == "silu":
activation = nn.SiLU
else:
raise ValueError(f"unknown activation: {activation_name}")
layers = []
dim = input_dim
for hidden_dim in config["hidden_dims"]:
layers.append(nn.Linear(dim, hidden_dim))
if config["normalization"] == "layernorm":
layers.append(nn.LayerNorm(hidden_dim))
layers.append(activation())
if config["dropout"] > 0:
layers.append(nn.Dropout(config["dropout"]))
dim = hidden_dim
layers.append(nn.Linear(dim, output_dim))
self.net = nn.Sequential(*layers)
def forward(self, x):
return self.net(x)A sampled architecture might be:
config = {
"hidden_dims": [512, 512, 256],
"activation": "gelu",
"normalization": "layernorm",
"dropout": 0.1,
}The training code can treat this like any other PyTorch model.
PyTorch Example: Sampling Architectures
A basic random architecture sampler:
import random
def sample_architecture():
num_layers = random.choice([2, 3, 4, 6])
width = random.choice([128, 256, 512, 1024])
return {
"hidden_dims": [width] * num_layers,
"activation": random.choice(["relu", "gelu", "silu"]),
"normalization": random.choice(["none", "layernorm"]),
"dropout": random.choice([0.0, 0.1, 0.2, 0.3]),
}A search loop:
best_score = float("-inf")
best_config = None
for trial in range(50):
config = sample_architecture()
model = SearchMLP(
input_dim=784,
output_dim=10,
config=config,
)
score = train_and_evaluate(model)
if score > best_score:
best_score = score
best_config = configThis is NAS in its simplest form. It searches over architecture configurations by repeatedly building, training, and evaluating candidate models.
Cost-Aware Architecture Search
Architecture quality cannot be measured only by accuracy. Large models may be too slow or expensive to deploy.
Useful architecture objectives include:
| Objective | Meaning |
|---|---|
| Validation accuracy | Predictive quality |
| Validation loss | Calibration and optimization quality |
| Parameter count | Model size |
| FLOPs | Approximate compute cost |
| Latency | Real inference speed |
| Memory use | Deployment feasibility |
| Energy use | Operational cost |
A cost-aware objective may be:
This prefers architectures that are accurate and efficient.
Latency should be measured on the target hardware when possible. FLOPs do not always predict real speed because memory access, kernel fusion, batching, and hardware utilization matter.
Common Failure Modes
NAS can fail in several ways.
One failure mode is an unrealistic search space. If the space excludes strong architectures, the search algorithm cannot find them.
Another failure mode is unfair evaluation. If one architecture trains longer or uses stronger augmentation, the comparison becomes confounded.
A third failure mode is search overfitting. The algorithm may exploit noise in the validation set after testing many architectures.
A fourth failure mode is proxy mismatch. An architecture that performs well on a small proxy dataset may perform poorly on the full dataset.
A fifth failure mode is hardware mismatch. An architecture that is efficient on one device may be slow on another.
Practical Guidelines
Use NAS only after strong baselines are established. A simple manually designed architecture with careful training often beats an automatically searched architecture with weak training.
Keep the search space small at first. Search over a few high-impact choices, such as width, depth, activation, normalization, and dropout.
Use fair training budgets. Each architecture should receive the same number of training steps or the same compute budget.
Log architecture, training configuration, score, parameter count, latency, and random seed. Without complete logs, NAS results are difficult to interpret.
Retain the top architectures and retrain them from scratch. This checks whether their performance was due to the architecture or lucky inherited weights, random seeds, or noisy validation.
When NAS Is Useful
NAS is useful when architecture choices strongly affect the result and many training runs are affordable. It is especially relevant when deployment constraints matter.
Examples include mobile vision models, efficient transformers, speech models, recommendation systems, and specialized scientific models.
NAS is less useful when the dominant performance gains come from data quality, training recipe, pretrained weights, or scale. In many modern foundation model settings, data, compute, and optimization recipe matter more than small architecture changes.
Summary
Neural architecture search automates the exploration of model structures. It searches over layers, widths, operations, connections, normalization, activation functions, and other architectural choices.
NAS can use random search, evolutionary algorithms, reinforcement learning, Bayesian optimization, differentiable relaxation, or weight-sharing supernets. The main challenge is cost: each architecture may require expensive training and careful evaluation.
A practical NAS workflow starts with strong baselines, defines a constrained search space, uses fair evaluation, includes deployment costs, and retrains top candidates from scratch.