Automated machine learning, or AutoML, refers to systems that automate parts of the model development process.
Automated machine learning, or AutoML, refers to systems that automate parts of the model development process. Hyperparameter optimization and neural architecture search are both parts of AutoML, but AutoML is broader. It may include data preprocessing, feature construction, model selection, training recipe selection, ensembling, compression, deployment, and monitoring.
In deep learning, AutoML usually means a system that searches over training configurations and model structures under a compute budget. The goal is not to remove human judgment. The goal is to make repeated experimental decisions systematic, logged, and reproducible.
What AutoML Optimizes
A deep learning project contains many choices. Some are numerical. Some are categorical. Some are structural. Some are operational.
| Area | Examples |
|---|---|
| Data processing | Normalization, augmentation, tokenization, filtering |
| Model architecture | Depth, width, block type, attention type |
| Optimization | Optimizer, learning rate, weight decay, schedule |
| Regularization | Dropout, label smoothing, stochastic depth |
| Training system | Batch size, precision, gradient accumulation |
| Evaluation | Metric, validation split, threshold |
| Deployment | Quantization, pruning, latency target |
An AutoML system defines a search space over these choices and evaluates candidate pipelines.
A complete configuration might include:
config = {
"model": {
"family": "transformer",
"num_layers": 12,
"hidden_dim": 768,
"num_heads": 12,
"mlp_ratio": 4,
},
"optimizer": {
"name": "AdamW",
"learning_rate": 3e-4,
"weight_decay": 0.01,
},
"training": {
"batch_size": 128,
"epochs": 20,
"mixed_precision": True,
},
"regularization": {
"dropout": 0.1,
"label_smoothing": 0.05,
},
}The configuration is then passed to a training pipeline.
Pipeline Search
AutoML searches over pipelines, not only isolated hyperparameters.
A pipeline is a sequence of decisions:
Each stage may have its own search space.
For image classification, the pipeline may include:
| Stage | Search choices |
|---|---|
| Input | Image resolution |
| Augmentation | Crop scale, color jitter, mixup, cutmix |
| Model | CNN, ViT, hybrid model |
| Optimizer | SGD, AdamW |
| Schedule | cosine decay, step decay, warmup |
| Regularization | dropout, stochastic depth |
| Inference | batch size, quantization |
For NLP, the pipeline may include:
| Stage | Search choices |
|---|---|
| Tokenization | vocabulary size, subword method |
| Sequence handling | max length, truncation, packing |
| Model | encoder, decoder, encoder-decoder |
| Objective | masked LM, causal LM, contrastive |
| Fine-tuning | full fine-tune, LoRA, adapters |
| Retrieval | chunk size, top-k documents |
| Inference | decoding strategy, temperature |
This broader view distinguishes AutoML from ordinary hyperparameter search.
AutoML as Nested Optimization
AutoML can be described as nested optimization.
The inner optimization trains model parameters:
where is a full pipeline configuration.
The outer optimization selects the configuration:
Here is the AutoML search space.
The outer loop is expensive because every candidate configuration may require training. This is why AutoML systems rely on early stopping, pruning, surrogate models, low-fidelity evaluations, and parallel execution.
Search Strategies
AutoML systems combine search strategies from earlier sections.
| Strategy | Role in AutoML |
|---|---|
| Grid search | Small fixed spaces |
| Random search | Strong baseline for broad search |
| Bayesian optimization | Sample-efficient search |
| Population-based training | Dynamic hyperparameter schedules |
| Neural architecture search | Structural model choices |
| Multi-fidelity search | Cheap approximations before full training |
| Evolutionary algorithms | Irregular and discrete spaces |
A mature AutoML system may use several strategies at once. For example, it may use random search for initial trials, Bayesian optimization after enough observations, pruning for poor configurations, and final retraining for the top candidates.
Multi-Fidelity Optimization
Full training is expensive. Multi-fidelity optimization evaluates many configurations cheaply, then spends more compute on promising ones.
Lower-fidelity evaluations include:
| Lower fidelity method | Approximation |
|---|---|
| Fewer epochs | Train briefly |
| Smaller dataset | Train on a subset |
| Lower resolution | Use smaller images |
| Shorter sequence length | Use fewer tokens |
| Smaller model | Use proxy architecture |
| Fewer diffusion steps | Approximate generation quality |
The assumption is that cheap evaluations are correlated with full training results.
Successive halving is a common multi-fidelity strategy. It begins with many configurations trained for a small budget. It keeps the best fraction and increases their budgets.
Hyperband extends this idea by trying multiple budget schedules.
A simplified successive halving loop:
configs = sample_many(search_space, n=64)
budget = 1
while len(configs) > 1:
results = []
for config in configs:
score = train_and_evaluate(config, epochs=budget)
results.append((score, config))
results.sort(reverse=True)
configs = [config for score, config in results[:len(results) // 2]]
budget *= 2This avoids fully training configurations that perform poorly early.
AutoML with PyTorch
A practical PyTorch AutoML system usually has four layers.
| Layer | Responsibility |
|---|---|
| Configuration schema | Defines all tunable choices |
| Model factory | Builds models from configs |
| Training engine | Runs training and evaluation |
| Search controller | Chooses configs and records results |
A simple model factory:
def build_model(config):
family = config["model"]["family"]
if family == "mlp":
return MLP(
input_dim=config["data"]["input_dim"],
hidden_dim=config["model"]["hidden_dim"],
output_dim=config["data"]["num_classes"],
num_layers=config["model"]["num_layers"],
dropout=config["regularization"]["dropout"],
)
if family == "cnn":
return CNN(
num_classes=config["data"]["num_classes"],
channels=config["model"]["channels"],
blocks=config["model"]["blocks"],
)
raise ValueError(f"unknown model family: {family}")A training function:
def run_trial(config):
model = build_model(config)
train_loader, val_loader = build_loaders(config)
optimizer = build_optimizer(model, config)
scheduler = build_scheduler(optimizer, config)
for epoch in range(config["training"]["epochs"]):
train_one_epoch(model, train_loader, optimizer)
scheduler.step()
return evaluate(model, val_loader)The search controller can call run_trial(config) using random search, Bayesian optimization, or another strategy.
Configuration Schemas
AutoML systems need explicit configuration schemas. The schema defines valid fields, types, defaults, and constraints.
A loose dictionary is easy to start with, but a schema becomes important as experiments grow.
Example schema style:
from dataclasses import dataclass
@dataclass
class ModelConfig:
family: str
hidden_dim: int
num_layers: int
dropout: float
@dataclass
class OptimizerConfig:
name: str
learning_rate: float
weight_decay: float
@dataclass
class TrainingConfig:
batch_size: int
epochs: int
mixed_precision: bool
@dataclass
class Config:
model: ModelConfig
optimizer: OptimizerConfig
training: TrainingConfigA schema prevents accidental errors such as misspelled keys, missing fields, or invalid types. This matters because AutoML may run hundreds of experiments without human inspection.
Experiment Tracking
AutoML produces many trials. Each trial should be logged completely.
At minimum, log:
| Item | Purpose |
|---|---|
| Full configuration | Reproduce the run |
| Validation metrics | Compare candidates |
| Training metrics | Diagnose learning behavior |
| Random seed | Reproduce stochastic choices |
| Code version | Match implementation |
| Dataset version | Avoid data drift |
| Hardware information | Interpret speed and memory |
| Failure reason | Debug invalid regions |
| Checkpoint path | Reload top models |
Without tracking, AutoML results become difficult to trust. The best trial may be impossible to reproduce.
A simple log record:
record = {
"trial_id": trial_id,
"config": config,
"val_accuracy": val_accuracy,
"val_loss": val_loss,
"seed": seed,
"status": "completed",
}For production work, these records are usually stored in a database or experiment tracking system.
Constraints and Deployment Objectives
AutoML often optimizes under constraints.
For example:
subject to
The system can reject configurations that violate hard constraints or include penalties in the objective.
A scalar objective may be:
Hard constraints are easier to interpret. Soft penalties are easier to optimize. In deployment-sensitive deep learning, both are common.
Ensembling and Model Selection
Some AutoML systems produce not one model but a set of models. An ensemble combines predictions from multiple trained models.
For classification, an ensemble may average probabilities:
Ensembles often improve accuracy and calibration, but they increase inference cost. They are useful when validation performance matters more than latency.
For deployment, the selected model may be the best single model under a cost budget. For competitions, the selected model may be an ensemble of top trials.
AutoML for Foundation Models
For foundation models, full AutoML over architecture and pretraining is usually too expensive. Instead, automation is applied to smaller decisions.
Common targets include:
| Area | Automated choices |
|---|---|
| Fine-tuning | Learning rate, LoRA rank, batch size, epochs |
| Retrieval | chunk size, embedding model, top-k, reranker |
| Prompting | prompt templates, demonstrations, decoding |
| Alignment | reward model settings, preference data mixture |
| Inference | temperature, top-p, max tokens, caching |
| Compression | quantization level, distillation target |
For example, retrieval-augmented generation may search over:
rag_space = {
"chunk_size": [256, 512, 1024],
"chunk_overlap": [0, 64, 128],
"top_k": [3, 5, 10, 20],
"rerank": [True, False],
}The objective may be answer accuracy, citation quality, latency, and cost.
Human Judgment in AutoML
AutoML does not make model development automatic in the strong sense. It automates search within a space chosen by humans.
Human judgment remains necessary for:
| Decision | Why it matters |
|---|---|
| Problem formulation | Defines what should be optimized |
| Data quality | Often dominates model performance |
| Search space design | Determines what can be found |
| Metric choice | Controls what the system prefers |
| Constraint design | Reflects deployment reality |
| Result interpretation | Detects spurious wins |
| Final validation | Prevents test-set leakage |
A system can optimize the wrong objective very efficiently. This is a common failure mode.
Common Failure Modes
AutoML can fail in predictable ways.
First, it can overfit the validation set. Testing hundreds or thousands of configurations increases the chance that one looks good by noise.
Second, it can exploit metric weaknesses. If the metric is incomplete, the search may find configurations that improve the metric while harming real performance.
Third, it can use unfair comparisons. Some trials may receive more compute, better preprocessing, or different data.
Fourth, it can ignore deployment constraints. A model with high validation accuracy may be too large, slow, or expensive.
Fifth, it can produce irreproducible results when random seeds, code versions, or dataset versions are missing.
Practical Guidelines
Start with a strong manual baseline. AutoML should improve on a known reference, not replace basic modeling discipline.
Keep the first search space small. Search only high-impact choices such as learning rate, weight decay, batch size, dropout, and one or two architecture dimensions.
Use multi-fidelity methods when full training is expensive. Promote promising configurations and stop weak ones early.
Retrain the best configurations from scratch. This checks whether the result is stable.
Reserve a final test set. Use it only once the search process is complete.
Log everything needed to reproduce the result.
Summary
Automated machine learning searches over training pipelines, model choices, hyperparameters, and deployment settings. In deep learning, it combines hyperparameter optimization, neural architecture search, pruning, multi-fidelity evaluation, experiment tracking, and final model selection.
AutoML is useful when the search space is well designed, metrics are reliable, and experiments are logged carefully. It is most effective as a disciplined engineering system around model development, rather than a substitute for understanding data, objectives, and deployment constraints.