Skip to content

Automated Machine Learning

Automated machine learning, or AutoML, refers to systems that automate parts of the model development process.

Automated machine learning, or AutoML, refers to systems that automate parts of the model development process. Hyperparameter optimization and neural architecture search are both parts of AutoML, but AutoML is broader. It may include data preprocessing, feature construction, model selection, training recipe selection, ensembling, compression, deployment, and monitoring.

In deep learning, AutoML usually means a system that searches over training configurations and model structures under a compute budget. The goal is not to remove human judgment. The goal is to make repeated experimental decisions systematic, logged, and reproducible.

What AutoML Optimizes

A deep learning project contains many choices. Some are numerical. Some are categorical. Some are structural. Some are operational.

AreaExamples
Data processingNormalization, augmentation, tokenization, filtering
Model architectureDepth, width, block type, attention type
OptimizationOptimizer, learning rate, weight decay, schedule
RegularizationDropout, label smoothing, stochastic depth
Training systemBatch size, precision, gradient accumulation
EvaluationMetric, validation split, threshold
DeploymentQuantization, pruning, latency target

An AutoML system defines a search space over these choices and evaluates candidate pipelines.

A complete configuration might include:

config = {
    "model": {
        "family": "transformer",
        "num_layers": 12,
        "hidden_dim": 768,
        "num_heads": 12,
        "mlp_ratio": 4,
    },
    "optimizer": {
        "name": "AdamW",
        "learning_rate": 3e-4,
        "weight_decay": 0.01,
    },
    "training": {
        "batch_size": 128,
        "epochs": 20,
        "mixed_precision": True,
    },
    "regularization": {
        "dropout": 0.1,
        "label_smoothing": 0.05,
    },
}

The configuration is then passed to a training pipeline.

Pipeline Search

AutoML searches over pipelines, not only isolated hyperparameters.

A pipeline is a sequence of decisions:

datapreprocessingmodeltrainingevaluationdeployment. \text{data} \rightarrow \text{preprocessing} \rightarrow \text{model} \rightarrow \text{training} \rightarrow \text{evaluation} \rightarrow \text{deployment}.

Each stage may have its own search space.

For image classification, the pipeline may include:

StageSearch choices
InputImage resolution
AugmentationCrop scale, color jitter, mixup, cutmix
ModelCNN, ViT, hybrid model
OptimizerSGD, AdamW
Schedulecosine decay, step decay, warmup
Regularizationdropout, stochastic depth
Inferencebatch size, quantization

For NLP, the pipeline may include:

StageSearch choices
Tokenizationvocabulary size, subword method
Sequence handlingmax length, truncation, packing
Modelencoder, decoder, encoder-decoder
Objectivemasked LM, causal LM, contrastive
Fine-tuningfull fine-tune, LoRA, adapters
Retrievalchunk size, top-k documents
Inferencedecoding strategy, temperature

This broader view distinguishes AutoML from ordinary hyperparameter search.

AutoML as Nested Optimization

AutoML can be described as nested optimization.

The inner optimization trains model parameters:

θ(c)=argminθLtrain(θ;c), \theta^\ast(c) = \arg\min_{\theta} \mathcal{L}_{\text{train}}(\theta;c),

where cc is a full pipeline configuration.

The outer optimization selects the configuration:

c=argmincCLval(θ(c);c). c^\ast = \arg\min_{c\in\mathcal{C}} \mathcal{L}_{\text{val}}(\theta^\ast(c);c).

Here C\mathcal{C} is the AutoML search space.

The outer loop is expensive because every candidate configuration may require training. This is why AutoML systems rely on early stopping, pruning, surrogate models, low-fidelity evaluations, and parallel execution.

Search Strategies

AutoML systems combine search strategies from earlier sections.

StrategyRole in AutoML
Grid searchSmall fixed spaces
Random searchStrong baseline for broad search
Bayesian optimizationSample-efficient search
Population-based trainingDynamic hyperparameter schedules
Neural architecture searchStructural model choices
Multi-fidelity searchCheap approximations before full training
Evolutionary algorithmsIrregular and discrete spaces

A mature AutoML system may use several strategies at once. For example, it may use random search for initial trials, Bayesian optimization after enough observations, pruning for poor configurations, and final retraining for the top candidates.

Multi-Fidelity Optimization

Full training is expensive. Multi-fidelity optimization evaluates many configurations cheaply, then spends more compute on promising ones.

Lower-fidelity evaluations include:

Lower fidelity methodApproximation
Fewer epochsTrain briefly
Smaller datasetTrain on a subset
Lower resolutionUse smaller images
Shorter sequence lengthUse fewer tokens
Smaller modelUse proxy architecture
Fewer diffusion stepsApproximate generation quality

The assumption is that cheap evaluations are correlated with full training results.

Successive halving is a common multi-fidelity strategy. It begins with many configurations trained for a small budget. It keeps the best fraction and increases their budgets.

Hyperband extends this idea by trying multiple budget schedules.

A simplified successive halving loop:

configs = sample_many(search_space, n=64)
budget = 1

while len(configs) > 1:
    results = []

    for config in configs:
        score = train_and_evaluate(config, epochs=budget)
        results.append((score, config))

    results.sort(reverse=True)
    configs = [config for score, config in results[:len(results) // 2]]
    budget *= 2

This avoids fully training configurations that perform poorly early.

AutoML with PyTorch

A practical PyTorch AutoML system usually has four layers.

LayerResponsibility
Configuration schemaDefines all tunable choices
Model factoryBuilds models from configs
Training engineRuns training and evaluation
Search controllerChooses configs and records results

A simple model factory:

def build_model(config):
    family = config["model"]["family"]

    if family == "mlp":
        return MLP(
            input_dim=config["data"]["input_dim"],
            hidden_dim=config["model"]["hidden_dim"],
            output_dim=config["data"]["num_classes"],
            num_layers=config["model"]["num_layers"],
            dropout=config["regularization"]["dropout"],
        )

    if family == "cnn":
        return CNN(
            num_classes=config["data"]["num_classes"],
            channels=config["model"]["channels"],
            blocks=config["model"]["blocks"],
        )

    raise ValueError(f"unknown model family: {family}")

A training function:

def run_trial(config):
    model = build_model(config)
    train_loader, val_loader = build_loaders(config)

    optimizer = build_optimizer(model, config)
    scheduler = build_scheduler(optimizer, config)

    for epoch in range(config["training"]["epochs"]):
        train_one_epoch(model, train_loader, optimizer)
        scheduler.step()

    return evaluate(model, val_loader)

The search controller can call run_trial(config) using random search, Bayesian optimization, or another strategy.

Configuration Schemas

AutoML systems need explicit configuration schemas. The schema defines valid fields, types, defaults, and constraints.

A loose dictionary is easy to start with, but a schema becomes important as experiments grow.

Example schema style:

from dataclasses import dataclass

@dataclass
class ModelConfig:
    family: str
    hidden_dim: int
    num_layers: int
    dropout: float

@dataclass
class OptimizerConfig:
    name: str
    learning_rate: float
    weight_decay: float

@dataclass
class TrainingConfig:
    batch_size: int
    epochs: int
    mixed_precision: bool

@dataclass
class Config:
    model: ModelConfig
    optimizer: OptimizerConfig
    training: TrainingConfig

A schema prevents accidental errors such as misspelled keys, missing fields, or invalid types. This matters because AutoML may run hundreds of experiments without human inspection.

Experiment Tracking

AutoML produces many trials. Each trial should be logged completely.

At minimum, log:

ItemPurpose
Full configurationReproduce the run
Validation metricsCompare candidates
Training metricsDiagnose learning behavior
Random seedReproduce stochastic choices
Code versionMatch implementation
Dataset versionAvoid data drift
Hardware informationInterpret speed and memory
Failure reasonDebug invalid regions
Checkpoint pathReload top models

Without tracking, AutoML results become difficult to trust. The best trial may be impossible to reproduce.

A simple log record:

record = {
    "trial_id": trial_id,
    "config": config,
    "val_accuracy": val_accuracy,
    "val_loss": val_loss,
    "seed": seed,
    "status": "completed",
}

For production work, these records are usually stored in a database or experiment tracking system.

Constraints and Deployment Objectives

AutoML often optimizes under constraints.

For example:

maximize accuracy \text{maximize accuracy}

subject to

latency20 ms, \text{latency} \le 20\text{ ms}, memory512 MB. \text{memory} \le 512\text{ MB}.

The system can reject configurations that violate hard constraints or include penalties in the objective.

A scalar objective may be:

J(c)=accuracy(c)αmax(0,latency(c)20)βmax(0,memory(c)512). J(c) = \text{accuracy}(c) - \alpha \cdot \max(0,\text{latency}(c)-20) - \beta \cdot \max(0,\text{memory}(c)-512).

Hard constraints are easier to interpret. Soft penalties are easier to optimize. In deployment-sensitive deep learning, both are common.

Ensembling and Model Selection

Some AutoML systems produce not one model but a set of models. An ensemble combines predictions from multiple trained models.

For classification, an ensemble may average probabilities:

p(yx)=1Mm=1Mpm(yx). p(y\mid x) = \frac{1}{M} \sum_{m=1}^{M} p_m(y\mid x).

Ensembles often improve accuracy and calibration, but they increase inference cost. They are useful when validation performance matters more than latency.

For deployment, the selected model may be the best single model under a cost budget. For competitions, the selected model may be an ensemble of top trials.

AutoML for Foundation Models

For foundation models, full AutoML over architecture and pretraining is usually too expensive. Instead, automation is applied to smaller decisions.

Common targets include:

AreaAutomated choices
Fine-tuningLearning rate, LoRA rank, batch size, epochs
Retrievalchunk size, embedding model, top-k, reranker
Promptingprompt templates, demonstrations, decoding
Alignmentreward model settings, preference data mixture
Inferencetemperature, top-p, max tokens, caching
Compressionquantization level, distillation target

For example, retrieval-augmented generation may search over:

rag_space = {
    "chunk_size": [256, 512, 1024],
    "chunk_overlap": [0, 64, 128],
    "top_k": [3, 5, 10, 20],
    "rerank": [True, False],
}

The objective may be answer accuracy, citation quality, latency, and cost.

Human Judgment in AutoML

AutoML does not make model development automatic in the strong sense. It automates search within a space chosen by humans.

Human judgment remains necessary for:

DecisionWhy it matters
Problem formulationDefines what should be optimized
Data qualityOften dominates model performance
Search space designDetermines what can be found
Metric choiceControls what the system prefers
Constraint designReflects deployment reality
Result interpretationDetects spurious wins
Final validationPrevents test-set leakage

A system can optimize the wrong objective very efficiently. This is a common failure mode.

Common Failure Modes

AutoML can fail in predictable ways.

First, it can overfit the validation set. Testing hundreds or thousands of configurations increases the chance that one looks good by noise.

Second, it can exploit metric weaknesses. If the metric is incomplete, the search may find configurations that improve the metric while harming real performance.

Third, it can use unfair comparisons. Some trials may receive more compute, better preprocessing, or different data.

Fourth, it can ignore deployment constraints. A model with high validation accuracy may be too large, slow, or expensive.

Fifth, it can produce irreproducible results when random seeds, code versions, or dataset versions are missing.

Practical Guidelines

Start with a strong manual baseline. AutoML should improve on a known reference, not replace basic modeling discipline.

Keep the first search space small. Search only high-impact choices such as learning rate, weight decay, batch size, dropout, and one or two architecture dimensions.

Use multi-fidelity methods when full training is expensive. Promote promising configurations and stop weak ones early.

Retrain the best configurations from scratch. This checks whether the result is stable.

Reserve a final test set. Use it only once the search process is complete.

Log everything needed to reproduce the result.

Summary

Automated machine learning searches over training pipelines, model choices, hyperparameters, and deployment settings. In deep learning, it combines hyperparameter optimization, neural architecture search, pruning, multi-fidelity evaluation, experiment tracking, and final model selection.

AutoML is useful when the search space is well designed, metrics are reliable, and experiments are logged carefully. It is most effective as a disciplined engineering system around model development, rather than a substitute for understanding data, objectives, and deployment constraints.