Training produces model parameters. Inference uses those parameters to generate predictions.
Training produces model parameters. Inference uses those parameters to generate predictions.
Inference optimization studies how to make model execution faster, cheaper, smaller, and more memory-efficient while preserving acceptable output quality.
For small models, naive inference may be sufficient. For foundation models, inference often becomes more expensive than training because deployed systems may serve millions or billions of requests.
A language model trained once may perform inference continuously for years.
Inference systems therefore optimize:
- latency
- throughput
- memory usage
- energy efficiency
- hardware utilization
- serving cost
Training Versus Inference
Training and inference have different computational characteristics.
| Property | Training | Inference |
|---|---|---|
| Forward pass | Yes | Yes |
| Backward pass | Yes | No |
| Gradient storage | Required | Not needed |
| Optimizer state | Required | Not needed |
| Numerical precision | Often mixed precision | Often lower precision |
| Latency sensitivity | Usually low | Often critical |
| Throughput focus | Tokens/images per second | Requests per second |
Inference is usually memory-bandwidth constrained rather than compute constrained, especially for large transformers.
Inference Workloads
Inference workloads vary substantially.
| Workload | Example |
|---|---|
| Batch inference | Offline embedding generation |
| Real-time inference | Chat applications |
| Streaming generation | Token-by-token LLM decoding |
| Edge inference | Mobile or embedded devices |
| Interactive multimodal systems | Vision-language assistants |
Different workloads require different optimizations.
For example:
| Goal | Important metric |
|---|---|
| Real-time chatbot | Low latency |
| Embedding pipeline | High throughput |
| Mobile model | Low memory and energy |
| Datacenter serving | Cost efficiency |
Autoregressive Decoding
Large language models usually generate tokens autoregressively.
Given previous tokens:
the model predicts:
Then the next token is appended and the process repeats.
This sequential dependency limits parallelism because token must be generated before predicting .
Training parallelizes across sequence positions. Inference cannot fully do this during generation.
Autoregressive decoding is therefore a major inference bottleneck.
KV Cache
Transformer inference repeatedly recomputing attention keys and values would be extremely inefficient.
Suppose a sequence has length . Naively recomputing all attention states at every generation step would repeatedly process earlier tokens.
Instead, inference systems use a key-value cache, usually called a KV cache.
For each transformer layer:
| Stored tensor | Meaning |
|---|---|
| Keys | Attention key projections |
| Values | Attention value projections |
At generation step , only the newest token requires new computation. Earlier keys and values are reused.
Without KV caching, generation cost grows roughly as:
per generated token.
With caching, only new attention interactions are computed.
KV caching is essential for efficient transformer serving.
Memory Cost of KV Caches
KV caches consume substantial memory.
Approximate KV cache memory:
where:
| Symbol | Meaning |
|---|---|
| Number of layers | |
| Sequence length | |
| Attention heads | |
| Head dimension |
Long-context inference therefore becomes memory-intensive.
Example pressures include:
- many concurrent users
- long conversations
- retrieval-augmented prompts
- large batch serving
Modern inference systems often spend more memory on KV caches than on model parameters.
Quantization
Quantization reduces numerical precision to lower memory and compute cost.
Instead of storing parameters in fp16 or fp32, systems may use:
| Format | Bits |
|---|---|
| fp32 | 32 |
| fp16 | 16 |
| bf16 | 16 |
| int8 | 8 |
| int4 | 4 |
A quantized parameter approximation:
where:
| Symbol | Meaning |
|---|---|
| Quantized integer | |
| Scale | |
| Zero point |
Quantization reduces:
- memory footprint
- bandwidth usage
- inference latency
A 4-bit model may require roughly one-quarter the parameter memory of a 16-bit model.
Quantization Tradeoffs
Quantization introduces approximation error.
Tradeoffs include:
| Advantage | Cost |
|---|---|
| Lower memory | Lower numerical precision |
| Faster inference | Possible accuracy degradation |
| Larger batch serving | More implementation complexity |
Some layers are more sensitive than others.
Common approaches include:
| Method | Idea |
|---|---|
| Post-training quantization | Quantize after training |
| Quantization-aware training | Simulate quantization during training |
| Mixed-precision quantization | Different layers use different precision |
Modern language models can often tolerate surprisingly aggressive quantization.
Weight-Only Quantization
In many transformer systems, weights dominate memory usage.
Weight-only quantization stores weights in lower precision while keeping activations in higher precision.
Example:
| Tensor type | Precision |
|---|---|
| Weights | int4 |
| Activations | fp16 |
| KV cache | fp16 |
This approach is attractive because it simplifies implementation while greatly reducing parameter memory.
Activation Quantization
Activation quantization reduces precision of intermediate tensors during inference.
This further reduces memory and bandwidth, but activations are often more sensitive than weights.
Challenges include:
- outlier activations
- varying tensor distributions
- dynamic ranges changing during inference
Activation quantization is especially difficult for transformers with long contexts.
Operator Fusion
Modern neural networks contain many small operations:
- matrix multiplication
- bias addition
- normalization
- activation functions
Naively executing each operation separately creates overhead from:
- kernel launches
- memory reads and writes
- synchronization
Operator fusion combines multiple operations into one kernel.
Example:
may be fused into one execution unit instead of separate:
- matrix multiplication
- bias addition
- activation
Fusion improves:
| Benefit | Reason |
|---|---|
| Throughput | Less overhead |
| Memory efficiency | Fewer intermediate tensors |
| Cache locality | Better reuse |
Inference compilers rely heavily on fusion.
Compilation and Graph Optimization
Eager execution is flexible but may introduce overhead.
Inference systems often convert models into optimized computation graphs.
Common graph optimizations include:
| Optimization | Purpose |
|---|---|
| Operator fusion | Reduce overhead |
| Constant folding | Precompute constants |
| Dead code elimination | Remove unused operations |
| Kernel selection | Choose optimized implementations |
| Layout optimization | Improve memory access |
Common inference runtimes include:
| Runtime | Use |
|---|---|
| TorchScript | PyTorch graph execution |
| TensorRT | NVIDIA inference optimization |
| ONNX Runtime | Portable graph execution |
| TVM | Compiler optimization |
| XLA | Accelerated graph compilation |
Batch Inference
Inference systems often combine requests into batches.
Instead of processing one example:
the system processes:
Batching improves hardware utilization because GPUs are optimized for large tensor operations.
Advantages:
| Benefit | Reason |
|---|---|
| Higher throughput | Better GPU occupancy |
| Better amortization | Shared kernel overhead |
| Improved efficiency | Larger matrix multiplications |
Disadvantages:
| Problem | Explanation |
|---|---|
| Higher latency | Requests wait for batching |
| Uneven sequence lengths | Padding inefficiency |
| Scheduling complexity | Dynamic request arrival |
Serving systems must balance throughput against latency.
Continuous Batching
Traditional batching waits for a full batch before execution.
Continuous batching dynamically inserts and removes requests during generation.
This is especially important for LLM serving because different requests finish at different times.
Example:
| Request | Length |
|---|---|
| A | 20 tokens |
| B | 300 tokens |
| C | 50 tokens |
Without continuous batching, short requests may wait behind long requests.
Continuous batching keeps the GPU busy while minimizing wasted slots.
Modern LLM serving systems heavily rely on this technique.
Speculative Decoding
Autoregressive generation is sequential and slow.
Speculative decoding accelerates generation using a smaller draft model.
Workflow:
- small model predicts several candidate tokens
- large model verifies them
- accepted tokens are committed
- rejected tokens are recomputed
If the draft model predicts correctly often enough, throughput increases substantially.
The idea exploits the fact that verification can be cheaper than full sequential generation.
Mixture-of-Experts Inference
Mixture-of-experts models activate only part of the network for each token.
Instead of computing all experts:
only a subset is selected.
This reduces computation per token while increasing total parameter count.
However, MoE inference introduces routing complexity:
- token dispatch
- load balancing
- expert communication
Efficient MoE serving is therefore partly a systems problem.
Attention Optimization
Attention becomes expensive for long contexts.
Standard attention complexity:
where is sequence length.
Long-context systems therefore use:
| Method | Idea |
|---|---|
| FlashAttention | Memory-efficient kernels |
| Sliding-window attention | Local attention regions |
| Sparse attention | Ignore many token pairs |
| Linear attention | Approximate softmax attention |
| Paged attention | Efficient KV cache management |
FlashAttention became especially important because it reduces memory traffic while preserving exact attention computation.
Paged Attention
Large language model serving often suffers from KV cache fragmentation.
Paged attention organizes KV memory into blocks or pages, similar to virtual memory systems.
Benefits include:
| Benefit | Explanation |
|---|---|
| Better memory utilization | Reduced fragmentation |
| Flexible request scheduling | Easier dynamic batching |
| Efficient cache reuse | Improved serving throughput |
Paged attention systems became important for large-scale multi-user inference servers.
CPU, GPU, and Accelerator Inference
Inference hardware varies widely.
| Hardware | Strength |
|---|---|
| CPU | Flexibility, low-volume serving |
| GPU | High throughput |
| TPU | Large-scale serving |
| Edge accelerators | Low power |
| Mobile NPUs | On-device inference |
The optimal deployment depends on:
- latency requirements
- throughput requirements
- memory constraints
- deployment environment
- cost targets
A small quantized model may run efficiently on a phone. A frontier language model may require dozens of GPUs for interactive serving.
Edge Inference
Edge inference runs models close to the user:
- phones
- browsers
- robots
- embedded devices
- autonomous systems
Advantages:
| Benefit | Reason |
|---|---|
| Lower latency | No network round-trip |
| Better privacy | Data stays local |
| Offline capability | No server required |
Constraints:
| Constraint | Problem |
|---|---|
| Limited memory | Small devices |
| Power consumption | Battery limits |
| Thermal limits | Sustained compute restrictions |
Edge systems therefore rely heavily on:
- quantization
- pruning
- compact architectures
- hardware-specific optimization
Serving Systems
Modern inference serving systems coordinate:
- batching
- scheduling
- memory management
- caching
- load balancing
- request routing
Common serving frameworks include:
| Framework | Use |
|---|---|
| TorchServe | PyTorch deployment |
| Triton Inference Server | Multi-model serving |
| vLLM | Efficient LLM serving |
| TensorRT-LLM | NVIDIA optimized LLM inference |
| Ray Serve | Distributed serving |
Serving infrastructure often becomes a major engineering domain separate from model training.
Cost as the Main Constraint
At scale, inference cost dominates deployment economics.
A model serving millions of users may process enormous token volumes daily.
Key cost drivers include:
| Driver | Impact |
|---|---|
| Parameter count | Memory and compute |
| Context length | Attention cost |
| Output length | Sequential decoding cost |
| Concurrent users | KV cache memory |
| Precision | Hardware efficiency |
Inference optimization therefore directly affects commercial viability.
The Central Tradeoff
Inference optimization balances:
Improving one dimension often worsens another.
Examples:
| Optimization | Possible downside |
|---|---|
| Lower precision | Accuracy degradation |
| Larger batches | Higher latency |
| Longer context | Higher memory use |
| Smaller models | Reduced capability |
| Aggressive caching | More memory consumption |
Inference engineering is therefore largely an optimization problem under hardware and economic constraints.
From Models to Systems
A trained neural network is only one part of a production AI system.
Real-world deployment also requires:
- runtime compilers
- schedulers
- distributed caches
- request routers
- memory managers
- observability systems
- autoscaling infrastructure
As model size increased, inference optimization evolved from a minor deployment detail into one of the central engineering problems in modern AI systems.