Inference Optimization

Training produces model parameters. Inference uses those parameters to generate predictions.

Inference optimization studies how to make model execution faster, cheaper, smaller, and more memory-efficient while preserving acceptable output quality.

For small models, naive inference may be sufficient. For foundation models, inference often becomes more expensive than training because deployed systems may serve millions or billions of requests.

A language model trained once may perform inference continuously for years.

Inference systems therefore optimize:

latency
throughput
memory usage
energy efficiency
hardware utilization
serving cost

Training Versus Inference

Training and inference have different computational characteristics.

Property	Training	Inference
Forward pass	Yes	Yes
Backward pass	Yes	No
Gradient storage	Required	Not needed
Optimizer state	Required	Not needed
Numerical precision	Often mixed precision	Often lower precision
Latency sensitivity	Usually low	Often critical
Throughput focus	Tokens/images per second	Requests per second

Inference is usually memory-bandwidth constrained rather than compute constrained, especially for large transformers.

Inference Workloads

Inference workloads vary substantially.

Workload	Example
Batch inference	Offline embedding generation
Real-time inference	Chat applications
Streaming generation	Token-by-token LLM decoding
Edge inference	Mobile or embedded devices
Interactive multimodal systems	Vision-language assistants

Different workloads require different optimizations.

For example:

Goal	Important metric
Real-time chatbot	Low latency
Embedding pipeline	High throughput
Mobile model	Low memory and energy
Datacenter serving	Cost efficiency

Autoregressive Decoding

Large language models usually generate tokens autoregressively.

Given previous tokens:

[t_1, t_2, \ldots, t_n],

the model predicts:

p(t_{n+1}\mid t_{\le n}).

Then the next token is appended and the process repeats.

This sequential dependency limits parallelism because token $t_{n+1}$ must be generated before predicting $t_{n+2}$ .

Training parallelizes across sequence positions. Inference cannot fully do this during generation.

Autoregressive decoding is therefore a major inference bottleneck.

KV Cache

Transformer inference repeatedly recomputing attention keys and values would be extremely inefficient.

Suppose a sequence has length $T$ . Naively recomputing all attention states at every generation step would repeatedly process earlier tokens.

Instead, inference systems use a key-value cache, usually called a KV cache.

For each transformer layer:

Stored tensor	Meaning
Keys	Attention key projections
Values	Attention value projections

At generation step $t$ , only the newest token requires new computation. Earlier keys and values are reused.

Without KV caching, generation cost grows roughly as:

O(T^2)

per generated token.

With caching, only new attention interactions are computed.

KV caching is essential for efficient transformer serving.

Memory Cost of KV Caches

KV caches consume substantial memory.

Approximate KV cache memory:

\text{memory} \propto L \times T \times H \times D,

where:

Symbol	Meaning
$L$	Number of layers
$T$	Sequence length
$H$	Attention heads
$D$	Head dimension

Long-context inference therefore becomes memory-intensive.

Example pressures include:

many concurrent users
long conversations
retrieval-augmented prompts
large batch serving

Modern inference systems often spend more memory on KV caches than on model parameters.

Quantization

Quantization reduces numerical precision to lower memory and compute cost.

Instead of storing parameters in fp16 or fp32, systems may use:

Format	Bits
fp32	32
fp16	16
bf16	16
int8	8
int4	4

A quantized parameter approximation:

W \approx s(q - z),

where:

Symbol	Meaning
$q$	Quantized integer
$s$	Scale
$z$	Zero point

Quantization reduces:

memory footprint
bandwidth usage
inference latency

A 4-bit model may require roughly one-quarter the parameter memory of a 16-bit model.

Quantization Tradeoffs

Quantization introduces approximation error.

Tradeoffs include:

Advantage	Cost
Lower memory	Lower numerical precision
Faster inference	Possible accuracy degradation
Larger batch serving	More implementation complexity

Some layers are more sensitive than others.

Common approaches include:

Method	Idea
Post-training quantization	Quantize after training
Quantization-aware training	Simulate quantization during training
Mixed-precision quantization	Different layers use different precision

Modern language models can often tolerate surprisingly aggressive quantization.

Weight-Only Quantization

In many transformer systems, weights dominate memory usage.

Weight-only quantization stores weights in lower precision while keeping activations in higher precision.

Example:

Tensor type	Precision
Weights	int4
Activations	fp16
KV cache	fp16

This approach is attractive because it simplifies implementation while greatly reducing parameter memory.

Activation Quantization

Activation quantization reduces precision of intermediate tensors during inference.

This further reduces memory and bandwidth, but activations are often more sensitive than weights.

Challenges include:

outlier activations
varying tensor distributions
dynamic ranges changing during inference

Activation quantization is especially difficult for transformers with long contexts.

Operator Fusion

Modern neural networks contain many small operations:

matrix multiplication
bias addition
normalization
activation functions

Naively executing each operation separately creates overhead from:

kernel launches
memory reads and writes
synchronization

Operator fusion combines multiple operations into one kernel.

Example:

y = \text{GELU}(Wx + b)

may be fused into one execution unit instead of separate:

matrix multiplication
bias addition
activation

Fusion improves:

Benefit	Reason
Throughput	Less overhead
Memory efficiency	Fewer intermediate tensors
Cache locality	Better reuse

Inference compilers rely heavily on fusion.

Compilation and Graph Optimization

Eager execution is flexible but may introduce overhead.

Inference systems often convert models into optimized computation graphs.

Common graph optimizations include:

Optimization	Purpose
Operator fusion	Reduce overhead
Constant folding	Precompute constants
Dead code elimination	Remove unused operations
Kernel selection	Choose optimized implementations
Layout optimization	Improve memory access

Common inference runtimes include:

Runtime	Use
TorchScript	PyTorch graph execution
TensorRT	NVIDIA inference optimization
ONNX Runtime	Portable graph execution
TVM	Compiler optimization
XLA	Accelerated graph compilation

Batch Inference

Inference systems often combine requests into batches.

Instead of processing one example:

x_1,

the system processes:

[x_1, x_2, \ldots, x_B].

Batching improves hardware utilization because GPUs are optimized for large tensor operations.

Advantages:

Benefit	Reason
Higher throughput	Better GPU occupancy
Better amortization	Shared kernel overhead
Improved efficiency	Larger matrix multiplications

Disadvantages:

Problem	Explanation
Higher latency	Requests wait for batching
Uneven sequence lengths	Padding inefficiency
Scheduling complexity	Dynamic request arrival

Serving systems must balance throughput against latency.

Continuous Batching

Traditional batching waits for a full batch before execution.

Continuous batching dynamically inserts and removes requests during generation.

This is especially important for LLM serving because different requests finish at different times.

Example:

Request	Length
A	20 tokens
B	300 tokens
C	50 tokens

Without continuous batching, short requests may wait behind long requests.

Continuous batching keeps the GPU busy while minimizing wasted slots.

Modern LLM serving systems heavily rely on this technique.

Speculative Decoding

Autoregressive generation is sequential and slow.

Speculative decoding accelerates generation using a smaller draft model.

Workflow:

small model predicts several candidate tokens
large model verifies them
accepted tokens are committed
rejected tokens are recomputed

If the draft model predicts correctly often enough, throughput increases substantially.

The idea exploits the fact that verification can be cheaper than full sequential generation.

Mixture-of-Experts Inference

Mixture-of-experts models activate only part of the network for each token.

Instead of computing all experts:

f(x) = \sum_{i=1}^{N} g_i(x)E_i(x),

only a subset is selected.

This reduces computation per token while increasing total parameter count.

However, MoE inference introduces routing complexity:

token dispatch
load balancing
expert communication

Efficient MoE serving is therefore partly a systems problem.

Attention Optimization

Attention becomes expensive for long contexts.

Standard attention complexity:

O(T^2),

where $T$ is sequence length.

Long-context systems therefore use:

Method	Idea
FlashAttention	Memory-efficient kernels
Sliding-window attention	Local attention regions
Sparse attention	Ignore many token pairs
Linear attention	Approximate softmax attention
Paged attention	Efficient KV cache management

FlashAttention became especially important because it reduces memory traffic while preserving exact attention computation.

Paged Attention

Large language model serving often suffers from KV cache fragmentation.

Paged attention organizes KV memory into blocks or pages, similar to virtual memory systems.

Benefits include:

Benefit	Explanation
Better memory utilization	Reduced fragmentation
Flexible request scheduling	Easier dynamic batching
Efficient cache reuse	Improved serving throughput

Paged attention systems became important for large-scale multi-user inference servers.

CPU, GPU, and Accelerator Inference

Inference hardware varies widely.

Hardware	Strength
CPU	Flexibility, low-volume serving
GPU	High throughput
TPU	Large-scale serving
Edge accelerators	Low power
Mobile NPUs	On-device inference

The optimal deployment depends on:

latency requirements
throughput requirements
memory constraints
deployment environment
cost targets

A small quantized model may run efficiently on a phone. A frontier language model may require dozens of GPUs for interactive serving.

Edge Inference

Edge inference runs models close to the user:

phones
browsers
robots
embedded devices
autonomous systems

Advantages:

Benefit	Reason
Lower latency	No network round-trip
Better privacy	Data stays local
Offline capability	No server required

Constraints:

Constraint	Problem
Limited memory	Small devices
Power consumption	Battery limits
Thermal limits	Sustained compute restrictions

Edge systems therefore rely heavily on:

quantization
pruning
compact architectures
hardware-specific optimization

Serving Systems

Modern inference serving systems coordinate:

batching
scheduling
memory management
caching
load balancing
request routing

Common serving frameworks include:

Framework	Use
TorchServe	PyTorch deployment
Triton Inference Server	Multi-model serving
vLLM	Efficient LLM serving
TensorRT-LLM	NVIDIA optimized LLM inference
Ray Serve	Distributed serving

Serving infrastructure often becomes a major engineering domain separate from model training.

Cost as the Main Constraint

At scale, inference cost dominates deployment economics.

A model serving millions of users may process enormous token volumes daily.

Key cost drivers include:

Driver	Impact
Parameter count	Memory and compute
Context length	Attention cost
Output length	Sequential decoding cost
Concurrent users	KV cache memory
Precision	Hardware efficiency

Inference optimization therefore directly affects commercial viability.

The Central Tradeoff

Inference optimization balances:

\text{quality} \leftrightarrow \text{latency} \leftrightarrow \text{throughput} \leftrightarrow \text{memory} \leftrightarrow \text{cost}.

Improving one dimension often worsens another.

Examples:

Optimization	Possible downside
Lower precision	Accuracy degradation
Larger batches	Higher latency
Longer context	Higher memory use
Smaller models	Reduced capability
Aggressive caching	More memory consumption

Inference engineering is therefore largely an optimization problem under hardware and economic constraints.

From Models to Systems

A trained neural network is only one part of a production AI system.

Real-world deployment also requires:

runtime compilers
schedulers
distributed caches
request routers
memory managers
observability systems
autoscaling infrastructure

As model size increased, inference optimization evolved from a minor deployment detail into one of the central engineering problems in modern AI systems.