Skip to content

Inference Optimization

Training produces model parameters. Inference uses those parameters to generate predictions.

Training produces model parameters. Inference uses those parameters to generate predictions.

Inference optimization studies how to make model execution faster, cheaper, smaller, and more memory-efficient while preserving acceptable output quality.

For small models, naive inference may be sufficient. For foundation models, inference often becomes more expensive than training because deployed systems may serve millions or billions of requests.

A language model trained once may perform inference continuously for years.

Inference systems therefore optimize:

  • latency
  • throughput
  • memory usage
  • energy efficiency
  • hardware utilization
  • serving cost

Training Versus Inference

Training and inference have different computational characteristics.

PropertyTrainingInference
Forward passYesYes
Backward passYesNo
Gradient storageRequiredNot needed
Optimizer stateRequiredNot needed
Numerical precisionOften mixed precisionOften lower precision
Latency sensitivityUsually lowOften critical
Throughput focusTokens/images per secondRequests per second

Inference is usually memory-bandwidth constrained rather than compute constrained, especially for large transformers.

Inference Workloads

Inference workloads vary substantially.

WorkloadExample
Batch inferenceOffline embedding generation
Real-time inferenceChat applications
Streaming generationToken-by-token LLM decoding
Edge inferenceMobile or embedded devices
Interactive multimodal systemsVision-language assistants

Different workloads require different optimizations.

For example:

GoalImportant metric
Real-time chatbotLow latency
Embedding pipelineHigh throughput
Mobile modelLow memory and energy
Datacenter servingCost efficiency

Autoregressive Decoding

Large language models usually generate tokens autoregressively.

Given previous tokens:

[t1,t2,,tn], [t_1, t_2, \ldots, t_n],

the model predicts:

p(tn+1tn). p(t_{n+1}\mid t_{\le n}).

Then the next token is appended and the process repeats.

This sequential dependency limits parallelism because token tn+1t_{n+1} must be generated before predicting tn+2t_{n+2}.

Training parallelizes across sequence positions. Inference cannot fully do this during generation.

Autoregressive decoding is therefore a major inference bottleneck.

KV Cache

Transformer inference repeatedly recomputing attention keys and values would be extremely inefficient.

Suppose a sequence has length TT. Naively recomputing all attention states at every generation step would repeatedly process earlier tokens.

Instead, inference systems use a key-value cache, usually called a KV cache.

For each transformer layer:

Stored tensorMeaning
KeysAttention key projections
ValuesAttention value projections

At generation step tt, only the newest token requires new computation. Earlier keys and values are reused.

Without KV caching, generation cost grows roughly as:

O(T2) O(T^2)

per generated token.

With caching, only new attention interactions are computed.

KV caching is essential for efficient transformer serving.

Memory Cost of KV Caches

KV caches consume substantial memory.

Approximate KV cache memory:

memoryL×T×H×D, \text{memory} \propto L \times T \times H \times D,

where:

SymbolMeaning
LLNumber of layers
TTSequence length
HHAttention heads
DDHead dimension

Long-context inference therefore becomes memory-intensive.

Example pressures include:

  • many concurrent users
  • long conversations
  • retrieval-augmented prompts
  • large batch serving

Modern inference systems often spend more memory on KV caches than on model parameters.

Quantization

Quantization reduces numerical precision to lower memory and compute cost.

Instead of storing parameters in fp16 or fp32, systems may use:

FormatBits
fp3232
fp1616
bf1616
int88
int44

A quantized parameter approximation:

Ws(qz), W \approx s(q - z),

where:

SymbolMeaning
qqQuantized integer
ssScale
zzZero point

Quantization reduces:

  • memory footprint
  • bandwidth usage
  • inference latency

A 4-bit model may require roughly one-quarter the parameter memory of a 16-bit model.

Quantization Tradeoffs

Quantization introduces approximation error.

Tradeoffs include:

AdvantageCost
Lower memoryLower numerical precision
Faster inferencePossible accuracy degradation
Larger batch servingMore implementation complexity

Some layers are more sensitive than others.

Common approaches include:

MethodIdea
Post-training quantizationQuantize after training
Quantization-aware trainingSimulate quantization during training
Mixed-precision quantizationDifferent layers use different precision

Modern language models can often tolerate surprisingly aggressive quantization.

Weight-Only Quantization

In many transformer systems, weights dominate memory usage.

Weight-only quantization stores weights in lower precision while keeping activations in higher precision.

Example:

Tensor typePrecision
Weightsint4
Activationsfp16
KV cachefp16

This approach is attractive because it simplifies implementation while greatly reducing parameter memory.

Activation Quantization

Activation quantization reduces precision of intermediate tensors during inference.

This further reduces memory and bandwidth, but activations are often more sensitive than weights.

Challenges include:

  • outlier activations
  • varying tensor distributions
  • dynamic ranges changing during inference

Activation quantization is especially difficult for transformers with long contexts.

Operator Fusion

Modern neural networks contain many small operations:

  • matrix multiplication
  • bias addition
  • normalization
  • activation functions

Naively executing each operation separately creates overhead from:

  • kernel launches
  • memory reads and writes
  • synchronization

Operator fusion combines multiple operations into one kernel.

Example:

y=GELU(Wx+b) y = \text{GELU}(Wx + b)

may be fused into one execution unit instead of separate:

  1. matrix multiplication
  2. bias addition
  3. activation

Fusion improves:

BenefitReason
ThroughputLess overhead
Memory efficiencyFewer intermediate tensors
Cache localityBetter reuse

Inference compilers rely heavily on fusion.

Compilation and Graph Optimization

Eager execution is flexible but may introduce overhead.

Inference systems often convert models into optimized computation graphs.

Common graph optimizations include:

OptimizationPurpose
Operator fusionReduce overhead
Constant foldingPrecompute constants
Dead code eliminationRemove unused operations
Kernel selectionChoose optimized implementations
Layout optimizationImprove memory access

Common inference runtimes include:

RuntimeUse
TorchScriptPyTorch graph execution
TensorRTNVIDIA inference optimization
ONNX RuntimePortable graph execution
TVMCompiler optimization
XLAAccelerated graph compilation

Batch Inference

Inference systems often combine requests into batches.

Instead of processing one example:

x1, x_1,

the system processes:

[x1,x2,,xB]. [x_1, x_2, \ldots, x_B].

Batching improves hardware utilization because GPUs are optimized for large tensor operations.

Advantages:

BenefitReason
Higher throughputBetter GPU occupancy
Better amortizationShared kernel overhead
Improved efficiencyLarger matrix multiplications

Disadvantages:

ProblemExplanation
Higher latencyRequests wait for batching
Uneven sequence lengthsPadding inefficiency
Scheduling complexityDynamic request arrival

Serving systems must balance throughput against latency.

Continuous Batching

Traditional batching waits for a full batch before execution.

Continuous batching dynamically inserts and removes requests during generation.

This is especially important for LLM serving because different requests finish at different times.

Example:

RequestLength
A20 tokens
B300 tokens
C50 tokens

Without continuous batching, short requests may wait behind long requests.

Continuous batching keeps the GPU busy while minimizing wasted slots.

Modern LLM serving systems heavily rely on this technique.

Speculative Decoding

Autoregressive generation is sequential and slow.

Speculative decoding accelerates generation using a smaller draft model.

Workflow:

  1. small model predicts several candidate tokens
  2. large model verifies them
  3. accepted tokens are committed
  4. rejected tokens are recomputed

If the draft model predicts correctly often enough, throughput increases substantially.

The idea exploits the fact that verification can be cheaper than full sequential generation.

Mixture-of-Experts Inference

Mixture-of-experts models activate only part of the network for each token.

Instead of computing all experts:

f(x)=i=1Ngi(x)Ei(x), f(x) = \sum_{i=1}^{N} g_i(x)E_i(x),

only a subset is selected.

This reduces computation per token while increasing total parameter count.

However, MoE inference introduces routing complexity:

  • token dispatch
  • load balancing
  • expert communication

Efficient MoE serving is therefore partly a systems problem.

Attention Optimization

Attention becomes expensive for long contexts.

Standard attention complexity:

O(T2), O(T^2),

where TT is sequence length.

Long-context systems therefore use:

MethodIdea
FlashAttentionMemory-efficient kernels
Sliding-window attentionLocal attention regions
Sparse attentionIgnore many token pairs
Linear attentionApproximate softmax attention
Paged attentionEfficient KV cache management

FlashAttention became especially important because it reduces memory traffic while preserving exact attention computation.

Paged Attention

Large language model serving often suffers from KV cache fragmentation.

Paged attention organizes KV memory into blocks or pages, similar to virtual memory systems.

Benefits include:

BenefitExplanation
Better memory utilizationReduced fragmentation
Flexible request schedulingEasier dynamic batching
Efficient cache reuseImproved serving throughput

Paged attention systems became important for large-scale multi-user inference servers.

CPU, GPU, and Accelerator Inference

Inference hardware varies widely.

HardwareStrength
CPUFlexibility, low-volume serving
GPUHigh throughput
TPULarge-scale serving
Edge acceleratorsLow power
Mobile NPUsOn-device inference

The optimal deployment depends on:

  • latency requirements
  • throughput requirements
  • memory constraints
  • deployment environment
  • cost targets

A small quantized model may run efficiently on a phone. A frontier language model may require dozens of GPUs for interactive serving.

Edge Inference

Edge inference runs models close to the user:

  • phones
  • browsers
  • robots
  • embedded devices
  • autonomous systems

Advantages:

BenefitReason
Lower latencyNo network round-trip
Better privacyData stays local
Offline capabilityNo server required

Constraints:

ConstraintProblem
Limited memorySmall devices
Power consumptionBattery limits
Thermal limitsSustained compute restrictions

Edge systems therefore rely heavily on:

  • quantization
  • pruning
  • compact architectures
  • hardware-specific optimization

Serving Systems

Modern inference serving systems coordinate:

  • batching
  • scheduling
  • memory management
  • caching
  • load balancing
  • request routing

Common serving frameworks include:

FrameworkUse
TorchServePyTorch deployment
Triton Inference ServerMulti-model serving
vLLMEfficient LLM serving
TensorRT-LLMNVIDIA optimized LLM inference
Ray ServeDistributed serving

Serving infrastructure often becomes a major engineering domain separate from model training.

Cost as the Main Constraint

At scale, inference cost dominates deployment economics.

A model serving millions of users may process enormous token volumes daily.

Key cost drivers include:

DriverImpact
Parameter countMemory and compute
Context lengthAttention cost
Output lengthSequential decoding cost
Concurrent usersKV cache memory
PrecisionHardware efficiency

Inference optimization therefore directly affects commercial viability.

The Central Tradeoff

Inference optimization balances:

qualitylatencythroughputmemorycost. \text{quality} \leftrightarrow \text{latency} \leftrightarrow \text{throughput} \leftrightarrow \text{memory} \leftrightarrow \text{cost}.

Improving one dimension often worsens another.

Examples:

OptimizationPossible downside
Lower precisionAccuracy degradation
Larger batchesHigher latency
Longer contextHigher memory use
Smaller modelsReduced capability
Aggressive cachingMore memory consumption

Inference engineering is therefore largely an optimization problem under hardware and economic constraints.

From Models to Systems

A trained neural network is only one part of a production AI system.

Real-world deployment also requires:

  • runtime compilers
  • schedulers
  • distributed caches
  • request routers
  • memory managers
  • observability systems
  • autoscaling infrastructure

As model size increased, inference optimization evolved from a minor deployment detail into one of the central engineering problems in modern AI systems.