# Inference Optimization

Training produces model parameters. Inference uses those parameters to generate predictions.

Inference optimization studies how to make model execution faster, cheaper, smaller, and more memory-efficient while preserving acceptable output quality.

For small models, naive inference may be sufficient. For foundation models, inference often becomes more expensive than training because deployed systems may serve millions or billions of requests.

A language model trained once may perform inference continuously for years.

Inference systems therefore optimize:

- latency
- throughput
- memory usage
- energy efficiency
- hardware utilization
- serving cost

### Training Versus Inference

Training and inference have different computational characteristics.

| Property | Training | Inference |
|---|---|---|
| Forward pass | Yes | Yes |
| Backward pass | Yes | No |
| Gradient storage | Required | Not needed |
| Optimizer state | Required | Not needed |
| Numerical precision | Often mixed precision | Often lower precision |
| Latency sensitivity | Usually low | Often critical |
| Throughput focus | Tokens/images per second | Requests per second |

Inference is usually memory-bandwidth constrained rather than compute constrained, especially for large transformers.

### Inference Workloads

Inference workloads vary substantially.

| Workload | Example |
|---|---|
| Batch inference | Offline embedding generation |
| Real-time inference | Chat applications |
| Streaming generation | Token-by-token LLM decoding |
| Edge inference | Mobile or embedded devices |
| Interactive multimodal systems | Vision-language assistants |

Different workloads require different optimizations.

For example:

| Goal | Important metric |
|---|---|
| Real-time chatbot | Low latency |
| Embedding pipeline | High throughput |
| Mobile model | Low memory and energy |
| Datacenter serving | Cost efficiency |

### Autoregressive Decoding

Large language models usually generate tokens autoregressively.

Given previous tokens:

$$
[t_1, t_2, \ldots, t_n],
$$

the model predicts:

$$
p(t_{n+1}\mid t_{\le n}).
$$

Then the next token is appended and the process repeats.

This sequential dependency limits parallelism because token $t_{n+1}$ must be generated before predicting $t_{n+2}$.

Training parallelizes across sequence positions. Inference cannot fully do this during generation.

Autoregressive decoding is therefore a major inference bottleneck.

### KV Cache

Transformer inference repeatedly recomputing attention keys and values would be extremely inefficient.

Suppose a sequence has length $T$. Naively recomputing all attention states at every generation step would repeatedly process earlier tokens.

Instead, inference systems use a key-value cache, usually called a KV cache.

For each transformer layer:

| Stored tensor | Meaning |
|---|---|
| Keys | Attention key projections |
| Values | Attention value projections |

At generation step $t$, only the newest token requires new computation. Earlier keys and values are reused.

Without KV caching, generation cost grows roughly as:

$$
O(T^2)
$$

per generated token.

With caching, only new attention interactions are computed.

KV caching is essential for efficient transformer serving.

### Memory Cost of KV Caches

KV caches consume substantial memory.

Approximate KV cache memory:

$$
\text{memory}
\propto
L \times T \times H \times D,
$$

where:

| Symbol | Meaning |
|---|---|
| $L$ | Number of layers |
| $T$ | Sequence length |
| $H$ | Attention heads |
| $D$ | Head dimension |

Long-context inference therefore becomes memory-intensive.

Example pressures include:

- many concurrent users
- long conversations
- retrieval-augmented prompts
- large batch serving

Modern inference systems often spend more memory on KV caches than on model parameters.

### Quantization

Quantization reduces numerical precision to lower memory and compute cost.

Instead of storing parameters in fp16 or fp32, systems may use:

| Format | Bits |
|---|---:|
| fp32 | 32 |
| fp16 | 16 |
| bf16 | 16 |
| int8 | 8 |
| int4 | 4 |

A quantized parameter approximation:

$$
W \approx s(q - z),
$$

where:

| Symbol | Meaning |
|---|---|
| $q$ | Quantized integer |
| $s$ | Scale |
| $z$ | Zero point |

Quantization reduces:

- memory footprint
- bandwidth usage
- inference latency

A 4-bit model may require roughly one-quarter the parameter memory of a 16-bit model.

### Quantization Tradeoffs

Quantization introduces approximation error.

Tradeoffs include:

| Advantage | Cost |
|---|---|
| Lower memory | Lower numerical precision |
| Faster inference | Possible accuracy degradation |
| Larger batch serving | More implementation complexity |

Some layers are more sensitive than others.

Common approaches include:

| Method | Idea |
|---|---|
| Post-training quantization | Quantize after training |
| Quantization-aware training | Simulate quantization during training |
| Mixed-precision quantization | Different layers use different precision |

Modern language models can often tolerate surprisingly aggressive quantization.

### Weight-Only Quantization

In many transformer systems, weights dominate memory usage.

Weight-only quantization stores weights in lower precision while keeping activations in higher precision.

Example:

| Tensor type | Precision |
|---|---|
| Weights | int4 |
| Activations | fp16 |
| KV cache | fp16 |

This approach is attractive because it simplifies implementation while greatly reducing parameter memory.

### Activation Quantization

Activation quantization reduces precision of intermediate tensors during inference.

This further reduces memory and bandwidth, but activations are often more sensitive than weights.

Challenges include:

- outlier activations
- varying tensor distributions
- dynamic ranges changing during inference

Activation quantization is especially difficult for transformers with long contexts.

### Operator Fusion

Modern neural networks contain many small operations:

- matrix multiplication
- bias addition
- normalization
- activation functions

Naively executing each operation separately creates overhead from:

- kernel launches
- memory reads and writes
- synchronization

Operator fusion combines multiple operations into one kernel.

Example:

$$
y = \text{GELU}(Wx + b)
$$

may be fused into one execution unit instead of separate:

1. matrix multiplication
2. bias addition
3. activation

Fusion improves:

| Benefit | Reason |
|---|---|
| Throughput | Less overhead |
| Memory efficiency | Fewer intermediate tensors |
| Cache locality | Better reuse |

Inference compilers rely heavily on fusion.

### Compilation and Graph Optimization

Eager execution is flexible but may introduce overhead.

Inference systems often convert models into optimized computation graphs.

Common graph optimizations include:

| Optimization | Purpose |
|---|---|
| Operator fusion | Reduce overhead |
| Constant folding | Precompute constants |
| Dead code elimination | Remove unused operations |
| Kernel selection | Choose optimized implementations |
| Layout optimization | Improve memory access |

Common inference runtimes include:

| Runtime | Use |
|---|---|
| TorchScript | PyTorch graph execution |
| TensorRT | NVIDIA inference optimization |
| ONNX Runtime | Portable graph execution |
| TVM | Compiler optimization |
| XLA | Accelerated graph compilation |

### Batch Inference

Inference systems often combine requests into batches.

Instead of processing one example:

$$
x_1,
$$

the system processes:

$$
[x_1, x_2, \ldots, x_B].
$$

Batching improves hardware utilization because GPUs are optimized for large tensor operations.

Advantages:

| Benefit | Reason |
|---|---|
| Higher throughput | Better GPU occupancy |
| Better amortization | Shared kernel overhead |
| Improved efficiency | Larger matrix multiplications |

Disadvantages:

| Problem | Explanation |
|---|---|
| Higher latency | Requests wait for batching |
| Uneven sequence lengths | Padding inefficiency |
| Scheduling complexity | Dynamic request arrival |

Serving systems must balance throughput against latency.

### Continuous Batching

Traditional batching waits for a full batch before execution.

Continuous batching dynamically inserts and removes requests during generation.

This is especially important for LLM serving because different requests finish at different times.

Example:

| Request | Length |
|---|---:|
| A | 20 tokens |
| B | 300 tokens |
| C | 50 tokens |

Without continuous batching, short requests may wait behind long requests.

Continuous batching keeps the GPU busy while minimizing wasted slots.

Modern LLM serving systems heavily rely on this technique.

### Speculative Decoding

Autoregressive generation is sequential and slow.

Speculative decoding accelerates generation using a smaller draft model.

Workflow:

1. small model predicts several candidate tokens
2. large model verifies them
3. accepted tokens are committed
4. rejected tokens are recomputed

If the draft model predicts correctly often enough, throughput increases substantially.

The idea exploits the fact that verification can be cheaper than full sequential generation.

### Mixture-of-Experts Inference

Mixture-of-experts models activate only part of the network for each token.

Instead of computing all experts:

$$
f(x) =
\sum_{i=1}^{N}
g_i(x)E_i(x),
$$

only a subset is selected.

This reduces computation per token while increasing total parameter count.

However, MoE inference introduces routing complexity:

- token dispatch
- load balancing
- expert communication

Efficient MoE serving is therefore partly a systems problem.

### Attention Optimization

Attention becomes expensive for long contexts.

Standard attention complexity:

$$
O(T^2),
$$

where $T$ is sequence length.

Long-context systems therefore use:

| Method | Idea |
|---|---|
| FlashAttention | Memory-efficient kernels |
| Sliding-window attention | Local attention regions |
| Sparse attention | Ignore many token pairs |
| Linear attention | Approximate softmax attention |
| Paged attention | Efficient KV cache management |

FlashAttention became especially important because it reduces memory traffic while preserving exact attention computation.

### Paged Attention

Large language model serving often suffers from KV cache fragmentation.

Paged attention organizes KV memory into blocks or pages, similar to virtual memory systems.

Benefits include:

| Benefit | Explanation |
|---|---|
| Better memory utilization | Reduced fragmentation |
| Flexible request scheduling | Easier dynamic batching |
| Efficient cache reuse | Improved serving throughput |

Paged attention systems became important for large-scale multi-user inference servers.

### CPU, GPU, and Accelerator Inference

Inference hardware varies widely.

| Hardware | Strength |
|---|---|
| CPU | Flexibility, low-volume serving |
| GPU | High throughput |
| TPU | Large-scale serving |
| Edge accelerators | Low power |
| Mobile NPUs | On-device inference |

The optimal deployment depends on:

- latency requirements
- throughput requirements
- memory constraints
- deployment environment
- cost targets

A small quantized model may run efficiently on a phone. A frontier language model may require dozens of GPUs for interactive serving.

### Edge Inference

Edge inference runs models close to the user:

- phones
- browsers
- robots
- embedded devices
- autonomous systems

Advantages:

| Benefit | Reason |
|---|---|
| Lower latency | No network round-trip |
| Better privacy | Data stays local |
| Offline capability | No server required |

Constraints:

| Constraint | Problem |
|---|---|
| Limited memory | Small devices |
| Power consumption | Battery limits |
| Thermal limits | Sustained compute restrictions |

Edge systems therefore rely heavily on:

- quantization
- pruning
- compact architectures
- hardware-specific optimization

### Serving Systems

Modern inference serving systems coordinate:

- batching
- scheduling
- memory management
- caching
- load balancing
- request routing

Common serving frameworks include:

| Framework | Use |
|---|---|
| TorchServe | PyTorch deployment |
| Triton Inference Server | Multi-model serving |
| vLLM | Efficient LLM serving |
| TensorRT-LLM | NVIDIA optimized LLM inference |
| Ray Serve | Distributed serving |

Serving infrastructure often becomes a major engineering domain separate from model training.

### Cost as the Main Constraint

At scale, inference cost dominates deployment economics.

A model serving millions of users may process enormous token volumes daily.

Key cost drivers include:

| Driver | Impact |
|---|---|
| Parameter count | Memory and compute |
| Context length | Attention cost |
| Output length | Sequential decoding cost |
| Concurrent users | KV cache memory |
| Precision | Hardware efficiency |

Inference optimization therefore directly affects commercial viability.

### The Central Tradeoff

Inference optimization balances:

$$
\text{quality}
\leftrightarrow
\text{latency}
\leftrightarrow
\text{throughput}
\leftrightarrow
\text{memory}
\leftrightarrow
\text{cost}.
$$

Improving one dimension often worsens another.

Examples:

| Optimization | Possible downside |
|---|---|
| Lower precision | Accuracy degradation |
| Larger batches | Higher latency |
| Longer context | Higher memory use |
| Smaller models | Reduced capability |
| Aggressive caching | More memory consumption |

Inference engineering is therefore largely an optimization problem under hardware and economic constraints.

### From Models to Systems

A trained neural network is only one part of a production AI system.

Real-world deployment also requires:

- runtime compilers
- schedulers
- distributed caches
- request routers
- memory managers
- observability systems
- autoscaling infrastructure

As model size increased, inference optimization evolved from a minor deployment detail into one of the central engineering problems in modern AI systems.

