# Efficient AI Systems

Modern deep learning systems are constrained by compute, memory, bandwidth, latency, and energy. As models become larger, efficiency becomes a central engineering problem rather than a secondary optimization.

An efficient AI system maximizes useful capability per unit of resource. The resource may be GPU hours, memory capacity, power consumption, inference latency, network bandwidth, storage size, or monetary cost.

Efficiency matters at every scale. A mobile vision model must run under strict power limits. A cloud inference system must serve millions of requests at low latency. A frontier training run must keep thousands of accelerators fully utilized for weeks without wasting compute.

The goal of efficient AI is therefore broader than speed alone. A system is efficient when it achieves the required quality while minimizing operational cost and resource usage.

### Sources of Computational Cost

Deep learning workloads consume resources in several ways.

| Resource | Typical bottleneck |
|---|---|
| Compute | Matrix multiplication and attention |
| Memory | Activations, optimizer states, parameters |
| Bandwidth | GPU-to-GPU communication |
| Storage | Datasets and checkpoints |
| Latency | Sequential operations and decoding |
| Energy | Accelerator utilization and cooling |

In modern transformers, the dominant operations are often matrix multiplications:

$$
Y = XW.
$$

Large language models repeatedly apply linear projections, attention layers, normalization layers, and feedforward networks across many layers and tokens.

The total training cost grows roughly with:

$$
\text{compute} \propto \text{parameters} \times \text{tokens}.
$$

Inference cost also scales with context length and decoding steps.

For autoregressive generation, inference is especially expensive because tokens are generated sequentially:

$$
p(x_t \mid x_{<t}).
$$

Each token depends on all previous tokens. This prevents full parallelization during decoding.

### Hardware Utilization

A theoretical GPU throughput is rarely achieved in practice. Real systems often waste compute because of poor utilization.

Common causes include:

| Problem | Effect |
|---|---|
| Small batch sizes | Low arithmetic intensity |
| Slow data loading | GPU starvation |
| Excessive synchronization | Idle accelerators |
| Memory fragmentation | Reduced usable memory |
| Inefficient kernels | Lower throughput |
| Python overhead | CPU bottlenecks |
| Poor communication overlap | Network stalls |

Efficient systems maximize accelerator occupancy. The GPU should spend most of its time executing large tensor operations rather than waiting for data or synchronization.

A training step contains several phases:

1. Load batch  
2. Transfer tensors to accelerator  
3. Execute forward pass  
4. Compute loss  
5. Execute backward pass  
6. Synchronize gradients  
7. Update parameters  

If any stage becomes slow, the entire pipeline slows.

### Arithmetic Intensity

Arithmetic intensity measures the ratio between computation and memory access.

$$
\text{arithmetic intensity} =
\frac{\text{operations}}{\text{bytes moved}}
$$

Modern accelerators are extremely fast at arithmetic but comparatively slower at memory access. Therefore, operations that reuse data efficiently tend to run faster.

Matrix multiplication has high arithmetic intensity because many multiply-add operations reuse the same matrix blocks.

Elementwise operations often have lower intensity because they move large amounts of memory while doing little computation.

Efficient deep learning systems therefore prefer:

- large matrix multiplications
- fused operations
- batched computation
- contiguous memory layouts
- minimized tensor movement

### Batch Processing

Batching is one of the simplest efficiency techniques.

Instead of processing one example at a time, we process many examples simultaneously:

$$
X \in \mathbb{R}^{B \times d}
$$

where $B$ is the batch size.

Large batches improve hardware utilization because matrix operations become larger and more parallel.

In PyTorch:

```python id="tq6h8v"
x = torch.randn(1024, 4096, device="cuda")
w = torch.randn(4096, 8192, device="cuda")

y = x @ w
```

This large matrix multiplication uses the GPU efficiently.

However, extremely large batches can reduce optimization quality. Training may become unstable or generalize poorly. Practical systems therefore balance statistical efficiency and hardware efficiency.

### Mixed Precision Training

Modern accelerators support reduced-precision arithmetic such as FP16 and BF16.

Traditional training uses 32-bit floating point values:

| Format | Bits |
|---|---:|
| FP32 | 32 |
| FP16 | 16 |
| BF16 | 16 |

Lower precision reduces memory usage and increases throughput.

Mixed precision training keeps some operations in higher precision while using lower precision for most tensor computations.

Benefits include:

| Benefit | Result |
|---|---|
| Smaller tensors | Lower memory usage |
| Faster tensor cores | Higher throughput |
| Larger batches | Better utilization |
| Reduced bandwidth | Faster communication |

In PyTorch:

```python id="m3o2aw"
scaler = torch.cuda.amp.GradScaler()

for x, y in loader:
    optimizer.zero_grad()

    with torch.cuda.amp.autocast():
        pred = model(x)
        loss = criterion(pred, y)

    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
```

Mixed precision is now standard in large-scale training.

### Memory Bottlenecks

Memory is often the main limitation in large models.

Training memory includes:

| Component | Memory usage |
|---|---|
| Parameters | Model weights |
| Gradients | Backpropagation |
| Activations | Intermediate tensors |
| Optimizer states | Momentum, variance estimates |
| Temporary buffers | Kernel workspace |

For Adam-like optimizers, optimizer states alone may require multiple copies of each parameter tensor.

Suppose a model has $N$ parameters. Adam may require approximately:

- parameters
- gradients
- first moments
- second moments

This can exceed:

$$
4N
$$

stored values before activations are included.

Large transformers therefore become memory-bound before compute-bound.

### Gradient Checkpointing

Gradient checkpointing reduces activation memory.

Normally, backpropagation stores intermediate activations during the forward pass. These activations are reused during gradient computation.

Checkpointing stores only selected activations and recomputes others during backpropagation.

Tradeoff:

| Method | Memory | Compute |
|---|---|---|
| Standard training | High | Lower |
| Checkpointing | Lower | Higher |

This exchanges additional computation for reduced memory usage.

In PyTorch:

```python id="4g8f1u"
from torch.utils.checkpoint import checkpoint

def block(x):
    return layer(x)

y = checkpoint(block, x)
```

Checkpointing enables larger models and longer sequences on fixed hardware.

### Operator Fusion

Many neural network operations are small and memory-bound. Launching separate kernels for each operation wastes bandwidth and scheduling overhead.

Operator fusion combines multiple operations into one kernel.

For example:

$$
y = \text{GELU}(xW + b)
$$

Instead of:

1. matrix multiplication  
2. bias addition  
3. activation  

a fused kernel performs them together.

Benefits include:

- fewer memory reads
- fewer memory writes
- fewer kernel launches
- improved cache reuse

Modern compilers and runtimes perform automatic fusion.

Examples include:

| System | Fusion support |
|---|---|
| TorchInductor | Kernel fusion |
| XLA | Graph optimization |
| TensorRT | Inference fusion |
| Triton | Custom fused kernels |

Fusion is especially important for inference systems where latency matters.

### Quantization

Quantization reduces numerical precision to smaller integer formats.

Common formats include:

| Format | Bits |
|---|---:|
| FP32 | 32 |
| FP16 | 16 |
| INT8 | 8 |
| INT4 | 4 |

A quantized model stores weights and activations using fewer bits.

Benefits:

| Benefit | Result |
|---|---|
| Smaller model size | Reduced storage |
| Lower memory bandwidth | Faster inference |
| Better cache efficiency | Lower latency |
| Lower energy usage | Cheaper deployment |

Quantization may slightly reduce accuracy, especially at aggressive precision levels.

Two major approaches exist:

| Method | Description |
|---|---|
| Post-training quantization | Convert trained model afterward |
| Quantization-aware training | Simulate quantization during training |

In PyTorch:

```python id="qphd1y"
model_int8 = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear},
    dtype=torch.qint8
)
```

Large language models often use 8-bit or 4-bit inference to reduce deployment cost.

### Pruning and Sparsity

Pruning removes less important parameters.

Suppose a parameter tensor contains many near-zero values. These values may contribute little to model behavior.

Pruning sets some parameters to zero:

$$
W_{ij} = 0.
$$

This creates sparse tensors.

Types of sparsity include:

| Type | Description |
|---|---|
| Unstructured sparsity | Arbitrary zero entries |
| Structured sparsity | Remove rows, columns, or blocks |
| Dynamic sparsity | Sparse patterns change during training |

Sparse models can reduce memory and computation, but hardware support matters. Dense matrix multiplication is highly optimized. Sparse acceleration is beneficial only when sparsity is sufficiently structured and supported by kernels.

### Knowledge Distillation

Distillation transfers knowledge from a large model to a smaller model.

The large model is called the teacher. The smaller model is called the student.

Instead of training only on hard labels, the student learns from teacher outputs:

$$
p_{\text{teacher}}(y \mid x).
$$

Soft targets contain richer information about class relationships.

Benefits include:

- smaller inference models
- lower latency
- lower memory usage
- reduced deployment cost

Distillation is common in mobile systems, search ranking, speech models, and edge AI.

### Efficient Attention Mechanisms

Self-attention has quadratic complexity:

$$
\text{cost} \propto T^2
$$

where $T$ is sequence length.

This becomes expensive for long contexts.

Several efficient attention methods reduce this cost:

| Method | Idea |
|---|---|
| Sparse attention | Attend only to selected positions |
| Sliding-window attention | Local neighborhoods |
| Linear attention | Kernel approximations |
| FlashAttention | IO-aware implementation |
| Retrieval attention | External memory lookup |

FlashAttention is especially important because it improves memory efficiency without changing the mathematical result.

Instead of storing large intermediate attention matrices, it reorganizes computation to reduce memory movement.

### Efficient Architectures

Architectural design strongly affects efficiency.

Efficient architectures aim to maximize quality per FLOP.

Examples include:

| Architecture | Efficiency strategy |
|---|---|
| MobileNet | Depthwise separable convolutions |
| EfficientNet | Compound scaling |
| ConvNeXt | Simplified convolution design |
| Mamba-style models | State-space sequence modeling |
| Mixture-of-Experts | Sparse activation |
| Tiny transformers | Reduced parameter counts |

Depthwise separable convolution reduces computation dramatically.

A standard convolution cost is approximately:

$$
K^2 C_{\text{in}} C_{\text{out}} HW.
$$

Depthwise separable convolution decomposes this into smaller operations, reducing compute and parameters.

Efficient architectures are especially important for:

- mobile devices
- embedded systems
- robotics
- edge inference
- large-scale serving

### Sparse Expert Models

Mixture-of-Experts (MoE) models improve efficiency through conditional computation.

Instead of activating all parameters for every token, the system activates only selected expert subnetworks.

Suppose there are $E$ experts, but only $k$ are used per token:

$$
k \ll E.
$$

This allows very large total parameter counts while keeping compute manageable.

Benefits:

| Benefit | Effect |
|---|---|
| Larger capacity | More specialized representations |
| Lower active compute | Faster scaling |
| Sparse activation | Better parameter efficiency |

Challenges include:

- load balancing
- routing instability
- communication overhead
- expert collapse

MoE systems are widely used in large-scale language models.

### Distributed Efficiency

Distributed training introduces communication costs.

Suppose gradients must be synchronized across GPUs:

$$
\nabla W =
\frac{1}{n}
\sum_{i=1}^{n}
\nabla W_i.
$$

Communication may become slower than computation.

Efficient distributed systems therefore overlap:

- computation
- communication
- data loading

Important techniques include:

| Technique | Purpose |
|---|---|
| Gradient bucketing | Reduce synchronization overhead |
| Overlap communication | Hide latency |
| Pipeline parallelism | Split layers across devices |
| Tensor parallelism | Split large operations |
| ZeRO optimization | Partition optimizer state |

Distributed efficiency determines whether scaling remains economical.

### Inference Optimization

Inference systems have different priorities from training systems.

Training optimizes throughput. Inference often optimizes:

- latency
- throughput
- memory usage
- serving cost

Autoregressive decoding is particularly expensive because tokens are generated sequentially.

Optimization techniques include:

| Technique | Purpose |
|---|---|
| KV caching | Reuse previous attention states |
| Speculative decoding | Reduce decoding latency |
| Quantized inference | Lower memory bandwidth |
| Continuous batching | Improve throughput |
| TensorRT compilation | Accelerate execution |
| Dynamic batching | Group requests efficiently |

KV caching stores previous key and value tensors so they do not need to be recomputed for every generated token.

### Energy Efficiency

Large-scale AI systems consume significant energy.

Training frontier models may require:

- large GPU clusters
- cooling systems
- high-bandwidth networking
- continuous power delivery

Energy efficiency is therefore a scientific and economic concern.

Energy usage depends on:

| Factor | Effect |
|---|---|
| Hardware efficiency | FLOPs per watt |
| Utilization | Idle hardware wastes power |
| Precision format | Lower precision reduces energy |
| Memory movement | Often more expensive than arithmetic |
| Cooling systems | Datacenter overhead |

Reducing unnecessary data movement is especially important because memory access may consume more energy than arithmetic itself.

Efficient AI systems therefore optimize both algorithms and physical infrastructure.

### Profiling and Measurement

Efficiency work requires measurement.

Important metrics include:

| Metric | Meaning |
|---|---|
| Throughput | Samples or tokens per second |
| Latency | Time per request |
| GPU utilization | Fraction of active compute |
| Memory usage | Peak allocation |
| FLOPs | Floating-point operations |
| Bandwidth | Data transfer rate |
| Energy | Power consumption |

PyTorch provides profiling tools:

```python id="4z3blv"
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA
    ]
) as prof:

    y = model(x)

print(prof.key_averages().table(sort_by="cuda_time_total"))
```

Profiling often reveals unexpected bottlenecks such as synchronization, memory copies, or inefficient kernels.

### Efficiency Tradeoffs

Efficiency always involves tradeoffs.

| Tradeoff | Example |
|---|---|
| Compute vs memory | Gradient checkpointing |
| Precision vs accuracy | Quantization |
| Latency vs throughput | Dynamic batching |
| Capacity vs activation cost | Mixture-of-Experts |
| Parallelism vs communication | Distributed training |
| Model quality vs deployment cost | Distillation |

There is no universally optimal system. The best design depends on constraints.

A mobile device prioritizes latency and energy. A research cluster prioritizes throughput. A cloud inference service prioritizes cost per request.

Efficient AI engineering is therefore an optimization problem over many interacting variables.

### Summary

Efficient AI systems maximize useful capability while minimizing resource usage. Modern deep learning efficiency depends on algorithms, architectures, hardware, compilers, distributed systems, and deployment infrastructure.

Key techniques include:

- batching
- mixed precision
- checkpointing
- operator fusion
- quantization
- sparsity
- distillation
- efficient attention
- distributed optimization
- inference acceleration

As models continue to scale, efficiency becomes increasingly important. The future of deep learning depends not only on larger models, but also on better systems that use compute, memory, bandwidth, and energy more intelligently.

