Exercises

Conceptual Questions

Explain the difference between parameter scaling, data scaling, and compute scaling.
Why do scaling laws often follow approximate power-law behavior?
What is compute-optimal training? Why can a smaller model trained on more data outperform a larger model trained on less data?
Explain the difference between training-time scaling and inference-time scaling.
Why is attention complexity quadratic in sequence length for standard transformers?
What are the main bottlenecks in large-scale AI systems?
Compare FP32, FP16, BF16, and INT8 computation. What tradeoffs exist between precision and efficiency?
Explain the purpose of:
- quantization
- pruning
- distillation
- gradient checkpointing
Why is memory movement often more expensive than arithmetic computation?
What is operator fusion? Why does it improve performance?
Explain the sim-to-real gap in robotics.
Compare imitation learning and reinforcement learning.
Why are world models important for embodied agents?
What are affordances? Give three examples from robotics.
Why is uncertainty estimation critical in scientific deep learning?
Explain the difference between correlation and causation.
Why are distribution shifts dangerous in deployed systems?
What is catastrophic forgetting?
Why is interpretability difficult in large neural networks?
Explain why benchmark accuracy alone is insufficient for evaluating advanced AI systems.

Mathematical Exercises

1. Parameter Scaling

Suppose validation loss follows:

L(N) = 1.5N^{-0.08} + 1.2

where $N$ is the parameter count in billions.

Compute the loss for:
- $N = 1$
- $N = 10$
- $N = 100$
What happens to the loss as $N \rightarrow \infty$ ?

2. Compute Scaling

Suppose compute scales as:

C \propto ND

where:

$N$ = parameters
$D$ = training tokens

If compute budget doubles, describe three possible scaling strategies.

3. Attention Complexity

A transformer processes sequences of length $T$ .

Standard attention requires:

O(T^2)

operations.

Compare the relative attention cost for:
- $T = 512$
- $T = 2048$
- $T = 8192$
How much larger is the attention matrix when sequence length increases from 2048 to 8192?

4. Mixed Precision Memory Savings

A model has 8 billion parameters.

Estimate parameter memory usage in:
- FP32
- FP16
Assume each parameter requires:
- parameters
- gradients
- Adam first moment
- Adam second moment

Estimate total memory usage.

5. Quantization

Suppose a model requires 40 GB in FP16.

Estimate storage size after:

INT8 quantization
INT4 quantization

Ignore metadata overhead.

PyTorch Exercises

1. Count Parameters

Write a PyTorch function that counts:

total parameters
trainable parameters

for any model.

2. Mixed Precision Training

Modify a standard training loop to use:

autocast
GradScaler

Measure memory usage before and after.

3. Gradient Checkpointing

Apply gradient checkpointing to a transformer block and compare:

peak memory usage
training speed

4. Profiling

Use torch.profiler to identify:

slow operations
memory bottlenecks
synchronization overhead

in a training script.

5. Quantized Inference

Convert a pretrained model to INT8 and measure:

inference latency
model size
accuracy degradation

Research Exercises

Read a recent scaling-law paper and summarize:
- the scaling variables
- the fitted equation
- the experimental setup
- the limitations
Compare two efficient transformer architectures.
Study a robotics benchmark and identify:
- sensor inputs
- action space
- evaluation metrics
- failure modes
Investigate one scientific deep learning system such as:
- protein folding
- weather forecasting
- molecular generation
- neural operators
Analyze one open problem in AI safety or alignment and explain why it remains difficult.

Open-Ended Projects

Project 1. Scaling Experiment

Train language models of several sizes and fit a scaling curve:

L(N) \approx aN^{-b} + c

Plot:

parameter count vs validation loss
compute vs validation loss

Project 2. Efficient Inference System

Build a text-generation service with:

quantized inference
KV caching
batching
latency measurement

Measure throughput and memory usage.

Project 3. Physics-Informed Neural Network

Train a PINN to solve a partial differential equation such as:

heat equation
wave equation
Burgers’ equation

Compare learned and analytical solutions.

Project 4. Robot Policy Learning

Train a robot policy in simulation using:

imitation learning
reinforcement learning

Evaluate sim-to-real transfer performance if hardware is available.

Project 5. Long-Context Evaluation

Evaluate transformer performance as context length increases.

Measure:

memory usage
inference latency
retrieval accuracy
degradation under long sequences

Chapter Summary

This chapter explored future directions in deep learning:

scaling laws
efficient AI systems
scientific deep learning
robotics and embodied AI
open research problems

Modern AI research increasingly combines:

large-scale optimization
multimodal learning
reasoning systems
scientific computation
interaction with tools and environments

Future progress will depend not only on larger models, but also on deeper understanding, more efficient systems, stronger evaluation methods, and safer deployment practices.