# Exercises

### Conceptual Questions

1. Explain the difference between parameter scaling, data scaling, and compute scaling.

2. Why do scaling laws often follow approximate power-law behavior?

3. What is compute-optimal training? Why can a smaller model trained on more data outperform a larger model trained on less data?

4. Explain the difference between training-time scaling and inference-time scaling.

5. Why is attention complexity quadratic in sequence length for standard transformers?

6. What are the main bottlenecks in large-scale AI systems?

7. Compare FP32, FP16, BF16, and INT8 computation. What tradeoffs exist between precision and efficiency?

8. Explain the purpose of:
   - quantization
   - pruning
   - distillation
   - gradient checkpointing

9. Why is memory movement often more expensive than arithmetic computation?

10. What is operator fusion? Why does it improve performance?

11. Explain the sim-to-real gap in robotics.

12. Compare imitation learning and reinforcement learning.

13. Why are world models important for embodied agents?

14. What are affordances? Give three examples from robotics.

15. Why is uncertainty estimation critical in scientific deep learning?

16. Explain the difference between correlation and causation.

17. Why are distribution shifts dangerous in deployed systems?

18. What is catastrophic forgetting?

19. Why is interpretability difficult in large neural networks?

20. Explain why benchmark accuracy alone is insufficient for evaluating advanced AI systems.

---

### Mathematical Exercises

#### 1. Parameter Scaling

Suppose validation loss follows:

$$
L(N) = 1.5N^{-0.08} + 1.2
$$

where $N$ is the parameter count in billions.

1. Compute the loss for:
   - $N = 1$
   - $N = 10$
   - $N = 100$

2. What happens to the loss as $N \rightarrow \infty$?

---

#### 2. Compute Scaling

Suppose compute scales as:

$$
C \propto ND
$$

where:
- $N$ = parameters
- $D$ = training tokens

If compute budget doubles, describe three possible scaling strategies.

---

#### 3. Attention Complexity

A transformer processes sequences of length $T$.

Standard attention requires:

$$
O(T^2)
$$

operations.

1. Compare the relative attention cost for:
   - $T = 512$
   - $T = 2048$
   - $T = 8192$

2. How much larger is the attention matrix when sequence length increases from 2048 to 8192?

---

#### 4. Mixed Precision Memory Savings

A model has 8 billion parameters.

1. Estimate parameter memory usage in:
   - FP32
   - FP16

2. Assume each parameter requires:
   - parameters
   - gradients
   - Adam first moment
   - Adam second moment

Estimate total memory usage.

---

#### 5. Quantization

Suppose a model requires 40 GB in FP16.

Estimate storage size after:
- INT8 quantization
- INT4 quantization

Ignore metadata overhead.

---

### PyTorch Exercises

#### 1. Count Parameters

Write a PyTorch function that counts:
- total parameters
- trainable parameters

for any model.

---

#### 2. Mixed Precision Training

Modify a standard training loop to use:
- `autocast`
- `GradScaler`

Measure memory usage before and after.

---

#### 3. Gradient Checkpointing

Apply gradient checkpointing to a transformer block and compare:
- peak memory usage
- training speed

---

#### 4. Profiling

Use `torch.profiler` to identify:
- slow operations
- memory bottlenecks
- synchronization overhead

in a training script.

---

#### 5. Quantized Inference

Convert a pretrained model to INT8 and measure:
- inference latency
- model size
- accuracy degradation

---

### Research Exercises

1. Read a recent scaling-law paper and summarize:
   - the scaling variables
   - the fitted equation
   - the experimental setup
   - the limitations

2. Compare two efficient transformer architectures.

3. Study a robotics benchmark and identify:
   - sensor inputs
   - action space
   - evaluation metrics
   - failure modes

4. Investigate one scientific deep learning system such as:
   - protein folding
   - weather forecasting
   - molecular generation
   - neural operators

5. Analyze one open problem in AI safety or alignment and explain why it remains difficult.

---

### Open-Ended Projects

#### Project 1. Scaling Experiment

Train language models of several sizes and fit a scaling curve:

$$
L(N) \approx aN^{-b} + c
$$

Plot:
- parameter count vs validation loss
- compute vs validation loss

---

#### Project 2. Efficient Inference System

Build a text-generation service with:
- quantized inference
- KV caching
- batching
- latency measurement

Measure throughput and memory usage.

---

#### Project 3. Physics-Informed Neural Network

Train a PINN to solve a partial differential equation such as:
- heat equation
- wave equation
- Burgers’ equation

Compare learned and analytical solutions.

---

#### Project 4. Robot Policy Learning

Train a robot policy in simulation using:
- imitation learning
- reinforcement learning

Evaluate sim-to-real transfer performance if hardware is available.

---

#### Project 5. Long-Context Evaluation

Evaluate transformer performance as context length increases.

Measure:
- memory usage
- inference latency
- retrieval accuracy
- degradation under long sequences

---

### Chapter Summary

This chapter explored future directions in deep learning:

- scaling laws
- efficient AI systems
- scientific deep learning
- robotics and embodied AI
- open research problems

Modern AI research increasingly combines:
- large-scale optimization
- multimodal learning
- reasoning systems
- scientific computation
- interaction with tools and environments

Future progress will depend not only on larger models, but also on deeper understanding, more efficient systems, stronger evaluation methods, and safer deployment practices.

