Conceptual Questions
Explain the difference between parameter scaling, data scaling, and compute scaling.
Why do scaling laws often follow approximate power-law behavior?
What is compute-optimal training? Why can a smaller model trained on more data outperform a larger model trained on less data?
Explain the difference between training-time scaling and inference-time scaling.
Why is attention complexity quadratic in sequence length for standard transformers?
What are the main bottlenecks in large-scale AI systems?
Compare FP32, FP16, BF16, and INT8 computation. What tradeoffs exist between precision and efficiency?
Explain the purpose of:
- quantization
- pruning
- distillation
- gradient checkpointing
Why is memory movement often more expensive than arithmetic computation?
What is operator fusion? Why does it improve performance?
Explain the sim-to-real gap in robotics.
Compare imitation learning and reinforcement learning.
Why are world models important for embodied agents?
What are affordances? Give three examples from robotics.
Why is uncertainty estimation critical in scientific deep learning?
Explain the difference between correlation and causation.
Why are distribution shifts dangerous in deployed systems?
What is catastrophic forgetting?
Why is interpretability difficult in large neural networks?
Explain why benchmark accuracy alone is insufficient for evaluating advanced AI systems.
Mathematical Exercises
1. Parameter Scaling
Suppose validation loss follows:
where is the parameter count in billions.
Compute the loss for:
What happens to the loss as ?
2. Compute Scaling
Suppose compute scales as:
where:
- = parameters
- = training tokens
If compute budget doubles, describe three possible scaling strategies.
3. Attention Complexity
A transformer processes sequences of length .
Standard attention requires:
operations.
Compare the relative attention cost for:
How much larger is the attention matrix when sequence length increases from 2048 to 8192?
4. Mixed Precision Memory Savings
A model has 8 billion parameters.
Estimate parameter memory usage in:
- FP32
- FP16
Assume each parameter requires:
- parameters
- gradients
- Adam first moment
- Adam second moment
Estimate total memory usage.
5. Quantization
Suppose a model requires 40 GB in FP16.
Estimate storage size after:
- INT8 quantization
- INT4 quantization
Ignore metadata overhead.
PyTorch Exercises
1. Count Parameters
Write a PyTorch function that counts:
- total parameters
- trainable parameters
for any model.
2. Mixed Precision Training
Modify a standard training loop to use:
autocastGradScaler
Measure memory usage before and after.
3. Gradient Checkpointing
Apply gradient checkpointing to a transformer block and compare:
- peak memory usage
- training speed
4. Profiling
Use torch.profiler to identify:
- slow operations
- memory bottlenecks
- synchronization overhead
in a training script.
5. Quantized Inference
Convert a pretrained model to INT8 and measure:
- inference latency
- model size
- accuracy degradation
Research Exercises
Read a recent scaling-law paper and summarize:
- the scaling variables
- the fitted equation
- the experimental setup
- the limitations
Compare two efficient transformer architectures.
Study a robotics benchmark and identify:
- sensor inputs
- action space
- evaluation metrics
- failure modes
Investigate one scientific deep learning system such as:
- protein folding
- weather forecasting
- molecular generation
- neural operators
Analyze one open problem in AI safety or alignment and explain why it remains difficult.
Open-Ended Projects
Project 1. Scaling Experiment
Train language models of several sizes and fit a scaling curve:
Plot:
- parameter count vs validation loss
- compute vs validation loss
Project 2. Efficient Inference System
Build a text-generation service with:
- quantized inference
- KV caching
- batching
- latency measurement
Measure throughput and memory usage.
Project 3. Physics-Informed Neural Network
Train a PINN to solve a partial differential equation such as:
- heat equation
- wave equation
- Burgers’ equation
Compare learned and analytical solutions.
Project 4. Robot Policy Learning
Train a robot policy in simulation using:
- imitation learning
- reinforcement learning
Evaluate sim-to-real transfer performance if hardware is available.
Project 5. Long-Context Evaluation
Evaluate transformer performance as context length increases.
Measure:
- memory usage
- inference latency
- retrieval accuracy
- degradation under long sequences
Chapter Summary
This chapter explored future directions in deep learning:
- scaling laws
- efficient AI systems
- scientific deep learning
- robotics and embodied AI
- open research problems
Modern AI research increasingly combines:
- large-scale optimization
- multimodal learning
- reasoning systems
- scientific computation
- interaction with tools and environments
Future progress will depend not only on larger models, but also on deeper understanding, more efficient systems, stronger evaluation methods, and safer deployment practices.