# TensorFlow Autograd

## TensorFlow Autograd

TensorFlow Autograd refers to TensorFlow’s automatic differentiation system, mainly exposed through `tf.GradientTape`. It is a reverse-mode AD system designed for tensor programs, neural network training, and differentiable numerical computation.

TensorFlow began with static computational graphs. A user built a graph first, then executed it inside a runtime. Later TensorFlow 2 made eager execution the default. The modern interface records operations dynamically during execution, then differentiates the recorded computation.

## From Static Graphs to GradientTape

In early TensorFlow, a program looked like graph construction:

```python
x = tf.placeholder(tf.float32)
y = x * x + tf.sin(x)
dy_dx = tf.gradients(y, x)
```

The derivative was another graph. Execution happened later inside a session.

In TensorFlow 2, the common style is eager:

```python
import tensorflow as tf

x = tf.Variable(2.0)

with tf.GradientTape() as tape:
    y = x * x + tf.sin(x)

dy_dx = tape.gradient(y, x)
```

The tape records tensor operations executed inside the `with` block. When `gradient` is called, TensorFlow traverses the recorded operation graph backward and applies registered gradient rules.

## Reverse Mode for Tensor Programs

TensorFlow’s AD system is built around vector-Jacobian products. For an operation

$$
y = f(x),
$$

the backward rule receives an upstream cotangent

$$
\bar y
$$

and returns

$$
\bar x = \bar y J_f(x).
$$

This convention avoids materializing full Jacobian matrices. For neural networks, the scalar loss depends on millions or billions of parameters. Reverse mode computes all parameter gradients with cost proportional to a small multiple of the forward computation.

A dense layer illustrates the pattern:

```python
with tf.GradientTape() as tape:
    y = x @ W + b
    loss = tf.reduce_mean((y - target) ** 2)

dW, db = tape.gradient(loss, [W, b])
```

The system does not build the full Jacobian of `loss` with respect to `W`. It propagates adjoints backward through matrix multiplication, broadcasting, subtraction, squaring, and reduction.

## Gradient Registry

TensorFlow associates primitive operations with gradient definitions. Each operation has a forward implementation and a backward rule.

For multiplication:

$$
z = x y
$$

the backward rule is:

$$
\bar x \mathrel{+}= \bar z y,
\qquad
\bar y \mathrel{+}= \bar z x.
$$

For matrix multiplication:

$$
Y = XW
$$

the backward rules are:

$$
\bar X = \bar Y W^T,
\qquad
\bar W = X^T \bar Y.
$$

The user writes tensor code. TensorFlow supplies derivative rules for its operation library. This is different from ADIFOR or Tapenade, which transform general Fortran or C source. TensorFlow differentiates a graph of known tensor operations.

## Persistent and Nested Tapes

By default, a `GradientTape` can be used once. A persistent tape can compute multiple gradients from the same recorded computation:

```python
with tf.GradientTape(persistent=True) as tape:
    y = x * x
    z = y + tf.sin(x)

dy_dx = tape.gradient(y, x)
dz_dx = tape.gradient(z, x)
```

Nested tapes allow higher-order derivatives:

```python
with tf.GradientTape() as outer:
    with tf.GradientTape() as inner:
        y = x ** 3
    dy_dx = inner.gradient(y, x)

d2y_dx2 = outer.gradient(dy_dx, x)
```

This works because gradient computation itself can be recorded as a differentiable computation, subject to operation support and memory limits.

## Variables, Watching, and State

TensorFlow automatically watches trainable `tf.Variable` objects used inside a tape. Plain tensors are not watched unless requested:

```python
x = tf.constant(2.0)

with tf.GradientTape() as tape:
    tape.watch(x)
    y = x * x

dy_dx = tape.gradient(y, x)
```

This distinction matters. Variables represent model parameters. Constants represent data unless explicitly marked as differentiable inputs.

Stateful mutation requires care. TensorFlow tracks operations, not arbitrary Python side effects. Assignments to variables can participate in computation, but Python-level mutation outside TensorFlow operations cannot be differentiated in the same way.

## Graph Compilation with `tf.function`

TensorFlow can trace Python functions into graph form using `tf.function`:

```python
@tf.function
def train_step(x, target):
    with tf.GradientTape() as tape:
        y = model(x)
        loss = loss_fn(target, y)

    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
    return loss
```

Here the user writes eager-style Python, but TensorFlow traces TensorFlow operations into a graph. The graph can be optimized, compiled, distributed, and executed efficiently.

This creates a two-level system:

| Level | Role |
|---|---|
| Python execution | defines the computation |
| TensorFlow graph | records differentiable tensor operations |
| GradientTape | builds reverse derivative computation |
| Runtime/compiler | optimizes and executes graph kernels |

This architecture is powerful but can be subtle. Python control flow, TensorFlow control flow, tensor shapes, tracing cache keys, and side effects interact.

## Custom Gradients

TensorFlow allows users to override or define gradient behavior with `tf.custom_gradient`.

```python
@tf.custom_gradient
def clipped_square(x):
    y = x * x

    def grad(dy):
        return dy * tf.clip_by_value(2 * x, -1.0, 1.0)

    return y, grad
```

Custom gradients are useful when the mathematical derivative is unstable, expensive, unavailable, or intentionally replaced. Examples include numerical stabilization, straight-through estimators, differentiating through approximate algorithms, and wrapping external kernels.

Custom gradients are also a liability. They can silently change the optimization problem. A custom rule must be tested as carefully as normal numerical code.

## Strengths

TensorFlow Autograd is strong for tensor-heavy workloads. It integrates reverse-mode AD with a large operation library, GPU and TPU kernels, distributed execution, and neural network training infrastructure.

It avoids many hard problems of whole-language AD by differentiating TensorFlow operations rather than arbitrary Python. This gives the system a controlled primitive set with known gradient rules.

It also supports production deployment. The same graph machinery used for differentiation can be connected to serialization, serving, compiler optimization, and hardware placement.

## Limitations

TensorFlow’s AD system differentiates TensorFlow computations, not all Python programs. Pure Python control flow, list mutation, object mutation, I/O, and non-TensorFlow numerical libraries sit outside the differentiable graph unless wrapped explicitly.

The tape can consume substantial memory because reverse mode requires saved forward values. Large models, long sequences, and unrolled computations need gradient checkpointing or recomputation.

Gradient behavior depends on registered operation rules. Unsupported operations return `None` gradients or require custom definitions. Numerically unstable primitives can produce unstable gradients even when the forward computation appears acceptable.

`tf.function` introduces tracing semantics. A function may be retraced for different input signatures, and Python effects usually happen during tracing rather than every graph execution. This can surprise users who expect ordinary Python execution semantics.

## Historical Role

TensorFlow Autograd represents a major shift from differentiating legacy scientific programs to differentiating tensor computation graphs. ADIFOR and Tapenade treat the program as source code to transform. TensorFlow treats the program as a graph of tensor operations.

This design matches deep learning. Neural networks are naturally expressed as compositions of tensor primitives. Reverse-mode AD over such graphs gives the computational structure needed for backpropagation, hardware acceleration, and large-scale optimization.

TensorFlow made automatic differentiation a routine part of software engineering for machine learning systems. Its contribution was less about inventing reverse mode and more about integrating reverse mode with tensor runtimes, accelerators, model libraries, and production deployment.

