# Parallelism

## Parallelism

Automatic differentiation is usually described as a transformation of programs or computational graphs. In real systems, it is also a parallel execution problem. Large differentiable programs contain millions or billions of operations, and modern hardware exposes thousands of concurrent execution lanes. Efficient AD must therefore expose, schedule, and control parallel work.

Parallelism affects both performance and numerical behavior. It changes operation ordering, memory movement, reduction structure, and sometimes even the derivative values observed in finite precision.

### Forms of Parallelism

Differentiable systems use several kinds of parallelism.

| Form | Unit of parallel work |
|---|---|
| Data parallelism | Independent examples or minibatches |
| Tensor parallelism | Slices of a tensor operation |
| Pipeline parallelism | Groups of layers or program stages |
| Operator parallelism | Independent graph nodes |
| Reduction parallelism | Summation or aggregation trees |
| Task parallelism | Independent functions or subgraphs |
| Distributed parallelism | Work across machines |

Automatic differentiation must preserve dependency order while exploiting independent work.

### Parallelism in Forward Mode

Forward mode propagates primal and tangent values together.

For each operation:

$$
z = f(x),
$$

forward mode computes:

$$
\dot{z} = J_f(x)\dot{x}.
$$

Many tangent evaluations can be parallelized over seed directions.

If we want a full Jacobian:

$$
J_f(x) \in \mathbb{R}^{m \times n},
$$

forward mode can compute columns independently by seeding basis vectors:

$$
e_1, e_2, \dots, e_n.
$$

This gives natural parallelism across input dimensions.

The drawback is cost. For large $n$, full forward-mode Jacobian computation requires many passes unless vectorized.

### Parallelism in Reverse Mode

Reverse mode propagates adjoints backward.

For a scalar output:

$$
y = f(x),
$$

reverse mode computes:

$$
\nabla_x y
$$

in roughly one backward traversal.

This is highly efficient for scalar losses, but the backward pass contains many reductions.

When a variable influences many downstream operations, its adjoint is:

$$
\bar{x} = \sum_i \bar{v}_i \frac{\partial v_i}{\partial x}.
$$

Those contributions can be computed in parallel, but they must be accumulated safely.

### Graph-Level Parallelism

A computational graph is a partially ordered set.

If two nodes have no dependency relation, they may execute concurrently.

Example:

$$
a = f(x),
$$

$$
b = g(y),
$$

$$
c = a + b.
$$

The computations of $a$ and $b$ are independent. They can run in parallel.

The same applies in reverse mode. Independent branches can propagate adjoints concurrently until they merge.

### Dependency Constraints

Parallel execution must respect dependencies.

An operation can run only when all required inputs are available.

In reverse mode, a node can run only after all adjoint contributions from its users are available, or it must use atomic accumulation.

This creates scheduling choices:

| Strategy | Description |
|---|---|
| Wait for all contributions | Deterministic, structured |
| Atomic accumulation | More flexible, often nondeterministic |
| Reduction buffers | Explicitly aggregate contributions |
| Graph coloring | Avoid conflicting writes |

### Parallel Reductions

Reductions are central to AD.

Examples include:

$$
\sum_i x_i,
$$

matrix multiplication,

convolution,

attention,

and gradient aggregation.

A reduction can be represented as a tree:

$$
((x_1+x_2)+(x_3+x_4))+\cdots
$$

Tree reductions are parallel and fast.

But they change floating point behavior because addition is not associative.

Thus parallelism affects reproducibility.

### Data Parallel Training

Data parallelism is the simplest large-scale training strategy.

Each worker holds a copy of the model and processes a different minibatch shard.

For worker $k$, compute:

$$
g_k = \nabla_\theta L_k(\theta).
$$

Then aggregate:

$$
g = \frac{1}{K}\sum_{k=1}^K g_k.
$$

All workers then apply the same update.

Data parallelism is simple because the derivative computation for each minibatch shard is independent until the gradient reduction step.

### Communication Cost

In distributed data parallelism, gradient synchronization can dominate runtime.

For parameter vector $\theta$, each step communicates approximately:

$$
O(|\theta|)
$$

gradient values.

As models grow, communication becomes a bottleneck.

Systems use:

- all-reduce,
- gradient bucketing,
- overlap of compute and communication,
- compression,
- sharding.

### Tensor Parallelism

Tensor parallelism splits individual tensor operations across devices.

For matrix multiplication:

$$
C = AB,
$$

we may split $A$, $B$, or $C$ across devices.

The backward pass must mirror these partitions.

For example, if:

$$
C = AB,
$$

then:

$$
\bar{A} = \bar{C}B^T,
$$

$$
\bar{B} = A^T\bar{C}.
$$

The required communication pattern depends on how $A$, $B$, and $C$ were partitioned.

### Pipeline Parallelism

Pipeline parallelism splits a model into stages.

Stage 1 computes early layers.

Stage 2 computes later layers.

Microbatches flow through the pipeline.

The backward pass flows in reverse.

Pipeline execution improves device utilization but introduces scheduling complexity.

It also increases memory pressure because activations must be retained across pipeline boundaries.

### Operator Fusion

Parallelism does not always mean more kernels.

Sometimes performance improves by reducing parallel boundaries.

Consider:

$$
y = \operatorname{ReLU}(Wx + b).
$$

A naive system launches separate kernels for:

- matrix multiplication,
- bias addition,
- ReLU.

A fused implementation may combine operations to reduce memory traffic.

Fusion changes the execution graph but should preserve the mathematical derivative. In floating point arithmetic, results may differ slightly due to changed rounding.

### Parallel AD and Memory

Parallelism often increases memory pressure.

More concurrent operations require more live tensors.

Pipeline parallelism stores activations across stages.

Tensor parallelism stores communication buffers.

Data parallelism replicates model state.

Thus parallel AD design balances:

| Objective | Pressure |
|---|---|
| More concurrency | More live memory |
| Less memory | Less available parallel work |
| Less communication | More recomputation |
| More fusion | Less scheduling flexibility |

### Race Conditions

Reverse mode is prone to write conflicts.

If multiple downstream nodes contribute to the same adjoint:

$$
\bar{x},
$$

parallel execution may create races.

Correct implementations must use:

- atomic adds,
- locks,
- deterministic reduction buffers,
- staged reductions,
- or graph scheduling that avoids conflicts.

Atomic adds are common on GPUs, but they often produce nondeterministic accumulation order.

### Deterministic Parallel AD

Deterministic parallel AD is possible but expensive.

It requires fixed reduction trees and controlled scheduling.

For example, instead of atomic accumulation into:

$$
\bar{x},
$$

the system stores all contributions separately and reduces them in a fixed order.

This improves reproducibility but increases memory and runtime.

### Work Stealing and Dynamic Scheduling

Dynamic graphs often use work queues.

When a node becomes ready, it is scheduled.

Work stealing improves utilization, especially for irregular graphs.

But it makes execution order less predictable.

This matters for floating point reproducibility and debugging.

### Parallelism in Sparse Graphs

Sparse computations expose irregular parallelism.

Examples:

- sparse matrix multiplication,
- graph neural networks,
- mixture-of-experts routing,
- sparse attention,
- differentiable indexing.

Sparse AD must manage:

- irregular memory access,
- load imbalance,
- atomic gradient accumulation,
- dynamic output shapes.

Parallel sparse differentiation is usually harder than dense tensor differentiation.

### Parallelism in Higher-Order AD

Higher-order AD increases parallel opportunities and costs.

For Hessian-vector products, multiple vector products may run concurrently.

For full Hessians, rows or columns may be computed in parallel.

But memory grows quickly, and nested derivative graphs complicate scheduling.

### Bulk-Synchronous Execution

Many distributed AD systems use bulk-synchronous execution.

Each iteration proceeds in phases:

1. forward pass,
2. backward pass,
3. gradient synchronization,
4. optimizer update.

This structure is simple and reproducible enough for many workloads.

The drawback is synchronization overhead. Slow workers delay all others.

### Asynchronous Execution

Asynchronous systems reduce waiting.

Workers may compute gradients using stale parameters:

$$
g_k = \nabla L(\theta_{t-\tau_k}).
$$

This improves throughput but changes optimization dynamics.

Staleness can destabilize training.

AD remains locally correct for each worker's parameters, but the global optimization process becomes asynchronous.

### Automatic Differentiation Across Devices

When a graph spans devices, the AD system must differentiate through communication operations.

Examples:

| Operation | Backward behavior |
|---|---|
| Broadcast | Reduce gradients back to source |
| Scatter | Gather gradients |
| Gather | Scatter gradients |
| All-reduce | All-reduce gradients |
| Send/receive | Reverse communication |

Communication becomes part of the computational graph.

### Differentiating Collectives

Distributed collectives have adjoints.

For example, if forward uses broadcast:

$$
y_i = x
$$

for each worker $i$, then backward sums contributions:

$$
\bar{x} = \sum_i \bar{y}_i.
$$

If forward uses sum all-reduce, backward is also a sum all-reduce.

Correct distributed AD requires these communication adjoints to be explicit.

### Load Balancing

Parallel AD performance depends on balanced work.

Imbalance occurs when:

- tensor sizes vary,
- graph branches have unequal cost,
- sparse routing sends too much work to one device,
- pipeline stages have unequal runtime.

A poorly balanced system may use many devices while achieving low effective utilization.

### Granularity

Task granularity matters.

Very small operations have high scheduling overhead.

Very large operations reduce flexibility.

Compilers often fuse small operations and tile large operations.

The goal is to expose enough parallelism without drowning the runtime in scheduling overhead.

### Parallelism and Numerical Semantics

Parallelism changes numerical semantics in practical systems.

Even when the mathematical program is unchanged, the floating point program may differ.

Sources include:

| Parallel mechanism | Numerical effect |
|---|---|
| Reduction tree | Different rounding |
| Atomic update order | Nondeterministic sums |
| Kernel fusion | Different intermediate rounding |
| Tiling | Different accumulation order |
| Communication partitioning | Different aggregation order |

For high-performance AD, numerical semantics and scheduling cannot be fully separated.

### Core Idea

Parallelism is essential for scalable automatic differentiation, but it changes both systems behavior and numerical behavior. Forward and reverse derivatives expose different parallel structures. Reverse mode in particular requires careful handling of adjoint accumulation, reductions, memory pressure, and communication. Efficient AD systems therefore combine graph scheduling, kernel fusion, distributed collectives, deterministic reduction strategies, and memory-aware execution planning.

