Skip to content

Parallelism

Automatic differentiation is usually described as a transformation of programs or computational graphs. In real systems, it is also a parallel execution problem. Large...

Automatic differentiation is usually described as a transformation of programs or computational graphs. In real systems, it is also a parallel execution problem. Large differentiable programs contain millions or billions of operations, and modern hardware exposes thousands of concurrent execution lanes. Efficient AD must therefore expose, schedule, and control parallel work.

Parallelism affects both performance and numerical behavior. It changes operation ordering, memory movement, reduction structure, and sometimes even the derivative values observed in finite precision.

Forms of Parallelism

Differentiable systems use several kinds of parallelism.

FormUnit of parallel work
Data parallelismIndependent examples or minibatches
Tensor parallelismSlices of a tensor operation
Pipeline parallelismGroups of layers or program stages
Operator parallelismIndependent graph nodes
Reduction parallelismSummation or aggregation trees
Task parallelismIndependent functions or subgraphs
Distributed parallelismWork across machines

Automatic differentiation must preserve dependency order while exploiting independent work.

Parallelism in Forward Mode

Forward mode propagates primal and tangent values together.

For each operation:

z=f(x), z = f(x),

forward mode computes:

z˙=Jf(x)x˙. \dot{z} = J_f(x)\dot{x}.

Many tangent evaluations can be parallelized over seed directions.

If we want a full Jacobian:

Jf(x)Rm×n, J_f(x) \in \mathbb{R}^{m \times n},

forward mode can compute columns independently by seeding basis vectors:

e1,e2,,en. e_1, e_2, \dots, e_n.

This gives natural parallelism across input dimensions.

The drawback is cost. For large nn, full forward-mode Jacobian computation requires many passes unless vectorized.

Parallelism in Reverse Mode

Reverse mode propagates adjoints backward.

For a scalar output:

y=f(x), y = f(x),

reverse mode computes:

xy \nabla_x y

in roughly one backward traversal.

This is highly efficient for scalar losses, but the backward pass contains many reductions.

When a variable influences many downstream operations, its adjoint is:

xˉ=ivˉivix. \bar{x} = \sum_i \bar{v}_i \frac{\partial v_i}{\partial x}.

Those contributions can be computed in parallel, but they must be accumulated safely.

Graph-Level Parallelism

A computational graph is a partially ordered set.

If two nodes have no dependency relation, they may execute concurrently.

Example:

a=f(x), a = f(x), b=g(y), b = g(y), c=a+b. c = a + b.

The computations of aa and bb are independent. They can run in parallel.

The same applies in reverse mode. Independent branches can propagate adjoints concurrently until they merge.

Dependency Constraints

Parallel execution must respect dependencies.

An operation can run only when all required inputs are available.

In reverse mode, a node can run only after all adjoint contributions from its users are available, or it must use atomic accumulation.

This creates scheduling choices:

StrategyDescription
Wait for all contributionsDeterministic, structured
Atomic accumulationMore flexible, often nondeterministic
Reduction buffersExplicitly aggregate contributions
Graph coloringAvoid conflicting writes

Parallel Reductions

Reductions are central to AD.

Examples include:

ixi, \sum_i x_i,

matrix multiplication,

convolution,

attention,

and gradient aggregation.

A reduction can be represented as a tree:

((x1+x2)+(x3+x4))+ ((x_1+x_2)+(x_3+x_4))+\cdots

Tree reductions are parallel and fast.

But they change floating point behavior because addition is not associative.

Thus parallelism affects reproducibility.

Data Parallel Training

Data parallelism is the simplest large-scale training strategy.

Each worker holds a copy of the model and processes a different minibatch shard.

For worker kk, compute:

gk=θLk(θ). g_k = \nabla_\theta L_k(\theta).

Then aggregate:

g=1Kk=1Kgk. g = \frac{1}{K}\sum_{k=1}^K g_k.

All workers then apply the same update.

Data parallelism is simple because the derivative computation for each minibatch shard is independent until the gradient reduction step.

Communication Cost

In distributed data parallelism, gradient synchronization can dominate runtime.

For parameter vector θ\theta, each step communicates approximately:

O(θ) O(|\theta|)

gradient values.

As models grow, communication becomes a bottleneck.

Systems use:

  • all-reduce,
  • gradient bucketing,
  • overlap of compute and communication,
  • compression,
  • sharding.

Tensor Parallelism

Tensor parallelism splits individual tensor operations across devices.

For matrix multiplication:

C=AB, C = AB,

we may split AA, BB, or CC across devices.

The backward pass must mirror these partitions.

For example, if:

C=AB, C = AB,

then:

Aˉ=CˉBT, \bar{A} = \bar{C}B^T, Bˉ=ATCˉ. \bar{B} = A^T\bar{C}.

The required communication pattern depends on how AA, BB, and CC were partitioned.

Pipeline Parallelism

Pipeline parallelism splits a model into stages.

Stage 1 computes early layers.

Stage 2 computes later layers.

Microbatches flow through the pipeline.

The backward pass flows in reverse.

Pipeline execution improves device utilization but introduces scheduling complexity.

It also increases memory pressure because activations must be retained across pipeline boundaries.

Operator Fusion

Parallelism does not always mean more kernels.

Sometimes performance improves by reducing parallel boundaries.

Consider:

y=ReLU(Wx+b). y = \operatorname{ReLU}(Wx + b).

A naive system launches separate kernels for:

  • matrix multiplication,
  • bias addition,
  • ReLU.

A fused implementation may combine operations to reduce memory traffic.

Fusion changes the execution graph but should preserve the mathematical derivative. In floating point arithmetic, results may differ slightly due to changed rounding.

Parallel AD and Memory

Parallelism often increases memory pressure.

More concurrent operations require more live tensors.

Pipeline parallelism stores activations across stages.

Tensor parallelism stores communication buffers.

Data parallelism replicates model state.

Thus parallel AD design balances:

ObjectivePressure
More concurrencyMore live memory
Less memoryLess available parallel work
Less communicationMore recomputation
More fusionLess scheduling flexibility

Race Conditions

Reverse mode is prone to write conflicts.

If multiple downstream nodes contribute to the same adjoint:

xˉ, \bar{x},

parallel execution may create races.

Correct implementations must use:

  • atomic adds,
  • locks,
  • deterministic reduction buffers,
  • staged reductions,
  • or graph scheduling that avoids conflicts.

Atomic adds are common on GPUs, but they often produce nondeterministic accumulation order.

Deterministic Parallel AD

Deterministic parallel AD is possible but expensive.

It requires fixed reduction trees and controlled scheduling.

For example, instead of atomic accumulation into:

xˉ, \bar{x},

the system stores all contributions separately and reduces them in a fixed order.

This improves reproducibility but increases memory and runtime.

Work Stealing and Dynamic Scheduling

Dynamic graphs often use work queues.

When a node becomes ready, it is scheduled.

Work stealing improves utilization, especially for irregular graphs.

But it makes execution order less predictable.

This matters for floating point reproducibility and debugging.

Parallelism in Sparse Graphs

Sparse computations expose irregular parallelism.

Examples:

  • sparse matrix multiplication,
  • graph neural networks,
  • mixture-of-experts routing,
  • sparse attention,
  • differentiable indexing.

Sparse AD must manage:

  • irregular memory access,
  • load imbalance,
  • atomic gradient accumulation,
  • dynamic output shapes.

Parallel sparse differentiation is usually harder than dense tensor differentiation.

Parallelism in Higher-Order AD

Higher-order AD increases parallel opportunities and costs.

For Hessian-vector products, multiple vector products may run concurrently.

For full Hessians, rows or columns may be computed in parallel.

But memory grows quickly, and nested derivative graphs complicate scheduling.

Bulk-Synchronous Execution

Many distributed AD systems use bulk-synchronous execution.

Each iteration proceeds in phases:

  1. forward pass,
  2. backward pass,
  3. gradient synchronization,
  4. optimizer update.

This structure is simple and reproducible enough for many workloads.

The drawback is synchronization overhead. Slow workers delay all others.

Asynchronous Execution

Asynchronous systems reduce waiting.

Workers may compute gradients using stale parameters:

gk=L(θtτk). g_k = \nabla L(\theta_{t-\tau_k}).

This improves throughput but changes optimization dynamics.

Staleness can destabilize training.

AD remains locally correct for each worker’s parameters, but the global optimization process becomes asynchronous.

Automatic Differentiation Across Devices

When a graph spans devices, the AD system must differentiate through communication operations.

Examples:

OperationBackward behavior
BroadcastReduce gradients back to source
ScatterGather gradients
GatherScatter gradients
All-reduceAll-reduce gradients
Send/receiveReverse communication

Communication becomes part of the computational graph.

Differentiating Collectives

Distributed collectives have adjoints.

For example, if forward uses broadcast:

yi=x y_i = x

for each worker ii, then backward sums contributions:

xˉ=iyˉi. \bar{x} = \sum_i \bar{y}_i.

If forward uses sum all-reduce, backward is also a sum all-reduce.

Correct distributed AD requires these communication adjoints to be explicit.

Load Balancing

Parallel AD performance depends on balanced work.

Imbalance occurs when:

  • tensor sizes vary,
  • graph branches have unequal cost,
  • sparse routing sends too much work to one device,
  • pipeline stages have unequal runtime.

A poorly balanced system may use many devices while achieving low effective utilization.

Granularity

Task granularity matters.

Very small operations have high scheduling overhead.

Very large operations reduce flexibility.

Compilers often fuse small operations and tile large operations.

The goal is to expose enough parallelism without drowning the runtime in scheduling overhead.

Parallelism and Numerical Semantics

Parallelism changes numerical semantics in practical systems.

Even when the mathematical program is unchanged, the floating point program may differ.

Sources include:

Parallel mechanismNumerical effect
Reduction treeDifferent rounding
Atomic update orderNondeterministic sums
Kernel fusionDifferent intermediate rounding
TilingDifferent accumulation order
Communication partitioningDifferent aggregation order

For high-performance AD, numerical semantics and scheduling cannot be fully separated.

Core Idea

Parallelism is essential for scalable automatic differentiation, but it changes both systems behavior and numerical behavior. Forward and reverse derivatives expose different parallel structures. Reverse mode in particular requires careful handling of adjoint accumulation, reductions, memory pressure, and communication. Efficient AD systems therefore combine graph scheduling, kernel fusion, distributed collectives, deterministic reduction strategies, and memory-aware execution planning.