# Audio-Visual Learning

Audio-visual learning studies models that jointly process sound and visual information. The goal is to learn representations that combine what is seen with what is heard.

Humans naturally integrate multiple sensory streams. When watching someone speak, we combine lip motion, facial expression, and sound. When observing a moving car, we associate engine noise with visual motion. Audio-visual models attempt to learn similar correspondences.

The core challenge is multimodal alignment across time. Visual events and audio events are often correlated but may have different temporal resolutions, noise properties, and ambiguities.

### Audio and Video as Tensors

Audio and video are both represented as tensors.

A video batch is commonly stored as

$$
V \in \mathbb{R}^{B \times T \times C \times H \times W},
$$

where:

| Symbol | Meaning |
|---|---|
| $B$ | Batch size |
| $T$ | Number of frames |
| $C$ | Channels |
| $H$ | Height |
| $W$ | Width |

An audio waveform is often stored as

$$
A \in \mathbb{R}^{B \times C_a \times L},
$$

where $L$ is waveform length.

Instead of raw waveforms, many systems use spectrograms. A spectrogram converts sound into a time-frequency representation:

$$
S \in \mathbb{R}^{B \times F \times T_a},
$$

where $F$ is the number of frequency bins and $T_a$ is the audio time dimension.

In PyTorch:

```python id="qqf4gf"
video = torch.randn(8, 16, 3, 224, 224)
spectrogram = torch.randn(8, 128, 400)

print(video.shape)
print(spectrogram.shape)
```

The video tensor contains 8 clips with 16 frames each. The spectrogram tensor contains 128 frequency bins across 400 audio timesteps.

### Why Audio and Vision Complement Each Other

Vision and sound contain overlapping but incomplete information.

| Visual information | Audio information |
|---|---|
| Shape | Pitch |
| Motion | Rhythm |
| Spatial layout | Tone |
| Texture | Loudness |
| Appearance | Speech content |
| Gesture | Environmental sound |

Some events are visually ambiguous but acoustically clear. Others are acoustically noisy but visually obvious.

For example:

| Scenario | Helpful modality |
|---|---|
| Lip reading in noise | Vision |
| Object behind camera | Audio |
| Silent gestures | Vision |
| Speaker identity | Both |
| Music performance | Both |

A multimodal system can therefore outperform single-modality systems.

### Learning Cross-Modal Correspondence

The most important principle in audio-visual learning is correspondence learning. The model learns that synchronized audio and video belong together.

Suppose a dataset contains video clips $v_i$ and audio clips $a_i$. A model learns encoders:

$$
z_v = f_{\theta}(v),
\quad
z_a = g_{\phi}(a).
$$

The embeddings are trained so that synchronized pairs are similar:

$$
s(v,a) =
\frac{z_v^\top z_a}
{\|z_v\|\|z_a\|}.
$$

Contrastive learning is widely used. Positive pairs are synchronized audio-video clips. Negative pairs come from unrelated clips.

The model learns semantic alignment without requiring labels.

For example, a model may learn:

- barking sounds correspond to dogs
- piano sounds correspond to keyboards
- explosions correspond to bright flashes
- speech corresponds to moving lips

### Contrastive Audio-Visual Training

Suppose we process a batch of synchronized video and audio clips.

The video encoder produces

$$
Z_v \in \mathbb{R}^{B \times d},
$$

and the audio encoder produces

$$
Z_a \in \mathbb{R}^{B \times d}.
$$

The similarity matrix is

$$
S = Z_v Z_a^\top.
$$

The diagonal elements correspond to matching pairs.

Training minimizes a contrastive objective:

```python id="ubgyd2"
import torch
import torch.nn.functional as F

video_emb = F.normalize(video_encoder(video), dim=-1)
audio_emb = F.normalize(audio_encoder(audio), dim=-1)

logits = video_emb @ audio_emb.T

labels = torch.arange(logits.size(0), device=logits.device)

loss_v2a = F.cross_entropy(logits, labels)
loss_a2v = F.cross_entropy(logits.T, labels)

loss = (loss_v2a + loss_a2v) / 2
```

This objective teaches the model to align sound and vision in embedding space.

### Temporal Modeling

Audio and video are sequential signals. Time therefore becomes central.

A static image contains spatial structure. Video and audio contain both spatial and temporal structure.

A model must capture:

| Structure type | Example |
|---|---|
| Short-term motion | Hand movement |
| Long-term motion | Human activity |
| Audio rhythm | Music beat |
| Temporal synchronization | Lip motion with speech |

Several architectures are used.

| Architecture | Purpose |
|---|---|
| 3D CNNs | Spatiotemporal convolutions |
| Temporal transformers | Long-range sequence modeling |
| Recurrent models | Sequential state tracking |
| Audio-video attention | Cross-modal fusion |

A transformer-based model may process video frames and audio patches jointly as token sequences.

### Audio Features

Raw audio is difficult to process directly because waveforms are long and high-frequency.

Most systems transform waveforms into spectral representations.

A spectrogram is computed using the short-time Fourier transform:

$$
X(\tau, \omega) =
\sum_{n=-\infty}^{\infty}
x[n] w[n-\tau]
e^{-j\omega n}.
$$

genui{"math_block_widget_always_prefetch_v2":{"content":"X(\\tau, \\omega)=\\sum_{n=-\\infty}^{\\infty} x[n] w[n-\\tau] e^{-j\\omega n}"}}

This converts a waveform into a representation indexed by time and frequency.

Common audio representations include:

| Representation | Description |
|---|---|
| Waveform | Raw audio signal |
| Spectrogram | Time-frequency energy |
| Mel spectrogram | Frequency compressed to perceptual scale |
| MFCC | Compact speech features |
| Learned audio tokens | Transformer embeddings |

Modern multimodal systems increasingly learn directly from raw or lightly processed audio.

### Cross-Modal Attention

Cross-modal attention allows one modality to attend to another.

Suppose video features are

$$
H_v \in \mathbb{R}^{N_v \times d},
$$

and audio features are

$$
H_a \in \mathbb{R}^{N_a \times d}.
$$

Audio-conditioned visual attention may use:

$$
Q = H_a W_Q,
\quad
K = H_v W_K,
\quad
V = H_v W_V.
$$

The attention output becomes

$$
\text{Attention}(Q,K,V) =
\text{softmax}
\left(
\frac{QK^\top}{\sqrt{d}}
\right)V.
$$

This lets audio queries select relevant visual regions.

For example:

- speech attends to mouth movement
- drum sounds attend to drumstick motion
- engine sounds attend to vehicles

Cross-attention creates multimodal grounding between streams.

### Self-Supervised Audio-Visual Learning

Large audio-visual datasets are difficult to label manually. Self-supervised learning therefore plays a major role.

Common pretraining tasks include:

| Task | Goal |
|---|---|
| Synchronization prediction | Determine whether audio and video match |
| Masked prediction | Predict missing frames or audio regions |
| Contrastive alignment | Match corresponding clips |
| Temporal ordering | Predict sequence order |
| Future prediction | Predict future audio or frames |

For example, a synchronization task may ask:

```text id="d0s0we"
Does this speech audio match this lip movement?
```

A model trained on this task often learns strong representations without explicit labels.

### Multimodal Fusion Strategies

Fusion combines modalities into a shared representation.

Three major strategies exist.

| Fusion type | Description |
|---|---|
| Early fusion | Combine raw or low-level features |
| Mid-level fusion | Combine intermediate embeddings |
| Late fusion | Combine predictions |

Early fusion captures fine interactions but is computationally expensive. Late fusion is simpler but may miss important cross-modal structure.

Modern transformer systems usually perform mid-level fusion using attention layers.

### Audio-Visual Generation

Generative multimodal systems can synthesize one modality from another.

Examples include:

| Task | Input | Output |
|---|---|---|
| Video dubbing | Video | Speech |
| Talking head generation | Audio | Face animation |
| Foley generation | Silent video | Sound effects |
| Music-conditioned animation | Music | Motion |
| Video captioning | Video | Text |

An audio-conditioned video generator may model:

$$
p(v \mid a).
$$

A video-conditioned audio generator may model:

$$
p(a \mid v).
$$

Diffusion models are increasingly used for these tasks because they generate high-quality temporal outputs.

### Audio-Visual Transformers

Modern systems frequently tokenize both modalities.

For example:

| Modality | Tokens |
|---|---|
| Video | Patch embeddings |
| Audio | Spectrogram patches |

The tokens are concatenated:

$$
X =
[x_1^{(v)},\ldots,x_n^{(v)},
x_1^{(a)},\ldots,x_m^{(a)}].
$$

A transformer processes the combined sequence.

Self-attention can then discover:

- temporal synchronization
- semantic correspondence
- motion-sound relations
- scene context

This unified token view has become dominant in foundation models.

### PyTorch Example

A simplified multimodal encoder:

```python id="2pd67p"
import torch
import torch.nn as nn
import torch.nn.functional as F

class AudioVisualModel(nn.Module):
    def __init__(self, video_encoder, audio_encoder, embed_dim):
        super().__init__()

        self.video_encoder = video_encoder
        self.audio_encoder = audio_encoder

        self.video_proj = nn.Linear(512, embed_dim)
        self.audio_proj = nn.Linear(512, embed_dim)

    def forward(self, video, audio):
        video_feat = self.video_encoder(video)
        audio_feat = self.audio_encoder(audio)

        video_emb = F.normalize(
            self.video_proj(video_feat),
            dim=-1,
        )

        audio_emb = F.normalize(
            self.audio_proj(audio_feat),
            dim=-1,
        )

        return video_emb, audio_emb
```

Training:

```python id="x0w4g5"
video_emb, audio_emb = model(video, audio)

logits = video_emb @ audio_emb.T
labels = torch.arange(logits.size(0), device=logits.device)

loss = (
    F.cross_entropy(logits, labels)
    +
    F.cross_entropy(logits.T, labels)
) / 2
```

This architecture resembles modern multimodal contrastive systems.

### Applications

Audio-visual learning supports many applications.

| Application | Description |
|---|---|
| Video understanding | Action recognition and event detection |
| Speech enhancement | Using lip motion to improve speech |
| Multimodal assistants | Combined sound and vision reasoning |
| Robotics | Environmental perception |
| Autonomous driving | Audio and camera fusion |
| Healthcare | Medical audiovisual monitoring |
| Human-computer interaction | Gesture and speech integration |

Many embodied AI systems depend on multimodal sensing because real environments contain both visual and acoustic signals.

### Challenges

Audio-visual learning remains difficult.

Major challenges include:

| Challenge | Description |
|---|---|
| Temporal misalignment | Audio and video may drift |
| Noise | Background sounds and motion blur |
| Scale mismatch | Audio and video have different rates |
| Missing modalities | One modality may be absent |
| Long sequences | Video is computationally expensive |
| Dataset bias | Correlations may be spurious |

For example, a model may incorrectly associate applause with stage lighting because both often appear together.

Robust multimodal systems must learn causal structure rather than shallow correlation.

### Summary

Audio-visual learning combines sound and visual information into unified representations. The central ideas are temporal modeling, multimodal alignment, cross-attention, and contrastive learning.

Modern systems encode video and audio into token sequences, align them through embedding objectives, and fuse them using transformers. These systems support retrieval, generation, speech understanding, robotics, multimodal assistants, and embodied AI.

In PyTorch, audio-visual learning reduces to tensorized multimodal pipelines: encode each modality, align embeddings, fuse representations, and optimize contrastive or generative objectives across time.

