# Speech Recognition Systems

Speech recognition maps an acoustic signal to a text sequence. The input is continuous audio. The output is discrete symbols: characters, subword tokens, words, or phonemes.

A speech recognition system receives a waveform

$$
x = (x_1, x_2, \ldots, x_N)
$$

and predicts a token sequence

$$
y = (y_1, y_2, \ldots, y_T).
$$

The input length $N$ is usually much larger than the output length $T$. A few seconds of audio may contain tens of thousands of waveform samples but only a few words.

The goal is to model

$$
p(y \mid x).
$$

This places speech recognition in the same general family as machine translation: both are sequence-to-sequence problems. The difference is that the source sequence is audio rather than text.

### The Speech Recognition Pipeline

A classical speech recognition system had several separate components: acoustic features, acoustic model, pronunciation dictionary, language model, and decoder. Modern neural systems often merge most of these into one trainable model.

A typical neural speech recognition pipeline has four stages:

| Stage | Input | Output |
|---|---|---|
| Audio loading | File or microphone stream | Waveform |
| Feature extraction | Waveform | Acoustic features |
| Encoder | Acoustic features | Hidden acoustic states |
| Decoder | Hidden states | Text tokens |

Some modern systems skip handcrafted features and operate closer to raw waveforms, but time-frequency features remain common.

### Waveforms

A waveform is a one-dimensional signal sampled at a fixed rate. If the sampling rate is 16 kHz, then one second of audio contains 16,000 samples.

In PyTorch, a batch of mono waveforms may have shape

```python
[B, N]
```

where $B$ is batch size and $N$ is the number of audio samples.

For stereo audio, the shape may be

```python
[B, C, N]
```

where $C=2$ for left and right channels.

Speech recognition systems usually convert stereo to mono and resample audio to a fixed sampling rate, such as 16 kHz.

### Spectrograms and Mel Features

Raw waveforms are long and difficult to model directly. A common preprocessing step converts waveforms into spectrograms.

A spectrogram represents how frequency content changes over time. It is computed by applying the short-time Fourier transform over small windows of the signal.

The result is a time-frequency matrix:

$$
X \in \mathbb{R}^{F \times T},
$$

where $F$ is the number of frequency bins and $T$ is the number of time frames.

Mel spectrograms compress the frequency axis using a scale designed to approximate human auditory perception. Log-mel spectrograms are especially common:

$$
X = \log(\text{MelSpectrogram}(x) + \epsilon).
$$

For a batch, the tensor may have shape

```python
[B, T, F]
```

or

```python
[B, F, T]
```

depending on the library and model.

### Minimal Feature Extraction in PyTorch

With `torchaudio`, a log-mel feature extractor can be built as a module.

```python
import torch
import torch.nn as nn
import torchaudio.transforms as T

class LogMelFeatures(nn.Module):
    def __init__(
        self,
        sample_rate=16000,
        n_fft=400,
        hop_length=160,
        n_mels=80,
    ):
        super().__init__()

        self.mel = T.MelSpectrogram(
            sample_rate=sample_rate,
            n_fft=n_fft,
            hop_length=hop_length,
            n_mels=n_mels,
        )

    def forward(self, waveform):
        # waveform: [B, N]
        features = self.mel(waveform)          # [B, F, T]
        features = torch.log(features + 1e-6)
        features = features.transpose(1, 2)    # [B, T, F]
        return features
```

Here each frame summarizes a short time window of the waveform. With a hop length of 160 at 16 kHz, adjacent frames are separated by 10 milliseconds.

### Acoustic Encoders

The encoder converts acoustic features into hidden states.

If the input feature tensor is

$$
X \in \mathbb{R}^{B \times T \times F},
$$

then the encoder produces

$$
H \in \mathbb{R}^{B \times T' \times D}.
$$

The time length $T'$ may be smaller than $T$ because many speech models downsample the time axis.

Common acoustic encoders include recurrent networks, convolutional networks, transformer encoders, conformers, and wav2vec-style self-supervised encoders.

The encoder must model both local acoustic patterns and long-range linguistic context. Local patterns distinguish sounds such as vowels and consonants. Long-range context helps resolve ambiguous audio.

### Connectionist Temporal Classification

A major challenge in speech recognition is alignment. We usually have audio and transcript pairs, but we often lack exact timestamps for each output token.

Connectionist Temporal Classification, or CTC, solves this by summing over possible alignments between acoustic frames and output tokens.

The model emits a distribution over tokens at each acoustic frame. The vocabulary includes a special blank token. A frame-level path might look like this:

```text
blank h h blank e e blank l l l blank o
```

CTC collapses repeated tokens and removes blanks:

```text
hello
```

Formally, CTC defines

$$
p(y \mid x) =
\sum_{\pi \in \mathcal{A}(y)}
p(\pi \mid x),
$$

where $\mathcal{A}(y)$ is the set of all frame-level paths that collapse to the output sequence $y$.

This lets the model train from audio-transcript pairs without frame-level alignment labels.

### CTC Loss in PyTorch

PyTorch provides `nn.CTCLoss`.

The logits for CTC usually have shape

```python
[T, B, V]
```

where $T$ is acoustic time length, $B$ is batch size, and $V$ is vocabulary size including blank.

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

T_steps = 120
B = 4
V = 32

logits = torch.randn(T_steps, B, V)
log_probs = F.log_softmax(logits, dim=-1)

targets = torch.randint(1, V, (B, 20), dtype=torch.long)

input_lengths = torch.full((B,), T_steps, dtype=torch.long)
target_lengths = torch.full((B,), 20, dtype=torch.long)

ctc_loss = nn.CTCLoss(blank=0, zero_infinity=True)

loss = ctc_loss(
    log_probs,
    targets,
    input_lengths,
    target_lengths,
)
```

The blank token is often index 0. Targets should not include the blank token.

### CTC Decoding

The simplest CTC decoder takes the most likely token at each frame, then collapses repeats and removes blanks.

```python
def ctc_greedy_decode(token_ids, blank_id=0):
    output = []
    prev = None

    for token_id in token_ids:
        token_id = int(token_id)

        if token_id != blank_id and token_id != prev:
            output.append(token_id)

        prev = token_id

    return output
```

If the frame predictions are

```text
0 5 5 0 7 7 0 7 0
```

and `0` is blank, the decoded output is

```text
5 7 7
```

The repeated `5` and first repeated `7` are collapsed. The separated `7` after a blank remains because blanks can separate repeated output symbols.

### Encoder-Decoder Speech Recognition

CTC is simple and effective, but it makes conditional independence assumptions across output steps. Encoder-decoder models directly model the output token sequence:

$$
p(y \mid x) =
\prod_{t=1}^{T}
p(y_t \mid y_{<t}, x).
$$

The encoder reads acoustic features. The decoder generates text tokens with attention over encoder states.

This is similar to neural machine translation, except the encoder input is acoustic rather than textual.

A transformer speech recognizer may use:

| Component | Role |
|---|---|
| Acoustic frontend | Converts waveform to features |
| Encoder | Builds contextual acoustic states |
| Decoder self-attention | Models previous output tokens |
| Cross-attention | Reads acoustic states |
| Output projection | Predicts next text token |

### Hybrid CTC and Attention Models

Many speech recognition models combine CTC with an encoder-decoder loss.

The total loss is

$$
\mathcal{L} =
\lambda \mathcal{L}_{\text{CTC}}
+
(1-\lambda)\mathcal{L}_{\text{attn}}.
$$

CTC encourages monotonic alignment between audio and text. The attention decoder improves language modeling and sequence consistency.

This hybrid objective is useful because speech recognition has a mostly monotonic structure: the transcript usually follows the same order as the audio.

### Conformer Encoders

The conformer is a common architecture for speech recognition. It combines transformer self-attention with convolution.

Self-attention captures long-range dependencies. Convolution captures local acoustic patterns. This combination fits speech well because speech contains both local phonetic structure and long-range linguistic context.

A conformer block usually contains feedforward layers, multi-head self-attention, convolution, normalization, and residual connections.

The general pattern is:

```text
feedforward
self-attention
convolution
feedforward
normalization
```

The exact implementation varies.

### Self-Supervised Speech Encoders

Large speech models often pretrain on unlabeled audio. Examples of self-supervised speech objectives include contrastive prediction, masked acoustic modeling, and discrete unit prediction.

The motivation is simple: transcribed speech is expensive, but raw audio is abundant.

A self-supervised encoder learns useful acoustic representations before supervised fine-tuning. During fine-tuning, the model learns to map these representations to text.

This approach is especially useful for low-resource languages and domains with limited labeled speech.

### Language Models in Speech Recognition

Speech recognition benefits from language modeling. Acoustic evidence may be ambiguous. A language model helps prefer more plausible text.

For example, the sounds for two phrases may be similar:

```text
recognize speech
wreck a nice beach
```

A language model can help choose the phrase that fits the context.

There are several ways to use language models:

| Method | Description |
|---|---|
| Shallow fusion | Add language model score during decoding |
| Rescoring | Generate candidates, then rerank with a language model |
| Internal decoder LM | Use the sequence decoder as the language model |
| Prompt conditioning | Condition a foundation model on task context |

In shallow fusion, the decoding score may be

$$
\text{score}(y) =
\log p_{\text{asr}}(y \mid x)
+
\alpha \log p_{\text{lm}}(y)
+
\beta |y|.
$$

Here $\alpha$ controls language model strength and $\beta$ is a length penalty.

### Streaming Recognition

Offline speech recognition can use the whole audio signal before producing text. Streaming recognition must output partial transcripts while audio is still arriving.

Streaming systems have stricter latency constraints. They cannot use unlimited future context.

Common streaming methods include chunked attention, recurrent encoders, monotonic attention, transducers, and limited-context conformers.

A streaming model must balance accuracy and latency. More context usually improves recognition, but it delays output.

### Transducer Models

Recurrent Neural Network Transducer, often called RNN-T, is another common speech recognition architecture.

A transducer has three parts:

| Component | Role |
|---|---|
| Encoder | Processes acoustic frames |
| Prediction network | Processes previous output tokens |
| Joint network | Combines acoustic and token context |

Unlike CTC, the prediction of a token depends on previous output tokens. Unlike a standard attention decoder, transducer models support streaming more naturally.

The model defines probabilities over alignments between acoustic frames and output tokens, including blank emissions.

### Evaluation Metrics

Speech recognition is commonly evaluated with word error rate, or WER.

WER is computed from substitutions, deletions, and insertions:

$$
\text{WER} =
\frac{S + D + I}{N}.
$$

Here $S$ is the number of substituted words, $D$ is the number of deleted words, $I$ is the number of inserted words, and $N$ is the number of words in the reference transcript.

Character error rate, or CER, uses characters instead of words. CER is useful for languages without whitespace word boundaries and for fine-grained analysis.

### Common Speech Recognition Errors

Speech recognition errors often come from background noise, overlapping speakers, accents, domain-specific vocabulary, poor microphones, reverberation, and code switching.

Models may confuse similar-sounding words. They may omit short function words. They may mistranscribe names, numbers, and rare technical terms. They may fail when speech contains multiple languages in one utterance.

A practical system often needs normalization rules for punctuation, capitalization, numbers, timestamps, and speaker labels.

### Minimal ASR Model Skeleton

A simple CTC-based speech recognizer can be sketched as follows.

```python
import torch
import torch.nn as nn

class CTCASR(nn.Module):
    def __init__(self, feature_dim, hidden_dim, vocab_size):
        super().__init__()

        self.encoder = nn.LSTM(
            input_size=feature_dim,
            hidden_size=hidden_dim,
            num_layers=3,
            batch_first=True,
            bidirectional=True,
        )

        self.output = nn.Linear(hidden_dim * 2, vocab_size)

    def forward(self, features):
        # features: [B, T, F]
        states, _ = self.encoder(features)
        logits = self.output(states)        # [B, T, V]
        return logits.transpose(0, 1)       # [T, B, V]
```

Training combines feature extraction, the acoustic encoder, and CTC loss:

```python
features = feature_extractor(waveforms)      # [B, T, F]
logits = model(features)                     # [T, B, V]
log_probs = logits.log_softmax(dim=-1)

loss = ctc_loss(
    log_probs,
    targets,
    input_lengths,
    target_lengths,
)
```

This model is deliberately small. Strong production systems use convolutional subsampling, transformer or conformer encoders, normalization, augmentation, large-scale pretraining, and tuned decoding.

### Data Augmentation for Speech

Speech models benefit from augmentation because real audio varies widely.

Common augmentations include additive noise, speed perturbation, time masking, frequency masking, reverberation, random gain, and codec simulation.

SpecAugment is widely used for spectrogram features. It masks random time spans and frequency bands. The model must learn robust representations even when parts of the acoustic input are missing.

Augmentation improves robustness to recording conditions and reduces overfitting.

### Summary

Speech recognition is a sequence-to-sequence problem from audio to text. The input is usually a waveform or log-mel spectrogram. The output is a sequence of text tokens.

CTC models train without frame-level alignments by summing over possible audio-token alignments. Encoder-decoder models generate text autoregressively using attention over acoustic states. Hybrid CTC-attention models combine alignment strength with language-modeling capacity. Transducer models are especially useful for streaming recognition.

In PyTorch, speech recognition systems require careful tensor shape handling, audio feature extraction, padding masks, sequence lengths, CTC or cross-entropy losses, and decoding procedures.