Vision-Language Models

A vision-language model learns a joint representation of images and text. Its purpose is to connect visual information with natural language so that a model can compare, retrieve, caption, answer questions about, or generate images from text.

Traditional computer vision models map an image to a fixed label, such as cat, car, or tumor. Traditional language models operate only on tokens. A vision-language model combines both modalities. It must process pixels and words, then place them into a shared computational space.

The central problem is alignment. The model must learn that an image of a dog on a beach and the sentence “a dog running on sand near the ocean” describe the same underlying scene.

The Basic Setting

A vision-language dataset usually consists of image-text pairs:

(x_i, t_i)

where $x_i$ is an image and $t_i$ is a text description, caption, question, answer, or label.

The image is first converted into a tensor. In PyTorch convention, a batch of images often has shape

X \in \mathbb{R}^{B \times C \times H \times W}.

The text is tokenized into integer IDs:

T \in \mathbb{N}^{B \times L},

where $B$ is batch size and $L$ is sequence length.

A vision-language model contains at least two components:

A vision encoder, which maps images into visual embeddings.
A text encoder or language model, which maps token sequences into text embeddings or generated text.

The simplest form is:

z_x = f_{\theta}(x), \quad z_t = g_{\phi}(t).

Here $z_x$ is the image embedding and $z_t$ is the text embedding.

If the two embeddings live in the same vector space, the model can compare them using cosine similarity:

s(x,t) = \frac{z_x^\top z_t}{\|z_x\|\|z_t\|}.

A high similarity means the image and text are semantically related.

Contrastive Vision-Language Learning

A common training method is contrastive learning. The model receives a batch of matching image-text pairs. Each image should be close to its correct text and far from other texts in the batch.

Suppose a batch contains $B$ image-text pairs. The model computes an image embedding matrix

Z_x \in \mathbb{R}^{B \times d}

and a text embedding matrix

Z_t \in \mathbb{R}^{B \times d}.

After normalization, the similarity matrix is

S = Z_x Z_t^\top.

The entry $S_{ij}$ measures the similarity between image $i$ and text $j$ . The diagonal entries are correct pairs. The off-diagonal entries are incorrect pairs.

The model is trained so that $S_{ii}$ becomes large and $S_{ij}$ becomes small for $i \neq j$ .

In simplified PyTorch form:

import torch
import torch.nn.functional as F

image_emb = F.normalize(image_encoder(images), dim=-1)
text_emb = F.normalize(text_encoder(tokens), dim=-1)

logits = image_emb @ text_emb.T
labels = torch.arange(logits.size(0), device=logits.device)

loss_i2t = F.cross_entropy(logits, labels)
loss_t2i = F.cross_entropy(logits.T, labels)

loss = (loss_i2t + loss_t2i) / 2

This objective supports image-to-text retrieval and text-to-image retrieval. It also enables zero-shot classification. To classify an image, we write candidate labels as text prompts, encode them, and choose the text with the highest similarity to the image.

For example:

"a photo of a cat"
"a photo of a dog"
"a photo of a car"

The model compares the image embedding against each prompt embedding.

Captioning Models

Image captioning models generate text from images. Instead of only comparing image and text embeddings, they condition a language decoder on visual features.

A typical captioning model has:

A vision encoder.
A projection layer.
A transformer decoder.

The vision encoder produces visual tokens:

V \in \mathbb{R}^{B \times N \times d_v}.

A projection maps them into the language model dimension:

H_v = V W_p.

The decoder then predicts text tokens autoregressively:

p(t_1,\ldots,t_L \mid x) = \prod_{k=1}^{L} p(t_k \mid t_{<k}, x).

During training, the model receives the correct previous tokens and learns to predict the next token. During inference, it generates one token at a time.

visual_tokens = vision_encoder(images)
visual_tokens = projection(visual_tokens)

logits = text_decoder(
    input_ids=caption_tokens[:, :-1],
    encoder_hidden_states=visual_tokens,
)

loss = F.cross_entropy(
    logits.reshape(-1, vocab_size),
    caption_tokens[:, 1:].reshape(-1),
)

Captioning requires more than recognition. The model must decide which objects, attributes, relations, and actions are important enough to mention.

Visual Question Answering

In visual question answering, the input is an image and a question. The output is an answer.

Example:

Image: a table with two cups and a plate
Question: How many cups are on the table?
Answer: two

The model must combine visual perception with language understanding. It must parse the question, locate relevant visual evidence, and produce a response.

A common architecture encodes the image into visual tokens and the question into text tokens. A transformer then performs cross-modal attention between them.

The model can be trained as classification when answers come from a fixed vocabulary:

p(y \mid x, q)

or as generation when answers are free-form:

p(a_1,\ldots,a_L \mid x,q).

Generative VQA is more flexible, but harder to evaluate. Classification VQA is easier to score, but cannot naturally produce long explanations.

Cross-Attention Between Vision and Language

Cross-attention is the main mechanism used to fuse visual and textual information.

In self-attention, queries, keys, and values come from the same sequence. In cross-attention, the query comes from one modality and the keys and values come from another.

For example, text tokens may attend to image tokens:

Q = H_t W_Q, \quad K = H_v W_K, \quad V = H_v W_V.

The attention output is

\text{Attention}(Q,K,V) = \text{softmax} \left( \frac{QK^\top}{\sqrt{d}} \right)V.

This lets each text token select relevant parts of the image. A token such as “red” may attend to colored regions. A token such as “dog” may attend to animal-shaped regions.

Cross-attention is expensive when the image has many tokens. For this reason, many systems compress visual information before giving it to the language model.

Visual Tokens

Vision-language models often treat images as sequences of visual tokens. This follows the transformer view of computation: everything becomes a sequence.

There are several common choices.

Visual token type	Description
Patch tokens	Image is split into fixed-size patches
Region tokens	Object detector proposes regions
Grid tokens	CNN feature map locations become tokens
Latent tokens	Learned queries compress visual features
Video tokens	Spatiotemporal patches represent video

For a vision transformer, an image of size $224 \times 224$ with patch size $16 \times 16$ gives

14 \times 14 = 196

patch tokens.

If each token has hidden dimension $d$ , the encoded image has shape

V \in \mathbb{R}^{B \times 196 \times d}.

These tokens can be passed to a multimodal transformer, a projection layer, or a language model.

Common Architectures

Vision-language models differ in how tightly they combine image and text processing.

Architecture	Description	Common use
Dual encoder	Separate image and text encoders, compared by similarity	Retrieval, zero-shot classification
Encoder-decoder	Vision encoder conditions a text decoder	Captioning, VQA
Cross-attention model	Text and image interact through attention layers	Grounded understanding
Multimodal LLM	Visual tokens are fed into a language model	Chat, reasoning, tool use
Diffusion-conditioned model	Text controls image generation	Text-to-image generation

A dual encoder is efficient because image and text embeddings can be precomputed. This makes it useful for search.

A multimodal LLM is more flexible because it can generate long answers and follow instructions. It is usually more expensive at inference time.

Training Objectives

Vision-language systems often combine several objectives.

Contrastive loss aligns image and text embeddings. Captioning loss trains generation. Masked modeling reconstructs missing image or text parts. Matching loss predicts whether an image and text belong together. Instruction tuning teaches the model to answer user requests.

Objective	What it teaches
Contrastive learning	Global image-text alignment
Captioning	Text generation from visual input
Image-text matching	Pairwise compatibility
Masked language modeling	Language understanding with visual context
Masked image modeling	Visual representation learning
Instruction tuning	Response behavior and task following

A strong model often uses staged training. First, it learns broad alignment from many image-text pairs. Then it learns higher-level tasks such as captioning, question answering, and instruction following.

PyTorch Skeleton

A minimal dual-encoder vision-language model can be written as follows:

import torch
import torch.nn as nn
import torch.nn.functional as F

class VisionLanguageDualEncoder(nn.Module):
    def __init__(self, vision_encoder, text_encoder, vision_dim, text_dim, embed_dim):
        super().__init__()
        self.vision_encoder = vision_encoder
        self.text_encoder = text_encoder

        self.image_proj = nn.Linear(vision_dim, embed_dim)
        self.text_proj = nn.Linear(text_dim, embed_dim)

        self.logit_scale = nn.Parameter(torch.tensor(1.0))

    def encode_image(self, images):
        h = self.vision_encoder(images)
        z = self.image_proj(h)
        return F.normalize(z, dim=-1)

    def encode_text(self, tokens):
        h = self.text_encoder(tokens)
        z = self.text_proj(h)
        return F.normalize(z, dim=-1)

    def forward(self, images, tokens):
        image_emb = self.encode_image(images)
        text_emb = self.encode_text(tokens)

        scale = self.logit_scale.exp()
        logits = scale * image_emb @ text_emb.T

        return logits

Training step:

def contrastive_step(model, images, tokens):
    logits = model(images, tokens)
    labels = torch.arange(logits.size(0), device=logits.device)

    loss_image = F.cross_entropy(logits, labels)
    loss_text = F.cross_entropy(logits.T, labels)

    return (loss_image + loss_text) / 2

This model is small compared with production systems, but it contains the core idea: image and text are encoded into a shared embedding space, then trained by matching correct pairs.

Evaluation

Vision-language models are evaluated according to the task.

For retrieval, the model ranks texts for an image or images for a text. Common metrics include recall at $k$ , median rank, and mean reciprocal rank.

For captioning, metrics compare generated captions against reference captions. These metrics are imperfect because many valid captions may describe the same image.

For VQA, accuracy is used when answers are short and standardized. For open-ended answers, evaluation may require semantic matching or human judgment.

For multimodal chat models, evaluation is harder. The model may need to describe images, reason about diagrams, compare objects, read text inside images, and avoid hallucinating unsupported details.

Failure Modes

Vision-language models have several characteristic failure modes.

They may hallucinate objects that are not present in the image. They may miss small objects. They may rely on dataset bias rather than visual evidence. They may struggle with counting, spatial relations, fine-grained attributes, and text embedded in images. They may also overfit to common caption patterns.

For example, if many training captions say “a man riding a horse,” the model may predict a horse even when the image contains a donkey or mule. This is a semantic prior overriding visual evidence.

A reliable model should express uncertainty when the image lacks enough evidence. It should separate what is visible from what is inferred.

Summary

Vision-language models connect images and text through shared representations, cross-attention, or multimodal generation. The simplest systems learn an embedding space where matching images and captions are close. More advanced systems condition language models on visual tokens and can answer questions, describe scenes, follow instructions, and interact with tools.

The key theoretical ideas are representation alignment, contrastive learning, cross-modal attention, autoregressive generation, and multimodal grounding. In PyTorch, these ideas reduce to a small number of tensor operations: image encoding, text encoding, projection, normalization, attention, similarity computation, and cross-entropy training.