Skip to content

Vision-Language Models

A vision-language model learns a joint representation of images and text.

A vision-language model learns a joint representation of images and text. Its purpose is to connect visual information with natural language so that a model can compare, retrieve, caption, answer questions about, or generate images from text.

Traditional computer vision models map an image to a fixed label, such as cat, car, or tumor. Traditional language models operate only on tokens. A vision-language model combines both modalities. It must process pixels and words, then place them into a shared computational space.

The central problem is alignment. The model must learn that an image of a dog on a beach and the sentence “a dog running on sand near the ocean” describe the same underlying scene.

The Basic Setting

A vision-language dataset usually consists of image-text pairs:

(xi,ti) (x_i, t_i)

where xix_i is an image and tit_i is a text description, caption, question, answer, or label.

The image is first converted into a tensor. In PyTorch convention, a batch of images often has shape

XRB×C×H×W. X \in \mathbb{R}^{B \times C \times H \times W}.

The text is tokenized into integer IDs:

TNB×L, T \in \mathbb{N}^{B \times L},

where BB is batch size and LL is sequence length.

A vision-language model contains at least two components:

  1. A vision encoder, which maps images into visual embeddings.
  2. A text encoder or language model, which maps token sequences into text embeddings or generated text.

The simplest form is:

zx=fθ(x),zt=gϕ(t). z_x = f_{\theta}(x), \quad z_t = g_{\phi}(t).

Here zxz_x is the image embedding and ztz_t is the text embedding.

If the two embeddings live in the same vector space, the model can compare them using cosine similarity:

s(x,t)=zxztzxzt. s(x,t) = \frac{z_x^\top z_t}{\|z_x\|\|z_t\|}.

A high similarity means the image and text are semantically related.

Contrastive Vision-Language Learning

A common training method is contrastive learning. The model receives a batch of matching image-text pairs. Each image should be close to its correct text and far from other texts in the batch.

Suppose a batch contains BB image-text pairs. The model computes an image embedding matrix

ZxRB×d Z_x \in \mathbb{R}^{B \times d}

and a text embedding matrix

ZtRB×d. Z_t \in \mathbb{R}^{B \times d}.

After normalization, the similarity matrix is

S=ZxZt. S = Z_x Z_t^\top.

The entry SijS_{ij} measures the similarity between image ii and text jj. The diagonal entries are correct pairs. The off-diagonal entries are incorrect pairs.

The model is trained so that SiiS_{ii} becomes large and SijS_{ij} becomes small for iji \neq j.

In simplified PyTorch form:

import torch
import torch.nn.functional as F

image_emb = F.normalize(image_encoder(images), dim=-1)
text_emb = F.normalize(text_encoder(tokens), dim=-1)

logits = image_emb @ text_emb.T
labels = torch.arange(logits.size(0), device=logits.device)

loss_i2t = F.cross_entropy(logits, labels)
loss_t2i = F.cross_entropy(logits.T, labels)

loss = (loss_i2t + loss_t2i) / 2

This objective supports image-to-text retrieval and text-to-image retrieval. It also enables zero-shot classification. To classify an image, we write candidate labels as text prompts, encode them, and choose the text with the highest similarity to the image.

For example:

"a photo of a cat"
"a photo of a dog"
"a photo of a car"

The model compares the image embedding against each prompt embedding.

Captioning Models

Image captioning models generate text from images. Instead of only comparing image and text embeddings, they condition a language decoder on visual features.

A typical captioning model has:

  1. A vision encoder.
  2. A projection layer.
  3. A transformer decoder.

The vision encoder produces visual tokens:

VRB×N×dv. V \in \mathbb{R}^{B \times N \times d_v}.

A projection maps them into the language model dimension:

Hv=VWp. H_v = V W_p.

The decoder then predicts text tokens autoregressively:

p(t1,,tLx)=k=1Lp(tkt<k,x). p(t_1,\ldots,t_L \mid x) = \prod_{k=1}^{L} p(t_k \mid t_{<k}, x).

During training, the model receives the correct previous tokens and learns to predict the next token. During inference, it generates one token at a time.

visual_tokens = vision_encoder(images)
visual_tokens = projection(visual_tokens)

logits = text_decoder(
    input_ids=caption_tokens[:, :-1],
    encoder_hidden_states=visual_tokens,
)

loss = F.cross_entropy(
    logits.reshape(-1, vocab_size),
    caption_tokens[:, 1:].reshape(-1),
)

Captioning requires more than recognition. The model must decide which objects, attributes, relations, and actions are important enough to mention.

Visual Question Answering

In visual question answering, the input is an image and a question. The output is an answer.

Example:

Image: a table with two cups and a plate
Question: How many cups are on the table?
Answer: two

The model must combine visual perception with language understanding. It must parse the question, locate relevant visual evidence, and produce a response.

A common architecture encodes the image into visual tokens and the question into text tokens. A transformer then performs cross-modal attention between them.

The model can be trained as classification when answers come from a fixed vocabulary:

p(yx,q) p(y \mid x, q)

or as generation when answers are free-form:

p(a1,,aLx,q). p(a_1,\ldots,a_L \mid x,q).

Generative VQA is more flexible, but harder to evaluate. Classification VQA is easier to score, but cannot naturally produce long explanations.

Cross-Attention Between Vision and Language

Cross-attention is the main mechanism used to fuse visual and textual information.

In self-attention, queries, keys, and values come from the same sequence. In cross-attention, the query comes from one modality and the keys and values come from another.

For example, text tokens may attend to image tokens:

Q=HtWQ,K=HvWK,V=HvWV. Q = H_t W_Q, \quad K = H_v W_K, \quad V = H_v W_V.

The attention output is

Attention(Q,K,V)=softmax(QKd)V. \text{Attention}(Q,K,V) = \text{softmax} \left( \frac{QK^\top}{\sqrt{d}} \right)V.

This lets each text token select relevant parts of the image. A token such as “red” may attend to colored regions. A token such as “dog” may attend to animal-shaped regions.

Cross-attention is expensive when the image has many tokens. For this reason, many systems compress visual information before giving it to the language model.

Visual Tokens

Vision-language models often treat images as sequences of visual tokens. This follows the transformer view of computation: everything becomes a sequence.

There are several common choices.

Visual token typeDescription
Patch tokensImage is split into fixed-size patches
Region tokensObject detector proposes regions
Grid tokensCNN feature map locations become tokens
Latent tokensLearned queries compress visual features
Video tokensSpatiotemporal patches represent video

For a vision transformer, an image of size 224×224224 \times 224 with patch size 16×1616 \times 16 gives

14×14=196 14 \times 14 = 196

patch tokens.

If each token has hidden dimension dd, the encoded image has shape

VRB×196×d. V \in \mathbb{R}^{B \times 196 \times d}.

These tokens can be passed to a multimodal transformer, a projection layer, or a language model.

Common Architectures

Vision-language models differ in how tightly they combine image and text processing.

ArchitectureDescriptionCommon use
Dual encoderSeparate image and text encoders, compared by similarityRetrieval, zero-shot classification
Encoder-decoderVision encoder conditions a text decoderCaptioning, VQA
Cross-attention modelText and image interact through attention layersGrounded understanding
Multimodal LLMVisual tokens are fed into a language modelChat, reasoning, tool use
Diffusion-conditioned modelText controls image generationText-to-image generation

A dual encoder is efficient because image and text embeddings can be precomputed. This makes it useful for search.

A multimodal LLM is more flexible because it can generate long answers and follow instructions. It is usually more expensive at inference time.

Training Objectives

Vision-language systems often combine several objectives.

Contrastive loss aligns image and text embeddings. Captioning loss trains generation. Masked modeling reconstructs missing image or text parts. Matching loss predicts whether an image and text belong together. Instruction tuning teaches the model to answer user requests.

ObjectiveWhat it teaches
Contrastive learningGlobal image-text alignment
CaptioningText generation from visual input
Image-text matchingPairwise compatibility
Masked language modelingLanguage understanding with visual context
Masked image modelingVisual representation learning
Instruction tuningResponse behavior and task following

A strong model often uses staged training. First, it learns broad alignment from many image-text pairs. Then it learns higher-level tasks such as captioning, question answering, and instruction following.

PyTorch Skeleton

A minimal dual-encoder vision-language model can be written as follows:

import torch
import torch.nn as nn
import torch.nn.functional as F

class VisionLanguageDualEncoder(nn.Module):
    def __init__(self, vision_encoder, text_encoder, vision_dim, text_dim, embed_dim):
        super().__init__()
        self.vision_encoder = vision_encoder
        self.text_encoder = text_encoder

        self.image_proj = nn.Linear(vision_dim, embed_dim)
        self.text_proj = nn.Linear(text_dim, embed_dim)

        self.logit_scale = nn.Parameter(torch.tensor(1.0))

    def encode_image(self, images):
        h = self.vision_encoder(images)
        z = self.image_proj(h)
        return F.normalize(z, dim=-1)

    def encode_text(self, tokens):
        h = self.text_encoder(tokens)
        z = self.text_proj(h)
        return F.normalize(z, dim=-1)

    def forward(self, images, tokens):
        image_emb = self.encode_image(images)
        text_emb = self.encode_text(tokens)

        scale = self.logit_scale.exp()
        logits = scale * image_emb @ text_emb.T

        return logits

Training step:

def contrastive_step(model, images, tokens):
    logits = model(images, tokens)
    labels = torch.arange(logits.size(0), device=logits.device)

    loss_image = F.cross_entropy(logits, labels)
    loss_text = F.cross_entropy(logits.T, labels)

    return (loss_image + loss_text) / 2

This model is small compared with production systems, but it contains the core idea: image and text are encoded into a shared embedding space, then trained by matching correct pairs.

Evaluation

Vision-language models are evaluated according to the task.

For retrieval, the model ranks texts for an image or images for a text. Common metrics include recall at kk, median rank, and mean reciprocal rank.

For captioning, metrics compare generated captions against reference captions. These metrics are imperfect because many valid captions may describe the same image.

For VQA, accuracy is used when answers are short and standardized. For open-ended answers, evaluation may require semantic matching or human judgment.

For multimodal chat models, evaluation is harder. The model may need to describe images, reason about diagrams, compare objects, read text inside images, and avoid hallucinating unsupported details.

Failure Modes

Vision-language models have several characteristic failure modes.

They may hallucinate objects that are not present in the image. They may miss small objects. They may rely on dataset bias rather than visual evidence. They may struggle with counting, spatial relations, fine-grained attributes, and text embedded in images. They may also overfit to common caption patterns.

For example, if many training captions say “a man riding a horse,” the model may predict a horse even when the image contains a donkey or mule. This is a semantic prior overriding visual evidence.

A reliable model should express uncertainty when the image lacks enough evidence. It should separate what is visible from what is inferred.

Summary

Vision-language models connect images and text through shared representations, cross-attention, or multimodal generation. The simplest systems learn an embedding space where matching images and captions are close. More advanced systems condition language models on visual tokens and can answer questions, describe scenes, follow instructions, and interact with tools.

The key theoretical ideas are representation alignment, contrastive learning, cross-modal attention, autoregressive generation, and multimodal grounding. In PyTorch, these ideas reduce to a small number of tensor operations: image encoding, text encoding, projection, normalization, attention, similarity computation, and cross-entropy training.