A vision-language model learns a joint representation of images and text. Its purpose is to connect visual information with natural language so that a model can compare, retrieve, caption, answer questions about, or generate images from text.
Traditional computer vision models map an image to a fixed label, such as cat, car, or tumor. Traditional language models operate only on tokens. A vision-language model combines both modalities. It must process pixels and words, then place them into a shared computational space.
The central problem is alignment. The model must learn that an image of a dog on a beach and the sentence “a dog running on sand near the ocean” describe the same underlying scene.
The Basic Setting
A vision-language dataset usually consists of image-text pairs:
where is an image and is a text description, caption, question, answer, or label.
The image is first converted into a tensor. In PyTorch convention, a batch of images often has shape
The text is tokenized into integer IDs:
where is batch size and is sequence length.
A vision-language model contains at least two components:
- A vision encoder, which maps images into visual embeddings.
- A text encoder or language model, which maps token sequences into text embeddings or generated text.
The simplest form is:
Here is the image embedding and is the text embedding.
If the two embeddings live in the same vector space, the model can compare them using cosine similarity:
A high similarity means the image and text are semantically related.
Contrastive Vision-Language Learning
A common training method is contrastive learning. The model receives a batch of matching image-text pairs. Each image should be close to its correct text and far from other texts in the batch.
Suppose a batch contains image-text pairs. The model computes an image embedding matrix
and a text embedding matrix
After normalization, the similarity matrix is
The entry measures the similarity between image and text . The diagonal entries are correct pairs. The off-diagonal entries are incorrect pairs.
The model is trained so that becomes large and becomes small for .
In simplified PyTorch form:
import torch
import torch.nn.functional as F
image_emb = F.normalize(image_encoder(images), dim=-1)
text_emb = F.normalize(text_encoder(tokens), dim=-1)
logits = image_emb @ text_emb.T
labels = torch.arange(logits.size(0), device=logits.device)
loss_i2t = F.cross_entropy(logits, labels)
loss_t2i = F.cross_entropy(logits.T, labels)
loss = (loss_i2t + loss_t2i) / 2This objective supports image-to-text retrieval and text-to-image retrieval. It also enables zero-shot classification. To classify an image, we write candidate labels as text prompts, encode them, and choose the text with the highest similarity to the image.
For example:
"a photo of a cat"
"a photo of a dog"
"a photo of a car"The model compares the image embedding against each prompt embedding.
Captioning Models
Image captioning models generate text from images. Instead of only comparing image and text embeddings, they condition a language decoder on visual features.
A typical captioning model has:
- A vision encoder.
- A projection layer.
- A transformer decoder.
The vision encoder produces visual tokens:
A projection maps them into the language model dimension:
The decoder then predicts text tokens autoregressively:
During training, the model receives the correct previous tokens and learns to predict the next token. During inference, it generates one token at a time.
visual_tokens = vision_encoder(images)
visual_tokens = projection(visual_tokens)
logits = text_decoder(
input_ids=caption_tokens[:, :-1],
encoder_hidden_states=visual_tokens,
)
loss = F.cross_entropy(
logits.reshape(-1, vocab_size),
caption_tokens[:, 1:].reshape(-1),
)Captioning requires more than recognition. The model must decide which objects, attributes, relations, and actions are important enough to mention.
Visual Question Answering
In visual question answering, the input is an image and a question. The output is an answer.
Example:
Image: a table with two cups and a plate
Question: How many cups are on the table?
Answer: twoThe model must combine visual perception with language understanding. It must parse the question, locate relevant visual evidence, and produce a response.
A common architecture encodes the image into visual tokens and the question into text tokens. A transformer then performs cross-modal attention between them.
The model can be trained as classification when answers come from a fixed vocabulary:
or as generation when answers are free-form:
Generative VQA is more flexible, but harder to evaluate. Classification VQA is easier to score, but cannot naturally produce long explanations.
Cross-Attention Between Vision and Language
Cross-attention is the main mechanism used to fuse visual and textual information.
In self-attention, queries, keys, and values come from the same sequence. In cross-attention, the query comes from one modality and the keys and values come from another.
For example, text tokens may attend to image tokens:
The attention output is
This lets each text token select relevant parts of the image. A token such as “red” may attend to colored regions. A token such as “dog” may attend to animal-shaped regions.
Cross-attention is expensive when the image has many tokens. For this reason, many systems compress visual information before giving it to the language model.
Visual Tokens
Vision-language models often treat images as sequences of visual tokens. This follows the transformer view of computation: everything becomes a sequence.
There are several common choices.
| Visual token type | Description |
|---|---|
| Patch tokens | Image is split into fixed-size patches |
| Region tokens | Object detector proposes regions |
| Grid tokens | CNN feature map locations become tokens |
| Latent tokens | Learned queries compress visual features |
| Video tokens | Spatiotemporal patches represent video |
For a vision transformer, an image of size with patch size gives
patch tokens.
If each token has hidden dimension , the encoded image has shape
These tokens can be passed to a multimodal transformer, a projection layer, or a language model.
Common Architectures
Vision-language models differ in how tightly they combine image and text processing.
| Architecture | Description | Common use |
|---|---|---|
| Dual encoder | Separate image and text encoders, compared by similarity | Retrieval, zero-shot classification |
| Encoder-decoder | Vision encoder conditions a text decoder | Captioning, VQA |
| Cross-attention model | Text and image interact through attention layers | Grounded understanding |
| Multimodal LLM | Visual tokens are fed into a language model | Chat, reasoning, tool use |
| Diffusion-conditioned model | Text controls image generation | Text-to-image generation |
A dual encoder is efficient because image and text embeddings can be precomputed. This makes it useful for search.
A multimodal LLM is more flexible because it can generate long answers and follow instructions. It is usually more expensive at inference time.
Training Objectives
Vision-language systems often combine several objectives.
Contrastive loss aligns image and text embeddings. Captioning loss trains generation. Masked modeling reconstructs missing image or text parts. Matching loss predicts whether an image and text belong together. Instruction tuning teaches the model to answer user requests.
| Objective | What it teaches |
|---|---|
| Contrastive learning | Global image-text alignment |
| Captioning | Text generation from visual input |
| Image-text matching | Pairwise compatibility |
| Masked language modeling | Language understanding with visual context |
| Masked image modeling | Visual representation learning |
| Instruction tuning | Response behavior and task following |
A strong model often uses staged training. First, it learns broad alignment from many image-text pairs. Then it learns higher-level tasks such as captioning, question answering, and instruction following.
PyTorch Skeleton
A minimal dual-encoder vision-language model can be written as follows:
import torch
import torch.nn as nn
import torch.nn.functional as F
class VisionLanguageDualEncoder(nn.Module):
def __init__(self, vision_encoder, text_encoder, vision_dim, text_dim, embed_dim):
super().__init__()
self.vision_encoder = vision_encoder
self.text_encoder = text_encoder
self.image_proj = nn.Linear(vision_dim, embed_dim)
self.text_proj = nn.Linear(text_dim, embed_dim)
self.logit_scale = nn.Parameter(torch.tensor(1.0))
def encode_image(self, images):
h = self.vision_encoder(images)
z = self.image_proj(h)
return F.normalize(z, dim=-1)
def encode_text(self, tokens):
h = self.text_encoder(tokens)
z = self.text_proj(h)
return F.normalize(z, dim=-1)
def forward(self, images, tokens):
image_emb = self.encode_image(images)
text_emb = self.encode_text(tokens)
scale = self.logit_scale.exp()
logits = scale * image_emb @ text_emb.T
return logitsTraining step:
def contrastive_step(model, images, tokens):
logits = model(images, tokens)
labels = torch.arange(logits.size(0), device=logits.device)
loss_image = F.cross_entropy(logits, labels)
loss_text = F.cross_entropy(logits.T, labels)
return (loss_image + loss_text) / 2This model is small compared with production systems, but it contains the core idea: image and text are encoded into a shared embedding space, then trained by matching correct pairs.
Evaluation
Vision-language models are evaluated according to the task.
For retrieval, the model ranks texts for an image or images for a text. Common metrics include recall at , median rank, and mean reciprocal rank.
For captioning, metrics compare generated captions against reference captions. These metrics are imperfect because many valid captions may describe the same image.
For VQA, accuracy is used when answers are short and standardized. For open-ended answers, evaluation may require semantic matching or human judgment.
For multimodal chat models, evaluation is harder. The model may need to describe images, reason about diagrams, compare objects, read text inside images, and avoid hallucinating unsupported details.
Failure Modes
Vision-language models have several characteristic failure modes.
They may hallucinate objects that are not present in the image. They may miss small objects. They may rely on dataset bias rather than visual evidence. They may struggle with counting, spatial relations, fine-grained attributes, and text embedded in images. They may also overfit to common caption patterns.
For example, if many training captions say “a man riding a horse,” the model may predict a horse even when the image contains a donkey or mule. This is a semantic prior overriding visual evidence.
A reliable model should express uncertainty when the image lacks enough evidence. It should separate what is visible from what is inferred.
Summary
Vision-language models connect images and text through shared representations, cross-attention, or multimodal generation. The simplest systems learn an embedding space where matching images and captions are close. More advanced systems condition language models on visual tokens and can answer questions, describe scenes, follow instructions, and interact with tools.
The key theoretical ideas are representation alignment, contrastive learning, cross-modal attention, autoregressive generation, and multimodal grounding. In PyTorch, these ideas reduce to a small number of tensor operations: image encoding, text encoding, projection, normalization, attention, similarity computation, and cross-entropy training.