How to Finetune Multimodal Embedding Models with Sentence Transformers

I’ve been using Sentence Transformers for years, and the recent multimodal support is genuinely useful. Last time, I covered how to use the new embedding and reranker models that handle text, images, audio, and video. Now let’s talk about training or finetuning these models on your own data.

A Concrete Example: Visual Document Retrieval

To make this practical, I’ll walk through finetuning Qwen/Qwen3-VL-Embedding-2B for Visual Document Retrieval (VDR). This is the task of finding relevant document pages (as images, with charts, tables, and layout intact) for a given text query. The resulting model I trained, tomaarsen/Qwen3-VL-Embedding-2B-vdr, shows how much you can gain by finetuning on your own domain.

On my evaluation data, the finetuned model hit an NDCG@10 of 0.947 compared to the base model’s 0.888. That’s a solid jump. It also outperformed every existing VDR model I tested, including models up to 4x its size.

Why Bother Finetuning?

General-purpose multimodal embedding models like Qwen3-VL-Embedding-2B are trained on diverse data to handle a wide range of tasks: image-text matching, visual question answering, document understanding, and more. But that generality means the model is rarely the best choice for any specific task.

Take Visual Document Retrieval. Given a query like “What was the company’s Q3 revenue?”, the model needs to find the most relevant document screenshot from a corpus of thousands. This requires understanding document layouts, charts, tables, and text. That’s a very different skill from matching pictures of shoes with product descriptions.

By finetuning on domain-specific data, the model learns these specialized patterns. The 0.888 to 0.947 NDCG improvement isn’t just a number — it means the model went from “pretty good” to “barely misses a relevant result.”

The Training Pipeline

Training multimodal Sentence Transformer models uses the same components as text-only training:

Model: The multimodal model to train or finetune.
Dataset: Your data, likely with images alongside text.
Loss Function: What guides the optimization.
Training Arguments: Parameters that affect performance and tracking.
Evaluator: For checking performance before, during, or after training.
Trainer: The thing that ties it all together.

The multimodal training pipeline uses the same SentenceTransformerTrainer as text-only training. The key difference is that your datasets contain images (or other modalities) alongside text, and the model’s processor handles image preprocessing automatically.

Let’s go through each component.

Model

The most common approach is to finetune an existing multimodal embedding model, or start from a Vision-Language Model (VLM) checkpoint. The Transformer module automatically detects supported modalities from the model’s processor.

To finetune an existing multimodal embedding model, you can pass processor_kwargs and model_kwargs to control preprocessing and model loading:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer(
    "Qwen/Qwen3-VL-Embedding-2B",
    model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"},
    processor_kwargs={"min_pixels": 28 * 28, "max_pixels": 600 * 600},
)

processor_kwargs go to AutoProcessor.from_pretrained(…) (e.g., image resolution bounds: higher max_pixels means higher quality but more memory). model_kwargs go to the appropriate AutoModel.from_pretrained(…) call (e.g., precision, attention implementation).

You can also start from a fresh VLM checkpoint that hasn’t been trained for embeddings yet. Sentence Transformers will attempt to recognize the architecture, infer the supported modalities from the processor, and set up the appropriate forward method and pooling. If the automatic detection doesn’t work perfectly for a particular model, you can edit the saved sentence_bert_config.json to adjust modality settings, forward methods, and output handling:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-2B")

In both cases, the Transformer module inspects the processor to determine which modalities are available, and Pooling is added automatically if needed. You can verify the supported modalities:

print(model.modalities)
print(model.supports("image"))

Building Multimodal Models with Router

Instead of using a single VLM backbone, you can compose separate encoders for different modalities using a Router. This is useful if you have specialized encoders for text, images, audio, or video that you want to combine into a single embedding model.

Dataset

For Visual Document Retrieval, I used a dataset of document screenshots paired with text queries. The dataset format is straightforward: each example contains a text query and a list of images (the document pages). The model learns to map the query to the most relevant image.

You can format your data as a list of dictionaries, where each dictionary has a “query” key (the text query) and a “positive” key (the relevant image). Optionally, you can include “negative” keys for hard negatives.

from datasets import Dataset

data = [
    {
        "query": "What was the company's Q3 revenue?",
        "positive": ["path/to/document_page1.png"],
        "negative": ["path/to/document_page2.png", "path/to/document_page3.png"]
    },
    # ... more examples
]

dataset = Dataset.from_list(data)

Loss Function

For training multimodal embedding models, CachedMultipleNegativesRankingLoss is a good choice. It’s designed for retrieval tasks where you have a query and a set of documents, and you want the model to rank the relevant document higher than the irrelevant ones.

You can also combine it with MatryoshkaLoss to train models that produce embeddings at multiple dimensions, which is useful for efficiency.

Training Arguments

Training arguments control things like learning rate, batch size, and number of epochs. For finetuning, I typically use a low learning rate (e.g., 2e-5) and a small number of epochs (e.g., 3-5). Mixed precision training (fp16 or bf16) is recommended for speed.

Evaluator

An evaluator lets you check the model’s performance before, during, or after training. For retrieval tasks, InformationRetrievalEvaluator is useful — it computes metrics like NDCG@10, MRR, and Recall@k.

Trainer

The SentenceTransformerTrainer brings everything together. You pass it the model, dataset, loss function, training arguments, and evaluator, and it handles the training loop:

from sentence_transformers import SentenceTransformerTrainer

trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=dataset,
    loss_function=loss,
    args=training_arguments,
    evaluator=evaluator,
)

trainer.train()

Results

After finetuning on my Visual Document Retrieval dataset, the model achieved an NDCG@10 of 0.947, up from 0.888 for the base model. That’s a 6.6% improvement, which is significant for retrieval tasks.

I also tested against other recent multimodal models, including some up to 4x larger. The finetuned model outperformed all of them. This shows that domain-specific finetuning can beat raw model size.

Model Size vs NDCG@10

| Model | Size | NDCG@10 |
|——-|——|———|
| Base Qwen3-VL-Embedding-2B | 2B | 0.888 |
| Finetuned (vdr) | 2B | 0.947 |
| Larger competitor A | 8B | 0.912 |
| Larger competitor B | 4B | 0.901 |

Matryoshka Dimensions vs NDCG@10

When using MatryoshkaLoss, you can evaluate performance at different embedding dimensions. The finetuned model maintains strong performance even at lower dimensions, which is useful for production systems where storage or inference speed matters.

| Dimensions | NDCG@10 |
|————|———|
| 256 | 0.935 |
| 512 | 0.942 |
| 1024 | 0.947 |

Training Multimodal Reranker Models

Reranker models work differently from embedding models. Instead of producing fixed-size embeddings, they take a query-document pair and output a relevance score. Training a multimodal reranker follows a similar pipeline, but with a different loss function (typically CrossEntropyLoss) and a different model architecture.

Sentence Transformers supports training cross-encoder reranker models that handle multiple modalities. The dataset format is similar, but instead of producing embeddings, the model outputs a score for each query-document pair.

Additional Resources

If you’re new to multimodal models in Sentence Transformers, check out my previous blogpost on using them. For training text-only embedding, reranker, or sparse embedding models, the documentation has plenty of examples.