Multimodal Embeddings and Rerankers Just Got a Lot Easier with Sentence Transformers v5.4

Multimodal Embeddings and Rerankers Just Got a Lot Easier with Sentence Transformers v5.4

2 0 0

I’ve been using Sentence Transformers for years, mostly for text embeddings and reranking in RAG pipelines. It’s been a solid workhorse. But the v5.4 update finally adds something I’ve been waiting for: proper multimodal support. You can now encode and compare texts, images, audio, and video using the same familiar model.encode() API.

Let me show you what changed and why this actually matters beyond the usual hype.

What Multimodal Actually Means Here

Traditional embedding models take text and spit out a vector. That’s it. Multimodal embedding models take text, images, audio, or video and map them all into a shared embedding space. So you can compare a text query against image documents, find video clips matching a description, or build RAG pipelines that work across modalities.

The same goes for rerankers. Normal Cross Encoders score relevance between two texts. Multimodal rerankers can score pairs where one or both elements are images, combined text-image documents, or other modalities.

This isn’t just a nice-to-have. If you’ve ever tried to build a visual document retrieval system or a cross-modal search, you know how painful it was to hack something together with separate models. Now it’s one call.

Getting Started

Installation is straightforward, but you need to pull in the right extras:

pip install -U "sentence-transformers[image]"
pip install -U "sentence-transformers"
pip install -U "sentence-transformers"

Or all at once:

pip install -U "sentence-transformers[image,video,train]"

One thing to keep in mind: VLM-based models like Qwen3-VL-2B need at least ~8 GB of VRAM. The 8B variants want ~20 GB. If you don’t have a local GPU, Google Colab or a cloud GPU service is your friend. On CPU, these models crawl. Stick to text-only or CLIP models for CPU inference.

Using Multimodal Embedding Models

Loading a model is the same as always:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-VL-Embedding-2B")

The model auto-detects which modalities it supports. No extra config needed.

Encoding images is where it gets interesting. You can pass URLs, local file paths, or PIL Image objects:

img_embeddings = model.encode([
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
])
print(img_embeddings.shape)

Cross-modal similarity works because everything lives in the same space:

text_embeddings = model.encode([
    "A green car parked in front of a yellow building",
    "A red car driving on a highway",
    "A bee on a pink flower",
    "A wasp on a wooden table",
])

similarities = model.similarity(text_embeddings, img_embeddings)
print(similarities)

As expected, “A green car parked in front of a yellow building” matches the car image (0.51), and “A bee on a pink flower” matches the bee image (0.67). The hard negatives get lower scores.

You might notice those scores aren’t close to 1.0. That’s the modality gap—embeddings from different modalities cluster in separate regions. Cross-modal similarities are typically lower than text-to-text, but the relative ordering holds, so retrieval still works fine.

For retrieval tasks, use encode_query() and encode_document() instead of plain encode(). Many models prepend different instruction prompts depending on whether the input is a query or a document, and these methods handle that automatically.

Multimodal Reranker Models

Rerankers work similarly but score relevance instead of producing embeddings:

from sentence_transformers import CrossEncoder

model = CrossEncoder("Qwen/Qwen3-VL-Reranker-2B")

scores = model.predict([
    ("A green car", "https://.../car.jpg"),
    ("A red car", "https://.../car.jpg"),
    ("A bee", "https://.../bee.jpg"),
])
print(scores)

The first pair scores highest, as expected.

Supported Input Types

The library accepts a wide range of inputs:

  • Text: strings
  • Images: URLs, local paths, PIL Images, numpy arrays, torch tensors, base64 strings
  • Audio: URLs, local paths, numpy arrays, torch tensors
  • Video: URLs, local paths, numpy arrays, torch tensors (video frames)

You can mix and match within a single batch. The library handles conversion internally.

Supported Models

As of now, the following models work out of the box:

  • Qwen3-VL-Embedding-2B and Qwen3-VL-Reranker-2B
  • Qwen3-VL-Embedding-8B and Qwen3-VL-Reranker-8B
  • CLIP variants (text+image)
  • SigLIP variants

More models are being integrated. Check the Hugging Face model hub for the latest.

Practical Considerations

I’ve been testing this for a visual document retrieval use case. The setup is refreshingly simple compared to the multi-model pipelines I used before. But there are a few things to watch out for:

  • VRAM usage is real. The 2B models are manageable on a consumer GPU, but the 8B variants need serious hardware.
  • Modality gap is not a bug. Don’t expect cross-modal scores to match text-text scores. The relative ordering is what matters.
  • Batch size matters. Large images or video frames eat memory fast. Start small and scale up.

If you’re building a multimodal RAG pipeline, this is the easiest path I’ve seen so far. The API is identical to what you already know, and the results are solid.

For training your own multimodal models, there’s a companion blogpost that covers finetuning. But for most use cases, the pretrained models will get you far.

Comments (0)

Be the first to comment!