AI & Tech

Sentence Transformers Now Supports Multimodal Embedding Model Finetuning

Finetuned 2B model achieves NDCG@10 of 0.947 on VDR, outperforming models up to 4x its size

신하영·2026년 4월 15일 수 15:00·5 min read·

Training and Finetuning Multimodal Embedding & Reranker Models with Sentence Transformers

Summary

•Sentence Transformers officially supports finetuning of multimodal embedding models.
•A finetuned 2B model achieved NDCG@10 of 0.947 on VDR, outperforming models 4x its size.
•Full compatibility with the existing text-only training pipeline lowers the entry barrier significantly.

Beyond Text: The Era of Training Embedding Models on Images and Documents

Hugging Face's Python library Sentence Transformers has officially launched support for training and finetuning multimodal embedding and reranker models. In a blog post published on April 16, 2026, developer Tom Aarsen detailed the full pipeline for finetuning multimodal models — capable of handling text, images, audio, and video — on user-specific domain data. In his experiment, finetuning Qwen/Qwen3-VL-Embedding-2B for the Visual Document Retrieval (VDR) task boosted the NDCG@10 score from 0.888 to 0.947, surpassing all existing VDR models tested, including those up to four times larger.

Why Finetuning Matters

General-purpose multimodal embedding models perform reasonably well across diverse tasks — image-text matching, visual question answering, document understanding. But generality rarely translates to optimal performance in a specific domain. In VDR, for instance, a model must find the most relevant document page image from thousands given a text query like "What was the company's Q3 revenue?" — a task requiring deep understanding of tables, charts, and document layouts. Finetuning is the primary mechanism to teach models these domain-specific patterns.

Aarsen demonstrated this with concrete numbers. The finetuned model tomaarsen/Qwen3-VL-Embedding-2B-vdr recorded an NDCG@10 of 0.947 on the evaluation dataset, surpassing the base model (0.888) and every other VDR model tested — including those with four times the parameters.

What Changed

Item	Before (Text-Only)	This Update (Multimodal)	Change
Supported Modalities	Text	Text, Image, Audio, Video	+4 modalities
Training Pipeline	SentenceTransformerTrainer	Same (SentenceTransformerTrainer)	Consistent API
Dataset Format	Text pairs	Text + Image mixed	Auto image preprocessing
Loss Functions	Various	CachedMultipleNegativesRankingLoss, MatryoshkaLoss	Same options
VDR Evaluation (NDCG@10)	—	0.947 (base: 0.888)	+6.6%p

The core design principle of this update is full compatibility with the existing text-only training pipeline. Developers use the same SentenceTransformerTrainer; adding a new modality is as simple as including images in the dataset, with the model's processor handling preprocessing automatically. Training components remain identical: Model, Dataset, Loss Function, Training Arguments, Evaluator, and Trainer.

Supported loss functions include CachedMultipleNegativesRankingLoss for efficient large-batch training and MatryoshkaLoss for simultaneously optimizing multiple embedding dimensions. The Matryoshka approach allows flexible use of embeddings at various dimensions from a single model, enabling trade-off adjustments between storage and retrieval speed.

[Expert Analysis] Reshaping the Multimodal Retrieval Landscape

This update is likely to bring substantive change to the multimodal retrieval ecosystem beyond a simple library feature addition.

First, accessibility. Previously, finetuning multimodal embedding models required complex custom training code. Sentence Transformers' standardized pipeline dramatically lowers that barrier, enabling startups and small research teams to build domain-specialized models from their own data.

Second, the update challenges assumptions about model size versus performance. A 2B-parameter model outperforming 8B-scale models through domain finetuning validates the "specialization over scale" direction. For enterprises where multimodal document retrieval accuracy is a critical bottleneck in Retrieval-Augmented Generation (RAG) pipelines, finetuning strategies are likely to emerge as a viable alternative.

Third, VDR application expansion is expected. Industries handling documents rich with layout and visual information — financial reports, legal documents, medical imaging — are likely to accelerate adoption of VDR-based retrieval systems. High-value verticals such as enterprise document search, compliance review, and patent retrieval stand out as prime candidates.

That said, two challenges remain: multimodal model training demands substantial GPU memory and compute resources, and securing high-quality domain training data remains the pivotal variable for performance.

#sentence-transformers #멀티모달 #임베딩 #VDR #RAG #LLM #huggingface-series

진지한크리에이터방금 전

좋은 정리입니다. Sentence에 대해 처음 접하는 정보가 있었습니다.

밝은독자방금 전

Transformers의 향후 전망이 궁금합니다.

판교의녹차방금 전

이런 시각도 있었군요. Now 관련 통계가 의외였습니다. 주변에도 공유해야겠어요.

햇살의별방금 전

좋은 정리입니다. 멀티모달이 일상에 어떤 영향을 줄지 생각해보게 됩니다.

진지한시민방금 전

임베딩이 앞으로 어떻게 전개될지 주목해야겠습니다.

유쾌한구름방금 전

좋은 기사 감사합니다.

맑은날해5분 전

객관적인 시각이 돋보이는 기사입니다.

제주의녹차5분 전

이런 시각도 있었군요. Now의 전문가 코멘트가 설득력 있었습니다.

조용한토끼5분 전

흥미로운 주제입니다. 멀티모달의 전문가 코멘트가 설득력 있었습니다. 좋은 기사 감사합니다.

판교의드럼5분 전

임베딩 관련 배경 설명이 이해하기 쉬웠습니다. 전문가 의견도 더 듣고 싶습니다.

구름위비평가5분 전

흥미로운 주제입니다. Sentence에 대해 처음 접하는 정보가 있었습니다.

한밤의부엉이12분 전

Transformers 관련 통계가 의외였습니다.

느긋한아메리카노12분 전

Now 기사에서 언급된 사례가 흥미로웠습니다.

열정적인러너12분 전

멀티모달 관련 통계가 의외였습니다.

새벽의구름12분 전

이런 시각도 있었군요. 임베딩의 향후 전망이 궁금합니다.

바람의러너12분 전

몰랐던 사실을 알게 됐습니다. Sentence 관련 데이터가 인상적이었습니다.

봄날의리더12분 전

Transformers이 일상에 어떤 영향을 줄지 생각해보게 됩니다.

서울의피아노30분 전

구독 중인데 만족합니다.

산속의기타30분 전

핵심만 잘 정리해주시네요.

여름의독자30분 전

임베딩에 대한 다른 매체 보도와 비교해봐도 잘 정리되어 있습니다. 나중에 다시 읽어볼 만합니다.

비오는날바이올린30분 전

몰랐던 사실을 알게 됐습니다. Sentence 관련 통계가 의외였습니다. 좋은 기사 감사합니다.

대전의해30분 전

몰랐던 사실을 알게 됐습니다. Transformers 관련 데이터가 인상적이었습니다.

겨울의드리머1시간 전

Now에 대한 다른 매체 보도와 비교해봐도 잘 정리되어 있습니다. 주변에도 공유해야겠어요.

햇살의바람1시간 전

몰랐던 사실을 알게 됐습니다. 멀티모달에 대해 주변 사람들과 이야기 나눠볼 만합니다.

산속의워커1시간 전

다른 기사도 기대하겠습니다.

부산의녹차1시간 전

Sentence 관련 통계가 의외였습니다.

구름위커피1시간 전

Transformers 기사에서 언급된 사례가 흥미로웠습니다.

진지한라떼1시간 전

좋은 정리입니다. Now에 대해 더 알고 싶어졌습니다.

공원의펭귄2시간 전

멀티모달 관련 데이터가 인상적이었습니다.

부산의관찰자2시간 전

임베딩에 대해 주변 사람들과 이야기 나눠볼 만합니다.

차분한라떼2시간 전

Sentence의 향후 전망이 궁금합니다.

강남의돌고래2시간 전

참고가 됩니다. Transformers 관련 배경 설명이 이해하기 쉬웠습니다. 전문가 의견도 더 듣고 싶습니다.

부지런한크리에이터2시간 전

참고가 됩니다. Now에 대한 다른 매체 보도와 비교해봐도 잘 정리되어 있습니다. 후속 기사 부탁드립니다.

신중한분석가3시간 전

멀티모달 관련 해외 동향도 궁금합니다.

바람의펭귄3시간 전

임베딩 관련 해외 동향도 궁금합니다. 전문가 의견도 더 듣고 싶습니다.

강남의부엉이3시간 전

Sentence에 대해 처음 접하는 정보가 있었습니다.

별빛의기록자3시간 전

Transformers에 대해 주변 사람들과 이야기 나눠볼 만합니다.

조용한탐험가3시간 전

Now의 전문가 코멘트가 설득력 있었습니다.

느긋한탐험가3시간 전

유익한 기사네요. 멀티모달 관련 통계가 의외였습니다. 좋은 기사 감사합니다.

저녁의강아지5시간 전

이런 시각도 있었군요. 임베딩 관련 배경 설명이 이해하기 쉬웠습니다. 후속 기사 부탁드립니다.

냉철한다람쥐5시간 전

Sentence이 일상에 어떤 영향을 줄지 생각해보게 됩니다.

제주의관찰자5시간 전

이런 시각도 있었군요. Transformers이 일상에 어떤 영향을 줄지 생각해보게 됩니다.

대전의기타5시간 전

잘 읽었습니다. Now 관련 배경 설명이 이해하기 쉬웠습니다. 다른 시각의 분석도 읽어보고 싶습니다.

부지런한탐험가5시간 전

멀티모달에 대해 더 알고 싶어졌습니다. 잘 정리된 기사네요.

홍대의해8시간 전

북마크해두겠습니다. 임베딩 관련 해외 동향도 궁금합니다.

서울의드럼8시간 전

Sentence에 대해 더 알고 싶어졌습니다.

바닷가의러너8시간 전

Transformers 관련 해외 동향도 궁금합니다.

제주의탐험가8시간 전

좋은 정리입니다. Now에 대해 처음 접하는 정보가 있었습니다.

홍대의피아노8시간 전

멀티모달이 일상에 어떤 영향을 줄지 생각해보게 됩니다. 계속 지켜봐야겠습니다.

신중한워커8시간 전

임베딩의 향후 전망이 궁금합니다. 전문가 의견도 더 듣고 싶습니다.

봄날의탐험가

유익한 기사네요.

다정한펭귄

Transformers이 앞으로 어떻게 전개될지 주목해야겠습니다. 다른 시각의 분석도 읽어보고 싶습니다.

도서관의고양이

출퇴근길에 항상 읽고 있습니다.

도서관의구름

멀티모달이 일상에 어떤 영향을 줄지 생각해보게 됩니다.

도서관의녹차

임베딩의 전문가 코멘트가 설득력 있었습니다. 생각이 바뀌었습니다.

More in this series

Hugging Face Redefines Open Source Contribution in the Age of Code Agents

4/15/2026

Anthropic restricts Claude access on open agent platform... Hugging Face ‘Alternative Guide’ released

3/26/2026

New standard for voice AI agent evaluation, EVA framework released

3/23/2026

Hugging Face open source AI ecosystem to exceed 13 million users and 2 million models by 2025

3/17/2026

Latest News

Global

IMF, 7년 만에 베네수엘라와 관계 재개…49억 달러 동결 해제 기대

IMF가 2019년 이후 중단됐던 베네수엘라와의 공식 관계를 7년 만에 재개했다.

6시간 전

Economy

IMF, 7년 만에 베네수엘라와 관계 재개…49억 달러 동결 해제 가능성

IMF가 7년 만에 베네수엘라와 공식 협력을 재개하기로 결정했다.

6시간 전

Economy

경상흑자 역대 최대인데 원화는 왜 약해지나

한국은행, 경상흑자에도 원화 약세 이어지는 구조적 원인 공식 분석.

7시간 전

Economy

금융당국, 미래에셋에 SpaceX IPO 조기 마케팅 경고

금융당국이 미래에셋증권의 SpaceX IPO 조기 마케팅에 구두 경고를 내렸다.

7시간 전

Global

베네치아, 수백 년 안에 사라진다...유럽 연구팀의 4가지 생존 방안

유럽 연구팀, 베네치아 생존 위한 4가지 시나리오를 Scientific Reports에 발표했다.

7시간 전

Sports & Esports

96년 전통 깬다…월드컵 결승전, 사상 첫 하프타임 쇼

FIFA가 96년 만에 처음으로 월드컵 결승전 하프타임 쇼를 도입한다.

7시간 전

Global

레바논 사망자 2,196명…이스라엘 공습에 의료 시스템 붕괴 위기

이스라엘 공습으로 레바논 누적 사망자 2,196명, 부상자 7,185명 기록

7시간 전

Economy

이란 전쟁 속 걸프 3국, 사모채권으로 100억 달러 조달

걸프 3국이 이란 전쟁 이후 처음으로 사모채권 발행에 나서 약 100억 달러를 조달했다.

8시간 전

ArayoNews

Sentence Transformers Now Supports Multimodal Embedding Model Finetuning

Beyond Text: The Era of Training Embedding Models on Images and Documents

Why Finetuning Matters

What Changed

[Expert Analysis] Reshaping the Multimodal Retrieval Landscape

댓글 (55)

More in this series

More in AI & Tech

OpenAI, 생명과학 전용 추론 AI 'GPT-Rosalind' 출시… 신약 개발 패러다임 흔든다

EU, Anthropic의 Claude Mythos AI 사이버 위협 놓고 직접 협의 개시

퍼플렉시티, Mac 전용 AI 에이전트 'Personal Computer' 정식 출시

글로벌 금융당국, Anthropic 'Mythos' AI 사이버 위협에 일제히 긴급 대응

앤스로픽, 런던에 800명 규모 사무소 확보…미 국방부 갈등 속 유럽 거점 구축

릴리 파운다요, 사망 위험 57% 감소…경구용 GLP-1 시대 열리나

Latest News

IMF, 7년 만에 베네수엘라와 관계 재개…49억 달러 동결 해제 기대

IMF, 7년 만에 베네수엘라와 관계 재개…49억 달러 동결 해제 가능성

경상흑자 역대 최대인데 원화는 왜 약해지나

금융당국, 미래에셋에 SpaceX IPO 조기 마케팅 경고

베네치아, 수백 년 안에 사라진다...유럽 연구팀의 4가지 생존 방안

96년 전통 깬다…월드컵 결승전, 사상 첫 하프타임 쇼

레바논 사망자 2,196명…이스라엘 공습에 의료 시스템 붕괴 위기

이란 전쟁 속 걸프 3국, 사모채권으로 100억 달러 조달