AI & Tech

NVIDIA Releases Multilingual OCR Model Built with Synthetic Data

Nemotron OCR v2, trained on 12 million synthetic images, cuts non-English recognition error rates by up to 94%

한서진··5 min read·
Building a Fast Multilingual OCR Model with Synthetic Data
Summary
  • NVIDIA released Nemotron OCR v2, trained on 12 million synthetic images across six languages.
  • Non-English NED error rates dropped from 0.56–0.92 to 0.035–0.069, an improvement of up to 94%.
  • The model processes 34.7 pages per second on a single A100 GPU; both the dataset and model are open-source.

NVIDIA Unveils 'Nemotron OCR v2' Multilingual OCR Model

NVIDIA has released Nemotron OCR v2, a multilingual Optical Character Recognition (OCR) model built on synthetic data. Trained on 12 million synthetic images spanning six languages, the model achieves 34.7 pages per second on a single A100 GPU. Normalized Edit Distance (NED) scores for non-English languages improved dramatically from 0.56–0.92 to 0.035–0.069. The dataset is available at nvidia/OCR-Synthetic-Multilingual-v1 and the model at nvidia/nemotron-ocr-v2 on Hugging Face.

Why It Matters: Synthetic Data Breaks the OCR Data Bottleneck

The central barrier to OCR model development has always been data. High-quality training requires image-text pairs annotated with precise bounding boxes at the word, line, and paragraph level, along with reading order information. Doing this manually at millions-of-images scale is neither economically nor practically feasible.

Existing benchmark datasets like ICDAR and Total-Text offer clean labels but are limited in scale—typically tens of thousands of images skewed toward English and Chinese. Web-scraped PDFs provide volume, but their text layers are often incomplete or contaminated with low-quality OCR outputs.

Synthetic data resolves both limitations simultaneously. By rendering text onto images programmatically, every bounding box, transcription, and reading order relationship is known exactly. The key challenge is realism—sufficient diversity across fonts, colors, backgrounds, layouts, and augmentations is required for the model to generalize to real-world documents.

What Changed: v1 vs. v2

ItemNemotron OCR v1Nemotron OCR v2Change
Language supportEnglish-centric6 languages (EN, JA, KO, RU, ZH, etc.)Expanded to multilingual
Character set855 characters14,244 charactersCJK + Cyrillic included
Training dataLimited12M synthetic imagesLarge-scale synthetic
Non-English NED0.56–0.920.035–0.069Up to 94% improvement
ThroughputN/A34.7 pages/sec (1× A100)Shared backbone architecture
ArchitectureIndependent modulesShared backbone for detection, recognition, relationalRedundant compute eliminated

The shift from v1 to v2 was fundamentally about solving a data problem, not an architecture problem. NVIDIA's team first attempted expanding the character set to 14,244 without corresponding training data—the gains were marginal. The model could theoretically output the right characters but had never learned what they looked like.

Historical Thread: OCR Meets Synthetic Data

Synthetic data use in Document AI gained momentum in the mid-2010s. DeepMind's SynthText (2016) pioneered scene-text synthesis for detection tasks, with the approach later extending to document understanding. NAVER's SynthDoG (2022) introduced a multilingual document image synthesis pipeline that drew wide attention, though achieving real-world accuracy with synthetic data alone remained difficult at the time.

NVIDIA's release demonstrates that when rendering engine diversity and randomization are sufficiently high, synthetic-only training can produce practically viable multilingual OCR. The explosion of Large Language Models (LLMs) has accelerated this trend—as pipelines feeding extracted document text into LLMs became standard, OCR quality began determining the fate of entire downstream workflows, making multilingual accuracy critical.

[Expert Analysis] Implications and Outlook

Notably, NVIDIA released not just the model but the pipeline itself. The team states the synthetic data pipeline is designed to extend to any language for which fonts and source text exist—a meaningful reduction in barriers for researchers working with lower-resource languages.

On the speed side, 34.7 pages per second on a single A100 is practically viable for enterprise-scale batch document processing. The shared backbone architecture—where detection, recognition, and relational models reuse features—enables this throughput by eliminating redundant computation.

Limitations remain. Handwriting, heavily degraded historical documents, and specialized domain terminology represent distributions difficult to cover adequately with synthetic data. NED improvements aside, real-world performance on specific business document types warrants further domain-specific evaluation.

Nemotron OCR v2 is likely to see broad adoption in enterprise document processing, RAG (Retrieval-Augmented Generation) preprocessing pipelines, and multilingual digital archive construction. Whether the open-source release catalyzes community-driven expansion to additional languages will be worth watching.

Share

댓글 (19)

꼼꼼한연구자방금 전

흥미로운 주제입니다. NVIDIA이 앞으로 어떻게 전개될지 주목해야겠습니다. 좋은 기사 감사합니다.

구름위커피방금 전

Releases에 대해 더 알고 싶어졌습니다.

열정적인사자5분 전

Multilingual 관련 해외 동향도 궁금합니다.

새벽의비평가5분 전

Nemotron-OCR 관련 데이터가 인상적이었습니다.

바람의돌고래12분 전

OCR 관련 데이터가 인상적이었습니다.

다정한바이올린12분 전

북마크해두겠습니다. NVIDIA이 앞으로 어떻게 전개될지 주목해야겠습니다. 주변에도 공유해야겠어요.

오후의토끼30분 전

참고가 됩니다. Releases 주제로 시리즈 기사가 나오면 좋겠습니다. 좋은 기사 감사합니다.

봄날의독자30분 전

Multilingual에 대한 다른 매체 보도와 비교해봐도 잘 정리되어 있습니다.

용감한구름1시간 전

읽기 좋은 기사입니다. Nemotron-OCR에 대해 주변 사람들과 이야기 나눠볼 만합니다.

성수의녹차1시간 전

흥미로운 주제입니다. OCR의 전문가 코멘트가 설득력 있었습니다. 주변에도 공유해야겠어요.

여름의부엉이2시간 전

이런 시각도 있었군요. NVIDIA에 대해 주변 사람들과 이야기 나눠볼 만합니다.

신중한바이올린2시간 전

Releases의 전문가 코멘트가 설득력 있었습니다.

재빠른라떼3시간 전

댓글 보는 재미도 있네요.

아침의커피3시간 전

Nemotron-OCR이 앞으로 어떻게 전개될지 주목해야겠습니다. 나중에 다시 읽어볼 만합니다.

성수의돌고래5시간 전

좋은 정보 감사합니다.

오후의바람5시간 전

NVIDIA 관련 배경 설명이 이해하기 쉬웠습니다.

호기심많은피아노8시간 전

깔끔한 기사입니다. Releases 관련 해외 동향도 궁금합니다.

느긋한기타8시간 전

깔끔한 기사입니다. Multilingual의 전문가 코멘트가 설득력 있었습니다.

산속의바이올린

흥미로운 주제입니다. Nemotron-OCR 관련 해외 동향도 궁금합니다. 전문가 의견도 더 듣고 싶습니다.

More in this series

More in AI & Tech

Latest News