AI & Tech

NVIDIA Releases Multilingual OCR Model Built with Synthetic Data

Nemotron OCR v2, trained on 12 million synthetic images, cuts non-English recognition error rates by up to 94%

한서진·2026년 4월 17일 금 07:17·5 min read·

Building a Fast Multilingual OCR Model with Synthetic Data

Summary

•NVIDIA released Nemotron OCR v2, trained on 12 million synthetic images across six languages.
•Non-English NED error rates dropped from 0.56–0.92 to 0.035–0.069, an improvement of up to 94%.
•The model processes 34.7 pages per second on a single A100 GPU; both the dataset and model are open-source.

NVIDIA Unveils 'Nemotron OCR v2' Multilingual OCR Model

NVIDIA has released Nemotron OCR v2, a multilingual Optical Character Recognition (OCR) model built on synthetic data. Trained on 12 million synthetic images spanning six languages, the model achieves 34.7 pages per second on a single A100 GPU. Normalized Edit Distance (NED) scores for non-English languages improved dramatically from 0.56–0.92 to 0.035–0.069. The dataset is available at nvidia/OCR-Synthetic-Multilingual-v1 and the model at nvidia/nemotron-ocr-v2 on Hugging Face.

Why It Matters: Synthetic Data Breaks the OCR Data Bottleneck

The central barrier to OCR model development has always been data. High-quality training requires image-text pairs annotated with precise bounding boxes at the word, line, and paragraph level, along with reading order information. Doing this manually at millions-of-images scale is neither economically nor practically feasible.

Existing benchmark datasets like ICDAR and Total-Text offer clean labels but are limited in scale—typically tens of thousands of images skewed toward English and Chinese. Web-scraped PDFs provide volume, but their text layers are often incomplete or contaminated with low-quality OCR outputs.

Synthetic data resolves both limitations simultaneously. By rendering text onto images programmatically, every bounding box, transcription, and reading order relationship is known exactly. The key challenge is realism—sufficient diversity across fonts, colors, backgrounds, layouts, and augmentations is required for the model to generalize to real-world documents.

What Changed: v1 vs. v2

Item	Nemotron OCR v1	Nemotron OCR v2	Change
Language support	English-centric	6 languages (EN, JA, KO, RU, ZH, etc.)	Expanded to multilingual
Character set	855 characters	14,244 characters	CJK + Cyrillic included
Training data	Limited	12M synthetic images	Large-scale synthetic
Non-English NED	0.56–0.92	0.035–0.069	Up to 94% improvement
Throughput	N/A	34.7 pages/sec (1× A100)	Shared backbone architecture
Architecture	Independent modules	Shared backbone for detection, recognition, relational	Redundant compute eliminated

The shift from v1 to v2 was fundamentally about solving a data problem, not an architecture problem. NVIDIA's team first attempted expanding the character set to 14,244 without corresponding training data—the gains were marginal. The model could theoretically output the right characters but had never learned what they looked like.

Historical Thread: OCR Meets Synthetic Data

Synthetic data use in Document AI gained momentum in the mid-2010s. DeepMind's SynthText (2016) pioneered scene-text synthesis for detection tasks, with the approach later extending to document understanding. NAVER's SynthDoG (2022) introduced a multilingual document image synthesis pipeline that drew wide attention, though achieving real-world accuracy with synthetic data alone remained difficult at the time.

NVIDIA's release demonstrates that when rendering engine diversity and randomization are sufficiently high, synthetic-only training can produce practically viable multilingual OCR. The explosion of Large Language Models (LLMs) has accelerated this trend—as pipelines feeding extracted document text into LLMs became standard, OCR quality began determining the fate of entire downstream workflows, making multilingual accuracy critical.

[Expert Analysis] Implications and Outlook

Notably, NVIDIA released not just the model but the pipeline itself. The team states the synthetic data pipeline is designed to extend to any language for which fonts and source text exist—a meaningful reduction in barriers for researchers working with lower-resource languages.

On the speed side, 34.7 pages per second on a single A100 is practically viable for enterprise-scale batch document processing. The shared backbone architecture—where detection, recognition, and relational models reuse features—enables this throughput by eliminating redundant computation.

Limitations remain. Handwriting, heavily degraded historical documents, and specialized domain terminology represent distributions difficult to cover adequately with synthetic data. NED improvements aside, real-world performance on specific business document types warrants further domain-specific evaluation.

Nemotron OCR v2 is likely to see broad adoption in enterprise document processing, RAG (Retrieval-Augmented Generation) preprocessing pipelines, and multilingual digital archive construction. Whether the open-source release catalyzes community-driven expansion to additional languages will be worth watching.

#nvidia-series #Nemotron-OCR #OCR #합성데이터 #다국어AI #문서AI #LLM