OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Yufeng Zhong*, Lei Chen*, Xuanle Zhao*, Wenkang Han, Liming Zheng, Jing Huang,
Deyang Jiang, Yilin Cao, Lin Ma $^{\dagger}$, Zhixiong Zeng $^{\dagger}$

Meituan

*Equal Contribution, $^{\dagger}$ Corresponding Authors

The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (Text-centric OCR), neglecting the identification of visual elements from visually information-dense image sources (Vision-centric OCR), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose OCRVerse, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.

📢 News and Updates

[Jan 30, 2026] 📄 Preprint released on arXiv.
[Jan 30, 2026] 🤗 We upload our model weights OCRVerse to HuggingFace.
[Nov 3, 2025] 🤗 We upload our model weights OCRVerse-code to HuggingFace.
[Oct 27, 2025] 🤗 We upload our model weights OCRVerse-text to HuggingFace.

💡 Benchmarking

Performance comparison of OCRVerse on text-centric OCR tasks (top row) and vision-centric OCR tasks (bottom row). Since existing OCR methods primarily focus on text-centric scenarios, we compare against both specialized OCR models and general-purpose models for text-centric benchmarks, while comparing only against general-purpose models for vision-centric benchmarks.

📚 Dataset Sources

OCRVerse encompasses both text-centric and vision-centric data types, comprehensively supporting the data requirements of holistic OCR. The text-centric data types cover nine document scenarios: natural scenes, books, magazines, papers, reports, slides, exam papers, notes, and newspapers, which encompass high-frequency text scenarios in daily life and meet essential OCR needs. The vision-centric data types comprise six specialized scenarios: charts, webpages, icons, geometry, circuits, and molecules, which focus on professional structured content and address gaps not covered by text-centric categories.

📥 Data Processing

Our training dataset is constructed through a systematic multi-stage pipeline that integrates both text-centric and vision-centric data sources to ensure comprehensive coverage and high quality. The data processing workflow encompasses data collection, cleaning, and annotation generation (including self-annotation).

🤖 Pipeline

We propose a two-stage SFT-RL multi-domain training method. In the SFT stage, we aim to establish foundational cross-domain knowledge by directly mixing data from all domains. This approach enables the model to learn diverse visual patterns and output formats across both text-centric and vision-centric scenarios, thereby building a unified representation space. In the RL stage, we aim to resolve domain-specific conflicts and optimize personalized performance for each domain. Since different domains require various output formats and quality expectations, we design customized reward strategies tailored to the characteristics of each domain, such as structural accuracy for tables and semantic fidelity for charts. This flexible reward mechanism effectively improves cross-domain fusion while avoiding data conflicts that typically arise from naive multi-task learning.

📊 Performance

Text-Centric Evaluation

Benchmark

We adopt OmniDocBench v1.5 as the primary testbed to comprehensively evaluate the document parsing capabilities of OCRVerse. This benchmark serves as a rigorous standard to validate model robustness and generalization across real-world applications. OmniDocBench v1.5 is an expanded dataset comprising 1,355 document pages, enriched with 374 additional pages compared to the previous version. It features a balanced distribution of bilingual content in both Chinese and English and covers nine diverse document types—including academic papers, textbooks, financial reports, and exam papers. By incorporating varied layout structures (from single- to multi-column) and rich elements such as text, mathematical formulas, and structured tables, the benchmark enables a thorough assessment of OCRVerse in handling complex parsing scenarios.

Metrics

End-to-end evaluation assesses the model's accuracy in parsing PDF page content. The evaluation uses the model's Markdown output of the entire PDF page parsing results as the prediction. The Overall metric is calculated as:

$$ \text{Overall} = \frac{(1-\text{Text Edit Distance}) \times 100 + \text{Table TEDS} +\text{Formula CDM}}{3} $$

Results

Model Type	Methods	Release Date	End to End	Parameters	Overall↑	Text^Edit↓	Formula^CDM↑	Table^TEDS↑	Table^TEDS-S↑	Reading Order^Edit↓
Pipeline Tools	Marker-1.8.2	2025	❌	-	71.30	0.206	76.66	57.88	71.17	0.250
	Mineru2-pipeline	2025	❌	-	75.51	0.209	76.55	70.90	79.11	0.225
	PP-StructureV3	2024	❌	-	86.73	0.073	85.79	81.68	89.48	0.073
General VLMs	GPT-4o	2024	✅	-	75.02	0.217	79.70	67.07	76.09	0.148
	InternVL3-76B	2025	✅	76B	80.33	0.131	83.42	70.64	77.74	0.113
	InternVL3.5-241B	2025	✅	241B	82.67	0.142	87.23	75.00	81.28	0.125
	Qwen2.5-VL-72B	2025	✅	72B	87.02	0.094	88.27	82.15	86.22	0.102
	Gemini-2.5 Pro	2025	✅	-	88.03	0.075	85.82	85.71	90.29	0.097
Specialized VLMs	Dolphin	2025.05	❌	322M	74.67	0.125	67.85	68.70	77.77	0.124
	MinerU2-VLM	2025.06	❌	0.9B	85.56	0.078	80.95	83.54	87.66	0.086
	MonkeyOCR-pro-1.2B	2025.07	❌	1.9B	86.96	0.084	85.02	84.24	89.02	0.130
	MonkeyOCR-3B	2025.06	❌	3.7B	87.13	0.075	87.45	81.39	85.92	0.129
	MonkeyOCR-pro-3B	2025.07	❌	3.7B	88.85	0.075	87.25	86.78	90.63	0.128
	MinerU2.5	2025.09	❌	1.2B	90.67	0.047	88.46	88.22	92.38	0.044
	PaddleOCR-VL	2025.10	❌	0.9B	92.56	0.035	91.43	89.76	93.52	0.043
	OCRFlux-3B	2025.06	✅	3B	74.82	0.193	68.03	75.75	80.23	0.202
	Mistral OCR	2025.03	✅	-	78.83	0.164	82.84	70.03	78.04	0.144
	POINTS-Reader	2025.08	✅	3B	80.98	0.134	79.20	77.13	81.66	0.145
	olmOCR-7B	2025.02	✅	7B	81.79	0.096	86.04	68.92	74.77	0.121
	Nanonets-OCR-s	2025.06	✅	3B	85.59	0.093	85.90	80.14	85.57	0.108
	Deepseek-OCR	2025.10	✅	3B	87.01	0.073	83.37	84.97	88.80	0.086
	dots.ocr	2025.07	✅	3B	88.41	0.048	83.22	86.78	90.62	0.053
	FD-RL	2025.11	✅	4B	90.41	0.049	88.67	87.35	92.10	0.055
	HunyuanOCR	2025.11	✅	4B	94.10	0.042	94.73	91.81	-	-
	Deepseek-OCR 2	2026.01	✅	1B	91.09	0.048	90.31	87.75	92.06	0.057
	OCRVerse(Ours)	2026.01	✅	4B	89.23	0.052	87.13	85.77	90.35	0.068

Vison-Centric Evaluation

Benchmarks

To comprehensively evaluate the vision-centric OCR capabilities of our proposed OCRVerse, we conducted extensive experiments across five diverse structured vision-centric benchmarks. These benchmarks cover a wide spectrum of visual domains, assessing the model's ability to translate visually dense information into executable code or structured representations. Specifically, the evaluation tasks include: (1) ChartMimic for direct chart-to-code generation; (2) Design2Code, which evaluates the precision of reproducing web layouts in the web-to-HTML task; (3) UniSVG, assessing the generation of scalable vector graphics in the image-to-SVG task; (4) Image2Struct, testing the conversion of scientific documents and formulas in the image-to-LaTeX task; (5) ChemDraw, focusing on the recognition of chemical structures in the molecule-to-code task.

Metrics

For all benchmarks, we strictly adhere to the evaluation protocols officially defined in their respective GitHub repositories or original papers. For ChartMimic, we report the code execution success rate and the average of low-level metrics (Text, Layout, Type, and Color Scores). For high-level evaluation, we employ the GPT-4o Score. Regarding UniSVG, we present the low-level score, computed as the average of SSIM and (1 - LPIPS), alongside the high-level CLIP similarity. For Design2Code, we report both the CLIP similarity (high-level) and the element-matching scores (low-level) proposed by the benchmark authors. For Image2Struct, we evaluate using the earth mover's similarity (EMS) and the rendering success rate. Finally, for ChemDraw, we report the code execution success rate and the Tanimoto similarity.

Results

Model	Parameter	ChartMimic_direct_v2			UniSVG-ISVGEN			Design2Code		Image2Latex_plot		ChemDraw
Model	Parameter	Exec.Rate	Low-Level	High-Level	Low-Level	High-Level	Score	Low-Level	High-Level	Ren.Succ.	EMS	Exec.Rate	Tani.Sim.
Closed-Source Models
Gemini-2.5-Pro	-	97.3	88.7	83.8	53.6	80.3	69.6	90.8	91.4	74.3	52.5	77.3	2.8
Claude-4.5-Sonnet	-	97.8	89.6	82.9	61.0	83.4	74.6	90.4	90.8	72.7	50.2	95.3	41.7
GPT-5	-	94.8	81.9	78.3	60.8	88.3	77.3	90.6	91.0	78.7	57.4	93.8	52.1
Open-Source Models
Qwen2.5-VL-7B	7B	68.7	42.2	40.1	47.5	73.8	63.3	83.4	87.6	42.7	25.5	21.1	11.7
InternVL3-8B	8B	63.3	43.8	46.1	54.5	77.4	68.2	85.3	87.6	57.7	38.6	42.2	6.2
Qwen3-VL-8B	8B	78.3	62.5	67.8	53.0	77.0	67.4	85.5	87.2	47.7	33.0	78.9	41.2
InternVL3.5-8B	8B	66.7	46.0	48.3	55.0	78.0	68.6	85.8	87.3	58.3	40.5	49.2	7.8
InternVL3-14B	14B	72.3	51.3	54.1	51.4	75.5	65.8	85.8	87.5	73.3	52.2	71.1	40.2
InternVL3.5-14B	14B	73.2	52.8	55.4	52.0	75.0	65.9	86.1	87.8	73.0	50.2	71.9	39.3
Qwen3-VL-32B	32B	83.0	66.9	77.5	68.0	86.0	78.8	88.6	89.8	75.7	53.3	37.5	48.8
InternVL3.5-38B	38B	79.0	60.0	71.8	51.9	77.3	67.1	87.8	88.4	72.6	49.5	55.5	31.4
Qwen2.5-VL-72B	72B	88.5	72.7	79.1	47.7	76.0	64.7	86.9	88.7	62.0	41.7	75.8	28.0
OCRVerse (Ours)	4B	84.8	72.2	75.4	63.2	85.2	76.3	85.7	87.4	88.7	63.1	89.1	54.7

🔍 Usage Example

Inference

Text-Centric Task

This below is a simple example of how to use OCRVerse for document parsing tasks.

Please first install transformers using the following command:

pip install "transformers>=4.57.0"

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch

# Load model
model_path = 'DocTron/OCRVerse'
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    dtype="auto", 
    device_map="cuda",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Prepare input with image and text
image_path = "./assets/text_centric_test.jpg"
# We recommend using the following prompt to better performance, since it is used throughout the training process.
prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt},
        ]
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

# $$
# r = \frac{\alpha}{\beta} \sin \beta (\sigma_1 \pm \sigma_2)
# $$

Vision-Centric Task

Below is a simple example of how to use OCRVerse for vision-centric tasks. We also recommend utilizing SGLang for inference.

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch

# Load model
model_path = 'DocTron/OCRVerse'
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    dtype="auto", 
    device_map="cuda",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Prepare input with image and text
image_path = "./assets/vision_centric_test.png"
prompt = "You are an expert Python developer who specializes in writing matplotlib code based on a given picture. I found a very nice picture in a STEM paper, but there is no corresponding source code available. I need your help to generate the Python code that can reproduce the picture based on the picture I provide.\nNote that it is necessary to use figsize=(7.0, 5.0) to set the image size to match the original size.\nNow, please give me the matplotlib code that reproduces the picture below."

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt},
        ]
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

Example scripts for launching SGLang Server

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m sglang.launch_server \
--model-path DocTron/OCRVerse-code \
--host 0.0.0.0 \
--dist-init-addr 127.0.0.1:10002 \
--tp 4 \
--port 6002

Fine-tuning

If you want to continue training based on our model, you can use Llama Factory. For installation and usage of Llama Factory, please refer to its official documentation. A reference fine-tuning script with pre-specified parameters is provided below:

PROJECT_DIR=/path/to/llama_factory
cd ${PROJECT_DIR}

# Set parameters
GPUS_PER_NODE=8                  # Number of GPUs per node
NNODES=1                         # Total number of nodes
NODE_RANK=0                      # Rank of the current node (starts from 0)
MASTER_ADDR=localhost            # IP address of the master node
MASTER_PORT=12345                # Port for communication between nodes

MODEL_DIR=/path/to/ocrverse_text_model  # Path to the pre-trained OCRVerse model
DATA=/name/of/your/dataset               # Name/path of your custom dataset
OUTPUT_DIR=/path/to/output              # Directory to save fine-tuned results

# Llama Factory-based fine-tuning script
torchrun --nproc_per_node="${GPUS_PER_NODE}" --nnodes="${NNODES}" --node_rank="${NODE_RANK}" --master_addr="${MASTER_ADDR}" --master_port="${MASTER_PORT}" \
    src/train.py \
    --model_name_or_path "$MODEL_DIR" \
    --stage sft \
    --do_train True \
    --finetuning_type full \
    --dataset "$DATA" \
    --template qwen3_vl_nothink \
    --cutoff_len 8192 \
    --preprocessing_num_workers 128 \
    --preprocessing_batch_size 256 \
    --dataloader_num_workers 128 \
    --output_dir "$OUTPUT_DIR" \
    --logging_steps 1 \
    --save_steps 5000 \
    --plot_loss True \
    --save_only_model False \
    --report_to none \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --learning_rate 1e-5 \
    --num_train_epochs 1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --bf16 True

📌 Acknowledgement

We sincerely appreciate LLaMA-Factory and EasyR1 for providing reference training framework.

📖 Citation

If you find this project useful, please feel free to leave a star and cite our paper:

@article{zhong2026ocrverse,
  title={OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models},
  author={Zhong, Yufeng and Chen, Lei and Zhao, Xuanle and Han, Wenkang and Zheng, Liming and Huang, Jing and Jiang, Deyang and Cao, Yilin and Ma, Lin and Zeng, Zhixiong},
  journal={arXiv preprint arXiv:2601.21639},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

📢 News and Updates

💡 Benchmarking

📚 Dataset Sources

📥 Data Processing

🤖 Pipeline

📊 Performance

Text-Centric Evaluation

Benchmark

Metrics

Results

Vison-Centric Evaluation

Benchmarks

Metrics

Results

🔍 Usage Example

Inference

Text-Centric Task

Vision-Centric Task

Fine-tuning

📌 Acknowledgement

📖 Citation

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

License

DocTron-hub/OCRVerse

Folders and files

Latest commit

History

Repository files navigation

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

📢 News and Updates

💡 Benchmarking

📚 Dataset Sources

📥 Data Processing

🤖 Pipeline

📊 Performance

Text-Centric Evaluation

Benchmark

Metrics

Results

Vison-Centric Evaluation

Benchmarks

Metrics

Results

🔍 Usage Example

Inference

Text-Centric Task

Vision-Centric Task

Fine-tuning

📌 Acknowledgement

📖 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Packages