Yufeng Zhong*, Lei Chen*, Xuanle Zhao*, Wenkang Han, Liming Zheng, Jing Huang,
Deyang Jiang, Yilin Cao, Lin Ma
Meituan
*Equal Contribution,
The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (Text-centric OCR), neglecting the identification of visual elements from visually information-dense image sources (Vision-centric OCR), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose OCRVerse, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.
- [Jan 30, 2026] 📄 Preprint released on arXiv.
- [Jan 30, 2026] 🤗 We upload our model weights OCRVerse to HuggingFace.
- [Nov 3, 2025] 🤗 We upload our model weights OCRVerse-code to HuggingFace.
- [Oct 27, 2025] 🤗 We upload our model weights OCRVerse-text to HuggingFace.
Performance comparison of OCRVerse on text-centric OCR tasks (top row) and vision-centric OCR tasks (bottom row). Since existing OCR methods primarily focus on text-centric scenarios, we compare against both specialized OCR models and general-purpose models for text-centric benchmarks, while comparing only against general-purpose models for vision-centric benchmarks.
OCRVerse encompasses both text-centric and vision-centric data types, comprehensively supporting the data requirements of holistic OCR. The text-centric data types cover nine document scenarios: natural scenes, books, magazines, papers, reports, slides, exam papers, notes, and newspapers, which encompass high-frequency text scenarios in daily life and meet essential OCR needs. The vision-centric data types comprise six specialized scenarios: charts, webpages, icons, geometry, circuits, and molecules, which focus on professional structured content and address gaps not covered by text-centric categories.
Our training dataset is constructed through a systematic multi-stage pipeline that integrates both text-centric and vision-centric data sources to ensure comprehensive coverage and high quality. The data processing workflow encompasses data collection, cleaning, and annotation generation (including self-annotation).
We propose a two-stage SFT-RL multi-domain training method. In the SFT stage, we aim to establish foundational cross-domain knowledge by directly mixing data from all domains. This approach enables the model to learn diverse visual patterns and output formats across both text-centric and vision-centric scenarios, thereby building a unified representation space. In the RL stage, we aim to resolve domain-specific conflicts and optimize personalized performance for each domain. Since different domains require various output formats and quality expectations, we design customized reward strategies tailored to the characteristics of each domain, such as structural accuracy for tables and semantic fidelity for charts. This flexible reward mechanism effectively improves cross-domain fusion while avoiding data conflicts that typically arise from naive multi-task learning.
We adopt OmniDocBench v1.5 as the primary testbed to comprehensively evaluate the document parsing capabilities of OCRVerse. This benchmark serves as a rigorous standard to validate model robustness and generalization across real-world applications. OmniDocBench v1.5 is an expanded dataset comprising 1,355 document pages, enriched with 374 additional pages compared to the previous version. It features a balanced distribution of bilingual content in both Chinese and English and covers nine diverse document types—including academic papers, textbooks, financial reports, and exam papers. By incorporating varied layout structures (from single- to multi-column) and rich elements such as text, mathematical formulas, and structured tables, the benchmark enables a thorough assessment of OCRVerse in handling complex parsing scenarios.
End-to-end evaluation assesses the model's accuracy in parsing PDF page content. The evaluation uses the model's Markdown output of the entire PDF page parsing results as the prediction. The Overall metric is calculated as:
| Model Type | Methods | Release Date | End to End | Parameters | Overall↑ | TextEdit↓ | FormulaCDM↑ | TableTEDS↑ | TableTEDS-S↑ | Reading OrderEdit↓ |
|---|---|---|---|---|---|---|---|---|---|---|
| Pipeline Tools | Marker-1.8.2 | 2025 | ❌ | - | 71.30 | 0.206 | 76.66 | 57.88 | 71.17 | 0.250 |
| Mineru2-pipeline | 2025 | ❌ | - | 75.51 | 0.209 | 76.55 | 70.90 | 79.11 | 0.225 | |
| PP-StructureV3 | 2024 | ❌ | - | 86.73 | 0.073 | 85.79 | 81.68 | 89.48 | 0.073 | |
| General VLMs | GPT-4o | 2024 | ✅ | - | 75.02 | 0.217 | 79.70 | 67.07 | 76.09 | 0.148 |
| InternVL3-76B | 2025 | ✅ | 76B | 80.33 | 0.131 | 83.42 | 70.64 | 77.74 | 0.113 | |
| InternVL3.5-241B | 2025 | ✅ | 241B | 82.67 | 0.142 | 87.23 | 75.00 | 81.28 | 0.125 | |
| Qwen2.5-VL-72B | 2025 | ✅ | 72B | 87.02 | 0.094 | 88.27 | 82.15 | 86.22 | 0.102 | |
| Gemini-2.5 Pro | 2025 | ✅ | - | 88.03 | 0.075 | 85.82 | 85.71 | 90.29 | 0.097 | |
| Specialized VLMs | Dolphin | 2025.05 | ❌ | 322M | 74.67 | 0.125 | 67.85 | 68.70 | 77.77 | 0.124 |
| MinerU2-VLM | 2025.06 | ❌ | 0.9B | 85.56 | 0.078 | 80.95 | 83.54 | 87.66 | 0.086 | |
| MonkeyOCR-pro-1.2B | 2025.07 | ❌ | 1.9B | 86.96 | 0.084 | 85.02 | 84.24 | 89.02 | 0.130 | |
| MonkeyOCR-3B | 2025.06 | ❌ | 3.7B | 87.13 | 0.075 | 87.45 | 81.39 | 85.92 | 0.129 | |
| MonkeyOCR-pro-3B | 2025.07 | ❌ | 3.7B | 88.85 | 0.075 | 87.25 | 86.78 | 90.63 | 0.128 | |
| MinerU2.5 | 2025.09 | ❌ | 1.2B | 90.67 | 0.047 | 88.46 | 88.22 | 92.38 | 0.044 | |
| PaddleOCR-VL | 2025.10 | ❌ | 0.9B | 92.56 | 0.035 | 91.43 | 89.76 | 93.52 | 0.043 | |
| OCRFlux-3B | 2025.06 | ✅ | 3B | 74.82 | 0.193 | 68.03 | 75.75 | 80.23 | 0.202 | |
| Mistral OCR | 2025.03 | ✅ | - | 78.83 | 0.164 | 82.84 | 70.03 | 78.04 | 0.144 | |
| POINTS-Reader | 2025.08 | ✅ | 3B | 80.98 | 0.134 | 79.20 | 77.13 | 81.66 | 0.145 | |
| olmOCR-7B | 2025.02 | ✅ | 7B | 81.79 | 0.096 | 86.04 | 68.92 | 74.77 | 0.121 | |
| Nanonets-OCR-s | 2025.06 | ✅ | 3B | 85.59 | 0.093 | 85.90 | 80.14 | 85.57 | 0.108 | |
| Deepseek-OCR | 2025.10 | ✅ | 3B | 87.01 | 0.073 | 83.37 | 84.97 | 88.80 | 0.086 | |
| dots.ocr | 2025.07 | ✅ | 3B | 88.41 | 0.048 | 83.22 | 86.78 | 90.62 | 0.053 | |
| FD-RL | 2025.11 | ✅ | 4B | 90.41 | 0.049 | 88.67 | 87.35 | 92.10 | 0.055 | |
| HunyuanOCR | 2025.11 | ✅ | 4B | 94.10 | 0.042 | 94.73 | 91.81 | - | - | |
| Deepseek-OCR 2 | 2026.01 | ✅ | 1B | 91.09 | 0.048 | 90.31 | 87.75 | 92.06 | 0.057 | |
| OCRVerse(Ours) | 2026.01 | ✅ | 4B | 89.23 | 0.052 | 87.13 | 85.77 | 90.35 | 0.068 |
To comprehensively evaluate the vision-centric OCR capabilities of our proposed OCRVerse, we conducted extensive experiments across five diverse structured vision-centric benchmarks. These benchmarks cover a wide spectrum of visual domains, assessing the model's ability to translate visually dense information into executable code or structured representations. Specifically, the evaluation tasks include: (1) ChartMimic for direct chart-to-code generation; (2) Design2Code, which evaluates the precision of reproducing web layouts in the web-to-HTML task; (3) UniSVG, assessing the generation of scalable vector graphics in the image-to-SVG task; (4) Image2Struct, testing the conversion of scientific documents and formulas in the image-to-LaTeX task; (5) ChemDraw, focusing on the recognition of chemical structures in the molecule-to-code task.
For all benchmarks, we strictly adhere to the evaluation protocols officially defined in their respective GitHub repositories or original papers. For ChartMimic, we report the code execution success rate and the average of low-level metrics (Text, Layout, Type, and Color Scores). For high-level evaluation, we employ the GPT-4o Score. Regarding UniSVG, we present the low-level score, computed as the average of SSIM and (1 - LPIPS), alongside the high-level CLIP similarity. For Design2Code, we report both the CLIP similarity (high-level) and the element-matching scores (low-level) proposed by the benchmark authors. For Image2Struct, we evaluate using the earth mover's similarity (EMS) and the rendering success rate. Finally, for ChemDraw, we report the code execution success rate and the Tanimoto similarity.
| Model | Parameter | ChartMimic_direct_v2 | UniSVG-ISVGEN | Design2Code | Image2Latex_plot | ChemDraw | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Exec.Rate | Low-Level | High-Level | Low-Level | High-Level | Score | Low-Level | High-Level | Ren.Succ. | EMS | Exec.Rate | Tani.Sim. | ||
| Closed-Source Models | |||||||||||||
| Gemini-2.5-Pro | - | 97.3 | 88.7 | 83.8 | 53.6 | 80.3 | 69.6 | 90.8 | 91.4 | 74.3 | 52.5 | 77.3 | 2.8 |
| Claude-4.5-Sonnet | - | 97.8 | 89.6 | 82.9 | 61.0 | 83.4 | 74.6 | 90.4 | 90.8 | 72.7 | 50.2 | 95.3 | 41.7 |
| GPT-5 | - | 94.8 | 81.9 | 78.3 | 60.8 | 88.3 | 77.3 | 90.6 | 91.0 | 78.7 | 57.4 | 93.8 | 52.1 |
| Open-Source Models | |||||||||||||
| Qwen2.5-VL-7B | 7B | 68.7 | 42.2 | 40.1 | 47.5 | 73.8 | 63.3 | 83.4 | 87.6 | 42.7 | 25.5 | 21.1 | 11.7 |
| InternVL3-8B | 8B | 63.3 | 43.8 | 46.1 | 54.5 | 77.4 | 68.2 | 85.3 | 87.6 | 57.7 | 38.6 | 42.2 | 6.2 |
| Qwen3-VL-8B | 8B | 78.3 | 62.5 | 67.8 | 53.0 | 77.0 | 67.4 | 85.5 | 87.2 | 47.7 | 33.0 | 78.9 | 41.2 |
| InternVL3.5-8B | 8B | 66.7 | 46.0 | 48.3 | 55.0 | 78.0 | 68.6 | 85.8 | 87.3 | 58.3 | 40.5 | 49.2 | 7.8 |
| InternVL3-14B | 14B | 72.3 | 51.3 | 54.1 | 51.4 | 75.5 | 65.8 | 85.8 | 87.5 | 73.3 | 52.2 | 71.1 | 40.2 |
| InternVL3.5-14B | 14B | 73.2 | 52.8 | 55.4 | 52.0 | 75.0 | 65.9 | 86.1 | 87.8 | 73.0 | 50.2 | 71.9 | 39.3 |
| Qwen3-VL-32B | 32B | 83.0 | 66.9 | 77.5 | 68.0 | 86.0 | 78.8 | 88.6 | 89.8 | 75.7 | 53.3 | 37.5 | 48.8 |
| InternVL3.5-38B | 38B | 79.0 | 60.0 | 71.8 | 51.9 | 77.3 | 67.1 | 87.8 | 88.4 | 72.6 | 49.5 | 55.5 | 31.4 |
| Qwen2.5-VL-72B | 72B | 88.5 | 72.7 | 79.1 | 47.7 | 76.0 | 64.7 | 86.9 | 88.7 | 62.0 | 41.7 | 75.8 | 28.0 |
| OCRVerse (Ours) | 4B | 84.8 | 72.2 | 75.4 | 63.2 | 85.2 | 76.3 | 85.7 | 87.4 | 88.7 | 63.1 | 89.1 | 54.7 |
This below is a simple example of how to use OCRVerse for document parsing tasks.
Please first install transformers using the following command:
pip install "transformers>=4.57.0"from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
# Load model
model_path = 'DocTron/OCRVerse'
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
dtype="auto",
device_map="cuda",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Prepare input with image and text
image_path = "./assets/text_centric_test.jpg"
# We recommend using the following prompt to better performance, since it is used throughout the training process.
prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": prompt},
]
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
# $$
# r = \frac{\alpha}{\beta} \sin \beta (\sigma_1 \pm \sigma_2)
# $$Below is a simple example of how to use OCRVerse for vision-centric tasks. We also recommend utilizing SGLang for inference.
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
# Load model
model_path = 'DocTron/OCRVerse'
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
dtype="auto",
device_map="cuda",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Prepare input with image and text
image_path = "./assets/vision_centric_test.png"
prompt = "You are an expert Python developer who specializes in writing matplotlib code based on a given picture. I found a very nice picture in a STEM paper, but there is no corresponding source code available. I need your help to generate the Python code that can reproduce the picture based on the picture I provide.\nNote that it is necessary to use figsize=(7.0, 5.0) to set the image size to match the original size.\nNow, please give me the matplotlib code that reproduces the picture below."
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_path},
{"type": "text", "text": prompt},
]
}
]
# Preparation for inference
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])Example scripts for launching SGLang Server
CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m sglang.launch_server \
--model-path DocTron/OCRVerse-code \
--host 0.0.0.0 \
--dist-init-addr 127.0.0.1:10002 \
--tp 4 \
--port 6002If you want to continue training based on our model, you can use Llama Factory. For installation and usage of Llama Factory, please refer to its official documentation. A reference fine-tuning script with pre-specified parameters is provided below:
PROJECT_DIR=/path/to/llama_factory
cd ${PROJECT_DIR}
# Set parameters
GPUS_PER_NODE=8 # Number of GPUs per node
NNODES=1 # Total number of nodes
NODE_RANK=0 # Rank of the current node (starts from 0)
MASTER_ADDR=localhost # IP address of the master node
MASTER_PORT=12345 # Port for communication between nodes
MODEL_DIR=/path/to/ocrverse_text_model # Path to the pre-trained OCRVerse model
DATA=/name/of/your/dataset # Name/path of your custom dataset
OUTPUT_DIR=/path/to/output # Directory to save fine-tuned results
# Llama Factory-based fine-tuning script
torchrun --nproc_per_node="${GPUS_PER_NODE}" --nnodes="${NNODES}" --node_rank="${NODE_RANK}" --master_addr="${MASTER_ADDR}" --master_port="${MASTER_PORT}" \
src/train.py \
--model_name_or_path "$MODEL_DIR" \
--stage sft \
--do_train True \
--finetuning_type full \
--dataset "$DATA" \
--template qwen3_vl_nothink \
--cutoff_len 8192 \
--preprocessing_num_workers 128 \
--preprocessing_batch_size 256 \
--dataloader_num_workers 128 \
--output_dir "$OUTPUT_DIR" \
--logging_steps 1 \
--save_steps 5000 \
--plot_loss True \
--save_only_model False \
--report_to none \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 1 \
--learning_rate 1e-5 \
--num_train_epochs 1 \
--lr_scheduler_type cosine \
--warmup_ratio 0.1 \
--bf16 TrueWe sincerely appreciate LLaMA-Factory and EasyR1 for providing reference training framework.
If you find this project useful, please feel free to leave a star and cite our paper:
@article{zhong2026ocrverse,
title={OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models},
author={Zhong, Yufeng and Chen, Lei and Zhao, Xuanle and Han, Wenkang and Zheng, Liming and Huang, Jing and Jiang, Deyang and Cao, Yilin and Ma, Lin and Zeng, Zhixiong},
journal={arXiv preprint arXiv:2601.21639},
year={2026}
}