Skip to content

DocTron-hub/OCRVerse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 

Repository files navigation

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Yufeng Zhong*, Lei Chen*, Xuanle Zhao*, Wenkang Han, Liming Zheng, Jing Huang,
Deyang Jiang, Yilin Cao, Lin Ma $^{\dagger}$, Zhixiong Zeng $^{\dagger}$

Meituan

*Equal Contribution, $^{\dagger}$ Corresponding Authors


The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (Text-centric OCR), neglecting the identification of visual elements from visually information-dense image sources (Vision-centric OCR), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose OCRVerse, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.

📢 News and Updates

  • [Jan 30, 2026] 📄 Preprint released on arXiv.
  • [Jan 30, 2026] 🤗 We upload our model weights OCRVerse to HuggingFace.
  • [Nov 3, 2025] 🤗 We upload our model weights OCRVerse-code to HuggingFace.
  • [Oct 27, 2025] 🤗 We upload our model weights OCRVerse-text to HuggingFace.

💡 Benchmarking

性能比较 Performance comparison of OCRVerse on text-centric OCR tasks (top row) and vision-centric OCR tasks (bottom row). Since existing OCR methods primarily focus on text-centric scenarios, we compare against both specialized OCR models and general-purpose models for text-centric benchmarks, while comparing only against general-purpose models for vision-centric benchmarks.

📚 Dataset Sources

数据分类 OCRVerse encompasses both text-centric and vision-centric data types, comprehensively supporting the data requirements of holistic OCR. The text-centric data types cover nine document scenarios: natural scenes, books, magazines, papers, reports, slides, exam papers, notes, and newspapers, which encompass high-frequency text scenarios in daily life and meet essential OCR needs. The vision-centric data types comprise six specialized scenarios: charts, webpages, icons, geometry, circuits, and molecules, which focus on professional structured content and address gaps not covered by text-centric categories.

📥 Data Processing

数据处理流程图 Our training dataset is constructed through a systematic multi-stage pipeline that integrates both text-centric and vision-centric data sources to ensure comprehensive coverage and high quality. The data processing workflow encompasses data collection, cleaning, and annotation generation (including self-annotation).

🤖 Pipeline

架构 We propose a two-stage SFT-RL multi-domain training method. In the SFT stage, we aim to establish foundational cross-domain knowledge by directly mixing data from all domains. This approach enables the model to learn diverse visual patterns and output formats across both text-centric and vision-centric scenarios, thereby building a unified representation space. In the RL stage, we aim to resolve domain-specific conflicts and optimize personalized performance for each domain. Since different domains require various output formats and quality expectations, we design customized reward strategies tailored to the characteristics of each domain, such as structural accuracy for tables and semantic fidelity for charts. This flexible reward mechanism effectively improves cross-domain fusion while avoiding data conflicts that typically arise from naive multi-task learning.

📊 Performance

Text-Centric Evaluation

Benchmark

We adopt OmniDocBench v1.5 as the primary testbed to comprehensively evaluate the document parsing capabilities of OCRVerse. This benchmark serves as a rigorous standard to validate model robustness and generalization across real-world applications. OmniDocBench v1.5 is an expanded dataset comprising 1,355 document pages, enriched with 374 additional pages compared to the previous version. It features a balanced distribution of bilingual content in both Chinese and English and covers nine diverse document types—including academic papers, textbooks, financial reports, and exam papers. By incorporating varied layout structures (from single- to multi-column) and rich elements such as text, mathematical formulas, and structured tables, the benchmark enables a thorough assessment of OCRVerse in handling complex parsing scenarios.

Metrics

End-to-end evaluation assesses the model's accuracy in parsing PDF page content. The evaluation uses the model's Markdown output of the entire PDF page parsing results as the prediction. The Overall metric is calculated as:

$$ \text{Overall} = \frac{(1-\text{Text Edit Distance}) \times 100 + \text{Table TEDS} +\text{Formula CDM}}{3} $$

Results

Model Type Methods Release Date End to End Parameters Overall↑ TextEdit FormulaCDM TableTEDS TableTEDS-S Reading OrderEdit
Pipeline Tools Marker-1.8.2 2025 - 71.30 0.206 76.66 57.88 71.17 0.250
Mineru2-pipeline 2025 - 75.51 0.209 76.55 70.90 79.11 0.225
PP-StructureV3 2024 - 86.73 0.073 85.79 81.68 89.48 0.073
General VLMs GPT-4o 2024 - 75.02 0.217 79.70 67.07 76.09 0.148
InternVL3-76B 2025 76B 80.33 0.131 83.42 70.64 77.74 0.113
InternVL3.5-241B 2025 241B 82.67 0.142 87.23 75.00 81.28 0.125
Qwen2.5-VL-72B 2025 72B 87.02 0.094 88.27 82.15 86.22 0.102
Gemini-2.5 Pro 2025 - 88.03 0.075 85.82 85.71 90.29 0.097
Specialized VLMs Dolphin 2025.05 322M 74.67 0.125 67.85 68.70 77.77 0.124
MinerU2-VLM 2025.06 0.9B 85.56 0.078 80.95 83.54 87.66 0.086
MonkeyOCR-pro-1.2B 2025.07 1.9B 86.96 0.084 85.02 84.24 89.02 0.130
MonkeyOCR-3B 2025.06 3.7B 87.13 0.075 87.45 81.39 85.92 0.129
MonkeyOCR-pro-3B 2025.07 3.7B 88.85 0.075 87.25 86.78 90.63 0.128
MinerU2.5 2025.09 1.2B 90.67 0.047 88.46 88.22 92.38 0.044
PaddleOCR-VL 2025.10 0.9B 92.56 0.035 91.43 89.76 93.52 0.043
OCRFlux-3B 2025.06 3B 74.82 0.193 68.03 75.75 80.23 0.202
Mistral OCR 2025.03 - 78.83 0.164 82.84 70.03 78.04 0.144
POINTS-Reader 2025.08 3B 80.98 0.134 79.20 77.13 81.66 0.145
olmOCR-7B 2025.02 7B 81.79 0.096 86.04 68.92 74.77 0.121
Nanonets-OCR-s 2025.06 3B 85.59 0.093 85.90 80.14 85.57 0.108
Deepseek-OCR 2025.10 3B 87.01 0.073 83.37 84.97 88.80 0.086
dots.ocr 2025.07 3B 88.41 0.048 83.22 86.78 90.62 0.053
FD-RL 2025.11 4B 90.41 0.049 88.67 87.35 92.10 0.055
HunyuanOCR 2025.11 4B 94.10 0.042 94.73 91.81 - -
Deepseek-OCR 2 2026.01 1B 91.09 0.048 90.31 87.75 92.06 0.057
OCRVerse(Ours) 2026.01 4B 89.23 0.052 87.13 85.77 90.35 0.068

Vison-Centric Evaluation

Benchmarks

To comprehensively evaluate the vision-centric OCR capabilities of our proposed OCRVerse, we conducted extensive experiments across five diverse structured vision-centric benchmarks. These benchmarks cover a wide spectrum of visual domains, assessing the model's ability to translate visually dense information into executable code or structured representations. Specifically, the evaluation tasks include: (1) ChartMimic for direct chart-to-code generation; (2) Design2Code, which evaluates the precision of reproducing web layouts in the web-to-HTML task; (3) UniSVG, assessing the generation of scalable vector graphics in the image-to-SVG task; (4) Image2Struct, testing the conversion of scientific documents and formulas in the image-to-LaTeX task; (5) ChemDraw, focusing on the recognition of chemical structures in the molecule-to-code task.

Metrics

For all benchmarks, we strictly adhere to the evaluation protocols officially defined in their respective GitHub repositories or original papers. For ChartMimic, we report the code execution success rate and the average of low-level metrics (Text, Layout, Type, and Color Scores). For high-level evaluation, we employ the GPT-4o Score. Regarding UniSVG, we present the low-level score, computed as the average of SSIM and (1 - LPIPS), alongside the high-level CLIP similarity. For Design2Code, we report both the CLIP similarity (high-level) and the element-matching scores (low-level) proposed by the benchmark authors. For Image2Struct, we evaluate using the earth mover's similarity (EMS) and the rendering success rate. Finally, for ChemDraw, we report the code execution success rate and the Tanimoto similarity.

Results

Model Parameter ChartMimic_direct_v2 UniSVG-ISVGEN Design2Code Image2Latex_plot ChemDraw
Exec.Rate Low-Level High-Level Low-Level High-Level Score Low-Level High-Level Ren.Succ. EMS Exec.Rate Tani.Sim.
Closed-Source Models
Gemini-2.5-Pro - 97.3 88.7 83.8 53.6 80.3 69.6 90.8 91.4 74.3 52.5 77.3 2.8
Claude-4.5-Sonnet - 97.8 89.6 82.9 61.0 83.4 74.6 90.4 90.8 72.7 50.2 95.3 41.7
GPT-5 - 94.8 81.9 78.3 60.8 88.3 77.3 90.6 91.0 78.7 57.4 93.8 52.1
Open-Source Models
Qwen2.5-VL-7B 7B 68.7 42.2 40.1 47.5 73.8 63.3 83.4 87.6 42.7 25.5 21.1 11.7
InternVL3-8B 8B 63.3 43.8 46.1 54.5 77.4 68.2 85.3 87.6 57.7 38.6 42.2 6.2
Qwen3-VL-8B 8B 78.3 62.5 67.8 53.0 77.0 67.4 85.5 87.2 47.7 33.0 78.9 41.2
InternVL3.5-8B 8B 66.7 46.0 48.3 55.0 78.0 68.6 85.8 87.3 58.3 40.5 49.2 7.8
InternVL3-14B 14B 72.3 51.3 54.1 51.4 75.5 65.8 85.8 87.5 73.3 52.2 71.1 40.2
InternVL3.5-14B 14B 73.2 52.8 55.4 52.0 75.0 65.9 86.1 87.8 73.0 50.2 71.9 39.3
Qwen3-VL-32B 32B 83.0 66.9 77.5 68.0 86.0 78.8 88.6 89.8 75.7 53.3 37.5 48.8
InternVL3.5-38B 38B 79.0 60.0 71.8 51.9 77.3 67.1 87.8 88.4 72.6 49.5 55.5 31.4
Qwen2.5-VL-72B 72B 88.5 72.7 79.1 47.7 76.0 64.7 86.9 88.7 62.0 41.7 75.8 28.0
OCRVerse (Ours) 4B 84.8 72.2 75.4 63.2 85.2 76.3 85.7 87.4 88.7 63.1 89.1 54.7

🔍 Usage Example

Inference

Text-Centric Task

This below is a simple example of how to use OCRVerse for document parsing tasks.

Please first install transformers using the following command:

pip install "transformers>=4.57.0"
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch

# Load model
model_path = 'DocTron/OCRVerse'
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    dtype="auto", 
    device_map="cuda",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Prepare input with image and text
image_path = "./assets/text_centric_test.jpg"
# We recommend using the following prompt to better performance, since it is used throughout the training process.
prompt = "Extract the main content from the document in the image, keeping the original structure. Convert all formulas to LaTeX and all tables to HTML."

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt},
        ]
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=8192, do_sample=False)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

# $$
# r = \frac{\alpha}{\beta} \sin \beta (\sigma_1 \pm \sigma_2)
# $$

Vision-Centric Task

Below is a simple example of how to use OCRVerse for vision-centric tasks. We also recommend utilizing SGLang for inference.

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch

# Load model
model_path = 'DocTron/OCRVerse'
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    dtype="auto", 
    device_map="cuda",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

# Prepare input with image and text
image_path = "./assets/vision_centric_test.png"
prompt = "You are an expert Python developer who specializes in writing matplotlib code based on a given picture. I found a very nice picture in a STEM paper, but there is no corresponding source code available. I need your help to generate the Python code that can reproduce the picture based on the picture I provide.\nNote that it is necessary to use figsize=(7.0, 5.0) to set the image size to match the original size.\nNow, please give me the matplotlib code that reproduces the picture below."

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_path},
            {"type": "text", "text": prompt},
        ]
    }
]

# Preparation for inference
inputs = processor.apply_chat_template(
    messages, 
    tokenize=True, 
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.tokenizer.batch_decode(
    generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

Example scripts for launching SGLang Server

CUDA_VISIBLE_DEVICES=0,1,2,3 \
python -m sglang.launch_server \
--model-path DocTron/OCRVerse-code \
--host 0.0.0.0 \
--dist-init-addr 127.0.0.1:10002 \
--tp 4 \
--port 6002

Fine-tuning

If you want to continue training based on our model, you can use Llama Factory. For installation and usage of Llama Factory, please refer to its official documentation. A reference fine-tuning script with pre-specified parameters is provided below:

PROJECT_DIR=/path/to/llama_factory
cd ${PROJECT_DIR}

# Set parameters
GPUS_PER_NODE=8                  # Number of GPUs per node
NNODES=1                         # Total number of nodes
NODE_RANK=0                      # Rank of the current node (starts from 0)
MASTER_ADDR=localhost            # IP address of the master node
MASTER_PORT=12345                # Port for communication between nodes

MODEL_DIR=/path/to/ocrverse_text_model  # Path to the pre-trained OCRVerse model
DATA=/name/of/your/dataset               # Name/path of your custom dataset
OUTPUT_DIR=/path/to/output              # Directory to save fine-tuned results

# Llama Factory-based fine-tuning script
torchrun --nproc_per_node="${GPUS_PER_NODE}" --nnodes="${NNODES}" --node_rank="${NODE_RANK}" --master_addr="${MASTER_ADDR}" --master_port="${MASTER_PORT}" \
    src/train.py \
    --model_name_or_path "$MODEL_DIR" \
    --stage sft \
    --do_train True \
    --finetuning_type full \
    --dataset "$DATA" \
    --template qwen3_vl_nothink \
    --cutoff_len 8192 \
    --preprocessing_num_workers 128 \
    --preprocessing_batch_size 256 \
    --dataloader_num_workers 128 \
    --output_dir "$OUTPUT_DIR" \
    --logging_steps 1 \
    --save_steps 5000 \
    --plot_loss True \
    --save_only_model False \
    --report_to none \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --learning_rate 1e-5 \
    --num_train_epochs 1 \
    --lr_scheduler_type cosine \
    --warmup_ratio 0.1 \
    --bf16 True

📌 Acknowledgement

We sincerely appreciate LLaMA-Factory and EasyR1 for providing reference training framework.

📖 Citation

If you find this project useful, please feel free to leave a star and cite our paper:

@article{zhong2026ocrverse,
  title={OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models},
  author={Zhong, Yufeng and Chen, Lei and Zhao, Xuanle and Han, Wenkang and Zheng, Liming and Huang, Jing and Jiang, Deyang and Cao, Yilin and Ma, Lin and Zeng, Zhixiong},
  journal={arXiv preprint arXiv:2601.21639},
  year={2026}
}

About

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •