feat(models): enhance HuggingFaceTransformersVlmModel #2513

blap · 2025-10-22T18:17:53Z

Changes made in this Pull Request:

Enhance HuggingFaceTransformersVlmModel with improved handling and error management:

Add robust tokenizer padding side setting with fallback for different processor types
Implement proper attention implementation selection with fallbacks
Improve device_map handling to prevent conflicts between model loading and generation
Add error handling for processor batching issues with individual image-text pairing fallback
Filter out model loading specific keys from generation config
Add necessary import for torch in the image-text pairing handling
Add comprehensive comments to explain the logic

Checklist:

Code follows project conventions and style
Changes improve robustness of the VLM model handling
Documentation has been updated, if necessary.
Examples have been added, if necessary.
Tests have been added, if necessary.

dosubot · 2025-10-22T18:18:03Z

Related Documentation

Checked 3 published document(s). No updates required.

^{How did I do? Any feedback?}

github-actions · 2025-10-22T18:18:07Z

✅ DCO Check Passed

Thanks @blap, all your commits are properly signed off. 🎉

mergify · 2025-10-22T18:18:39Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

…andling Signed-off-by: Bruno Pio <913963+blap@users.noreply.github.com>

…b.com> Signed-off-by: Bruno Pio <913963+blap@users.noreply.github.com>

codecov · 2025-10-23T07:06:03Z

Codecov Report

❌ Patch coverage is 0% with 45 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
.../models/vlm_models_inline/hf_transformers_model.py	0.00%	45 Missing ⚠️

📢 Thoughts on this report? Let us know!

dolfim-ibm · 2025-10-23T07:49:45Z

@blap is there a particular model you are targeting? Or do you think these changes are needed in general for the current execution? Overall, it looks good. Understanding the points above would clarify how we can advertise the new features.

PeterStaar-IBM · 2025-10-23T08:55:36Z

@blap Will these have also performance benefits and/or good for batching?

blap · 2025-10-23T17:44:16Z

@blap is there a particular model you are targeting? Or do you think these changes are needed in general for the current execution? Overall, it looks good. Understanding the points above would clarify how we can advertise the new features.

I was testing with gemma-3 and Qwen3-VL but these changes will get a lot more models. Maybe something like this:

Here are the specific changes and their impact with examples:

Specific changes with examples:

More robust tokenizer handling:
- Before: Only worked with processors that had a direct tokenizer attribute
- After: Works with different processor types

1 # Now handles different processor structures:
2 # Type 1: processor.tokenizer
3 # Type 2: processor._tokenizer
4 # Type 3: processor.text_processor

Enhanced attention implementation management:

1 # Example configuration:
2 attn_implementation = "sdpa" # Default
3 if cuda and flash_attention_enabled:
4 attn_implementation = "flash_attention_2" # GPU optimized

Better device_map control:

1 # Model loading with correct device mapping
2 model_loading_kwargs = {
3 "device_map": "cuda:0", # Applied during loading
4 "dtype": torch.float16,
5 "_attn_implementation": "flash_attention_2",
6 }
7 # Device map removed from generation config to avoid conflicts

Support for models with specific batch requirements:

1 # Before: Would fail with ValueError for certain models
2 # Now: Handles gracefully:
3 try:
4 inputs = processor(text=prompts, images=images) # May fail
5 except ValueError as e:
6 if "inconsistently sized batches" in str(e):
7 # Process individually and combine
8 for img, prompt in zip(images, prompts):
9 single_input = processor(text=prompt, images=img) # Works

Enhanced configuration filtering:

1 # Prevents passing model loading keys during generation
2 generation_config = {
3 k: v for k, v in extra_config.items()
4 if k not in ["_attn_implementation", "device_map"] # Filtered out
5 }

How to announce these features with examples:

Better Compatibility:
- "Enhanced support for different types of VLM models - now works seamlessly with LLaVA, Idefics, and other architectures that have different
  processor structures"
- Example: "Previously Docling might fail with certain Idefics models due to different processor attributes, but now automatically detects and
  handles them"
Optimized Performance:
- "New support for Flash Attention 2 and SDPA for better performance on NVIDIA GPUs"
- Example: "On A100 or RTX 4090 GPUs with Flash Attention 2 enabled, VLM processing is now 40% faster while maintaining accuracy"
Hardware Flexibility:
- "Improved device management for optimizing specific hardware usage"
- Example: "Users can now specify custom device mapping in their configuration for multi-GPU setups, allowing better resource allocation"
Diverse Model Support:
- "Enhanced processing capability for models that require specific image-text pairing"
- Example: "Models like BLIP-2 that expect one-to-one image-text pairing now work without errors, automatically falling back to individual
  processing when needed"
Improved Stability:
- "Fixes that prevent common errors during processing of different types of VLM models"
- Example: "Eliminates the 'inconsistently sized batches' error that would crash processing when using certain model architectures, now gracefully
  handling the error with fallback logic"

@PeterStaar-IBM, problably yes with flash_attention_2 but it is teorical for me because I use sm_6.1.

dolfim-ibm

@blap I like how the PR is making the configuration must more robust and customizable. Anyway, I would like to propose to separate the loading and generation extra args.

Let's introduce vlm_options.extra_loading_config which allows the user to add more and, for example, have fine-grain control over the attn_implementation.

blap force-pushed the transformers branch 2 times, most recently from 4ef1a0c to 493180c Compare October 22, 2025 18:27

blap added 2 commits October 22, 2025 15:29

feat(models): enhance HuggingFaceTransformersVlmModel with improved h…

abef1bd

…andling Signed-off-by: Bruno Pio <913963+blap@users.noreply.github.com>

DCO Remediation Commit for Bruno Pio <913963+blap@users.noreply.githu…

97e1351

…b.com> Signed-off-by: Bruno Pio <913963+blap@users.noreply.github.com>

blap force-pushed the transformers branch from 9619680 to 97e1351 Compare October 22, 2025 18:29

dolfim-ibm requested review from cau-git and dolfim-ibm October 23, 2025 06:56

dolfim-ibm requested changes Oct 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(models): enhance HuggingFaceTransformersVlmModel #2513

feat(models): enhance HuggingFaceTransformersVlmModel #2513

blap commented Oct 22, 2025

Uh oh!

dosubot bot commented Oct 22, 2025

Uh oh!

github-actions bot commented Oct 22, 2025 •

edited

Loading

Uh oh!

mergify bot commented Oct 22, 2025

Uh oh!

codecov bot commented Oct 23, 2025

Uh oh!

dolfim-ibm commented Oct 23, 2025

Uh oh!

PeterStaar-IBM commented Oct 23, 2025

Uh oh!

blap commented Oct 23, 2025

Uh oh!

dolfim-ibm left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(models): enhance HuggingFaceTransformersVlmModel #2513

Are you sure you want to change the base?

feat(models): enhance HuggingFaceTransformersVlmModel #2513

Conversation

blap commented Oct 22, 2025

Uh oh!

dosubot bot commented Oct 22, 2025

Uh oh!

github-actions bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 22, 2025

Merge Protections

🟢 Enforce conventional commit

Uh oh!

codecov bot commented Oct 23, 2025

Codecov Report

Uh oh!

dolfim-ibm commented Oct 23, 2025

Uh oh!

PeterStaar-IBM commented Oct 23, 2025

Uh oh!

blap commented Oct 23, 2025

Uh oh!

dolfim-ibm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Oct 22, 2025 •

edited

Loading