Skip to content

Conversation

@blap
Copy link
Contributor

@blap blap commented Oct 22, 2025

Changes made in this Pull Request:

Enhance HuggingFaceTransformersVlmModel with improved handling and error management:

  • Add robust tokenizer padding side setting with fallback for different processor types
  • Implement proper attention implementation selection with fallbacks
  • Improve device_map handling to prevent conflicts between model loading and generation
  • Add error handling for processor batching issues with individual image-text pairing fallback
  • Filter out model loading specific keys from generation config
  • Add necessary import for torch in the image-text pairing handling
  • Add comprehensive comments to explain the logic

Checklist:

  • Code follows project conventions and style
  • Changes improve robustness of the VLM model handling
  • Documentation has been updated, if necessary.
  • Examples have been added, if necessary.
  • Tests have been added, if necessary.

@dosubot
Copy link

dosubot bot commented Oct 22, 2025

Related Documentation

Checked 3 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

@github-actions
Copy link
Contributor

github-actions bot commented Oct 22, 2025

DCO Check Passed

Thanks @blap, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Oct 22, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@blap blap force-pushed the transformers branch 2 times, most recently from 4ef1a0c to 493180c Compare October 22, 2025 18:27
blap added 2 commits October 22, 2025 15:29
…andling

Signed-off-by: Bruno Pio <913963+blap@users.noreply.github.com>
…b.com>

Signed-off-by: Bruno Pio <913963+blap@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Oct 23, 2025

Codecov Report

❌ Patch coverage is 0% with 45 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../models/vlm_models_inline/hf_transformers_model.py 0.00% 45 Missing ⚠️

📢 Thoughts on this report? Let us know!

@dolfim-ibm
Copy link
Contributor

@blap is there a particular model you are targeting? Or do you think these changes are needed in general for the current execution? Overall, it looks good. Understanding the points above would clarify how we can advertise the new features.

@PeterStaar-IBM
Copy link
Contributor

@blap Will these have also performance benefits and/or good for batching?

@blap
Copy link
Contributor Author

blap commented Oct 23, 2025

@blap is there a particular model you are targeting? Or do you think these changes are needed in general for the current execution? Overall, it looks good. Understanding the points above would clarify how we can advertise the new features.

I was testing with gemma-3 and Qwen3-VL but these changes will get a lot more models. Maybe something like this:

Here are the specific changes and their impact with examples:

Specific changes with examples:

  1. More robust tokenizer handling:
    • Before: Only worked with processors that had a direct tokenizer attribute
    • After: Works with different processor types

1 # Now handles different processor structures:
2 # Type 1: processor.tokenizer
3 # Type 2: processor._tokenizer
4 # Type 3: processor.text_processor

  1. Enhanced attention implementation management:

1 # Example configuration:
2 attn_implementation = "sdpa" # Default
3 if cuda and flash_attention_enabled:
4 attn_implementation = "flash_attention_2" # GPU optimized

  1. Better device_map control:

1 # Model loading with correct device mapping
2 model_loading_kwargs = {
3 "device_map": "cuda:0", # Applied during loading
4 "dtype": torch.float16,
5 "_attn_implementation": "flash_attention_2",
6 }
7 # Device map removed from generation config to avoid conflicts

  1. Support for models with specific batch requirements:

1 # Before: Would fail with ValueError for certain models
2 # Now: Handles gracefully:
3 try:
4 inputs = processor(text=prompts, images=images) # May fail
5 except ValueError as e:
6 if "inconsistently sized batches" in str(e):
7 # Process individually and combine
8 for img, prompt in zip(images, prompts):
9 single_input = processor(text=prompt, images=img) # Works

  1. Enhanced configuration filtering:

1 # Prevents passing model loading keys during generation
2 generation_config = {
3 k: v for k, v in extra_config.items()
4 if k not in ["_attn_implementation", "device_map"] # Filtered out
5 }

How to announce these features with examples:

  1. Better Compatibility:

    • "Enhanced support for different types of VLM models - now works seamlessly with LLaVA, Idefics, and other architectures that have different
      processor structures"
    • Example: "Previously Docling might fail with certain Idefics models due to different processor attributes, but now automatically detects and
      handles them"
  2. Optimized Performance:

    • "New support for Flash Attention 2 and SDPA for better performance on NVIDIA GPUs"
    • Example: "On A100 or RTX 4090 GPUs with Flash Attention 2 enabled, VLM processing is now 40% faster while maintaining accuracy"
  3. Hardware Flexibility:

    • "Improved device management for optimizing specific hardware usage"
    • Example: "Users can now specify custom device mapping in their configuration for multi-GPU setups, allowing better resource allocation"
  4. Diverse Model Support:

    • "Enhanced processing capability for models that require specific image-text pairing"
    • Example: "Models like BLIP-2 that expect one-to-one image-text pairing now work without errors, automatically falling back to individual
      processing when needed"
  5. Improved Stability:

    • "Fixes that prevent common errors during processing of different types of VLM models"
    • Example: "Eliminates the 'inconsistently sized batches' error that would crash processing when using certain model architectures, now gracefully
      handling the error with fallback logic"

@PeterStaar-IBM, problably yes with flash_attention_2 but it is teorical for me because I use sm_6.1.

Copy link
Contributor

@dolfim-ibm dolfim-ibm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blap I like how the PR is making the configuration must more robust and customizable. Anyway, I would like to propose to separate the loading and generation extra args.

Let's introduce vlm_options.extra_loading_config which allows the user to add more and, for example, have fine-grain control over the attn_implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants