Skip to content

Outputs of Idefics2 is unrelated to the images in the latest versions of Transformers #32271

@HugoLaurencon

Description

@HugoLaurencon

System Info

  • transformers version: 4.43.2
  • Platform: Linux-5.15.0-1049-aws-x86_64-with-glibc2.17
  • Python version: 3.8.18
  • Huggingface_hub version: 0.23.2
  • Safetensors version: 0.4.1
  • Accelerate version: 0.26.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu118 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:

Who can help?

@amyeroberts
@ArthurZucker

Just filling an official issue so that other users can see it.
Some people didn't understand why they had much lower scores with Idefics2, as it's a silent bug, plus the model still gives generations related to the text prompt, so that's hard to notice at first there is a bug.
I also put it in big in the model cards like https://huggingface.co/HuggingFaceM4/idefics2-8b

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq


processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b", torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="flash_attention_2"
)


# Create inputs
path_image = "/fsx/hugo/wow_images/cv_0.png"
image = Image.open(path_image)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What's written on this image?"},
        ]
    },      
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)

Pick any image, this code will give different outputs, depending on the version of Transformers.
It gives correct outputs for version 4.40, and outputs that seem unrelated to the image with newer versions.
This is surely because of the new caching strategy implemented in Transformers, and modeling_idefics2.py doesn't that take into account.

Expected behavior

We should have the same output as with the version 4.40 with the newer versions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions