-
Notifications
You must be signed in to change notification settings - Fork 32k
Description
System Info
transformersversion: 4.43.2- Platform: Linux-5.15.0-1049-aws-x86_64-with-glibc2.17
- Python version: 3.8.18
- Huggingface_hub version: 0.23.2
- Safetensors version: 0.4.1
- Accelerate version: 0.26.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1+cu118 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?:
Who can help?
Just filling an official issue so that other users can see it.
Some people didn't understand why they had much lower scores with Idefics2, as it's a silent bug, plus the model still gives generations related to the text prompt, so that's hard to notice at first there is a bug.
I also put it in big in the model cards like https://huggingface.co/HuggingFaceM4/idefics2-8b
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b", torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="flash_attention_2"
)
# Create inputs
path_image = "/fsx/hugo/wow_images/cv_0.png"
image = Image.open(path_image)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What's written on this image?"},
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
Pick any image, this code will give different outputs, depending on the version of Transformers.
It gives correct outputs for version 4.40, and outputs that seem unrelated to the image with newer versions.
This is surely because of the new caching strategy implemented in Transformers, and modeling_idefics2.py doesn't that take into account.
Expected behavior
We should have the same output as with the version 4.40 with the newer versions.