Outputs of Idefics2 is unrelated to the images in the latest versions of Transformers

### System Info

- `transformers` version: 4.43.2
- Platform: Linux-5.15.0-1049-aws-x86_64-with-glibc2.17
- Python version: 3.8.18
- Huggingface_hub version: 0.23.2
- Safetensors version: 0.4.1
- Accelerate version: 0.26.1
- Accelerate config:    not found
- PyTorch version (GPU?): 2.0.1+cu118 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

@amyeroberts
@ArthurZucker 

Just filling an official issue so that other users can see it.
Some people didn't understand why they had much lower scores with Idefics2, as it's a silent bug, plus the model still gives generations related to the text prompt, so that's hard to notice at first there is a bug.
I also put it in big in the model cards like https://huggingface.co/HuggingFaceM4/idefics2-8b

### Information

- [X] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

```
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq


processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b", torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="flash_attention_2"
)


# Create inputs
path_image = "/fsx/hugo/wow_images/cv_0.png"
image = Image.open(path_image)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What's written on this image?"},
        ]
    },      
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
```

Pick any image, this code will give different outputs, depending on the version of Transformers.
It gives correct outputs for version 4.40, and outputs that seem unrelated to the image with newer versions.
This is surely because of the new caching strategy implemented in Transformers, and `modeling_idefics2.py` doesn't that take into account.

### Expected behavior

We should have the same output as with the version 4.40 with the newer versions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outputs of Idefics2 is unrelated to the images in the latest versions of Transformers #32271

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Outputs of Idefics2 is unrelated to the images in the latest versions of Transformers #32271

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions