Unable to export Phi-4-multimodal-instruct using automatic-speech-recognition task

I tried multiple times to export the "Phi-4-multimodal-instruct" model to OpenVINO IR without success. My goal is to use "optimum-intel" to quantize the model in order to be able to run it on Intel NPU. According to [this](https://docs.openvino.ai/2025/documentation/compatibility-and-support/supported-models.html) OpenVINO Documentation page, I should be able to use the "Phi-4-multimodal-instruct" on my Intel Core Ultra NPU.

<img width="1113" height="360" alt="Image" src="https://github.com/user-attachments/assets/583009c2-92d4-4168-9ef2-8b6f490be71c" />

I tried following the official instructions to prepare models for NPU Execution: https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.html and they didn't work. Of course, the INT4-CW quantization method is being used.

I have tried a clean Python 3.12 with "optimum-intel" installed following instructions on [PyPi page](https://pypi.org/project/optimum-intel/) as well as the [OpenVINO phi-4-multimodal noteook](https://openvinotoolkit.github.io/openvino_notebooks/?search=Multimodal+assistant+with+Phi-4-multimodal+and+OpenVINO) but both couldn't produce a valid output. 

I am now using a custom environment which allowed to produce a valid output using "image-text-to-text", yet the "automatic-speech-recognition" task still has issues. Here is the custom environment packages I'm using with a Python 3.12 venv:

[requirements.txt](https://github.com/user-attachments/files/24837948/requirements.txt)

Downloading the model first successfully allowed me to convert it, specifying the task "image-text-to-text": `optimum-cli export openvino -m Phi-4-multimodal-instruct --weight-format int4 --sym --ratio 1.0 --group-size -1 --trust-remote-code --task image-text-to-text Phi-4-multimodal-instruct`. Because I was using transofmers>=4.50 I had to apply a patch as officially specified [here](https://openvinotoolkit.github.io/openvino.genai/docs/supported-models/#phi4mm-notes). This output worked fine with VLMPipeline and LLMPipeline classes of OpenVINO GenAI.  

### Here comes the issue
According to optimum-intel, the "automatic-speech-recognition" task should be supported too for this model (`ValueError: Asked to export a phi4mm model for the task text-to-text, but the Optimum OpenVINO exporter only supports the tasks image-text-to-text, automatic-speech-recognition for phi4mm. Please use a supported task`) but I couldn't manage to get it working. **By default the inferred task was "automatic-speech-recognition" and always ended with the following error:** This means that exporting the model with Optimum without any parameter would also result in this error:

`RuntimeError: Exception from src\inference\src\cpp\core.cpp:97:
Check 'util::directory_exists(path) || util::file_exists(path)' failed at src\frontends\common\src\frontend.cpp:117:
FrontEnd API failed with GeneralFailure:
ir: Could not open the file: "C:\Users\<username>\AppData\Local\Temp\tmp7p952plq\openvino_encoder_model.xml"`

**I also tried removing all the NPU-specific parameters leaving only OpenVINO export config but the result was the same, therefore I decided to open this issue.**





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to export Phi-4-multimodal-instruct using automatic-speech-recognition task #1595

Here comes the issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unable to export Phi-4-multimodal-instruct using automatic-speech-recognition task #1595

Description

Here comes the issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions