Skip to content

Unable to export Phi-4-multimodal-instruct using automatic-speech-recognition task #1595

@Fede2782

Description

@Fede2782

I tried multiple times to export the "Phi-4-multimodal-instruct" model to OpenVINO IR without success. My goal is to use "optimum-intel" to quantize the model in order to be able to run it on Intel NPU. According to this OpenVINO Documentation page, I should be able to use the "Phi-4-multimodal-instruct" on my Intel Core Ultra NPU.

Image

I tried following the official instructions to prepare models for NPU Execution: https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.html and they didn't work. Of course, the INT4-CW quantization method is being used.

I have tried a clean Python 3.12 with "optimum-intel" installed following instructions on PyPi page as well as the OpenVINO phi-4-multimodal noteook but both couldn't produce a valid output.

I am now using a custom environment which allowed to produce a valid output using "image-text-to-text", yet the "automatic-speech-recognition" task still has issues. Here is the custom environment packages I'm using with a Python 3.12 venv:

requirements.txt

Downloading the model first successfully allowed me to convert it, specifying the task "image-text-to-text": optimum-cli export openvino -m Phi-4-multimodal-instruct --weight-format int4 --sym --ratio 1.0 --group-size -1 --trust-remote-code --task image-text-to-text Phi-4-multimodal-instruct. Because I was using transofmers>=4.50 I had to apply a patch as officially specified here. This output worked fine with VLMPipeline and LLMPipeline classes of OpenVINO GenAI.

Here comes the issue

According to optimum-intel, the "automatic-speech-recognition" task should be supported too for this model (ValueError: Asked to export a phi4mm model for the task text-to-text, but the Optimum OpenVINO exporter only supports the tasks image-text-to-text, automatic-speech-recognition for phi4mm. Please use a supported task) but I couldn't manage to get it working. By default the inferred task was "automatic-speech-recognition" and always ended with the following error: This means that exporting the model with Optimum without any parameter would also result in this error:

RuntimeError: Exception from src\inference\src\cpp\core.cpp:97: Check 'util::directory_exists(path) || util::file_exists(path)' failed at src\frontends\common\src\frontend.cpp:117: FrontEnd API failed with GeneralFailure: ir: Could not open the file: "C:\Users\<username>\AppData\Local\Temp\tmp7p952plq\openvino_encoder_model.xml"

I also tried removing all the NPU-specific parameters leaving only OpenVINO export config but the result was the same, therefore I decided to open this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions