-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
Fix GLM-4.5V-FP8 numerical issue #22949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GLM-4.5V-FP8 numerical issue #22949
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes a numerical issue with the GLM-4.5V-FP8 model by correcting the prefix for the QKV projection layer during quantization. The change is conditional on the presence of a quantization configuration, which correctly handles the pre-quantized FP8 model. However, this approach is too broad and could unintentionally affect on-the-fly quantization for other models. I've provided a suggestion to make the condition more specific to FP8 models to prevent potential regressions.
|
@vladmihailescu has imported this pull request. If you are a Meta employee, you can view this in D80315216. |
|
@vladmihailescu has imported this pull request. If you are a Meta employee, you can view this in D80315216. |
|
any update? |
df736bd to
bbfc684
Compare
bbfc684 to
7f49211
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I just noticed we don't have GLM-4.5V example in examples/offline_inference/vision_language.py and examples/offline_inference/vision_language_multi_image.py. Let's consolidate this example into those two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, updated examples/offline_inference/vision_language.py and examples/offline_inference/vision_language_multi_image.py to include GLM-4.5V / GLM-4.5V-FP8 and removed the additional script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also check quant_config's quantize_method to see it's fp8 quantization. It could be other type of quantization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue we are fixing is the mismatch between quant_config["ignore"] layer names and runtime layer names, and it is not tied to a specific type of quantization. (e.g. here in https://huggingface.co/zai-org/GLM-4.5V-FP8/blob/main/config.json, quant_config["quant_method"] == "compressed-tensors").
If Z.ai were to publish a new checkpoint with different quant_method, I would assume the quant_config["ignore"] might still be the same as it is right now and we would still need this fix? cc @zRzRzRzRzRzRzR
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix, please address the comments.
Signed-off-by: qizixi <qizixi@meta.com>
Signed-off-by: qizixi <qizixi@meta.com>
Signed-off-by: qizixi <qizixi@meta.com>
7f49211 to
fdc3038
Compare
Signed-off-by: qizixi <qizixi@meta.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: qizixi <qizixi@meta.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: qizixi <qizixi@meta.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: qizixi <qizixi@meta.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: qizixi <qizixi@meta.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: qizixi <qizixi@meta.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: qizixi <qizixi@meta.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: qizixi <qizixi@meta.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Signed-off-by: qizixi <qizixi@meta.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Purpose
Fix numerical regression when running GLM-4.5V-FP8 with vLLM. Issue is that for vision attention qkv projection, the model config file want to turn off quantization by adding "visual.blocks.1.attn.qkv_proj" to https://huggingface.co/zai-org/GLM-4.5V-FP8/blob/main/config.json . However at runtime, the prefix is actually called "visual.blocks.1.attn.qkv" instead.
This PR does a short term fix to change the prefix so that quantization can be turned off for those layers
Test Plan
Test Result
Remaining Issue
Output quality for FP8 text only seems still lower than BF16 version, but not garbage like the vision one
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.