[Mxfp4] Add a way to save with a quantization method#40176
[Mxfp4] Add a way to save with a quantization method#40176ArthurZucker merged 36 commits intomainfrom
Mxfp4] Add a way to save with a quantization method#40176Conversation
|
run-slow: mxfp4 |
|
This comment contains run-slow, running the specified jobs: models: [] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
| triton_weight_tensor.storage.data, requires_grad=False | ||
| ) | ||
|
|
||
| print("New module: ", list(module.state_dict().items())) |
There was a problem hiding this comment.
yes, not ready yet 😉
|
run-slow: gpt_oss, mxfp4 |
|
This comment contains run-slow, running the specified jobs: models: ['models/gpt_oss'] |
|
run-slow: gpt_oss, mxfp4 |
|
This comment contains run-slow, running the specified jobs: models: ['models/gpt_oss'] |
| triton_weight_tensor.storage.data, requires_grad=False | ||
| ) | ||
|
|
||
| print("New module: ", list(module.state_dict().items())) |
There was a problem hiding this comment.
yes, not ready yet 😉
| w, w_scale = swizzle_mxfp4(w, w_scale) | ||
| def quantize_to_mxfp4(w, triton_kernels_hub): | ||
| downcast_to_mxfp_torch = triton_kernels_hub.numerics_details.mxfp.downcast_to_mxfp_torch | ||
| w, w_scale = downcast_to_mxfp_torch(w.to(torch.bfloat16), torch.uint8, axis=1) |
There was a problem hiding this comment.
- we need the torch version
- swizzle is done at loading time already so duplicating fails
SunMarc
left a comment
There was a problem hiding this comment.
Thanks for adding this ! This looks quite good. I was thinking it would be better if we can do the following instead of allowing users to quantize the model in save_pretrained as this will add more complexity.
model = GptOssForCausalLM.from_pretrained(
model_name,
quantization_config= Mxfp4Config(swizzle=False)
)
model.save_pretrained(...)
If the user didn't set swizzle=False when quantizing the model for saving, we can just raise an error for that. WDYT ?
BTW, right now if a user try to quantize the model with the following way, we can't use it at all as the weights are not swizzled.
|
run-slow: gpt_oss, mxfp4 |
|
This comment contains run-slow, running the specified jobs: models: ['models/gpt_oss'] |
ArthurZucker
left a comment
There was a problem hiding this comment.
As discussed offline, we really need a way to save_pretrained without having to use this swizzle setting, let's think about how to cover all cases and simplify please!
ArthurZucker
left a comment
There was a problem hiding this comment.
LGTM thanks for iterating
|
run-slow: gpt_oss, mxfp4 |
|
This comment contains run-slow, running the specified jobs: models: ['models/gpt_oss'] |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: gpt_oss, mxfp4 |
What does this PR do?
Allows saving gpt_oss after it was trained. You can also save a mxfp4 model.