Skip to content

Conversation

@frank-wei
Copy link
Contributor

@frank-wei frank-wei commented Sep 3, 2025

Summary:
On GB200, the MOE MXFP4 weight transpose takes quite a long time when the gpt-oss model is loaded.
Add the cache for weight transpose indices so that the expert weight transpose time can be reduced

20b:
Before: Model loading took 94sec

^[[1;36m(EngineCore_0 pid=3397977)^[[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds
^[[1;36m(EngineCore_0 pid=3397977)^[[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds

After: Model loading took 5.9sec

^[[1;36m(EngineCore_0 pid=3005216)^[[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds
^[[1;36m(EngineCore_0 pid=3005216)^[[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds

120b:
Loading time verification:
Before, P1928776629
E2E predictor warm up takes: 17:28:53 ~ 17:39:59 = 11min 6sec

Model loading takes 568.133048 seconds

(EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds
(EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds

After, P1928762318
E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec

Model loading takes 15.083996 seconds

(EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds
(EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds

Accuracy verification:

aime25 medium: P1928806083
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}]

aime25 high:P1928898566
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}]

Test Plan:
Compared the transposed weights and they are matched between before and after. [link]
python test_eq.py

import torch

[g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt")
[g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt")

for i in range(len(g1w)):
    print(i)
    print(torch.equal(g1w[i], g1w2[i]))
    print(torch.equal(g1s[i], g1s2[i]))
    print(torch.equal(g1b[i], g1b2[i]))

[g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt")
[g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt")

for i in range(len(g2w)):
    print(i)
    print(torch.equal(g2w[i], g2w2[i]))
    print(torch.equal(g2s[i], g2s2[i]))
    print(torch.equal(g2b[i], g2b2[i]))

Rollback Plan:

Differential Revision: D81544286

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D81544286

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a caching mechanism for permutation indices to accelerate weight loading in MoE layers, which is a valuable optimization that demonstrates significant performance gains. The overall implementation is solid, but I've identified a critical bug where an incorrect device is used for a tensor operation, which could lead to runtime errors or incorrect behavior.

Comment on lines 424 to 411
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a typo on line 426. The device should be w2_weight_scale.device instead of w13_weight_scale.device. While this might work if both tensors are on the same device, it is safer and more correct to use the device of the tensor being processed to avoid potential runtime errors or incorrect behavior.

Suggested change
w2_weight_scale[i]
.view(torch.uint8)[
permute_sf_indices.to(w13_weight_scale.device)
]
.contiguous()
w2_weight_scale[i]
.view(torch.uint8)[
permute_sf_indices.to(w2_weight_scale.device)
]
.contiguous()

@mergify
Copy link

mergify bot commented Sep 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @frank-wei.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D81544286

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D81544286

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D81544286

@facebook-github-bot
Copy link

This pull request was exported from Phabricator. Differential Revision: D81544286

Summary:
Pull Request resolved: vllm-project#24154

ATT
On GB200, the MOE MXFP4 weight transpose takes quite a long time.
Add the cache for weight transpose indices so that the expert weight transpose time can be reduced

**20b:**
Before: Model loading took 94sec
```
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds
```
After: Model loading took  5.9sec
```
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds
```

**120b:**
**Loading time verification:**
**Before, P1928776629**
E2E predictor warm up takes: 17:28:53 ~  17:39:59 = 11min 6sec

Model loading takes  568.133048 seconds
```
(EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds
(EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds
```

**After, P1928762318**
E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec

Model loading takes 15.083996 seconds
```
(EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds
(EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds
```
**Accuracy verification:**
```
aime25 medium: P1928806083
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}]

aime25 high:P1928898566
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}]
```

Test Plan:
Compared the transposed weights and they are matched between before and after. P1928725920
python test_eq.py
```
import torch

[g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt")
[g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt")

for i in range(len(g1w)):
    print(i)
    print(torch.equal(g1w[i], g1w2[i]))
    print(torch.equal(g1s[i], g1s2[i]))
    print(torch.equal(g1b[i], g1b2[i]))

[g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt")
[g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt")

for i in range(len(g2w)):
    print(i)
    print(torch.equal(g2w[i], g2w2[i]))
    print(torch.equal(g2s[i], g2s2[i]))
    print(torch.equal(g2b[i], g2b2[i]))
```

Rollback Plan:

Reviewed By: zixi-qi

Differential Revision: D81544286

Signed-off-by: Wei Wei <wwei6@meta.com>
yeqcharlotte pushed a commit to yeqcharlotte/vllm that referenced this pull request Sep 6, 2025
Summary:
Pull Request resolved: vllm-project#24154

ATT
On GB200, the MOE MXFP4 weight transpose takes quite a long time.
Add the cache for weight transpose indices so that the expert weight transpose time can be reduced

**20b:**
Before: Model loading took 94sec
```
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:27:08 [default_loader.py:267] Loading weights took 2.83 seconds
�[1;36m(EngineCore_0 pid=3397977)�[0;0m INFO 09-01 19:28:41 [gpu_model_runner.py:1977] Model loading took 14.1643 GiB and 94.110470 seconds
```
After: Model loading took  5.9sec
```
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:43 [default_loader.py:267] Loading weights took 2.54 seconds
�[1;36m(EngineCore_0 pid=3005216)�[0;0m INFO 09-02 16:54:47 [gpu_model_runner.py:1977] Model loading took 14.1693 GiB and 5.918206 seconds
```

**120b:**
**Loading time verification:**
**Before, P1928776629**
E2E predictor warm up takes: 17:28:53 ~  17:39:59 = 11min 6sec

Model loading takes  568.133048 seconds
```
(EngineCore_0 pid=344869) INFO 09-02 17:29:45 [default_loader.py:267] Loading weights took 8.25 seconds
(EngineCore_0 pid=344869) INFO 09-02 17:39:05 [gpu_model_runner.py:1977] Model loading took 68.7019 GiB and 568.133048 seconds
```

**After, P1928762318**
E2E predictor warm up takes: 17:26:12 ~ 17:28:15 = 2min 3sec

Model loading takes 15.083996 seconds
```
(EngineCore_0 pid=156514) INFO 09-02 17:27:05 [default_loader.py:267] Loading weights took 9.18 seconds
(EngineCore_0 pid=156514) INFO 09-02 17:27:12 [gpu_model_runner.py:1977] Model loading took 68.7093 GiB and 15.083996 seconds
```
**Accuracy verification:**
```
aime25 medium: P1928806083
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-medium_temp1.0_20250902_175112', 'metric': 0.7875}]

aime25 high:P1928898566
[{'eval_name': 'aime25', 'model_name': 'gpt-oss-120b-high_temp1.0_20250902_180141', 'metric': 0.9}]
```

Test Plan:
Compared the transposed weights and they are matched between before and after. P1928725920
python test_eq.py
```
import torch

[g1w, g1s, g1b] = torch.load("/tmp/gemm1_wei.pt")
[g1w2, g1s2, g1b2] = torch.load("/tmp/gemm1_wei2.pt")

for i in range(len(g1w)):
    print(i)
    print(torch.equal(g1w[i], g1w2[i]))
    print(torch.equal(g1s[i], g1s2[i]))
    print(torch.equal(g1b[i], g1b2[i]))

[g2w, g2s, g2b] = torch.load("/tmp/gemm2_wei.pt")
[g2w2, g2s2, g2b2] = torch.load("/tmp/gemm2_wei2.pt")

for i in range(len(g2w)):
    print(i)
    print(torch.equal(g2w[i], g2w2[i]))
    print(torch.equal(g2s[i], g2s2[i]))
    print(torch.equal(g2b[i], g2b2[i]))
```

Rollback Plan:

Reviewed By: zixi-qi

Differential Revision: D81544286

Signed-off-by: Wei Wei <wwei6@meta.com>
@22quinn 22quinn added performance Performance-related issues quantization moe gpt-oss Related to GPT-OSS models labels Sep 9, 2025
Signed-off-by: Wei Wei <wwei6@meta.com>
Signed-off-by: Wei Wei <wwei6@meta.com>
Signed-off-by: Wei Wei <wwei6@meta.com>
@yeqcharlotte
Copy link
Collaborator

yeqcharlotte commented Sep 9, 2025

thanks for the change! this is huge!

could you update your PR title and replace internal pastebin with gist? :)

cc: @houseroad @mgoin @LucasWilkinson

Signed-off-by: Wei Wei <wwei6@meta.com>
@frank-wei frank-wei changed the title reduce the weight loading time [Misc] Reduce the gpt-oss model loading time Sep 9, 2025
@frank-wei
Copy link
Contributor Author

Thanks @yeqcharlotte and @22quinn for the review. I have updated this PR as suggested.

@jwfromm
Copy link

jwfromm commented Sep 9, 2025

Great fix! Just curious, do we know why this issue is so much more noticeable on GB200 than other GPUs? It seems like this improvement is backend agnostic.

@frank-wei
Copy link
Contributor Author

Great fix! Just curious, do we know why this issue is so much more noticeable on GB200 than other GPUs? It seems like this improvement is backend agnostic.

@jwfromm , the issue is raised during enabling the mxfp4 on MOE weight. AFAIK, only Blackwell supports this format.

@22quinn 22quinn added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 9, 2025
@22quinn 22quinn changed the title [Misc] Reduce the gpt-oss model loading time [gpt-oss] Cache permute indices for faster MXFP4 MoE layer loading Sep 10, 2025
@22quinn 22quinn enabled auto-merge (squash) September 10, 2025 02:09
@22quinn 22quinn merged commit 0efdb5c into vllm-project:main Sep 10, 2025
46 checks passed
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…llm-project#24154)

Signed-off-by: Wei Wei <wwei6@meta.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…llm-project#24154)

Signed-off-by: Wei Wei <wwei6@meta.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models moe performance Performance-related issues quantization ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants