-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
[Model loader]: support multi-thread model weight loading #23928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces multi-threaded model weight loading to accelerate model initialization. The implementation uses ThreadPoolExecutor to load safetensor and PyTorch binary files in parallel. While this can provide significant speedups for I/O-bound loading, I've identified a potential high-severity issue with the safetensors implementation. The multi-threaded loader for safetensors changes the loading strategy from a memory-efficient streaming approach (safe_open) to one that loads entire file shards into memory (load_file). This can lead to excessive memory consumption and potential OOM errors for large models, which counteracts one of the key benefits of the safetensors format.
simon-mo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good enhancement! Please
- Take a look at Gemini comment
- is there a downside to do this by default
- seems like while the default loader mmap the weights file, so there won’t be excessive usage of RAM, this approach first copy the weights to cpu? I think this is indeed a behavior change and can cause thrashing (the model weights file might already be in disk page cache)
Thanks for the review. I will add further tests to observe the impact on RAM, as well as the status of weight files in PageCache. |
|
3a0d348 to
7b82800
Compare
|
@BraveY do you clear the linux page cache in your testing? I'm wondering if the number you get are from reading from the page cache vs nvmes, and whether multi-threading help in both cases. |
Yes, i have dropped the page cache before starting test. The test result are both from nvmes. In scenarios where reading directly from disk, multi-threading can significantly improve disk bandwidth, thereby reducing model loading time. In the case of PageCache, it can slightly reduce loading time. |
39f02f9 to
82d780d
Compare
Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com>
82d780d to
3dd6f55
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Simon Mo <simon.mo@hey.com>
…ct#23928) Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com> Signed-off-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
|
This PR has broken test_tensorizer tests: (eg: https://buildkite.com/vllm/ci/builds/29967/steps/canvas?sid=01992ee2-d80a-4abb-83bd-c2f14f10f811) |
I will fix this issue immediately. |
Already fix by #24545 |
|
I'm a bit confused why the Tensorizer test passed in the original CI run for this PR... |
Upon closer inspection, the old code was executed, which resulted in a passing test case. If you look at the test logs for Tensorizer Test: https://buildkite.com/vllm/ci/builds/29824/steps/canvas?sid=01992a3e-d822-4dde-979b-f0bed2541d69#01992a3e-d902-4cdf-89c9-a4a5a055df90/165-329 This indicates that the executed So it seems like there are certain paths in Python that are not getting recompiled in the current CI model. |
…ct#23928) Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com> Signed-off-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
|
@BraveY May I ask how to run the feature |
Full command. |
|
Yes,if you want to test without weights in pagecache. |
…ct#23928) Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com> Signed-off-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Simon Mo <simon.mo@hey.com>
…ct#23928) Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com> Signed-off-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…ct#23928) Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com> Signed-off-by: Simon Mo <simon.mo@hey.com> Co-authored-by: Simon Mo <simon.mo@hey.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
This pull request introduces support for multi-threaded model weight loading in the
DefaultModelLoader, allowing faster initialization of models by parallelizing the loading of safetensor and pt/bin weight files. It also adds configuration options to control threading behavior and validates extra config keys.The code was modified from the [Sglang's PR].(sgl-project/sglang#7277)
Multi-threaded weight loading support
Added
multi_thread_safetensors_weights_iteratorandmulti_thread_pt_weights_iteratorfunctions toweight_utils.py, enabling parallel loading of safetensor and pt/bin files using Python'sThreadPoolExecutor. [1] [2]Updated
DefaultModelLoaderto use these new iterators when theenable_multithread_loadextra config is set, with configurablenum_threads(default 8). [1] [2] [3] [4]Configuration validation and error handling
DefaultModelLoader, raising an error if unexpected keys are provided to avoid misconfiguration. Onlyenable_multithread_loadandnum_threadsare allowed.Dependency updates
concurrent.futuresinweight_utils.pyto support multi-threading for weight loading.Test Plan
I tested the default loading logic and multi-threaded loading logic on a single NVME SSD.
The loaded model is QwQ-32B, and the model size is 62G.
Test cmd:
Test Result
Single NVME ssd :
Test results of DeepSeek-R1-0528 safetensor loading in NVME ssd PCIE Gen4 (Speed 16GT/s, Width x4) could be found in this PR. This acceleration effect should also apply to vllm, because the loading logic of Sglang and vllm is similar.

Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.