Skip to content

Conversation

@BraveY
Copy link
Contributor

@BraveY BraveY commented Aug 29, 2025

Purpose

This pull request introduces support for multi-threaded model weight loading in the DefaultModelLoader, allowing faster initialization of models by parallelizing the loading of safetensor and pt/bin weight files. It also adds configuration options to control threading behavior and validates extra config keys.

The code was modified from the [Sglang's PR].(sgl-project/sglang#7277)

Multi-threaded weight loading support

  • Added multi_thread_safetensors_weights_iterator and multi_thread_pt_weights_iterator functions to weight_utils.py, enabling parallel loading of safetensor and pt/bin files using Python's ThreadPoolExecutor. [1] [2]

  • Updated DefaultModelLoader to use these new iterators when the enable_multithread_load extra config is set, with configurable num_threads (default 8). [1] [2] [3] [4]

Configuration validation and error handling

  • Added validation for extra config keys in DefaultModelLoader, raising an error if unexpected keys are provided to avoid misconfiguration. Only enable_multithread_load and num_threads are allowed.

Dependency updates

  • Imported concurrent.futures in weight_utils.py to support multi-threading for weight loading.

Test Plan

I tested the default loading logic and multi-threaded loading logic on a single NVME SSD.
The loaded model is QwQ-32B, and the model size is 62G.
Test cmd:

python -m vllm.entrypoints.openai.api_server \
    --model=/data/model/QwQ-32B\
    --trust-remote-code \
    --served-model-name auto \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=8 \
    --enforce-eager \
    --max-num-seqs 192 \
    --max-model-len 32768 \
    --max-num-batched-tokens 32768 \
    --block-size 128 \
    --gpu-memory-utilization 0.9 \
    --enable-prompt-tokens-details \
    --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'

Test Result

Single NVME ssd :

Module Loader Load Time (s) Average Bandwidth (GB/s)
disable multi thread loading(before) 20 3.1
enable multi thread loading(after) 14 4.42
Load Weight speed up 1.43x.

Test results of DeepSeek-R1-0528 safetensor loading in NVME ssd PCIE Gen4 (Speed 16GT/s, Width x4) could be found in this PR. This acceleration effect should also apply to vllm, because the loading logic of Sglang and vllm is similar.
image


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multi-threaded model weight loading to accelerate model initialization. The implementation uses ThreadPoolExecutor to load safetensor and PyTorch binary files in parallel. While this can provide significant speedups for I/O-bound loading, I've identified a potential high-severity issue with the safetensors implementation. The multi-threaded loader for safetensors changes the loading strategy from a memory-efficient streaming approach (safe_open) to one that loads entire file shards into memory (load_file). This can lead to excessive memory consumption and potential OOM errors for large models, which counteracts one of the key benefits of the safetensors format.

Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good enhancement! Please

  • Take a look at Gemini comment
  • is there a downside to do this by default
  • seems like while the default loader mmap the weights file, so there won’t be excessive usage of RAM, this approach first copy the weights to cpu? I think this is indeed a behavior change and can cause thrashing (the model weights file might already be in disk page cache)

@BraveY
Copy link
Contributor Author

BraveY commented Sep 1, 2025

Good enhancement! Please

  • Take a look at Gemini comment
  • is there a downside to do this by default
  • seems like while the default loader mmap the weights file, so there won’t be excessive usage of RAM, this approach first copy the weights to cpu? I think this is indeed a behavior change and can cause thrashing (the model weights file might already be in disk page cache)

Thanks for the review. I will add further tests to observe the impact on RAM, as well as the status of weight files in PageCache.

@BraveY
Copy link
Contributor Author

BraveY commented Sep 1, 2025

Good enhancement! Please

  • Take a look at Gemini comment
  • is there a downside to do this by default
  • seems like while the default loader mmap the weights file, so there won’t be excessive usage of RAM, this approach first copy the weights to cpu? I think this is indeed a behavior change and can cause thrashing (the model weights file might already be in disk page cache)
  • The load_file still relies safe_open, which internally uses mmap under the hood, just like the default loader.
  • In my tests loading the QwQ-32B model from the page cache, the memory usage peaked at around 58GB — consistent with the behavior of the default loader. No additional memory overhead was observed.
  • I haven't identified any negative side effects so far. Tests show that the time spent loading weights from the Page Cache using multi-threading is still slightly faster than the default read method by about 0.2 seconds for the QwQ-32B model.

@BraveY BraveY force-pushed the comm/multi-thread-loading branch from 3a0d348 to 7b82800 Compare September 2, 2025 02:51
@lionelvillard
Copy link
Contributor

@BraveY do you clear the linux page cache in your testing? I'm wondering if the number you get are from reading from the page cache vs nvmes, and whether multi-threading help in both cases.

@BraveY
Copy link
Contributor Author

BraveY commented Sep 3, 2025

@BraveY do you clear the linux page cache in your testing? I'm wondering if the number you get are from reading from the page cache vs nvmes, and whether multi-threading help in both cases.

Yes, i have dropped the page cache before starting test. The test result are both from nvmes. In scenarios where reading directly from disk, multi-threading can significantly improve disk bandwidth, thereby reducing model loading time. In the case of PageCache, it can slightly reduce loading time.

@BraveY BraveY force-pushed the comm/multi-thread-loading branch 2 times, most recently from 39f02f9 to 82d780d Compare September 4, 2025 04:28
Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com>
@BraveY BraveY force-pushed the comm/multi-thread-loading branch from 82d780d to 3dd6f55 Compare September 4, 2025 06:00
@simon-mo simon-mo added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 4, 2025
@mergify
Copy link

mergify bot commented Sep 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @BraveY.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 8, 2025
Signed-off-by: Simon Mo <simon.mo@hey.com>
@simon-mo simon-mo requested a review from 22quinn as a code owner September 8, 2025 16:52
@simon-mo simon-mo enabled auto-merge (squash) September 8, 2025 16:52
@mergify mergify bot removed the needs-rebase label Sep 8, 2025
@simon-mo simon-mo merged commit 43d9ad0 into vllm-project:main Sep 8, 2025
38 checks passed
eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
…ct#23928)

Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com>
Signed-off-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
@pwschuurman
Copy link
Contributor

This PR has broken test_tensorizer tests: (eg: https://buildkite.com/vllm/ci/builds/29967/steps/canvas?sid=01992ee2-d80a-4abb-83bd-c2f14f10f811)

@BraveY
Copy link
Contributor Author

BraveY commented Sep 10, 2025

This PR has broken test_tensorizer tests: (eg: https://buildkite.com/vllm/ci/builds/29967/steps/canvas?sid=01992ee2-d80a-4abb-83bd-c2f14f10f811)

I will fix this issue immediately.

@BraveY
Copy link
Contributor Author

BraveY commented Sep 10, 2025

This PR has broken test_tensorizer tests: (eg: https://buildkite.com/vllm/ci/builds/29967/steps/canvas?sid=01992ee2-d80a-4abb-83bd-c2f14f10f811)

I will fix this issue immediately.

Already fix by #24545

@njhill
Copy link
Member

njhill commented Sep 10, 2025

I'm a bit confused why the Tensorizer test passed in the original CI run for this PR...

@pwschuurman
Copy link
Contributor

pwschuurman commented Sep 10, 2025

I'm a bit confused why the Tensorizer test passed in the original CI run for this PR...

Upon closer inspection, the old code was executed, which resulted in a passing test case. If you look at the test logs for Tensorizer Test: https://buildkite.com/vllm/ci/builds/29824/steps/canvas?sid=01992a3e-d822-4dde-979b-f0bed2541d69#01992a3e-d902-4cdf-89c9-a4a5a055df90/165-329

[2025-09-08T16:56:46Z] (EngineCore_0 pid=481) DEBUG 09-08 09:56:46 [base_loader.py:47] Loading weights on cuda ...
[2025-09-08T16:56:46Z] (EngineCore_0 pid=481) INFO 09-08 09:56:46 [weight_utils.py:346] Using model weights format ['*.safetensors', '*.bin', '*.pt']
Loading pt checkpoint shards: 100% 1/1 [00:00<00:00,  1.57it/s]
[2025-09-08T16:56:47Z] (EngineCore_0 pid=481) INFO 09-08 09:56:47 [default_loader.py:237] Loading weights took 0.64 seconds
[2025-09-08T16:56:48Z] (EngineCore_0 pid=481) INFO 09-08 09:56:48 [gpu_model_runner.py:2232] Model loading took 0.2389 GiB and 1.171350 

This indicates that the executed default_loader.py emitted a lot line of Loading weights took from line 237. However, if we look at the diff of #23928, we see that this log message should have been emitted from line 266: https://github.com/vllm-project/vllm/pull/23928/files#diff-ebfc523d6559bb3f30b9c05ae0f912120c22ddf17ae67f2a0a93e356245c0508L237

So it seems like there are certain paths in Python that are not getting recompiled in the current CI model.

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
…ct#23928)

Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com>
Signed-off-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
@CSWYF3634076
Copy link
Contributor

@BraveY May I ask how to run the feature

@BraveY
Copy link
Contributor Author

BraveY commented Sep 19, 2025

@BraveY May I ask how to run the feature

--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' add this option.

Full command.

python -m vllm.entrypoints.openai.api_server \
    --model=/data/model/QwQ-32B\
    --trust-remote-code \
    --served-model-name auto \
    --distributed-executor-backend=mp \
    --port 8006 \
    -tp=8 \
    --enforce-eager \
    --max-num-seqs 192 \
    --max-model-len 32768 \
    --max-num-batched-tokens 32768 \
    --block-size 128 \
    --gpu-memory-utilization 0.9 \
    --enable-prompt-tokens-details \
    --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'

@CSWYF3634076
Copy link
Contributor

@BraveY May I ask how to run the feature

--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' add this option.
Do I need to use the following command to clear the page cache before testing?
echo 1 > /proc/sys/vm/drop_caches

@BraveY
Copy link
Contributor Author

BraveY commented Sep 19, 2025

@BraveY May I ask how to run the feature

--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' add this option.
Do I need to use the following command to clear the page cache before testing?
echo 1 > /proc/sys/vm/drop_caches

Yes,if you want to test without weights in pagecache. echo 1 > /proc/sys/vm/drop_caches I used this command to clear cache.

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
…ct#23928)

Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com>
Signed-off-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…ct#23928)

Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com>
Signed-off-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…ct#23928)

Signed-off-by: Yang Kaiyong <yangkaiyong.yky@antgroup.com>
Signed-off-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants