break execute_model in gpu_model_runner into sub-functions for custom scopes #24265

bangshengtang · 2025-09-04T17:09:43Z

Summary:
break down the main execute_model() function into multiple parts so that we can have better latency breakdown in profiles:

preprocess
forward
postprocess
bookkeep (includes sync)
draft (if spec decoding is enabled)

this is meant to be a refactor + some new function scopes, there's no functionality difference

Test Plan:
CI

run the vLLM predictor runbook and take a trace

Differential Revision: D81009244

github-actions · 2025-09-04T17:09:52Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

facebook-github-bot · 2025-09-04T17:10:01Z

This pull request was exported from Phabricator. Differential Revision: D81009244

gemini-code-assist

Code Review

This pull request refactors the execute_model function by breaking it down into smaller, more focused sub-functions, each wrapped in a profiling scope. This is a great improvement for code readability, maintainability, and performance profiling. The refactoring appears to be logically sound. I've identified a couple of issues with incorrect type hints in the new function signatures that should be addressed to ensure code correctness from a static analysis perspective.

gemini-code-assist · 2025-09-04T17:11:59Z

vllm/v1/worker/gpu_model_runner.py

The return type hint for _preprocess is incorrect. It specifies 9 elements in the tuple, but the function returns 8. Additionally, several of the types are mismatched with the actual returned values (e.g., the 3rd element is Optional[torch.Tensor] but hinted as int). This should be corrected to match the returned tuple for type consistency and to help static analysis tools.

) -> tuple[ int, int, Optional[torch.Tensor], Optional[torch.Tensor], Optional[torch.Tensor], torch.Tensor, Optional[IntermediateTensors], dict[str, Any], ]:

gemini-code-assist · 2025-09-04T17:11:59Z

vllm/v1/worker/gpu_model_runner.py

The type hint for logprobs_lists in the return tuple is LogprobsLists, but the value can be None if sampler_output.logprobs_tensors is None. The type hint should be Optional[LogprobsLists] to accurately reflect this.

Optional[LogprobsLists],

facebook-github-bot · 2025-09-04T17:26:35Z

This pull request was exported from Phabricator. Differential Revision: D81009244

facebook-github-bot · 2025-09-04T17:35:54Z

This pull request was exported from Phabricator. Differential Revision: D81009244

vllm/v1/worker/gpu_model_runner.py

facebook-github-bot · 2025-09-05T02:34:01Z

This pull request was exported from Phabricator. Differential Revision: D81009244

facebook-github-bot · 2025-09-05T04:10:16Z

This pull request was exported from Phabricator. Differential Revision: D81009244

facebook-github-bot · 2025-09-05T16:21:36Z

This pull request was exported from Phabricator. Differential Revision: D81009244

facebook-github-bot · 2025-09-05T16:33:14Z

This pull request was exported from Phabricator. Differential Revision: D81009244

houseroad

I like this move. It will make the profiling result easier to parse. As by product, it also break long function into smaller pieces.

houseroad · 2025-09-05T16:46:29Z

vllm/v1/worker/gpu_model_runner.py

later we can move this into something like _post_processing.

simon-mo · 2025-09-05T17:14:55Z

vllm/v1/utils.py

is there a reason not to turn it on by default? how much is the overhead? (given it's just nvtx context?)

For use case for extreme perf, we would like to avoid additional overhead. Although, this should be light.

mergify · 2025-09-06T01:26:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bangshengtang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

… scopes (vllm-project#24265) Summary: break down the main execute_model() function into multiple parts so that we can have better latency breakdown in profiles: - preprocess - forward - postprocess - bookkeep (includes sync) - draft (if spec decoding is enabled) this is meant to be a refactor + some new function scopes, there's no functionality difference Test Plan: CI run the vLLM predictor runbook and take a trace {F1981500506} Reviewed By: houseroad, frank-wei Differential Revision: D81009244

facebook-github-bot · 2025-09-06T05:44:42Z

This pull request was exported from Phabricator. Differential Revision: D81009244

… scopes (vllm-project#24265) Co-authored-by: Bangsheng Tang <bangsheng@meta.com>

… scopes (vllm-project#24265) Co-authored-by: Bangsheng Tang <bangsheng@meta.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

bangshengtang requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 4, 2025 17:09

mergify bot added the v1 label Sep 4, 2025

gemini-code-assist bot reviewed Sep 4, 2025

View reviewed changes

bangshengtang force-pushed the export-D81009244 branch from 41d00e1 to 757588e Compare September 4, 2025 17:26

bangshengtang force-pushed the export-D81009244 branch from 757588e to c18d76a Compare September 4, 2025 17:35

houseroad reviewed Sep 4, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

bangshengtang force-pushed the export-D81009244 branch from 61d77aa to e823d15 Compare September 5, 2025 02:33

bangshengtang force-pushed the export-D81009244 branch from e823d15 to 2e9ed8d Compare September 5, 2025 04:10

bangshengtang force-pushed the export-D81009244 branch from 2e9ed8d to e0f8874 Compare September 5, 2025 16:21

bangshengtang force-pushed the export-D81009244 branch from e0f8874 to 994669b Compare September 5, 2025 16:33

houseroad approved these changes Sep 5, 2025

View reviewed changes

houseroad reviewed Sep 5, 2025

View reviewed changes

bangshengtang closed this Sep 5, 2025

bangshengtang reopened this Sep 5, 2025

simon-mo reviewed Sep 5, 2025

View reviewed changes

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 6, 2025

mergify bot added needs-rebase and removed needs-rebase labels Sep 6, 2025

bangshengtang force-pushed the export-D81009244 branch from 3f21393 to 9610aee Compare September 6, 2025 05:44

bangshengtang added 2 commits September 6, 2025 00:02

Merge branch 'main' into export-D81009244

f9b7750

Merge branch 'main' into export-D81009244

c865f0b

houseroad enabled auto-merge (squash) September 6, 2025 20:59

WoosukKwon disabled auto-merge September 6, 2025 21:02

WoosukKwon merged commit 848562b into vllm-project:main Sep 6, 2025
37 of 38 checks passed

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025

break execute_model in gpu_model_runner into sub-functions for custom…

5cd0bdc

… scopes (vllm-project#24265) Co-authored-by: Bangsheng Tang <bangsheng@meta.com>

njhill mentioned this pull request Sep 11, 2025

[BugFix] Fix pipeline parallel #24621

Merged

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

break execute_model in gpu_model_runner into sub-functions for custom…

4ef2101

… scopes (vllm-project#24265) Co-authored-by: Bangsheng Tang <bangsheng@meta.com>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

break execute_model in gpu_model_runner into sub-functions for custom…

4be1e32

… scopes (vllm-project#24265) Co-authored-by: Bangsheng Tang <bangsheng@meta.com>

Uh oh!

break execute_model in gpu_model_runner into sub-functions for custom scopes #24265

break execute_model in gpu_model_runner into sub-functions for custom scopes #24265

Uh oh!

Conversation

bangshengtang commented Sep 4, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 4, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

Uh oh!

facebook-github-bot commented Sep 5, 2025

Uh oh!

facebook-github-bot commented Sep 5, 2025

Uh oh!

facebook-github-bot commented Sep 5, 2025

Uh oh!

facebook-github-bot commented Sep 5, 2025

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

houseroad Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

simon-mo Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

houseroad Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Sep 6, 2025

Uh oh!

facebook-github-bot commented Sep 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bangshengtang commented Sep 4, 2025 •

edited by github-actions bot

Loading