-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
fix cuda graph #22721
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix cuda graph #22721
Conversation
Signed-off-by: fsx950223 <fsx950223@outlook.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the CUDA graph support flag for the AiterFlashAttention backend. While the change to use the AttentionCGSupport enum is a good step towards standardization, I've identified a critical issue in the underlying implementation. The backend is marked as always supporting CUDA graphs, but the forward pass for prefill scenarios appears to be buggy, calling two attention kernels sequentially and overwriting the output. This makes the prefill path incorrect. I've recommended adjusting the support level to PURE_DECODE_ONLY to accurately reflect the backend's current state and prevent capturing faulty logic in CUDA graphs until the implementation is fixed.
| attn_cudagraph_support: ClassVar[ | ||
| AttentionCGSupport] = AttentionCGSupport.ALWAYS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting attn_cudagraph_support to ALWAYS seems incorrect given the current implementation of AiterFlashAttentionImpl.forward.
In the forward method, for prefill scenarios (where max_seqlen_q > 1), both torch.ops.vllm.flash_attn_varlen_func and torch.ops.aiter.paged_attention_v1 are executed sequentially. The second call will overwrite the results of the first, which appears to be a bug that makes the prefill path produce incorrect results.
Given that the decode path (max_seqlen_q <= 1) only calls paged_attention_v1 and might be correct, attn_cudagraph_support should be set to AttentionCGSupport.PURE_DECODE_ONLY. This would accurately reflect that only decode operations can be safely captured in a CUDA graph. Setting it to ALWAYS is misleading and could lead to capturing incorrect logic in CUDA graphs for prefill/mixed batches.
I suggest changing this to PURE_DECODE_ONLY until the prefill logic is fixed.
| attn_cudagraph_support: ClassVar[ | |
| AttentionCGSupport] = AttentionCGSupport.ALWAYS | |
| attn_cudagraph_support: ClassVar[ | |
| AttentionCGSupport] = AttentionCGSupport.PURE_DECODE_ONLY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fsx950223
This attn_cudagraph_support has been renamed to cudagraph_support in this PR #20059 (merged 4 days ago into main)
The full_cuda_graph is deprecated.
It is now controlled through --compilation-config '{"cudagraph_mode": "FULL"}'
There are multiple modes introduced in this PR #20059
vllm/vllm/config/compilation.py
Lines 226 to 261 in f856c33
| cudagraph_mode: Optional[CUDAGraphMode] = None | |
| """ | |
| The mode of the cudagraph. | |
| - NONE, no cudagraph capture. | |
| - PIECEWISE. (v1 default) | |
| - FULL. | |
| - FULL_DECODE_ONLY. | |
| - FULL_AND_PIECEWISE. | |
| PIECEWISE mode build piecewise cudagraph only, keeping the cudagraph | |
| incompatiable ops (i.e. some attention ops) outside the cudagraph | |
| for general flexibility. | |
| This is the default mode. | |
| FULL mode: Capture full cudagraph for all batches. Can be good for small | |
| models or workloads with small prompts; not supported by many backends. | |
| Generally for performance FULL_AND_PIECEWISE is better. | |
| FULL_DECODE_ONLY mode: Capture full cudagraph for decode batches only. | |
| Mixed prefill-decode batches are run without cudagraphs. Can be good for | |
| decode instances in a P/D setup where prefill is not as important so we | |
| can save some memory. | |
| FULL_AND_PIECEWISE mode: Capture full cudagraph for decode batches and | |
| piecewise cudagraph for prefill and mixed prefill-decode batches. | |
| This is like the most performant mode for most models. | |
| Currently, the cudagraph mode is only used for the v1 engine. | |
| Note that the cudagraph logic is generally orthogonal to the | |
| compilation logic. While piecewise cudagraphs require piecewise | |
| compilation (level=PIECEWISE and non-empty splitting_ops), full | |
| cudagraphs are supported with and without compilation. | |
| Warning: This flag is new and subject to change in addition | |
| more modes may be added. | |
| """ |
Signed-off-by: fsx950223 <fsx950223@outlook.com>
|
Additional Tests: Server command: #!/bin/bash
rm -rf /root/.cache/vllm
MODEL=Qwen/Qwen3-8B
VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 \
vllm serve $MODEL \
--tensor-parallel-size 8 \
--max-num-seqs 1024 \
--kv-cache-dtype fp8 \
--max-num-batched-tokens 32768 \
--disable-log-requests \
--compilation-config '{"cudagraph_mode": "FULL"}' \
--trust-remote-codeserver log: Client command: #!/bin/bash
lm_eval \
--model local-completions \
--tasks gsm8k \
--model_args model=Qwen/Qwen3-8B,base_url=http://127.0.0.1:8000/v1/completions \
--batch_size 100 \
> lmeval_server-Qwen_Qwen3-8B-aiter-v1-mha-fullgraphmode_FULL_PR.log 2>&1lm eval score of
|
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com> Signed-off-by: Duncan Moss <djm.moss@gmail.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Signed-off-by: fsx950223 <fsx950223@outlook.com>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update