Skip to content

Conversation

@drisspg
Copy link
Contributor

@drisspg drisspg commented Jul 22, 2025

Purpose

Improve flex attention performance by adding a custom blockmask metadata builder for common case.
Also updates to newer metadata passing APIs.
Co-authored by Horace

Test Plan

pytest tests/kernels/test_flex_attention.py

Here is my perf numbers on a sweep of vllm, using this script:
https://gist.github.com/drisspg/c983e853ba8e9d999ae429783cde3c2f

Batch Total Tokens Avg Tokens/Prompt Token Range Flex Input (tok/s) Flash Attention Input (tok/s) Input Speedup Flex Output (tok/s) Flash Attention Output (tok/s) Output Speedup
1 288 9 1-17 1,396.83 1,582.32 1.13x 2,483.16 2,812.96 1.13x
2 928 29 14-60 624.17 4,565.11 7.31x 341.68 2,503.88 7.33x
3 5,554 173 50-693 3,230.28 17,809.32 5.51x 286.15 1,593.65 5.57x
4 12,271 383 218-729 7,235.03 26,763.38 3.70x 255.89 1,018.53 3.98x
5 29,094 909 425-3055 14,132.62 34,035.30 2.41x 231.71 558.01 2.41x

Two sources of slow down:

FBURL for trace: https://fburl.com/3bcgf2d9

Processing batch 1/1 (batch size: 32)
 Total prefill tokens: 8407
 Average tokens per prompt: 262
 Token range: 1 - 788
 generating 16 output tokens

FBURL Flash Trace: https://fburl.com/0k1vpaor

  1. Much slower kernel:
    Flex is performing pretty miserably here (flex decode disabled):
    420 us compared to 20 - 25 us for Flash
Screenshot 2025-07-27 at 1 47 49 PM Screenshot 2025-07-27 at 1 48 01 PM
  1. Performance extra overhead from prepping metadata:
    We are launching some GPU kernels but they are all CPU bound, cuda-graphs kind of a mixed bag since the nonzero call inbuild direct breaks the thangs.

Comparing to Flash -> 3.68 ms between decode steps vs 2.3 ms
Screenshot 2025-07-27 at 1 40 34 PM

Note

I tried to run the same sweep before this PR but kept getting IMAs

Online Bench

python benchmarks/benchmark_serving.py \
  --backend vllm \
  --model Qwen/Qwen3-8B \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path /home/drisspg/meta/my_scripts/data/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 10
  --seed 42

Flash:

============ Serving Benchmark Result ============
Successful requests:                     300       
Benchmark duration (s):                  15.17     
Total input tokens:                      65492     
Total generated tokens:                  63454     
Request throughput (req/s):              19.78     
Output token throughput (tok/s):         4184.23   
Total Token throughput (tok/s):          8502.84   
---------------Time to First Token----------------
Mean TTFT (ms):                          1112.37   
Median TTFT (ms):                        1044.79   
P99 TTFT (ms):                           1902.98   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.71     
Median TPOT (ms):                        20.94     
P99 TPOT (ms):                           228.39    
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.57     
Median ITL (ms):                         14.86     
P99 ITL (ms):                            227.82    

Flex:

============ Serving Benchmark Result ============
Successful requests:                     300       
Benchmark duration (s):                  37.89     
Total input tokens:                      65492     
Total generated tokens:                  63454     
Request throughput (req/s):              7.92      
Output token throughput (tok/s):         1674.69   
Total Token throughput (tok/s):          3403.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          281.99    
Median TTFT (ms):                        289.92    
P99 TTFT (ms):                           321.96    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          46.91     
Median TPOT (ms):                        42.37     
P99 TPOT (ms):                           69.20     
---------------Inter-token Latency----------------
Mean ITL (ms):                           42.00     
Median ITL (ms):                         41.16     
P99 ITL (ms):                            60.25     
==================================================

Lm Eval

HF_HUB_DISABLE_XET=1 VLLM_ATTENTION_BACKEND=FLEX_ATTENTION lm_eval
--model vllm
--model_args '{
"pretrained": "meta-llama/Meta-Llama-3-8B-Instruct",
"gpu_memory_utilization": 0.8
}'
--tasks gsm8k --batch_size auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.7582 ± 0.0118
strict-match 5 exact_match 0.7597 ± 0.0118

w/ Flash backend
limit: None, num_fewshot: None, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.7513 ± 0.0119
strict-match 5 exact_match 0.7528 ± 0.0119

cc @LucasWilkinson

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@drisspg drisspg marked this pull request as draft July 22, 2025 23:30
@mergify mergify bot added the v1 label Jul 22, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance improvements to FlexAttention by implementing a more efficient method for building the block mask. The changes are well-structured and include new helper functions for tensor manipulation. However, I've identified a critical correctness issue where __post_init__ incorrectly returns a value, and a high-severity issue regarding a hardcoded block size that limits the applicability of this backend. Addressing these points will improve the robustness and correctness of the implementation.

@drisspg drisspg force-pushed the updates-to-flex branch 2 times, most recently from 8896590 to 1bcd0a9 Compare July 22, 2025 23:34
@drisspg drisspg force-pushed the updates-to-flex branch 6 times, most recently from 20ab73c to 7fe7ae2 Compare July 23, 2025 21:47
@drisspg drisspg marked this pull request as ready for review July 24, 2025 01:02
@drisspg
Copy link
Contributor Author

drisspg commented Jul 24, 2025

Running Flex Test:

>           assert flex_text == default_text, (
                f"FlexAttention output doesn't match default for: {prompt!r}\n"
                f"FlexAttention: {flex_text!r}\n"
                f"Default: {default_text!r}")
E           AssertionError: FlexAttention output doesn't match default for: 'Hello, my name is'
E             FlexAttention: ' John. I am a 16 year old boy. I am a student at a high school. I am a bit of a loner. I have'
E             Default: ' John. I am a 20-year-old student at the University of California, Berkeley. I am a senior in my major of Computer Science. I am'
E           assert ' John. I am ...loner. I have' == ' John. I am ...Science. I am'
E             
E             -  John. I am a 20-year-old student at the University of California, Berkeley. I am a senior in my major of Computer Science. I am
E             +  John. I am a 16 year old boy. I am a student at a high school. I am a bit of a loner. I have

Would be curious if people have better ideas on more robust testing here

@drisspg drisspg force-pushed the updates-to-flex branch 2 times, most recently from dbeff4d to 6d84dc5 Compare August 13, 2025 17:44
@drisspg
Copy link
Contributor Author

drisspg commented Aug 13, 2025

@LucasWilkinson Are the failures related?

@LucasWilkinson
Copy link
Collaborator

@LucasWilkinson Are the failures related?

I dont think so; but we're holding off force merges till we can get the CI green (hopefully today) so id just wait and rebase after that

@mergify
Copy link

mergify bot commented Aug 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @drisspg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 20, 2025
Signed-off-by: drisspg <drisspguessous@gmail.com>
@mergify mergify bot removed the needs-rebase label Aug 25, 2025
@zou3519 zou3519 merged commit e0329ed into vllm-project:main Aug 25, 2025
38 checks passed
@Muennighoff
Copy link
Contributor

Muennighoff commented Aug 27, 2025

@drisspg can you share your env info?

When running the below with Python 3.10.18 on H100s I get the error at the bottom. Interestingly it works fine if doing [text] * 2 or just a single item so maybe some indexing issue for larger batch sizes?

Edit: Reduced the script to just the below:

# git clone https://github.com/vllm-project/vllm.git
# cd vllm
# pip install uv
# VLLM_USE_PRECOMPILED=1 uv pip install --editable .

import os
os.environ["VLLM_ATTENTION_BACKEND"] = "FLEX_ATTENTION"
from vllm import LLM, SamplingParams
model = LLM("Qwen/Qwen2-7B-Instruct")
output = model.generate(["Hi"] * 4)
print(output)
Log

Processed prompts: 0%| | 0/4 [00:00", line 32, in __init__ (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/attention/backends/flex_attention .py", line 511, in __post_init__ (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] self.block_mask = self.build_block_mask() (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/attention/backends/flex_attention .py", line 481, in build_block_mask (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return create_block_mask_compiled( (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 6 55, in _fn (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return fn(*args, **kwargs) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/attention/flex_attention.py ", line 824, in create_block_mask (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] def create_block_mask( (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 8 38, in _fn (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return fn(*args, **kwargs) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", l ine 1209, in forward (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return compiled_fn(full_args) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runti me_wrappers.py", line 328, in runtime_wrapper (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] all_outs = call_func_at_runtime_with_args( (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils .py", line 126, in call_func_at_runtime_with_args (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] out = normalize_as_list(f(args)) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runti me_wrappers.py", line 689, in inner_fn (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] outs = compiled_fn(args) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runti me_wrappers.py", line 495, in wrapper (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return compiled_fn(runtime_args) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/output_code.py", lin e 460, in __call__ (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return self.current_callable(inputs) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1372, in run (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return compiled_fn(new_inputs) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 371, in deferred_cudagraphify (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return fn(inputs) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/utils.py", line 2404 , in run (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return model(new_inputs) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 1997, in run (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] out = self._run(new_inputs, function_id) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 2175, in _run (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] out = self.record_function(new_inputs, function_id) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 2230, in record_function (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] torch.cuda.synchronize() (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/cuda/__init__.py", line 1040, in synchronize (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return torch._C._cuda_synchronize() (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] RuntimeError: CUDA error: an illegal memory access was encountered (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be inc orrect. (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] Traceback (most recent call last): File "/home/muennighoff/s2/generate_simple.py", line 26, in output = model.generate( File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/entrypoints/llm.py", line 388, in generate outputs = self._run_engine(use_tqdm=use_tqdm) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/entrypoints/llm.py", line 1448, in _run_engine step_outputs = self.llm_engine.step() File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/llm_engine.py", line 241, in step outputs = self.engine_core.get_output() File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core_client.py", line 668, in get_output raise self._format_exception(outputs) from None vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (EngineCore_0 pid=542941) Process EngineCore_0:

...

(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return compiled_fn(new_inputs)
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/pyth
on3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 371, in deferred_cudagraphify
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return fn(inputs)
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/pyth
on3.10/site-packages/torch/_inductor/utils.py", line 2404, in run
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return model(new_inputs)
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/pyth
on3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 1997, in run
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] out = self._run(new_inputs, function_id)
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/pyth
on3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 2175, in _run
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] out = self.record_function(new_inputs, function_id)
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/pyth
on3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 2230, in record_function
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] torch.cuda.synchronize()
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/pyth
on3.10/site-packages/torch/cuda/init.py", line 1040, in synchronize
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return torch._C._cuda_synchronize()
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] RuntimeError: CUDA error: an illegal memory access was
encountered
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] CUDA kernel errors might be asynchronously reported at
some other API call, so the stacktrace below might be incorrect.
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] Compile with TORCH_USE_CUDA_DSA to enable device-side
assertions.
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710]
Traceback (most recent call last):
File "/home/muennighoff/s2/generate_simple.py", line 26, in
output = model.generate(
File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/entrypoints/llm.py", line 388, in genera
te
outputs = self._run_engine(use_tqdm=use_tqdm)
File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/entrypoints/llm.py", line 1448, in run
engine
step_outputs = self.llm_engine.step()
File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/llm_engine.py", line 241, in s
tep
outputs = self.engine_core.get_output()
File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core_client.py", line 668, in
get_output
raise self._format_exception(outputs) from None
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cau
se.
(EngineCore_0 pid=542941) Process EngineCore_0:
(EngineCore_0 pid=542941) Traceback (most recent call last):
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/multiprocessing/process.py",
line 314, in _bootstrap
(EngineCore_0 pid=542941) self.run()
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/multiprocessing/process.py",
line 108, in run
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/en[53/1840$
.py", line 712, in run_engine_core
(EngineCore_0 pid=542941) raise e
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core
.py", line 701, in run_engine_core
(EngineCore_0 pid=542941) engine_core.run_busy_loop()
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core
.py", line 728, in run_busy_loop
(EngineCore_0 pid=542941) self._process_engine_step()
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core
.py", line 753, in _process_engine_step
(EngineCore_0 pid=542941) outputs, model_executed = self.step_fn()
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core
.py", line 289, in step
(EngineCore_0 pid=542941) model_output = self.execute_model_with_error_logging(
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core
.py", line 275, in execute_model_with_error_logging
(EngineCore_0 pid=542941) raise err
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core
.py", line 266, in execute_model_with_error_logging
(EngineCore_0 pid=542941) return model_fn(scheduler_output)
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/executor/ab
stract.py", line 95, in execute_model
(EngineCore_0 pid=542941) output = self.collective_rpc("execute_model",
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/executor/unipr
oc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=542941) answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/utils/init
.py", line 3031, in run_method
(EngineCore_0 pid=542941) return func(*args, **kwargs)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/c
ontextlib.py", line 116, in decorate_context
(EngineCore_0 pid=542941) return func(*args, **kwargs)
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/worker/gpu

worker.py", line 362, in execute_model
(EngineCore_0 pid=542941) output = self.model_runner.execute_model(scheduler_output,
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/c
ontextlib.py", line 116, in decorate_context
(EngineCore_0 pid=542941) return func(*args, **kwargs)
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/worker/gpu

model_runner.py", line 1488, in execute_model
(EngineCore_0 pid=542941) max_query_len) = self.prepare_inputs(scheduler_output)
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/worker/gpu

model_runner.py", line 880, in _prepare_inputs
(EngineCore_0 pid=542941) attn_metadata_i = (builder.build(
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/attention/b
ackends/flex_attention.py", line 577, in build
(EngineCore_0 pid=542941) out = FlexAttentionMetadata(
(EngineCore_0 pid=542941) File "", line 32, in init
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/attention/b
ackends/flex_attention.py", line 511, in post_init
(EngineCore_0 pid=542941) self.block_mask = self.build_block_mask()
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/attention/b
ackends/flex_attention.py", line 481, in build_block_mask
(EngineCore_0 pid=542941) return create_block_mask_compiled(
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
(EngineCore_0 pid=542941) return fn(*args, **kwargs)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/attention/flex_attention.py", line 824, in create_block_mask
(EngineCore_0 pid=542941) def create_block_mask(
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
(EngineCore_0 pid=542941) return fn(*args, **kwargs)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1209, in forward
(EngineCore_0 pid=542941) return compiled_fn(full_args)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 328, in runtime_wrapper
(EngineCore_0 pid=542941) all_outs = call_func_at_runtime_with_args(
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
(EngineCore_0 pid=542941) out = normalize_as_list(f(args))
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 689, in inner_fn
(EngineCore_0 pid=542941) outs = compiled_fn(args)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 495, in wrapper
(EngineCore_0 pid=542941) return compiled_fn(runtime_args)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 460, in call
(EngineCore_0 pid=542941) return self.current_callable(inputs)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1372, in run
(EngineCore_0 pid=542941) return compiled_fn(new_inputs)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 371, in deferred_cudagraphify
(EngineCore_0 pid=542941) return fn(inputs)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/utils.py", line 2404, in run
(EngineCore_0 pid=542941) return model(new_inputs)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 1997, in run
(EngineCore_0 pid=542941) out = self._run(new_inputs, function_id)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 2175, in _run
(EngineCore_0 pid=542941) out = self.record_function(new_inputs, function_id)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 2230, in record_function
(EngineCore_0 pid=542941) torch.cuda.synchronize()
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/cuda/init.py", line 1040, in synchronize
(EngineCore_0 pid=542941) return torch._C._cuda_synchronize()
(EngineCore_0 pid=542941) RuntimeError: CUDA error: an illegal memory access was encountered
(EngineCore_0 pid=542941) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_0 pid=542941) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_0 pid=542941) Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(EngineCore_0 pid=542941)
Processed prompts: 0%| | 0/4 [00:06<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: drisspg <drisspguessous@gmail.com>
@huydhn huydhn mentioned this pull request Aug 28, 2025
10 tasks
xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: drisspg <drisspguessous@gmail.com>
Signed-off-by: Xiao Yu <xiao.yu@amd.com>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025
Signed-off-by: drisspg <drisspguessous@gmail.com>
zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Sep 3, 2025
Signed-off-by: drisspg <drisspguessous@gmail.com>
ekagra-ranjan pushed a commit to ekagra-ranjan/vllm that referenced this pull request Sep 4, 2025
Signed-off-by: drisspg <drisspguessous@gmail.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
@zongy17
Copy link

zongy17 commented Sep 19, 2025

Hi, @drisspg , I am wondering if Flex can be used with pipeline parallel ? It seems that setting --pipeline-parallel-size to more than one would incur errors.

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: drisspg <drisspguessous@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants