Updates to Flex + VLLm integration #21416

drisspg · 2025-07-22T23:30:41Z

Purpose

Improve flex attention performance by adding a custom blockmask metadata builder for common case.
Also updates to newer metadata passing APIs.
Co-authored by Horace

Test Plan

pytest tests/kernels/test_flex_attention.py

Here is my perf numbers on a sweep of vllm, using this script:
https://gist.github.com/drisspg/c983e853ba8e9d999ae429783cde3c2f

Batch	Total Tokens	Avg Tokens/Prompt	Token Range	Flex Input (tok/s)	Flash Attention Input (tok/s)	Input Speedup	Flex Output (tok/s)	Flash Attention Output (tok/s)	Output Speedup
1	288	9	1-17	1,396.83	1,582.32	1.13x	2,483.16	2,812.96	1.13x
2	928	29	14-60	624.17	4,565.11	7.31x	341.68	2,503.88	7.33x
3	5,554	173	50-693	3,230.28	17,809.32	5.51x	286.15	1,593.65	5.57x
4	12,271	383	218-729	7,235.03	26,763.38	3.70x	255.89	1,018.53	3.98x
5	29,094	909	425-3055	14,132.62	34,035.30	2.41x	231.71	558.01	2.41x

Two sources of slow down:

FBURL for trace: https://fburl.com/3bcgf2d9

Processing batch 1/1 (batch size: 32)
 Total prefill tokens: 8407
 Average tokens per prompt: 262
 Token range: 1 - 788
 generating 16 output tokens

FBURL Flash Trace: https://fburl.com/0k1vpaor

Much slower kernel:
Flex is performing pretty miserably here (flex decode disabled):
420 us compared to 20 - 25 us for Flash

Performance extra overhead from prepping metadata:
We are launching some GPU kernels but they are all CPU bound, cuda-graphs kind of a mixed bag since the nonzero call inbuild direct breaks the thangs.

Comparing to Flash -> 3.68 ms between decode steps vs 2.3 ms

Note

I tried to run the same sweep before this PR but kept getting IMAs

Online Bench

python benchmarks/benchmark_serving.py \
  --backend vllm \
  --model Qwen/Qwen3-8B \
  --endpoint /v1/completions \
  --dataset-name sharegpt \
  --dataset-path /home/drisspg/meta/my_scripts/data/ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 10
  --seed 42

Flash:

============ Serving Benchmark Result ============
Successful requests:                     300       
Benchmark duration (s):                  15.17     
Total input tokens:                      65492     
Total generated tokens:                  63454     
Request throughput (req/s):              19.78     
Output token throughput (tok/s):         4184.23   
Total Token throughput (tok/s):          8502.84   
---------------Time to First Token----------------
Mean TTFT (ms):                          1112.37   
Median TTFT (ms):                        1044.79   
P99 TTFT (ms):                           1902.98   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.71     
Median TPOT (ms):                        20.94     
P99 TPOT (ms):                           228.39    
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.57     
Median ITL (ms):                         14.86     
P99 ITL (ms):                            227.82

Flex:

============ Serving Benchmark Result ============
Successful requests:                     300       
Benchmark duration (s):                  37.89     
Total input tokens:                      65492     
Total generated tokens:                  63454     
Request throughput (req/s):              7.92      
Output token throughput (tok/s):         1674.69   
Total Token throughput (tok/s):          3403.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          281.99    
Median TTFT (ms):                        289.92    
P99 TTFT (ms):                           321.96    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          46.91     
Median TPOT (ms):                        42.37     
P99 TPOT (ms):                           69.20     
---------------Inter-token Latency----------------
Mean ITL (ms):                           42.00     
Median ITL (ms):                         41.16     
P99 ITL (ms):                            60.25     
==================================================

Lm Eval

HF_HUB_DISABLE_XET=1 VLLM_ATTENTION_BACKEND=FLEX_ATTENTION lm_eval
--model vllm
--model_args '{
"pretrained": "meta-llama/Meta-Llama-3-8B-Instruct",
"gpu_memory_utilization": 0.8
}'
--tasks gsm8k --batch_size auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7582	±	0.0118
		strict-match	5	exact_match	↑	0.7597	±	0.0118

w/ Flash backend
limit: None, num_fewshot: None, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.7513	±	0.0119
		strict-match	5	exact_match	↑	0.7528	±	0.0119

cc @LucasWilkinson

github-actions · 2025-07-22T23:30:51Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces significant performance improvements to FlexAttention by implementing a more efficient method for building the block mask. The changes are well-structured and include new helper functions for tensor manipulation. However, I've identified a critical correctness issue where __post_init__ incorrectly returns a value, and a high-severity issue regarding a hardcoded block size that limits the applicability of this backend. Addressing these points will improve the robustness and correctness of the implementation.

vllm/v1/attention/backends/flex_attention.py

drisspg · 2025-07-24T01:49:00Z

Running Flex Test:

>           assert flex_text == default_text, (
                f"FlexAttention output doesn't match default for: {prompt!r}\n"
                f"FlexAttention: {flex_text!r}\n"
                f"Default: {default_text!r}")
E           AssertionError: FlexAttention output doesn't match default for: 'Hello, my name is'
E             FlexAttention: ' John. I am a 16 year old boy. I am a student at a high school. I am a bit of a loner. I have'
E             Default: ' John. I am a 20-year-old student at the University of California, Berkeley. I am a senior in my major of Computer Science. I am'
E           assert ' John. I am ...loner. I have' == ' John. I am ...Science. I am'
E             
E             -  John. I am a 20-year-old student at the University of California, Berkeley. I am a senior in my major of Computer Science. I am
E             +  John. I am a 16 year old boy. I am a student at a high school. I am a bit of a loner. I have

Would be curious if people have better ideas on more robust testing here

drisspg · 2025-08-13T21:00:02Z

@LucasWilkinson Are the failures related?

LucasWilkinson · 2025-08-13T21:06:27Z

@LucasWilkinson Are the failures related?

I dont think so; but we're holding off force merges till we can get the CI green (hopefully today) so id just wait and rebase after that

mergify · 2025-08-20T20:13:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @drisspg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: drisspg <drisspguessous@gmail.com>

Muennighoff · 2025-08-27T00:45:21Z

@drisspg can you share your env info?

When running the below with Python 3.10.18 on H100s I get the error at the bottom. Interestingly it works fine if doing [text] * 2 or just a single item so maybe some indexing issue for larger batch sizes?

Edit: Reduced the script to just the below:

# git clone https://github.com/vllm-project/vllm.git
# cd vllm
# pip install uv
# VLLM_USE_PRECOMPILED=1 uv pip install --editable .

import os
os.environ["VLLM_ATTENTION_BACKEND"] = "FLEX_ATTENTION"
from vllm import LLM, SamplingParams
model = LLM("Qwen/Qwen2-7B-Instruct")
output = model.generate(["Hi"] * 4)
print(output)

Log

Processed prompts: 0%| | 0/4 [00:00", line 32, in __init__ (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/attention/backends/flex_attention .py", line 511, in __post_init__ (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] self.block_mask = self.build_block_mask() (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/attention/backends/flex_attention .py", line 481, in build_block_mask (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return create_block_mask_compiled( (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 6 55, in _fn (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return fn(*args, **kwargs) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/attention/flex_attention.py ", line 824, in create_block_mask (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] def create_block_mask( (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 8 38, in _fn (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return fn(*args, **kwargs) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", l ine 1209, in forward (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return compiled_fn(full_args) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runti me_wrappers.py", line 328, in runtime_wrapper (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] all_outs = call_func_at_runtime_with_args( (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils .py", line 126, in call_func_at_runtime_with_args (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] out = normalize_as_list(f(args)) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runti me_wrappers.py", line 689, in inner_fn (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] outs = compiled_fn(args) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runti me_wrappers.py", line 495, in wrapper (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return compiled_fn(runtime_args) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/output_code.py", lin e 460, in __call__ (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return self.current_callable(inputs) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1372, in run (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return compiled_fn(new_inputs) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 371, in deferred_cudagraphify (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return fn(inputs) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/utils.py", line 2404 , in run (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return model(new_inputs) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 1997, in run (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] out = self._run(new_inputs, function_id) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 2175, in _run (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] out = self.record_function(new_inputs, function_id) (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 2230, in record_function (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] torch.cuda.synchronize() (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/cuda/__init__.py", line 1040, in synchronize (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return torch._C._cuda_synchronize() (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] RuntimeError: CUDA error: an illegal memory access was encountered (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be inc orrect. (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] Traceback (most recent call last): File "/home/muennighoff/s2/generate_simple.py", line 26, in output = model.generate( File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/entrypoints/llm.py", line 388, in generate outputs = self._run_engine(use_tqdm=use_tqdm) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/entrypoints/llm.py", line 1448, in _run_engine step_outputs = self.llm_engine.step() File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/llm_engine.py", line 241, in step outputs = self.engine_core.get_output() File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core_client.py", line 668, in get_output raise self._format_exception(outputs) from None vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (EngineCore_0 pid=542941) Process EngineCore_0:

...

(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return compiled_fn(new_inputs)
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/pyth
on3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 371, in deferred_cudagraphify
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return fn(inputs)
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/pyth
on3.10/site-packages/torch/_inductor/utils.py", line 2404, in run
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return model(new_inputs)
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/pyth
on3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 1997, in run
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] out = self._run(new_inputs, function_id)
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/pyth
on3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 2175, in _run
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] out = self.record_function(new_inputs, function_id)
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/pyth
on3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 2230, in record_function
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] torch.cuda.synchronize()
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] File "/home/muennighoff/miniconda3/envs/vllm/lib/pyth
on3.10/site-packages/torch/cuda/init.py", line 1040, in synchronize
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] return torch._C._cuda_synchronize()
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] RuntimeError: CUDA error: an illegal memory access was
encountered
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] CUDA kernel errors might be asynchronously reported at
some other API call, so the stacktrace below might be incorrect.
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710] Compile with TORCH_USE_CUDA_DSA to enable device-side
assertions.
(EngineCore_0 pid=542941) ERROR 08-27 00:35:20 [core.py:710]
Traceback (most recent call last):
File "/home/muennighoff/s2/generate_simple.py", line 26, in
output = model.generate(
File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/entrypoints/llm.py", line 388, in genera
te
outputs = self._run_engine(use_tqdm=use_tqdm)
File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/entrypoints/llm.py", line 1448, in run
engine
step_outputs = self.llm_engine.step()
File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/llm_engine.py", line 241, in s
tep
outputs = self.engine_core.get_output()
File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core_client.py", line 668, in
get_output
raise self._format_exception(outputs) from None
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cau
se.
(EngineCore_0 pid=542941) Process EngineCore_0:
(EngineCore_0 pid=542941) Traceback (most recent call last):
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/multiprocessing/process.py",
line 314, in _bootstrap
(EngineCore_0 pid=542941) self.run()
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/multiprocessing/process.py",
line 108, in run
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/en[53/1840$
.py", line 712, in run_engine_core
(EngineCore_0 pid=542941) raise e
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core
.py", line 701, in run_engine_core
(EngineCore_0 pid=542941) engine_core.run_busy_loop()
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core
.py", line 728, in run_busy_loop
(EngineCore_0 pid=542941) self._process_engine_step()
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core
.py", line 753, in _process_engine_step
(EngineCore_0 pid=542941) outputs, model_executed = self.step_fn()
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core
.py", line 289, in step
(EngineCore_0 pid=542941) model_output = self.execute_model_with_error_logging(
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core
.py", line 275, in execute_model_with_error_logging
(EngineCore_0 pid=542941) raise err
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/engine/core
.py", line 266, in execute_model_with_error_logging
(EngineCore_0 pid=542941) return model_fn(scheduler_output)
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/executor/ab
stract.py", line 95, in execute_model
(EngineCore_0 pid=542941) output = self.collective_rpc("execute_model",
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/executor/unipr
oc_executor.py", line 58, in collective_rpc
(EngineCore_0 pid=542941) answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/utils/init
.py", line 3031, in run_method
(EngineCore_0 pid=542941) return func(*args, **kwargs)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/c
ontextlib.py", line 116, in decorate_context
(EngineCore_0 pid=542941) return func(*args, **kwargs)
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/worker/gpu
worker.py", line 362, in execute_model
(EngineCore_0 pid=542941) output = self.model_runner.execute_model(scheduler_output,
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/utils/c
ontextlib.py", line 116, in decorate_context
(EngineCore_0 pid=542941) return func(*args, **kwargs)
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/worker/gpu
model_runner.py", line 1488, in execute_model
(EngineCore_0 pid=542941) max_query_len) = self.prepare_inputs(scheduler_output)
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/worker/gpu
model_runner.py", line 880, in _prepare_inputs
(EngineCore_0 pid=542941) attn_metadata_i = (builder.build(
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/attention/b
ackends/flex_attention.py", line 577, in build
(EngineCore_0 pid=542941) out = FlexAttentionMetadata(
(EngineCore_0 pid=542941) File "", line 32, in init
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/attention/b
ackends/flex_attention.py", line 511, in post_init
(EngineCore_0 pid=542941) self.block_mask = self.build_block_mask()
(EngineCore_0 pid=542941) File "/opt/hpcaas/.mounts/fs-0301404b74c8d22fd/home/muennighoff/vllm/vllm/v1/attention/b
ackends/flex_attention.py", line 481, in build_block_mask
(EngineCore_0 pid=542941) return create_block_mask_compiled(
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
(EngineCore_0 pid=542941) return fn(*args, **kwargs)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/nn/attention/flex_attention.py", line 824, in create_block_mask
(EngineCore_0 pid=542941) def create_block_mask(
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
(EngineCore_0 pid=542941) return fn(*args, **kwargs)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py", line 1209, in forward
(EngineCore_0 pid=542941) return compiled_fn(full_args)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 328, in runtime_wrapper
(EngineCore_0 pid=542941) all_outs = call_func_at_runtime_with_args(
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
(EngineCore_0 pid=542941) out = normalize_as_list(f(args))
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 689, in inner_fn
(EngineCore_0 pid=542941) outs = compiled_fn(args)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 495, in wrapper
(EngineCore_0 pid=542941) return compiled_fn(runtime_args)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/output_code.py", line 460, in call
(EngineCore_0 pid=542941) return self.current_callable(inputs)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/compile_fx.py", line 1372, in run
(EngineCore_0 pid=542941) return compiled_fn(new_inputs)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 371, in deferred_cudagraphify
(EngineCore_0 pid=542941) return fn(inputs)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/utils.py", line 2404, in run
(EngineCore_0 pid=542941) return model(new_inputs)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 1997, in run
(EngineCore_0 pid=542941) out = self._run(new_inputs, function_id)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 2175, in _run
(EngineCore_0 pid=542941) out = self.record_function(new_inputs, function_id)
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/_inductor/cudagraph_trees.py", line 2230, in record_function
(EngineCore_0 pid=542941) torch.cuda.synchronize()
(EngineCore_0 pid=542941) File "/home/muennighoff/miniconda3/envs/vllm/lib/python3.10/site-packages/torch/cuda/init.py", line 1040, in synchronize
(EngineCore_0 pid=542941) return torch._C._cuda_synchronize()
(EngineCore_0 pid=542941) RuntimeError: CUDA error: an illegal memory access was encountered
(EngineCore_0 pid=542941) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_0 pid=542941) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_0 pid=542941) Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
(EngineCore_0 pid=542941)
Processed prompts: 0%| | 0/4 [00:06<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

Signed-off-by: drisspg <drisspguessous@gmail.com>

Signed-off-by: drisspg <drisspguessous@gmail.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

Signed-off-by: drisspg <drisspguessous@gmail.com>

Signed-off-by: drisspg <drisspguessous@gmail.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

zongy17 · 2025-09-19T09:42:25Z

Hi, @drisspg , I am wondering if Flex can be used with pipeline parallel ? It seems that setting --pipeline-parallel-size to more than one would incur errors.

Signed-off-by: drisspg <drisspguessous@gmail.com>

drisspg requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat, tlrmchlsmth and ywang96 as code owners July 22, 2025 23:30

drisspg marked this pull request as draft July 22, 2025 23:30

mergify bot added the v1 label Jul 22, 2025

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

vllm/v1/attention/backends/flex_attention.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/flex_attention.py Outdated Show resolved Hide resolved

drisspg commented Jul 22, 2025

View reviewed changes