[gpt-oss] raise error for flashinfer backend without trtllm #24482

heheda12345 · 2025-09-09T05:25:29Z

Purpose

The attention sink is not integrated into flashinfer backend yet. Raise an error.

Test Plan

VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve openai/gpt-oss-20b on hopper

Test Result

raise the newly added error.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

gemini-code-assist

Code Review

This pull request adds checks to prevent using attention sinks with the FlashInfer backend when TRT-LLM is not available, as this configuration is unsupported. My feedback focuses on improving the robustness of these checks by replacing assert statements with explicit NotImplementedError exceptions. This ensures that the checks are not bypassed when assertions are disabled, providing more reliable error handling for unsupported configurations.

vllm/v1/attention/backends/flashinfer.py

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

yeqcharlotte · 2025-09-09T08:23:31Z

vllm/v1/attention/backends/flashinfer.py

+            raise NotImplementedError(
+                "FlashInfer backend currently does not support attention "
+                "sinks, please use trtllm on blackwell or flash attention on "
+                "earlier GPUs.")


possibly coordinate with #24470. we see accuracy regression for 0.2.14 actually which is fixed in >=0.3.0.

LucasWilkinson

LGTM; assuming this isn't addressed by: https://github.com/vllm-project/vllm/pull/24482/files#r2332475619

heheda12345 · 2025-09-09T16:34:37Z

@LucasWilkinson I'm sure it is not addressed by https://github.com/vllm-project/vllm/pull/24482/files#r2332475619. We need to pass the sink to the flashinfer kernels and I'm waiting for an interface update about this.

…ject#24482) Signed-off-by: Chen Zhang <zhangch99@outlook.com>

@yzh119

## 📌 Description This PR adds a wrapper class around the JiT implementation of `AttentionSink` (GPT-OSS style) for FlashInfer backend `BatchPrefillWithPagedKVCacheWrapper`. The user should not be aware of the JiT args or backend selections.  ## 🔍 Related Issues This should backup vllm-project/vllm#24482  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes @yzh119 Current `use_sliding_window` implementation in both FA2(https://github.com/flashinfer-ai/flashinfer/blob/bc29697ba20b7e6bdb728ded98f04788e16ee021/include/flashinfer/attention/variants.cuh#L89) and FA3(https://github.com/flashinfer-ai/flashinfer/blob/bc29697ba20b7e6bdb728ded98f04788e16ee021/include/flashinfer/attention/hopper/utils.cuh#L40) don't consider `causal=False`, where qo_idx has no effect on the boundary of the sliding window. Some of the unit tests in `test_attention_sink.py` may fail due to this.  cc @heheda12345 --------- Co-authored-by: happierpig <zhaoyilong217@sjtu.edn.cn>

…ject#24482) Signed-off-by: Chen Zhang <zhangch99@outlook.com>

…ject#24482) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

raise error

4503f10

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

heheda12345 requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners September 9, 2025 05:25

mergify bot added gpt-oss Related to GPT-OSS models v1 labels Sep 9, 2025

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

fix

ec77698

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

yeqcharlotte reviewed Sep 9, 2025

View reviewed changes

LucasWilkinson approved these changes Sep 9, 2025

View reviewed changes

heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 9, 2025

LucasWilkinson enabled auto-merge (squash) September 9, 2025 16:37

Merge branch 'main' into flashinfer-sink

6ee4473

mgoin approved these changes Sep 10, 2025

View reviewed changes

simon-mo merged commit b5e383c into vllm-project:main Sep 10, 2025
35 of 40 checks passed

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[gpt-oss] raise error for flashinfer backend without trtllm (vllm-pro…

1c0898f

…ject#24482) Signed-off-by: Chen Zhang <zhangch99@outlook.com>

happierpig mentioned this pull request Sep 15, 2025

[misc] add a wrapper class for attention sink jit args flashinfer-ai/flashinfer#1679

Merged

5 tasks

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[gpt-oss] raise error for flashinfer backend without trtllm (vllm-pro…

a8e73fa

…ject#24482) Signed-off-by: Chen Zhang <zhangch99@outlook.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[gpt-oss] raise error for flashinfer backend without trtllm (vllm-pro…

f4ac313

…ject#24482) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[gpt-oss] raise error for flashinfer backend without trtllm (vllm-pro…

a2b80f4

…ject#24482) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[gpt-oss] raise error for flashinfer backend without trtllm #24482

[gpt-oss] raise error for flashinfer backend without trtllm #24482

Uh oh!

heheda12345 commented Sep 9, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

yeqcharlotte Sep 9, 2025

Uh oh!

LucasWilkinson left a comment

Uh oh!

heheda12345 commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[gpt-oss] raise error for flashinfer backend without trtllm #24482

[gpt-oss] raise error for flashinfer backend without trtllm #24482

Uh oh!

Conversation

heheda12345 commented Sep 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

raise the newly added error.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

yeqcharlotte Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson left a comment

Choose a reason for hiding this comment

Uh oh!

heheda12345 commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

heheda12345 commented Sep 9, 2025 •

edited by github-actions bot

Loading