Skip to content

Conversation

@heheda12345
Copy link
Collaborator

@heheda12345 heheda12345 commented Sep 9, 2025

Purpose

The attention sink is not integrated into flashinfer backend yet. Raise an error.

Test Plan

VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve openai/gpt-oss-20b on hopper

Test Result

raise the newly added error.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds checks to prevent using attention sinks with the FlashInfer backend when TRT-LLM is not available, as this configuration is unsupported. My feedback focuses on improving the robustness of these checks by replacing assert statements with explicit NotImplementedError exceptions. This ensures that the checks are not bypassed when assertions are disabled, providing more reliable error handling for unsupported configurations.

Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Comment on lines +418 to +421
raise NotImplementedError(
"FlashInfer backend currently does not support attention "
"sinks, please use trtllm on blackwell or flash attention on "
"earlier GPUs.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly coordinate with #24470. we see accuracy regression for 0.2.14 actually which is fixed in >=0.3.0.

Copy link
Collaborator

@LucasWilkinson LucasWilkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heheda12345
Copy link
Collaborator Author

@LucasWilkinson I'm sure it is not addressed by https://github.com/vllm-project/vllm/pull/24482/files#r2332475619. We need to pass the sink to the flashinfer kernels and I'm waiting for an interface update about this.

@heheda12345 heheda12345 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 9, 2025
@LucasWilkinson LucasWilkinson enabled auto-merge (squash) September 9, 2025 16:37
@simon-mo simon-mo merged commit b5e383c into vllm-project:main Sep 10, 2025
35 of 40 checks passed
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
yzh119 pushed a commit to flashinfer-ai/flashinfer that referenced this pull request Sep 15, 2025
<!-- .github/pull_request_template.md -->

## 📌 Description
This PR adds a wrapper class around the JiT implementation of
`AttentionSink` (GPT-OSS style) for FlashInfer backend
`BatchPrefillWithPagedKVCacheWrapper`. The user should not be aware of
the JiT args or backend selections.
<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues
This should backup vllm-project/vllm#24482
<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes
@yzh119 Current `use_sliding_window` implementation in both
FA2(https://github.com/flashinfer-ai/flashinfer/blob/bc29697ba20b7e6bdb728ded98f04788e16ee021/include/flashinfer/attention/variants.cuh#L89)
and
FA3(https://github.com/flashinfer-ai/flashinfer/blob/bc29697ba20b7e6bdb728ded98f04788e16ee021/include/flashinfer/attention/hopper/utils.cuh#L40)
don't consider `causal=False`, where qo_idx has no effect on the
boundary of the sliding window. Some of the unit tests in
`test_attention_sink.py` may fail due to this.

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->
cc @heheda12345

---------

Co-authored-by: happierpig <zhaoyilong217@sjtu.edn.cn>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…ject#24482)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…ject#24482)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpt-oss Related to GPT-OSS models ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants