-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
[gpt-oss] raise error for flashinfer backend without trtllm #24482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds checks to prevent using attention sinks with the FlashInfer backend when TRT-LLM is not available, as this configuration is unsupported. My feedback focuses on improving the robustness of these checks by replacing assert statements with explicit NotImplementedError exceptions. This ensures that the checks are not bypassed when assertions are disabled, providing more reliable error handling for unsupported configurations.
| raise NotImplementedError( | ||
| "FlashInfer backend currently does not support attention " | ||
| "sinks, please use trtllm on blackwell or flash attention on " | ||
| "earlier GPUs.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly coordinate with #24470. we see accuracy regression for 0.2.14 actually which is fixed in >=0.3.0.
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; assuming this isn't addressed by: https://github.com/vllm-project/vllm/pull/24482/files#r2332475619
|
@LucasWilkinson I'm sure it is not addressed by https://github.com/vllm-project/vllm/pull/24482/files#r2332475619. We need to pass the sink to the flashinfer kernels and I'm waiting for an interface update about this. |
…ject#24482) Signed-off-by: Chen Zhang <zhangch99@outlook.com>
<!-- .github/pull_request_template.md --> ## 📌 Description This PR adds a wrapper class around the JiT implementation of `AttentionSink` (GPT-OSS style) for FlashInfer backend `BatchPrefillWithPagedKVCacheWrapper`. The user should not be aware of the JiT args or backend selections. <!-- What does this PR do? Briefly describe the changes and why they’re needed. --> ## 🔍 Related Issues This should backup vllm-project/vllm#24482 <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes @yzh119 Current `use_sliding_window` implementation in both FA2(https://github.com/flashinfer-ai/flashinfer/blob/bc29697ba20b7e6bdb728ded98f04788e16ee021/include/flashinfer/attention/variants.cuh#L89) and FA3(https://github.com/flashinfer-ai/flashinfer/blob/bc29697ba20b7e6bdb728ded98f04788e16ee021/include/flashinfer/attention/hopper/utils.cuh#L40) don't consider `causal=False`, where qo_idx has no effect on the boundary of the sliding window. Some of the unit tests in `test_attention_sink.py` may fail due to this. <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> cc @heheda12345 --------- Co-authored-by: happierpig <zhaoyilong217@sjtu.edn.cn>
…ject#24482) Signed-off-by: Chen Zhang <zhangch99@outlook.com>
…ject#24482) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…ject#24482) Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Purpose
The attention sink is not integrated into flashinfer backend yet. Raise an error.
Test Plan
VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve openai/gpt-oss-20bon hopperTest Result
raise the newly added error.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.