[V1][Spec Decode] Handle draft tokens beyond max_model_len#16087
[V1][Spec Decode] Handle draft tokens beyond max_model_len#16087WoosukKwon merged 12 commits intomainfrom
Conversation
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
comaniac
left a comment
There was a problem hiding this comment.
Overall LGTM. Approve first to unblock the PR. Meanwhile it would be good to have a unit test for it. Also do we know the overhead of introduced ops (e.g., torch.where)?
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
vllm/v1/spec_decode/eagle.py
Outdated
| attn_metadata.slot_mapping = torch.where( | ||
| exceeds_max_model_len, | ||
| PADDING_SLOT_ID, | ||
| attn_metadata.slot_mapping, | ||
| ) |
There was a problem hiding this comment.
My understanding is that torch.where will allocate an intermediate tensor and then assign it.
Is it possible to use attn_metadata.slot_mapping[exceeds_max_model_len] = PADDING_SLOT_ID so that it's an in-place operation?
There was a problem hiding this comment.
@ekagra-ranjan Thanks for the suggestion. I changed it to masked_fill_, which is also an in-place operation.
Overall, I think the performance impact will be small since the tensors here are small (shape of [batch_size]).
| # out-of-range access during the model execution. The draft tokens | ||
| # generated with this adjustment should be ignored. |
There was a problem hiding this comment.
Can you please point me to the logic of ignoring such draft tokens in this PR?
There was a problem hiding this comment.
Good question. The scheduler handles it
vllm/vllm/v1/core/sched/scheduler.py
Lines 188 to 193 in af7462b
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
@comaniac Good point. Added a test. Also, I replaced |
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
…ect#16087) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Frieda (Jingying) Huang <jingyingfhuang@gmail.com>
…ect#16087) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
…ect#16087) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
…ect#16087) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
…ect#16087) Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Implements 4. Handle the edge cases like when the draft model generates beyond max_pos_embeddings in #15901