-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
[Attention][Platform] Refactor MLA to support Custom Op #23332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the Multi-Head Latent Attention (MLA) into a CustomOp to improve extensibility for different hardware platforms. This is a positive change for modularity. However, the current implementation has some critical issues in the new MultiHeadLatentAttention class. Specifically, it's missing initialization for key attributes, which will cause runtime errors. Additionally, it incorrectly uses total number of heads instead of local heads in tensor-parallel contexts, which will lead to shape mismatches. I've provided detailed comments and suggestions to address these issues.
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the Multi-Head Latent Attention (MLA) implementation into a CustomOp to better support different hardware backends and custom fused operators. A new file vllm/model_executor/layers/mla.py is introduced, which contains the MultiHeadLatentAttention custom op and an MLAModules dataclass to group related modules. The DeepseekV2MLAAttention class in deepseek_v2.py is updated to use this new custom op, simplifying its implementation. The changes are well-structured and align with the goal of improving modularity and extensibility. My main feedback is to add assertions in the new MultiHeadLatentAttention class to ensure that required modules are provided, which will improve robustness and prevent potential runtime errors.
742db74 to
08296ea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this is the best approach. The CustomOp abstraction is meant as a simple abstraction to dispatch between torch implementations and custom/CUDA/platform kernels. For something as complex as MLA we should not use this.
Could you describe the fusion you want to do in more detail? Could this fusion in vllm-ascend instead be performed using a custom torch.compile pass?
EDIT: I think if you want to extract MLA code from the model code I think that's a great idea. But the custom op abstraction seems like the wrong tool for the job.
|
Thank you for the picture. Honestly this is a common issue with attention, there can be a lot of ops hidden from torch.compile. There are two possible approaches:
Both approaches would require modifying the unified_attention op, and I'm not sure which is better, and if either is feasible at all. Before we go down this rabbit hole, is Inductor supported on vllm-ascend? What about cuda graphs? |
Thanks for your attention. Currently, vLLM-Ascend does not support Inductor. We have enabled a graph mode whose functionality is very similar to CUDA Graph. At present, performing fusion passes in vLLM-Ascend cannot be achieved through the mechanism implemented via Inductor in vLLM. We only have a practice of performing kernel fusion through |
To ensure I correctly understand the second approach you mentioned, I’ll briefly describe the implementation. Then we should add a static method to the CustomOp class to check whether a given op name has already been registered: Finally, in the abstracted MLA layer, we check if the mla_outer op has been registered. If it has, we call the forward_outer method: Does this align with what you had in mind? @ProExpertProg |
|
Not exactly, but it would still require Inductor fusions anyway. Does vllm-ascend support (or use) Dynamo at all? What about AotDispatcher? Because we could do passes without Inductor if AotDispatcher is used (so normalized and functional IR is produced). How are the CUDAGraph-equivalent graphs handled? |
vllm-ascend supports and uses Dynamo. We only implement our own piecewise backend to receive and process FX Graph. Our graph mode also works through capture and replay machanism, just like CUDA-Graph. |
|
after chatting with @ProExpertProg I think we agree that making layers pluggable (especially MLA for the reasons you outlined) by different HW backends without the need for torch.compile is desirable; however CustomOP might not be the right mechanism for this (per @ProExpertProg comments) I will be honest I don't know that much about how the HW pluggin infrastructure works so maybe @youkaichao can weigh in here |
|
Yeah, to elaborate, I think a custom pass mechanism to perform fusion would be good in vllm-ascend because plugging layers and fusing manually inside will quickly become unmaintainable due to duplicated logic. It would also suffer from the same reasons vLLM uses custom passes, and fusions across layers are still going to be difficult. However, I understand that might be too large an undertaking for this case. I still believe |
|
Thank you for your input! I completely agree that we need a more lightweight PluggableLayer to support custom model layers—this would significantly enhance the extensibility of the vLLM project through plugins. However, to implement PluggableLayer effectively, we need to conduct more detailed design work. We will subsequently propose an RFC to describe and further discuss this direction. For the current MLA-related issues, I believe we can temporarily address them using the Custom Op approach. |
@LucasWilkinson Let me do a simple elaboration to show how hardware plugin work: 0. Register platform: vLLM serve start with plugin register and regiter current platform. 1. Register worker: vLLM call Line 3697 in 308fa28
2. Init worker: vLLM init worker, this will init platform's worker: vllm/vllm/worker/worker_base.py Line 556 in 308fa28
3. Register Custom ops: vLLM hardware plugin (such as vllm-ascend) worker will call After all above steps, the ops will be replaced by hardware plugin own ops. Actually, the current custom ops mechanism is relatively flexible and supports registration at the forward, layer, and ops levels. I think the So, from the perspective of hardware plugin, I think custom ops is an acceptable way to go, actually, it's also the original intention of #19164 |
I guess both paths (the frontend pluggable layer abstraction and the backend graph fusion) can be useful. The former is direct and simpler while we have to maintain the compatibility of the layer semantics - therefore, the supported layers had better be stable enough and well-defined, e.g., MLA and fused MoE. The latter is more flexible and general but it depends on the From vllm-ascend, we are trying to enable some graph fusion passes by extending the vllm core. But it seems that the compilation backend is currently bound to the "inductor" backend which not every HW backend supports. In vllm-ascend, the inductor backend is not yet enabled and we are adding direct FX-based graph fusion. Currently, we have to do monkey-patches to plugin our own passes (see the rmsnorm+quant fusion PR from In summary, it would be great if
Would love to get your inputs. @ProExpertProg PS: For "2", perhaps we can manually apply the AotDispatch passes and invoke those vllm "inductor" passes from inside the custom compiler backend. Not sure if that would work well though. |
|
@Yikun Thanks for the description! I think the main downside of using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not up to date on the intentions behind CustomOp or the HW plugins; so I will defer to the experts for final approval (cc @ProExpertProg @youkaichao ). Left a more MLA related comment (happy to review the MLA related portions; although theres not much change there haha)
addc5f9 to
c30506d
Compare
|
@whx-sjtu is this done? |
Yes, I think we can run ci and merge this PR. |
84aa004 to
3f6e479
Compare
|
I review again on this, I also think it's ready to go. |
Head branch was pushed to by a user without write access
5d61664 to
7f45a92
Compare
Head branch was pushed to by a user without write access
7f45a92 to
88e34dd
Compare
Signed-off-by: whx-sjtu <2952154980@qq.com>
88e34dd to
b0fb75f
Compare
|
If there still CI issue here, we can forge merge this. |
…#23332) Signed-off-by: whx-sjtu <2952154980@qq.com>
…#23332) Signed-off-by: whx-sjtu <2952154980@qq.com>
…#23332) Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…#23332) Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Purpose
MLA (Multi-Head Latent Attention) is a complex layer with significant optimization potential, and different hardware backends may require distinct optimization approaches—such as operator fusion, multi-streaming, etc.
The current abstraction in vLLM struggles to accommodate scenarios that require integrating large-grained fused operators from different platforms. For example, on the Ascend platform, we need to integrate a fused operator that combines
rms_norm,rope, andconcat_and_cache_mla. However, in the current community implementation,rms_normandropeare called within DeepSeek’s modeling code. Without modifying the model itself, it is impossible to integrate such custom fused operators.Therefore, we hope to enable MLA as a Custom Op to support customizable extensions for various platforms.
Additionally, considering that other models might also benefit from MLA optimizations—for instance, we recently observed community demand for integrating MLA into models like Qwen—decoupling the MLA layer from modeling would better serve such use cases and facilitate broader adoption.
Test Plan
No need to add new test.
Test Result
all tests should pass
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.