Skip to content

[Feat] Add csrc/ascend NPU custom ops for GSA#729

Merged
ygwpz merged 28 commits intoModelEngine-Group:developfrom
leideng:gsa_ops_v2
Feb 6, 2026
Merged

[Feat] Add csrc/ascend NPU custom ops for GSA#729
ygwpz merged 28 commits intoModelEngine-Group:developfrom
leideng:gsa_ops_v2

Conversation

@leideng
Copy link
Contributor

@leideng leideng commented Feb 3, 2026

Purpose

Merge all new Ascend NPU custom ops in csrc/ascend into the develop branch. These ops enable GSA on NPU devices by providing:

  • npu_hamming_dist_top_k — Hamming-distance-based top-K for import KV selection (GQA and MLA variants).
  • npu_reshape_and_cache_bnsd — Reshape-and-cache for BNSD (batch × num_heads × seq × dim) layout on NPU.

The implementation follows the vLLM-Ascend build system and integrates with an independent Python package ucm_custom_ops. The usage is as follows

import ucm_custom_ops
torch.ops._C_ucm.npu_reshape_and_cache_bnsd(...)
torch.ops._C_ucm.npu_hamming_dist_top_k(...)

Modifications

New updated in ucm/sparse/gsa_on_device/csrc/ascend and test/sparse/gsa

Area Description
Torch bindings torch_binding.cpp, torch_binding_meta.cpp — Register both ops for PrivateUse1 (NPU) with meta implementations for shape inference and graph capture.
Hamming dist top-K hamming_dist_top_k/ — Full op_host (tiling, split, proto) and op_kernel implementation.
Reshape and cache BNSD reshape_and_cache_bnsd/ — op_host and op_kernel for BNSD reshape-and-cache.
test/sparse/gsa/test_reshape_graph.py test script for op reshpae_and_cache_bnsd
test/sparse/gsa/test_hamming_gqa.py test script for op hamming_dist_top_k in GQA mode
test/sparse/gsa/test_hamming_mla.py test script for op hamming_dist_top_k in MLA mode

NPU OPS APIs (summary)

  • npu_hamming_dist_top_k
    (hashq, hashkCache, hashkCacheRope, topN, seqLen, chunk_size?, max_seq_len?, sink?, recent?, support_offload?, key_block_table?, mask?, indices?) -> Tensor

  • npu_reshape_and_cache_bnsd
    (hashq, hashkCache, slot_mapping, seq_len, hashk_cache_out) -> Tensor

Test

  • Unit / integration tests (eager and graph):
    • test/gsa/test_reshape_graph.pytest_reshape_and_cache_bnsd, test_reshape_and_cache_bnsd_graph
    • test/gsa/test_hamming_gqa.pytest_hamming_dist_top_k_graph and eager path
    • test/gsa/test_hamming_mla.pytest_hamming_dist_top_k_mla_eager, test_hamming_dist_top_k_mla_graph
  • Build: From repo root, bash csrc/ascend/build_aclnn.sh builds the custom op library; install_python_package.sh installs the wheel. In addition, you should execute source csrc/ascend/_ucm_ops_custom/vendors/ucm/bin/set_env.bash so import ucm_custom_ops and torch.ops._C_ucm.* work on NPU.

The screenshots and logs for testing both ops are also attached here.
image
image
image
test_reshape_graph_successful.log
test_hamming_mla_successful.log
test_hamming_gqa_successful.log

In addition, I have run offline inference with the new NPU ops, which has been succesful.

image image

gsaondevice_02051759.log

@ygwpz ygwpz merged commit 183a263 into ModelEngine-Group:develop Feb 6, 2026
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants