Skip to content

Conversation

@YangKai0616
Copy link
Contributor

@YangKai0616 YangKai0616 commented Oct 30, 2025

What does this PR do?

XPU has now implemented the basic functionality of flash_attn2 in kernels-community/flash-attn2. So this PR adds the function call to transformers.

Main contributions:

  1. Added flash_attn2 support for XPU;
  2. Enabled flash_attn2 UTs on XPU.

Now we can use attn_implementation="flash_attention_2" or attn_implementation="kernels-community/flash-attn" on XPU to invoke this feature.

Note (CI):

  1. Some tests such as XXX::test_flash_attn_2_inference_equivalence and XXX::test_flash_attn_2_equivalence, etc. , will sometimes pass and sometimes fail due to some non-determinism in the flash-attn2 kernel implementation. I have observed this phenomenon on both CUDA (A100) and XPU.
  2. For the test case tests/models/kosmos2/test_modeling_kosmos2.py::Kosmos2ModelTest::test_eager_matches_fa2_generate, CUDA will trigger a RuntimeError: cu_seqlens_q must have shape (batch_size + 1), while XPU will directly cause a Aborted (core dumped). This appears to be caused by the robustness of the underlying data reading function, but the root cause is the test case issue.

@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: bamba, diffllama, ernie4_5_moe, esm, glm4, gpt2, jamba, jetmoe, kosmos2_5, m2m_100, ministral, mixtral, modernbert, seed_oss, zamba, zamba2

@YangKai0616
Copy link
Contributor Author

Once huggingface/kernels-community/pull/59 has been merged and the binary files has been uploaded to kernels-community/flash-attn2, we can begin the review. Cc @yao-matrix .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants