CPU flash attention with mask #112381

imzhuhl · 2023-10-30T09:40:30Z

According to the issue, I try to add mask support to the flash attn structure on CPU.

First mask is supported by using cum_seq_[q/kv], so the code for flashAttentionKernel is modified.
Secondly, an interface needs to be added to allow users to call, I choose to overload the _scaled_dot_product_flash_attention function.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2023-10-30T09:40:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112381

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 17 New Failures

As of commit a50f8dd with merge base 0d669f0 ():

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2023-10-30T09:41:21Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

drisspg · 2023-10-30T16:28:16Z

I think we want to do this, we need to properly spit the APIs I.e. add this: #110546

And this and then you would add an impl to the "nested" registration for CPU

cc @cpuhrsch

imzhuhl · 2023-10-31T02:20:26Z

I think we want to do this, we need to properly spit the APIs I.e. add this: #110546

And this and then you would add an impl to the "nested" registration for CPU

cc @cpuhrsch

So we expect pytorch to have two different flash attn APIs. I think I should commit the changes after #110546 merges, Is that okay?

Valentine233 · 2023-11-01T02:03:16Z

For the support of attention mask, I suppose that using cum_seq_q or cum_seq_k is not appropriate because they do not mean the same thing as attention mask.

Maybe it's better to implement _scaled_dot_product_efficient_attention CPU version as it has attn_bias as input? There are some early discussions here #103826 (comment). I could help do this if needed.

cc @jgong5

jgong5 · 2023-11-01T08:37:10Z

For the support of attention mask, I suppose that using cum_seq_q or cum_seq_k is not appropriate because they do not mean the same thing as attention mask.

Maybe it's better to implement _scaled_dot_product_efficient_attention CPU version as it has attn_bias as input? There are some early discussions here #103826. I could help do this if needed.

cc @jgong5

Agreed with @Valentine233 . Overloading the meaning of cum_seq_q or cum_seq_k for attention mask doesn't seem the right way to support the mask. Since the algorithms for flash attention and efficient attention are very similar, I don't see the benefit of implement another version for efficient attention on CPU. Would it be a viable option to provide an overload for _scaled_dot_product_flash_attention to support attention mask with an added arg for attn_mask on CPU? @drisspg

Valentine233 · 2023-11-06T07:41:48Z

Hi @drisspg, do you have any opinion on the API to support CPU flash attention with mask?

drisspg · 2023-11-06T14:41:54Z

Hey, so like @jgong5 said I think it doesn't really make sense too add a new backend if we are going to be modifying the existing kernel code.

If it makes it easier we can decouple the signature of the kernel from cpu and cuda. We should probably register a new stem function could and call it "sdpa_fused_attention" or something where we only register a cpu backend.

I think this would also clean up the meta registration for sdpa as we won't be forcing the cpu code to abide by the constraints of the cuda code

github-actions · 2024-01-05T15:33:50Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

drisspg · 2024-01-05T18:12:19Z

Closing in favor of: #115913

imzhuhl added 2 commits October 30, 2023 17:09

Overload flash attn cpu func, mask support for forward and backward

1708470

Add test for cpu flash attn with mask

a50f8dd

imzhuhl requested review from albanD and soulitzer as code owners October 30, 2023 09:40

github-actions bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Oct 30, 2023

pytorchbot added the open source label Oct 30, 2023

zou3519 requested a review from drisspg October 30, 2023 14:39

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 30, 2023

jgong5 requested a review from Valentine233 October 31, 2023 09:50

albanD removed their request for review October 31, 2023 15:27

github-actions bot added the Stale label Jan 5, 2024

drisspg closed this Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CPU flash attention with mask #112381

CPU flash attention with mask #112381

Uh oh!

imzhuhl commented Oct 30, 2023 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Oct 30, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Oct 30, 2023

Uh oh!

drisspg commented Oct 30, 2023

Uh oh!

imzhuhl commented Oct 31, 2023

Uh oh!

Valentine233 commented Nov 1, 2023 •

edited

Loading

Uh oh!

jgong5 commented Nov 1, 2023

Uh oh!

Valentine233 commented Nov 6, 2023

Uh oh!

drisspg commented Nov 6, 2023

Uh oh!

github-actions bot commented Jan 5, 2024

Uh oh!

drisspg commented Jan 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

CPU flash attention with mask #112381

CPU flash attention with mask #112381

Uh oh!

Conversation

imzhuhl commented Oct 30, 2023 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/112381

❌ 17 New Failures

Uh oh!

github-actions bot commented Oct 30, 2023

This PR needs a release notes: label

Uh oh!

drisspg commented Oct 30, 2023

Uh oh!

imzhuhl commented Oct 31, 2023

Uh oh!

Valentine233 commented Nov 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgong5 commented Nov 1, 2023

Uh oh!

Valentine233 commented Nov 6, 2023

Uh oh!

drisspg commented Nov 6, 2023

Uh oh!

github-actions bot commented Jan 5, 2024

Uh oh!

drisspg commented Jan 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

imzhuhl commented Oct 30, 2023 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 30, 2023 •

edited

Loading

This PR needs a `release notes:` label

Valentine233 commented Nov 1, 2023 •

edited

Loading