[SDPA] Use scaled_dot_product_attention within attention.cpp #87312

drisspg · 2022-10-19T19:05:22Z

Summary

Use the private _scaled_dot_product_attention to support _native_multiheaded_attention. _SDP provides access to fused kernels when certain conditions are meant enabling a speed up for MHA.

cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki

pytorch-bot · 2022-10-19T19:05:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87312

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit 100c3c4:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cpuhrsch · 2022-10-19T20:05:55Z

aten/src/ATen/native/transformers/attention.cpp


+  const auto dim_per_head = D / num_head;
+
+  if (query.is_same(key) && key.is_same(value) && !need_weights) {


By using qkv_projection below we might be able to broaden the applicability of _scaled_dot_product_attention to just !need_weights

test/test_transformers.py

aten/src/ATen/native/transformers/attention.cpp

aten/src/ATen/native/transformers/cuda/sdp_utils.h

test/test_transformers.py

aten/src/ATen/native/transformers/attention.cpp

test/test_transformers.py

fmassa · 2022-10-21T15:14:28Z

aten/src/ATen/native/transformers/attention.cpp

+      chunks[0] = (chunks[0].view({x_size_0, -1, num_head, sdp_dim_per_head}))
+                      .transpose(1, 2);
+      chunks[1] = (chunks[1].view({x_size_0, -1, num_head, sdp_dim_per_head}))
+                      .transpose(1, 2);
+      chunks[2] = (chunks[2].view({x_size_0, -1, num_head, sdp_dim_per_head}))
+                      .transpose(1, 2);
+
+      auto y = at::_scaled_dot_product_attention(
+          chunks[0], chunks[1], chunks[2], mask, 0.0, need_weights, false);


Note that the last implementation from @danthe3rd allows to pass the tensor without having to reshape anything, avoiding a call to contiguous both before and after the operator. This can provide significant speedups btw

aten/src/ATen/native/transformers/attention.cpp

facebook-github-bot · 2022-10-24T16:32:00Z

@drisspg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

drisspg · 2022-10-25T22:33:27Z

aten/src/ATen/native/transformers/attention.cpp

We could also error out here

I think that's a good idea for now if there's no risk we'll take this path from attention.cpp

aten/src/ATen/native/transformers/attention.cpp

drisspg · 2022-10-26T05:30:20Z

aten/src/ATen/native/transformers/transformer.cpp

 #if BETTER_TRANSFORMER_USE_FLASH_ATTENTION
  }
 #endif
+  x = std::get<0>(at::_native_multi_head_attention(


I think this needs to be here, before the avoided dispatch but the native_multiheaded_attention worked for all tensors subclasses so now using the dispatcher to correctly dispatch to cpu or cuda

danthe3rd

I'm wondering if we need some more tests for this specific behavior (eg one of the nested tensor with seqlen=1)

danthe3rd · 2022-10-26T07:13:14Z

aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cpp

-        tensor_stride_ptr[(i - 1) * tensor_stride_0];
+    // TODO: When 0 seq_len nested tensors are allowed we need to guard against this
+    int64_t previous_numel = tensor_size_ptr[(i - 1) * tensor_stride_0] * tensor_stride_ptr[(i - 1) * tensor_stride_0];
+    int64_t current_offset_constant = (tensor_offsets[i] - tensor_offsets[i - 1]) / previous_numel;


Should we add a check that (tensor_offsets[i] - tensor_offsets[i - 1]) % previous_numel == 0 before dividing here - and also before the loop?

https://github.com/drisspg/pytorch/blob/029ab31c0602683821ebb6d9ce78be1fa70770f7/aten/src/ATen/native/transformers/cuda/sdp_utils.h#L65

Yes we very much do but spoke to christian and this PR is essentially neuter the SDP inclusion in native mha and then in a follow up PR expand scope

facebook-github-bot · 2022-10-26T17:49:09Z

@drisspg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-10-26T19:17:21Z

@drisspg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

avialable

aten/src/ATen/native/transformers/cuda/attention.cu

drisspg · 2022-10-28T19:41:02Z

test/test_native_mha.py

-                                need_weights=need_weights,
-                                average_attn_weights=average_attn_weights,
-                            )
+    def test_native_multihead_self_attention(self, device, dtype, use_nt, need_weights, average_attn_weights, use_padding=False, pad_all=False


This test is currently failing for the cases when it can run fused sdp, need_weights = false

drisspg · 2022-10-28T19:43:08Z

test/test_transformers.py

+        math_ref_test = math_ref_test.to(dtype=torch.float32).contiguous()
+        math_ref_lp_test = math_ref_lp_test.to(dtype=torch.float32).contiguous()
+
+        self.assertEqual(math_ref_test, math_ref_lp_test, atol=4e-1, rtol=4e-1)


This comparison is for the math_ref run on fp32 to math_ref on fp16. Using this to define a reasonable epsilon for fp16 to fp32 comparisons.

The second assert compares fused_sdp_fp16 vs math_ref_32 and ensures that it is within the same bounds as math_ref compares.

Also note that we are scaling up the uniform distribution to between (-10, 10)

benchmarks/transformer/better_transformer_vs_mha_functional.py

facebook-github-bot · 2022-10-28T20:09:38Z

@drisspg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

test/test_transformers.py

facebook-github-bot · 2022-10-28T20:46:48Z

@drisspg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-10-28T21:05:14Z

@drisspg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-10-28T21:38:59Z

@drisspg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-10-29T18:43:07Z

@drisspg has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

drisspg · 2022-10-31T04:04:55Z

@pytorchbot merge

pytorchmergebot · 2022-10-31T04:06:27Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

github-actions · 2022-10-31T04:07:02Z

Hey @drisspg.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

@cpuhrsch

# Summary Use the private _scaled_dot_product_attention to support _native_multiheaded_attention. _SDP provides access to fused kernels when certain conditions are meant enabling a speed up for MHA. cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki Pull Request resolved: pytorch#87312 Approved by: https://github.com/cpuhrsch

@cpuhrsch

# Summary Use the private _scaled_dot_product_attention to support _native_multiheaded_attention. _SDP provides access to fused kernels when certain conditions are meant enabling a speed up for MHA. cc @cpuhrsch @jbschlosser @bhosmer @mikaylagawarecki Pull Request resolved: pytorch#87312 Approved by: https://github.com/cpuhrsch

cpuhrsch reviewed Oct 19, 2022

View reviewed changes

drisspg added module: nestedtensor NestedTensor tag see issue #25032 ciflow/trunk Trigger trunk jobs on your pull request labels Oct 19, 2022

cpuhrsch reviewed Oct 20, 2022

View reviewed changes

test/test_transformers.py Outdated Show resolved Hide resolved

drisspg force-pushed the Use_scaled_dot_product_attention_within_attention.cpp branch from dec4bf4 to 596e945 Compare October 20, 2022 16:07

drisspg commented Oct 20, 2022

View reviewed changes

aten/src/ATen/native/transformers/attention.cpp Outdated Show resolved Hide resolved

drisspg commented Oct 20, 2022

View reviewed changes

aten/src/ATen/native/transformers/cuda/sdp_utils.h Outdated Show resolved Hide resolved

drisspg commented Oct 20, 2022

View reviewed changes

test/test_transformers.py Show resolved Hide resolved

cpuhrsch reviewed Oct 20, 2022

View reviewed changes

aten/src/ATen/native/transformers/attention.cpp Outdated Show resolved Hide resolved

cpuhrsch reviewed Oct 20, 2022

View reviewed changes

test/test_transformers.py Show resolved Hide resolved

fmassa reviewed Oct 21, 2022

View reviewed changes

drisspg commented Oct 22, 2022

View reviewed changes

aten/src/ATen/native/transformers/attention.cpp Outdated Show resolved Hide resolved

drisspg force-pushed the Use_scaled_dot_product_attention_within_attention.cpp branch 2 times, most recently from 5f42406 to d64515c Compare October 25, 2022 21:57

drisspg commented Oct 25, 2022

View reviewed changes

drisspg force-pushed the Use_scaled_dot_product_attention_within_attention.cpp branch from e5a3ce9 to 7f206e1 Compare October 25, 2022 22:40

drisspg commented Oct 25, 2022

View reviewed changes

aten/src/ATen/native/transformers/attention.cpp Show resolved Hide resolved

drisspg commented Oct 26, 2022

View reviewed changes

danthe3rd reviewed Oct 26, 2022

View reviewed changes

drisspg force-pushed the Use_scaled_dot_product_attention_within_attention.cpp branch from 029ab31 to 6c6a8ac Compare October 26, 2022 17:02

pytorch deleted a comment from cpuhrsch Oct 26, 2022

drisspg requested review from mruberry and ngimel as code owners October 26, 2022 17:35

drisspg force-pushed the Use_scaled_dot_product_attention_within_attention.cpp branch from 4c9c905 to 18dc15c Compare October 27, 2022 22:30

drisspg added 2 commits October 28, 2022 15:59

Split native_multiheaded attention into cpu/cuda and call into sdp if

2b92dad

avialable

error for non contiguous inputs to math

14a0e97

drisspg added 3 commits October 28, 2022 16:00

add benchmark script

e02219a

precision test

09f74a2

compare against torch.float32

573a1a5

drisspg force-pushed the Use_scaled_dot_product_attention_within_attention.cpp branch from 619f511 to 0aef791 Compare October 28, 2022 17:25

drisspg commented Oct 28, 2022

View reviewed changes

aten/src/ATen/native/transformers/cuda/attention.cu Show resolved Hide resolved

drisspg commented Oct 28, 2022

View reviewed changes

benchmarks/transformer/better_transformer_vs_mha_functional.py Show resolved Hide resolved

drisspg requested review from cpuhrsch, danthe3rd and fmassa October 28, 2022 20:09

drisspg commented Oct 28, 2022

View reviewed changes

test/test_transformers.py Outdated Show resolved Hide resolved

update accuracy test

100c3c4

drisspg force-pushed the Use_scaled_dot_product_attention_within_attention.cpp branch from 101e478 to 100c3c4 Compare October 28, 2022 23:42

cpuhrsch approved these changes Oct 30, 2022

View reviewed changes

pytorchmergebot added the Merged label Oct 31, 2022

pytorchmergebot closed this in e24ce48 Oct 31, 2022

drisspg changed the title ~~Use scaled_dot_product_attention within attention.cpp~~ [SDPA] Use scaled_dot_product_attention within attention.cpp Jan 10, 2023


		const auto dim_per_head = D / num_head;

		if (query.is_same(key) && key.is_same(value) && !need_weights) {

[SDPA] Use scaled_dot_product_attention within attention.cpp #87312

[SDPA] Use scaled_dot_product_attention within attention.cpp #87312

Uh oh!

Conversation

drisspg commented Oct 19, 2022 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

pytorch-bot bot commented Oct 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/87312

⏳ No Failures, 1 Pending

Uh oh!

cpuhrsch Oct 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Oct 24, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danthe3rd left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Oct 26, 2022

Uh oh!

facebook-github-bot commented Oct 26, 2022

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

facebook-github-bot commented Oct 28, 2022

Uh oh!

Uh oh!

facebook-github-bot commented Oct 28, 2022

Uh oh!

facebook-github-bot commented Oct 28, 2022

Uh oh!

facebook-github-bot commented Oct 28, 2022

Uh oh!

facebook-github-bot commented Oct 29, 2022

Uh oh!

drisspg commented Oct 31, 2022

Uh oh!

pytorchmergebot commented Oct 31, 2022

Merge started

Uh oh!

github-actions bot commented Oct 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

drisspg commented Oct 19, 2022 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Oct 19, 2022 •

edited

Loading

cpuhrsch Oct 19, 2022 •

edited

Loading