varlen api by liangel-02 · Pull Request #164502 · pytorch/pytorch

liangel-02 · 2025-10-02T20:47:54Z

Summary

Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors here. We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA.

This PR builds out varlen_attn, the public API that users can call for the forward method, and _varlen_attn, the private API that calls into the Flash Attention/cuDNN backend.

Benchmarking

To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding.

Settings:

1 H100 machine
batch_size=8, max_seq_len=2048, embed_dim=1024, num_heads=16
dtype torch.bfloat16
is_causal=False
for variable length, we set sequences to be random multiples of 64 up to max_seq_len
100 runs

	Variable Length API	SDPA
Runtime	0.21750560760498047 ms	0.43171775817871094 ms
TFLOPs	231.812	320.840

The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length.

Testing

Run python test/test_varlen_attention.py for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA.

Next steps

Next steps from this PR (higher in the stack) include registering the private API _varlen_attn as a custom op, implementing backward support, and enabling cuDNN with correct numerics.

Stack from ghstack (oldest at bottom):

(This stack builds on top of #162326)

[ghstack-poisoned]

pytorch-bot · 2025-10-02T20:47:57Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164502

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 59331ad with merge base ffe3cb2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: e22c006 Pull Request resolved: #164502

[ghstack-poisoned]

torch/nn/attention/varlen.py

**Summary** Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors [here](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#nestedtensor-and-dense-tensor-support). We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA. This PR builds out `varlen_attn`, the public API that users can call for the forward method, and `_varlen_attn`, the private API that calls into the Flash Attention/cuDNN backend. **Benchmarking** To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding. Settings: - 1 H100 machine - `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16` - dtype `torch.bfloat16` - `is_causal=False` - for variable length, we set sequences to be random multiples of 64 up to `max_seq_len` - 100 runs | | Variable Length API | SDPA | |--------|--------------------|----------| | Runtime | 0.21750560760498047 ms | 0.43171775817871094 ms | | TFLOPs | 231.812 | 320.840 | The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length. **Testing** Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA. **Next steps** Next steps from this PR (higher in the stack) include registering the private API `_varlen_attn` as a custom op, implementing backward support, and enabling cuDNN with correct numerics. (This stack builds on top of #162326) [ghstack-poisoned]

ghstack-source-id: 46c9dcc Pull Request resolved: #164502

[ghstack-poisoned]

torch/nn/attention/varlen.py

[ghstack-poisoned]

test/test_varlen_attention.py

**Summary** Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors [here](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#nestedtensor-and-dense-tensor-support). We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA. This PR builds out `varlen_attn`, the public API that users can call for the forward method, and `_varlen_attn`, the private API that calls into the Flash Attention/cuDNN backend. **Benchmarking** To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding. Settings: - 1 H100 machine - `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16` - dtype `torch.bfloat16` - `is_causal=False` - for variable length, we set sequences to be random multiples of 64 up to `max_seq_len` - 100 runs | | Variable Length API | SDPA | |--------|--------------------|----------| | Runtime | 0.21750560760498047 ms | 0.43171775817871094 ms | | TFLOPs | 231.812 | 320.840 | The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length. **Testing** Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA. **Next steps** Next steps from this PR (higher in the stack) include registering the private API `_varlen_attn` as a custom op, implementing backward support, and enabling cuDNN with correct numerics. (This stack builds on top of #162326) [ghstack-poisoned]

**Summary** Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors [here](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#nestedtensor-and-dense-tensor-support). We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA. This PR builds out `varlen_attn`, the public API that users can call for the forward method, and `_varlen_attn`, the private API that calls into the Flash Attention/cuDNN backend. **Benchmarking** To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding. Settings: - 1 H100 machine - `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16` - dtype `torch.bfloat16` - `is_causal=False` - for variable length, we set sequences to be random multiples of 64 up to `max_seq_len` - 100 runs | | Variable Length API | SDPA | |--------|--------------------|----------| | Runtime | 0.21750560760498047 ms | 0.43171775817871094 ms | | TFLOPs | 231.812 | 320.840 | The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length. **Testing** Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA. **Next steps** Next steps from this PR (higher in the stack) include registering the private API `_varlen_attn` as a custom op, implementing backward support, and enabling cuDNN with correct numerics. (This stack builds on top of pytorch#162326) Pull Request resolved: pytorch#164502 Approved by: https://github.com/v0i0, https://github.com/drisspg

huydhn · 2025-10-15T03:55:06Z

@pytorchbot revert -m 'Sorry for reverting your change, but the doctests failure is legit' -c weird

GH job link HUD commit link

Due to a macOS outage we had during the weekend, doctests was wrongly marked as flaky

pytorchmergebot · 2025-10-15T03:56:35Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2025-10-15T03:56:45Z

@liangel-02 your PR has been successfully reverted.

This reverts commit 3681312. Reverted #164502 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the doctests failure is legit ([comment](#164502 (comment)))

pytorch-auto-revert · 2025-10-15T04:44:15Z

@pytorchbot revert -m "Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable" -c autorevert

This PR is attributed to have caused regression in:

trunk: macos-py3-arm64 / test (hud)

Please investigate and fix the issues.

pytorchmergebot · 2025-10-15T04:45:49Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

pytorchmergebot · 2025-10-15T04:45:55Z

Reverting PR 164502 failed

Reason: list index out of range

Details for Dev Infra team

Raised by workflow job

**Summary** Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors [here](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#nestedtensor-and-dense-tensor-support). We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA. This PR builds out `varlen_attn`, the public API that users can call for the forward method, and `_varlen_attn`, the private API that calls into the Flash Attention/cuDNN backend. **Benchmarking** To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding. Settings: - 1 H100 machine - `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16` - dtype `torch.bfloat16` - `is_causal=False` - for variable length, we set sequences to be random multiples of 64 up to `max_seq_len` - 100 runs | | Variable Length API | SDPA | |--------|--------------------|----------| | Runtime | 0.21750560760498047 ms | 0.43171775817871094 ms | | TFLOPs | 231.812 | 320.840 | The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length. **Testing** Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA. **Next steps** Next steps from this PR (higher in the stack) include registering the private API `_varlen_attn` as a custom op, implementing backward support, and enabling cuDNN with correct numerics. (This stack builds on top of #162326) [ghstack-poisoned]

liangel-02 · 2025-10-15T19:37:58Z

@pytorchbot merge

pytorchmergebot · 2025-10-15T19:39:54Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

**Summary** Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors [here](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#nestedtensor-and-dense-tensor-support). We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA. This PR builds out `varlen_attn`, the public API that users can call for the forward method, and `_varlen_attn`, the private API that calls into the Flash Attention/cuDNN backend. **Benchmarking** To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding. Settings: - 1 H100 machine - `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16` - dtype `torch.bfloat16` - `is_causal=False` - for variable length, we set sequences to be random multiples of 64 up to `max_seq_len` - 100 runs | | Variable Length API | SDPA | |--------|--------------------|----------| | Runtime | 0.21750560760498047 ms | 0.43171775817871094 ms | | TFLOPs | 231.812 | 320.840 | The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length. **Testing** Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA. **Next steps** Next steps from this PR (higher in the stack) include registering the private API `_varlen_attn` as a custom op, implementing backward support, and enabling cuDNN with correct numerics. (This stack builds on top of pytorch#162326) Pull Request resolved: pytorch#164502 Approved by: https://github.com/v0i0, https://github.com/drisspg

This reverts commit 3681312. Reverted pytorch#164502 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the doctests failure is legit ([comment](pytorch#164502 (comment)))

**Summary** Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors [here](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#nestedtensor-and-dense-tensor-support). We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA. This PR builds out `varlen_attn`, the public API that users can call for the forward method, and `_varlen_attn`, the private API that calls into the Flash Attention/cuDNN backend. **Benchmarking** To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding. Settings: - 1 H100 machine - `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16` - dtype `torch.bfloat16` - `is_causal=False` - for variable length, we set sequences to be random multiples of 64 up to `max_seq_len` - 100 runs | | Variable Length API | SDPA | |--------|--------------------|----------| | Runtime | 0.21750560760498047 ms | 0.43171775817871094 ms | | TFLOPs | 231.812 | 320.840 | The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length. **Testing** Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA. **Next steps** Next steps from this PR (higher in the stack) include registering the private API `_varlen_attn` as a custom op, implementing backward support, and enabling cuDNN with correct numerics. (This stack builds on top of pytorch#162326) Pull Request resolved: pytorch#164502 Approved by: https://github.com/v0i0, https://github.com/drisspg

ghstack-source-id: 19289b7 Pull Request resolved: pytorch/pytorch#164502

varlen api

3a3a22f

[ghstack-poisoned]

liangel-02 requested review from albanD, jbschlosser and mikaylagawarecki as code owners October 2, 2025 20:47

liangel-02 added a commit that referenced this pull request Oct 2, 2025

varlen api

3e5f973

ghstack-source-id: e22c006 Pull Request resolved: #164502

This was referenced Oct 2, 2025

register custom op #164503

Closed

bwd pass #164504

Closed

liangel-02 added 2 commits October 3, 2025 07:00

Update on "varlen api"

6226cf3

[ghstack-poisoned]

Update base for Update on "varlen api"

a2ce3c5

[ghstack-poisoned]

jbschlosser reviewed Oct 3, 2025

View reviewed changes

torch/nn/attention/varlen.py Outdated Show resolved Hide resolved

drisspg reviewed Oct 7, 2025

View reviewed changes

torch/nn/attention/varlen.py Show resolved Hide resolved

drisspg reviewed Oct 7, 2025

View reviewed changes

torch/nn/attention/varlen.py Outdated Show resolved Hide resolved

liangel-02 mentioned this pull request Oct 8, 2025

debugging cudnn numerics #164950

Closed

liangel-02 added 2 commits October 9, 2025 08:10

liangel-02 requested a review from drisspg October 9, 2025 15:18

liangel-02 added a commit that referenced this pull request Oct 9, 2025

varlen api

c3d1f8e

ghstack-source-id: 46c9dcc Pull Request resolved: #164502

liangel-02 added a commit that referenced this pull request Oct 9, 2025

varlen api

7beb107

ghstack-source-id: 46c9dcc Pull Request resolved: #164502

drisspg requested a review from v0i0 October 9, 2025 23:26

drisspg added release notes: nn release notes category module: sdpa All things related to torch.nn.functional.scaled_dot_product_attentiion labels Oct 10, 2025

liangel-02 added 2 commits October 10, 2025 10:47

Update

6e9472e

[ghstack-poisoned]

Update (base update)

ad6c69e

[ghstack-poisoned]

drisspg reviewed Oct 10, 2025

View reviewed changes

torch/nn/attention/varlen.py Show resolved Hide resolved

liangel-02 added 2 commits October 10, 2025 11:09

Update

e3732ff

[ghstack-poisoned]

Update

80e46cf

[ghstack-poisoned]

v0i0 approved these changes Oct 10, 2025

View reviewed changes

mikaylagawarecki reviewed Oct 10, 2025

View reviewed changes

test/test_varlen_attention.py Outdated Show resolved Hide resolved

pytorchmergebot added the Merged label Oct 15, 2025

pytorchmergebot closed this in 3681312 Oct 15, 2025

pytorchmergebot removed the merging label Oct 15, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Oct 15, 2025

pytorchmergebot reopened this Oct 15, 2025

liangel-02 added 4 commits October 15, 2025 07:37

pytorchmergebot added the merging label Oct 15, 2025

pytorchmergebot closed this in 78f5a1e Oct 15, 2025

pytorchmergebot removed the merging label Oct 15, 2025

github-actions bot deleted the gh/liangel/1/head branch November 15, 2025 02:16

Khanaksahu pushed a commit to Khanaksahu/pytorch that referenced this pull request Nov 17, 2025

varlen api

56dbf23

ghstack-source-id: 19289b7 Pull Request resolved: pytorch/pytorch#164502

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

varlen api#164502

varlen api#164502
liangel-02 wants to merge 23 commits intomainfrom
gh/liangel/1/head

liangel-02 commented Oct 2, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Oct 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huydhn commented Oct 15, 2025

Uh oh!

pytorchmergebot commented Oct 15, 2025

Uh oh!

pytorchmergebot commented Oct 15, 2025

Uh oh!

pytorch-auto-revert bot commented Oct 15, 2025

Uh oh!

pytorchmergebot commented Oct 15, 2025

Uh oh!

pytorchmergebot commented Oct 15, 2025

Uh oh!

liangel-02 commented Oct 15, 2025

Uh oh!

pytorchmergebot commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

liangel-02 commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/164502

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huydhn commented Oct 15, 2025

Uh oh!

pytorchmergebot commented Oct 15, 2025

Uh oh!

pytorchmergebot commented Oct 15, 2025

Uh oh!

pytorch-auto-revert bot commented Oct 15, 2025

Uh oh!

pytorchmergebot commented Oct 15, 2025

Uh oh!

pytorchmergebot commented Oct 15, 2025

Reverting PR 164502 failed

Uh oh!

liangel-02 commented Oct 15, 2025

Uh oh!

pytorchmergebot commented Oct 15, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

liangel-02 commented Oct 2, 2025 •

edited

Loading

pytorch-bot bot commented Oct 2, 2025 •

edited

Loading