Skip to content

[Flex] Changing how bwd configs are setup and updating default b200 config#163318

Closed
drisspg wants to merge 7 commits intogh/drisspg/201/basefrom
gh/drisspg/201/head
Closed

[Flex] Changing how bwd configs are setup and updating default b200 config#163318
drisspg wants to merge 7 commits intogh/drisspg/201/basefrom
gh/drisspg/201/head

Conversation

@drisspg
Copy link
Contributor

@drisspg drisspg commented Sep 19, 2025

Stack from ghstack (oldest at bottom):

Up to 4x perf boost

🔝 Top 5 Performance Differences (by absolute %):
shape: (5, 7)
┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐
│ attn_type ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)          ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta  │
│ ---       ┆ ---            ┆ ---                            ┆ ---               ┆ ---                         ┆ ---                             ┆ ---        │
│ str       ┆ str            ┆ str                            ┆ f64               ┆ f64                         ┆ f64                             ┆ f64        │
╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035        ┆ 532.580435                  ┆ 4.268325                        ┆ 326.832527 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557        ┆ 519.798488                  ┆ 4.175271                        ┆ 317.527078 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189        ┆ 512.877391                  ┆ 4.136635                        ┆ 313.663544 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)   ┆ 122.827725        ┆ 496.195958                  ┆ 4.039772                        ┆ 303.977164 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738        ┆ 484.244647                  ┆ 3.910663                        ┆ 291.066303 │
└───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘

🔺 Top 5 Cases Where better_configs (change) is Faster than base (baseline):
shape: (5, 7)
┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐
│ attn_type ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)          ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta  │
│ ---       ┆ ---            ┆ ---                            ┆ ---               ┆ ---                         ┆ ---                             ┆ ---        │
│ str       ┆ str            ┆ str                            ┆ f64               ┆ f64                         ┆ f64                             ┆ f64        │
╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035        ┆ 532.580435                  ┆ 4.268325                        ┆ 326.832527 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557        ┆ 519.798488                  ┆ 4.175271                        ┆ 317.527078 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189        ┆ 512.877391                  ┆ 4.136635                        ┆ 313.663544 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)   ┆ 122.827725        ┆ 496.195958                  ┆ 4.039772                        ┆ 303.977164 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738        ┆ 484.244647                  ┆ 3.910663                        ┆ 291.066303 │
└───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘

🔻 Top 5 Cases Where better_configs (change) is Slower than base (baseline):
shape: (5, 7)
┌───────────────┬────────────────┬───────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬───────────┐
│ attn_type     ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)         ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │
│ ---           ┆ ---            ┆ ---                           ┆ ---               ┆ ---                         ┆ ---                             ┆ ---       │
│ str           ┆ str            ┆ str                           ┆ f64               ┆ f64                         ┆ f64                             ┆ f64       │
╞═══════════════╪════════════════╪═══════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪═══════════╡
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)  ┆ 267.502004        ┆ 250.728732                  ┆ 0.937297                        ┆ -6.270335 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 4, 8192, 128)   ┆ 248.510516        ┆ 235.210874                  ┆ 0.946483                        ┆ -5.351742 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, 16384, 128) ┆ 282.856295        ┆ 271.806926                  ┆ 0.960936                        ┆ -3.906354 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 64)   ┆ 282.212695        ┆ 280.519092                  ┆ 0.993999                        ┆ -0.600116 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 32768, 4, 32768, 128) ┆ 295.864073        ┆ 294.477894                  ┆ 0.995315                        ┆ -0.468519 │
└───────────────┴────────────────┴───────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴───────────┘

📊 Performance Summary:
============================================================
Baseline: base
Change:   better_configs
Geometric Mean Speedup (change over baseline): 1.9954x
Geometric Mean % Change: +99.54%
Median Speedup (change over baseline): 2.1590x
Speedup Std Dev: 0.9800
Valid Comparisons: 60/60

cc @jerryzh168 @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @Chillee @yanboliang @BoyuanFeng

[ghstack-poisoned]
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/163318

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e12de62 with merge base cd4303a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
drisspg added a commit that referenced this pull request Sep 19, 2025
[ghstack-poisoned]
drisspg added a commit that referenced this pull request Sep 19, 2025
@drisspg drisspg added module: performance Issues related to performance, either of kernel code or framework glue release notes: nn release notes category labels Sep 19, 2025
[ghstack-poisoned]
drisspg added a commit that referenced this pull request Sep 19, 2025
[ghstack-poisoned]
drisspg added a commit that referenced this pull request Sep 19, 2025
[ghstack-poisoned]
drisspg added a commit that referenced this pull request Sep 19, 2025
[ghstack-poisoned]
drisspg added a commit that referenced this pull request Sep 19, 2025
@drisspg drisspg added ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 19, 2025
@drisspg
Copy link
Contributor Author

drisspg commented Sep 19, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
…onfig (pytorch#163318)

```Shell
Up to 4x perf boost

🔝 Top 5 Performance Differences (by absolute %):
shape: (5, 7)
┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐
│ attn_type ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)          ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta  │
│ ---       ┆ ---            ┆ ---                            ┆ ---               ┆ ---                         ┆ ---                             ┆ ---        │
│ str       ┆ str            ┆ str                            ┆ f64               ┆ f64                         ┆ f64                             ┆ f64        │
╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035        ┆ 532.580435                  ┆ 4.268325                        ┆ 326.832527 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557        ┆ 519.798488                  ┆ 4.175271                        ┆ 317.527078 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189        ┆ 512.877391                  ┆ 4.136635                        ┆ 313.663544 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)   ┆ 122.827725        ┆ 496.195958                  ┆ 4.039772                        ┆ 303.977164 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738        ┆ 484.244647                  ┆ 3.910663                        ┆ 291.066303 │
└───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘

🔺 Top 5 Cases Where better_configs (change) is Faster than base (baseline):
shape: (5, 7)
┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐
│ attn_type ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)          ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta  │
│ ---       ┆ ---            ┆ ---                            ┆ ---               ┆ ---                         ┆ ---                             ┆ ---        │
│ str       ┆ str            ┆ str                            ┆ f64               ┆ f64                         ┆ f64                             ┆ f64        │
╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035        ┆ 532.580435                  ┆ 4.268325                        ┆ 326.832527 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557        ┆ 519.798488                  ┆ 4.175271                        ┆ 317.527078 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189        ┆ 512.877391                  ┆ 4.136635                        ┆ 313.663544 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)   ┆ 122.827725        ┆ 496.195958                  ┆ 4.039772                        ┆ 303.977164 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738        ┆ 484.244647                  ┆ 3.910663                        ┆ 291.066303 │
└───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘

🔻 Top 5 Cases Where better_configs (change) is Slower than base (baseline):
shape: (5, 7)
┌───────────────┬────────────────┬───────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬───────────┐
│ attn_type     ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)         ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │
│ ---           ┆ ---            ┆ ---                           ┆ ---               ┆ ---                         ┆ ---                             ┆ ---       │
│ str           ┆ str            ┆ str                           ┆ f64               ┆ f64                         ┆ f64                             ┆ f64       │
╞═══════════════╪════════════════╪═══════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪═══════════╡
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)  ┆ 267.502004        ┆ 250.728732                  ┆ 0.937297                        ┆ -6.270335 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 4, 8192, 128)   ┆ 248.510516        ┆ 235.210874                  ┆ 0.946483                        ┆ -5.351742 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, 16384, 128) ┆ 282.856295        ┆ 271.806926                  ┆ 0.960936                        ┆ -3.906354 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 64)   ┆ 282.212695        ┆ 280.519092                  ┆ 0.993999                        ┆ -0.600116 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 32768, 4, 32768, 128) ┆ 295.864073        ┆ 294.477894                  ┆ 0.995315                        ┆ -0.468519 │
└───────────────┴────────────────┴───────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴───────────┘

📊 Performance Summary:
============================================================
Baseline: base
Change:   better_configs
Geometric Mean Speedup (change over baseline): 1.9954x
Geometric Mean % Change: +99.54%
Median Speedup (change over baseline): 2.1590x
Speedup Std Dev: 0.9800
Valid Comparisons: 60/60

```

Pull Request resolved: pytorch#163318
Approved by: https://github.com/BoyuanFeng
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
…onfig (pytorch#163318)

```Shell
Up to 4x perf boost

🔝 Top 5 Performance Differences (by absolute %):
shape: (5, 7)
┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐
│ attn_type ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)          ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta  │
│ ---       ┆ ---            ┆ ---                            ┆ ---               ┆ ---                         ┆ ---                             ┆ ---        │
│ str       ┆ str            ┆ str                            ┆ f64               ┆ f64                         ┆ f64                             ┆ f64        │
╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035        ┆ 532.580435                  ┆ 4.268325                        ┆ 326.832527 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557        ┆ 519.798488                  ┆ 4.175271                        ┆ 317.527078 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189        ┆ 512.877391                  ┆ 4.136635                        ┆ 313.663544 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)   ┆ 122.827725        ┆ 496.195958                  ┆ 4.039772                        ┆ 303.977164 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738        ┆ 484.244647                  ┆ 3.910663                        ┆ 291.066303 │
└───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘

🔺 Top 5 Cases Where better_configs (change) is Faster than base (baseline):
shape: (5, 7)
┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐
│ attn_type ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)          ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta  │
│ ---       ┆ ---            ┆ ---                            ┆ ---               ┆ ---                         ┆ ---                             ┆ ---        │
│ str       ┆ str            ┆ str                            ┆ f64               ┆ f64                         ┆ f64                             ┆ f64        │
╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035        ┆ 532.580435                  ┆ 4.268325                        ┆ 326.832527 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557        ┆ 519.798488                  ┆ 4.175271                        ┆ 317.527078 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189        ┆ 512.877391                  ┆ 4.136635                        ┆ 313.663544 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)   ┆ 122.827725        ┆ 496.195958                  ┆ 4.039772                        ┆ 303.977164 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738        ┆ 484.244647                  ┆ 3.910663                        ┆ 291.066303 │
└───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘

🔻 Top 5 Cases Where better_configs (change) is Slower than base (baseline):
shape: (5, 7)
┌───────────────┬────────────────┬───────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬───────────┐
│ attn_type     ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)         ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │
│ ---           ┆ ---            ┆ ---                           ┆ ---               ┆ ---                         ┆ ---                             ┆ ---       │
│ str           ┆ str            ┆ str                           ┆ f64               ┆ f64                         ┆ f64                             ┆ f64       │
╞═══════════════╪════════════════╪═══════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪═══════════╡
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)  ┆ 267.502004        ┆ 250.728732                  ┆ 0.937297                        ┆ -6.270335 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 4, 8192, 128)   ┆ 248.510516        ┆ 235.210874                  ┆ 0.946483                        ┆ -5.351742 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, 16384, 128) ┆ 282.856295        ┆ 271.806926                  ┆ 0.960936                        ┆ -3.906354 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 64)   ┆ 282.212695        ┆ 280.519092                  ┆ 0.993999                        ┆ -0.600116 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 32768, 4, 32768, 128) ┆ 295.864073        ┆ 294.477894                  ┆ 0.995315                        ┆ -0.468519 │
└───────────────┴────────────────┴───────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴───────────┘

📊 Performance Summary:
============================================================
Baseline: base
Change:   better_configs
Geometric Mean Speedup (change over baseline): 1.9954x
Geometric Mean % Change: +99.54%
Median Speedup (change over baseline): 2.1590x
Speedup Std Dev: 0.9800
Valid Comparisons: 60/60

```

Pull Request resolved: pytorch#163318
Approved by: https://github.com/BoyuanFeng
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
…onfig (pytorch#163318)

```Shell
Up to 4x perf boost

🔝 Top 5 Performance Differences (by absolute %):
shape: (5, 7)
┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐
│ attn_type ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)          ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta  │
│ ---       ┆ ---            ┆ ---                            ┆ ---               ┆ ---                         ┆ ---                             ┆ ---        │
│ str       ┆ str            ┆ str                            ┆ f64               ┆ f64                         ┆ f64                             ┆ f64        │
╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035        ┆ 532.580435                  ┆ 4.268325                        ┆ 326.832527 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557        ┆ 519.798488                  ┆ 4.175271                        ┆ 317.527078 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189        ┆ 512.877391                  ┆ 4.136635                        ┆ 313.663544 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)   ┆ 122.827725        ┆ 496.195958                  ┆ 4.039772                        ┆ 303.977164 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738        ┆ 484.244647                  ┆ 3.910663                        ┆ 291.066303 │
└───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘

🔺 Top 5 Cases Where better_configs (change) is Faster than base (baseline):
shape: (5, 7)
┌───────────┬────────────────┬────────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬────────────┐
│ attn_type ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)          ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta  │
│ ---       ┆ ---            ┆ ---                            ┆ ---               ┆ ---                         ┆ ---                             ┆ ---        │
│ str       ┆ str            ┆ str                            ┆ f64               ┆ f64                         ┆ f64                             ┆ f64        │
╞═══════════╪════════════════╪════════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪════════════╡
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 124.775035        ┆ 532.580435                  ┆ 4.268325                        ┆ 326.832527 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 124.494557        ┆ 519.798488                  ┆ 4.175271                        ┆ 317.527078 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 32768, 16, 32768, 128) ┆ 123.984189        ┆ 512.877391                  ┆ 4.136635                        ┆ 313.663544 │
│ noop      ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)   ┆ 122.827725        ┆ 496.195958                  ┆ 4.039772                        ┆ 303.977164 │
│ causal    ┆ torch.bfloat16 ┆ (4, 16, 16384, 16, 16384, 128) ┆ 123.826738        ┆ 484.244647                  ┆ 3.910663                        ┆ 291.066303 │
└───────────┴────────────────┴────────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴────────────┘

🔻 Top 5 Cases Where better_configs (change) is Slower than base (baseline):
shape: (5, 7)
┌───────────────┬────────────────┬───────────────────────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┬───────────┐
│ attn_type     ┆ dtype          ┆ shape(B,Hq,M,Hkv,N,D)         ┆ TFlops BWD (base) ┆ TFlops BWD (better_configs) ┆ better_configs_speedup_over_ba… ┆ pct_delta │
│ ---           ┆ ---            ┆ ---                           ┆ ---               ┆ ---                         ┆ ---                             ┆ ---       │
│ str           ┆ str            ┆ str                           ┆ f64               ┆ f64                         ┆ f64                             ┆ f64       │
╞═══════════════╪════════════════╪═══════════════════════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╪═══════════╡
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 128)  ┆ 267.502004        ┆ 250.728732                  ┆ 0.937297                        ┆ -6.270335 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 4, 8192, 128)   ┆ 248.510516        ┆ 235.210874                  ┆ 0.946483                        ┆ -5.351742 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 16384, 4, 16384, 128) ┆ 282.856295        ┆ 271.806926                  ┆ 0.960936                        ┆ -3.906354 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 8192, 16, 8192, 64)   ┆ 282.212695        ┆ 280.519092                  ┆ 0.993999                        ┆ -0.600116 │
│ document_mask ┆ torch.bfloat16 ┆ (4, 16, 32768, 4, 32768, 128) ┆ 295.864073        ┆ 294.477894                  ┆ 0.995315                        ┆ -0.468519 │
└───────────────┴────────────────┴───────────────────────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┴───────────┘

📊 Performance Summary:
============================================================
Baseline: base
Change:   better_configs
Geometric Mean Speedup (change over baseline): 1.9954x
Geometric Mean % Change: +99.54%
Median Speedup (change over baseline): 2.1590x
Speedup Std Dev: 0.9800
Valid Comparisons: 60/60

```

Pull Request resolved: pytorch#163318
Approved by: https://github.com/BoyuanFeng
@github-actions github-actions bot deleted the gh/drisspg/201/head branch October 20, 2025 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/trunk Trigger trunk jobs on your pull request Merged module: flex attention module: inductor module: performance Issues related to performance, either of kernel code or framework glue release notes: nn release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants