Skip to content

[Pytorch] Improve conversion to bfloat16 on aarch64/NEON#166958

Closed
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D86207189
Closed

[Pytorch] Improve conversion to bfloat16 on aarch64/NEON#166958
Nicoshev wants to merge 1 commit intopytorch:mainfrom
Nicoshev:export-D86207189

Conversation

@Nicoshev
Copy link
Contributor

@Nicoshev Nicoshev commented Nov 4, 2025

Summary:
Autovectorization of casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t ===> 319.433us
int8->bfloat16_t ===> 320.216us
int16->bfloat16_t ===> 326.899us
int32->bfloat16_t ===> 327.925us

after:

uint8->bfloat16_t ===> 185.189us -----> 72% higher throughput
int8->bfloat16_t ===> 169.790us -----> 89% higher throughput
int16->bfloat16_t ===> 180.744us -----> 81% higher throughput
int32->bfloat16_t ===> 185.129us -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D86207189

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @aditew01 @snadampal @milpuz01 @nikhil-arm @fadara01

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/166958

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 444d752 with merge base d273422 (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Nov 4, 2025
@meta-codesync
Copy link

meta-codesync bot commented Nov 4, 2025

@Nicoshev has exported this pull request. If you are a Meta employee, you can view the originating Diff in D86207189.

@Nicoshev Nicoshev requested a review from mcfi November 4, 2025 16:17
@Nicoshev Nicoshev added module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 ciflow/trunk Trigger trunk jobs on your pull request ciflow/linux-aarch64 linux aarch64 CI workflow release notes: cpu (aarch64) release notes category for aarch64, arm, etc. labels Nov 4, 2025
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 4, 2025
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D86207189
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 4, 2025
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D86207189
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 4, 2025
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D86207189
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 4, 2025
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D86207189
@Nicoshev Nicoshev force-pushed the export-D86207189 branch 2 times, most recently from 893bb57 to f563627 Compare November 4, 2025 17:21
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 4, 2025
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D86207189
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 4, 2025
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D86207189
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 4, 2025
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D86207189
@Nicoshev Nicoshev force-pushed the export-D86207189 branch 2 times, most recently from a2ceafe to c98755c Compare November 4, 2025 20:46
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 4, 2025
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D86207189
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 4, 2025
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D86207189
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 5, 2025
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D86207189
@Nicoshev Nicoshev force-pushed the export-D86207189 branch 2 times, most recently from 8639998 to c734960 Compare November 5, 2025 14:00
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 5, 2025
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D86207189
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add performance benchmark

pytorch-bot bot pushed a commit that referenced this pull request Nov 11, 2025
Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D86207189
@Nicoshev Nicoshev requested a review from malfet November 11, 2025 18:39
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 11, 2025
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D86207189
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 11, 2025
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D86207189
pytorch-bot bot pushed a commit that referenced this pull request Nov 12, 2025
Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D86207189
Nicoshev added a commit to Nicoshev/pytorch that referenced this pull request Nov 13, 2025
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D86207189
)

Summary:

Conversion by casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D86207189
@facebook-github-bot
Copy link
Contributor

@pytorchbot merge

(Initiating merge automatically since Phabricator Diff has merged)

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

This PR has pending changes requested. Please address the comments and update the PR before merging.

@Nicoshev
Copy link
Contributor Author

@pytorchbot merge -f "Benchmarks added in a separate PR. CI failure was on an x86 build compiled by gcc, it does not execute added path.

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 14, 2025

❌ 🤖 pytorchbot command failed:

Got EOF while in a quoted string```
Try `@pytorchbot --help` for more info.

@Nicoshev
Copy link
Contributor Author

@pytorchbot merge -f "Benchs added in PR167099. CI fails on an x86 build compiled by gcc, it doesn't execute new path"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Silv3S pushed a commit to Silv3S/pytorch that referenced this pull request Nov 18, 2025
)

Summary:
Autovectorization of casting to bfloat16_t is broken in clang-[17, 20], fixed in clang-21.

We are adding a workaround vectorized code, which improves conversion speed from smaller int data types.

We've observed the following performance improvements, when compiling with clang-19 and targeting armv9a+sve2:

before:

uint8->bfloat16_t  ===> 319.433us
int8->bfloat16_t  ===> 320.216us
int16->bfloat16_t  ===> 326.899us
int32->bfloat16_t  ===> 327.925us

after:

uint8->bfloat16_t  ===> 185.189us  -----> 72% higher throughput
int8->bfloat16_t  ===> 169.790us  -----> 89% higher throughput
int16->bfloat16_t  ===> 180.744us  -----> 81% higher throughput
int32->bfloat16_t  ===> 185.129us  -----> 77% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D86207189

Pull Request resolved: pytorch#166958
Approved by: https://github.com/mcfi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/linux-aarch64 linux aarch64 CI workflow ciflow/trunk Trigger trunk jobs on your pull request fb-exported Merged meta-exported module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: cpu CPU specific problem (e.g., perf, algorithm) release notes: cpu (aarch64) release notes category for aarch64, arm, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants