[ROCm] fix miopen batchnorm changing output format by amdfaa · Pull Request #162112 · pytorch/pytorch

amdfaa · 2025-09-04T00:17:55Z

It was found that the integration of miopen batchnorm was causing the output to always be in default contig memory format even when the input was channels last. This also unskips a number of related unit tests.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2025-09-04T00:17:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162112

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (9 Unrelated Failures)

As of commit 7ba20fe with merge base 4840a1a ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-jammy-py3.10-clang12 / test (default, 2, 5, linux.4xlarge) (gh) (disabled by #136125, #137026, #137027)
inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration
pull / linux-jammy-py3.10-clang12 / test (dynamo_wrapped, 3, 3, linux.2xlarge) (gh) (disabled by #80338)
test_autograd.py::TestAutograd::test_lobpcg
pull / linux-jammy-py3.10-clang18-asan / test (default, 4, 7, linux.4xlarge) (gh) (disabled by #136125, #137026, #137027)
inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration
pull / linux-jammy-py3.10-gcc11 / test (default, 4, 5, linux.2xlarge) (gh) (disabled by #136125, #137026, #137027)
inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration
pull / linux-jammy-py3.10-gcc11 / test (distributed, 2, 2, linux.2xlarge) (gh) (disabled by #123294)
distributed/tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_nn_functional_multi_head_attention_forward_cpu_float32
pull / linux-jammy-py3.13-clang12 / test (default, 3, 5, linux.4xlarge) (gh) (disabled by #136125, #137026, #137027)
inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration
pull / linux-jammy-py3.13-clang12 / test (dynamo_wrapped, 3, 3, linux.2xlarge) (gh) (disabled by #80338)
test_autograd.py::TestAutograd::test_lobpcg
rocm / linux-jammy-rocm-py3.10 / test (default, 1, 6, linux.rocm.gpu.2) (gh) (disabled by #98259)
test_type_hints.py::TestTypeHints::test_doc_examples
rocm / linux-jammy-rocm-py3.10 / test (default, 4, 6, linux.rocm.gpu.2) (gh) (disabled by #136125, #137026, #137027)
inductor/test_extension_backend.py::ExtensionBackendTests::test_open_device_registration

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2025-09-04T16:58:47Z

To add the ciflow label ciflow/rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

pytorch-bot · 2025-09-04T16:58:57Z

To add the ciflow label ciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

jithunnair-amd · 2025-09-11T03:03:43Z

2 unit tests failing:

The following tests failed consistently: ['test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_cudnn_convolution_add_relu_cuda_float16', 'test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_cudnn_convolution_relu_cuda_float16']

jithunnair-amd · 2025-09-11T03:11:21Z

test/test_nn.py


-            if self._testMethodName == "test_batchnorm_3D_train_NCHW_vs_native_mixed_float16" \
-                    and _get_torch_rocm_version() < (7, 0):
-                self.skipTest("3D float16 NCHW train failed on ROCm<7.0")


@dnikolaev-amd @jeffdaily Why remove the ROCm7.0 condition? Does this test fail even on ROCm7.0?

@jithunnair-amd this test failed on ROCm7.0+

AssertionError: Tensor-likes are not close! Mismatched elements: 1 / 8 (12.5%) Greatest absolute difference: 0.021703720092773438 at index (4,) (up to 1e-05 allowed) Greatest relative difference: 0.0013534484896808863 at index (4,) (up to 1.3e-06 allowed)

Looks like another native BN accuracy issue:

test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 - failed

test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16 - passed

jeffdaily · 2025-09-11T19:35:54Z

@pytorchbot merge -f "unrelated rocm failures, unrelated other failures, linter is green"

pytorchmergebot · 2025-09-11T19:37:29Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

cherry pick of pytorch#162112

It was found that the integration of miopen batchnorm was causing the output to always be in default contig memory format even when the input was channels last. This also unskips a number of related unit tests. Pull Request resolved: pytorch#162112 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

cherry pick of pytorch#162112

It was found that the integration of miopen batchnorm was causing the output to always be in default contig memory format even when the input was channels last. This also unskips a number of related unit tests. Pull Request resolved: pytorch#162112 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

cherry pick of pytorch#162112 Fixes #SWDEV-567460 Co-authored-by: Jeff Daily <jeff.daily@amd.com> Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com> Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>

[ROCm] fix miopen batchnorm changing output format

01f1ffa

amdfaa requested review from jeffdaily and jithunnair-amd as code owners September 4, 2025 00:17

pytorch-bot bot added module: rocm AMD GPU support for Pytorch release notes: nn release notes category labels Sep 4, 2025

jeffdaily approved these changes Sep 4, 2025

View reviewed changes

pytorchbot added the open source label Sep 4, 2025

jeffdaily added 2 commits September 4, 2025 15:13

avoid redundant NHWC-NCHW-NHWC conversions for MiopenBatchNormBackward

af0dc52

lint

33baeb6

jeffdaily requested review from albanD and soulitzer as code owners September 4, 2025 15:14

jeffdaily added the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Sep 4, 2025

albanD removed their request for review September 4, 2025 16:54

additional batchnorm tests

bd51d87

pytorch-bot bot removed the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Sep 4, 2025

jeffdaily added ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 4, 2025

pytorch-bot bot removed the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Sep 4, 2025

pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Sep 4, 2025

jeffdaily added release notes: rocm mandatorylabel ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 4, 2025

This comment was marked as outdated.

Sign in to view

pytorch-bot bot removed ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 4, 2025

unskip tests

9bccd27

pytorch-bot bot removed the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Sep 9, 2025

jeffdaily added ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 9, 2025

typo

7ba20fe

pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 11, 2025

jithunnair-amd reviewed Sep 11, 2025

View reviewed changes

jeffdaily added ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 11, 2025

pytorchmergebot added the merging label Sep 11, 2025

pytorchmergebot closed this in d65ffde Sep 11, 2025

pytorchmergebot added Merged and removed merging labels Sep 11, 2025

jeffdaily added a commit to ROCm/pytorch that referenced this pull request Sep 11, 2025

[release/2.8] fix miopen batchnorm changing output format

d3985e1

cherry pick of pytorch#162112

jeffdaily added a commit to ROCm/pytorch that referenced this pull request Oct 15, 2025

[release/2.8] fix miopen batchnorm changing output format (#2602)

336f231

cherry pick of pytorch#162112

jerrymannil pushed a commit to ROCm/pytorch that referenced this pull request Nov 19, 2025

[release/2.9] fix miopen batchnorm changing output format

bd4bf5b

cherry pick of pytorch#162112

jerrymannil mentioned this pull request Nov 19, 2025

[release/2.9] fix miopen batchnorm changing output format ROCm/pytorch#2813

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] fix miopen batchnorm changing output format#162112

[ROCm] fix miopen batchnorm changing output format#162112
amdfaa wants to merge 10 commits intopytorch:mainfrom
ROCm:rocm_miopen_batchnorm_fix

amdfaa commented Sep 4, 2025 •

edited by jeffdaily

Loading

Uh oh!

pytorch-bot bot commented Sep 4, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 4, 2025

Uh oh!

pytorch-bot bot commented Sep 4, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

jithunnair-amd commented Sep 11, 2025

Uh oh!

jithunnair-amd Sep 11, 2025

Uh oh!

dnikolaev-amd Sep 11, 2025 •

edited

Loading

Uh oh!

jeffdaily commented Sep 11, 2025

Uh oh!

pytorchmergebot commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

amdfaa commented Sep 4, 2025 • edited by jeffdaily Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162112

✅ You can merge normally! (9 Unrelated Failures)

Uh oh!

pytorch-bot bot commented Sep 4, 2025

Uh oh!

pytorch-bot bot commented Sep 4, 2025

Uh oh!

This comment was marked as outdated.

This comment was marked as outdated.

jithunnair-amd commented Sep 11, 2025

Uh oh!

jithunnair-amd Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

dnikolaev-amd Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffdaily commented Sep 11, 2025

Uh oh!

pytorchmergebot commented Sep 11, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

amdfaa commented Sep 4, 2025 •

edited by jeffdaily

Loading

pytorch-bot bot commented Sep 4, 2025 •

edited

Loading

dnikolaev-amd Sep 11, 2025 •

edited

Loading