Skip to content

[ROCm] fix miopen batchnorm changing output format#162112

Closed
amdfaa wants to merge 10 commits intopytorch:mainfrom
ROCm:rocm_miopen_batchnorm_fix
Closed

[ROCm] fix miopen batchnorm changing output format#162112
amdfaa wants to merge 10 commits intopytorch:mainfrom
ROCm:rocm_miopen_batchnorm_fix

Conversation

@amdfaa
Copy link
Contributor

@amdfaa amdfaa commented Sep 4, 2025

It was found that the integration of miopen batchnorm was causing the output to always be in default contig memory format even when the input was channels last. This also unskips a number of related unit tests.

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 4, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/162112

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (9 Unrelated Failures)

As of commit 7ba20fe with merge base 4840a1a (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch release notes: nn release notes category labels Sep 4, 2025
@jeffdaily jeffdaily added the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Sep 4, 2025
@albanD albanD removed their request for review September 4, 2025 16:54
@pytorch-bot pytorch-bot bot removed the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Sep 4, 2025
@jeffdaily jeffdaily added ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 4, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 4, 2025

To add the ciflow label ciflow/rocm-mi300 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Sep 4, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 4, 2025

To add the ciflow label ciflow/rocm please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/rocm Trigger "default" config CI on ROCm label Sep 4, 2025
@jeffdaily jeffdaily added release notes: rocm mandatorylabel ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 4, 2025
@pytorch-bot

This comment was marked as outdated.

@pytorch-bot

This comment was marked as outdated.

@pytorch-bot pytorch-bot bot removed ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 4, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 label Sep 9, 2025
@jeffdaily jeffdaily added ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 9, 2025
@jithunnair-amd
Copy link
Collaborator

2 unit tests failing:

The following tests failed consistently: ['test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_cudnn_convolution_add_relu_cuda_float16', 'test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_cudnn_convolution_relu_cuda_float16']

@pytorch-bot pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 11, 2025

if self._testMethodName == "test_batchnorm_3D_train_NCHW_vs_native_mixed_float16" \
and _get_torch_rocm_version() < (7, 0):
self.skipTest("3D float16 NCHW train failed on ROCm<7.0")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dnikolaev-amd @jeffdaily Why remove the ROCm7.0 condition? Does this test fail even on ROCm7.0?

Copy link
Contributor

@dnikolaev-amd dnikolaev-amd Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jithunnair-amd this test failed on ROCm7.0+

AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 8 (12.5%)
Greatest absolute difference: 0.021703720092773438 at index (4,) (up to 1e-05 allowed)
Greatest relative difference: 0.0013534484896808863 at index (4,) (up to 1.3e-06 allowed)

Looks like another native BN accuracy issue:

  • test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 - failed
  • test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16 - passed

@jeffdaily jeffdaily added ciflow/trunk Trigger trunk jobs on your pull request ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 labels Sep 11, 2025
@jeffdaily
Copy link
Collaborator

@pytorchbot merge -f "unrelated rocm failures, unrelated other failures, linter is green"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

jeffdaily added a commit to ROCm/pytorch that referenced this pull request Sep 11, 2025
markc-614 pushed a commit to markc-614/pytorch that referenced this pull request Sep 17, 2025
It was found that the integration of miopen batchnorm was causing the output to always be in default contig memory format even when the input was channels last.  This also unskips a number of related unit tests.

Pull Request resolved: pytorch#162112
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com>
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
mansiag05 pushed a commit to mansiag05/pytorch that referenced this pull request Sep 22, 2025
It was found that the integration of miopen batchnorm was causing the output to always be in default contig memory format even when the input was channels last.  This also unskips a number of related unit tests.

Pull Request resolved: pytorch#162112
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com>
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
cleonard530 pushed a commit to cleonard530/pytorch that referenced this pull request Sep 22, 2025
It was found that the integration of miopen batchnorm was causing the output to always be in default contig memory format even when the input was channels last.  This also unskips a number of related unit tests.

Pull Request resolved: pytorch#162112
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com>
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
dsashidh pushed a commit to dsashidh/pytorch that referenced this pull request Sep 26, 2025
It was found that the integration of miopen batchnorm was causing the output to always be in default contig memory format even when the input was channels last.  This also unskips a number of related unit tests.

Pull Request resolved: pytorch#162112
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com>
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
jeffdaily added a commit to ROCm/pytorch that referenced this pull request Oct 15, 2025
jerrymannil pushed a commit to ROCm/pytorch that referenced this pull request Nov 19, 2025
jerrymannil pushed a commit to ROCm/pytorch that referenced this pull request Nov 19, 2025
It was found that the integration of miopen batchnorm was causing the output to always be in default contig memory format even when the input was channels last.  This also unskips a number of related unit tests.

Pull Request resolved: pytorch#162112
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com>
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
jerrymannil added a commit to ROCm/pytorch that referenced this pull request Nov 19, 2025
cherry pick of pytorch#162112

Fixes #SWDEV-567460

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com>
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm Trigger "default" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/trunk Trigger trunk jobs on your pull request keep-going Don't stop on first failure, keep running tests until the end Merged module: rocm AMD GPU support for Pytorch open source release notes: nn release notes category release notes: rocm mandatorylabel

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants