PyTorch training slower with fp16 mixed precision on AI Max+ 395 (Radeon 8060S) #5541

rsyue · 2025-10-19T05:25:20Z

rsyue
Oct 19, 2025

I have a standard FashionMNIST training script that runs pretty well at full precision (fp32). I've run it on Nvidia devices and more recently on the Radeon 8060S (GMKTec Evo-X2). I implemented fp16 mixed precision, trying using both bfloat16 and float16:

( FYI torch.cuda.amp is now deprecated. A warning says to use torch.amp.[...] and include device_type )

with torch.amp.autocast(device_type="cuda", dtype=torch.float16): ...

This speeds things up quite a bit on my Nvidia devices, so I tried it on my Radeon recently. It runs, but it it actually slower than the fp32 version, and I'm not quite sure why. This is on Ubuntu 24.04.3 with ROCm 7 after following the installs at:

What's more confusing is that the PC came with Windows natively. I had originally used an unofficial PyTorch build with ROCm from TheRock:

https://github.com/scottt/rocm-TheRock/releases/tag/v6.5.0rc-pytorch-gfx110x

This was pretty good and sped things up similar to Nvidia when using mixed precision. However, there were a number of bugs, it being an unofficial release.

Wondering if anyone else has had this problem. If so, are there any red flags? Namely compatibility issues that can arise with certain bad torch version or ROCm versions, even Linux versions (22.04 better?)

Would appreciate any tips you may have!

Answered by rsyue

Oct 28, 2025

I solved this problem by installing a nightly PyTorch build specific to my gfx:

Use a venv or conda environment with Python 3.12
Run pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ "rocm[libraries,devel]" if you need ROCm (or another gfx if this is not yours. Use rocm-info (if on Linux, you can | grep gfx)
Run pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ --pre torch torchaudio torchvision to get the nightly torch build supporting the 8060s's gfx
Make sure everything went smoothly by running python -c "import torch; print(torch.cuda.is_available())". It might say CUDA here, but if True it means that the ROCm-compiled torch build was correctly insta…

View full answer

rsyue · 2025-10-28T03:07:50Z

rsyue
Oct 28, 2025
Author

I solved this problem by installing a nightly PyTorch build specific to my gfx:

Use a venv or conda environment with Python 3.12
Run pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ "rocm[libraries,devel]" if you need ROCm (or another gfx if this is not yours. Use rocm-info (if on Linux, you can | grep gfx)
Run pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ --pre torch torchaudio torchvision to get the nightly torch build supporting the 8060s's gfx
Make sure everything went smoothly by running python -c "import torch; print(torch.cuda.is_available())". It might say CUDA here, but if True it means that the ROCm-compiled torch build was correctly installed.

Enjoy your new ROCm PyTorch on Windows (experimental)

※(^o^)/※

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PyTorch training slower with fp16 mixed precision on AI Max+ 395 (Radeon 8060S) #5541

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

PyTorch training slower with fp16 mixed precision on AI Max+ 395 (Radeon 8060S) #5541

Uh oh!

rsyue Oct 19, 2025

Replies: 1 comment

Uh oh!

rsyue Oct 28, 2025 Author

rsyue
Oct 19, 2025

rsyue
Oct 28, 2025
Author