-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Closed
Labels
ci: sevcritical failure affecting PyTorch CIcritical failure affecting PyTorch CImodule: rocmAMD GPU support for PytorchAMD GPU support for Pytorch
Description
NOTE: Remember to label this issue with "
ci: sev"
Current Status
preemptive
Error looks like
ROCm pytorch jobs will take a long time to queue due to a subset of ROCm PyTorch nodes undergoing upgrades
Incident timeline (all times pacific)
28th Sept 2024 3:06 PM - 30th Sept 2024 ~12:30PM
User impact
Queue jobs will take a long time to be picked up by runners.
Root cause
ROCm pytorch nodes are undergoing ROCm upgrades.
Mitigation
Completing the upgrade.
Prevention/followups
N/A
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd
Metadata
Metadata
Assignees
Labels
ci: sevcritical failure affecting PyTorch CIcritical failure affecting PyTorch CImodule: rocmAMD GPU support for PytorchAMD GPU support for Pytorch
Type
Projects
Status
Done