Skip to content

PyTorch Testing Nodes Undergoing ROCm 6.2.1 Upgrades #136928

@amdfaa

Description

@amdfaa

NOTE: Remember to label this issue with "ci: sev"

Current Status

preemptive

Error looks like

ROCm pytorch jobs will take a long time to queue due to a subset of ROCm PyTorch nodes undergoing upgrades

Incident timeline (all times pacific)

28th Sept 2024 3:06 PM - 30th Sept 2024 ~12:30PM

User impact

Queue jobs will take a long time to be picked up by runners.

Root cause

ROCm pytorch nodes are undergoing ROCm upgrades.

Mitigation

Completing the upgrade.

Prevention/followups

N/A

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

Metadata

Metadata

Assignees

No one assigned

    Labels

    ci: sevcritical failure affecting PyTorch CImodule: rocmAMD GPU support for Pytorch

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions