rocm jobs are consistently queuing for 1h+ during working hours

We observe jobs consistently waiting in 1h+ queues during US working hours. You can see the current state of the queue here: https://metrics.pytorch.org/?orgId=1&refresh=5m&viewPanel=85. At time of writing (2PM pacific on a thurssday), it looked like: 
<img width="503" alt="image" src="https://user-images.githubusercontent.com/1617424/154577667-409e80cb-1ac9-45d4-ad2c-090873bcbf4a.png">

Queueing is bad because it increases the time to signal for our developers on PRs. 1h+ is quite bad, because the whole ROCm test suite can take hours to complete and we can't even get started. We have two options to reduce queuing:
- Increase supply by increasing size of the worker pool. This would involve spinning up more rocm runners.
- Decrease demand by decreasing the amount of work we do in CI. For example, we could move ROCm to trunk-only (with the ability to manually run rocm jobs on your PRs with CIFlow). Or we could restrict rocm workflows to commits that affect specific directories.

My preference would be to increase supply if possible, as then we don't have to do any custom stuff in the CI.

cc @ezyang @gchanan @zou3519 @jeffdaily @sunway513 @jithunnair-amd @ROCmSupport @KyleCZH @seemethere @malfet @pytorch/pytorch-dev-infra 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rocm jobs are consistently queuing for 1h+ during working hours #73039

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

rocm jobs are consistently queuing for 1h+ during working hours #73039

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions