Skip to content

rocm jobs are consistently queuing for 1h+ during working hours #73039

@suo

Description

@suo

We observe jobs consistently waiting in 1h+ queues during US working hours. You can see the current state of the queue here: https://metrics.pytorch.org/?orgId=1&refresh=5m&viewPanel=85. At time of writing (2PM pacific on a thurssday), it looked like:
image

Queueing is bad because it increases the time to signal for our developers on PRs. 1h+ is quite bad, because the whole ROCm test suite can take hours to complete and we can't even get started. We have two options to reduce queuing:

  • Increase supply by increasing size of the worker pool. This would involve spinning up more rocm runners.
  • Decrease demand by decreasing the amount of work we do in CI. For example, we could move ROCm to trunk-only (with the ability to manually run rocm jobs on your PRs with CIFlow). Or we could restrict rocm workflows to commits that affect specific directories.

My preference would be to increase supply if possible, as then we don't have to do any custom stuff in the CI.

cc @ezyang @gchanan @zou3519 @jeffdaily @sunway513 @jithunnair-amd @ROCmSupport @KyleCZH @seemethere @malfet @pytorch/pytorch-dev-infra

Metadata

Metadata

Assignees

Labels

high prioritymodule: ciRelated to continuous integrationmodule: rocmAMD GPU support for PytorchtriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions