[ci] move rocm jobs from pull to trunk workflow #77989

suo · 2022-05-20T18:37:04Z

Stack from ghstack:

-> [ci] move rocm jobs from pull to trunk workflow #77989

This makes the rocm jobs run on master-only. We've been battling queue
times for a few months now
(#73039). So far we have tried
or investigated:

Moving distributed builds to master
Moving distributed builds to periodic
Only running rocm on a specific set of paths
Running multiple jobs on a single rocm host.

Unfortunately, we haven't been able to reduce queuing times to good
levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with
an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name
"Job time-to-signal, all branches").

There are two things we haven't tried so far:

Running "smoke tests" only on PR
Switching rocm builds to master

Since #2 is easiest let's give it a try. For now, the policy would be
the same as what we do for other capacity-constrained configurations
(Win and Mac)—run on master only, but revert if there is a breakage
introduced.

[skip ci]

This makes the rocm jobs run on master-only. We've been battling queue times for a few months now (#73039). So far we have tried or investigated: 1. Moving distributed builds to master 2. Moving distributed builds to periodic 3. Only running rocm on a specific set of paths 4. Running multiple jobs on a single rocm host. Unfortunately, we haven't been able to reduce queuing times to good levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name "Job time-to-signal, all branches"). There are two things we haven't tried so far: 1. Running "smoke tests" only on PR 2. Switching rocm builds to master Since #2 is easiest let's give it a try. For now, the policy would be the same as what we do for other capacity-constrained configurations (Win and Mac)—run on master only, but revert if there is a breakage introduced. [skip ci] [ghstack-poisoned]

facebook-github-bot · 2022-05-20T18:37:11Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/77989
✖️ Python docs build was skipped
✖️ C++ docs build was skipped
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit e9429ad (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

This makes the rocm jobs run on master-only. We've been battling queue times for a few months now (#73039). So far we have tried or investigated: 1. Moving distributed builds to master 2. Moving distributed builds to periodic 3. Only running rocm on a specific set of paths 4. Running multiple jobs on a single rocm host. Unfortunately, we haven't been able to reduce queuing times to good levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name "Job time-to-signal, all branches"). There are two things we haven't tried so far: 1. Running "smoke tests" only on PR 2. Switching rocm builds to master Since #2 is easiest let's give it a try. For now, the policy would be the same as what we do for other capacity-constrained configurations (Win and Mac)—run on master only, but revert if there is a breakage introduced. [skip ci] ghstack-source-id: e5faeb3 Pull Request resolved: #77989

suo · 2022-05-20T18:44:37Z

cc @jeffdaily

malfet

LGTM, although we need to investigate, what can be one to meet queueing expectations

suo · 2022-05-20T19:17:22Z

Heads up @jeffdaily @jithunnair-amd I'm planning to merge this at the end of the day today, so if we have any other ideas about reducing queueing times that should prompt us to reconsider this change, please raise them before then. Thanks!

suo · 2022-05-21T01:50:14Z

@pytorchbot merge -f

suo · 2022-05-21T01:50:42Z

alright let's give it a shot

Summary: This makes the rocm jobs run on master-only. We've been battling queue times for a few months now (#73039). So far we have tried or investigated: 1. Moving distributed builds to master 2. Moving distributed builds to periodic 3. Only running rocm on a specific set of paths 4. Running multiple jobs on a single rocm host. Unfortunately, we haven't been able to reduce queuing times to good levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name "Job time-to-signal, all branches"). There are two things we haven't tried so far: 1. Running "smoke tests" only on PR 2. Switching rocm builds to master Since #2 is easiest let's give it a try. For now, the policy would be the same as what we do for other capacity-constrained configurations (Win and Mac)—run on master only, but revert if there is a breakage introduced. [skip ci] Pull Request resolved: #77989 Approved by: https://github.com/malfet, https://github.com/janeyx99 Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/1d845253d82c16a79c0737087f524a0896985a4c Reviewed By: seemethere Differential Revision: D36603002 Pulled By: seemethere fbshipit-source-id: fac619553e6d7819e1a58154570edf69f79bbcef

This makes the rocm jobs run on master-only. We've been battling queue times for a few months now (#73039). So far we have tried or investigated: 1. Moving distributed builds to master 2. Moving distributed builds to periodic 3. Only running rocm on a specific set of paths 4. Running multiple jobs on a single rocm host. Unfortunately, we haven't been able to reduce queuing times to good levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name "Job time-to-signal, all branches"). There are two things we haven't tried so far: 1. Running "smoke tests" only on PR 2. Switching rocm builds to master Since #2 is easiest let's give it a try. For now, the policy would be the same as what we do for other capacity-constrained configurations (Win and Mac)—run on master only, but revert if there is a breakage introduced. [skip ci] Pull Request resolved: #77989 Approved by: https://github.com/malfet, https://github.com/janeyx99

suo requested review from atalman, janeyx99 and seemethere as code owners May 20, 2022 18:37

pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label May 20, 2022

facebook-github-bot added the cla signed label May 20, 2022

suo requested a review from malfet May 20, 2022 18:44

malfet approved these changes May 20, 2022

View reviewed changes

janeyx99 approved these changes May 20, 2022

View reviewed changes

pytorchmergebot added the Merged label May 21, 2022

pytorchmergebot closed this in 1d84525 May 21, 2022

facebook-github-bot deleted the gh/suo/521/head branch May 24, 2022 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ci] move rocm jobs from pull to trunk workflow #77989

[ci] move rocm jobs from pull to trunk workflow #77989

Uh oh!

suo commented May 20, 2022 •

edited

Loading

Uh oh!

facebook-github-bot commented May 20, 2022 •

edited

Loading

Uh oh!

suo commented May 20, 2022

Uh oh!

malfet left a comment

Uh oh!

suo commented May 20, 2022

Uh oh!

suo commented May 21, 2022

Uh oh!

suo commented May 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[ci] move rocm jobs from pull to trunk workflow #77989

[ci] move rocm jobs from pull to trunk workflow #77989

Uh oh!

Conversation

suo commented May 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented May 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

suo commented May 20, 2022

Uh oh!

malfet left a comment

Choose a reason for hiding this comment

Uh oh!

suo commented May 20, 2022

Uh oh!

suo commented May 21, 2022

Uh oh!

suo commented May 21, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

suo commented May 20, 2022 •

edited

Loading

facebook-github-bot commented May 20, 2022 •

edited

Loading