-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[ci] move rocm jobs from pull to trunk workflow #77989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This makes the rocm jobs run on master-only. We've been battling queue times for a few months now (#73039). So far we have tried or investigated: 1. Moving distributed builds to master 2. Moving distributed builds to periodic 3. Only running rocm on a specific set of paths 4. Running multiple jobs on a single rocm host. Unfortunately, we haven't been able to reduce queuing times to good levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name "Job time-to-signal, all branches"). There are two things we haven't tried so far: 1. Running "smoke tests" only on PR 2. Switching rocm builds to master Since #2 is easiest let's give it a try. For now, the policy would be the same as what we do for other capacity-constrained configurations (Win and Mac)—run on master only, but revert if there is a breakage introduced. [skip ci] [ghstack-poisoned]
🔗 Helpful links
✅ No Failures (0 Pending)As of commit e9429ad (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
This makes the rocm jobs run on master-only. We've been battling queue times for a few months now (#73039). So far we have tried or investigated: 1. Moving distributed builds to master 2. Moving distributed builds to periodic 3. Only running rocm on a specific set of paths 4. Running multiple jobs on a single rocm host. Unfortunately, we haven't been able to reduce queuing times to good levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name "Job time-to-signal, all branches"). There are two things we haven't tried so far: 1. Running "smoke tests" only on PR 2. Switching rocm builds to master Since #2 is easiest let's give it a try. For now, the policy would be the same as what we do for other capacity-constrained configurations (Win and Mac)—run on master only, but revert if there is a breakage introduced. [skip ci] ghstack-source-id: e5faeb3 Pull Request resolved: #77989
|
cc @jeffdaily |
malfet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although we need to investigate, what can be one to meet queueing expectations
|
Heads up @jeffdaily @jithunnair-amd I'm planning to merge this at the end of the day today, so if we have any other ideas about reducing queueing times that should prompt us to reconsider this change, please raise them before then. Thanks! |
|
@pytorchbot merge -f |
|
alright let's give it a shot |
Summary: This makes the rocm jobs run on master-only. We've been battling queue times for a few months now (#73039). So far we have tried or investigated: 1. Moving distributed builds to master 2. Moving distributed builds to periodic 3. Only running rocm on a specific set of paths 4. Running multiple jobs on a single rocm host. Unfortunately, we haven't been able to reduce queuing times to good levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name "Job time-to-signal, all branches"). There are two things we haven't tried so far: 1. Running "smoke tests" only on PR 2. Switching rocm builds to master Since #2 is easiest let's give it a try. For now, the policy would be the same as what we do for other capacity-constrained configurations (Win and Mac)—run on master only, but revert if there is a breakage introduced. [skip ci] Pull Request resolved: #77989 Approved by: https://github.com/malfet, https://github.com/janeyx99 Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/1d845253d82c16a79c0737087f524a0896985a4c Reviewed By: seemethere Differential Revision: D36603002 Pulled By: seemethere fbshipit-source-id: fac619553e6d7819e1a58154570edf69f79bbcef
This makes the rocm jobs run on master-only. We've been battling queue times for a few months now (#73039). So far we have tried or investigated: 1. Moving distributed builds to master 2. Moving distributed builds to periodic 3. Only running rocm on a specific set of paths 4. Running multiple jobs on a single rocm host. Unfortunately, we haven't been able to reduce queuing times to good levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name "Job time-to-signal, all branches"). There are two things we haven't tried so far: 1. Running "smoke tests" only on PR 2. Switching rocm builds to master Since #2 is easiest let's give it a try. For now, the policy would be the same as what we do for other capacity-constrained configurations (Win and Mac)—run on master only, but revert if there is a breakage introduced. [skip ci] Pull Request resolved: #77989 Approved by: https://github.com/malfet, https://github.com/janeyx99
Stack from ghstack:
This makes the rocm jobs run on master-only. We've been battling queue
times for a few months now
(#73039). So far we have tried
or investigated:
Unfortunately, we haven't been able to reduce queuing times to good
levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with
an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name
"Job time-to-signal, all branches").
There are two things we haven't tried so far:
Since #2 is easiest let's give it a try. For now, the policy would be
the same as what we do for other capacity-constrained configurations
(Win and Mac)—run on master only, but revert if there is a breakage
introduced.
[skip ci]