Skip to content

Conversation

@suo
Copy link
Member

@suo suo commented May 20, 2022

Stack from ghstack:

This makes the rocm jobs run on master-only. We've been battling queue
times for a few months now
(#73039). So far we have tried
or investigated:

  1. Moving distributed builds to master
  2. Moving distributed builds to periodic
  3. Only running rocm on a specific set of paths
  4. Running multiple jobs on a single rocm host.

Unfortunately, we haven't been able to reduce queuing times to good
levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with
an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name
"Job time-to-signal, all branches").

There are two things we haven't tried so far:

  1. Running "smoke tests" only on PR
  2. Switching rocm builds to master

Since #2 is easiest let's give it a try. For now, the policy would be
the same as what we do for other capacity-constrained configurations
(Win and Mac)—run on master only, but revert if there is a breakage
introduced.

[skip ci]

This makes the rocm jobs run on master-only. We've been battling queue
times for a few months now
(#73039). So far we have tried
or investigated:
1. Moving distributed builds to master
2. Moving distributed builds to periodic
3. Only running rocm on a specific set of paths
4. Running multiple jobs on a single rocm host.

Unfortunately, we haven't been able to reduce queuing times to good
levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with
an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name
"Job time-to-signal, all branches").

There are two things we haven't tried so far:
1. Running "smoke tests" only on PR
2. Switching rocm builds to master

Since #2 is easiest let's give it a try. For now, the policy would be
the same as what we do for other capacity-constrained configurations
(Win and Mac)—run on master only, but revert if there is a breakage
introduced.

[skip ci]

[ghstack-poisoned]
@pytorch-bot pytorch-bot bot added the module: rocm AMD GPU support for Pytorch label May 20, 2022
@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented May 20, 2022

🔗 Helpful links

✅ No Failures (0 Pending)

As of commit e9429ad (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

suo added a commit that referenced this pull request May 20, 2022
This makes the rocm jobs run on master-only. We've been battling queue
times for a few months now
(#73039). So far we have tried
or investigated:
1. Moving distributed builds to master
2. Moving distributed builds to periodic
3. Only running rocm on a specific set of paths
4. Running multiple jobs on a single rocm host.

Unfortunately, we haven't been able to reduce queuing times to good
levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with
an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name
"Job time-to-signal, all branches").

There are two things we haven't tried so far:
1. Running "smoke tests" only on PR
2. Switching rocm builds to master

Since #2 is easiest let's give it a try. For now, the policy would be
the same as what we do for other capacity-constrained configurations
(Win and Mac)—run on master only, but revert if there is a breakage
introduced.

[skip ci]

ghstack-source-id: e5faeb3
Pull Request resolved: #77989
@suo
Copy link
Member Author

suo commented May 20, 2022

cc @jeffdaily

@suo suo requested a review from malfet May 20, 2022 18:44
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although we need to investigate, what can be one to meet queueing expectations

@suo
Copy link
Member Author

suo commented May 20, 2022

Heads up @jeffdaily @jithunnair-amd I'm planning to merge this at the end of the day today, so if we have any other ideas about reducing queueing times that should prompt us to reconsider this change, please raise them before then. Thanks!

@suo
Copy link
Member Author

suo commented May 21, 2022

@pytorchbot merge -f

@suo
Copy link
Member Author

suo commented May 21, 2022

alright let's give it a shot

facebook-github-bot pushed a commit that referenced this pull request May 24, 2022
Summary:
This makes the rocm jobs run on master-only. We've been battling queue
times for a few months now
(#73039). So far we have tried
or investigated:
1. Moving distributed builds to master
2. Moving distributed builds to periodic
3. Only running rocm on a specific set of paths
4. Running multiple jobs on a single rocm host.

Unfortunately, we haven't been able to reduce queuing times to good
levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with
an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name
"Job time-to-signal, all branches").

There are two things we haven't tried so far:
1. Running "smoke tests" only on PR
2. Switching rocm builds to master

Since #2 is easiest let's give it a try. For now, the policy would be
the same as what we do for other capacity-constrained configurations
(Win and Mac)—run on master only, but revert if there is a breakage
introduced.

[skip ci]

Pull Request resolved: #77989

Approved by: https://github.com/malfet, https://github.com/janeyx99

Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/1d845253d82c16a79c0737087f524a0896985a4c

Reviewed By: seemethere

Differential Revision: D36603002

Pulled By: seemethere

fbshipit-source-id: fac619553e6d7819e1a58154570edf69f79bbcef
@facebook-github-bot facebook-github-bot deleted the gh/suo/521/head branch May 24, 2022 14:17
swang392 pushed a commit that referenced this pull request May 25, 2022
This makes the rocm jobs run on master-only. We've been battling queue
times for a few months now
(#73039). So far we have tried
or investigated:
1. Moving distributed builds to master
2. Moving distributed builds to periodic
3. Only running rocm on a specific set of paths
4. Running multiple jobs on a single rocm host.

Unfortunately, we haven't been able to reduce queuing times to good
levels. As a result, ROCm jobs are the "weightiest" job in PR CI, with
an average TTS of 3.3h (see https://hud.pytorch.org/metrics, panel name
"Job time-to-signal, all branches").

There are two things we haven't tried so far:
1. Running "smoke tests" only on PR
2. Switching rocm builds to master

Since #2 is easiest let's give it a try. For now, the policy would be
the same as what we do for other capacity-constrained configurations
(Win and Mac)—run on master only, but revert if there is a breakage
introduced.

[skip ci]

Pull Request resolved: #77989

Approved by: https://github.com/malfet, https://github.com/janeyx99
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: rocm AMD GPU support for Pytorch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants