-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Make the CUTLASS swizzle options configurable and default to 2. #146088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146088
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 New FailureAs of commit f8d8bc3 with merge base 354fe48 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
| # This is mainly used to reduce test time in CI. | ||
| cutlass_max_profiling_configs: Optional[int] = None | ||
|
|
||
| # The L2 swizzle values to consider when profiling CUTLASS configs in max_autotune. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably also mention what are good values to put in here
Chillee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I quite understand this PR? Is it just to reduce compilation time?
@Chillee, yeah. For one data point, see: https://fburl.com/workplace/gx0zim0l |
It will also 4x the number of configs, which cannot be controlled by cutlass_max_profiling_configs. For example, even if you set cutlass_max_profiling_configs = 10, you will still be autotuning 40 configs. |
@Chillee @masnesral can you land this just to recover flaky test signal? Even if we want to set default as [1, 2, 4, 8], as long as it is configurable, we can fix that in the test. |
|
I'll land as is. I only chose '2' because @henrylhtsang suggest that in offline discussion. If someone can give insights on the "best" default, I'll gladly change it. |
|
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: 1 jobs have failed, first few of them are: trunk / linux-focal-rocm6.3-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4) Details for Dev Infra teamRaised by workflow job |
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 1 checks: trunk / linux-focal-rocm6.3-py3.10 / test (distributed, 1, 1, linux.rocm.gpu.4) Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Stack from ghstack (oldest at bottom):
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @amjames @desertfire @chauhang @aakhundov