-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
[CI] Retry flaky fp8 cutlass mla tests #24536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Nick Hill <nhill@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request marks a flaky fp8 test to be retried on failure. While this is a pragmatic fix for CI stability, I've raised a concern about using flaky markers as they can sometimes mask underlying non-deterministic bugs. I've suggested an alternative approach of adjusting the test's tolerance, which would provide a more robust and deterministic solution.
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work! Do you think we can update the unit test as Gemini suggests first then considering retry?
|
Thanks @yewentao256 ... In this case I'm not sure adjusting the tolerance would work since it looks like it may be some kind of overflow/underflow issue: Perhaps that's something in itself that should be investigated/fixed though? cc @MatthewBonanni |
|
@njhill Yeah, I'm still not sure what's driving this and haven't been able to reproduce it locally. As far as I can tell, it only shows up with |
|
Thanks @MatthewBonanni. Is there an issue tracking this? Maybe we can merge this in the meantime to insulate CI for other PRs. |
ywang96
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is reasonable
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Such as https://buildkite.com/vllm/ci/builds/29987#01992f55-5a98-42e8-9589-751e26e35165.