Skip to content

Conversation

@njhill
Copy link
Member

@njhill njhill commented Sep 9, 2025

Signed-off-by: Nick Hill <nhill@redhat.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request marks a flaky fp8 test to be retried on failure. While this is a pragmatic fix for CI stability, I've raised a concern about using flaky markers as they can sometimes mask underlying non-deterministic bugs. I've suggested an alternative approach of adjusting the test's tolerance, which would provide a more robust and deterministic solution.

Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work! Do you think we can update the unit test as Gemini suggests first then considering retry?

@njhill
Copy link
Member Author

njhill commented Sep 9, 2025

Thanks @yewentao256 ... In this case I'm not sure adjusting the tolerance would work since it looks like it may be some kind of overflow/underflow issue:

>           assert cos_diff < 1e-4
E           assert 1.0 < 0.0001
>               assert cos_diff < 1e-4
E               assert nan < 0.0001

Perhaps that's something in itself that should be investigated/fixed though? cc @MatthewBonanni

@MatthewBonanni
Copy link
Contributor

MatthewBonanni commented Sep 9, 2025

@njhill Yeah, I'm still not sure what's driving this and haven't been able to reproduce it locally. As far as I can tell, it only shows up with VLLM_USE_PRECOMPILED=1 and only on CI, so it's tricky. Working on it though

@njhill
Copy link
Member Author

njhill commented Sep 9, 2025

Thanks @MatthewBonanni. Is there an issue tracking this? Maybe we can merge this in the meantime to insulate CI for other PRs.

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is reasonable

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2025
@njhill njhill merged commit 7e7db04 into vllm-project:main Sep 10, 2025
25 checks passed
@njhill njhill deleted the retry-blackwell-flake branch September 10, 2025 03:33
@MatthewBonanni
Copy link
Contributor

@njhill I just created #24590 to track it. Thanks for doing this, it sounds good to me as a start

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
Signed-off-by: Nick Hill <nhill@redhat.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Nick Hill <nhill@redhat.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants