-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
[Perf] Tune configs for triton block fp8 gemm H100/H200 #23748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf] Tune configs for triton block fp8 gemm H100/H200 #23748
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces performance tuning configurations for the FP8 block-wise GEMM kernel on H100/H200 GPUs, specifically for matrix shapes found in the DeepSeek-V3 model. It includes a new benchmark script to compare the performance of the w8a8-block-fp8 kernel against the standard bfloat16 GEMM, along with numerous new and updated JSON files containing the tuned kernel parameters. The changes appear to be correct and well-aligned with the goal of improving performance. The new benchmark script is well-implemented, and the configuration files are consistent with the output of a tuning process. I have not identified any issues of high or critical severity.
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the work!
…#23748) Signed-off-by: mgoin <mgoin64@gmail.com>
…#23748) Signed-off-by: mgoin <mgoin64@gmail.com>
…#23748) Signed-off-by: mgoin <mgoin64@gmail.com>
…#23748) Signed-off-by: mgoin <mgoin64@gmail.com>
Purpose
Retune the triton fp8 block dense gemm configs for modern triton. Also adds a simple benchmark script that doesn't tune.
Mostly improves performance for smaller M, but crucially gives improvement for
N=576,K=7168Test Plan
Test Result
H100
H200
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.