-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
[kernel] Support W4A8 on Hopper #23198
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
5c2e0d2 to
5a18e66
Compare
CMakeLists.txt
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the condition to build is the same as machete, I can also merge them together if that is preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having them separate for now id fine 👍 keeps the CMakeList more compartmentalized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change is saying to not ignore the activation config even if the format. is not in the activation types. are there tests I can refer/scenarios which rely on this behavior? i can further special case if needed.
|
Related Documentation No published documentation to review for changes on this repository. |
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
61d5a62 to
b2aaa7b
Compare
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
Show resolved
Hide resolved
LucasWilkinson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work! Thank you for the clean integration follow existing abstractions; its very much appreciated 😄
Signed-off-by: czhu-cohere <conway.zhu@cohere.com> Signed-off-by: johnnynunez <johnnynuca14@gmail.com>
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
Signed-off-by: czhu-cohere <conway.zhu@cohere.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
Signed-off-by: czhu-cohere <conway.zhu@cohere.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: czhu-cohere <conway.zhu@cohere.com>
|
Hey, thanks for the PR I just tried with the following scheme and got a key error (EngineCore_DP0 pid=30408) File "/teamspace/studios/this_studio/vllm/vllm/v1/worker/gpu_worker.py", line 213, in load_model any tips on what one would need to change to make it work for this w4a8 quantization scheme? |
Purpose
Add support in vLLM for CUTLASS-based W4A8 kernel on Hopper, see example 55 which uses LUT trick to bypass int4 -> bf16 -> fp8 conversion in the GEMM mainloop. This improves the compute-bound performance and allows W4A8 to approach peak FP8 throughput while still maintaining the fast decoding speed of W4A16.
The kernel performs the computation
where
outis the output matrix with typebf16w_qis the packed quantized weight matrix with typeint4scale_typeisfp8 e4m3w_sis the packed scales with typefp8 e4m3and group size128ais the (dynamically quantized) activations with typefp8 e4m3s_ais per-tok activation scales with typefp32s_cis per-channel scales with typefp32and the per-tok/per-chan scaling is done in the epilogue. Note that zp/activation reordering/smaller group size not supported yet.
There are additional requirements on the layout/encoding of scales and weights, which are handled by two helper routines
cutlass_pack_scale_fp8andcutlass_encode_and_reorder_int4b. The original weights are also expected to be encoded as signed int4, which notably is different from the commonly usedint4b8(though we can losslessly convert between the two - more on that in theTest Plansection).Kernel
The main file is
w4a8_mm_entry.cuwhich implementsThe heuristic used in
mm_dispatchwas distilled from a sweep over various tile/cluster shapes and problem shapes taken from open source models like Llama 8/70/405B. In aggregate, this heuristic achieves perf within ~1-2% of the best config for each problem shape tested.The new registered torch ops are
along with their
fakevariants for cudagraph.vLLM Frontend
An example quantization config which will trigger W4A8
Basically, the weights are pack-quantized (8 4-bit values packed to int32), group size is 128, and activations are quantized to 8 bits (fp8 e4m3) with dynamic scaling. Both weight and activation quantization are symmetric.
_is_fp8_w4a8_sm90checks the config/device compatible with w4a8, and returns theCompressedTensorsW4A8Fp8scheme.vllm/model_executor/layers/quantization/kernels/mixed_precision/cutlass.pyimplements
CutlassW4A8LinearKernelwhich wraps the per-tok activation quant + w4a8 op and calls the pre-processing routines.Test Plan
Kernel
pytest tests/kernels/quantization/test_cutlass_w4a8.py- tests kernel correctness + cudagraph. Note we use fp8 with fast accumulate for the reference computation as suggested in CUTLASS upstream.perf benchmark against w4a16 (machete), fp16 and cutlass fp8
E2E
baseline: w4a16 checkpoint for CohereLabs/c4ai-command-a-03-2025 (111b dense model)
generate w4a8 checkpoint by
e2e perf: run serving benchmark
e2e quality: run lm-eval gsm8k to sanity check
more detailed perf/quality evals pending.
Test Result
pytest tests/kernels/quantization/test_cutlass_w4a8.py- passkernel perf benchmarks
serving benchmark settings:
gsm8k
mmlu_pro
TODOs
(Optional) Documentation Update
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.