[Quantization]feat: Add Nvidia ModelOpt HF FP8 support for fp8_pc_pt and fp8_pb_wo #15468

CedricHwong · 2025-12-19T08:33:17Z

Motivation

ModelOpt-exported HuggingFace checkpoints encode the FP8 flavor via hf_quant_config.json
(quantization.quant_algo). We’ve seen two ModelOpt FP8 variants show up in real PTQ workflows and serving
deployments on Hopper (H200):

FP8_PER_CHANNEL_PER_TOKEN (fp8_pc_pt)
FP8_PB_WO (fp8_pb_wo)

SGLang’s existing ModelOpt path did not consistently recognize/route these variants, which could cause load-time failures or wrong quant method selection. This PR makes those variants first-class and keeps the matching strict
to avoid accidentally treating other FP8 formats as ModelOpt.

Modifications

Quantization:
- Add explicit support for ModelOpt HF FP8 variants:
  - FP8_PER_CHANNEL_PER_TOKEN (per-channel weight scale, per-token activation scaling)
  - FP8_PB_WO (blockwise FP8 weight-only)
- Normalize quant_algo parsing (case-insensitive) and error out on unknown FP8 algos instead of “accept any FP8*”.
Tests:
- Add unit coverage for config parsing + strict algo matching/normalization without requiring real checkpoints in CI.

Accuracy Tests

Unit tests:

PYTHONPATH=python pytest -q test/srt/test_modelopt_loader.py -k "TestModelOptFp8ConfigVariants or test_engine_with_modelopt_quant_cli_argument"

Benchmarking and Profiling

Not included. This change is primarily enabling/dispatching support for additional checkpoint formats; performance evaluation can be done follow-up once merged (I can provide H200 numbers if desired).

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR.

gemini-code-assist · 2025-12-19T08:33:34Z

Summary of Changes

Hello @CedricHwong, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances SGLang's ModelOpt integration by introducing explicit support for two prevalent HuggingFace FP8 quantization variants: FP8_PER_CHANNEL_PER_TOKEN and FP8_PB_WO. These variants are crucial for real-world PTQ workflows and optimal performance on NVIDIA Hopper GPUs. The changes ensure that SGLang correctly identifies and routes these specific quantization formats, thereby preventing potential load-time failures or misapplication of quantization methods. The update also includes robust error handling for unknown FP8 algorithms, comprehensive documentation, and dedicated unit tests to validate the new parsing logic.

Highlights

New ModelOpt FP8 Variants Support: Explicit support has been added for two new ModelOpt HuggingFace FP8 quantization variants: FP8_PER_CHANNEL_PER_TOKEN (per-channel weight scale, per-token activation scaling) and FP8_PB_WO (2D blockwise FP8 weight-only).
Strict quant_algo Parsing: The parsing of quant_algo values is now case-insensitive and enforces strict matching. The system will now error out if an unknown FP8 algorithm is encountered, preventing incorrect quantization method selection.
Updated Documentation: The documentation has been updated to include details on the newly supported quant_algo values for ModelOpt FP8 and provides a new tip for warming up FP8_PB_WO checkpoints using sglang.compile_deep_gemm.
New Linear Methods and Parameter Types: Two new linear methods, ModelOptFp8PcPtLinearMethod and ModelOptFp8PbWoLinearMethod, have been introduced to handle the specific characteristics of these new FP8 variants. Additionally, BlockQuantScaleParameter and ChannelQuantScaleParameter were added for managing different scaling types.
Enhanced Unit Test Coverage: New unit tests have been added to cover the configuration parsing and strict algorithm matching for the new FP8 variants, ensuring robustness without requiring actual checkpoints in CI.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for two new Nvidia ModelOpt FP8 quantization variants, fp8_pc_pt and fp8_pb_wo. The changes are well-implemented, making the quantization logic more robust and explicit by strictly checking for supported algorithms. The introduction of new LinearMethod classes for each variant is a clean design choice. The PR also includes corresponding unit tests and documentation updates, which is excellent. I have one minor suggestion to improve the clarity of the documentation.

docs/advanced_features/quantization.md

Signed-off-by: CedricHuang <cedrichgw@gmail.com>

CedricHwong requested review from AniZpZ, BBuf, Edwardf0t1, FlamingoPg, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners December 19, 2025 08:33

github-actions bot added documentation Improvements or additions to documentation quant LLM Quantization labels Dec 19, 2025

gemini-code-assist bot reviewed Dec 19, 2025

View reviewed changes

docs/advanced_features/quantization.md Outdated Show resolved Hide resolved

CedricHwong added 3 commits December 19, 2025 08:40

Support ModelOpt FP8 pc_pt and pb_wo

74e4899

Signed-off-by: CedricHuang <cedrichgw@gmail.com>

test: cover ModelOpt FP8 algo parsing

ea64370

Signed-off-by: CedricHuang <cedrichgw@gmail.com>

docs: remove aka from ModelOpt FP8 section

6477cc2

Signed-off-by: CedricHuang <cedrichgw@gmail.com>

CedricHwong force-pushed the main branch from ce5ad5b to 6477cc2 Compare December 19, 2025 08:40

Merge branch 'main' into main

d19f927

ispobock assigned Edwardf0t1 Dec 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Quantization]feat: Add Nvidia ModelOpt HF FP8 support for fp8_pc_pt and fp8_pb_wo #15468

[Quantization]feat: Add Nvidia ModelOpt HF FP8 support for fp8_pc_pt and fp8_pb_wo #15468

CedricHwong commented Dec 19, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Quantization]feat: Add Nvidia ModelOpt HF FP8 support for fp8_pc_pt and fp8_pb_wo #15468

Are you sure you want to change the base?

[Quantization]feat: Add Nvidia ModelOpt HF FP8 support for fp8_pc_pt and fp8_pb_wo #15468

Conversation

CedricHwong commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CedricHwong commented Dec 19, 2025 •

edited

Loading