Skip to content

Conversation

@CedricHwong
Copy link

@CedricHwong CedricHwong commented Dec 19, 2025

Fixes #15467

Motivation

ModelOpt-exported HuggingFace checkpoints encode the FP8 flavor via hf_quant_config.json
(quantization.quant_algo). We’ve seen two ModelOpt FP8 variants show up in real PTQ workflows and serving
deployments on Hopper (H200):

  • FP8_PER_CHANNEL_PER_TOKEN (fp8_pc_pt)
  • FP8_PB_WO (fp8_pb_wo)

SGLang’s existing ModelOpt path did not consistently recognize/route these variants, which could cause load-time failures or wrong quant method selection. This PR makes those variants first-class and keeps the matching strict
to avoid accidentally treating other FP8 formats as ModelOpt.

Modifications

  • Quantization:
    • Add explicit support for ModelOpt HF FP8 variants:
      • FP8_PER_CHANNEL_PER_TOKEN (per-channel weight scale, per-token activation scaling)
      • FP8_PB_WO (blockwise FP8 weight-only)
    • Normalize quant_algo parsing (case-insensitive) and error out on unknown FP8 algos instead of “accept any FP8*”.
  • Tests:
    • Add unit coverage for config parsing + strict algo matching/normalization without requiring real checkpoints in CI.

Accuracy Tests

Unit tests:

  • PYTHONPATH=python pytest -q test/srt/test_modelopt_loader.py -k "TestModelOptFp8ConfigVariants or test_engine_with_modelopt_quant_cli_argument"

Benchmarking and Profiling

Not included. This change is primarily enabling/dispatching support for additional checkpoint formats; performance evaluation can be done follow-up once merged (I can provide H200 numbers if desired).

Checklist

  • Format your code according to the Format code with pre-commit.
  • Add unit tests according to the Run and add unit tests.
  • Update documentation according to Write documentations.
  • Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
  • Follow the SGLang code style guidance.
  • Work with maintainers to merge your PR.

@github-actions github-actions bot added documentation Improvements or additions to documentation quant LLM Quantization labels Dec 19, 2025
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @CedricHwong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances SGLang's ModelOpt integration by introducing explicit support for two prevalent HuggingFace FP8 quantization variants: FP8_PER_CHANNEL_PER_TOKEN and FP8_PB_WO. These variants are crucial for real-world PTQ workflows and optimal performance on NVIDIA Hopper GPUs. The changes ensure that SGLang correctly identifies and routes these specific quantization formats, thereby preventing potential load-time failures or misapplication of quantization methods. The update also includes robust error handling for unknown FP8 algorithms, comprehensive documentation, and dedicated unit tests to validate the new parsing logic.

Highlights

  • New ModelOpt FP8 Variants Support: Explicit support has been added for two new ModelOpt HuggingFace FP8 quantization variants: FP8_PER_CHANNEL_PER_TOKEN (per-channel weight scale, per-token activation scaling) and FP8_PB_WO (2D blockwise FP8 weight-only).
  • Strict quant_algo Parsing: The parsing of quant_algo values is now case-insensitive and enforces strict matching. The system will now error out if an unknown FP8 algorithm is encountered, preventing incorrect quantization method selection.
  • Updated Documentation: The documentation has been updated to include details on the newly supported quant_algo values for ModelOpt FP8 and provides a new tip for warming up FP8_PB_WO checkpoints using sglang.compile_deep_gemm.
  • New Linear Methods and Parameter Types: Two new linear methods, ModelOptFp8PcPtLinearMethod and ModelOptFp8PbWoLinearMethod, have been introduced to handle the specific characteristics of these new FP8 variants. Additionally, BlockQuantScaleParameter and ChannelQuantScaleParameter were added for managing different scaling types.
  • Enhanced Unit Test Coverage: New unit tests have been added to cover the configuration parsing and strict algorithm matching for the new FP8 variants, ensuring robustness without requiring actual checkpoints in CI.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for two new Nvidia ModelOpt FP8 quantization variants, fp8_pc_pt and fp8_pb_wo. The changes are well-implemented, making the quantization logic more robust and explicit by strictly checking for supported algorithms. The introduction of new LinearMethod classes for each variant is a clean design choice. The PR also includes corresponding unit tests and documentation updates, which is excellent. I have one minor suggestion to improve the clarity of the documentation.

Signed-off-by: CedricHuang <cedrichgw@gmail.com>
Signed-off-by: CedricHuang <cedrichgw@gmail.com>
Signed-off-by: CedricHuang <cedrichgw@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation quant LLM Quantization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support NVIDIA ModelOpt-exported HF FP8 checkpoints (FP8_PER_CHANNEL_PER_TOKEN / FP8_PB_WO) in SGLang

2 participants