Skip to content

[NV] dsr1 fp4 b200 trt agg mtp update#642

Open
camiloamoreno wants to merge 4 commits intomainfrom
nv/dsr1-fp4-b200-trt-agg-mtp-260204
Open

[NV] dsr1 fp4 b200 trt agg mtp update#642
camiloamoreno wants to merge 4 commits intomainfrom
nv/dsr1-fp4-b200-trt-agg-mtp-260204

Conversation

@camiloamoreno
Copy link
Collaborator

@camiloamoreno camiloamoreno commented Feb 5, 2026

This PR contains below updates:

  • Update to the latest TRTLLM 1.2 release container, the recent rc6.post3
  • Fine-tune choice of parallelism in nvidia-master (reduce overlapping TP8/TP4)
  • Enable Piecewise cuda graphs optimization on specific cases

Near the top of the benchmark script (L26-35), we enable specific optimizations, mainly differentiating between cases with and without DP attention, including for the choice of MTP aggressiveness.

As in the non-mtp fp4 agg version, we use Piecewise Cuda Graphs (https://nvidia.github.io/TensorRT-LLM/features/torch_compile_and_piecewise_cuda_graph.html) which enables some components to execute thorugh cuda graphs while other components are run eagerly, to gain benefit with lower overhead. We use the formula from the documentation to generate a capture_num_tokens list depending on MAX_NUM_TOKENS.

"cuda graph max batch size" is optimized to match CONC as a natural limit, but reduced to batch_size/4 when DP_ATTENTION is enabled.

@camiloamoreno camiloamoreno requested a review from a team as a code owner February 5, 2026 01:35
@camiloamoreno camiloamoreno marked this pull request as draft February 5, 2026 04:52
@camiloamoreno camiloamoreno changed the title [WIP] [NV] dsr1 fp4 b200 trt agg mtp update [NV] dsr1 fp4 b200 trt agg mtp update Feb 5, 2026
@camiloamoreno camiloamoreno marked this pull request as ready for review February 5, 2026 23:49
@claude
Copy link
Contributor

claude bot commented Feb 5, 2026

Claude finished @camiloamoreno's task in 1m 52s —— View job


PR Review: [NV] dsr1 fp4 b200 trt agg mtp update

  • Read changed files and gather context
  • Validate master config (model prefixes, image accessibility)
  • Validate perf-changelog presence
  • Validate benchmark script (code style, MTP requirements)
  • Check for bugs and logic errors
  • Post review findings

LGTM — no blocking issues found.

Validation checklist:

  • Model prefix: dsr1 — valid
  • Image: nvcr.io#nvidia/tensorrt-llm/release:1.2.0rc6.post3 — public NGC image, valid
  • perf-changelog.yaml: Updated with matching PR link and description
  • --use-chat-template: Present in benchmark client (line 132)
  • Server launch formatting: Arguments on separate lines with \ continuations
  • Piecewise CUDA graphs logic: Capture token list generation correctly implements the documented formula; seq/modulo logic correctly avoids duplicates at the boundary
  • MAX_BATCH_SIZE: Correctly bounded to CONC/4 with DP attention, CONC otherwise; consistently used in both the YAML config and the server launch args

Copy link
Collaborator

@kedarpotdar-nv kedarpotdar-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. this should get picked up after current GPTOSS run is complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants