Skip to content

[NV] dsr1 fp8 b200 trt agg mtp update#632

Open
camiloamoreno wants to merge 13 commits intomainfrom
nv/dsr1-fp8-b200-trt-agg-mtp-260203
Open

[NV] dsr1 fp8 b200 trt agg mtp update#632
camiloamoreno wants to merge 13 commits intomainfrom
nv/dsr1-fp8-b200-trt-agg-mtp-260203

Conversation

@camiloamoreno
Copy link
Collaborator

@camiloamoreno camiloamoreno commented Feb 5, 2026

  • update to the latest TRTLLM 1.2 release container
  • fine-tune choice of parallelism in nvidia master (go to TP only for most points)
  • Enable latest optimizations offered by trtllm

For most of the tests we switch to the TRTLLM backend for best performance.

As in the non-mtp fp8 agg code, we use Piecewise Cuda Graphs (https://nvidia.github.io/TensorRT-LLM/features/torch_compile_and_piecewise_cuda_graph.html) which enables some components to execute thorugh cuda graphs while other components are run eagerly, to gain benefit with lower overhead. We prepare the yaml configuration as per the documentation including a "capture_num_tokens" list based partly on MAX_NUM_TOKENS. Though we still exclude a few narrow-concurrency scenarios for performance reasons, we are working to improve this and will update this config once that is done.

For some of the higher-concurrency points we use data-parallel attention, through the DEEPGEMM MOE backend. This backend requires a few different optimizations vs TRTLLM as can be seen in lines 33-43. Particularly the flag ENABLE_CONFIGURABLE_MOE enables DEEPGEMM to use the MOE backend from the latest 1.3 code tree for its improved communication performance.

@Oseltamivir
Copy link
Collaborator

Thanks for PR, please append the modifications to perf-changelog.yaml for a sweep.

@camiloamoreno camiloamoreno changed the title [WIP] [NV] dsr1 fp8 b200 trt agg mtp update [NV] dsr1 fp8 b200 trt agg mtp update Feb 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants