Skip to content

Commit e3a4628

Browse files
authored
Merge branch 'main' into jiangs/1.1.0rc4/fix_sm89_fp8bmm
2 parents 6eed0cc + dd9627d commit e3a4628

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

47 files changed

+670
-293
lines changed

cpp/include/tensorrt_llm/batch_manager/logitsPostProcessor.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ class LogitsPostProcessor : Algorithm
4747

4848
bool operator()(DecoderInputBuffers& inputBuffers, bool replicateLogitsPostProcessor,
4949
runtime::WorldConfig const& worldConfig, CudaStreamPtr const& stream,
50-
std::optional<LogitsPostProcessorBatched> logitsPostProcessorBatched = std::nullopt) const;
50+
std::optional<LogitsPostProcessorBatched> const& logitsPostProcessorBatched = std::nullopt) const;
5151
};
5252

5353
} // namespace tensorrt_llm::batch_manager

cpp/tensorrt_llm/batch_manager/logitsPostProcessor.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ using SizeType32 = tensorrt_llm::runtime::SizeType32;
3434

3535
bool LogitsPostProcessor::operator()(DecoderInputBuffers& inputBuffers, bool replicateLogitsPostProcessor,
3636
tr::WorldConfig const& worldConfig, CudaStreamPtr const& stream,
37-
std::optional<LogitsPostProcessorBatched> logitsPostProcessorBatched) const
37+
std::optional<LogitsPostProcessorBatched> const& logitsPostProcessorBatched) const
3838
{
3939
TLLM_LOG_TRACE("%s start", __PRETTY_FUNCTION__);
4040
NVTX3_SCOPED_RANGE(LogitsPostProcessor);

docs/source/commands/trtllm-serve/trtllm-serve.rst

Lines changed: 34 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -201,56 +201,60 @@ Metrics Endpoint
201201
202202
.. note::
203203
204-
This endpoint is beta maturity.
204+
The metrics endpoint for the default PyTorch backend are in beta and are not as comprehensive as those for the TensorRT backend.
205205
206-
The statistics for the PyTorch backend are beta and not as comprehensive as those for the TensorRT backend.
206+
Some fields, such as CPU memory usage, are not yet available for the PyTorch backend.
207207
208-
Some fields, such as CPU memory usage, are not available for the PyTorch backend.
208+
Enabling ``enable_iter_perf_stats`` in the PyTorch backend can slightly impact performance, depending on the serving configuration.
209209
210-
Enabling ``enable_iter_perf_stats`` in the PyTorch backend can impact performance slightly, depending on the serving configuration.
210+
The ``/metrics`` endpoint provides runtime iteration statistics such as GPU memory usage and KV cache details.
211211
212-
The ``/metrics`` endpoint provides runtime-iteration statistics such as GPU memory use and inflight-batching details.
213-
For the TensorRT backend, these statistics are enabled by default.
214-
However, for the PyTorch backend, you must explicitly enable iteration statistics logging by setting the `enable_iter_perf_stats` field in a YAML configuration file as shown in the following example:
212+
For the default PyTorch backend, iteration statistics logging is enabled by setting the ``enable_iter_perf_stats`` field in a YAML file:
215213
216214
.. code-block:: yaml
217215
218-
# extra-llm-api-config.yml
219-
pytorch_backend_config:
220-
enable_iter_perf_stats: true
216+
# extra_llm_config.yaml
217+
enable_iter_perf_stats: true
221218
222-
Then start the server and specify the ``--extra_llm_api_options`` argument with the path to the YAML file as shown in the following example:
219+
Start the server and specify the ``--extra_llm_api_options`` argument with the path to the YAML file:
223220
224221
.. code-block:: bash
225222
226-
trtllm-serve <model> \
227-
--extra_llm_api_options <path-to-extra-llm-api-config.yml> \
228-
[--tp_size <tp> --pp_size <pp> --ep_size <ep> --host <host> --port <port>]
223+
trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --extra_llm_api_options extra_llm_config.yaml
229224
230-
After at least one inference request is sent to the server, you can fetch the runtime-iteration statistics by polling the `/metrics` endpoint:
225+
After sending at least one inference request to the server, you can fetch runtime iteration statistics by polling the ``/metrics`` endpoint.
226+
Since the statistics are stored in an internal queue and removed once retrieved, it's recommended to poll the endpoint shortly after each request and store the results if needed.
231227
232228
.. code-block:: bash
233229
234-
curl -X GET http://<host>:<port>/metrics
230+
curl -X GET http://localhost:8000/metrics
235231
236-
*Example Output*
232+
Example output:
237233
238234
.. code-block:: json
239235
240-
[
241-
{
242-
"gpuMemUsage": 56401920000,
243-
"inflightBatchingStats": {
236+
[
237+
{
238+
"gpuMemUsage": 76665782272,
239+
"iter": 154,
240+
"iterLatencyMS": 7.00688362121582,
241+
"kvCacheStats": {
242+
"allocNewBlocks": 3126,
243+
"allocTotalBlocks": 3126,
244+
"cacheHitRate": 0.00128,
245+
"freeNumBlocks": 101253,
246+
"maxNumBlocks": 101256,
247+
"missedBlocks": 3121,
248+
"reusedBlocks": 4,
249+
"tokensPerBlock": 32,
250+
"usedNumBlocks": 3
251+
},
252+
"numActiveRequests": 1
244253
...
245-
},
246-
"iter": 1,
247-
"iterLatencyMS": 16.505143404006958,
248-
"kvCacheStats": {
249-
...
250-
},
251-
"newActiveRequestsQueueLatencyMS": 0.0007503032684326172
252-
}
253-
]
254+
}
255+
]
256+
257+
254258
255259
Syntax
256260
------

docs/source/deployment-guide/quick-start-recipe-for-gpt-oss-on-trtllm.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,28 @@ TODO: Use Chat Compeletions API / Responses API as the example after the PR is m
234234
We use OpenAI's official evaluation tool to test the model's accuracy. For more information see [https://github.com/openai/gpt-oss/tree/main/gpt_oss/evals](gpt-oss-eval).
235235
With the added support of Chat Completions and Responses API in `trtllm-serve,` `gpt_oss.evals` works directly without any modifications.
236236

237+
You need to set `enable_attention_dp`, `tp_size`, `ep_size`, `max_batch_size` and `max_num_tokens` when launching the trtllm server and set `reasoning-effort` when launching evaluation in gpt-oss. Below are some reference configurations for accuracy evaluation on B200.
238+
239+
| **reasoning-effort** | **parallel configuration** | **max_batch_size** | **max_num_tokens** |
240+
|:--------------------:|:--------------------------:|:------------------:|:------------------:|
241+
| low/medium | DEP8 / DEP4 | 128 | 32768 |
242+
| high | DEP8 / DEP4 | 2 | 133120 |
243+
| low/medium | TP8 / TP4 | 1024 | 32768 |
244+
| high | TP8 / TP4 | 720 | 133120 |
245+
246+
Below is an example command for evaluating the accuracy of gpt-oss-120b with low and medium reasoning-effort on GPQA and AIME2025.
247+
248+
```shell
249+
# execute this command in gpt-oss
250+
python -m gpt_oss.evals \
251+
--sampler chat_completions \
252+
--eval gpqa,aime25 \
253+
--model gpt-oss-120b \
254+
--reasoning-effort low,medium
255+
```
256+
257+
258+
237259
## Benchmarking Performance
238260

239261
To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.

docs/source/legacy/tensorrt_quickstart.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# LLM API with TensorRT Engine
22
A simple inference example with TinyLlama using the LLM API:
33

4-
```{literalinclude} ../../examples/llm-api/_tensorrt_engine/quickstart_example.py
4+
```{literalinclude} ../../../examples/llm-api/_tensorrt_engine/quickstart_example.py
55
:language: python
66
:linenos:
77
```

examples/llm-api/_tensorrt_engine/quickstart_example.py

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,17 @@
1-
from tensorrt_llm import LLM, SamplingParams
1+
from tensorrt_llm import BuildConfig, SamplingParams
2+
from tensorrt_llm._tensorrt_engine import LLM # NOTE the change
23

34

45
def main():
56

7+
build_config = BuildConfig()
8+
build_config.max_batch_size = 256
9+
build_config.max_num_tokens = 1024
10+
611
# Model could accept HF model name, a path to local HF model,
712
# or TensorRT Model Optimizer's quantized checkpoints like nvidia/Llama-3.1-8B-Instruct-FP8 on HF.
8-
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
13+
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
14+
build_config=build_config)
915

1016
# Sample prompts.
1117
prompts = [

examples/llm-api/llm_mgmn_trtllm_bench.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ srun -l \
7676
7777
# This is optional
7878
cat > /tmp/pytorch_extra_args.txt << EOF
79+
cuda_graph_config: null
7980
print_iter_log: true
8081
enable_attention_dp: false
8182
EOF

jenkins/L0_Test.groovy

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -364,7 +364,7 @@ def runLLMTestlistOnSlurm(pipeline, platform, testList, config=VANILLA_CONFIG, p
364364
// Wait 10 minutes to check status of the node again
365365
sleep(time: 10, unit: 'MINUTES')
366366
// Avoid the node being stuck in the held state.
367-
Utils.exec(pipeline, Utils.sshUserCmd(remote, "\"scontrol release ${slurmJobID} || true\""))
367+
Utils.exec(pipeline, script: Utils.sshUserCmd(remote, "\"scontrol release ${slurmJobID} || true\""), numRetries: 3)
368368
counter++
369369
}
370370
}
@@ -1805,7 +1805,7 @@ def runLLMBuild(pipeline, cpu_arch, reinstall_dependencies=false, wheel_path="",
18051805
if (env.alternativeTRT) {
18061806
trtllm_utils.replaceWithAlternativeTRT(env.alternativeTRT, cpver)
18071807
}
1808-
buildArgs = "--clean"
1808+
buildArgs = "--clean --nixl_root /opt/nvidia/nvda_nixl"
18091809
if (cpu_arch == AARCH64_TRIPLE) {
18101810
buildArgs += " -a '90-real;100-real;120-real'"
18111811
}
@@ -2040,9 +2040,9 @@ def launchTestJobs(pipeline, testFilter)
20402040
"DGX_H200-4_GPUs-TensorRT-Post-Merge-1": ["dgx-h200-x4", "l0_dgx_h200", 1, 3, 4],
20412041
"DGX_H200-4_GPUs-TensorRT-Post-Merge-2": ["dgx-h200-x4", "l0_dgx_h200", 2, 3, 4],
20422042
"DGX_H200-4_GPUs-TensorRT-Post-Merge-3": ["dgx-h200-x4", "l0_dgx_h200", 3, 3, 4],
2043-
//"RTXPro6000-Pytorch-Post-Merge-1": ["rtx-pro-6000", "l0_rtx_pro_6000", 1, 1],
2044-
//"RTXPro6000-4_GPUs-Pytorch-Post-Merge-1": ["rtx-pro-6000-x4", "l0_rtx_pro_6000", 1, 2, 4],
2045-
//"RTXPro6000-4_GPUs-Pytorch-Post-Merge-2": ["rtx-pro-6000-x4", "l0_rtx_pro_6000", 2, 2, 4],
2043+
"RTXPro6000-PyTorch-Post-Merge-1": ["rtx-pro-6000", "l0_rtx_pro_6000", 1, 1],
2044+
"RTXPro6000-4_GPUs-PyTorch-Post-Merge-1": ["rtx-pro-6000-x4", "l0_rtx_pro_6000", 1, 2, 4],
2045+
"RTXPro6000-4_GPUs-PyTorch-Post-Merge-2": ["rtx-pro-6000-x4", "l0_rtx_pro_6000", 2, 2, 4],
20462046
]
20472047

20482048
parallelJobs = x86TestConfigs.collectEntries{key, values -> [key, [createKubernetesPodConfig(LLM_DOCKER_IMAGE, values[0], "amd64", values[4] ?: 1, key.contains("Perf")), {

tensorrt_llm/_torch/attention_backend/flashinfer.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -170,7 +170,8 @@ def __post_init__(self) -> None:
170170
def create_cuda_graph_metadata(self,
171171
max_batch_size: int,
172172
sub_cross_metadata: bool = False,
173-
max_draft_tokens: int = 0) -> Self:
173+
max_draft_tokens: int = 0,
174+
buffers=None) -> Self:
174175
metadata = super().create_cuda_graph_metadata(max_batch_size,
175176
sub_cross_metadata,
176177
max_draft_tokens)

tensorrt_llm/_torch/attention_backend/interface.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -140,6 +140,7 @@ class AttentionMetadata:
140140

141141
# This buffer is currently only used for TrtllmAttentionMetadata.
142142
cache_indirection: Optional[torch.Tensor] = None
143+
cuda_graph_buffers: dict[str, list[torch.Tensor]] = None
143144

144145
_saved_tensors: Dict[str, torch.Tensor] = field(init=False,
145146
default_factory=dict)
@@ -288,7 +289,8 @@ def prepare(self):
288289
def create_cuda_graph_metadata(self,
289290
max_batch_size: int,
290291
sub_cross_metadata: bool = False,
291-
max_draft_tokens: int = 0) -> Self:
292+
max_draft_tokens: int = 0,
293+
buffers=None) -> Self:
292294
"""
293295
Creates metadata for CUDA graph execution.
294296
CUDA graphs require to use pre-allocated buffers for all tensors in fields.
@@ -300,6 +302,7 @@ def create_cuda_graph_metadata(self,
300302

301303
cuda_graph_metadata = copy.copy(self)
302304
cuda_graph_metadata.is_cuda_graph = True
305+
cuda_graph_metadata.cuda_graph_buffers = buffers
303306
if self.has_cross_sub_metadata:
304307
cuda_graph_metadata.cross = cuda_graph_metadata.cross.create_cuda_graph_metadata(
305308
max_batch_size, True)

0 commit comments

Comments
 (0)