[Ascend]adapt enable-profile-cuda-graph for NPU #12617

ping1jing2 · 2025-11-04T09:15:18Z

Motivation

Adapt the enable-profile-cuda-graph feature for Ascend NPU to enable profile data collection during graph capture. Migrating PyTorch Profiler to Ascend PyTorch Profiler

Notice: The profiling results on the GPU are directly output via echo. However, on the NPU, the results currently need to be saved to a file before analysis. It is estimated that by 12.30, the output on the NPU will be same as GPU.

Modifications

Override the capture function in NPUGraphRunner to incorporate torch_npu.profiler collection during graph capture.

Accuracy Tests

Please refer to the detailed test results below.

Benchmarking and Profiling

Please refer to the detailed test results below.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

# Conflicts: # python/sglang/srt/model_executor/npu_graph_runner.py

…graph-feature

gemini-code-assist · 2025-11-04T09:15:37Z

Summary of Changes

Hello @ping1jing2, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive profiling capabilities for Ascend NPUs during CUDA graph capture. It integrates torch_npu.profiler to enable detailed performance and memory analysis on NPU devices, ensuring that developers can gain insights into the execution of their models. The changes also involve refactoring existing profiling logic into reusable methods and adapting the output handling to suit the NPU environment, with a clear roadmap for future output consistency.

Highlights

NPU Profiling Integration: The enable-profile-cuda-graph feature is adapted to support Ascend NPUs, allowing profiling during graph capture on these devices.
PyTorch Profiler Migration: The profiling mechanism for NPUs now utilizes torch_npu.profiler, replacing the standard PyTorch Profiler for NPU-specific operations.
Modular Profiling Logic: Common profiling initialization and post-processing steps in CudaGraphRunner have been refactored into dedicated private methods (_init_profile_context_and_memory_record and _post_process_after_profile) for better organization and reusability.
NPU-Specific Output Handling: Unlike GPU profiling which outputs directly, NPU profiling results are now saved to a file for later analysis, with a future update planned to unify the output method.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request successfully adapts the enable-profile-cuda-graph feature for Ascend NPU by refactoring the profiling logic in CudaGraphRunner and implementing an NPU-specific version in NPUGraphRunner. The code is well-structured, using inheritance to cleanly separate the CUDA and NPU implementations. I have one minor suggestion to simplify the directory creation logic.

gemini-code-assist · 2025-11-04T09:17:26Z

python/sglang/srt/model_executor/npu_graph_runner.py

+        if not Path(output_dir).exists():
+            Path(output_dir).mkdir(parents=True, exist_ok=True)


The check if not Path(output_dir).exists() is redundant because Path.mkdir(parents=True, exist_ok=True) already handles the case where the directory exists without raising an error. You can simplify the code by removing the if statement.

Path(output_dir).mkdir(parents=True, exist_ok=True)

ping1jing2 · 2025-11-04T09:22:54Z

Self-verification：

Hardware:H800
Test CMD:

python3 -m sglang.launch_server --model-path /root/models/Qwen/Qwen3-8B --host 0.0.0.0 --port 8000 --enable-profile-cuda-graph

Logs

[2025-10-24 15:28:05] Load weight end. type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=63.11 GB, mem usage=15.44 GB.
[2025-10-24 15:28:05] Using KV cache dtype: torch.bfloat16
[2025-10-24 15:28:05] KV Cache is allocated. #tokens: 365177, K size: 25.07 GB, V size: 25.07 GB
[2025-10-24 15:28:05] Memory pool end. avail mem=12.26 GB
[2025-10-24 15:28:05] Capture cuda graph begin. This can take up to several minutes. avail mem=12.16 GB
[2025-10-24 15:28:05] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=256 avail_mem=11.93 GB):   0%|                                                                                                                           | 0/36 [00:00<?, ?it/s][2025-10-24 15:28:05] MOE_A2A_BACKEND is not initialized, using default backend
Capturing batches (bs=1 avail_mem=11.47 GB): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:04<00:00,  8.12it/s]


[2025-10-24 15:29:36] Sorted by CUDA Time:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls                                                                      Input Shapes
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -------------------------------------------------------------------------------- 
              void cutlass::device_kernel<flash::enable_sm90_or_la...         0.00%       0.000us         0.00%       0.000us       0.000us      94.192ms        13.97%      94.192ms      36.340us          2592                                                                                []
              nvjet_tst_192x128_64x5_1x2_h_bz_coopB_TNT         0.00%       0.000us         0.00%       0.000us       0.000us      28.623ms         4.25%      28.623ms      79.508us           360                                                                                []
              void flashinfer::norm::RMSNormKernel<8u, __nv_bfloat...         0.00%       0.000us         0.00%       0.000us       0.000us      27.285ms         4.05%      27.285ms       5.191us          5256                                                                                []
              void at::native::index_elementwise_kernel<128, 4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      24.492ms         3.63%      24.492ms       4.725us          5184                                                                                []
              nvjet_tst_192x8_64x8_2x1_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us      20.263ms         3.01%      20.263ms      70.357us           288                                                                                []
              void flashinfer::norm::FusedAddRMSNormKernel<8u, __n...         0.00%       0.000us         0.00%       0.000us       0.000us      19.175ms         2.84%      19.175ms       3.699us          5184                                                                                []
              cudaStreamIsCapturing         0.44%      11.692ms         0.44%      11.692ms       0.374us      17.814ms         2.64%      17.814ms       0.569us         31291                                                                                []
              void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      17.299ms         2.57%      17.299ms       3.432us          5040                                                                                []
              nvjet_tst_64x8_64x16_2x1_v_bz_splitK_TNT         0.00%       0.000us         0.00%       0.000us       0.000us      14.706ms         2.18%      14.706ms      25.531us           576                                                                                []
              nvjet_tst_128x64_64x8_1x2_h_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us      14.400ms         2.14%      14.400ms      40.000us           360                                                                                []
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------

Self CPU time total: 2.656s
Self CUDA time total: 674.143ms


Sorted by CPU Time:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls                                                                      Input Shapes
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
                                            aten::empty        10.32%     273.991ms        11.11%     294.971ms       9.461us       0.000us         0.00%       0.000us       0.000us         31178                                                          [[], [], [], [], [], []]
                                       cudaLaunchKernel         4.98%     132.193ms         8.36%     222.079ms       7.232us       0.000us         0.00%     786.117us       0.026us         30708                                                                                []
                          cudaGraphInstantiateWithFlags         5.13%     136.214ms         5.13%     136.214ms       3.784ms       1.536us         0.00%       1.536us       0.043us            36                                                                                []
                       Runtime Triggered Module Loading         3.98%     105.758ms         3.98%     105.758ms       3.777ms     664.006us         0.10%     664.006us      23.714us            28                                                                                []
                                           aten::linear         0.01%     274.377us         3.59%      95.268ms     882.112us       0.000us         0.00%       1.732ms      16.042us           108                                                   [[256, 4096], [6144, 4096], []]
                                           aten::matmul         0.01%     164.902us         3.56%      94.547ms     875.437us       0.000us         0.00%       1.732ms      16.042us           108                                                       [[256, 4096], [4096, 6144]]
                                               aten::mm         3.52%      93.382ms         3.55%      94.382ms     873.911us       1.704ms         0.25%       1.732ms      16.042us           108                                                       [[256, 4096], [4096, 6144]]
                                    cudaLaunchKernelExC         3.17%      84.267ms         3.30%      87.593ms       3.738us       0.000us         0.00%     233.379us       0.010us         23436                                                                                []
                                       cuLaunchKernelEx         2.51%      66.674ms         2.75%      73.141ms       4.671us       0.000us         0.00%      12.793ms       0.817us         15660                                                                                []
                                    aten::empty_strided         2.64%      70.074ms         2.71%      72.037ms       9.096us       0.000us         0.00%       0.000us       0.000us          7920                                                          [[], [], [], [], [], []]
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------

Self CPU time total: 2.656s
Self CUDA time total: 674.143ms


Memory Usage is saved to cuda_graph_runner_memory_usage.pickle

[2025-10-24 15:29:36] Capture cuda graph end. Time elapsed: 91.34 s. mem usage=0.70 GB. avail mem=11.46 GB.

ping1jing2 · 2025-11-04T09:24:32Z

Accuracy tests

1. without enable-profile-cuda-graph

Launch a server
python3 -m sglang.launch_server --model-path /root/models/Qwen/Qwen2-7B-Instruct --device npu --attention-backend ascend
Evaluate
python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --parallel 16
LOG
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:47<00:00, 4.18it/s]
Accuracy: 0.855
Invalid: 0.000
Latency: 48.017 s
Output throughput: 585.901 token/s

2、with enable-profile-cuda-graph

Launch a server
python3 -m sglang.launch_server --model-path /root/models/Qwen/Qwen2-7B-Instruct --device npu --attention-backend ascend --enable-profile-cuda-graph
Evaluate
python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --parallel 16
LOG
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:45<00:00, 4.41it/s]
Accuracy: 0.855
Invalid: 0.000
Latency: 45.474 s
Output throughput: 618.664 token/s
Server Log
[2025-10-24 09:38:48] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=46.21 GB, mem usage=14.34 GB.
[2025-10-24 09:38:48] Using KV cache dtype: torch.bfloat16
[2025-10-24 09:38:48] KV Cache is allocated. #tokens: 621440, K size: 16.60 GB, V size: 16.60 GB
[2025-10-24 09:38:48] Memory pool end. avail mem=12.01 GB
[2025-10-24 09:38:48] Capture cuda graph begin. This can take up to several minutes. avail mem=12.01 GB
[2025-10-24 09:38:48] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
[2025-10-24 09:38:48] Profiling starts for graph capture for NPU. Traces will be saved to: /root/datasets/sglang_profiler_with_graph_capture/1023/graph_capture_profile
Capturing batches (bs=1 avail_mem=11.58 GB): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:10<00:00, 3.59it/s]
[2025-10-24 09:38:59] [WARNING] [185276] profiler.py: Incorrect schedule: Stop profiler while current state is RECORD which may result in incomplete parsed data.
[2025-10-24 09:39:00] Capture cuda graph end. Time elapsed: 11.50 s. mem usage=0.43 GB. avail mem=11.58 GB.
[2025-10-24 09:39:00] [INFO] [185724] profiler.py: Start parsing profiling data: /root/datasets/sglang_profiler_with_graph_capture/1023/graph_capture_profile/e80c194d0e61_185276_20251024093849161_ascend_pt
/usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/utils/common.py:1165: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
tensor_data = torch.ByteTensor(
[2025-10-24 09:39:00] max_total_num_tokens=621440, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=8192, context_len=32768, available_gpu_mem=11.58 GB
[2025-10-24 09:39:01] INFO: Started server process [184926]
[2025-10-24 09:39:01] INFO: Waiting for application startup.
[2025-10-24 09:39:01] INFO: Application startup complete.
[2025-10-24 09:39:01] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-10-24 09:39:02] INFO: 127.0.0.1:44806 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-24 09:39:02] Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
[2025-10-24 09:39:06] INFO: 127.0.0.1:44814 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 09:39:06] The server is fired up and ready to roll!
[2025-10-24 09:39:12] [INFO] [185766] profiler.py: CANN profiling data parsed in a total time of 0:00:12.048350
[2025-10-24 09:39:37] [INFO] [185724] profiler.py: All profiling data parsed in a total time of 0:00:38.506339
[2025-10-24 09:40:38] INFO: 127.0.0.1:39402 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-24 09:40:38] Prefill batch. #new-seq: 1, #new-token: 896, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-24 09:40:38] INFO: 127.0.0.1:39404 - "POST /generate HTTP/1.1" 200 OK

ping1jing2 · 2025-11-04T09:26:22Z

performance tests

1, without enable-profile-cuda-graph

CMD
python3 -m sglang.bench_one_batch_server --device npu --attention-backend ascend --enable-profile-cuda-graph --model-path /root/models/Qwen/Qwen2-7B-Instruct --batch-size 32 --input-len 256 --output-len 32
Log

[2025-10-24 10:25:59] INFO:     Started server process [195407]
[2025-10-24 10:25:59] INFO:     Waiting for application startup.
[2025-10-24 10:25:59] INFO:     Application startup complete.
[2025-10-24 10:25:59] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-10-24 10:26:00] INFO:     127.0.0.1:54490 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-24 10:26:01] Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
  warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
[2025-10-24 10:26:03] INFO:     127.0.0.1:57606 - "GET /v1/models HTTP/1.1" 200 OK
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
  warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
[2025-10-24 10:26:04] INFO:     127.0.0.1:57608 - "GET /get_server_info HTTP/1.1" 200 OK
[2025-10-24 10:26:04] INFO:     127.0.0.1:54496 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:26:04] The server is fired up and ready to roll!
======== Warmup Begin ========
[2025-10-24 10:26:04] Cache flushed successfully!
[2025-10-24 10:26:04] INFO:     127.0.0.1:57618 - "POST /flush_cache HTTP/1.1" 200 OK
#Input tokens: 16384
#Output tokens: 256
[2025-10-24 10:26:15] INFO:     127.0.0.1:57802 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:26:15] Prefill batch. #new-seq: 8, #new-token: 8192, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 8,
[2025-10-24 10:26:15] Prefill batch. #new-seq: 8, #new-token: 8192, #cached-token: 0, token usage: 0.01, #running-req: 8, #queue-req: 0,
[2025-10-24 10:26:17] INFO:     127.0.0.1:57814 - "GET /get_server_info HTTP/1.1" 200 OK
batch size: 16
input_len: 1024
output_len: 16
latency: 1.36 s
ttft: 1.06 s
last generation throughput: 0.00 tok/s
input throughput: 15519.54 tok/s
output throughput: 841.40 tok/s
======== Warmup End   ========

[2025-10-24 10:26:17] Cache flushed successfully!
[2025-10-24 10:26:17] INFO:     127.0.0.1:57822 - "POST /flush_cache HTTP/1.1" 200 OK
#Input tokens: 4096
#Output tokens: 512
[2025-10-24 10:26:27] INFO:     127.0.0.1:36938 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:26:27] Prefill batch. #new-seq: 16, #new-token: 4096, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-24 10:26:28] INFO:     127.0.0.1:36948 - "GET /get_server_info HTTP/1.1" 200 OK
batch size: 16
input_len: 256
output_len: 32
latency: 0.90 s
ttft: 0.31 s
last generation throughput: 0.00 tok/s
input throughput: 13385.51 tok/s
output throughput: 857.19 tok/s

Results are saved to result.jsonl

2 with enable-profile-cuda-graph

CMD
python3 -m sglang.bench_one_batch_server --device npu --attention-backend ascend --enable-profile-cuda-graph --model-path /root/models/Qwen/Qwen2-7B-Instruct --batch-size 16 --input-len 256 --output-len 32
Log

[2025-10-24 10:28:34] max_total_num_tokens=621440, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=8192, context_len=32768, available_gpu_mem=11.58 GB
[2025-10-24 10:28:35] INFO:     Started server process [197342]
[2025-10-24 10:28:35] INFO:     Waiting for application startup.
[2025-10-24 10:28:35] INFO:     Application startup complete.
[2025-10-24 10:28:35] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-10-24 10:28:36] INFO:     127.0.0.1:57348 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-24 10:28:36] Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-24 10:28:36] INFO:     127.0.0.1:57374 - "GET /v1/models HTTP/1.1" 200 OK
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
  warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
  warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
[2025-10-24 10:28:40] INFO:     127.0.0.1:57390 - "GET /get_server_info HTTP/1.1" 200 OK
[2025-10-24 10:28:40] INFO:     127.0.0.1:57360 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:28:40] The server is fired up and ready to roll!
======== Warmup Begin ========
[2025-10-24 10:28:40] Cache flushed successfully!
[2025-10-24 10:28:40] INFO:     127.0.0.1:57400 - "POST /flush_cache HTTP/1.1" 200 OK
[2025-10-24 10:28:46] [INFO] [197851] profiler.py: CANN profiling data parsed in a total time of 0:00:12.050036
#Input tokens: 16384
#Output tokens: 256
[2025-10-24 10:28:54] INFO:     127.0.0.1:42580 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:28:54] Prefill batch. #new-seq: 8, #new-token: 8192, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 8,
[2025-10-24 10:28:54] Prefill batch. #new-seq: 8, #new-token: 8192, #cached-token: 0, token usage: 0.01, #running-req: 8, #queue-req: 0,
[2025-10-24 10:28:56] INFO:     127.0.0.1:42586 - "GET /get_server_info HTTP/1.1" 200 OK
batch size: 16
input_len: 1024
output_len: 16
latency: 1.35 s
ttft: 1.06 s
last generation throughput: 0.00 tok/s
input throughput: 15460.33 tok/s
output throughput: 884.04 tok/s
======== Warmup End   ========

[2025-10-24 10:28:56] Cache flushed successfully!
[2025-10-24 10:28:56] INFO:     127.0.0.1:42596 - "POST /flush_cache HTTP/1.1" 200 OK
#Input tokens: 4096
#Output tokens: 512
[2025-10-24 10:29:08] INFO:     127.0.0.1:40078 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:29:08] Prefill batch. #new-seq: 16, #new-token: 4096, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-24 10:29:09] INFO:     127.0.0.1:40084 - "GET /get_server_info HTTP/1.1" 200 OK
**batch size: 16
input_len: 256
output_len: 32
latency: 0.87 s
ttft: 0.32 s
last generation throughput: 0.00 tok/s
input throughput: 12775.82 tok/s
output throughput: 931.68 tok/s**

Results are saved to result.jsonl

Leo920320 and others added 3 commits October 24, 2025 19:38

adapt enable-profile-cuda-graph for NPU

3ff3287

adapt enable-profile-cuda-graph for NPU

865002d

# Conflicts: # python/sglang/srt/model_executor/npu_graph_runner.py

Merge branch 'sgl-project:main' into npu_support_enable-profile-cuda-…

7f54293

…graph-feature

ping1jing2 requested review from Ying1123, hnyls2002, ispobock, merrymercy and zhyncs as code owners November 4, 2025 09:15

sglang-bot added the run-ci label Nov 4, 2025

gemini-code-assist bot reviewed Nov 4, 2025

View reviewed changes

ping1jing2 marked this pull request as draft November 4, 2025 09:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Ascend]adapt enable-profile-cuda-graph for NPU #12617

[Ascend]adapt enable-profile-cuda-graph for NPU #12617

ping1jing2 commented Nov 4, 2025

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Uh oh!

ping1jing2 commented Nov 4, 2025

Uh oh!

ping1jing2 commented Nov 4, 2025

Uh oh!

ping1jing2 commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		if not Path(output_dir).exists():
		Path(output_dir).mkdir(parents=True, exist_ok=True)

[Ascend]adapt enable-profile-cuda-graph for NPU #12617

Are you sure you want to change the base?

[Ascend]adapt enable-profile-cuda-graph for NPU #12617

Conversation

ping1jing2 commented Nov 4, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

ping1jing2 commented Nov 4, 2025

Uh oh!

ping1jing2 commented Nov 4, 2025

Accuracy tests

1. without enable-profile-cuda-graph

2、with enable-profile-cuda-graph

Uh oh!

ping1jing2 commented Nov 4, 2025

performance tests

1, without enable-profile-cuda-graph

2 with enable-profile-cuda-graph

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants