-
Notifications
You must be signed in to change notification settings - Fork 3.3k
[Ascend]adapt enable-profile-cuda-graph for NPU #12617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Ascend]adapt enable-profile-cuda-graph for NPU #12617
Conversation
# Conflicts: # python/sglang/srt/model_executor/npu_graph_runner.py
Summary of ChangesHello @ping1jing2, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive profiling capabilities for Ascend NPUs during CUDA graph capture. It integrates Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully adapts the enable-profile-cuda-graph feature for Ascend NPU by refactoring the profiling logic in CudaGraphRunner and implementing an NPU-specific version in NPUGraphRunner. The code is well-structured, using inheritance to cleanly separate the CUDA and NPU implementations. I have one minor suggestion to simplify the directory creation logic.
| if not Path(output_dir).exists(): | ||
| Path(output_dir).mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Self-verification:
[2025-10-24 15:28:05] Load weight end. type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=63.11 GB, mem usage=15.44 GB.
[2025-10-24 15:28:05] Using KV cache dtype: torch.bfloat16
[2025-10-24 15:28:05] KV Cache is allocated. #tokens: 365177, K size: 25.07 GB, V size: 25.07 GB
[2025-10-24 15:28:05] Memory pool end. avail mem=12.26 GB
[2025-10-24 15:28:05] Capture cuda graph begin. This can take up to several minutes. avail mem=12.16 GB
[2025-10-24 15:28:05] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=256 avail_mem=11.93 GB): 0%| | 0/36 [00:00<?, ?it/s][2025-10-24 15:28:05] MOE_A2A_BACKEND is not initialized, using default backend
Capturing batches (bs=1 avail_mem=11.47 GB): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:04<00:00, 8.12it/s]
[2025-10-24 15:29:36] Sorted by CUDA Time:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls Input Shapes
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
void cutlass::device_kernel<flash::enable_sm90_or_la... 0.00% 0.000us 0.00% 0.000us 0.000us 94.192ms 13.97% 94.192ms 36.340us 2592 []
nvjet_tst_192x128_64x5_1x2_h_bz_coopB_TNT 0.00% 0.000us 0.00% 0.000us 0.000us 28.623ms 4.25% 28.623ms 79.508us 360 []
void flashinfer::norm::RMSNormKernel<8u, __nv_bfloat... 0.00% 0.000us 0.00% 0.000us 0.000us 27.285ms 4.05% 27.285ms 5.191us 5256 []
void at::native::index_elementwise_kernel<128, 4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 24.492ms 3.63% 24.492ms 4.725us 5184 []
nvjet_tst_192x8_64x8_2x1_v_bz_TNT 0.00% 0.000us 0.00% 0.000us 0.000us 20.263ms 3.01% 20.263ms 70.357us 288 []
void flashinfer::norm::FusedAddRMSNormKernel<8u, __n... 0.00% 0.000us 0.00% 0.000us 0.000us 19.175ms 2.84% 19.175ms 3.699us 5184 []
cudaStreamIsCapturing 0.44% 11.692ms 0.44% 11.692ms 0.374us 17.814ms 2.64% 17.814ms 0.569us 31291 []
void at::native::elementwise_kernel<128, 4, at::nati... 0.00% 0.000us 0.00% 0.000us 0.000us 17.299ms 2.57% 17.299ms 3.432us 5040 []
nvjet_tst_64x8_64x16_2x1_v_bz_splitK_TNT 0.00% 0.000us 0.00% 0.000us 0.000us 14.706ms 2.18% 14.706ms 25.531us 576 []
nvjet_tst_128x64_64x8_1x2_h_bz_TNT 0.00% 0.000us 0.00% 0.000us 0.000us 14.400ms 2.14% 14.400ms 40.000us 360 []
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
Self CPU time total: 2.656s
Self CUDA time total: 674.143ms
Sorted by CPU Time:
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls Input Shapes
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
aten::empty 10.32% 273.991ms 11.11% 294.971ms 9.461us 0.000us 0.00% 0.000us 0.000us 31178 [[], [], [], [], [], []]
cudaLaunchKernel 4.98% 132.193ms 8.36% 222.079ms 7.232us 0.000us 0.00% 786.117us 0.026us 30708 []
cudaGraphInstantiateWithFlags 5.13% 136.214ms 5.13% 136.214ms 3.784ms 1.536us 0.00% 1.536us 0.043us 36 []
Runtime Triggered Module Loading 3.98% 105.758ms 3.98% 105.758ms 3.777ms 664.006us 0.10% 664.006us 23.714us 28 []
aten::linear 0.01% 274.377us 3.59% 95.268ms 882.112us 0.000us 0.00% 1.732ms 16.042us 108 [[256, 4096], [6144, 4096], []]
aten::matmul 0.01% 164.902us 3.56% 94.547ms 875.437us 0.000us 0.00% 1.732ms 16.042us 108 [[256, 4096], [4096, 6144]]
aten::mm 3.52% 93.382ms 3.55% 94.382ms 873.911us 1.704ms 0.25% 1.732ms 16.042us 108 [[256, 4096], [4096, 6144]]
cudaLaunchKernelExC 3.17% 84.267ms 3.30% 87.593ms 3.738us 0.000us 0.00% 233.379us 0.010us 23436 []
cuLaunchKernelEx 2.51% 66.674ms 2.75% 73.141ms 4.671us 0.000us 0.00% 12.793ms 0.817us 15660 []
aten::empty_strided 2.64% 70.074ms 2.71% 72.037ms 9.096us 0.000us 0.00% 0.000us 0.000us 7920 [[], [], [], [], [], []]
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ --------------------------------------------------------------------------------
Self CPU time total: 2.656s
Self CUDA time total: 674.143ms
Memory Usage is saved to cuda_graph_runner_memory_usage.pickle
[2025-10-24 15:29:36] Capture cuda graph end. Time elapsed: 91.34 s. mem usage=0.70 GB. avail mem=11.46 GB. |
Accuracy tests1. without enable-profile-cuda-graph
2、with enable-profile-cuda-graph
|
performance tests1, without enable-profile-cuda-graph
[2025-10-24 10:25:59] INFO: Started server process [195407]
[2025-10-24 10:25:59] INFO: Waiting for application startup.
[2025-10-24 10:25:59] INFO: Application startup complete.
[2025-10-24 10:25:59] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-10-24 10:26:00] INFO: 127.0.0.1:54490 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-24 10:26:01] Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
[2025-10-24 10:26:03] INFO: 127.0.0.1:57606 - "GET /v1/models HTTP/1.1" 200 OK
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
[2025-10-24 10:26:04] INFO: 127.0.0.1:57608 - "GET /get_server_info HTTP/1.1" 200 OK
[2025-10-24 10:26:04] INFO: 127.0.0.1:54496 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:26:04] The server is fired up and ready to roll!
======== Warmup Begin ========
[2025-10-24 10:26:04] Cache flushed successfully!
[2025-10-24 10:26:04] INFO: 127.0.0.1:57618 - "POST /flush_cache HTTP/1.1" 200 OK
#Input tokens: 16384
#Output tokens: 256
[2025-10-24 10:26:15] INFO: 127.0.0.1:57802 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:26:15] Prefill batch. #new-seq: 8, #new-token: 8192, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 8,
[2025-10-24 10:26:15] Prefill batch. #new-seq: 8, #new-token: 8192, #cached-token: 0, token usage: 0.01, #running-req: 8, #queue-req: 0,
[2025-10-24 10:26:17] INFO: 127.0.0.1:57814 - "GET /get_server_info HTTP/1.1" 200 OK
batch size: 16
input_len: 1024
output_len: 16
latency: 1.36 s
ttft: 1.06 s
last generation throughput: 0.00 tok/s
input throughput: 15519.54 tok/s
output throughput: 841.40 tok/s
======== Warmup End ========
[2025-10-24 10:26:17] Cache flushed successfully!
[2025-10-24 10:26:17] INFO: 127.0.0.1:57822 - "POST /flush_cache HTTP/1.1" 200 OK
#Input tokens: 4096
#Output tokens: 512
[2025-10-24 10:26:27] INFO: 127.0.0.1:36938 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:26:27] Prefill batch. #new-seq: 16, #new-token: 4096, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-24 10:26:28] INFO: 127.0.0.1:36948 - "GET /get_server_info HTTP/1.1" 200 OK
batch size: 16
input_len: 256
output_len: 32
latency: 0.90 s
ttft: 0.31 s
last generation throughput: 0.00 tok/s
input throughput: 13385.51 tok/s
output throughput: 857.19 tok/s
Results are saved to result.jsonl2 with enable-profile-cuda-graph
[2025-10-24 10:28:34] max_total_num_tokens=621440, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=8192, context_len=32768, available_gpu_mem=11.58 GB
[2025-10-24 10:28:35] INFO: Started server process [197342]
[2025-10-24 10:28:35] INFO: Waiting for application startup.
[2025-10-24 10:28:35] INFO: Application startup complete.
[2025-10-24 10:28:35] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-10-24 10:28:36] INFO: 127.0.0.1:57348 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-24 10:28:36] Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-24 10:28:36] INFO: 127.0.0.1:57374 - "GET /v1/models HTTP/1.1" 200 OK
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
[2025-10-24 10:28:40] INFO: 127.0.0.1:57390 - "GET /get_server_info HTTP/1.1" 200 OK
[2025-10-24 10:28:40] INFO: 127.0.0.1:57360 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:28:40] The server is fired up and ready to roll!
======== Warmup Begin ========
[2025-10-24 10:28:40] Cache flushed successfully!
[2025-10-24 10:28:40] INFO: 127.0.0.1:57400 - "POST /flush_cache HTTP/1.1" 200 OK
[2025-10-24 10:28:46] [INFO] [197851] profiler.py: CANN profiling data parsed in a total time of 0:00:12.050036
#Input tokens: 16384
#Output tokens: 256
[2025-10-24 10:28:54] INFO: 127.0.0.1:42580 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:28:54] Prefill batch. #new-seq: 8, #new-token: 8192, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 8,
[2025-10-24 10:28:54] Prefill batch. #new-seq: 8, #new-token: 8192, #cached-token: 0, token usage: 0.01, #running-req: 8, #queue-req: 0,
[2025-10-24 10:28:56] INFO: 127.0.0.1:42586 - "GET /get_server_info HTTP/1.1" 200 OK
batch size: 16
input_len: 1024
output_len: 16
latency: 1.35 s
ttft: 1.06 s
last generation throughput: 0.00 tok/s
input throughput: 15460.33 tok/s
output throughput: 884.04 tok/s
======== Warmup End ========
[2025-10-24 10:28:56] Cache flushed successfully!
[2025-10-24 10:28:56] INFO: 127.0.0.1:42596 - "POST /flush_cache HTTP/1.1" 200 OK
#Input tokens: 4096
#Output tokens: 512
[2025-10-24 10:29:08] INFO: 127.0.0.1:40078 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:29:08] Prefill batch. #new-seq: 16, #new-token: 4096, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-24 10:29:09] INFO: 127.0.0.1:40084 - "GET /get_server_info HTTP/1.1" 200 OK
**batch size: 16
input_len: 256
output_len: 32
latency: 0.87 s
ttft: 0.32 s
last generation throughput: 0.00 tok/s
input throughput: 12775.82 tok/s
output throughput: 931.68 tok/s**
Results are saved to result.jsonl |
Motivation
Adapt the enable-profile-cuda-graph feature for Ascend NPU to enable profile data collection during graph capture. Migrating PyTorch Profiler to Ascend PyTorch Profiler
Notice: The profiling results on the GPU are directly output via echo. However, on the NPU, the results currently need to be saved to a file before analysis. It is estimated that by 12.30, the output on the NPU will be same as GPU.
Modifications
Override the capture function in NPUGraphRunner to incorporate torch_npu.profiler collection during graph capture.
Accuracy Tests
Please refer to the detailed test results below.
Benchmarking and Profiling
Please refer to the detailed test results below.
Checklist