Skip to content

Conversation

@ping1jing2
Copy link
Collaborator

Motivation

Adapt the enable-profile-cuda-graph feature for Ascend NPU to enable profile data collection during graph capture. Migrating PyTorch Profiler to Ascend PyTorch Profiler

Notice: The profiling results on the GPU are directly output via echo. However, on the NPU, the results currently need to be saved to a file before analysis. It is estimated that by 12.30, the output on the NPU will be same as GPU.

Modifications

Override the capture function in NPUGraphRunner to incorporate torch_npu.profiler collection during graph capture.

Accuracy Tests

Please refer to the detailed test results below.

Benchmarking and Profiling

Please refer to the detailed test results below.

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ping1jing2, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive profiling capabilities for Ascend NPUs during CUDA graph capture. It integrates torch_npu.profiler to enable detailed performance and memory analysis on NPU devices, ensuring that developers can gain insights into the execution of their models. The changes also involve refactoring existing profiling logic into reusable methods and adapting the output handling to suit the NPU environment, with a clear roadmap for future output consistency.

Highlights

  • NPU Profiling Integration: The enable-profile-cuda-graph feature is adapted to support Ascend NPUs, allowing profiling during graph capture on these devices.
  • PyTorch Profiler Migration: The profiling mechanism for NPUs now utilizes torch_npu.profiler, replacing the standard PyTorch Profiler for NPU-specific operations.
  • Modular Profiling Logic: Common profiling initialization and post-processing steps in CudaGraphRunner have been refactored into dedicated private methods (_init_profile_context_and_memory_record and _post_process_after_profile) for better organization and reusability.
  • NPU-Specific Output Handling: Unlike GPU profiling which outputs directly, NPU profiling results are now saved to a file for later analysis, with a future update planned to unify the output method.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully adapts the enable-profile-cuda-graph feature for Ascend NPU by refactoring the profiling logic in CudaGraphRunner and implementing an NPU-specific version in NPUGraphRunner. The code is well-structured, using inheritance to cleanly separate the CUDA and NPU implementations. I have one minor suggestion to simplify the directory creation logic.

Comment on lines +76 to +77
if not Path(output_dir).exists():
Path(output_dir).mkdir(parents=True, exist_ok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check if not Path(output_dir).exists() is redundant because Path.mkdir(parents=True, exist_ok=True) already handles the case where the directory exists without raising an error. You can simplify the code by removing the if statement.

        Path(output_dir).mkdir(parents=True, exist_ok=True)

@ping1jing2 ping1jing2 marked this pull request as draft November 4, 2025 09:20
@ping1jing2
Copy link
Collaborator Author

Self-verification:

  • Hardware:H800

  • Test CMD:

python3 -m sglang.launch_server --model-path /root/models/Qwen/Qwen3-8B --host 0.0.0.0 --port 8000 --enable-profile-cuda-graph

  • Logs
[2025-10-24 15:28:05] Load weight end. type=Qwen3ForCausalLM, dtype=torch.bfloat16, avail mem=63.11 GB, mem usage=15.44 GB.
[2025-10-24 15:28:05] Using KV cache dtype: torch.bfloat16
[2025-10-24 15:28:05] KV Cache is allocated. #tokens: 365177, K size: 25.07 GB, V size: 25.07 GB
[2025-10-24 15:28:05] Memory pool end. avail mem=12.26 GB
[2025-10-24 15:28:05] Capture cuda graph begin. This can take up to several minutes. avail mem=12.16 GB
[2025-10-24 15:28:05] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=256 avail_mem=11.93 GB):   0%|                                                                                                                           | 0/36 [00:00<?, ?it/s][2025-10-24 15:28:05] MOE_A2A_BACKEND is not initialized, using default backend
Capturing batches (bs=1 avail_mem=11.47 GB): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:04<00:00,  8.12it/s]


[2025-10-24 15:29:36] Sorted by CUDA Time:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls                                                                      Input Shapes
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  -------------------------------------------------------------------------------- 
              void cutlass::device_kernel<flash::enable_sm90_or_la...         0.00%       0.000us         0.00%       0.000us       0.000us      94.192ms        13.97%      94.192ms      36.340us          2592                                                                                []
              nvjet_tst_192x128_64x5_1x2_h_bz_coopB_TNT         0.00%       0.000us         0.00%       0.000us       0.000us      28.623ms         4.25%      28.623ms      79.508us           360                                                                                []
              void flashinfer::norm::RMSNormKernel<8u, __nv_bfloat...         0.00%       0.000us         0.00%       0.000us       0.000us      27.285ms         4.05%      27.285ms       5.191us          5256                                                                                []
              void at::native::index_elementwise_kernel<128, 4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      24.492ms         3.63%      24.492ms       4.725us          5184                                                                                []
              nvjet_tst_192x8_64x8_2x1_v_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us      20.263ms         3.01%      20.263ms      70.357us           288                                                                                []
              void flashinfer::norm::FusedAddRMSNormKernel<8u, __n...         0.00%       0.000us         0.00%       0.000us       0.000us      19.175ms         2.84%      19.175ms       3.699us          5184                                                                                []
              cudaStreamIsCapturing         0.44%      11.692ms         0.44%      11.692ms       0.374us      17.814ms         2.64%      17.814ms       0.569us         31291                                                                                []
              void at::native::elementwise_kernel<128, 4, at::nati...         0.00%       0.000us         0.00%       0.000us       0.000us      17.299ms         2.57%      17.299ms       3.432us          5040                                                                                []
              nvjet_tst_64x8_64x16_2x1_v_bz_splitK_TNT         0.00%       0.000us         0.00%       0.000us       0.000us      14.706ms         2.18%      14.706ms      25.531us           576                                                                                []
              nvjet_tst_128x64_64x8_1x2_h_bz_TNT         0.00%       0.000us         0.00%       0.000us       0.000us      14.400ms         2.14%      14.400ms      40.000us           360                                                                                []
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------

Self CPU time total: 2.656s
Self CUDA time total: 674.143ms


Sorted by CPU Time:

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls                                                                      Input Shapes
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------
                                            aten::empty        10.32%     273.991ms        11.11%     294.971ms       9.461us       0.000us         0.00%       0.000us       0.000us         31178                                                          [[], [], [], [], [], []]
                                       cudaLaunchKernel         4.98%     132.193ms         8.36%     222.079ms       7.232us       0.000us         0.00%     786.117us       0.026us         30708                                                                                []
                          cudaGraphInstantiateWithFlags         5.13%     136.214ms         5.13%     136.214ms       3.784ms       1.536us         0.00%       1.536us       0.043us            36                                                                                []
                       Runtime Triggered Module Loading         3.98%     105.758ms         3.98%     105.758ms       3.777ms     664.006us         0.10%     664.006us      23.714us            28                                                                                []
                                           aten::linear         0.01%     274.377us         3.59%      95.268ms     882.112us       0.000us         0.00%       1.732ms      16.042us           108                                                   [[256, 4096], [6144, 4096], []]
                                           aten::matmul         0.01%     164.902us         3.56%      94.547ms     875.437us       0.000us         0.00%       1.732ms      16.042us           108                                                       [[256, 4096], [4096, 6144]]
                                               aten::mm         3.52%      93.382ms         3.55%      94.382ms     873.911us       1.704ms         0.25%       1.732ms      16.042us           108                                                       [[256, 4096], [4096, 6144]]
                                    cudaLaunchKernelExC         3.17%      84.267ms         3.30%      87.593ms       3.738us       0.000us         0.00%     233.379us       0.010us         23436                                                                                []
                                       cuLaunchKernelEx         2.51%      66.674ms         2.75%      73.141ms       4.671us       0.000us         0.00%      12.793ms       0.817us         15660                                                                                []
                                    aten::empty_strided         2.64%      70.074ms         2.71%      72.037ms       9.096us       0.000us         0.00%       0.000us       0.000us          7920                                                          [[], [], [], [], [], []]
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  --------------------------------------------------------------------------------

Self CPU time total: 2.656s
Self CUDA time total: 674.143ms


Memory Usage is saved to cuda_graph_runner_memory_usage.pickle

[2025-10-24 15:29:36] Capture cuda graph end. Time elapsed: 91.34 s. mem usage=0.70 GB. avail mem=11.46 GB.

@ping1jing2
Copy link
Collaborator Author

Accuracy tests

1. without enable-profile-cuda-graph

  • Launch a server
    python3 -m sglang.launch_server --model-path /root/models/Qwen/Qwen2-7B-Instruct --device npu --attention-backend ascend
  • Evaluate
    python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --parallel 16
  • LOG
    100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:47<00:00, 4.18it/s]
    Accuracy: 0.855
    Invalid: 0.000
    Latency: 48.017 s
    Output throughput: 585.901 token/s

2、with enable-profile-cuda-graph

  • Launch a server
    python3 -m sglang.launch_server --model-path /root/models/Qwen/Qwen2-7B-Instruct --device npu --attention-backend ascend --enable-profile-cuda-graph

  • Evaluate
    python3 -m sglang.test.few_shot_gsm8k --num-questions 200 --parallel 16

  • LOG
    100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:45<00:00, 4.41it/s]
    Accuracy: 0.855
    Invalid: 0.000
    Latency: 45.474 s
    Output throughput: 618.664 token/s

  • Server Log
    [2025-10-24 09:38:48] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=46.21 GB, mem usage=14.34 GB.
    [2025-10-24 09:38:48] Using KV cache dtype: torch.bfloat16
    [2025-10-24 09:38:48] KV Cache is allocated. #tokens: 621440, K size: 16.60 GB, V size: 16.60 GB
    [2025-10-24 09:38:48] Memory pool end. avail mem=12.01 GB
    [2025-10-24 09:38:48] Capture cuda graph begin. This can take up to several minutes. avail mem=12.01 GB
    [2025-10-24 09:38:48] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
    [2025-10-24 09:38:48] Profiling starts for graph capture for NPU. Traces will be saved to: /root/datasets/sglang_profiler_with_graph_capture/1023/graph_capture_profile
    Capturing batches (bs=1 avail_mem=11.58 GB): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [00:10<00:00, 3.59it/s]
    [2025-10-24 09:38:59] [WARNING] [185276] profiler.py: Incorrect schedule: Stop profiler while current state is RECORD which may result in incomplete parsed data.
    [2025-10-24 09:39:00] Capture cuda graph end. Time elapsed: 11.50 s. mem usage=0.43 GB. avail mem=11.58 GB.
    [2025-10-24 09:39:00] [INFO] [185724] profiler.py: Start parsing profiling data: /root/datasets/sglang_profiler_with_graph_capture/1023/graph_capture_profile/e80c194d0e61_185276_20251024093849161_ascend_pt
    /usr/local/python3.11.13/lib/python3.11/site-packages/sglang/srt/utils/common.py:1165: UserWarning: The given NumPy array is not writable, and PyTorch does not support non-writable tensors. This means writing to this tensor will result in undefined behavior. You may want to copy the array to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:203.)
    tensor_data = torch.ByteTensor(
    [2025-10-24 09:39:00] max_total_num_tokens=621440, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=8192, context_len=32768, available_gpu_mem=11.58 GB
    [2025-10-24 09:39:01] INFO: Started server process [184926]
    [2025-10-24 09:39:01] INFO: Waiting for application startup.
    [2025-10-24 09:39:01] INFO: Application startup complete.
    [2025-10-24 09:39:01] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
    [2025-10-24 09:39:02] INFO: 127.0.0.1:44806 - "GET /get_model_info HTTP/1.1" 200 OK
    [2025-10-24 09:39:02] Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
    /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
    warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
    /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
    warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
    [2025-10-24 09:39:06] INFO: 127.0.0.1:44814 - "POST /generate HTTP/1.1" 200 OK
    [2025-10-24 09:39:06] The server is fired up and ready to roll!
    [2025-10-24 09:39:12] [INFO] [185766] profiler.py: CANN profiling data parsed in a total time of 0:00:12.048350
    [2025-10-24 09:39:37] [INFO] [185724] profiler.py: All profiling data parsed in a total time of 0:00:38.506339
    [2025-10-24 09:40:38] INFO: 127.0.0.1:39402 - "GET /get_model_info HTTP/1.1" 200 OK
    [2025-10-24 09:40:38] Prefill batch. #new-seq: 1, #new-token: 896, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
    [2025-10-24 09:40:38] INFO: 127.0.0.1:39404 - "POST /generate HTTP/1.1" 200 OK

@ping1jing2
Copy link
Collaborator Author

performance tests

1, without enable-profile-cuda-graph

  • CMD
    python3 -m sglang.bench_one_batch_server --device npu --attention-backend ascend --enable-profile-cuda-graph --model-path /root/models/Qwen/Qwen2-7B-Instruct --batch-size 32 --input-len 256 --output-len 32

  • Log

[2025-10-24 10:25:59] INFO:     Started server process [195407]
[2025-10-24 10:25:59] INFO:     Waiting for application startup.
[2025-10-24 10:25:59] INFO:     Application startup complete.
[2025-10-24 10:25:59] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-10-24 10:26:00] INFO:     127.0.0.1:54490 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-24 10:26:01] Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
  warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
[2025-10-24 10:26:03] INFO:     127.0.0.1:57606 - "GET /v1/models HTTP/1.1" 200 OK
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
  warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
[2025-10-24 10:26:04] INFO:     127.0.0.1:57608 - "GET /get_server_info HTTP/1.1" 200 OK
[2025-10-24 10:26:04] INFO:     127.0.0.1:54496 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:26:04] The server is fired up and ready to roll!
======== Warmup Begin ========
[2025-10-24 10:26:04] Cache flushed successfully!
[2025-10-24 10:26:04] INFO:     127.0.0.1:57618 - "POST /flush_cache HTTP/1.1" 200 OK
#Input tokens: 16384
#Output tokens: 256
[2025-10-24 10:26:15] INFO:     127.0.0.1:57802 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:26:15] Prefill batch. #new-seq: 8, #new-token: 8192, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 8,
[2025-10-24 10:26:15] Prefill batch. #new-seq: 8, #new-token: 8192, #cached-token: 0, token usage: 0.01, #running-req: 8, #queue-req: 0,
[2025-10-24 10:26:17] INFO:     127.0.0.1:57814 - "GET /get_server_info HTTP/1.1" 200 OK
batch size: 16
input_len: 1024
output_len: 16
latency: 1.36 s
ttft: 1.06 s
last generation throughput: 0.00 tok/s
input throughput: 15519.54 tok/s
output throughput: 841.40 tok/s
======== Warmup End   ========

[2025-10-24 10:26:17] Cache flushed successfully!
[2025-10-24 10:26:17] INFO:     127.0.0.1:57822 - "POST /flush_cache HTTP/1.1" 200 OK
#Input tokens: 4096
#Output tokens: 512
[2025-10-24 10:26:27] INFO:     127.0.0.1:36938 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:26:27] Prefill batch. #new-seq: 16, #new-token: 4096, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-24 10:26:28] INFO:     127.0.0.1:36948 - "GET /get_server_info HTTP/1.1" 200 OK
batch size: 16
input_len: 256
output_len: 32
latency: 0.90 s
ttft: 0.31 s
last generation throughput: 0.00 tok/s
input throughput: 13385.51 tok/s
output throughput: 857.19 tok/s

Results are saved to result.jsonl

2 with enable-profile-cuda-graph

  • CMD
    python3 -m sglang.bench_one_batch_server --device npu --attention-backend ascend --enable-profile-cuda-graph --model-path /root/models/Qwen/Qwen2-7B-Instruct --batch-size 16 --input-len 256 --output-len 32

  • Log

[2025-10-24 10:28:34] max_total_num_tokens=621440, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=8192, context_len=32768, available_gpu_mem=11.58 GB
[2025-10-24 10:28:35] INFO:     Started server process [197342]
[2025-10-24 10:28:35] INFO:     Waiting for application startup.
[2025-10-24 10:28:35] INFO:     Application startup complete.
[2025-10-24 10:28:35] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-10-24 10:28:36] INFO:     127.0.0.1:57348 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-10-24 10:28:36] Prefill batch. #new-seq: 1, #new-token: 128, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-24 10:28:36] INFO:     127.0.0.1:57374 - "GET /v1/models HTTP/1.1" 200 OK
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
  warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/dynamo/torchair/_ge_concrete_graph/fx2ge_converter.py:997: UserWarning: When enable frozen_parameter, Parameters will be considered frozen.Please make sure that the Parameters data address remain the same throughout the program runtime.
  warnings.warn(f'When enable frozen_parameter, Parameters will be considered frozen.'
[2025-10-24 10:28:40] INFO:     127.0.0.1:57390 - "GET /get_server_info HTTP/1.1" 200 OK
[2025-10-24 10:28:40] INFO:     127.0.0.1:57360 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:28:40] The server is fired up and ready to roll!
======== Warmup Begin ========
[2025-10-24 10:28:40] Cache flushed successfully!
[2025-10-24 10:28:40] INFO:     127.0.0.1:57400 - "POST /flush_cache HTTP/1.1" 200 OK
[2025-10-24 10:28:46] [INFO] [197851] profiler.py: CANN profiling data parsed in a total time of 0:00:12.050036
#Input tokens: 16384
#Output tokens: 256
[2025-10-24 10:28:54] INFO:     127.0.0.1:42580 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:28:54] Prefill batch. #new-seq: 8, #new-token: 8192, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 8,
[2025-10-24 10:28:54] Prefill batch. #new-seq: 8, #new-token: 8192, #cached-token: 0, token usage: 0.01, #running-req: 8, #queue-req: 0,
[2025-10-24 10:28:56] INFO:     127.0.0.1:42586 - "GET /get_server_info HTTP/1.1" 200 OK
batch size: 16
input_len: 1024
output_len: 16
latency: 1.35 s
ttft: 1.06 s
last generation throughput: 0.00 tok/s
input throughput: 15460.33 tok/s
output throughput: 884.04 tok/s
======== Warmup End   ========

[2025-10-24 10:28:56] Cache flushed successfully!
[2025-10-24 10:28:56] INFO:     127.0.0.1:42596 - "POST /flush_cache HTTP/1.1" 200 OK
#Input tokens: 4096
#Output tokens: 512
[2025-10-24 10:29:08] INFO:     127.0.0.1:40078 - "POST /generate HTTP/1.1" 200 OK
[2025-10-24 10:29:08] Prefill batch. #new-seq: 16, #new-token: 4096, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-10-24 10:29:09] INFO:     127.0.0.1:40084 - "GET /get_server_info HTTP/1.1" 200 OK
**batch size: 16
input_len: 256
output_len: 32
latency: 0.87 s
ttft: 0.32 s
last generation throughput: 0.00 tok/s
input throughput: 12775.82 tok/s
output throughput: 931.68 tok/s**

Results are saved to result.jsonl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants