Here, we follow a sequence of steps to understand the Jacobi code better, each time using a specific profiler tool because we seek a particular type of information. We have at our disposal three types of tools:
rocprofv3that can help get GPU hotspots, traces or countersrocprof-systhat can help us get GPU and CPU tracesrocprof-computethat can help us understand GPU kernel performance
The Jacobi example can be found in this repo at https://github.com/gsitaram/HPCTrainingExamples/tree/main/HIP/jacobi. It uses HIP for offloading compute to GPUs. We know that the Jacobi code uses MPI for halo exchanges. Let's assume that we don't know yet the characteristics of the application or the limiters of the hotspots.
First, set up your environment to get a newer version of ROCm, and tools. Since rocprofiler-systems is not yet installed in rocm/6.3.1 on Frontier, we are going to install it by ourselves using an installer script into our home directory.
module load rocm/6.3.1
module load rocprofiler-compute/3.0.0
wget https://github.com/ROCm/rocprofiler-systems/releases/download/rocm-6.3.3/rocprofiler-systems-0.1.2-opensuse-15.6-ROCm-60300-PAPI-OMPT-Python3.sh
chmod +x rocprofiler-systems-0.1.2-opensuse-15.6-ROCm-60300-PAPI-OMPT-Python3.sh
mkdir -p ${HOME}/rocprofiler-systems
./rocprofiler-systems-0.1.2-opensuse-15.6-ROCm-60300-PAPI-OMPT-Python3.sh --exclude-subdir --prefix=${HOME}/rocprofiler-systems
source ${HOME}/rocprofiler-systems/share/rocprofiler-systems/setup-env.sh
Also, for ease of use, set up your project name in Frontier in an environment variable.
export PROJ=<proj>
Check if you can clone and build the Jacobi example. On Frontier, run the commands:
cd ${HOME}
git clone git@github.com:amd/HPCTrainingExamples.git
cd HPCTrainingExamples/HIP/jacobi
make -f Makefile.cray
srun -N1 -n1 -c7 -A ${PROJ} -t 02:00 ./Jacobi_hip -g 1 1
That should show an output similar to the following:
Topology size: 1 x 1
Local domain size (current node): 4096 x 4096
Global domain size (all nodes): 4096 x 4096
Rank 0 selecting device 0 on host frontier04031
Starting Jacobi run.
Iteration: 0 - Residual: 0.022108
Iteration: 100 - Residual: 0.000625
Iteration: 200 - Residual: 0.000371
Iteration: 300 - Residual: 0.000274
Iteration: 400 - Residual: 0.000221
Iteration: 500 - Residual: 0.000187
Iteration: 600 - Residual: 0.000163
Iteration: 700 - Residual: 0.000145
Iteration: 800 - Residual: 0.000131
Iteration: 900 - Residual: 0.000120
Iteration: 1000 - Residual: 0.000111
Stopped after 1000 iterations with residue 0.000111
Total Jacobi run time: 1.3129 sec.
Measured lattice updates: 12.78 GLU/s (total), 12.78 GLU/s (per process)
Measured FLOPS: 217.24 GFLOPS (total), 217.24 GFLOPS (per process)
Measured device bandwidth: 1.23 TB/s (total), 1.23 TB/s (per process)
That was a successful run. Now, we can try running a job with 2 processes. This time,
add the Slurm option --gpu-bind=closest to ensure that each process gets a
different GPU device and one that is closest to the CPU core that it runs on. Notice
that we increased the number of processes in -n2 and modified the Jacobi grid
in -g 2 1.
srun -N1 -n2 -c7 --gpu-bind=closest -A ${PROJ} -t 02:00 ./Jacobi_hip -g 2 1
To understand whether we spend most of the application runtime on the GPU or on the host,
getting the GPU kernel hotspots using rocprofv3 is a quick method. Using the total
time spent in the most expensive kernel, we can calculate how much of the application
time is spent running GPU compute kernels. We will use the single rank run for this
experiment. Notice that we added the tool and its options,
rocprofv3 --kernel-trace --stats --
before calling the application in the srun command. You can do similar experiments
to get HIP API traces using --hip-trace -stats or memory copy stats using
--memory-copy-trace --stats or a combination of these.
srun -N1 -n1 -c7 -A ${PROJ} -t 02:00 rocprofv3 --kernel-trace --stats -- ./Jacobi_hip -g 1 1
Of the output files, there will be one called XXXXX_kernel_stats.csv. A cat of that
file should show a list of GPU kernel hot spots as seen below:
"Name","Calls","TotalDurationNs","AverageNs","Percentage","MinNs","MaxNs","StdDev"
"JacobiIterationKernel(int, double, double, double const*, double const*, double*, double*)",1000,517647434,517647.434000,42.56,510404,527365,2907.325924
"NormKernel1(int, double, double, double const*, double*)",1001,412893767,412481.285714,33.95,401603,423364,2862.057355
"LocalLaplacianKernel(int, int, int, double, double, double const*, double*)",1000,269619091,269619.091000,22.17,263842,281762,1763.747497
"HaloLaplacianKernel(int, int, int, double, double, double const*, double const*, double*)",1000,13295466,13295.466000,1.09,12320,15360,316.811361
"NormKernel2(int, double const*, double*)",1001,2869948,2867.080919,0.2360,2720,3840,135.996465
Here, we see that the JacobiIterationKernel is the most expensive one on the MI250X GCD
that this job ran. Taking the total duration of 517ms in this kernel which was
42.56% of the total run time, we get 1214.8 ms of time spent in GPU kernels during
this run. Given the total elapsed time of this run of 1.2545 seconds, we can quickly
conclude that 97% of elapsed time of this run was spent running GPU kernels.
To avoid seeing all the kernel arguments in the hotspot list and make it more readable,
use the --truncate-kernels option to rocprofv3.
srun -N1 -n1 -c7 -A ${PROJ} -t 02:00 rocprofv3 --kernel-trace --stats --truncate-kernels -- ./Jacobi_hip -g 1 1
To truly understand the overhead induced by halo exchanges, we should do this experiment when we run multiple ranks and collect hotspots per process. This exercise will be left to the reader.
In order to get a more holistic view of the application, use rocprof-sys to collect
a trace. Run with one process first, and then progress to running with multiple processes.
Before we start using rocprof-sys tools, it is best to create a runtime config file.
Then, in order to get a less cluttered trace, edit some options to turn off CPU frequency
sampling for all CPU cores. These config options can be edited directly in the file
you create, or via environment variables as shown below:
rocprof-sys-avail -G ${HOME}/.rocprofsys.cfg
export ROCPROFSYS_CONFIG_FILE=${HOME}/.rocprofsys.cfg
export ROCPROFSYS_SAMPLING_CPUS=none
If you know which GPU device this process is going to run on, then you can turn on sampling for the characteristics of that GPU device only. For instance, if we are going to run on GPU 0, then we can do something like the following:
export ROCR_VISIBLE_DEVICES=0
export ROCPROFSYS_SAMPLING_GPUS=0
Now, instrument the code and collect a trace.
rocprof-sys-instrument -o ./Jacobi_hip.inst -- ./Jacobi_hip
srun -N1 -n1 -c7 -A ${PROJ} -t 02:00 rocprof-sys-run -- ./Jacobi_hip.inst -g 1 1
This should output a .proto file. Copy that over to your laptop to view with Perfetto UI,
https://ui.perfetto.dev.
When you run with multiple ranks, it may be best to use the Slurm option --gpu-bind=closest and sample all GPUs because we don't know which GPUs the processes are going to run on.
export ROCPROFSYS_SAMPLING_GPUS=all
srun -N1 -n2 -c7 --gpu-bind=closest -A ${PROJ} -t 02:00 rocprof-sys-run -- ./Jacobi_hip.inst -g 2 1
You will observe that a .proto file is created for each rank. You can simply concatenate
those .proto files to create a merged trace for viewing in Perfetto.
cat rocprofsys-Jacobi_hip.inst-output/<timestamp>/perfetto-trace-*.proto > merged.proto
Use rocprof-compute to first get a roofline plot. This roofline plot can help us
understand whether the kernels are memory bound, compute bound or latency bound.
srun -N1 -n1 -c7 -A ${PROJ} -t 02:00 rocprof-compute profile -n roofline --roof-only --device 0 --kernel-names -- ./Jacobi_hip -g 1 1
The above command results in a few PDF files being created in the
workloads/roofline/MI200 directory. The file empirRoof_gpu-0_fp32_fp64.pdf
contains the roofline plot itself, and the file kernelName_legend.pdf contains
the legend for the plot. When you view the roofline
plot, you will observe that all kernels are either memory bound or latency bound.
Next collect kernel performance metrics. You will notice that this command runs your application multiple times to collect different batches of hardware counters. For this reason, we recommend running with only 1 rank if possible, and increase the time required for this run.
srun -N1 -n1 -c7 -A ${PROJ} -t 10:00 rocprof-compute profile -n test --no-roof -- ./Jacobi_hip -g 1 1
Next, analyze and look at kernel stats to get an idea of either the kernel ID or
the dispatch ID to analyze further.
All hardware counter values are saved in the workloads directory workloads/test/MI200.
Supply this path to the analyze command as shown below.
srun -N1 -n1 -c7 -A ${PROJ} -t 02:00 rocprof-compute analyze -p workloads/test/MI200 --list-stats >& stats_output.log
You should see something like the following:
--------------------------------------------------------------------------------
Detected Kernels (sorted descending by duration)
ββββββ€βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β Kernel_Name β
ββββββͺβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ‘
β 0 β JacobiIterationKernel(int, double, double, double const*, double const*, double*, double*) [clone .kd] β
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 1 β NormKernel1(int, double, double, double const*, double*) [clone .kd] β
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 2 β LocalLaplacianKernel(int, int, int, double, double, double const*, double*) [clone .kd] β
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 3 β HaloLaplacianKernel(int, int, int, double, double, double const*, double const*, double*) [clone .kd] β
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 4 β NormKernel2(int, double const*, double*) [clone .kd] β
ββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β 5 β __amd_rocclr_fillBufferAligned.kd β
ββββββ§βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
--------------------------------------------------------------------------------
Dispatch list
ββββββββ€ββββββββββββββββ€βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€βββββββββββ
β β Dispatch_ID β Kernel_Name β GPU_ID β
ββββββββͺββββββββββββββββͺβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββͺβββββββββββ‘
β 0 β 0 β __amd_rocclr_fillBufferAligned.kd β 4 β
ββββββββΌββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββ€
β 1 β 1 β NormKernel1(int, double, double, double const*, double*) [clone .kd] β 4 β
ββββββββΌββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββ€
β 2 β 2 β NormKernel2(int, double const*, double*) [clone .kd] β 4 β
ββββββββΌββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββ€
β 3 β 3 β LocalLaplacianKernel(int, int, int, double, double, double const*, double*) [clone .kd] β 4 β
ββββββββΌββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββ€
β 4 β 4 β HaloLaplacianKernel(int, int, int, double, double, double const*, double const*, double*) [clone .kd] β 4 β
ββββββββΌββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββ€
β 5 β 5 β JacobiIterationKernel(int, double, double, double const*, double const*, double*, double*) [clone .kd] β 4 β
ββββββββΌββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββ€
β 6 β 6 β NormKernel1(int, double, double, double const*, double*) [clone .kd] β 4 β
We are interested in the JacobiIterationKernel, so let's analyze just dispatch number 5.
srun -N1 -n1 -c7 -A ${PROJ} -t 02:00 rocprof-compute analyze -p workloads/test/MI200 -d 5 >& dispatch5_output.log
This output log file now contains all the metrics that should help you understand this invocation's Speed Of Light (SOL), HBM Read Bandwidth, whether the wavefront launches were limited by any resources such as registers or shared memory, and other things such as your launch parameters and instruction mix in the kernel.
Exploring this output is left as an exercise for the reader. Some snapshots of this output are shown below as a teaser.
0. Top Stats
0.1 Top Kernels
ββββββ€βββββββββββββββββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββββ€βββββββββββββ€βββββββββββββββ€βββββββββ
β β Kernel_Name β Count β Sum(ns) β Mean(ns) β Median(ns) β Pct β
ββββββͺβββββββββββββββββββββββββββββββββββββββββββͺββββββββββͺββββββββββββͺβββββββββββββͺβββββββββββββββͺβββββββββ‘
β 0 β JacobiIterationKernel(int, double, doubl β 1.00 β 503043.00 β 503043.00 β 503043.00 β 100.00 β
β β e, double const*, double const*, double* β β β β β β
β β , double*) [clone .kd] β β β β β β
ββββββ§βββββββββββββββββββββββββββββββββββββββββββ§ββββββββββ§ββββββββββββ§βββββββββββββ§βββββββββββββββ§βββββββββ
0.2 Dispatch List
ββββββ€ββββββββββββββββ€βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€βββββββββββ
β β Dispatch_ID β Kernel_Name β GPU_ID β
ββββββͺββββββββββββββββͺβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββͺβββββββββββ‘
β 0 β 5 β JacobiIterationKernel(int, double, double, double const*, double const*, double* β 4 β
β β β , double*) [clone .kd] β β
ββββββ§ββββββββββββββββ§βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ§βββββββββββ
SOL info:
βββββββββββββββΌββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββΌββββββββββββββββ€
β 2.1.13 β VALU Active Threads β 64.0 β Threads β 64.0 β 100.0 β
βββββββββββββββΌββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββΌββββββββββββββββ€
β 2.1.14 β IPC β 0.22 β Instr/cycle β 5.0 β 4.33 β
βββββββββββββββΌββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββΌββββββββββββββββ€
β 2.1.15 β Wavefront Occupancy β 2102.73 β Wavefronts β 3520.0 β 59.74 β
βββββββββββββββΌββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββΌββββββββββββββββ€
β 2.1.16 β Theoretical LDS Bandwidth β 0.0 β Gb/s β 23936.0 β 0.0 β
βββββββββββββββΌββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββΌββββββββββββββββ€
β 2.1.17 β LDS Bank Conflicts/Access β β Conflicts/access β 32.0 β β
βββββββββββββββΌββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββΌββββββββββββββββ€
β 2.1.18 β vL1D Cache Hit Rate β 50.0 β Pct β 100.0 β 50.0 β
βββββββββββββββΌββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββΌββββββββββββββββ€
β 2.1.19 β vL1D Cache BW β 2668.12 β Gb/s β 11968.0 β 22.29 β
βββββββββββββββΌββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββΌββββββββββββββββ€
β 2.1.20 β L2 Cache Hit Rate β 49.03 β Pct β 100.0 β 49.03 β
βββββββββββββββΌββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββΌββββββββββββββββ€
β 2.1.21 β L2 Cache BW β 2094.37 β Gb/s β 3481.6 β 60.16 β
βββββββββββββββΌββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββΌββββββββββββββββ€
β 2.1.22 β L2-Fabric Read BW β 800.45 β Gb/s β 1638.4 β 48.86 β
βββββββββββββββΌββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββΌββββββββββββββββ€
β 2.1.23 β L2-Fabric Write BW β 528.23 β Gb/s β 1638.4 β 32.24 β
βββββββββββββββΌββββββββββββββββββββββββββββΌββββββββββΌβββββββββββββββββββΌβββββββββββΌββββββββββββββββ€
Occupancy limiters:
βββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββ€
β 6.2.4 β Insufficient SIMD Waveslots β 0.00 β 0.00 β 0.00 β Pct β
βββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββ€
β 6.2.5 β Insufficient SIMD VGPRs β 0.00 β 0.00 β 0.00 β Pct β
βββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββ€
β 6.2.6 β Insufficient SIMD SGPRs β 0.00 β 0.00 β 0.00 β Pct β
βββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββ€
β 6.2.7 β Insufficient CU LDS β 0.00 β 0.00 β 0.00 β Pct β
βββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββ€
Wavefront launch stats:
7.1 Wavefront Launch Stats
βββββββββββββββ€ββββββββββββββββββββββ€ββββββββββββββ€ββββββββββββββ€ββββββββββββββ€βββββββββββββββββ
β Metric_ID β Metric β Avg β Min β Max β Unit β
βββββββββββββββͺββββββββββββββββββββββͺββββββββββββββͺββββββββββββββͺββββββββββββββͺβββββββββββββββββ‘
β 7.1.0 β Grid Size β 16777216.00 β 16777216.00 β 16777216.00 β Work items β
βββββββββββββββΌββββββββββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββββββββ€
β 7.1.1 β Workgroup Size β 512.00 β 512.00 β 512.00 β Work items β
βββββββββββββββΌββββββββββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββββββββ€
β 7.1.2 β Total Wavefronts β 0.00 β 0.00 β 0.00 β Wavefronts β
βββββββββββββββΌββββββββββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββββββββ€
β 7.1.3 β Saved Wavefronts β 0.00 β 0.00 β 0.00 β Wavefronts β
βββββββββββββββΌββββββββββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββββββββ€
β 7.1.4 β Restored Wavefronts β 0.00 β 0.00 β 0.00 β Wavefronts β
βββββββββββββββΌββββββββββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββββββββ€
β 7.1.5 β VGPRs β 32.00 β 32.00 β 32.00 β Registers β
βββββββββββββββΌββββββββββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββββββββ€
β 7.1.6 β AGPRs β 0.00 β 0.00 β 0.00 β Registers β
βββββββββββββββΌββββββββββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββββββββ€
β 7.1.7 β SGPRs β 32.00 β 32.00 β 32.00 β Registers β
βββββββββββββββΌββββββββββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββββββββ€
β 7.1.8 β LDS Allocation β 0.00 β 0.00 β 0.00 β Bytes β
βββββββββββββββΌββββββββββββββββββββββΌββββββββββββββΌββββββββββββββΌββββββββββββββΌβββββββββββββββββ€
β 7.1.9 β Scratch Allocation β 0.00 β 0.00 β 0.00 β Bytes/workitem β
βββββββββββββββ§ββββββββββββββββββββββ§ββββββββββββββ§ββββββββββββββ§ββββββββββββββ§βββββββββββββββββ
Wavefront runtime stats:
7.2 Wavefront Runtime Stats
βββββββββββββββ€βββββββββββββββββββββββββββββ€ββββββββββββ€ββββββββββββ€ββββββββββββ€ββββββββββββββββββ
β Metric_ID β Metric β Avg β Min β Max β Unit β
βββββββββββββββͺβββββββββββββββββββββββββββββͺββββββββββββͺββββββββββββͺββββββββββββͺββββββββββββββββββ‘
β 7.2.0 β Kernel Time (Nanosec) β 503043.00 β 503043.00 β 503043.00 β Ns β
βββββββββββββββΌβββββββββββββββββββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββββββββ€
β 7.2.1 β Kernel Time (Cycles) β 910207.00 β 910207.00 β 910207.00 β Cycle β
βββββββββββββββΌβββββββββββββββββββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββββββββ€
β 7.2.2 β Instructions per wavefront β 73.00 β 73.00 β 73.00 β Instr/wavefront β
βββββββββββββββΌβββββββββββββββββββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββββββββ€
β 7.2.3 β Wave Cycles β 7273.37 β 7273.37 β 7273.37 β Cycles per wave β
βββββββββββββββΌβββββββββββββββββββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββββββββ€
β 7.2.4 β Dependency Wait Cycles β 5526.10 β 5526.10 β 5526.10 β Cycles per wave β
βββββββββββββββΌβββββββββββββββββββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββββββββ€
β 7.2.5 β Issue Wait Cycles β 1925.98 β 1925.98 β 1925.98 β Cycles per wave β
βββββββββββββββΌβββββββββββββββββββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββββββββ€
β 7.2.6 β Active Cycles β 312.00 β 312.00 β 312.00 β Cycles per wave β
βββββββββββββββΌβββββββββββββββββββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββΌββββββββββββββββββ€
β 7.2.7 β Wavefront Occupancy β 2102.73 β 2102.73 β 2102.73 β Wavefronts β
βββββββββββββββ§βββββββββββββββββββββββββββββ§ββββββββββββ§ββββββββββββ§ββββββββββββ§ββββββββββββββββββ
Instruction mix in kernel:
βββββββββββββββΌβββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββββββββββ€
β 10.1.0 β VALU β 54.00 β 54.00 β 54.00 β Instr per wave β
βββββββββββββββΌβββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββββββββββ€
β 10.1.1 β VMEM β 5.00 β 5.00 β 5.00 β Instr per wave β
βββββββββββββββΌβββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββββββββββ€
β 10.1.2 β LDS β 0.00 β 0.00 β 0.00 β Instr per wave β
βββββββββββββββΌβββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββββββββββ€
β 10.1.3 β MFMA β 0.00 β 0.00 β 0.00 β Instr per wave β
βββββββββββββββΌβββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββββββββββ€
β 10.1.4 β SALU β 4.00 β 4.00 β 4.00 β Instr per wave β
βββββββββββββββΌβββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββββββββββ€
β 10.1.5 β SMEM β 4.00 β 4.00 β 4.00 β Instr per wave β
βββββββββββββββΌβββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββββββββββ€
VALU FLOPs:
βββββββββββββββΌβββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββββββββββ€
β 10.1.0 β VALU β 54.00 β 54.00 β 54.00 β Instr per wave β
βββββββββββββββΌβββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββββββββββ€
β 10.1.1 β VMEM β 5.00 β 5.00 β 5.00 β Instr per wave β
βββββββββββββββΌβββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββββββββββ€
β 10.1.2 β LDS β 0.00 β 0.00 β 0.00 β Instr per wave β
βββββββββββββββΌβββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββββββββββ€
β 10.1.3 β MFMA β 0.00 β 0.00 β 0.00 β Instr per wave β
βββββββββββββββΌβββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββββββββββ€
β 10.1.4 β SALU β 4.00 β 4.00 β 4.00 β Instr per wave β
βββββββββββββββΌβββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββββββββββ€
β 10.1.5 β SMEM β 4.00 β 4.00 β 4.00 β Instr per wave β
βββββββββββββββΌβββββββββββΌββββββββΌββββββββΌββββββββΌβββββββββββββββββ€
and
11.3 Arithmetic Operations
βββββββββββββββ€ββββββββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββ
β Metric_ID β Metric β Avg β Min β Max β Unit β
βββββββββββββββͺββββββββββββββββͺββββββββββͺββββββββββͺββββββββββͺβββββββββββββββ‘
β 11.3.0 β FLOPs (Total) β 3008.00 β 3008.00 β 3008.00 β Ops per wave β
βββββββββββββββΌββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββββββ€
β 11.3.1 β IOPs (Total) β 704.00 β 704.00 β 704.00 β Ops per wave β
βββββββββββββββΌββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββββββ€
β 11.3.2 β F16 OPs β 0.00 β 0.00 β 0.00 β Ops per wave β
βββββββββββββββΌββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββββββ€
β 11.3.3 β BF16 OPs β 0.00 β 0.00 β 0.00 β Ops per wave β
βββββββββββββββΌββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββββββ€
β 11.3.4 β F32 OPs β 0.00 β 0.00 β 0.00 β Ops per wave β
βββββββββββββββΌββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββββββ€
β 11.3.5 β F64 OPs β 3008.00 β 3008.00 β 3008.00 β Ops per wave β
βββββββββββββββΌββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββββββ€
β 11.3.6 β INT8 OPs β 0.00 β 0.00 β 0.00 β Ops per wave β
βββββββββββββββ§ββββββββββββββββ§ββββββββββ§ββββββββββ§ββββββββββ§βββββββββββββββ
Traffic to HBM:
17.2 L2 - Fabric Transactions
βββββββββββββββ€ββββββββββββββββββββββββββββββββββββ€ββββββββββ€ββββββββββ€ββββββββββ€βββββββββββββββββ
β Metric_ID β Metric β Avg β Min β Max β Unit β
βββββββββββββββͺββββββββββββββββββββββββββββββββββββͺββββββββββͺββββββββββͺββββββββββͺβββββββββββββββββ‘
β 17.2.0 β Read BW β 1536.03 β 1536.03 β 1536.03 β Bytes per wave β
βββββββββββββββΌββββββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββββββββ€
β 17.2.1 β HBM Read Traffic β 100.0 β 100.0 β 100.0 β Pct β
βββββββββββββββΌββββββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββββββββ€
β 17.2.2 β Remote Read Traffic β 0.0 β 0.0 β 0.0 β Pct β
βββββββββββββββΌββββββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββββββββ€
β 17.2.3 β Uncached Read Traffic β 0.0 β 0.0 β 0.0 β Pct β
βββββββββββββββΌββββββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββββββββ€
β 17.2.4 β Write and Atomic BW β 1013.66 β 1013.66 β 1013.66 β Bytes per wave β
βββββββββββββββΌββββββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββββββββ€
β 17.2.5 β HBM Write and Atomic Traffic β 100.0 β 100.0 β 100.0 β Pct β
βββββββββββββββΌββββββββββββββββββββββββββββββββββββΌββββββββββΌββββββββββΌββββββββββΌβββββββββββββββββ€
Happy optimizing!