Skip to content
This repository was archived by the owner on Jan 12, 2026. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/sources/ext_links.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,6 @@
.. _Data Parallel Extension for Numpy*: https://intelpython.github.io/dpnp/
.. _IEEE 754-2019 Standard for Floating-Point Arithmetic: https://standards.ieee.org/ieee/754/6210/
.. _David Goldberg, What every computer scientist should know about floating-point arithmetic: https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf>
.. _Intel oneAPI Base Toolkit: https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html
.. _Intel VTune Profiler: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html
.. _Intel Advisor: https://www.intel.com/content/www/us/en/developer/tools/oneapi/advisor.html
1 change: 0 additions & 1 deletion docs/sources/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,3 @@ Table of Contents
programming_dpep
examples
useful_links

102 changes: 99 additions & 3 deletions docs/sources/programming_dpep.rst
Original file line number Diff line number Diff line change
Expand Up @@ -126,9 +126,105 @@ there are some situations when you will need to use dpctl advanced capabilities:

Debugging and profiling Data Parallel Extensions for Python
***********************************************************
`Intel oneAPI Base Toolkit`_ provides two tools to assist programmers to analyze performance issues in programs
that use **Data Parallel Extensions for Python**. They are `Intel VTune Profiler`_ and
`Intel Advisor`_.

Intel VTune Profiler examines various performance aspects of a program like, the most time-consuming parts,
efficiency of offloaded code, impact of memory sub-systems, etc.

Intel Advisor provides insights on the performance of offloaded code relative to the peak performance and
memory bandwidth.

Next, we will detail the steps involved in using Intel VTune Profiler and Intel Advisor with
heterogenous programs that use **Data Parallel Extensions for Python**.

Profiling with Intel VTune Profiler
-----------------------------------

.. |copy| unicode:: U+000A9

.. |trade| unicode:: U+2122

Intel |copy| VTune |trade| Profiler provides two mechanisms, called *GPU offload* and *GPU hotspots*, to profile heterogeneous programs
targeted to GPUs.

The *GPU offload* analysis profiles the entire application (both GPU and host code) and helps to identify
if the application is CPU or GPU bound. It provides information on the proportion of the execution time spent
in GPU execution. It also provides information about various hotspots in the program. The key goal of the *GPU offload*
analysis is to identify the parts of the program that can benefit from offloading to GPUs.

The *GPU hotspots* analysis focuses on providing insights into the performance of GPU-offloaded code.
It provides insights about the parallelism in the GPU kernel, the efficiency of the kernel, SIMD utilization
and memory latency. It also provides performance data regarding synchronization operations like GPU barriers and
atomic operations.

The following instructions are used execute the two Intel VTune Profiler analyses on programs written
using **Data Parallel Extensions for Python**.

.. code-block:: console
:caption: **GPU Offload**

> vtune -collect gpu-offload -r <output_dir> -- python <script>.py <args>

.. code-block:: console
:caption: **GPU Hotspots**

> vtune -collect gpu-hotspots -r <output_dir> -- python <script>.py <args>

Intel VTune Profiler performs dynamic binary analysis on a given program to obtain insights on various
performance characteristics. It can run on unmodified binaries with no extra requirements for program compilation.
After collecting the data using the above commands, the Intel VTune Profiler GUI can be used to view various
performance characteristics. In addition to the GUI, it provides mechanisms to generate reports through
the command line and setup a web server for post processing the data.

Further details on viewing Intel VTune Profiler output along with other use-cases can be found in the
`Intel VTune Profiler User Guide <https://www.intel.com/content/www/us/en/develop/documentation/vtune-help/top.html>`_.

Profiling with Intel Advisor
----------------------------

The primary goal of Intel |copy| Advisor is to help programmers make targeted optimizations by identifying
appropriate kernels and characterizing the performance limiting factors. Intel Advisor provides mechanisms
to analyze the performance of GPU kernels against the hardware roof-line performance. It provides information
about the maximum achievable performance with the given hardware conditions and helps identify the best
kernels for optimization. Further, it helps the programmer characterize if a GPU kernel is bound by
compute capacity or by memory bandwidth.

The following instructions are used to generate GPU roof-line performance graphs using Intel Advisor.

.. code-block:: console
:caption: **Collect Roofline**

> advisor --collect=roofline --profile-gpu --project-dir=<output_dir> --search-dir src:r=<search_dir> -- <executable> <args>

This command collects the GPU roof-line data from executing the application written using
**Data Parallel Extensions for Python**.

The next command generates the roof-line graph as a html file in the output directory.

.. code-block:: console
:caption: **Generate Roofline HTML-File**

> advisor --report=roofline --gpu --project-dir=<output_dir> --report-output=<output_dir>/roofline_gpu.html

.. todo::
Insert high-resolution image illustrating Advisor html report

The above figure shows an example roof-line graph generated using Intel Advisor.
The X-axis in the graph represents arithmetic intensity and the Y-axis represents performance in GFLOPS.
The horizontal lines parallel to the X-axis represent the roof-line compute capacity for the given hardware.
The cross-diagonal lines represent the peak memory bandwidth of different layers of the memory hierarchy.
The red colored dot corresponds to the executed GPU kernel. The graph shows the performance of the kernel relative
to the peak compute capacity and memory bandwidth. It also shows whether the GPU kernel is memory or compute
bound depending on the roof-line that is limiting the GPU kernel.

For further details on Intel Advisor and its extended capabilities refer to the
`Intel |copy| Advisor User Guide <https://www.intel.com/content/www/us/en/develop/documentation/advisor-user-guide/top.html>`_.


.. todo::
Document debugging and profiling section
Document debugging section

Writing robust numerical codes for heterogeneous computing
**********************************************************
Expand Down Expand Up @@ -185,8 +281,8 @@ arithmetic please refer to `IEEE 754-2019 Standard for Floating-Point Arithmetic
`David Goldberg, What every computer scientist should know about floating-point arithmetic`_.


Switch between single and double precision
******************************************
Switching between single and double precision
---------------------------------------------

1. Implement your code to switch easily between single and double precision in a controlled fashion.
For example, implement a utility function or introduce a constant that selects ``dtype`` for
Expand Down
7 changes: 5 additions & 2 deletions docs/sources/useful_links.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,11 @@ Useful links
* - `Data Parallel Control`_
- Documentation how to manage data and devices, how to interchange data between different tensor implementations,
and how to write data parallel extensions
* - `Intel VTune Profiler`_
- Performance profiler supporting analysis of bottlenecks from function leve down to low level instructions.
Supports Python and Numba
* - `Intel Advisor`_
- Analyzes native and Python codes and provides an advice for better composition of heterogeneous algorithms
* - `Python* Array API Standard`_
- Standard for writing portable Numpy-like codes targeting different hardware vendors and frameworks
operating with tensor data
Expand All @@ -26,8 +31,6 @@ Useful links
- Free e-book how to program data parallel devices using Data Parallel C++
* - `OpenCl*`_
- OpenCl* Standard for heterogeneous programming
* - `Data Parallel Extension for Numpy*`_
- Documentation for programming NumPy-like codes on data parallel devices
* - `IEEE 754-2019 Standard for Floating-Point Arithmetic`_
- Standard for floating-point arithmetic, essential for writing robust numerical codes
* - `David Goldberg, What every computer scientist should know about floating-point arithmetic`_
Expand Down