IntelPython · samaid · Nov 18, 2022 · Nov 18, 2022
diff --git a/docs/sources/examples.rst b/docs/sources/examples.rst
@@ -3,3 +3,33 @@
 
 List of examples
 ================
+
+.. literalinclude:: ../../examples/01-hello_dpnp.py
+   :language: python
+   :lines: 27-
+   :caption: Your first NumPy code running on GPU
+   :name: examples_01_hello_dpnp
+
+.. literalinclude:: ../../examples/02-dpnp_device.py
+   :language: python
+   :lines: 27-
+   :caption: Select device type while creating array
+   :name: examples_02_dpnp_device
+
+.. literalinclude:: ../../examples/03-dpnp2numba-dpex.py
+   :language: python
+   :lines: 27-
+   :caption: Compile dpnp code with numba-dpex
+   :name: examples_03_dpnp2numba_dpex
+
+Benchmarks
+**********
+
+.. todo::
+   Provide instructions for dpbench
+
+Jupyter* Notebooks
+******************
+
+.. todo::
+   Provide instructions for Jupyter Notebook samples illustrating Data Parallel Extensions for Python
diff --git a/docs/sources/ext_links.txt b/docs/sources/ext_links.txt
@@ -12,3 +12,5 @@
 .. _SYCL*: https://www.khronos.org/sycl/
 .. _Data Parallel Control: https://intelpython.github.io/dpctl/latest/index.html
 .. _Data Parallel Extension for Numpy*: https://intelpython.github.io/dpnp/
+.. _IEEE 754-2019 Standard for Floating-Point Arithmetic: https://standards.ieee.org/ieee/754/6210/
+.. _David Goldberg, What every computer scientist should know about floating-point arithmetic: https://www.itu.dk/~sestoft/bachelor/IEEE754_article.pdf>
diff --git a/docs/sources/programming_dpep.rst b/docs/sources/programming_dpep.rst
@@ -71,3 +71,128 @@ It takes just a few lines to modify your CPU `Numba*`_ script to run on GPU.
    :caption: Compile dpnp code with numba-dpex
    :name: ex_03_dpnp2numba_dpex
 
+In this example we implement a custom function ``sum_it()`` that takes an array input. We compile it with
+`Data Parallel Extension for Numba*`_. Being just-in-time compiler, Numba derives the queue from input argument ``x``,
+which is associated with the default device (``"gpu"`` on systems with integrated or discrete GPU) and
+dynamically compiles the kernel submitted to that queue. The result will reside as a 0-dimensional array on the device
+associated with the queue, and on exit from the offload kernel it will be assigned to the tensor y.
+
+The ``parallel=True`` setting in ``@njit` is essential to enable generation of data parallel kernels.
+Please also note that we use ``fastmath=True`` in ``@njit`` decorator. This is an important setting
+to instruct the compiler that you’re okay NOT preserving the order of floating-point operations.
+This will enable generation of instructions (such as SIMD) for greater performance.
+
+Data Parallel Control - dpctl
+*****************************
+
+Both ``dpnp`` and ``numba-dpex`` provide enough API versatility for programming data parallel devices but
+there are some situations when you will need to use dpctl advanced capabilities:
+
+1. **Advanced device management.** Both ``dpnp`` and ``numba-dpex`` support Numpy array creation routines
+   with additional parameters that specify the device on which the data is allocated and the type of memory to be used
+   (``"device"``, ``"host"``, or ``"shared"``). However, if you need some more advanced device and data management
+   capabilities you will also need to import ``dpctl`` in addition to ``dpnp`` and/or ``numba-dpex``.
+
+   One of frequent usages of ``dpctl`` is to query the list devices present on the system, available driver backend
+   (such as ``"opencl"``, ``"level_zero"``, ``"cuda"``, etc.)
+
+   Another frequent usage is the creation additional queues for the purpose of profiling or choosing an out-of-order
+   execution of offload kernels.
+
+2. **Cross-platform development using Python Array API standard.** If you’re a Python developer
+   programming Numpy-like codes and targeting different hardware vendors and different tensor implementations,
+   then going with `Python* Array API Standard`_ is a good choice for writing a portable Numpy-like code.
+   The ``dpctl.tensor`` implements `Python* Array API Standard`_ for `SYCL*`_ devices. Accompanied with
+   respective SYCL device drivers from different vendors ``dpctl.tensor`` becomes a portable solution
+   for writing numerical codes for any SYCL device.
+
+   For example, some Python communities, such as
+   `Scikit-Learn* community <https://github.com/scikit-learn/scikit-learn/issues/22352>`_, are already establishing
+   a path for having algorithms (re-)implemented using `Python* Array API Standard`_ .
+   This is a reliable path for extending their capabilities beyond CPU only, or beyond certain GPU vendor only.
+
+3. **Zero-copy data exchange between tensor implementations.** Certain Python projects may have own tensor
+   implementations not relying on ``dpctl.tensor`` or ``dpnp.ndarray`` tensors. Can users still exchange data
+   between these tensors not copying it back and forth through the host?
+   `Python* Array API Standard`_ specifies the data exchange protocol for zero-copy exchange
+   between tensors through ``dlpack``. Being the `Python* Array API Standard`_ implementation
+   ``dpctl`` provides ``dpctl.tensor.from_dlpack()`` function used for zero-copy view of another tensor input.
+
+
+Debugging and profiling Data Parallel Extensions for Python
+***********************************************************
+
+.. todo::
+   Document debugging and profiling section
+
+Writing robust numerical codes for heterogeneous computing
+**********************************************************
+
+Default primitive type (``dtype``) in `Numpy*`_ is double precision (``float64``), which is supported by
+majority of modern CPUs. When it comes to program GPUs and especially specialized accelerators,
+the set of supported primitive data types may be limited. For example, certain GPUs may not support
+double precision or half-precision. **Data Parallel Extensions for Python** select default ``dtype`` depending on
+device’s default type in accordance with Python Array API Standard. It can be either ``float64`` or ``float32``.
+It means that unlike traditional `Numpy*`_ programming on a CPU, the heterogeneous computing requires
+careful management of hardware peculiarities to keep the Python script portable and robust on any device.
+
+There are several hints how to make the numerical code portable and robust.
+
+Sensitivity to floating-point errors
+------------------------------------
+
+Floating-point arithmetic has a finite precision, which implies that only a tiny fraction of real numbers can be
+represented in floating-point arithmetic. It is almost certain that every floating-point operation
+will induce a rounding error because the result cannot be accurately represented as a floating-point number.
+The `IEEE 754-2019 Standard for Floating-Point Arithmetic`_ sets the upper bound for rounding errors in each
+arithmetic operation to 0.5 *ulp*, meaning that each arithmetic operation must be accurate to the last bit of
+floating-point mantissa, which is an order of :math:`10^-16` in double precision and :math:`10^-7`
+in single precision.
+
+In robust numerical codes these errors tend to accumulate slowly so that single precision is enough to
+calculate the result accurate to 3-5 decimal digits.
+
+However, there is a situation known as a *catastrophic cancellation*, when small accumulated errors
+result in a significant (or even a complete) loss of accuracy. The catastrophic cancellation happens
+when two close floating-point numbers with small rounding errors are subtracted. As a result the original
+rounding errors amplify by the number of identical leading digits:
+
+.. image:: ./_images/fp-cancellation.png
+    :scale: 50%
+    :align: center
+    :alt: Floating-Point Cancellation
+
+In the above example, green digits are accurate digits, a few trailing digits in red are inaccurate due to
+induced errors. As a result of subtraction, only one accurate digit remains.
+
+Situations with catastrophic cancellations must be carefully handled. An example where catastrophic
+cancellation happens naturally is the numeric differentiation, where two close numbers are subtracted
+to approximate the derivative:
+
+.. math::
+
+   df/dx \approx \frac{f(x+\delta) - f(x-\delta)}{2\delta}
+
+Smaller you take :math:`\delta` is greater the catastrophic cancellation. At the same time bigger :math:`\delta`
+results in bigger approximation error. Books on numerical computing and floating-point arithmetic discuss
+variety of technics to make catastrophic cancellations controllable. For more details about floating-point
+arithmetic please refer to `IEEE 754-2019 Standard for Floating-Point Arithmetic`_ and the article by
+`David Goldberg, What every computer scientist should know about floating-point arithmetic`_.
+
+
+Switch between single and double precision
+******************************************
+
+1. Implement your code to switch easily between single and double precision in a controlled fashion.
+   For example, implement a utility function or introduce a constant that selects ``dtype`` for
+   the rest of the `Numpy*`_ code.
+
+2. Run your code on a representative set of inputs in single and double precisions.
+   Observe sensitivity of computed results to the switching between single and double precisions.
+   If results remain identical to 3-5 digits for different inputs, it is a good sign that your code
+   is not sensitive to floating-point errors.
+
+3. Write your code with catastrophic cancellations in mind. These blocks of code will require special
+   care such as the use of extended precision or other techniques to control cancellations.
+   It is likely that this part of the code will require a hardware specific implementation.
+
diff --git a/docs/sources/useful_links.rst b/docs/sources/useful_links.rst
@@ -3,3 +3,43 @@
 
 Useful links
 ============
+
+.. list-table:: **Companion documentation**
+   :widths: 70 200
+   :header-rows: 1
+
+   * - Document
+     - Description
+   * - `Data Parallel Extension for Numpy*`_
+     - Documentation for programming NumPy-like codes on data parallel devices
+   * - `Data Parallel Extension for Numba*`_
+     - Documentation for programming Numba codes on data parallel devices the same way as you program Numba on CPU
+   * - `Data Parallel Control`_
+     - Documentation how to manage data and devices, how to interchange data between different tensor implementations,
+       and how to write data parallel extensions
+   * - `Python* Array API Standard`_
+     - Standard for writing portable Numpy-like codes targeting different hardware vendors and frameworks
+       operating with tensor data
+   * - `SYCL*`_
+     - Standard for writing C++-like codes for heterogeneous computing
+   * - `DPC++`_
+     - Free e-book how to program data parallel devices using Data Parallel C++
+   * - `OpenCl*`_
+     - OpenCl* Standard for heterogeneous programming
+   * - `Data Parallel Extension for Numpy*`_
+     - Documentation for programming NumPy-like codes on data parallel devices
+   * - `IEEE 754-2019 Standard for Floating-Point Arithmetic`_
+     - Standard for floating-point arithmetic, essential for writing robust numerical codes
+   * - `David Goldberg, What every computer scientist should know about floating-point arithmetic`_
+     - Scientific paper important for understanding how to write robust numerical code
+   * - `Numpy*`_
+     - Documentation for Numpy - foundational CPU library for array programming. Used in conjunction with
+       `Data Parallel Extension for Numpy*`_.
+   * - `Numba*`_
+     - Documentation for Numba - Just-In-Time compiler for Numpy-like codes. Used in conjunction with
+       `Data Parallel Extension for Numba*`_.
+
+
+To-Do
+=====
+.. todolist::
diff --git a/examples/03-dpnp2numba-dpex.py b/examples/03-dpnp2numba-dpex.py
@@ -25,17 +25,20 @@
 # *****************************************************************************
 
 import dpnp as np
-from numba_dpex import njit
-
+from numba_dpex import jit
 
 @njit(parallel=True, fastmath=True)
-def sum(x):
+def sum_it(x):
     return np.sum(x)
 
 
-x = np.empty(3)
+x = None
 try:
     x = np.asarray([1, 2, 3], device="gpu")
 except:
     print("GPU device is not available")
 
+y = sum_it(x)
+
+print(y.shape)  # Must be 0-dimensional array
+print(y)  # Expect 6