ARROW-3233: [Python] Add prose documentation for CUDA support

pitrou · pitrou · commit bcfacaafcb18 · 2019-01-09T23:12:31.000+01:00
It will be harder to add generated API docs without requiring CUDA support on the machine building the docs. Author: Antoine Pitrou <antoine@python.org> Closes apache#3359 from pitrou/ARROW-3233-pyarrow-cuda-doc and squashes the following commits: 40b63f0 <Antoine Pitrou> ARROW-3233: Add prose documentation for CUDA support
diff --git a/docs/source/python/cuda.rst b/docs/source/python/cuda.rst
@@ -0,0 +1,159 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. currentmodule:: pyarrow.cuda
+
+CUDA Integration
+================
+
+Arrow is not limited to CPU buffers (located in the computer's main memory,
+also named "host memory").  It also has provisions for accessing buffers
+located on a CUDA-capable GPU device (in "device memory").
+
+.. note::
+   This functionality is optional and must have been enabled at build time.
+   If this is not done by your package manager, you might have to build Arrow
+   yourself.
+
+CUDA Contexts
+-------------
+
+A CUDA context represents access to a particular CUDA-capable device.
+For example, this is creating a CUDA context accessing CUDA device number 0::
+
+   >>> from pyarrow import cuda
+   >>> ctx = cuda.Context(0)
+   >>>
+
+CUDA Buffers
+------------
+
+A CUDA buffer can be created by copying data from host memory to the memory
+of a CUDA device, using the :meth:`Context.buffer_from_data` method.
+The source data can be any Python buffer-like object, including Arrow buffers::
+
+   >>> import numpy as np
+   >>> arr = np.arange(4, dtype=np.int32)
+   >>> arr.nbytes
+   16
+   >>> cuda_buf = ctx.buffer_from_data(arr)
+   >>> type(cuda_buf)
+   pyarrow._cuda.CudaBuffer
+   >>> cuda_buf.size     # The buffer's size in bytes
+   16
+   >>> cuda_buf.address  # The buffer's address in device memory
+   30088364544
+   >>> cuda_buf.context.device_number
+   0
+
+Conversely, you can copy back a CUDA buffer to device memory, getting a regular
+CPU buffer::
+
+   >>> buf = cuda_buf.copy_to_host()
+   >>> type(buf)
+   pyarrow.lib.Buffer
+   >>> np.frombuffer(buf, dtype=np.int32)
+   array([0, 1, 2, 3], dtype=int32)
+
+.. warning::
+   Many Arrow functions expect a CPU buffer but will not check the buffer's
+   actual type.  You will get a crash if you pass a CUDA buffer to such a
+   function::
+
+      >>> pa.py_buffer(b"x" * 16).equals(cuda_buf)
+      Segmentation fault
+
+Numba Integration
+-----------------
+
+There is not much you can do directly with Arrow CUDA buffers from Python,
+but they support interoperation with `Numba <https://numba.pydata.org/>`_,
+a JIT compiler which can turn Python code into optimized CUDA kernels.
+
+Arrow to Numba
+~~~~~~~~~~~~~~
+
+First let's define a Numba CUDA kernel operating on an ``int32`` array.  Here,
+we will simply increment each array element (assuming the array is writable)::
+
+   import numba.cuda
+
+   @numba.cuda.jit
+   def increment_by_one(an_array):
+       pos = numba.cuda.grid(1)
+       if pos < an_array.size:
+           an_array[pos] += 1
+
+Then we need to wrap our CUDA buffer into a Numba "device array" with the right
+array metadata (shape, strides and datatype).  This is necessary so that Numba
+can identify the array's characteristics and compile the kernel with the
+appropriate type declarations.
+
+In this case the metadata can simply be got from the original Numpy array.
+Note the GPU data isn't copied, just pointed to::
+
+   >>> from numba.cuda.cudadrv.devicearray import DeviceNDArray
+   >>> device_arr = DeviceNDArray(arr.shape, arr.strides, arr.dtype, gpu_data=cuda_buf.to_numba())
+
+(ideally we could have defined an Arrow array in CPU memory, copied it to CUDA
+memory without losing type information, and then invoked the Numba kernel on it
+without constructing the DeviceNDArray by hand; this is not yet possible)
+
+Finally we can run the Numba CUDA kernel on the Numba device array (here
+with a 16x16 grid size)::
+
+   >>> increment_by_one[16, 16](device_arr)
+
+And the results can be checked by copying back the CUDA buffer to CPU memory::
+
+   >>> np.frombuffer(cuda_buf.copy_to_host(), dtype=np.int32)
+   array([1, 2, 3, 4], dtype=int32)
+
+Numba to Arrow
+~~~~~~~~~~~~~~
+
+Conversely, a Numba-created device array can be viewed as an Arrow CUDA buffer,
+using the :meth:`CudaBuffer.from_numba` factory method.
+
+For the sake of example, let's first create a Numba device array::
+
+   >>> arr = np.arange(10, 14, dtype=np.int32)
+   >>> arr
+   array([10, 11, 12, 13], dtype=int32)
+   >>> device_arr = numba.cuda.to_device(arr)
+
+Then we can create a CUDA buffer pointing the device array's memory.
+We don't need to pass a CUDA context explicitly this time: the appropriate
+CUDA context is automatically retrieved and adapted from the Numba object.
+
+::
+
+   >>> cuda_buf = cuda.CudaBuffer.from_numba(device_arr.gpu_data)
+   >>> cuda_buf.size
+   16
+   >>> cuda_buf.address
+   30088364032
+   >>> cuda_buf.context.device_number
+   0
+
+Of course, we can copy the CUDA buffer back to host memory::
+
+   >>> np.frombuffer(cuda_buf.copy_to_host(), dtype=np.int32)
+   array([10, 11, 12, 13], dtype=int32)
+
+.. seealso::
+   Documentation for Numba's `CUDA support <https://numba.pydata.org/numba-doc/latest/cuda/index.html>`_.
diff --git a/docs/source/python/index.rst b/docs/source/python/index.rst
@@ -43,6 +43,7 @@ files into Arrow structures.
    pandas
    csv
    parquet
+   cuda
    extending
    api
    development
diff --git a/docs/source/python/memory.rst b/docs/source/python/memory.rst
@@ -109,6 +109,9 @@ the buffer is garbaged-collected, all of the memory is freed:
    buf = None
    pa.total_allocated_bytes()
 
+.. seealso::
+   On-GPU buffers using Arrow's optional :doc:`CUDA integration <cuda>`.
+
 
 Input and Output
 ================