Write Python for Speed

Write Python for Speed
Yung-Yu Chen
(yyc@sciwork.dev)
October 2023
1

speed is the king
リニア中央新幹線 (Chuo Shinkansen)
https://linear-chuo-shinkansen.jr-central.co.jp/about/design/
500km/h high speed rail by 2038 "can't wait!" 🙂
what we are talking about is unparalleled speed
2

numerical calculation needs speed
u = 0
u = 0
u = 0
u(y) = sin(πy)
i, j
i, j + 1
i, j − 1
i + 1,j
i − 1,j
i + 1,j + 1
i + 1,j − 1
i − 1,j + 1
i − 1,j − 1
grid points
computing domain
(and solution)
∂2
u
∂x2
+
∂2
u
∂y2
= 0 (0 < x < 1; 0 < y < 1)
u(0,y) = 0, u(1,y) = sin(πy) (0 ≤ y ≤ 1)
u(x,0) = 0, u(x,1) = 0 (0 ≤ x ≤ 1)
Equation for domain interior
Boundary conditions
un+1
(xi, yi) =
un
(xi+1, yj) + un
(xi−1, yj) + un
(xi, yj+1) + un
(xi, yj−1)
4
Use the point-Jacobi method to march
the Laplace equation
3

Python is slow
for it in range(1, nx-1):
for jt in range(1, nx-1):
un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4
un[1:nx-1,1:nx-1] = (u[2:nx,1:nx-1] + u[0:nx-2,1:nx-1] +
u[1:nx-1,2:nx] + u[1:nx-1,0:nx-2]) / 4
for (size_t it=1; it<nx-1; ++it)
{
for (size_t jt=1; jt<nx-1; ++jt)
{
un(it,jt) = (u(it+1,jt) + u(it-1,jt) + u(it,jt+1) + u(it,jt-1)) / 4;
}
}
Point-Jacobi method
Python nested loop
Numpy array
C++ nested loop
4.797s (1x)
0.055s (87x)
0.025s (192x)
un+1
(xi, yi) =
un
(xi+1, yj) + un
(xi−1, yj) + un
(xi, yj+1) + un
(xi, yj−1)
4
4

example: hyperbolic PDEs
5
numerical simulations of conservation laws:
∂u
∂t
+
3
∑
k=1
∂F(k)
(u)
∂xk
= 0
use case: stress waves in anisotropic solids
use case: compressible
fl
ows

6
observe
theorize
simple setup
software package
wall of
complexity
prototype
numerical
analysis
analytical
solution
mass production run
a lot of code development
u(x, y) =
sinh(πx)
sinh(π)
sin(πy)
development procedure
always keep
the problem
(math or
physics) in
mind
always make
the code run

hybrid architecture
• everyone wants the simplicity of Python
• HPC programmers included
• but we must use C++ for speed
• to be honest, machine code
• write Python to drive the C++ code / the machine
7

machine determines speed
void calc_distance(
size_t const n
, double const * x
, double const * y
, double * r)
{
for (size_t i = 0 ; i < n ; ++i)
{
r[i] = std::sqrt(x[i]*x[i] + y[i]*y[i]);
}
}
vmovupd ymm0, ymmword [rsi + r9*8]
vmulpd ymm0, ymm0, ymm0
vmovupd ymm1, ymmword [rdx + r9*8]
vmulpd ymm1, ymm1, ymm1
vaddpd ymm0, ymm0, ymm1
vsqrtpd ymm0, ymm0
movupd xmm0, xmmword [rsi + r8*8]
mulpd xmm0, xmm0
movupd xmm1, xmmword [rdx + r8*8]
mulpd xmm1, xmm1
addpd xmm1, xmm0
sqrtpd xmm0, xmm1
AVX: 256-bit-wide vectorization SSE: 128-bit-wide vectorization
C++ code
8

array: HPC fundamental
...
...
...
regularly allocated bu
ff
er
ff
er
ff
er
regularly accessing algorithm
randomly accessing algorithm
quoted from SemiAnalysis
https://www.semianalysis.com/p/apple-m2-die-shot-and-architecture
die shots of popular chips
the data structure for cache optimization, data parallelism, SIMD, GPU, all the
techniques of high-performance computing (HPC)
9

array for zero-copy across Python and C++
Python app C++ app
a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
share memory bu
ff
er across language
ndarray
C++
container
Python app C++ app
C++
container
ndarray
manage
access a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory bu
ff
er shared across language
👍 bottom (C++) - up (Python) design
Python app C++ app
C++
container
ndarray
manage
access
a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory bu
ff
er shared across language
top (Python) - down (C++) design 👎
Python-controlling
lifecycle is
bad for long-running
process
C++ allows
fi
ne-grained
control of all resources
10

quick prototype in Python:
something new and fragile
• thread pool for I/O? certainly
Python
• certainly not want to write
complicated C++ to prototype
• 64 threads on 32-core servers?
• consider to move it to C++
• data-parallel code on top of the
thread pool?
• time to go with TBB (thread-
building block) C++ library
from _thread import allocate_lock, start_new_thread
class ThreadPool(object):
"""Python prototype for I/O thread pool"""
def __init__(self, nthread):
# Worker callback
self.func = None
# Placeholders for managing data
self.__threadids = [None] * nthread
self.__threads = [None] * nthread
self.__returns = [None] * nthread
# Initialize thread managing data
for it in range(nthread):
mlck = allocate_lock(); mlck.acquire()
wlck = allocate_lock(); wlck.acquire()
tdata = [mlck, wlck, None, None]
self.__threads[it] = tdata
tid = start_new_thread(self.eventloop, (tdata,))
self.__threadids[it] = tid
11

quick prototype in Python:
something complex
i, j
i, j + 1
i, j − 1
i + 1,j
i − 1,j
i + 1,j + 1
i + 1,j − 1
i − 1,j + 1
i − 1,j − 1
di
ff
erence equation
def solve_python_loop():
u = uoriginal.copy() # Input from outer scope
un = u.copy() # Create the buffer for the next time step
converged = False
step = 0
# Outer loop.
while not converged:
step += 1
# Inner loops. One for x and the other for y.
for it in range(1, nx-1):
for jt in range(1, nx-1):
un[it,jt] = (u[it+1,jt] + u[it-1,jt] + u[it,jt+1] + u[it,jt-1]) / 4
norm = np.abs(un-u).max()
u[...] = un[...]
converged = True if norm < 1.e-5 else False
return u, step, norm
12
grid points
python nested loop implementing point-Jacobi method
u(xi, yj) =
u(xi+1, yj) + u(xi−1, yj) + u(xi, yj+1) + u(xi, yj−1)
4
point-Jacobi method
un+1
(xi, yi) =
un
(xi+1, yj) + un
(xi−1, yj) + un
(xi, yj+1) + un
(xi, yj−1)
4

production: many data many ways
13
fi
t scattered points to a polynomial
f(x) = a3x3
+ a2x2
+ a1x + a0
test many (like 1,000+) such data sets:
def plot_poly_fitted(i):
slct = (xdata>=i)&(xdata<(i+1))
sub_x = xdata[slct]
sub_y = ydata[slct]
poly = data_prep.fit_poly(sub_x, sub_y, 3)
print(poly)
poly = np.poly1d(poly)
xp = np.linspace(sub_x.min(), sub_x.max(), 100)
plt.plot(sub_x, sub_y, '.', xp, poly(xp), '-')
plot_poly_fitted(10)
(yeah, the data points seem to be too random to be represented by a polynomial)
// The rank of the linear map is (order+1).
modmesh::SimpleArray<double> matrix(std::vector<size_t>{order+1, order+1});
// Use the x coordinates to build the linear map for least-square
// regression.
for (size_t it=0; it<order+1; ++it)
{
for (size_t jt=0; jt<order+1; ++jt)
{
double & val = matrix(it, jt);
val = 0;
for (size_t kt=start; kt<stop; ++kt)
{
val += pow(xarr[kt], it+jt);
}
}
}
>>> with Timer():
>>> # Do the calculation for the 1000 groups of points.
>>> polygroup = np.empty((1000, 3), dtype='float64')
>>> for i in range(1000):
>>> # Use numpy to build the point group.
>>> slct = (xdata>=i)&(xdata<(i+1))
>>> sub_x = xdata[slct]
>>> sub_y = ydata[slct]
>>> polygroup[i,:] = data_prep.fit_poly(sub_x, sub_y, 2)
>>> with Timer():
>>> # Using numpy to build the point groups takes a lot of time.
>>> data_groups = []
>>> data_groups.append((xdata[slct], ydata[slct]))
>>> with Timer():
>>> # Fitting helper runtime is much less than building the point groups.
>>> for it, (sub_x, sub_y) in enumerate(data_groups):
>>> polygroup[it,:] = data_prep.fit_poly(sub_x, sub_y, 2)
Wall time: 1.49671 s
prepare data
fi
t polynomials

problem of speedup
14
>>> with Timer():
>>> # Using numpy to build the point groups takes a lot of time.
>>> data_groups = []
>>> data_groups.append((xdata[slct], ydata[slct]))
>>> with Timer():
>>> # Fitting helper runtime is much less than building the point groups.
>>> for it, (sub_x, sub_y) in enumerate(data_groups):
>>> polygroup[it,:] = data_prep.fit_poly(sub_x, sub_y, 2)
>>> with Timer():
>>> rbatch = data_prep.fit_polys(xdata, ydata, 2)
/**
* This function calculates the least-square regression of multiple sets of
* point clouds to the corresponding polynomial functions of a given order.
*/
modmesh::SimpleArray<double> fit_polys
(
modmesh::SimpleArray<double> const & xarr
, modmesh::SimpleArray<double> const & yarr
, size_t order
)
{
size_t xmin = std::floor(*std::min_element(xarr.begin(), xarr.end()));
size_t xmax = std::ceil(*std::max_element(xarr.begin(), xarr.end()));
size_t ninterval = xmax - xmin;
modmesh::SimpleArray<double> lhs(std::vector<size_t>{ninterval, order+1});
std::fill(lhs.begin(), lhs.end(), 0); // sentinel.
size_t start=0;
for (size_t it=0; it<xmax; ++it)
{
// NOTE: We take advantage of the intrinsic features of the input data
// to determine the grouping. This is ad hoc and hard to maintain. We
// play this trick to demonstrate a hackish way of performing numerical
// calculation.
size_t stop;
for (stop=start; stop<xarr.size(); ++stop)
{
if (xarr[stop]>=it+1) { break; }
}
// Use the single polynomial helper function.
auto sub_lhs = fit_poly(xarr, yarr, start, stop, order);
for (size_t jt=0; jt<order+1; ++jt)
{
lhs(it, jt) = sub_lhs[jt];
}
start = stop;
}
return lhs;
}
prepare data and loop in Python:
prepare data and loop in C++:
C++ wins so much, but we lose
fl
exibility!
// NOTE: We take advantage of the intrinsic features of the input data
// to determine the grouping. This is ad hoc and hard to maintain. We
// play this trick to demonstrate a hackish way of performing numerical
// calculation.

data object:
fl
exible and fast
15
public:
std::vector<Data> inner1(size_t start, size_t len)
{
std::vector<Data> ret;
ret.reserve(len);
inner2(start, len, ret);
return ret;
}
private:
void inner2(size_t start, size_t len,
std::vector<Data> & ret)
{
for (size_t it=0; it < len; ++it)
{
Data data(start+it);
ret.emplace_back(std::move(data));
}
}
public:
void outer(size_t len)
{
result.reserve(len*(len+1)/2);
for (size_t it=0; it < len; ++it)
{
// The output argument passed into
// the private helper is a private
// member datum.
inner2(result.size(), it+1, result);
}
}
Called when consumers want the sub-operation
one by one, and make the code more testable.
Called when batch operation is
demanded.
struct Accumulator
{
public:
std::vector<Data> result;
};
Caller does not see this private
helper that takes an output argument.
only use public data
member for simple
purposes
class to pack data and logic

data object: use case
class StaticMesh
{
// shape data
uint8_t m_ndim = 0;
uint_type m_nnode = 0; ///< Number of nodes (interior).
uint_type m_nface = 0; ///< Number of faces (interior).
uint_type m_ncell = 0; ///< Number of cells (interior).
uint_type m_nbound = 0; ///< Number of boundary faces.
uint_type m_ngstnode = 0; ///< Number of ghost nodes.
uint_type m_ngstface = 0; ///< Number of ghost faces.
uint_type m_ngstcell = 0; ///< Number of ghost cells.
// geometry arrays
MM_DECL_StaticMesh_ARRAY(real_type, ndcrd);
MM_DECL_StaticMesh_ARRAY(real_type, fccnd);
MM_DECL_StaticMesh_ARRAY(real_type, fcnml);
MM_DECL_StaticMesh_ARRAY(real_type, fcara);
MM_DECL_StaticMesh_ARRAY(real_type, clcnd);
MM_DECL_StaticMesh_ARRAY(real_type, clvol);
// meta arrays
MM_DECL_StaticMesh_ARRAY(int_type, fctpn);
MM_DECL_StaticMesh_ARRAY(int_type, cltpn);
MM_DECL_StaticMesh_ARRAY(int_type, clgrp);
// connectivity arrays
MM_DECL_StaticMesh_ARRAY(int_type, fcnds);
MM_DECL_StaticMesh_ARRAY(int_type, fccls);
MM_DECL_StaticMesh_ARRAY(int_type, clnds);
MM_DECL_StaticMesh_ARRAY(int_type, clfcs);
MM_DECL_StaticMesh_ARRAY(int_type, ednds);
}; /* end class StaticMesh */
#define MM_DECL_StaticMesh_ARRAY(TYPE, NAME)
public:
SimpleArray<TYPE> const & NAME() const { return m_##NAME; }
SimpleArray<TYPE> & NAME() { return m_##NAME; }
template <typename... Args>
TYPE const & NAME(Args... args) const { return m_##NAME(args...); }
template <typename... Args>
TYPE & NAME(Args... args) { return m_##NAME(args...); }

private:
SimpleArray<TYPE> m_##NAME
namespace py = pybind11;
(*this)
// shape data
.def_property_readonly("ndim", &wrapped_type::ndim)
.def_property_readonly("nnode", &wrapped_type::nnode)
.def_property_readonly("nface", &wrapped_type::nface)
.def_property_readonly("ncell", &wrapped_type::ncell)
.def_property_readonly("nbound", &wrapped_type::nbound)
.def_property_readonly("ngstnode", &wrapped_type::ngstnode)
.def_property_readonly("ngstface", &wrapped_type::ngstface)
.def_property_readonly("ngstcell", &wrapped_type::ngstcell)
#define MM_DECL_ARRAY(NAME)
.expose_SimpleArray(
#NAME,
[](wrapped_type & self) -> decltype(auto)
{ return self.NAME(); })
(*this)
// geometry arrays
MM_DECL_ARRAY(ndcrd)
MM_DECL_ARRAY(fccnd)
MM_DECL_ARRAY(fcnml)
MM_DECL_ARRAY(fcara)
MM_DECL_ARRAY(clcnd)
MM_DECL_ARRAY(clvol)
// meta arrays
MM_DECL_ARRAY(fctpn)
MM_DECL_ARRAY(cltpn)
MM_DECL_ARRAY(clgrp)
MM_DECL_ARRAY(fcnds)
MM_DECL_ARRAY(fccls)
MM_DECL_ARRAY(clnds)
MM_DECL_ARRAY(clfcs)
MM_DECL_ARRAY(ednds);
#undef MM_DECL_ARRAY
# Construct the data object
mh = modmesh.StaticMesh(
ndim=3, nnode=4, nface=4, ncell=1)
# Set the data
mh.ndcrd.ndarray[:, :] =
(0, 0, 0), (0, 1, 0), (-1, 1, 0), (0, 1, 1)
mh.cltpn.ndarray[:] = modmesh.StaticMesh.TETRAHEDRON
mh.clnds.ndarray[:, :5] = [(4, 0, 1, 2, 3)]
# Calculate internal by the input data
# to build up the object
mh.build_interior()
np.testing.assert_almost_equal(
mh.fccnd,
[[-0.3333333, 0.6666667, 0. ],
[ 0. , 0.6666667, 0.3333333],
[-0.3333333, 0.6666667, 0.3333333],
[-0.3333333, 1. , 0.3333333]])
mesh shape information
data arrays
data arrays will be very large: gigabytes in memory
use data in python
C++ library pybind11 wrapper
16

test in Python
def test_2d_trivial_triangles(self):
mh = modmesh.StaticMesh(ndim=2, nnode=4, nface=0, ncell=3)
mh.ndcrd.ndarray[:, :] = (0, 0), (-1, -1), (1, -1), (0, 1)
mh.cltpn.ndarray[:] = modmesh.StaticMesh.TRIANGLE
mh.clnds.ndarray[:, :4] = (3, 0, 1, 2), (3, 0, 2, 3), (3, 0, 3, 1)
self._check_shape(mh, ndim=2, nnode=4, nface=0, ncell=3,
nbound=0, ngstnode=0, ngstface=0, ngstcell=0,
nedge=0)
# Test build interior data.
mh.build_interior(_do_metric=False, _build_edge=False)
nedge=0)
mh.build_interior() # _do_metric=True, _build_edge=True
nedge=6)
mh.fccnd,
[[-0.5, -0.5], [0.0, -1.0], [0.5, -0.5],
[0.5, 0.0], [0.0, 0.5], [-0.5, 0.0]])
mh.fcnml,
[[-0.7071068, 0.7071068], [0.0, -1.0], [0.7071068, 0.7071068],
[0.8944272, 0.4472136], [-1.0, -0.0], [-0.8944272, 0.4472136]])
mh.fcara, [1.4142136, 2.0, 1.4142136, 2.236068, 1.0, 2.236068])
mh.clcnd, [[0.0, -0.6666667], [0.3333333, 0.0], [-0.3333333, 0.0]])
mh.clvol, [1.0, 0.5, 0.5])
17
(*this)
// shape data
.def_property_readonly("ndim", &wrapped_type::ndim)
.def_property_readonly("nnode", &wrapped_type::nnode)
.def_property_readonly("nface", &wrapped_type::nface)
.def_property_readonly("ncell", &wrapped_type::ncell)
.def_property_readonly("nbound", &wrapped_type::nbound)
.def_property_readonly("ngstnode", &wrapped_type::ngstnode)
.def_property_readonly("ngstface", &wrapped_type::ngstface)
.def_property_readonly("ngstcell", &wrapped_type::ngstcell)
#define MM_DECL_ARRAY(NAME) .expose_SimpleArray(
#NAME,
[](wrapped_type & self) -> decltype(auto){ return self.NAME(); })
(*this)
// geometry arrays
MM_DECL_ARRAY(ndcrd)
MM_DECL_ARRAY(fccnd)
MM_DECL_ARRAY(fcnml)
MM_DECL_ARRAY(fcara)
MM_DECL_ARRAY(clcnd)
MM_DECL_ARRAY(clvol)
// meta arrays
MM_DECL_ARRAY(fctpn)
MM_DECL_ARRAY(cltpn)
MM_DECL_ARRAY(clgrp)
MM_DECL_ARRAY(fcnds)
MM_DECL_ARRAY(fccls)
MM_DECL_ARRAY(clnds)
MM_DECL_ARRAY(clfcs)
MM_DECL_ARRAY(ednds);
#undef MM_DECL_ARRAY
wrapping to Python enables fast testing development may be done as writing tests!
# Construct the data object
mh = modmesh.StaticMesh(
ndim=3, nnode=4, nface=4, ncell=1)
# Set the data
mh.ndcrd.ndarray[:, :] =
(0, 0, 0), (0, 1, 0), (-1, 1, 0), (0, 1, 1)
mh.cltpn.ndarray[:] = modmesh.StaticMesh.TETRAHEDRON
mh.clnds.ndarray[:, :5] = [(4, 0, 1, 2, 3)]
# Calculate internal by the input data
# to build up the object
mh.build_interior()
mh.fccnd,
[[-0.3333333, 0.6666667, 0. ],
[ 0. , 0.6666667, 0.3333333],
[-0.3333333, 0.6666667, 0.3333333],
[-0.3333333, 1. , 0.3333333]])

development
fl
ow
Python prototype
write C++ test C++ wrap to Python
Python app test Python
C++ data object wrap to Python
test Python Python app
Python-centric
C++-centric
class
class class class
function
function
class class
need to be
familiar with
C++ 🙁
happy with
Python 🙂
gain from
the pain
18
many C++ helper classes:
• more features
• better performance
note: a lot of low-level data structure
happens here

develop SimpleArray in C++
SimpleArray std::vector
SimpleArray is
fi
xed size 👍
• only allocate memory on
construction
std::vector is variable size 👎
• bu
ff
er may be invalidated
• implicit memory allocation
(reallocation)
multi-dimensional access 👍
operator()
one-dimensional access 👎
operator[]
19
C++
container
ndarray
manage
access a11 a12 ⋯ a1n a21 ⋯ am1 ⋯ amn
memory bu
ff
er
make it yourself get it free from STL

buffer ownership
20
ownership
know when and where to allocate
and deallocate memory
(and "resources")

buffer ownership
21
SimpleArray a
meta data
buffer ptr
data bu
ff
er

buffer ownership
22
SimpleArray a
meta data
buffer ptr
data bu
ff
er
SimpleArray b
meta data
buffer ptr
copy

buffer ownership
23
SimpleArray a
meta data
buffer ptr
data bu
ff
er
SimpleArray b
meta data
buffer ptr
copy

buffer ownership
24
SimpleArray a
meta data
buffer ptr
data bu
ff
er
SimpleArray b
meta data
buffer ptr

buffer ownership
25
SimpleArray a
meta data
buffer ptr
data bu
ff
er
SimpleArray b
meta data
buffer ptr

buffer interface and numpy ndarray
template <typename S>
std::enable_if_t<is_simple_array_v<S>, pybind11::array>
to_ndarray(S && sarr)
{
using T = typename std::remove_reference_t<S>::value_type;
std::vector<size_t> const shape(sarr.shape().begin(),
sarr.shape().end());
std::vector<size_t> stride(sarr.stride().begin(),
sarr.stride().end());
for (size_t & v : stride) { v *= sarr.itemsize(); }
return py::array(
/* Numpy dtype */
py::detail::npy_format_descriptor<T>::dtype(),
/* Buffer dimensions */
shape,
/* Strides (in bytes) for each index */
stride,
/* Pointer to buffer */
sarr.data(),
/* Owning Python object */
py::cast(sarr.buffer().shared_from_this()));
}
template <typename T>
static SimpleArray<T> makeSimpleArray(
pybind11::array_t<T> & ndarr)
{
typename SimpleArray<T>::shape_type shape;
for (ssize_t i = 0; i < ndarr.ndim(); ++i)
{
shape.push_back(ndarr.shape(i));
}
std::shared_ptr<ConcreteBuffer> const buffer =
ConcreteBuffer::construct(
ndarr.nbytes(),
ndarr.mutable_data(),
std::make_unique<ConcreteBufferNdarrayRemover>(
ndarr));
return SimpleArray<T>(shape, buffer);
}
take SimpleArray bu
ff
er to make ndarray take ndarray bu
ff
er to make SimpleArray
26

conclusions
27
• Python-C++ hybrid system: utmost speed and
fl
exibility
• Python for prototyping and testing
• use data object to organize code in C++
• design zero-copy interface
• what you get: e
ffi
cient, data-parallelizable code fully driven by
Python

Write Python for Speed

More Related Content

Similar to Write Python for Speed

More from Yung-Yu Chen

Recently uploaded

Write Python for Speed