Mathématiques - Informatique
Solving large sparse linear
systems on the GPU
Bruno Lévy
Inria - ParMA
Laboratoire de Mathématiques d’Orsay
Atelier AM2I
Calcul scientifique
Passage a l’echelle
16 Dec 2024
Outline
1. Motivations – a case study
2. What’s a GPU and how to program it
3. On the testbench …
Motivations – a case study
1
Mysteries in the sky
There is more mass than what we observe
Vera Rubin - 1962
There is more mass than what we observe
There is more mass than what we observe
Type Ia supernovae
“standard candles”
Permutter
Riess
The expansion of the
Universe is accelerating.
Mysteries in the sky
- There seems to be more matter than what we observe…
- The big-bang is big-banging faster than we thought …
Mysteries in the sky
- There seems to be more matter than what we observe…
- The big-bang is big-banging faster than we thought …
“dark matter” (but we do not know what it is)
“dark energy” (but we do not know what it is)
The inverse problem
Initial condition (homogeneous) Redshift acquisition survey
The inverse problem
Initial condition (homogeneous) Redshift acquisition survey
3.350 billion haloes
The inverse problem – Benamou-Brenier thm
Initial condition (homogeneous) Redshift acquisition survey
T(x)
[Frisch, Matarrese, Mohayaee, Sobolevski 2002 (Nature)]
[Brenier, Frisch, Henon, Loeper, Matarrese, Mohayaee, Sobolevskii 2003]
The inverse problem – Benamou-Brenier thm
Initial condition (homogeneous) Redshift acquisition survey
T(x)
∫t1
t2
(t2-t1)
∫V
ρ(x,t) ||v(t,x)||2
dxdt
s.t. ρ(t1,.) = ρ1 ; ρ(t2,.) = ρ2 ; d ρ
dt
= - div(ρv)
Minimize
A(ρ,v) =
[Frisch, Matarrese, Mohayaee, Sobolevski 2002 (Nature)]
[Brenier, Frisch, Henon, Loeper, Matarrese, Mohayaee, Sobolevskii 2003]
The inverse problem – Benamou-Brenier thm
Initial condition (homogeneous) Redshift acquisition survey
T(x)
∫t1
t2
(t2-t1)
∫V
ρ(x,t) ||v(t,x)||2
dxdt
s.t. ρ(t1,.) = ρ1 ; ρ(t2,.) = ρ2 ; d ρ
dt
= - div(ρv)
Minimize C(T) =
∫V
|| x – T(x) ||2 dx
s.t. T is measure-preserving
ρ1(x)
Minimize
A(ρ,v) =
Optimal transport
[Frisch, Matarrese, Mohayaee, Sobolevski 2002 (Nature)]
[Brenier, Frisch, Henon, Loeper, Matarrese, Mohayaee, Sobolevskii 2003]
K(ψ) =∑j ∫Lag ψ(yj) || x – yj ||2 - ψ(yj) dμ + ∑j ψ(yj) vj
Sup
ψ Є ψc
(DMK)
Semi-discrete optimal transport
Minimize C(T) =
∫V
|| x – T(x) ||2 dx
s.t. T is measure-preserving
ρ1(x)
Optimal transport
Maximize Kantorovich dual K(ψ)
K(ψ) =∑j ∫Lag ψ(yj) || x – yj ||2 - ψ(yj) dμ + ∑j ψ(yj) vj
Sup
ψ Є ψc
(DMK)
Semi-discrete optimal transport
Minimize C(T) =
∫V
|| x – T(x) ||2 dx
s.t. T is measure-preserving
ρ1(x)
Optimal transport
Maximize Kantorovich dual K(ψ)
K(ψ) =∑j ∫Lag ψ(yj) || x – yj ||2 - ψ(yj) dμ + ∑j ψ(yj) vj
Sup
ψ Є ψc
(DMK)
Where: Lag ψ(yj) = { x | || x – yj ||2 – ψ(yj) < || x – yj ||2 - ψ(yj’) } for all j’ ≠ j
Laguerre diagram of the yj’s
(with the L2 cost || x – y ||2 used here, Power diagram)
Weight of yj in the power diagram
Semi-discrete optimal transport
K(ψ) =∑j ∫Lag ψ(yj) || x – yj ||2 - ψ(yj) dμ + ∑j ψ(yj) vj
Sup
ψ Є ψc
(DMK)
Where: Lag ψ(yj) = { x | || x – yj ||2 – ψ(yj) < || x – yj ||2 - ψ(yj’) } for all j’ ≠ j
Laguerre diagram of the yj’s
(with the L2 cost || x – y ||2 used here, Power diagram)
Weight of yj in the power diagram
ψ is determined by the
weight vector [ψ(y1) ψ(y2) … ψ(ym)]
Semi-discrete optimal transport
Semi-discrete optimal transport
[Kitagawa Merigot Thibert 2019, JEMS]
[L 2015, M2AN]
[L 2021, JCP]
[Nikhaktar, Seth, L, Mohayaee 2022, PRL]
[von Hausseger, L, Mohayaee 2021, PRL]
[L, Ray, Merigot, Leclerc, JCP (pend. rev.)]
i
j
i
j
Matrix of the system: the classical P1 Laplacian
i
j
Matrix of the system: the classical P1 Laplacian
i
j
In 3D: 16 NNZs per row in average
N = 100 million points
Matrix: 25.6 GBytes
Conjugate gradient
Conjugate gradient
Operations on vectors (BLAS)
Conjugate gradient
Operations on vectors (BLAS)
Sparse mtx vect product (SPMV)
What’s a GPU …
And how to program it ?
2
CPUs
AMD Zen3 processor annotated die shot
CPUs
Cache
CPUs
Cores
CPUs
What’s inside a core ?
CPUs
AMD core floorplan
CPUs - payload
CPUs - thrust
CPUs
how to increase payload / thrust ratio ?
AMD core floorplan
GPUs
A “tapestry” of (mostly) compute cores (“payload”) and memory (cache hierarchy)
Nvidia GA100 annotated die shot
GPUs – how to program
OpenCL
Cuda
OpenACC
CuBlas, CuSparse
Portable
Nvidia only
Portable
Nvidia only
Fast
Fastest
Not so fast
Fastest
Special language
Special language
C++ directives
C API
GPUs – how to program
OpenCL
Cuda
OpenACC
CuBlas, CuSparse
+ Portable
+ Fast, versatile
+ Easy, in language
+ Fast, C API, easy
- Perf, own lang. and toolkit
- own lang. and toolkit
- slow
- Specialized for lin. alg.
GPUs – how to program
OpenCL
Cuda
OpenACC
CuBlas, CuSparse
+ Portable
+ Fast, versatile
+ Easy, in language
+ Fast, C API, easy
- Perf, own lang. and toolkit
- own lang. and toolkit
- slow
- Specialized for lin. alg.
GPUs – how to program
OpenCL
Cuda
OpenACC
CuBlas, CuSparse
+ Portable
+ Fast, versatile
+ Easy, in language
+ Fast, C API, easy
- Perf, own lang. and toolkit
- own lang. and toolkit
- slow
- Specialized for lin. alg.
GPUs – CUDA - saxpy
Y ← a*X + Y
GPUs – CUDA - saxpy
“Kernel” (function run in // on the GPU)
Difference with multi-core programming:
(very) fine-grained parallelism
Y ← a*X + Y
GPUs – CUDA - saxpy
Initialize data and copy to GPU RAM
Y ← a*X + Y
GPUs – CUDA - saxpy
Launch computation on GPU
Y ← a*X + Y
GPUs – CUDA - saxpy
Copy data from GPU memory to main memorycc
Y ← a*X + Y
GPUs – CUDA - saxpy
Do something with the result
Y ← a*X + Y
GPUs – CUDA - saxpy
- Vector operations (BLAS)
[s/d]axpy Y ← a*X + Y
[s/d]dot a ← X.Y
[s/d]scal X ← a*X
GPUs – CUDA - spmv
- Vector operations (BLAS)
- Sparse matrix-vector product
[Buatois, Caumon, L 2009]
“Concurrent Number Cruncher”
OpenNL (github, part of geogram),
CUDA backend
Y ← M*X
GPUs – CUDA - spmv
- Vector operations (BLAS)
- Sparse matrix-vector product
[Buatois, Caumon, L 2009]
“Concurrent Number Cruncher”
OpenNL (github, part of geogram),
CUDA backend
Y ← M*X
GPUs – how to program
OpenCL
Cuda
OpenACC
CuBlas, CuSparse
+ Portable
+ Fast, versatile
+ Easy, in language
+ Fast, C API, easy
- Perf, own lang. and toolkit
- own lang. and toolkit
- slow
- Specialized for lin. alg.
GPUs – how to program
OpenCL
Cuda
OpenACC
CuBlas, CuSparse
+ Portable
+ Fast, versatile
+ Easy, in language
+ Fast, C API, easy
- Perf, own lang. and toolkit
- own lang. and toolkit
- slow
- Specialized for lin. alg.
Y ← a*X + Y
Y ← αA*X + βY
GPUs – how to program
OpenCL
Cuda
OpenACC
CuBlas, CuSparse
+ Portable
+ Fast, versatile
+ Easy, in language
+ Fast, C API, easy
- Perf, own lang. and toolkit
- own lang. and toolkit
- slow
- Specialized for lin. alg.
Subtleties:
Where is the data ?
On the testbench…
3
On the testbench …
Conjugate gradient – multithreaded CPU (on my laptop)
Optimal transport, Early Universe Reconstruction, 2M haloes
CPU
GPU
On the testbench …
Conjugate gradient – multithreaded CPU (on my laptop)
Optimal transport, Early Universe Reconstruction, 2M haloes
CPU
CPU
Conjugate gradient – GPU (on my laptop)
Optimal transport, Early Universe Reconstruction, 2M haloes
Linear solve 4x faster than on small CPU
On the testbench …
GPU
GPU
Conjugate gradient – GPU (nvidia A100)
Optimal transport, Early Universe Reconstruction, 2M haloes
On the testbench …
GPU A100
GPU A100
Linear solve: up to 50x faster than on CPU
On the testbench …
… scaling up !!
130 M haloes … we need to upgrade !!!
On the testbench …
… scaling up !!
130 M haloes … we need to upgrade !!!
- Hardware side: 4x Nvidia A100
On the testbench …
… scaling up !!
130 M haloes … we need to upgrade !!!
- Hardware side: 4x Nvidia A100
- Algorithmic side:
algebraic multigrid preconditioner
On the testbench …
… scaling up !!
130 M haloes … we need to upgrade !!!
- Hardware side: 4x Nvidia A100
- Algorithmic side:
algebraic multigrid preconditioner
- Sofware side: AMGCL [Demidov] +
custom backend for multi-GPU (OpenNL/geogram), Object-oriented C
- BLAS abstraction layer
- Sparse Matrix abstraction layer
- Matrix assembly helper https://github.com/BrunoLevy/geogram
Unified memory can do the work for you …
On the testbench …
On the testbench …
GPU
Unified memory can do the work for you …
On the testbench …
GPU A100 x4
Unified memory can do the work for you …
… but it is (in general) faster to transfer memory explicitly
On the testbench …
GPU A100 x4
On the testbench …
CPU
On the testbench …
linear solve takes 25 min (instead of 53 min on CPU, multithreaded)
GPU A100 x4
On the testbench …
Construction of the algebraic multigrid preconditioner is done on CPU, lion’s share !
GPU A100 x4
On the testbench …
Construction of the algebraic multigrid preconditioner is done on CPU, lion’s share !
GPU A100 x4
CPU EPYC 9754 x2
Coming Next …
Coming next: construction of preconditioner on GPU too.
Laguerre diagram on GPU ?
possible but harder… [Ray, Basselin, Alonso, Sokolov, L, Lefebvre]
References on Cosmology and OT
Nature 2002, Frisch, Matarrese, Mohayaee, Sobolevski
MNRAS 2003, Brenier, Frisch, Henon, Loeper, Matarrese, Mohayaee, Sobolevski
Geom. & Func. Ana., 2004, Brenier
Confluentes Math, 2011, Brenier
Analysis & PDE, 2023, Ambrosio, Baradat and Brenier
JEMS, 2019, Kitagawa, Merigot, Thibert
Mathematical Modeling and Analysis 2015, L
Monthly Not. Royal Astron. Society 2021, L, Mohayaee, von Hausegger
Physical Review Letters 2021, von Hausegger, L, Mohayaee
Journal of Computational Physics 2022, L
Physical Review Letters 2022, Nikhaktar, Sheth, L, Mohahayee
Physical Review D, 2023, Nikhaktar, Padmanabhan, L, Sheth, Mohayaee
Physical Review D, 2024, Nikhaktar, Padmanabhan, L, Sheth, Mohayaee
Physical Review D, 2024 L, Brenier, Mohayaee
Journal of Computational Physics, L,Ray, Merigot, Leclerc (pending revision)

Solving large sparse linear systems on the GPU

  • 1.
    Mathématiques - Informatique Solvinglarge sparse linear systems on the GPU Bruno Lévy Inria - ParMA Laboratoire de Mathématiques d’Orsay Atelier AM2I Calcul scientifique Passage a l’echelle 16 Dec 2024
  • 2.
    Outline 1. Motivations –a case study 2. What’s a GPU and how to program it 3. On the testbench …
  • 3.
    Motivations – acase study 1
  • 4.
    Mysteries in thesky There is more mass than what we observe Vera Rubin - 1962
  • 5.
    There is moremass than what we observe
  • 6.
    There is moremass than what we observe
  • 7.
    Type Ia supernovae “standardcandles” Permutter Riess The expansion of the Universe is accelerating.
  • 9.
    Mysteries in thesky - There seems to be more matter than what we observe… - The big-bang is big-banging faster than we thought …
  • 10.
    Mysteries in thesky - There seems to be more matter than what we observe… - The big-bang is big-banging faster than we thought … “dark matter” (but we do not know what it is) “dark energy” (but we do not know what it is)
  • 11.
    The inverse problem Initialcondition (homogeneous) Redshift acquisition survey
  • 12.
    The inverse problem Initialcondition (homogeneous) Redshift acquisition survey
  • 15.
  • 17.
    The inverse problem– Benamou-Brenier thm Initial condition (homogeneous) Redshift acquisition survey T(x) [Frisch, Matarrese, Mohayaee, Sobolevski 2002 (Nature)] [Brenier, Frisch, Henon, Loeper, Matarrese, Mohayaee, Sobolevskii 2003]
  • 18.
    The inverse problem– Benamou-Brenier thm Initial condition (homogeneous) Redshift acquisition survey T(x) ∫t1 t2 (t2-t1) ∫V ρ(x,t) ||v(t,x)||2 dxdt s.t. ρ(t1,.) = ρ1 ; ρ(t2,.) = ρ2 ; d ρ dt = - div(ρv) Minimize A(ρ,v) = [Frisch, Matarrese, Mohayaee, Sobolevski 2002 (Nature)] [Brenier, Frisch, Henon, Loeper, Matarrese, Mohayaee, Sobolevskii 2003]
  • 19.
    The inverse problem– Benamou-Brenier thm Initial condition (homogeneous) Redshift acquisition survey T(x) ∫t1 t2 (t2-t1) ∫V ρ(x,t) ||v(t,x)||2 dxdt s.t. ρ(t1,.) = ρ1 ; ρ(t2,.) = ρ2 ; d ρ dt = - div(ρv) Minimize C(T) = ∫V || x – T(x) ||2 dx s.t. T is measure-preserving ρ1(x) Minimize A(ρ,v) = Optimal transport [Frisch, Matarrese, Mohayaee, Sobolevski 2002 (Nature)] [Brenier, Frisch, Henon, Loeper, Matarrese, Mohayaee, Sobolevskii 2003]
  • 20.
    K(ψ) =∑j ∫Lagψ(yj) || x – yj ||2 - ψ(yj) dμ + ∑j ψ(yj) vj Sup ψ Є ψc (DMK) Semi-discrete optimal transport Minimize C(T) = ∫V || x – T(x) ||2 dx s.t. T is measure-preserving ρ1(x) Optimal transport Maximize Kantorovich dual K(ψ)
  • 21.
    K(ψ) =∑j ∫Lagψ(yj) || x – yj ||2 - ψ(yj) dμ + ∑j ψ(yj) vj Sup ψ Є ψc (DMK) Semi-discrete optimal transport Minimize C(T) = ∫V || x – T(x) ||2 dx s.t. T is measure-preserving ρ1(x) Optimal transport Maximize Kantorovich dual K(ψ)
  • 22.
    K(ψ) =∑j ∫Lagψ(yj) || x – yj ||2 - ψ(yj) dμ + ∑j ψ(yj) vj Sup ψ Є ψc (DMK) Where: Lag ψ(yj) = { x | || x – yj ||2 – ψ(yj) < || x – yj ||2 - ψ(yj’) } for all j’ ≠ j Laguerre diagram of the yj’s (with the L2 cost || x – y ||2 used here, Power diagram) Weight of yj in the power diagram Semi-discrete optimal transport
  • 23.
    K(ψ) =∑j ∫Lagψ(yj) || x – yj ||2 - ψ(yj) dμ + ∑j ψ(yj) vj Sup ψ Є ψc (DMK) Where: Lag ψ(yj) = { x | || x – yj ||2 – ψ(yj) < || x – yj ||2 - ψ(yj’) } for all j’ ≠ j Laguerre diagram of the yj’s (with the L2 cost || x – y ||2 used here, Power diagram) Weight of yj in the power diagram ψ is determined by the weight vector [ψ(y1) ψ(y2) … ψ(ym)] Semi-discrete optimal transport
  • 24.
    Semi-discrete optimal transport [KitagawaMerigot Thibert 2019, JEMS] [L 2015, M2AN] [L 2021, JCP] [Nikhaktar, Seth, L, Mohayaee 2022, PRL] [von Hausseger, L, Mohayaee 2021, PRL] [L, Ray, Merigot, Leclerc, JCP (pend. rev.)]
  • 28.
  • 29.
    Matrix of thesystem: the classical P1 Laplacian i j
  • 30.
    Matrix of thesystem: the classical P1 Laplacian i j In 3D: 16 NNZs per row in average N = 100 million points Matrix: 25.6 GBytes
  • 31.
  • 32.
  • 33.
    Conjugate gradient Operations onvectors (BLAS) Sparse mtx vect product (SPMV)
  • 34.
    What’s a GPU… And how to program it ? 2
  • 35.
    CPUs AMD Zen3 processorannotated die shot
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
    CPUs how to increasepayload / thrust ratio ? AMD core floorplan
  • 43.
    GPUs A “tapestry” of(mostly) compute cores (“payload”) and memory (cache hierarchy) Nvidia GA100 annotated die shot
  • 44.
    GPUs – howto program OpenCL Cuda OpenACC CuBlas, CuSparse Portable Nvidia only Portable Nvidia only Fast Fastest Not so fast Fastest Special language Special language C++ directives C API
  • 45.
    GPUs – howto program OpenCL Cuda OpenACC CuBlas, CuSparse + Portable + Fast, versatile + Easy, in language + Fast, C API, easy - Perf, own lang. and toolkit - own lang. and toolkit - slow - Specialized for lin. alg.
  • 46.
    GPUs – howto program OpenCL Cuda OpenACC CuBlas, CuSparse + Portable + Fast, versatile + Easy, in language + Fast, C API, easy - Perf, own lang. and toolkit - own lang. and toolkit - slow - Specialized for lin. alg.
  • 47.
    GPUs – howto program OpenCL Cuda OpenACC CuBlas, CuSparse + Portable + Fast, versatile + Easy, in language + Fast, C API, easy - Perf, own lang. and toolkit - own lang. and toolkit - slow - Specialized for lin. alg.
  • 48.
    GPUs – CUDA- saxpy Y ← a*X + Y
  • 49.
    GPUs – CUDA- saxpy “Kernel” (function run in // on the GPU) Difference with multi-core programming: (very) fine-grained parallelism Y ← a*X + Y
  • 50.
    GPUs – CUDA- saxpy Initialize data and copy to GPU RAM Y ← a*X + Y
  • 51.
    GPUs – CUDA- saxpy Launch computation on GPU Y ← a*X + Y
  • 52.
    GPUs – CUDA- saxpy Copy data from GPU memory to main memorycc Y ← a*X + Y
  • 53.
    GPUs – CUDA- saxpy Do something with the result Y ← a*X + Y
  • 54.
    GPUs – CUDA- saxpy - Vector operations (BLAS) [s/d]axpy Y ← a*X + Y [s/d]dot a ← X.Y [s/d]scal X ← a*X
  • 55.
    GPUs – CUDA- spmv - Vector operations (BLAS) - Sparse matrix-vector product [Buatois, Caumon, L 2009] “Concurrent Number Cruncher” OpenNL (github, part of geogram), CUDA backend Y ← M*X
  • 56.
    GPUs – CUDA- spmv - Vector operations (BLAS) - Sparse matrix-vector product [Buatois, Caumon, L 2009] “Concurrent Number Cruncher” OpenNL (github, part of geogram), CUDA backend Y ← M*X
  • 57.
    GPUs – howto program OpenCL Cuda OpenACC CuBlas, CuSparse + Portable + Fast, versatile + Easy, in language + Fast, C API, easy - Perf, own lang. and toolkit - own lang. and toolkit - slow - Specialized for lin. alg.
  • 58.
    GPUs – howto program OpenCL Cuda OpenACC CuBlas, CuSparse + Portable + Fast, versatile + Easy, in language + Fast, C API, easy - Perf, own lang. and toolkit - own lang. and toolkit - slow - Specialized for lin. alg. Y ← a*X + Y Y ← αA*X + βY
  • 59.
    GPUs – howto program OpenCL Cuda OpenACC CuBlas, CuSparse + Portable + Fast, versatile + Easy, in language + Fast, C API, easy - Perf, own lang. and toolkit - own lang. and toolkit - slow - Specialized for lin. alg. Subtleties: Where is the data ?
  • 60.
  • 61.
    On the testbench… Conjugate gradient – multithreaded CPU (on my laptop) Optimal transport, Early Universe Reconstruction, 2M haloes CPU GPU
  • 62.
    On the testbench… Conjugate gradient – multithreaded CPU (on my laptop) Optimal transport, Early Universe Reconstruction, 2M haloes CPU CPU
  • 63.
    Conjugate gradient –GPU (on my laptop) Optimal transport, Early Universe Reconstruction, 2M haloes Linear solve 4x faster than on small CPU On the testbench … GPU GPU
  • 64.
    Conjugate gradient –GPU (nvidia A100) Optimal transport, Early Universe Reconstruction, 2M haloes On the testbench … GPU A100 GPU A100 Linear solve: up to 50x faster than on CPU
  • 65.
    On the testbench… … scaling up !! 130 M haloes … we need to upgrade !!!
  • 66.
    On the testbench… … scaling up !! 130 M haloes … we need to upgrade !!! - Hardware side: 4x Nvidia A100
  • 67.
    On the testbench… … scaling up !! 130 M haloes … we need to upgrade !!! - Hardware side: 4x Nvidia A100 - Algorithmic side: algebraic multigrid preconditioner
  • 68.
    On the testbench… … scaling up !! 130 M haloes … we need to upgrade !!! - Hardware side: 4x Nvidia A100 - Algorithmic side: algebraic multigrid preconditioner - Sofware side: AMGCL [Demidov] + custom backend for multi-GPU (OpenNL/geogram), Object-oriented C - BLAS abstraction layer - Sparse Matrix abstraction layer - Matrix assembly helper https://github.com/BrunoLevy/geogram
  • 69.
    Unified memory cando the work for you … On the testbench …
  • 70.
  • 71.
    Unified memory cando the work for you … On the testbench … GPU A100 x4
  • 72.
    Unified memory cando the work for you … … but it is (in general) faster to transfer memory explicitly On the testbench … GPU A100 x4
  • 73.
  • 74.
    On the testbench… linear solve takes 25 min (instead of 53 min on CPU, multithreaded) GPU A100 x4
  • 75.
    On the testbench… Construction of the algebraic multigrid preconditioner is done on CPU, lion’s share ! GPU A100 x4
  • 76.
    On the testbench… Construction of the algebraic multigrid preconditioner is done on CPU, lion’s share ! GPU A100 x4 CPU EPYC 9754 x2
  • 77.
    Coming Next … Comingnext: construction of preconditioner on GPU too. Laguerre diagram on GPU ? possible but harder… [Ray, Basselin, Alonso, Sokolov, L, Lefebvre]
  • 78.
    References on Cosmologyand OT Nature 2002, Frisch, Matarrese, Mohayaee, Sobolevski MNRAS 2003, Brenier, Frisch, Henon, Loeper, Matarrese, Mohayaee, Sobolevski Geom. & Func. Ana., 2004, Brenier Confluentes Math, 2011, Brenier Analysis & PDE, 2023, Ambrosio, Baradat and Brenier JEMS, 2019, Kitagawa, Merigot, Thibert Mathematical Modeling and Analysis 2015, L Monthly Not. Royal Astron. Society 2021, L, Mohayaee, von Hausegger Physical Review Letters 2021, von Hausegger, L, Mohayaee Journal of Computational Physics 2022, L Physical Review Letters 2022, Nikhaktar, Sheth, L, Mohahayee Physical Review D, 2023, Nikhaktar, Padmanabhan, L, Sheth, Mohayaee Physical Review D, 2024, Nikhaktar, Padmanabhan, L, Sheth, Mohayaee Physical Review D, 2024 L, Brenier, Mohayaee Journal of Computational Physics, L,Ray, Merigot, Leclerc (pending revision)