Porting and optimizing UniFrac for GPUs
Reducing microbiome analysis runtimes by orders of magnitude
Igor Sfiligoi Daniel McDonald Rob Knight
(isfiligoi@sdsc.edu) (danielmcdonald@ucsd.edu) (robknight@ucsd.edu)
University of California San Diego - La Jolla, CA, USA
This work was partially funded by US National Science Foundation (NSF) grants OAC-1826967, OAC-1541349 and CNS-1730158, and by US National Institutes of Health (NIH) grant DP1-AT010885.
Abstract
UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another
(“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many
independent subproblems, and exhibits near linear scaling on multiple nodes, but does not scale linearly with the
number of CPU cores on a single node, indicating memory-bound compute.
Massively parallel algorithms, especially memory-heavy ones, are natural candidates for GPU compute. In this
poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We chose to do the porting
using OpenACC, since it allows for a single code base for both GPU and CPU compute. The choice proved to be
wise, as the optimizations aimed at speeding up execution on GPUs helped CPU execution, too.
The GPUs have proven to be an excellent resource for this type of problem, reducing runtimes from days to hours.
for(unsigned int stripe = start;
stripe < stop; stripe++) {
dm_stripe = dm_stripes[stripe];
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int j = 0;
j < n_samples / 4; j++) {
int idx =(stripe-start_idx)*n_samples;
double *dm_stripe =dm_stripes_buf+idx;
int k = j * 4;
double u1 = emb[k];
double u2 = emb[k+1];
double v1 = emb[k + stripe + 1];
double v2 = emb[k + stripe + 2];
…
dm_stripe[k] += (u1-v1)*length;
dm_stripe[k+1] += (u2-v2)*length;
}
}
#pragma acc parallel loop collapse(2) 
present(emb,dm_stripes_buf,length)
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int k = 0;
k < n_samples ; k++) {
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += (u-v)*length[e];
}
…
dm_stripe[k] += (u1-v1)*length;
}
}
#pragma acc parallel loop collapse(3) 
present(emb,dm_stripes_buf,length)
for(unsigned int sk = 0;
sk < sample_steps ; sk++) {
for(unsigned int stripe = start;
stripe < stop; stripe++) {
for(unsigned int ik = 0;
ik < step_size ; ik++) {
unsigned int k = sk*step_size + ik;
…
double my_stripe = dm_stripe[k];
#pragma acc loop seq
for (unsigned int e=0;
e<filled_embs; e++) {
uint64_t offset = n_samples*e;
double u = emb[offset+k];
double v = emb[offset+k+stripe+ 1];
my_stripe += (u-v)*length[e];
}
…
dm_stripe[k] += (u1-v1)*length;
}
}
}
Most time spent in small
area of the code, which is
implemented as a set of
tight loops.
OpenACC makes it trivial toport
to GPU compute.
Just decorate with a pragma.
But needed minor refactoring to have a unified buffer.
Memory writes much more
expensive than memory reads,
so cluster reads and minimize writes.
Also undo manual unrolls: were optimal for CPU, bad for GPU.
Reorder loops to maximize
cache reuse.
Intel Xeon E5-2680 v4 CPU
(using all 14 cores)
800 minutes (13 hours)
NVIDIA Tesla V100
(using all 84 SMs)
92 minutes (1.5 hours)
NVIDIA Tesla V100
(using all 84 SMs)
33 minutes
NVIDIA Tesla V100
(using all 84 SMs)
12 minutes
Measuring runtime computing
UniFrac on EMP dataset.
UniFrac was originally designed
and always implemented using
fp64 math operations. Thehigher-
precision floating point math was
used to maximize reliability of
the results.
Can we use fp32?
On mobile and gaming GPUs
fp64 compute is 32x slower
than fp32 compute.
We can!
We compared the results of the compute using
fp32-enabled and fp64-only code, using the same
EMP input, and observed a near identical result.
(Mantel R2 0.99999; p<0.001, comparing
pairwise distances in the two matrices).
E5-2680 v4 CPU
Original New
GPU
V100
GPU
2080TI
GPU
1080TI
GPU
1080
GPU
Mobile 1050
fp64 800 193 12 59 77 99 213
fp32 - 190 9.5 19 31 36 64
Runtime computing UniFrac on EMP dataset (Single Chip, in minutes)
Per chip
(in minutes)
128x CPU
E5-2680 v4
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 415 97 14 29 184 252
fp32 - 91 12 20 32 82
Aggregated
(in chip hours)
128x
E5-2680 v4 CPU
Original New
128x GPU
V100
4x GPU
V100
16x GPU
2080TI
16x GPU
1080TI
fp64 890 207 30 1.9 49 67
fp32 - 194 26 1.3 8.5 22
Runtime computing UniFrac on dataset containing 113,721 samples (Using multiple Chips)
Conclusions
Our work now allows many microbiome analyses which were
previously relegated to large compute clusters to be performed with
much lower resource requirements. Even the largest datasets currently
envisaged could be processed in reasonable time with just a handful of
server-class GPUs, while smaller but still sizable datasets like the
EMP can be processed even on GPU-enabled workstations.
We also explored the use of lower-precision floating point math to
effectively exploit consumer-grade GPUs, which are typical in
desktop and laptop setups. We conclude that fp32 math yields
nearly identical results to fp64, and should be adequate for the vast
majority of studies, making compute on GPU-enabled personal
devices, even laptops, a sufficient resource for this otherwise rate-
limiting step for many researchers.
An ordination plot of unweighted UniFrac distances over 113,721 samples
Xeon E5-2680 v4 CPU
(using all 14 cores)
193minutes (~3 hours)
Properly aligning the memory
buffers and picking the right value
for the grouping parameters can
make a 5x difference in speed.

Porting and optimizing UniFrac for GPUs

  • 1.
    Porting and optimizingUniFrac for GPUs Reducing microbiome analysis runtimes by orders of magnitude Igor Sfiligoi Daniel McDonald Rob Knight (isfiligoi@sdsc.edu) (danielmcdonald@ucsd.edu) (robknight@ucsd.edu) University of California San Diego - La Jolla, CA, USA This work was partially funded by US National Science Foundation (NSF) grants OAC-1826967, OAC-1541349 and CNS-1730158, and by US National Institutes of Health (NIH) grant DP1-AT010885. Abstract UniFrac is a commonly used metric in microbiome research for comparing microbiome profiles to one another (“beta diversity”). The recently implemented Striped UniFrac added the capability to split the problem into many independent subproblems, and exhibits near linear scaling on multiple nodes, but does not scale linearly with the number of CPU cores on a single node, indicating memory-bound compute. Massively parallel algorithms, especially memory-heavy ones, are natural candidates for GPU compute. In this poster we describe steps undertaken in porting and optimizing Striped Unifrac to GPUs. We chose to do the porting using OpenACC, since it allows for a single code base for both GPU and CPU compute. The choice proved to be wise, as the optimizations aimed at speeding up execution on GPUs helped CPU execution, too. The GPUs have proven to be an excellent resource for this type of problem, reducing runtimes from days to hours. for(unsigned int stripe = start; stripe < stop; stripe++) { dm_stripe = dm_stripes[stripe]; for(unsigned int j = 0; j < n_samples / 4; j++) { int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int j = 0; j < n_samples / 4; j++) { int idx =(stripe-start_idx)*n_samples; double *dm_stripe =dm_stripes_buf+idx; int k = j * 4; double u1 = emb[k]; double u2 = emb[k+1]; double v1 = emb[k + stripe + 1]; double v2 = emb[k + stripe + 2]; … dm_stripe[k] += (u1-v1)*length; dm_stripe[k+1] += (u2-v2)*length; } } #pragma acc parallel loop collapse(2) present(emb,dm_stripes_buf,length) for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int k = 0; k < n_samples ; k++) { … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += (u-v)*length[e]; } … dm_stripe[k] += (u1-v1)*length; } } #pragma acc parallel loop collapse(3) present(emb,dm_stripes_buf,length) for(unsigned int sk = 0; sk < sample_steps ; sk++) { for(unsigned int stripe = start; stripe < stop; stripe++) { for(unsigned int ik = 0; ik < step_size ; ik++) { unsigned int k = sk*step_size + ik; … double my_stripe = dm_stripe[k]; #pragma acc loop seq for (unsigned int e=0; e<filled_embs; e++) { uint64_t offset = n_samples*e; double u = emb[offset+k]; double v = emb[offset+k+stripe+ 1]; my_stripe += (u-v)*length[e]; } … dm_stripe[k] += (u1-v1)*length; } } } Most time spent in small area of the code, which is implemented as a set of tight loops. OpenACC makes it trivial toport to GPU compute. Just decorate with a pragma. But needed minor refactoring to have a unified buffer. Memory writes much more expensive than memory reads, so cluster reads and minimize writes. Also undo manual unrolls: were optimal for CPU, bad for GPU. Reorder loops to maximize cache reuse. Intel Xeon E5-2680 v4 CPU (using all 14 cores) 800 minutes (13 hours) NVIDIA Tesla V100 (using all 84 SMs) 92 minutes (1.5 hours) NVIDIA Tesla V100 (using all 84 SMs) 33 minutes NVIDIA Tesla V100 (using all 84 SMs) 12 minutes Measuring runtime computing UniFrac on EMP dataset. UniFrac was originally designed and always implemented using fp64 math operations. Thehigher- precision floating point math was used to maximize reliability of the results. Can we use fp32? On mobile and gaming GPUs fp64 compute is 32x slower than fp32 compute. We can! We compared the results of the compute using fp32-enabled and fp64-only code, using the same EMP input, and observed a near identical result. (Mantel R2 0.99999; p<0.001, comparing pairwise distances in the two matrices). E5-2680 v4 CPU Original New GPU V100 GPU 2080TI GPU 1080TI GPU 1080 GPU Mobile 1050 fp64 800 193 12 59 77 99 213 fp32 - 190 9.5 19 31 36 64 Runtime computing UniFrac on EMP dataset (Single Chip, in minutes) Per chip (in minutes) 128x CPU E5-2680 v4 Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 415 97 14 29 184 252 fp32 - 91 12 20 32 82 Aggregated (in chip hours) 128x E5-2680 v4 CPU Original New 128x GPU V100 4x GPU V100 16x GPU 2080TI 16x GPU 1080TI fp64 890 207 30 1.9 49 67 fp32 - 194 26 1.3 8.5 22 Runtime computing UniFrac on dataset containing 113,721 samples (Using multiple Chips) Conclusions Our work now allows many microbiome analyses which were previously relegated to large compute clusters to be performed with much lower resource requirements. Even the largest datasets currently envisaged could be processed in reasonable time with just a handful of server-class GPUs, while smaller but still sizable datasets like the EMP can be processed even on GPU-enabled workstations. We also explored the use of lower-precision floating point math to effectively exploit consumer-grade GPUs, which are typical in desktop and laptop setups. We conclude that fp32 math yields nearly identical results to fp64, and should be adequate for the vast majority of studies, making compute on GPU-enabled personal devices, even laptops, a sufficient resource for this otherwise rate- limiting step for many researchers. An ordination plot of unweighted UniFrac distances over 113,721 samples Xeon E5-2680 v4 CPU (using all 14 cores) 193minutes (~3 hours) Properly aligning the memory buffers and picking the right value for the grouping parameters can make a 5x difference in speed.