ParallelProcessing.md

Parallel Computing with R

library(knitr)
library(raster)
library(rasterVis)
library(dplyr)
library(ggplot2)

## New Packages
library(foreach)
library(doParallel)
library(arm)
library(coda)
library(ggmcmc)
library(fields)
library(snow)

If you don't have the packages above, install them in the package manager or by running install.packages("doParallel").

Introduction

Serial Computing

Most (legacy) software is written for serial computation:

Problem broken into discrete set of instructions
Instructions executed sequentially on a single processor

Figure from here

Parallel computation

Parallel computing is the simultaneous use of multiple compute resources:

Problem divided into discrete parts that can be solved concurrently
Instructions from each part execute simultaneously on different processors
An overall control/coordination mechanism is employed

Figure from here

Flynn's taxonomy

A classification of computer architectures (Flynn, 1972)

Single Instruction, Single Data (SISD)
- No parallelization
Single Instruction, Multiple Data (SIMD)
- Run the same code/analysis on different datasets
- Examples:
  - different species in species distribution model
  - same species under different climates
  - different MCMC chains from a Bayesian Model
Multiple Instruction, Single Data (MISD)
- Run different code/analyses on the same data
- Examples:
  - One species, multiple models
Multiple Instruction, Multiple Data streams (MIMD)
- Run different code/analyses on different data
- Examples:
  - Different species & different models

Figure from here

Our focus: Single Instruction, Multiple Data (SIMD)

Parallel functions within an R script
- starts on single processor
- runs looped elements on multiple 'slave' processors
- returns results of all iterations to the original instance
- foreach, multicore, plyr, raster
Alternative: run many separate instances of R in parallel with Rscript
- need another operation to combine the results
- preferable for long, complex jobs
- NOT planning to discuss in this session

R Packages

There are many R packages for parallelization, check out the CRAN Task View on High-Performance and Parallel Computing for an overview. For example:

Rmpi: Built on MPI (Message Passing Interface), a de facto standard in parallel computing.
snow: Simple Network of Workstations can use several standards (PVM, MPI, NWS)
parallel Built in R package (since v2.14.0).

Foreach Package

In this session we'll focus on the foreach package, which has numerous advantages including:

intuitive for() loop-like syntax
flexibility of choosing a parallel 'backend' for laptops through to supercomputers (using multicore, parallel, snow, Rmpi, etc.)
nice options for combining output from parallelized jobs

Documentation for foreach:

Foreach backends

doParallel best for use on multicore machines (uses fork on linux/mac and snow on windows).
doMPI: Interface to MPI (Message-Passing Interface)
doSNOW: Simple Network of Workstations

Simple examples

Sequential `for()` loop

x=vector()
for(i in 1:3) x[i]=i^2
x

Sequential `foreach()` loop

x <- foreach(i=1:3) %do% i^2
x

Note that x is a list with one element for each iterator variable (i). You can also specify a function to use to combine the outputs with .combine. Let's concatenate the results into a vector with c.

Sequential `foreach()` loop with `.combine`

x <- foreach(i=1:3,.combine='c') %do% i^2
x

Tells foreach() to first calculate each iteration, then .combine them with a c(...)

Sequential `foreach()` loop with `.combine`

x <- foreach(i=1:3,.combine='rbind') %do% i^2
x

Your Turn

Write a foreach() loop that:

generates 10 sets of 100 random values from a normal distribution (rnorm())
column-binds (cbind) them.

x <- foreach(i=1:10,.combine='cbind') %do% rnorm(100)
head(x)%>%kable()
dim(x)

Parallel `foreach()` loop

So far we've only used %do% which only uses a single processor.

Before running foreach() in parallel, you have to register a parallel backend with one of the do functions such as doParallel(). On most multicore systems, the easiest backend is typically doParallel(). On linux and mac, it uses fork system call and on Windows machines it uses snow backend. The nice thing is it chooses automatically for the system.

# register specified number of workers
registerDoParallel(3)
# or, reserve all all available cores 
#registerDoParallel()		
# check how many cores (workers) are registered
getDoParWorkers()

NOTE It may be a good idea to use n-1 cores for processing (so you can still use your computer to do other things while the analysis is running)

To run in parallel, simply change the %do% to %dopar%. Wasn't that easy?

## run the loop
x <- foreach(i=1:3, .combine='c') %dopar% i^2
x

A slightly more complicated example

In this section we will:

Generate data with known parameters
Fit a set of regression models using subsets of the complete dataset (e.g. bootstrapping)
Compare processing times for sequential vs. parallel execution

For more on motivations to bootstrap, see here.

Make up some data:

n <- 100000              # number of data points
x1 <- rnorm (n)          # make up x1 covariate
b0 <- 1.8                # set intercept (beta0)
b1 <- -1.5                # set beta1
p = invlogit(b0+b1*x1)
y <- rbinom (n, 1, p)  # simulate data with noise
data=cbind.data.frame(y=y,x1=x1,p=p)

Let's look at the data:

kable(head(data),row.names = F,digits = 2)

ggplot(data,aes(y=x1,x=as.factor(y)))+
  geom_boxplot()+
  coord_flip()+
  geom_line(aes(x=p+1,y=x1),col="red",size=2,alpha=.5)+
  xlab("Binary Response")+
  ylab("Covariate")

Sampling from a dataset with `sample_n()`

size=5
sample_n(data,size,replace=T)

Simple Generalized Linear Model

This is the formal definition of the model we'll use:

$y_i \sim Bernoulli(p_i)$

$logit(p_i) = \beta_0 + \beta_1 X_i$

The index $i$ identifies each grid cell (data point). $\beta_0$ - $\beta_1$ are model coefficients (intercept and slope), and $y_i$ is the simulated observation from cell $i$.

trials = 10000
tsize = 100

  ptime <- system.time({
  result <- foreach(i=1:trials,
                    .combine = rbind.data.frame) %dopar% {
  tdata=sample_n(data,tsize,replace=TRUE)
  M1=glm(y ~ x1, data=tdata, family=binomial(link="logit"))
  ## return parameter estimates
  cbind.data.frame(trial=i,t(coefficients(M1)))
    }
  })
ptime

Look at results object containing slope and aspect from subsampled models. There is one row per sample with columns for the estimated intercept and slope for that sample.

kable(head(result),digits = 2)

ggplot(dplyr::select(result,everything(),Intercept=contains("Intercept")))+
  geom_density(aes(x=Intercept),fill="black",alpha=.2)+
  geom_vline(aes(xintercept=b0),size=2)+
  geom_density(aes(x=x1),fill="red",alpha=.2)+
  geom_vline(aes(xintercept=b1),col="red",size=2)+
  xlim(c(-5,5))+
  ylab("Parameter Value")+
  xlab("Density")

So we were able to perform 10^{4} separate model fits in seconds. Let's see how long it would have taken in sequence.

  stime <- system.time({
  result <- foreach(i=1:trials,
                    .combine = rbind.data.frame) %do% {
  tdata=sample_n(data,tsize,replace=TRUE)
  M1=glm(y ~ x1, data=tdata,family=binomial(link="logit"))
  ## return parameter estimates
  cbind.data.frame(trial=i,t(coefficients(M1)))
    }
  })
stime

So we were able to run 10^{4} separate model fits in seconds when using 3 CPUs and seconds on one CPU. That's X faster for this simple example.

Your Turn

Generate some random as follows:

n <- 10000              # number of data points
x1 <- rnorm (n)          # make up x1 covariate
b0 <- 25                # set intercept (beta0)
b1 <- -15                # set beta1
y <- rnorm (n, b0+b1*x1,10)  # simulate data with noise
data2=cbind.data.frame(y=y,x1=x1)

Write a parallel foreach() loop that:

selects a sample (as above) with sample_n()
fits a 'normal' linear model with lm()
Compiles the coefficient of determination (R^2) from each model

Hint: see ?summary.lm.

trials = 100
tsize = 100

  result <- foreach(i=1:trials,.combine = rbind.data.frame) %dopar% {
  tdata=sample_n(data2,tsize,replace=TRUE)
  M1=lm(y ~ x1, data=tdata)
  cbind.data.frame(trial=i,
                   t(coefficients(M1)),
                   r2=summary(M1)$r.squared,
                   aic=AIC(M1))
  }


ggplot(data2,aes(y=y,x=x1))+
  geom_point(col="grey")+
  geom_abline(data=dplyr::select(result,everything(),Intercept=contains("Intercept")),
              aes(intercept=Intercept,slope=x1),alpha=.5)

Writing data to disk

For long-running processes, you may want to consider writing results to disk as-you-go rather than waiting until the end in case of a problem (power failure, single job failure, etc.).

## assign target directory
td=tempdir()

  foreach(i=1:trials,
                    .combine = rbind.data.frame) %dopar% {
  tdata=sample_n(data,tsize,replace=TRUE)
  M1=glm(y ~ x1, data=tdata, family=binomial(link="logit"))
  ## return parameter estimates
  results=cbind.data.frame(trial=i,t(coefficients(M1)))
  
  ## write results to disk
  file=paste0(td,"/results_",i,".csv")
  write.csv(results,file=file)
  return(NULL)
    }

That will save the result of each subprocess to disk (be careful about duplicated file names!):

list.files(td,pattern="results")%>%head()

Other useful `foreach` parameters

.inorder (true/false) results combined in the same order that they were submitted?
.errorhandling (stop/remove/pass)
.packages packages to made available to sub-processes
.export variables to export to sub-processes

Spatial example

In this section we will:

Generate some spatial data
Tile the region to facilitate processing the data in parallel.
Perform a moving window mean for the full area
Compare processing times for sequential vs. parallel execution

Generate Spatial Data

A function to generate raster object with spatial autocorrelation.

simrast=function(nx=60,ny=60,theta=10,seed=1234){
      ## create a random raster with some spatial structure
      ## Theta is the scale of an exponential decay function.  
      ## This controls degree of autocorrelation, 
      ## values close to 1 are close to random while values near nx/4 have high autocorrelation
     r=raster(nrows=ny, ncols=nx,vals=1,xmn=-nx/2, xmx=nx/2, ymn=-ny/2, ymx=ny/2)
      names(r)="z"
      # Simulate a Gaussian random field with an exponential covariance function
      set.seed(seed)  #set a seed so everyone's maps are the same
      grid=list(x=seq(xmin(r),xmax(r)-1,by=res(r)[1]),y=seq(ymin(r),ymax(r)-1,res(r)[2]))
      obj<-Exp.image.cov(grid=grid, theta=theta, setup=TRUE)
      look<- sim.rf( obj)      
      values(r)=t(look)*10
      return(r)
      }

Generate a raster using simrast.

r=simrast(nx=3000,ny=1000,theta = 100)
r

Plot the raster showing the grid.

gplot(r)+
  geom_raster(aes(fill = value))+ 
  scale_fill_gradient(low = 'white', high = 'blue')+
  coord_equal()+ylab("Y")+xlab("X")

"Tile" the region

To parallelize spatial data, you often need to tile the data and process each tile separately. Here is a function that will take a bounding box, tile size and generate a tiling system. If given an overlap term, it will also add buffers to the tiles to reduce/eliminate edge effects, though this depends on what algorithm/model you are using.

tilebuilder=function(raster,size=10,overlap=NULL){
  ## get raster extents
  xmin=xmin(raster)
  xmax=xmax(raster)
  ymin=ymin(raster)
  ymax=ymax(raster)
  xmins=c(seq(xmin,xmax-size,by=size))
  ymins=c(seq(ymin,ymax-size,by=size))
  exts=expand.grid(xmin=xmins,ymin=ymins)
  exts$ymax=exts$ymin+size
  exts$xmax=exts$xmin+size
  if(!is.null(overlap)){
  #if overlapped tiles are requested, create new columns with buffered extents
    exts$yminb=exts$ymin
    exts$xminb=exts$xmin
    exts$ymaxb=exts$ymax
    exts$xmaxb=exts$xmax
    
    t1=(exts$ymin-overlap)>=ymin
    exts$yminb[t1]=exts$ymin[t1]-overlap
    t2=exts$xmin-overlap>=xmin
    exts$xminb[t2]=exts$xmin[t2]-overlap    
    t3=exts$ymax+overlap<=ymax
    exts$ymaxb[t3]=exts$ymax[t3]+overlap
    t4=exts$xmax+overlap<=xmax
    exts$xmaxb[t4]=exts$xmax[t4]+overlap  
  }
  exts$tile=1:nrow(exts)
  return(exts)
}

Generate a tiling system for that raster. Here will use only three tiles (feel free to play with this).

jobs=tilebuilder(r,size=1000,overlap=80)
kable(jobs,row.names = F,digits = 2)

Plot the raster showing the grid.

ggplot(jobs)+
  geom_raster(aes(x=coordinates(r)[,1],y=coordinates(r)[,2],fill = values(r)))+ 
  scale_fill_gradient(low = 'white', high = 'blue')+
  geom_rect(mapping=aes(xmin=xmin,xmax=xmax,ymin=ymin,ymax=ymax),
            fill="transparent",lty="dashed",col="darkgreen")+
  geom_rect(aes(xmin=xminb,xmax=xmaxb,ymin=yminb,ymax=ymaxb),
            fill="transparent",col="black")+
  geom_text(aes(x=(xminb+xmax)/2,y=(yminb+ymax)/2,label=tile),size=10)+
  coord_equal()+ylab("Y")+xlab("X")

Run a simple spatial analysis: `focal` moving window

Use the focal function from the raster package to calculate a 3x3 moving window mean over the raster.

stime2=system.time({
  r_focal1=focal(r,w=matrix(1,101,101),mean,pad=T)
  })
stime2

## plot it
gplot(r_focal1)+
  geom_raster(aes(fill = value))+ 
  scale_fill_gradient(low = 'white', high = 'blue')+
  coord_equal()+ylab("Y")+xlab("X")

That works great (and pretty fast) for this little example, but as the data (or the size of the window) get larger, it can become prohibitive.

Repeat the analysis, but parallelize using the tile system.

First write a function that breaks up the original raster, computes the focal mean, then puts it back together. You could also put this directly in the foreach() loop.

focal_par=function(i,raster,jobs,w=matrix(1,101,101)){
  ## identify which row in jobs to process
  t_ext=jobs[i,]
  ## crop original raster to (buffered) tile
  r2=crop(raster,extent(t_ext$xminb,t_ext$xmaxb,t_ext$yminb,t_ext$ymaxb))
  ## run moving window mean over tile
  rf=focal(r2,w=w,mean,pad=T)
  ## crop to tile
  rf2=crop(rf,extent(t_ext$xmin,t_ext$xmax,t_ext$ymin,t_ext$ymax))
  ## return the object - could also write the file to disk and aggregate later outside of foreach()
  return(rf2)
}

Run the parallelized version.

registerDoParallel(3)  	

ptime2=system.time({
  r_focal=foreach(i=1:nrow(jobs),.combine=merge,.packages=c("raster")) %dopar% focal_par(i,r,jobs)
  })

Are the outputs the same?

identical(r_focal,r_focal1)

So we were able to process the data in seconds when using 3 CPUs and seconds on one CPU. That's X faster for this simple example.

Parallelized Raster functions

Some functions in raster package also easy to parallelize.

ncores=2
beginCluster(ncores)

fn=function(x) x^3

system.time(fn(r))
system.time(clusterR(r, fn, verbose=T))

endCluster()

Does not work with:

merge
crop
mosaic
(dis)aggregate
resample
projectRaster
focal
distance
buffer
direction

High Performance Computers (HPC)

aka supercomputers, for example, check out the University at Buffalo HPC

Working on a cluster can be quite different from a laptop/workstation. The most important difference is the existence of scheduler that manages large numbers of individual tasks.

SLURM and R

You typically don't run the script interactively, so you need to edit your script to 'behave' like a normal #! (linux command line) script. This is easy with getopt package.

cat(paste("
          library(getopt)
          ## get options
          opta <- getopt(
              matrix(c(
                  'date', 'd', 1, 'character'
              ), ncol=4, byrow=TRUE))
          ## extract value
          date=as.Date(opta$date) 
          
          ## Now your script using date as an input
          print(date+1)
          q(\"no\")
          "
          ),file=paste("script.R",sep=""))

Then you can run this script from the command line like this:

Rscript script.R --date 2013-11-05

You will need the complete path if Rscript is not on your system path. For example, on OS X, it might be at /Library/Frameworks/R.framework/Versions/3.2/Resources/Rscript.

Or even from within R like this:

system("Rscript script.R --date 2013-11-05",intern=T)

Driving cluster from R

Possible to drive the cluster from within R via SLURM. First, define the jobs and write that file to disk:

script="script.R"
dates=seq(as.Date("2000-01-01"),as.Date("2000-12-31"),by=60)
pjobs=data.frame(jobs=paste(script,"--date",dates))

write.table(pjobs,                     
  file="process.txt",
  row.names=F,col.names=F,quote=F)

This table has one row per task:

pjobs

Now identify other parameters for SLURM.

### Set up submission script
nodes=2
walltime=5

Write the SLURM script

More information here

### write SLURM script to disk from R

cat(paste("#!/bin/sh
#SBATCH --partition=general-compute
#SBATCH --time=00:",walltime,":00
#SBATCH --nodes=",nodes,"
#SBATCH --ntasks-per-node=8
#SBATCH --constraint=IB
#SBATCH --mem=300
# Memory per node specification is in MB. It is optional. 
# The default limit is 3000MB per core.
#SBATCH --job-name=\"date_test\"
#SBATCH --output=date_test-srun.out
#SBATCH --mail-user=adamw@buffalo.edu
#SBATCH --mail-type=ALL
##SBATCH --requeue
#Specifies that the job will be requeued after a node failure.
#The default is that the job will not be requeued.

## Load necessary modules
module load openmpi/gcc-4.8.3/1.8.4   
module load R

IDIR=~
WORKLIST=$IDIR/process.txt
EXE=Rscript
LOGSTDOUT=$IDIR/log/stdout
LOGSTDERR=$IDIR/log/stderr
          
### use mpiexec to parallelize across lines in process.txt
mpiexec -np $CORES xargs -a $WORKLIST -p $EXE 1> $LOGSTDOUT 2> $LOGSTDERR
",sep=""),file=paste("slurm_script.txt",sep=""))

Now we have a list of jobs and a qsub script that points at those jobs with the necessary PBS settings.

## run it!
system("sbatch slurm_script.txt")
## Check status with squeue
system("squeue -u adamw")

For more detailed information about the UB HPC, see the CCR Userguide.

Summary

Each task should involve computationally-intensive work. If the tasks are very small, it can take longer to run in parallel.

Choose your method

Run from master process (e.g. foreach)
- easier to implement and collect results
- fragile (one failure can kill it and lose results)
- clumsy for big jobs
Run as separate R processes
- see getopt library
- safer for big jobs: each job completely independent
- update job list to re-run incomplete submissions
- compatible with slurm / cluster computing
- forces you to have a clean processing script

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel Computing with R

Introduction

Serial Computing

Parallel computation

Flynn's taxonomy

Our focus: Single Instruction, Multiple Data (SIMD)

R Packages

Foreach Package

Documentation for foreach:

Foreach backends

Simple examples

Sequential `for()` loop

Sequential `foreach()` loop

Sequential `foreach()` loop with `.combine`

Sequential `foreach()` loop with `.combine`

Your Turn

Parallel `foreach()` loop

A slightly more complicated example

Sampling from a dataset with `sample_n()`

Simple Generalized Linear Model

Your Turn

Writing data to disk

Other useful `foreach` parameters

Spatial example

Generate Spatial Data

"Tile" the region

Run a simple spatial analysis: `focal` moving window

Repeat the analysis, but parallelize using the tile system.

Parallelized Raster functions

High Performance Computers (HPC)

SLURM and R

Driving cluster from R

Write the SLURM script

Summary

Choose your method

Further Reading

FilesExpand file tree

ParallelProcessing.md

Latest commit

History

ParallelProcessing.md

File metadata and controls

Parallel Computing with R

Introduction

Serial Computing

Parallel computation

Flynn's taxonomy

Our focus: Single Instruction, Multiple Data (SIMD)

R Packages

Foreach Package

Documentation for foreach:

Foreach backends

Simple examples

Sequential for() loop

Sequential foreach() loop

Sequential foreach() loop with .combine

Sequential foreach() loop with .combine

Your Turn

Parallel foreach() loop

A slightly more complicated example

Sampling from a dataset with sample_n()

Simple Generalized Linear Model

Your Turn

Writing data to disk

Other useful foreach parameters

Spatial example

Generate Spatial Data

"Tile" the region

Run a simple spatial analysis: focal moving window

Repeat the analysis, but parallelize using the tile system.

Parallelized Raster functions

High Performance Computers (HPC)

SLURM and R

Driving cluster from R

Write the SLURM script

Summary

Choose your method

Further Reading

Sequential `for()` loop

Sequential `foreach()` loop

Sequential `foreach()` loop with `.combine`

Sequential `foreach()` loop with `.combine`

Parallel `foreach()` loop

Sampling from a dataset with `sample_n()`

Other useful `foreach` parameters

Run a simple spatial analysis: `focal` moving window