Solution Patterns for Parallel
Programming
CS4532 Concurrent Programming
Dilum Bandara
Dilum.Bandara@uom.lk
Some slides adapted from Dr. Srinath Perera
Outline
 Designing parallel algorithms
 Solution patterns for parallelism
 Loop Parallel
 Fork/Join
 Divide & Conquer
 Pipe Line
 Asynchronous Agents
 Producer/Consumer
 Load balancing
2
Building a Solution by Composition
 We often solve problems by reducing the problem
to a composition of known problems
 Finding the way to Habarana?
 Sorting 1 million integers
 Can we solve this with Mutex & Semaphores?
 Mutex for mutual exclusion
 Semaphores for signaling
 There is another level
3
Designing Parallel Algorithms
 Parallel algorithm design is not easily reduced to
simple recipes
 Parallel version of serial algorithm is not necessarily
optimum
 Good algorithms require creativity
 Goal
 Suggest a framework within which parallel algorithm
design can be explored
 Develop intuition as to what constitutes a good
parallel algorithm
4
Methodical Design
 Partitioning &
communication focus
on concurrency &
scalability
 Agglomeration &
mapping focus on
locality & other
performance issues
5
Source: www.drdobbs.com/parallel/designing-parallel-
algorithms-part-1/223100878
Methodical Design (Cont.)
1. Partitioning
 Decompose computation/data into small tasks/chunks
 Focus on recognizing opportunities for parallel
execution
 Practical issues such as no of CPUs are ignored
2. Communication
 Determine communication required to coordinate task
execution
 Define communication structures & algorithms
6
Methodical Design (Cont.)
3. Agglomeration
 Defined task & communication structures are
evaluated with respect to
 Performance requirements
 Implementation costs
 If necessary, tasks are combined into larger tasks to
improve
 Performance
 Reduce development costs
7
Source: www.drdobbs.com/architecture-and-design/designing-
parallel-algorithms-part-3/223500075
Methodical Design (Cont.)
4. Mapping
 Each task is assigned to a processor while attempting
to satisfy competing goals of
 Maximizing processor utilization
 Minimizing communication costs
 Static mapping
 At design/compile time
 Dynamic mapping
 At runtime by load-balancing algorithms
8
Parallel Algorithm Design Issues
 Efficiency
 Scalability
 Partitioning computations
 Domain decomposition – based on data
 Functional decomposition – based on computation
 Locality
 Spatial & temporal
 Synchronous & asynchronous communication
 Agglomeration to reduce communication
 Load-balancing
9
3 Ways to Parallelize
1. By Data
 Partition data & give it to different threads
2. By Task
 Partition task into smaller tasks & give it to different
threads
3. By Order
 Partition task into steps & give them to different threads
10
By Data
 Use SPMD model
 When data can be processed locally with lower
dependencies with other data
 Patterns
 Loop parallel, embarrassingly parallel
 Large data unit – under utilization
 Small data units – thrashing
 Chunk layout
 Based on dependencies & caching
 Example – Processing geographical data
11
By Task
 Task Parallel, Divide & Conquer
 Too many tasks – thrashing
 Too little tasks – under utilization
 Dependencies among tasks
 Removable
 Code transformations
 Separable
 Accumulation operations (average, sum, count)
 Extrema (max, min)
 Read only, Read/Write
12
By Order
 Pipeline & Asynchronous Agents
 Dependencies
 Temporal – before/after
 Same time
 None
13
Load Balancing
 Some threads will be busy while others are idle
 Counter by distributing load equally
 When cost of problem is well understood this is possible
 e.g., matrix multiplication, known tree walk
 Some other problems are not that simple
 Hard to predict how workload will be distributed  use
dynamic load balancing
 But require communication between threads/tasks
 2 methods for dynamic load balancing
 Task queues
 Work stealing
14
Task Queues
 Multiple instance of task queues (producer
consumer)
 Threads comes to the task queue after finishing a
task & grab next task
 Typically run with thread pool with fixed no of
threads
15
Source: http://blog.zenika.com
Work Stealing
 Every thread has a work/task queue
 When 1 thread runs out of work, it goes to other
task queue & “steal” the work
16
Source: http://karlsenchoi.blogspot.com
Efficiency = Maximizing Parallelism?
 Usually it is 2 things
 Run algorithm in MAX no of threads with minimal
communication/waiting
 When size of the problem grows, algorithm can handle
it by adding new resources
 It’s done by right architecture + tuning
 There are no clear way to do it
 Just like “design patterns” for OOP, people have
identified parallel programming patterns
17
Solution Patterns for Parallelism
 Loop Parallel
 Fork/Join
 Divide and Conquer
 Producer Consumer/ Pipe Line
 Asynchronous Agents
 Producer/Consumer
18
Loop Parallel
 If each iteration in a loop only depends on that
iteration results + read only data, each iteration
can run in a different thread
 As it’s based on data, also called data parallelism
int[] A = .. int[] B = .. int[] C = ..
for (int i; i<N; i++){
C[i] = F(A[i], B[i])
}
19
Which for These are Loop Parallel?
int[] A = .. int[] B = .. int[] C = ..
for (int i; i<N; i++){
C[i] = F(A[i], B[i-1])
}
int[] A = .. int[] B = .. int[] C = ..
for (int i; i<N; i++){
C[i] = F(A[i], C[i-1])
}
20
Implementing Loop Parallel
 OpenMP example
21
Fork/Join
 Fork job into smaller tasks (independent if
possible), perform them, & join them
 Examples
 Calculate the mean across an array
 Tree walk
 How to partition?
 By Data, e.g., SPMD
 By Task, e.g., MPSD
22
Source: http://en.wikipedia.org/wiki/Fork%E2%80%93join_model
Fork/Join (Cont.)
 Size of work Unit
 Small units – thrashing
 Big Unit – imbalance
 Balancing load among threads
 Static allocation
 If data/task is completely known
 E.g., matrix addition
 Dynamic allocation (tree walks)
 Task queues
 Work Stealing
23
Implementing Fork/Join
 Pthreads
 OpenMP
24
Divide & Conquer
 Break problem into recursive sub-problems &
assign them to different threads
 Examples
 Quick sort
 Search for a value in a tree
 Calculating Fibonacci Sequence
 Often fork again, leads to an execution tree
 Recursion
 May or may not have a join step
 Deep tree – thrashing
 Shallow tree – under utilization 25
Divide & Conquer – Fibonacci
Sequence
Source - Introduction to Algorithms (3rd Edition) by Cormen, Leiserson, Rivest and Stein
26
Producer Consumer
 This pattern is often used, as it helps
dynamically balance workload
 E.g., crawling the Web
 Place new links in a queue so others can pick it up
27
Source: http://vichargrave.com/multithreaded-work-queue-in-c/
Pipeline
 Break a task into small steps (which may have
dependencies) & assign execution of steps to
different threads
 Example
 Read file, sort file, & write to file
 Work hand off from step-to-step
 Each task doesn’t gain, but if there are many
instances of the task, we get a better throughput
 Gain come from tuning
 Example – read/write are slow but sort is fast, can
add more threads to read/write & less threads to sort 28
Pipeline (Cont.)
 Long pipeline – high throughput
 Short pipeline – low latency
 Passing data from one stage to another
 Message passing
 Shared queues
29
Asynchronous Agents
 Here task is done by a set of agents
 Working in P2P fashion
 No clear structure
 They talk to each other via asynchronous messages
 Example – Detecting storms using weather data
 Many agents, each know some aspects about storms
 Weather events are sent to them, which in turn fire
other events, leading to detection
30
Source: http://blogs.msdn.com/
31

Solution Patterns for Parallel Programming

  • 1.
    Solution Patterns forParallel Programming CS4532 Concurrent Programming Dilum Bandara Dilum.Bandara@uom.lk Some slides adapted from Dr. Srinath Perera
  • 2.
    Outline  Designing parallelalgorithms  Solution patterns for parallelism  Loop Parallel  Fork/Join  Divide & Conquer  Pipe Line  Asynchronous Agents  Producer/Consumer  Load balancing 2
  • 3.
    Building a Solutionby Composition  We often solve problems by reducing the problem to a composition of known problems  Finding the way to Habarana?  Sorting 1 million integers  Can we solve this with Mutex & Semaphores?  Mutex for mutual exclusion  Semaphores for signaling  There is another level 3
  • 4.
    Designing Parallel Algorithms Parallel algorithm design is not easily reduced to simple recipes  Parallel version of serial algorithm is not necessarily optimum  Good algorithms require creativity  Goal  Suggest a framework within which parallel algorithm design can be explored  Develop intuition as to what constitutes a good parallel algorithm 4
  • 5.
    Methodical Design  Partitioning& communication focus on concurrency & scalability  Agglomeration & mapping focus on locality & other performance issues 5 Source: www.drdobbs.com/parallel/designing-parallel- algorithms-part-1/223100878
  • 6.
    Methodical Design (Cont.) 1.Partitioning  Decompose computation/data into small tasks/chunks  Focus on recognizing opportunities for parallel execution  Practical issues such as no of CPUs are ignored 2. Communication  Determine communication required to coordinate task execution  Define communication structures & algorithms 6
  • 7.
    Methodical Design (Cont.) 3.Agglomeration  Defined task & communication structures are evaluated with respect to  Performance requirements  Implementation costs  If necessary, tasks are combined into larger tasks to improve  Performance  Reduce development costs 7 Source: www.drdobbs.com/architecture-and-design/designing- parallel-algorithms-part-3/223500075
  • 8.
    Methodical Design (Cont.) 4.Mapping  Each task is assigned to a processor while attempting to satisfy competing goals of  Maximizing processor utilization  Minimizing communication costs  Static mapping  At design/compile time  Dynamic mapping  At runtime by load-balancing algorithms 8
  • 9.
    Parallel Algorithm DesignIssues  Efficiency  Scalability  Partitioning computations  Domain decomposition – based on data  Functional decomposition – based on computation  Locality  Spatial & temporal  Synchronous & asynchronous communication  Agglomeration to reduce communication  Load-balancing 9
  • 10.
    3 Ways toParallelize 1. By Data  Partition data & give it to different threads 2. By Task  Partition task into smaller tasks & give it to different threads 3. By Order  Partition task into steps & give them to different threads 10
  • 11.
    By Data  UseSPMD model  When data can be processed locally with lower dependencies with other data  Patterns  Loop parallel, embarrassingly parallel  Large data unit – under utilization  Small data units – thrashing  Chunk layout  Based on dependencies & caching  Example – Processing geographical data 11
  • 12.
    By Task  TaskParallel, Divide & Conquer  Too many tasks – thrashing  Too little tasks – under utilization  Dependencies among tasks  Removable  Code transformations  Separable  Accumulation operations (average, sum, count)  Extrema (max, min)  Read only, Read/Write 12
  • 13.
    By Order  Pipeline& Asynchronous Agents  Dependencies  Temporal – before/after  Same time  None 13
  • 14.
    Load Balancing  Somethreads will be busy while others are idle  Counter by distributing load equally  When cost of problem is well understood this is possible  e.g., matrix multiplication, known tree walk  Some other problems are not that simple  Hard to predict how workload will be distributed  use dynamic load balancing  But require communication between threads/tasks  2 methods for dynamic load balancing  Task queues  Work stealing 14
  • 15.
    Task Queues  Multipleinstance of task queues (producer consumer)  Threads comes to the task queue after finishing a task & grab next task  Typically run with thread pool with fixed no of threads 15 Source: http://blog.zenika.com
  • 16.
    Work Stealing  Everythread has a work/task queue  When 1 thread runs out of work, it goes to other task queue & “steal” the work 16 Source: http://karlsenchoi.blogspot.com
  • 17.
    Efficiency = MaximizingParallelism?  Usually it is 2 things  Run algorithm in MAX no of threads with minimal communication/waiting  When size of the problem grows, algorithm can handle it by adding new resources  It’s done by right architecture + tuning  There are no clear way to do it  Just like “design patterns” for OOP, people have identified parallel programming patterns 17
  • 18.
    Solution Patterns forParallelism  Loop Parallel  Fork/Join  Divide and Conquer  Producer Consumer/ Pipe Line  Asynchronous Agents  Producer/Consumer 18
  • 19.
    Loop Parallel  Ifeach iteration in a loop only depends on that iteration results + read only data, each iteration can run in a different thread  As it’s based on data, also called data parallelism int[] A = .. int[] B = .. int[] C = .. for (int i; i<N; i++){ C[i] = F(A[i], B[i]) } 19
  • 20.
    Which for Theseare Loop Parallel? int[] A = .. int[] B = .. int[] C = .. for (int i; i<N; i++){ C[i] = F(A[i], B[i-1]) } int[] A = .. int[] B = .. int[] C = .. for (int i; i<N; i++){ C[i] = F(A[i], C[i-1]) } 20
  • 21.
  • 22.
    Fork/Join  Fork jobinto smaller tasks (independent if possible), perform them, & join them  Examples  Calculate the mean across an array  Tree walk  How to partition?  By Data, e.g., SPMD  By Task, e.g., MPSD 22 Source: http://en.wikipedia.org/wiki/Fork%E2%80%93join_model
  • 23.
    Fork/Join (Cont.)  Sizeof work Unit  Small units – thrashing  Big Unit – imbalance  Balancing load among threads  Static allocation  If data/task is completely known  E.g., matrix addition  Dynamic allocation (tree walks)  Task queues  Work Stealing 23
  • 24.
  • 25.
    Divide & Conquer Break problem into recursive sub-problems & assign them to different threads  Examples  Quick sort  Search for a value in a tree  Calculating Fibonacci Sequence  Often fork again, leads to an execution tree  Recursion  May or may not have a join step  Deep tree – thrashing  Shallow tree – under utilization 25
  • 26.
    Divide & Conquer– Fibonacci Sequence Source - Introduction to Algorithms (3rd Edition) by Cormen, Leiserson, Rivest and Stein 26
  • 27.
    Producer Consumer  Thispattern is often used, as it helps dynamically balance workload  E.g., crawling the Web  Place new links in a queue so others can pick it up 27 Source: http://vichargrave.com/multithreaded-work-queue-in-c/
  • 28.
    Pipeline  Break atask into small steps (which may have dependencies) & assign execution of steps to different threads  Example  Read file, sort file, & write to file  Work hand off from step-to-step  Each task doesn’t gain, but if there are many instances of the task, we get a better throughput  Gain come from tuning  Example – read/write are slow but sort is fast, can add more threads to read/write & less threads to sort 28
  • 29.
    Pipeline (Cont.)  Longpipeline – high throughput  Short pipeline – low latency  Passing data from one stage to another  Message passing  Shared queues 29
  • 30.
    Asynchronous Agents  Heretask is done by a set of agents  Working in P2P fashion  No clear structure  They talk to each other via asynchronous messages  Example – Detecting storms using weather data  Many agents, each know some aspects about storms  Weather events are sent to them, which in turn fire other events, leading to detection 30 Source: http://blogs.msdn.com/
  • 31.

Editor's Notes

  • #2 Shovel example
  • #4 Along A6 after Dambulla