Solution Patterns for Parallel Programming

Solution Patterns for Parallel
Programming
CS4532 Concurrent Programming
Dilum Bandara
Dilum.Bandara@uom.lk
Some slides adapted from Dr. Srinath Perera

Outline
 Designing parallel algorithms
 Solution patterns for parallelism
 Loop Parallel
 Fork/Join
 Divide & Conquer
 Pipe Line
 Asynchronous Agents
 Producer/Consumer
 Load balancing
2

Building a Solution by Composition
 We often solve problems by reducing the problem
to a composition of known problems
 Finding the way to Habarana?
 Sorting 1 million integers
 Can we solve this with Mutex & Semaphores?
 Mutex for mutual exclusion
 Semaphores for signaling
 There is another level
3

Designing Parallel Algorithms
 Parallel algorithm design is not easily reduced to
simple recipes
 Parallel version of serial algorithm is not necessarily
optimum
 Good algorithms require creativity
 Goal
 Suggest a framework within which parallel algorithm
design can be explored
 Develop intuition as to what constitutes a good
parallel algorithm
4

Methodical Design
 Partitioning &
communication focus
on concurrency &
scalability
 Agglomeration &
mapping focus on
locality & other
performance issues
5
Source: www.drdobbs.com/parallel/designing-parallel-
algorithms-part-1/223100878

Methodical Design (Cont.)
1. Partitioning
 Decompose computation/data into small tasks/chunks
 Focus on recognizing opportunities for parallel
execution
 Practical issues such as no of CPUs are ignored
2. Communication
 Determine communication required to coordinate task
execution
 Define communication structures & algorithms
6

3. Agglomeration
 Defined task & communication structures are
evaluated with respect to
 Performance requirements
 Implementation costs
 If necessary, tasks are combined into larger tasks to
improve
 Performance
 Reduce development costs
7
Source: www.drdobbs.com/architecture-and-design/designing-
parallel-algorithms-part-3/223500075

4. Mapping
 Each task is assigned to a processor while attempting
to satisfy competing goals of
 Maximizing processor utilization
 Minimizing communication costs
 Static mapping
 At design/compile time
 Dynamic mapping
 At runtime by load-balancing algorithms
8

Parallel Algorithm Design Issues
 Efficiency
 Scalability
 Partitioning computations
 Domain decomposition – based on data
 Functional decomposition – based on computation
 Locality
 Spatial & temporal
 Synchronous & asynchronous communication
 Agglomeration to reduce communication
 Load-balancing
9

3 Ways to Parallelize
1. By Data
 Partition data & give it to different threads
2. By Task
 Partition task into smaller tasks & give it to different
threads
3. By Order
 Partition task into steps & give them to different threads
10

By Data
 Use SPMD model
 When data can be processed locally with lower
dependencies with other data
 Patterns
 Loop parallel, embarrassingly parallel
 Large data unit – under utilization
 Small data units – thrashing
 Chunk layout
 Based on dependencies & caching
 Example – Processing geographical data
11

By Task
 Task Parallel, Divide & Conquer
 Too many tasks – thrashing
 Too little tasks – under utilization
 Dependencies among tasks
 Removable
 Code transformations
 Separable
 Accumulation operations (average, sum, count)
 Extrema (max, min)
 Read only, Read/Write
12

By Order
 Pipeline & Asynchronous Agents
 Dependencies
 Temporal – before/after
 Same time
 None
13

Load Balancing
 Some threads will be busy while others are idle
 Counter by distributing load equally
 When cost of problem is well understood this is possible
 e.g., matrix multiplication, known tree walk
 Some other problems are not that simple
 Hard to predict how workload will be distributed  use
dynamic load balancing
 But require communication between threads/tasks
 2 methods for dynamic load balancing
 Task queues
 Work stealing
14

Task Queues
 Multiple instance of task queues (producer
consumer)
 Threads comes to the task queue after finishing a
task & grab next task
 Typically run with thread pool with fixed no of
threads
15
Source: http://blog.zenika.com

Work Stealing
 Every thread has a work/task queue
 When 1 thread runs out of work, it goes to other
task queue & “steal” the work
16
Source: http://karlsenchoi.blogspot.com

Efficiency = Maximizing Parallelism?
 Usually it is 2 things
 Run algorithm in MAX no of threads with minimal
communication/waiting
 When size of the problem grows, algorithm can handle
it by adding new resources
 It’s done by right architecture + tuning
 There are no clear way to do it
 Just like “design patterns” for OOP, people have
identified parallel programming patterns
17

Solution Patterns for Parallelism
 Loop Parallel
 Fork/Join
 Divide and Conquer
 Producer Consumer/ Pipe Line
 Asynchronous Agents
 Producer/Consumer
18

Loop Parallel
 If each iteration in a loop only depends on that
iteration results + read only data, each iteration
can run in a different thread
 As it’s based on data, also called data parallelism
int[] A = .. int[] B = .. int[] C = ..
for (int i; i<N; i++){
C[i] = F(A[i], B[i])
}
19

Which for These are Loop Parallel?
int[] A = .. int[] B = .. int[] C = ..
C[i] = F(A[i], B[i-1])
}
int[] A = .. int[] B = .. int[] C = ..
C[i] = F(A[i], C[i-1])
}
20

Implementing Loop Parallel
 OpenMP example
21

Fork/Join
 Fork job into smaller tasks (independent if
possible), perform them, & join them
 Examples
 Calculate the mean across an array
 Tree walk
 How to partition?
 By Data, e.g., SPMD
 By Task, e.g., MPSD
22
Source: http://en.wikipedia.org/wiki/Fork%E2%80%93join_model

Fork/Join (Cont.)
 Size of work Unit
 Small units – thrashing
 Big Unit – imbalance
 Balancing load among threads
 Static allocation
 If data/task is completely known
 E.g., matrix addition
 Dynamic allocation (tree walks)
 Task queues
 Work Stealing
23

Implementing Fork/Join
 Pthreads
 OpenMP
24

Divide & Conquer
 Break problem into recursive sub-problems &
assign them to different threads
 Examples
 Quick sort
 Search for a value in a tree
 Calculating Fibonacci Sequence
 Often fork again, leads to an execution tree
 Recursion
 May or may not have a join step
 Deep tree – thrashing
 Shallow tree – under utilization 25

Divide & Conquer – Fibonacci
Sequence
Source - Introduction to Algorithms (3rd Edition) by Cormen, Leiserson, Rivest and Stein
26

Producer Consumer
 This pattern is often used, as it helps
dynamically balance workload
 E.g., crawling the Web
 Place new links in a queue so others can pick it up
27
Source: http://vichargrave.com/multithreaded-work-queue-in-c/

Pipeline
 Break a task into small steps (which may have
dependencies) & assign execution of steps to
different threads
 Example
 Read file, sort file, & write to file
 Work hand off from step-to-step
 Each task doesn’t gain, but if there are many
instances of the task, we get a better throughput
 Gain come from tuning
 Example – read/write are slow but sort is fast, can
add more threads to read/write & less threads to sort 28

Pipeline (Cont.)
 Long pipeline – high throughput
 Short pipeline – low latency
 Passing data from one stage to another
 Message passing
 Shared queues
29

Asynchronous Agents
 Here task is done by a set of agents
 Working in P2P fashion
 No clear structure
 They talk to each other via asynchronous messages
 Example – Detecting storms using weather data
 Many agents, each know some aspects about storms
 Weather events are sent to them, which in turn fire
other events, leading to detection
30
Source: http://blogs.msdn.com/

Solution Patterns for Parallel Programming

More Related Content

Similar to Solution Patterns for Parallel Programming

More from Dilum Bandara

Recently uploaded

Solution Patterns for Parallel Programming

Editor's Notes