Big Graph Analytics Systems
DaYan
The Chinese University of Hong Kong
The Univeristy of Alabama at Birmingham
Yingyi Bu
Couchbase, Inc.
Yuanyuan Tian
IBM Research Almaden Center
Amol Deshpande
University of Maryland
James Cheng
The Chinese University of Hong Kong
Motivations
Big Graphs Are Everywhere
2
Big Graph Systems
General-Purpose Graph Analytics
Programming Language
»Java, C/C++, Scala, Python …
»Domain-Specific Language (DSL)
3
Big Graph Systems
Programming Model
»Think Like aVertex
• Message passing
• Shared MemoryAbstraction
»Matrix Algebra
»Think Like a Graph
»Datalog
4
Big Graph Systems
Other Features
»Execution Mode: Sync or Async ?
»Environment: Single-Machine or Distributed ?
»Support for Topology Mutation
»Out-of-Core Support
»Support forTemporal Dynamics
»Data-Intensive or Computation-Intensive ?
5
Tutorial Outline
Message Passing Systems
Shared Memory Abstraction
Single-Machine Systems
Matrix-Based Systems
Temporal Graph Systems
DBMS-Based Systems
Subgraph-Based Systems
6
Vertex-Centric
Hardware-Related
Computation-Intensive
Tutorial Outline
Message Passing Systems
Shared Memory Abstraction
Single-Machine Systems
Matrix-Based Systems
Temporal Graph Systems
DBMS-Based Systems
Subgraph-Based Systems
7
Message Passing Systems
8
Google’s Pregel [SIGMOD’10]
»Think like a vertex
»Message passing
»Iterative
• Superstep
Message Passing Systems
9
Google’s Pregel [SIGMOD’10]
»Vertex Partitioning
0
1 2
3
4 5 6
7 8
0 1 3 1 0 2 3 2 1 3 4 7
3 0 1 2 7 4 2 5 7 5 4 6
6 5 8 7 2 3 4 8 8 6 7
M0 M1 M2
Message Passing Systems
10
Google’s Pregel [SIGMOD’10]
»Programming Interface
• u.compute(msgs)
• u.send_msg(v, msg)
• get_superstep_number()
• u.vote_to_halt()
Called inside u.compute(msgs)
Message Passing Systems
11
Google’s Pregel [SIGMOD’10]
»Vertex States
• Active / inactive
• Reactivated by messages
»Stop Condition
• All vertices halted, and
• No pending messages
Message Passing Systems
12
Google’s Pregel [SIGMOD’10]
»Hash-Min: Connected Components
7
0
1
2
3
4
5 67 8
0 6 85
2
4
1
3
Superstep 1
Message Passing Systems
13
Google’s Pregel [SIGMOD’10]
»Hash-Min: Connected Components
5
0
1
2
3
4
5 67 8
0 0 60
0
2
0
1
Superstep 2
Message Passing Systems
14
Google’s Pregel [SIGMOD’10]
»Hash-Min: Connected Components
0
0
1
2
3
4
5 67 8
0 0 00
0
0
0
0
Superstep 3
Message Passing Systems
15
Practical Pregel Algorithm (PPA) [PVLDB’14]
»First cost model for Pregel algorithm design
»PPAs for fundamental graph problems
• Breadth-first search
• List ranking
• Spanning tree
• Euler tour
• Pre/post-order traversal
• Connected components
• Bi-connected components
• Strongly connected components
• ...
Message Passing Systems
16
Practical Pregel Algorithm (PPA) [PVLDB’14]
»Linear cost per superstep
• O(|V| + |E|) message number
• O(|V| + |E|) computation time
• O(|V| + |E|) memory space
»Logarithm number of supersteps
• O(log |V|) supersteps
O(log|V|) = O(log|E|)
How about load balancing?
Message Passing Systems
17
Balanced PPA (BPPA) [PVLDB’14]
»din(v): in-degree of v
»dout(v): out-degree of v
»Linear cost per superstep
• O(din(v) + dout(v)) message number
• O(din(v) + dout(v)) computation time
• O(din(v) + dout(v)) memory space
»Logarithm number of supersteps
Message Passing Systems
18
BPPA Example: List Ranking [PVLDB’14]
»A basic operation of Euler tour technique
»Linked list where each element v has
• Value val(v)
• Predecessor pred(v)
»Element at the head has pred(v) = NULL
11111NULL
v1 v2 v3 v4 v5
Toy Example: val(v) = 1 for all v
Message Passing Systems
19
BPPA Example: List Ranking [PVLDB’14]
»Compute sum(v) for each element v
• Summing val(v) and values of all predecessors
»WhyTeraSort cannot work?
54321NULL
v1 v2 v3 v4 v5
Message Passing Systems
20
BPPA Example: List Ranking [PVLDB’14]
»Pointer jumping / path doubling
• sum(v) ← sum(v) + sum(pred(v))
• pred(v) ← pred(pred(v))
11111NULL
v1 v2 v3 v4 v5
As long as pred(v) ≠ NULL
Message Passing Systems
21
BPPA Example: List Ranking [PVLDB’14]
»Pointer jumping / path doubling
• sum(v) ← sum(v) + sum(pred(v))
• pred(v) ← pred(pred(v))
11111NULL
22221NULL
v1 v2 v3 v4 v5
Message Passing Systems
22
BPPA Example: List Ranking [PVLDB’14]
»Pointer jumping / path doubling
• sum(v) ← sum(v) + sum(pred(v))
• pred(v) ← pred(pred(v))
NULL
22221NULL
44321NULL
v1 v2 v3 v4 v5
11111
Message Passing Systems
23
BPPA Example: List Ranking [PVLDB’14]
»Pointer jumping / path doubling
• sum(v) ← sum(v) + sum(pred(v))
• pred(v) ← pred(pred(v))
NULL
22221NULL
44321NULL
54321NULL
v1 v2 v3 v4 v5
11111
O(log |V|) supersteps
Message Passing Systems
24
Optimizations in
Communication Mechanism
Message Passing Systems
25
Apache Giraph
»Superstep splitting: reduce memory consumption
»Only effective when compute(.) is distributive
u1
u2
u3
u4
u5
u6
v
0
1
1
1
1
1
1
Message Passing Systems
26
Apache Giraph
»Superstep splitting: reduce memory consumption
»Only effective when compute(.) is distributive
u1
u2
u3
u4
u5
u6
v
0
6
Message Passing Systems
27
Apache Giraph
»Superstep splitting: reduce memory consumption
»Only effective when compute(.) is distributive
u1
u2
u3
u4
u5
u6
v
6
Message Passing Systems
28
Apache Giraph
»Superstep splitting: reduce memory consumption
»Only effective when compute(.) is distributive
u1
u2
u3
u4
u5
u6
v
0
1
1
1
Message Passing Systems
29
Apache Giraph
»Superstep splitting: reduce memory consumption
»Only effective when compute(.) is distributive
u1
u2
u3
u4
u5
u6
v
0
3
Message Passing Systems
30
Apache Giraph
»Superstep splitting: reduce memory consumption
»Only effective when compute(.) is distributive
u1
u2
u3
u4
u5
u6
v
3
Message Passing Systems
31
Apache Giraph
»Superstep splitting: reduce memory consumption
»Only effective when compute(.) is distributive
u1
u2
u3
u4
u5
u6
v
3
1
1
1
Message Passing Systems
32
Apache Giraph
»Superstep splitting: reduce memory consumption
»Only effective when compute(.) is distributive
u1
u2
u3
u4
u5
u6
v
3
3
Message Passing Systems
33
Apache Giraph
»Superstep splitting: reduce memory consumption
»Only effective when compute(.) is distributive
u1
u2
u3
u4
u5
u6
v
6
Message Passing Systems
34
Pregel+ [WWW’15]
»Vertex Mirroring
»Request-Respond Paradigm
Message Passing Systems
35
Pregel+ [WWW’15]
»Vertex Mirroring
M3
w1
w2
wk
……
M2
v1
v2
vj
……
M1
u1
u2
ui
……
…
…
Message Passing Systems
36
Pregel+ [WWW’15]
»Vertex Mirroring
M3
w1
w2
wk
……
M2
v1
v2
vj
……
M1
u1
u2
ui
……
uiui
…
…
Message Passing Systems
37
Pregel+ [WWW’15]
»Vertex Mirroring: Create mirror for u4?
M1
u1
u4
…
v1 v2
v4v1 v2 v3
u2 v1 v2
u3 v1 v2
M2
v1
v4
…
v2
v3
Message Passing Systems
38
Pregel+ [WWW’15]
»Vertex Mirroring v.s. Message Combining
M1
u1
u4
…
v1 v2
v4v1 v2 v3
u2 v1 v2
u3 v1 v2
M1
u1
u4
…
u2
u3
M2
v1
v4
…
v2
v3
a(u1) + a(u2)
+ a(u3) + a(u4)
Message Passing Systems
39
Pregel+ [WWW’15]
»Vertex Mirroring v.s. Message Combining
M1
u1
u4
…
v1 v2
v4v1 v2 v3
u2 v1 v2
u3 v1 v2
M1
u1
u4
…
u2
u3
M2
v1
v4
…
v2
v3
u4
a(u1) + a(u2) + a(u3)
a(u4)
Message Passing Systems
40
Pregel+ [WWW’15]
»Vertex Mirroring: Only mirror high-degree vertices
»Choice of degree threshold τ
• M machines, n vertices, m edges
• Average degree: degavg = m / n
• Optimal τ is M · exp{degavg / M}
Message Passing Systems
41
Pregel+ [WWW’15]
» Request-Respond Paradigm
v1
v4
v2
v3
u
M1
a(u)
M2
<v1>
<v2>
<v3>
<v4>
Message Passing Systems
42
Pregel+ [WWW’15]
» Request-Respond Paradigm
v1
v4
v2
v3
u
M1
a(u)
M2
a(u)
a(u)
a(u)
a(u)
Message Passing Systems
43
Pregel+ [WWW’15]
»A vertex v can request attribute a(u) in superstep i
» a(u) will be available in superstep (i + 1)
Message Passing Systems
44
v1
v4
v2
v3
u
M1
D[u]
M2
request u
u | D[u]
Pregel+ [WWW’15]
»A vertex v can request attribute a(u) in superstep I
» a(u) will be available in superstep (i + 1)
Message Passing Systems
45
Load Balancing
Message Passing Systems
46
Vertex Migration
»WindCatch [ICDE’13]
• Runtime improved by 31.5% for PageRank (best)
• 2% for shortest path computation
• 9% for maximal matching
»Stanford’s GPS [SSDBM’13]
»Mizan [EuroSys’13]
• Hash-based and METIS partitioning: no improvement
• Range-based partitioning: around 40% improvement
Message Passing Systems
Dynamic Concurrency Control
»PAGE [TKDE’15]
• Better partitioning → slower ?
47
Message Passing Systems
Dynamic Concurrency Control
»PAGE [TKDE’15]
• Message generation
• Local message processing
• Remote message processing
48
Message Passing Systems
Dynamic Concurrency Control
»PAGE [TKDE’15]
• Monitors speeds of the 3 operations
• Dynamically adjusts number of threads for the 3 operations
• Criteria
- Speed of message processing = speed of incoming messages
- Thread numbers for local & remote message processing are
proportional to speed of local & remote message processing
49
Message Passing Systems
50
Out-of-Core Support
java.lang.OutOfMemoryError:
Java heap space
26 cases reported by Giraph-users
mailing list during 08/2013~08/2014!
Message Passing Systems
51
Pregelix [PVLDB’15]
»Transparent out-of-core support
»Physical flexibility (Environment)
»Software simplicity (Implementation)
Hyracks
Dataflow Engine
Message Passing Systems
52
Pregelix [PVLDB’15]
Message Passing Systems
53
Pregelix [PVLDB’15]
Message Passing Systems
54
GraphD
»Hardware for small startups and average researchers
• Desktop PCs
• Gigabit Ethernet switch
»Features of a common cluster
• Limited memory space
• Disk streaming bandwidth >> network bandwidth
» Each worker stores and streams edges and messages on local
disks
» Cost of buffering msgs on disks hidden inside msg
transmission
Message Passing Systems
55
Fault Tolerance
Message Passing Systems
56
Coordinated Checkpointing of Pregel
»Every δ supersteps
»Recovery from machine failure:
• Standby machine
• Repartitioning among survivors
An illustration with δ = 5
Message Passing Systems
57
Coordinated Checkpointing of Pregel
W1 W2 W3
…
…
…
Superstep
4
W1 W2 W3
5
W2 W3
6
W1 W2 W3
7
Failure occurs
W1
Write checkpoint to HDFS
Vertex states, edge changes, shuffled messages
Message Passing Systems
58
Coordinated Checkpointing of Pregel
W1 W2 W3
…
…
…
Superstep
4
W1 W2 W3
5
W1 W2 W3
6
W1 W2 W3
7
Load checkpoint from HDFS
Message Passing Systems
59
Chandy-Lamport Snapshot [TOCS’85]
»Uncoordinated checkpointing (e.g., for async exec)
»For message-passing systems
»FIFO channels
u v
5 5
u : 5
Message Passing Systems
60
Chandy-Lamport Snapshot [TOCS’85]
»Uncoordinated checkpointing (e.g., for async exec)
»For message-passing systems
»FIFO channels
u v
u : 5
4
4
5
Message Passing Systems
61
Chandy-Lamport Snapshot [TOCS’85]
»Uncoordinated checkpointing (e.g., for async exec)
»For message-passing systems
»FIFO channels
u v
u : 5
4 4
Message Passing Systems
62
Chandy-Lamport Snapshot [TOCS’85]
»Uncoordinated checkpointing (e.g., for async exec)
»For message-passing systems
»FIFO channels
u v
u : 5 v : 4
4 4
Message Passing Systems
63
Chandy-Lamport Snapshot [TOCS’85]
»Solution: bcast checkpoint request right after
checkpointed
u v
5 5
u : 5
REQ
v : 5
Message Passing Systems
64
Recovery by Message-Logging [PVLDB’14]
»Each worker logs its msgs to local disks
• Negligible overhead, cost hidden
»Survivor
• No re-computaton during recovery
• Forward logged msgs to replacing workers
»Replacing worker
• Re-compute from latest checkpoint
• Only send msgs to replacing workers
Message Passing Systems
65
Recovery by Message-Logging [PVLDB’14]
W1 W2 W3
…
…
…
Superstep
4
W1 W2 W3
5
W2 W3
6
W1 W2 W3
7
Failure occurs
W1
Log msgsLog msgsLog msgs
Log msgsLog msgsLog msgs
Message Passing Systems
66
Recovery by Message-Logging [PVLDB’14]
W1 W2 W3
…
…
…
Superstep
4
W1 W2 W3
5
W1 W2 W3
6
W1 W2 W3
7
Standby Machine
Load checkpoint
Message Passing Systems
67
Block-Centric Computation Model
Message Passing Systems
68
Block-Centric Computation
»Main Idea
• A block refers to a connected subgraph
• Messages exchange among blocks
• Serial in-memory algorithm within a block
Message Passing Systems
69
Block-Centric Computation
»Motivation: graph characteristics adverse to Pregel
• Large graph diameter
• High average vertex degree
Message Passing Systems
70
Block-Centric Computation
»Benefits
• Less communication workload
• Less number of supersteps
• Less number of computing units
Message Passing Systems
71
Giraph++ [PVLDB’13]
» Pioneering: think like a graph
» METIS-style vertex partitioning
» Partition.compute(.)
» Boundary vertex values sync-ed at superstep barrier
» Internal vertex values can be updated anytime
Message Passing Systems
72
Blogel [PVLDB’14]
» API: vertex.compute(.) + block.compute(.)
»A block can have its own fields
»A block/vertex can send msgs to another block/vertex
»Example: Hash-Min
• Construct block-level graph: to compute an adjacency list
for each block
• Propagate min block ID among blocks
Message Passing Systems
73
Blogel [PVLDB’14]
»Performance on Friendster social network with 65.6 M
vertices and 3.6 B edges
1
10
100
1000
2.52
120.24
ComputingTime
Blogel Pregel+
1
100
10,000
19
7,227
MILLION
Total Msg #
Blogel Pregel+
0
10
20
30
5
30
Superstep #
Blogel Pregel+
Message Passing Systems
74
Blogel [PVLDB’14]
»Web graph: URL-based partitioning
»Spatial networks: 2D partitioning
»General graphs: graphVoronoi diagram partitioning
Blogel [PVLDB’14]
» GraphVoronoi Diagram (GVD) partitioning
75
Three seeds
v is 2 hops from red seed
v is 3 hops from green seed
v is 5 hops from blue seedv
Message Passing Systems
Blogel [PVLDB’14]
»Sample seed vertices with probability p
76
Message Passing Systems
Blogel [PVLDB’14]
»Sample seed vertices with probability p
77
Message Passing Systems
Blogel [PVLDB’14]
»Sample seed vertices with probability p
»Compute GVD grouping
• Vertex-centric multi-source BFS
78
Message Passing Systems
Blogel [PVLDB’14]
79State after Seed Sampling
Message Passing Systems
Blogel [PVLDB’14]
80Superstep 1
Message Passing Systems
Blogel [PVLDB’14]
81Superstep 2
Message Passing Systems
Blogel [PVLDB’14]
82Superstep 3
Message Passing Systems
Blogel [PVLDB’14]
»Sample seed vertices with probability p
»Compute GVD grouping
»Postprocessing
83
Message Passing Systems
Blogel [PVLDB’14]
»Sample seed vertices with probability p
»Compute GVD grouping
»Postprocessing
• For very large blocks, resample with a larger p and repeat
84
Message Passing Systems
Blogel [PVLDB’14]
»Sample seed vertices with probability p
»Compute GVD grouping
»Postprocessing
• For very large blocks, resample with a larger p and repeat
• For tiny components, find them using Hash-Min at last
85
Message Passing Systems
GVD Partitioning Performance
86
2026.65
505.85
186.89
105.48 75.88 70.68
0
500
1000
1500
2000
2500
3000
WebUK Friendster BTC LiveJournal USA Road Euro Road
Loading Partitioning Dumping
Message Passing Systems
Message Passing Systems
87
Asynchronous Computation Model
Maiter [TPDS’14]
» For algos where vertex values converge asymmetrically
» Delta-based accumulative iterative computation
(DAIC)
88
Message Passing Systems
v1
v2 v3 v4
Maiter [TPDS’14]
» For algos where vertex values converge asymmetrically
» Delta-based accumulative iterative computation
(DAIC)
» Strict transformation from Pregel API to DAIC
formulation
»Delta may serve as priority score
»Natural for block-centric frameworks
89
Message Passing Systems
Message Passing Systems
90
Vertex-Centric Query Processing
Quegel [PVLDB’16]
» On-demand answering of light-workload graph queries
• Only a portion of the whole graph gets accessed
» Option 1: to process queries one job after another
• Network underutilization, too many barriers
• High startup overhead (e.g., graph loading)
91
Message Passing Systems
Quegel [PVLDB’16]
» On-demand answering of light-workload graph queries
• Only a portion of the whole graph gets accessed
» Option 2: to process a batch of queries in one job
• Programming complexity
• Straggler problem
92
Message Passing Systems
Quegel [PVLDB’16]
»Execution model: superstep-sharing
• Each iteration is called a super-round
• In a super-round, every query proceeds by one superstep
93
Message Passing Systems
Super–Round # 1
q1
2 3 4
1 2 3 4
q3q2 q4
Time
Queries
5 6
q1
q2
q3
q4
7
1 2 3 4
1 2 3 4
1 2 3 4
Quegel [PVLDB’16]
»Benefits
• Messages of multiple queries transmitted in one batch
• One synchronization barrier for each super-round
• Better load balancing
94
Message Passing Systems
Worker 1
Worker 2
time sync sync sync
Individual Synchronization Superstep-Sharing
Quegel [PVLDB’16]
»API is similar to Pregel
»The system does more:
• Q-data: superstep number, control information, …
• V-data: adjacency list, vertex/edge labels
• VQ-data: vertex state in the evaluation of each query
95
Message Passing Systems
Quegel [PVLDB’16]
»Create aVQ-data of v for q, only when q touches v
»Garbage collection of Q-data andVQ-data
»Distributed indexing
96
Message Passing Systems
Tutorial Outline
Message Passing Systems
Shared Memory Abstraction
Single-Machine Systems
Matrix-Based Systems
Temporal Graph Systems
DBMS-Based Systems
Subgraph-Based Systems
97
Shared-Mem Abstraction
98
Single Machine
(UAI 2010)
Distributed GraphLab
(PVLDB 2012)
PowerGraph
(OSDI 2012)
Shared-Mem Abstraction
Distributed GraphLab [PVLDB’12]
»Scope of vertex v
99
u v w
Du Dv Dw
D(u,v) D(v,w)
…………
…………
All that v can access
Shared-Mem Abstraction
Distributed GraphLab [PVLDB’12]
» Async exec mode: for asymmetric convergence
• Scheduler, serializability
» API:v.update()
• Access & update data in v’s scope
• Add neighbors to scheduler
100
Shared-Mem Abstraction
Distributed GraphLab [PVLDB’12]
» Vertices partitioned among machines
» For edge (u, v), scopes of u and v overlap
• Du, Dv and D(u, v)
• Replicated if u and v are on different machines
» Ghosts: overlapped boundary data
• Value-sync by a versioning system
» Memory space problem
• x {# of machines}
101
Shared-Mem Abstraction
PowerGraph [OSDI’12]
» API: Gather-Apply-Scatter (GAS)
• PageRank: out-degree = 2 for all in-neighbors
102
1
1
1
1
1
1
1
0
Shared-Mem Abstraction
PowerGraph [OSDI’12]
» API: Gather-Apply-Scatter (GAS)
• PageRank: out-degree = 2 for all in-neighbors
103
1
1
1
1
1
1
1
1/2
0
1/2
1/2
Shared-Mem Abstraction
PowerGraph [OSDI’12]
» API: Gather-Apply-Scatter (GAS)
• PageRank: out-degree = 2 for all in-neighbors
104
1
1
1
1
1
1
1
1.5
Shared-Mem Abstraction
PowerGraph [OSDI’12]
» API: Gather-Apply-Scatter (GAS)
• PageRank: out-degree = 2 for all in-neighbors
105
1
1
1
1.5
1
1
1
0
Δ = 0.5 > ϵ
Shared-Mem Abstraction
PowerGraph [OSDI’12]
» API: Gather-Apply-Scatter (GAS)
• PageRank: out-degree = 2 for all in-neighbors
106
1
1
1
1.5
1
1
1
0
activated
activated
activated
Shared-Mem Abstraction
PowerGraph [OSDI’12]
»Edge Partitioning
»Goals:
• Loading balancing
• Minimize vertex replicas
– Cost of value sync
– Cost of memory space
107
Shared-Mem Abstraction
PowerGraph [OSDI’12]
»Greedy Edge Placement
108
u v
W1 W2 W3 W4 W5 W6
Workload 100 101 102 103 104 105
Shared-Mem Abstraction
PowerGraph [OSDI’12]
»Greedy Edge Placement
109
u v
W1 W2 W3 W4 W5 W6
Workload 100 101 102 103 104 105
Shared-Mem Abstraction
PowerGraph [OSDI’12]
»Greedy Edge Placement
110
u v
W1 W2 W3 W4 W5 W6
Workload 100 101 102 103 104 105
∅ ∅
Shared-Mem Abstraction
111
Single-Machine Out-of-Core Systems
Shared-Mem Abstraction
Shared-Mem + Single-Machine
»Out-of-core execution, disk/SSD-based
• GraphChi [OSDI’12]
• X-Stream [SOSP’13]
• VENUS [ICDE’14]
• …
»Vertices are numbered 1, …, n; cut into P intervals
112
interval(2) interval(P)
1 nv1 v2
interval(1)
Shared-Mem Abstraction
GraphChi [OSDI’12]
»Programming Model
• Edge scope of v
113
u v w
Du Dv Dw
D(u,v) D(v,w)
…………
…………
Shared-Mem Abstraction
GraphChi [OSDI’12]
»Programming Model
• Scatter & gather values along adjacent edges
114
u v w
Dv
D(u,v) D(v,w)
…………
…………
Shared-Mem Abstraction
GraphChi [OSDI’12]
»Load vertices of each interval, along with adjacent
edges for in-mem processing
»Write updated vertex/edge values back to disk
»Challenges
• Sequential IO
• Consistency: store each edge value only once on disk
115
interval(2) interval(P)
1 nv1 v2
interval(1)
Shared-Mem Abstraction
GraphChi [OSDI’12]
»Disk shards: shard(i)
• Vertices in interval(i)
• Their incoming edges, sorted by source_ID
116
interval(2) interval(P)
1 nv1 v2
interval(1)
shard(P)shard(2)shard(1)
Shared-Mem Abstraction
GraphChi [OSDI’12]
»Parallel SlidingWindows (PSW)
117
Shard 1
in-edgessortedby
src_id
Vertices
1..100
Vertices
101..200
Vertices
201..300
Vertices
301..400
Shard 2 Shard 3 Shard 4Shard 1
Shared-Mem Abstraction
GraphChi [OSDI’12]
»Parallel SlidingWindows (PSW)
118
Shard 1
in-edgessortedby
src_id
Vertices
1..100
Vertices
101..200
Vertices
201..300
Vertices
301..400
Shard 2 Shard 3 Shard 4Shard 1
100
100
100
1 1 1 1
Out-Edges
Vertices & In-Edges
100
Shared-Mem Abstraction
GraphChi [OSDI’12]
»Parallel SlidingWindows (PSW)
119
Shard 1
in-edgessortedby
src_id
Vertices
1..100
Vertices
101..200
Vertices
201..300
Vertices
301..400
Shard 2 Shard 3 Shard 4Shard 1
1 1 1 1
100
100
100
200
Vertices & In-Edges
200
200
Out-Edges
100
200
Shared-Mem Abstraction
GraphChi [OSDI’12]
»Each vertex & edge value is read & written for at least
once in an iteration
120
Shared-Mem Abstraction
X-Stream [SOSP’13]
»Edge-scope GAS programming model
»Streams a completely unordered list of edges
121
Shared-Mem Abstraction
X-Stream [SOSP’13]
»Simple case: all vertex states are memory-resident
»Pass 1: edge-centric scattering
• (u, v): value(u) => <v, value(u, v)>
»Pass 2: edge-centric gathering
• <v, value(u, v)> => value(v)
122
update
aggregate
Shared-Mem Abstraction
X-Stream [SOSP’13]
»Out-of-Core Engine
• P vertex partitions with vertex states only
• P edge partitions, partitioned by source vertices
• Each pass loads a vertex partition, streams corresponding
edge partition (or update partition)
123
interval(2) interval(P)
1 nv1 v2
interval(1)
Fit into memory
Larger than in GraphChi
Streamed on disk
P update files generated by Pass 1 scattering
Shared-Mem Abstraction
X-Stream [SOSP’13]
»Out-of-Core Engine
• Pass 1: edge-centric scattering
– (u, v): value(u) => [v, value(u, v)]
• Pass 2: edge-centric scattering
– [v, value(u, v)] => value(v)
124
interval(2) interval(P)
1 nv1 v2
interval(1)
Append to update file
for partition of v
Streamed from update file
for the corresponding vertex partition
Shared-Mem Abstraction
X-Stream [SOSP’13]
»Scale out: Chaos [SOSP’15]
• Requires 40 GigE
• Slow with GigE
»Weakness: sparse computation
125
Shared-Mem Abstraction
VENUS [ICDE’14]
»Programming model
• Value scope of v
126
u v w
Du Dv Dw
D(u,v) D(v,w)
…………
…………
Shared-Mem Abstraction
VENUS [ICDE’14]
»Assume static topology
• Separate read-only edge data and mutable vertex states
»g-shard(i): incoming edge lists of vertices in interval(i)
»v-shard(i): srcs & dsts of edges in g-shard(i)
»All g-shards are concatenated for streaming
127
interval(2) interval(P)
1 nv1 v2
interval(1)
Sources may not be in interval(i)
Vertices in a v-shard are ordered by ID
Dsts of interval(i) may be srcs of other intervals
Shared-Mem Abstraction
VENUS [ICDE’14]
»To process interval(i)
• Load v-shard(i)
• Stream g-shard(i), update in-memory v-shard(i)
• Update every other v-shard by a sequential write
128
interval(2) interval(P)
1 nv1 v2
interval(1)
Dst vertices are in interval(i)
Shared-Mem Abstraction
VENUS [ICDE’14]
» Avoid writing O(|E|) edge values to disk
» O(|E|) edge values are read once
» O(|V|) may be read/written for multiple times
129
interval(2) interval(P)
1 nv1 v2
interval(1)
Tutorial Outline
Message Passing Systems
Shared Memory Abstraction
Single-Machine Systems
Matrix-Based Systems
Temporal Graph Systems
DBMS-Based Systems
Subgraph-Based Systems
130
Single-Machine Systems
Categories
»Shared-mem out-of-core (GraphChi, X-Stream,VENUS)
»Matrix-based (to be discussed later)
»SSD-based
»In-mem multi-core
»GPU-based
131
Single-Machine Systems
132
SSD-Based Systems
Single-Machine Systems
SSD-Based Systems
»Async random IO
• Many flash chips, each with multiple dies
»Callback function
»Pipelined for high throughput
133
Single-Machine Systems
TurboGraph [KDD’13]
»Vertices ordered by ID, stored in pages
134
Single-Machine Systems
TurboGraph [KDD’13]
135
Single-Machine Systems
TurboGraph [KDD’13]
136
Read order for positions in a page
Single-Machine Systems
TurboGraph [KDD’13]
137
Record for v6: in Page p3, Position 1
Single-Machine Systems
TurboGraph [KDD’13]
138
In-mem page table: vertex ID -> location on SSD
1-hop neighborhood: outperform GraphChi by 104
Single-Machine Systems
TurboGraph [KDD’13]
139
Special treatment for adj-list larger than a page
Single-Machine Systems
TurboGraph [KDD’13]
»Pin-and-slide execution model
»Concurrently process vertices of pinned pages
»Do not wait for completion of IO requests
»Page unpinned as soon as processed
140
Single-Machine Systems
FlashGraph [FAST’15]
»Semi-external memory
• Edge lists on SSDs
»On top of SAFS, an SSD file system
• High-throughput async I/Os over SSD array
• Edge lists stored in one (logical) file on SSD
141
Single-Machine Systems
FlashGraph [FAST’15]
»Only access requested edge lists
»Merge same-page / adjacent-page requests into one
sequential access
»Vertex-centricAPI
»Message passing among threads
142
Single-Machine Systems
143
In-Memory Multi-Core Frameworks
Single-Machine Systems
In-Memory Parallel Frameworks
»Programming simplicity
• Green-Marl, Ligra, GRACE
»Full utilization of all cores in a machine
• GRACE, Galois
144
Single-Machine Systems
Green-Marl [ASPLOS’12]
»Domain-specific language (DSL)
• High-level language constructs
• Expose data-level parallelism
»DSL → C++ program
»Initially single-machine, now supported by GPS
145
Single-Machine Systems
Green-Marl [ASPLOS’12]
»Parallel For
»Parallel BFS
»Reductions (e.g., SUM, MIN, AND)
»Deferred assignment (<=)
• Effective only at the end of the binding iteration
146
Single-Machine Systems
Ligra [PPoPP’13]
»VertexSet-centric API: edgeMap, vertexMap
»Example: BFS
• Ui+1←edgeMap(Ui, F, C)
147
u
v
Ui
Vertices for next iteration
Single-Machine Systems
Ligra [PPoPP’13]
»VertexSet-centric API: edgeMap, vertexMap
»Example: BFS
• Ui+1←edgeMap(Ui, F, C)
148
u
v
Ui
C(v) = parent[v] is NULL?
Yes
Single-Machine Systems
Ligra [PPoPP’13]
»VertexSet-centric API: edgeMap, vertexMap
»Example: BFS
• Ui+1←edgeMap(Ui, F, C)
149
u
v
Ui
F(u, v):
parent[v] ← u
v added to Ui+1
Single-Machine Systems
Ligra [PPoPP’13]
»Mode switch based on vertex sparseness |Ui|
• When | Ui | is large
150
u
v
Ui
w
C(w) called 3 times
Single-Machine Systems
Ligra [PPoPP’13]
»Mode switch based on vertex sparseness |Ui|
• When | Ui | is large
151
u
v
Ui
w
if C(v) is true
Call F(u, v) for every in-neighbor in U
Early pruning: just the first one for BFS
Single-Machine Systems
GRACE [PVLDB’13]
»Vertex-centricAPI, block-centric execution
• Inner-block computation: vertex-centric computation with
an inner-block scheduler
»Reduce data access to computation ratio
• Many vertex-centric algos are computationally-light
• CPU cache locality: every block fits in cache
152
Single-Machine Systems
Galois [SOSP’13]
»Amorphous data-parallelism (ADP)
• Speculative execution: fully use extra CPU resources
153
v’s neighborhoodu’s neighborhood
u vw
Single-Machine Systems
Galois [SOSP’13]
»Amorphous data-parallelism (ADP)
• Speculative execution: fully use extra CPU resources
154
v’s neighborhoodu’s neighborhood
u vw
Rollback
Single-Machine Systems
Galois [SOSP’13]
»Amorphous data-parallelism (ADP)
• Speculative execution: fully use extra CPU resources
»Machine-topology-aware scheduler
• Try to fetch tasks local to the current core first
155
Single-Machine Systems
156
GPU-Based Systems
Single-Machine Systems
GPU Architecture
»Array of streaming multiprocessors (SMs)
»Single instruction, multiple threads (SIMT)
»Different control flows
• Execute all flows
• Masking
»Memory cache hierarchy
157
Small path divergence
Coalesced memory accesses
Single-Machine Systems
GPU Architecture
»Warp: 32 threads, basic unit for scheduling
»SM: 48 warps
• Two streaming processors (SPs)
• Warp scheduler: two warps executed at a time
»Thread block / CTA (cooperative thread array)
• 6 warps
• Kernel call → grid of CTAs
• CTAs are distributed to SMs with available resources
158
Single-Machine Systems
Medusa [TPDS’14]
»BPS model of Pregel
»Fine-grained API: Edge-Message-Vertex (EMV)
• Large parallelism, small path divergence
»Pre-allocates an array for buffering messages
• Coalesced memory accesses: incoming msgs for each vertex
is consecutive
• Write positions of msgs do not conflict
159
Single-Machine Systems
CuSha [HPDC’14]
»Apply the shard organization of GraphChi
»Each shard processed by one CTA
»Window concatenation
160
Window write-back: imbalanced workload
Shard 1
n-edgessortedbysrc_id
Vertices
1..100
Vertices
101..200
Vertices
201..300
Vertices
301..400
Shard 2 Shard 3 Shard 4Shard 1
1 1 1 1
100
100
100
200 200
200
100
200
Single-Machine Systems
CuSha [HPDC’14]
»Apply the shard organization of GraphChi
»Each shard processed by one CTA
»Window concatenation
161
Threads in a CTA may
cross window boundaries
Pointers to actual locations
in shards
Window write-back: imbalanced workload
Tutorial Outline
Message Passing Systems
Shared Memory Abstraction
Single-Machine Systems
Matrix-Based Systems
Temporal Graph Systems
DBMS-Based Systems
Subgraph-Based Systems
162
Matrix-Based Systems
163
Categories
»Single-machine systems
• Vertex-centric API
• Matrix operations in the backend
»Distributed frameworks
• (Generalized) matrix-vector multiplication
• Matrix algebra
Matrix-Based Systems
164
Matrix-Vector Multiplication
»Example: PageRank
PRi(v1)
PRi(v2)
PRi(v3)
PRi(v4)
× =
Pri+1(v1)
PRi+1 (v2)
PRi+1 (v3)
PRi+1 (v4)
Out-AdjacencyList(v1)
Out-AdjacencyList(v2)
Out-AdjacencyList(v3)
Out-AdjacencyList(v4)
Matrix-Based Systems
165
Generalized Matrix-Vector Multiplication
»Example: HashMin
mini(v1)
mini(v2)
mini(v3)
mini(v4)
× =
mini+1(v1)
mini+1 (v2)
mini+1 (v3)
mini+1 (v4)
0/1-AdjacencyList(v1)
0/1-AdjacencyList(v2)
0/1-AdjacencyList(v3)
0/1-AdjacencyList(v4)
Add → Min
Assign only when smaller
Matrix-Based Systems
166
Single-Machine Systems
with Vertex-Centric API
Matrix-Based Systems
GraphTwist [PVLDB’15]
»Multi-level graph partitioning
• Right granularity for in-memory processing
• Balance workloads among computing threads
1671 n
src
dst
1
n
u
v
w(u, v)
edge-weight
Matrix-Based Systems
GraphTwist [PVLDB’15]
»Multi-level graph partitioning
• Right granularity for in-memory processing
• Balance workloads among computing threads
1681 n
src
dst
1
n
edge-weight
slice
Matrix-Based Systems
GraphTwist [PVLDB’15]
»Multi-level graph partitioning
• Right granularity for in-memory processing
• Balance workloads among computing threads
1691 n
src
dst
1
n
edge-weight
stripe
Matrix-Based Systems
GraphTwist [PVLDB’15]
»Multi-level graph partitioning
• Right granularity for in-memory processing
• Balance workloads among computing threads
1701 n
src
dst
1
n
edge-weight
dice
Matrix-Based Systems
GraphTwist [PVLDB’15]
»Multi-level graph partitioning
• Right granularity for in-memory processing
• Balance workloads among computing threads
1711 n
src
dst
1
n
edge-weight
u
vertex cut
Matrix-Based Systems
GraphTwist [PVLDB’15]
»Multi-level graph partitioning
• Right granularity for in-memory processing
• Balance workloads among computing threads
»Fast Randomized Approximation
• Prune statistically insignificant vertices/edges
• E.g., PageRank computation only using high-weight edges
• Unbiased estimator: sampling slices/cuts according to
Frobenius norm
172
Matrix-Based Systems
GridGraph [ATC’15]
»Grid representation for reducing IO
173
Matrix-Based Systems
GridGraph [ATC’15]
»Grid representation for reducing IO
»Streaming-apply API
• Streaming edges of a block (Ii, Ij)
• Aggregate value to v ∈ Ij
174
Matrix-Based Systems
GridGraph [ATC’15]
»Illustration: column-by-column evaluation
175
Matrix-Based Systems
GridGraph [ATC’15]
»Illustration: column-by-column evaluation
176
Create in-mem
Load
Matrix-Based Systems
GridGraph [ATC’15]
»Illustration: column-by-column evaluation
177
Load
Matrix-Based Systems
GridGraph [ATC’15]
»Illustration: column-by-column evaluation
178
Save
Matrix-Based Systems
GridGraph [ATC’15]
»Illustration: column-by-column evaluation
179
Create in-mem
Load
Matrix-Based Systems
GridGraph [ATC’15]
»Illustration: column-by-column evaluation
180
Load
Matrix-Based Systems
GridGraph [ATC’15]
»Illustration: column-by-column evaluation
181
Save
Matrix-Based Systems
GridGraph [ATC’15]
»Illustration: column-by-column evaluation
182
Matrix-Based Systems
GridGraph [ATC’15]
»Read O(P|V|) data of vertex chunks
»Write O(|V|) data of vertex chunks (not O(|E|)!)
»Stream O(|E|) data of edge blocks
• Edge blocks are appended into one large file for streaming
• Block boundaries recorded to trigger the pin/unpin of a
vertex chunk
183
Matrix-Based Systems
184
Distributed Frameworks
with Matrix Algebra
Distributed Systems with Matrix-
Based Interfaces
• PEGASUS (CMU, 2009)
• GBase (CMU & IBM, 2011)
• SystemML (IBM, 2011)
185
Commonality:
• Matrix-based programming interface to the users
• Rely on MapReduce for execution.
PEGASUS
• Open source: http://www.cs.cmu.edu/~pegasus
• Publications: ICDM’09,KAIS’10.
• Intuition: many graph computation can be
modeled by a generalized form of matrix-vector
multiplication.
𝑣′ = 𝑀 × 𝑣
PageRank: 𝑣′ = 0.85 ∙ 𝐴 𝑇 + 0.15 ∙ 𝑈 × 𝑣
186
PEGASUS Programming Interface: GIM-V
Three Primitives:
1) combine2(mi,j , vj ) : combine mi,j and vj into xi,j
2) combineAlli (xi,1 , ..., xi,n ) : combine all the results from
combine2() for node i into vi '
3) assign(vi , vi ' ) : decide how to update vi with vi '
Iterative: Operation applied till algorithm-specific convergence
criterion is met.
PageRank Example
𝑣′ = 0.85 ∙ 𝐴 𝑇
+ 0.15 ∙ 𝑈 × 𝑣
𝒄𝒐𝒎𝒃𝒊𝒏𝒆𝟐 𝑎𝑖,𝑗, 𝑣𝑗 = 0.85 ∙ 𝑎𝑖,𝑗 ∙ 𝑣𝑗
𝒄𝒐𝒎𝒃𝒊𝒏𝒆𝑨𝒍𝒍𝒊 𝑥𝑖,1, … , 𝑥𝑖,𝑛 =
0.15
𝑛
+
𝑖=1
𝑛
𝑥𝑖,𝑗
𝒂𝒔𝒔𝒊𝒈𝒏 𝑣𝑗, 𝑣𝑗′ = 𝑣𝑗′
188
Execution Model
Iterations of a 2-stage algorithm (each stage is a MR job)
• Input: Edge andVector file
• Edge line : (idsrc , iddst , mval) -> cell adjacency Matrix M
• Vector line: (id, vval) -> element inVectorV
• Stage 1: performs combine2() on columns of iddst of M with
rows of id ofV
• Stage 2: combines all partial results and assigns new vector
-> old vector
189
Optimizations
• Block Multiplication
• Clustered Edges
190
• Diagonal Block Iteration for
connected component detection
* Figures are copied from Kang et al ICDM’09
GBASE
• Part of the IBM System GToolkit
• http://systemg.research.ibm.com
• Publications: SIGKDD’11,VLDBJ’12.
• PEGASUS vs GBASE:
• Common:
• Matrix-vector multiplication as the core operation
• Division of a matrix into blocks
• Clustering nodes to form homogenous blocks
• Different:
191
PEGASUS GBASE
Queries global targeted & global
User Interface customizableAPIs build-in algorithms
Storage normal files compression, special placement
Block Size Square blocks Rectangular blocks
Block Compression and Placement
• Block Formation
• Partition nodes using clustering algorithms e.g. Metis
• Compressed block encoding
• source and destination partition ID p and q;
• the set of sources and the set of destinations
• the payload, the bit string of subgraph G(p,q)
• The payload is compressed using zip compression or gap Elias-γ encoding.
• Block Placement
• Grid placement to minimize the number of input HDFS
files to answer queries
192* Figure is copied from Kang et al SIGKDD’11
Built-In Algorithms in GBASE
• Select grids containing the blocks relevant to the queries
• Derive the incidence matrix from the original adjacency
matrix as required
193* Figure is copied from Kang et al SIGKDD’11
SystemML
• Apache Open source: https://systemml.apache.org
• Publications: ICDE’11, ICDE’12,VLDB’14, Data Engineering Bulletin’14,
ICDE’15, SIGMOD’15, PPOPP’15,VLDB16.
• Comparison to PEGASUS and GBASE
• Core: General linear algebra and math operations (beyond just matrix-
vector multiplication)
• Designed for machine learning in general
• User Interface: A high-level language with similar syntax as R
• Declarative approach to graph processing with cost-based and rule-based
optimization
• Run on multiple platforms including MapReduce, Spark and single node.
194
SystemML – Declarative Machine Learning
Analytics language for data scientists
(“The SQL for analytics”)
» Algorithms expressed in a declarative,
high-level language with R-like syntax
» Productivity of data scientists
» Language embeddings for
• Solutions development
• Tools
Compiler
» Cost-based optimizer to generate
execution plans and to parallelize
• based on data characteristics
• based on cluster and machine characteristics
» Physical operators for in-memory single node and
cluster execution
Performance & Scalability
195
SystemML Architecture Overview
196
Language (DML)
• R- like syntax
• Rich set of statistical functions
• User-defined & external function
• Parsing
• Statement blocks & statements
• Program Analysis, type inference, dead code elimination
High-Level Operator (HOP) Component
• Represent dataflow in DAGs of operations on matrices, scalars
• Choosing from alternative execution plans based on memory and
cost estimates: operator ordering & selection; hybrid plans
Low-Level Operator (LOP) Component
• Low-level physical execution plan (LOPDags) over key-value pairs
• “Piggybacking” operations into minimal number Map-Reduce jobs
Runtime
• Hybrid Runtime
• CP: single machine operations & orchestrate MR jobs
• MR: generic Map-Reduce jobs & operations
• SP: Spark Jobs
• Numerically stable operators
• Dense / sparse matrix representation
• Multi-Level buffer pool (caching) to evict in-memory objects
• Dynamic Recompilation for initial unknowns
Command
Line
JMLC
Spark
MLContext
Spark
ML
APIs
High-Level Operators
Parser/Language
Low-Level Operators
Compiler
Runtime
Control Program
Runtime
Program
Buffer Pool
ParFor Optimizer/
Runtime
MR
InstSpark
Inst
CP
Inst
Recompiler
Cost-based
optimizations
DFS IOMem/FS IO
Generic
MR Jobs
MatrixBlock Library
(single/multi-threaded)
Pros and Cons of Matrix-Based Graph
Systems
Pros:
- Intuitive for analytic users familiar with linear algebra
- E.g. SystemML provides a high-level language familiar to a lot of analysts
Cons:
- PEGASUS and GBASE require an expensive clustering of nodes as a
preprocessing step.
- Not all graph algorithms can be expressed using linear algebra
- Unnecessary computation compared to vertex-centric model
197
Tutorial Outline
Message Passing Systems
Shared Memory Abstraction
Single-Machine Systems
Matrix-Based Systems
Temporal Graph Systems
DBMS-Based Systems
Subgraph-Based Systems
198
Temporal and Streaming Graph Analytics
• Motivation: Real world graphs often evolve
over time.
• Two body of work:
• Real-time analysis on streaming graph data
• E.g. Calculate each vertex’s current PageRank
• Temporal analysis over historical traces of graphs
• E.g. Analyzing the change of each vertex’s PageRank
for a given time range
199
Common Features for All Systems
• Temporal Graph: a continuous stream of graph updates
• Graph update: addition or deletion of vertex/edge, or the update of the attribute associated with
node/edge.
• Most systems separate graph updates from graph computation.
• Graph computation is only performed on a sequence of successive static views of the temporal
graph
• A graph snapshot is most commonly used static view
• Using existing static graph programmingAPIs
for temporal graph
• Incremental graph computation
• Leverage significant overlap of successive
static views
• Use ending vertex and edge states at time t
as the starting states at time t+1
• Not applicable to all algorithms
200
Static view 1 Static view 2 Static view 3
Overview
• Real-time Streaming Graph Systems
• Kineograph (distributed, Microsoft, 2012)
• TIDE (distributed, IBM, 2015)
• Historical Graph Systems
• Chronos (distributed, Microsoft, 2014)
• DeltaGraph (distributed, University of Maryland, 2013)
• LLAMM (single-node, Harvard University & Oracle, 2015)
201
Kineograph
• Publication: Cheng et al Eurosys’12
• Target query: continuously deliver analytics results
on static snapshots of a dynamic graph periodically
• Two layers:
• Storage layer: continuously applies updates to a dynamic graph
• Computation layer: performs graph computation on a graph
snapshot
202
Kineograph Architecture Overview
• Graph is stored in a key/value
store among graph nodes
• Ingest nodes are the front end
of incoming graph updates
• Snapshooter uses an epoch
commit protocol to produce
snapshots
• Progress table keeps track of
the process by ingest nodes
203* Figure is copied from Cheng et al Eurosys’12
Epoch Commit Protocol
204* Figure is copied from Cheng et al Eurosys’12
Graph Computation
• ApplyVertex-based GAS computation model on
snapshots of a dynamic graph
• Supports both push and pull models for inter-vertex
communication.
205* Figure is copied from Cheng et al Eurosys’12
TIDE
• Publication: Xie et al ICDE’15
• Target query: continuously deliver analytics results
on a dynamic graph
• Model social interactions as a dynamic interaction
graph
• New interactions (edges) continuously added
• Probabilistic edge decay (PED) model to produce
static views of dynamic graphs
206
StaticViews ofTemporal Graph
207
E.g., relationship
between a and b
is forgottena b
a
b
Sliding Window Model
 Consider recent graph data within a small time window
 Problem: Abruptly forgets past data (no continuity)
Snapshot Model
 Consider all graph data seen so far
 Problem: Does not emphasize recent data (no recency)
Probabilistic Edge Decay Model
208
Key Idea: Temporally Biased Sampling
 Sample data items according to a probability
that decreases over time
 Sample contains a relatively high proportion of
recent interactions
Probabilistic View of an Edge’s Role
 All edges have chance to be considered
(continuity)
 Outdated edges are less likely to be used
(recency)
 Can systematically trade off recency and
continuity
 Can use existing static-graph algorithms
Create N sample graphs
Discretized Time + Exponential Decay
Typically reduces Monte Carlo
variability
Maintaining Sample Graphs inTIDE
209
Naïve Approach: Whenever a new batch of data comes in
 Generate N sampled graphs
 Run graph algorithm on each sample
Idea #1: Exploit overlaps at successive time points
 Subsample old edges of 𝐺𝑡
𝑖
– Selection probability = 𝑝 independently for each edge
 Then add new edges
 Theorem: 𝐺𝑡+1 has correct marginal probability
𝐺𝑡
𝑖
𝐺𝑡+1
𝑖
Maintaining Sample Graphs, Continued
210
Idea #2: Exploit overlap between sample graphs at each time point
 With high probability, more than 50% of edges overlap
 So maintain aggregate graph
𝐺𝑡
1
𝐺𝑡
2
𝐺𝑡
3
𝐺𝑡
1,2
1,3
Memory requirements (batch size = 𝑴)
 Snapshot model: continuously increasing memory requirement
 PED model: bounded memory requirement
– # Edges stored by storing graphs separately: 𝑂(𝑀𝑁)
– # Edges stored by aggregate graph: 𝑂(𝑀 log 𝑁)
Bulk Graph Execution Model
211
Iterative Graph processing (Pregel, GraphLab, Trinity, GRACE, …)
• User-defined compute () function on each vertex v changes v + adjacent edges
• Changes propagated to other vertices via message passing or scheduled updates
Key idea in TIDE:
Bulk execution: Compute results for multiple sample graphs simultaneously
 Partition N sample graphs into bulk sets with s sample graphs each
 Execute algorithm on aggregate graph of each bulk set (partial aggregate graph) Benefits
 Same interface: users still think the
computation is applied on one
graph
 Amortize overheads of extracting &
loading from aggregate graph
 Better memory locality (vertex
operations)
 Similar message values & similar
state values  opportunities for
compression (>2x speedup w. LZF)
Overview
• Real-time Streaming Graph Systems
• Kineograph (distributed, Microsoft, 2012)
• TIDE (distributed, IBM, 2015)
• Historical Graph Systems
• Chronos (distributed, Microsoft, 2014)
• DeltaGraph (distributed, University of Maryland, 2013)
• LLAMM (single-node, Harvard University & Oracle, 2015)
212
Chronos
• Publication: Han et al Eurosys’14
• Target query: graph computation on the sequence of static
snapshots of a temporal graph within a time range
• E.g analyzing the change of each vertex’s PageRank for a given time
range
• Naïve approach: applying graph computation on each
snapshot separately
• Chronos: exploit the time locality of temporal graphs
213
Structure Locality vsTime Locality
• Structure locality
• States of neighboring vertices in the same snapshot are laid out close to
each
• Time locality (preferred in Chronos)
• States of a vertex (or an edge) in consecutive snapshots are stored together
214* Figures are copied from Han et al EuroSys’14
Chronos Design
• In-memory graph layout
• Data of a vertex/edge in consecutive snapshots are placed together
• Locality-aware batch scheduling (LABS)
• Batch processing of a vertex across all the snapshorts
• Batch information propagation to a neighbor vertex across snapshots
• Incremental Computation
• Use the results on 1st snapshot to batch compute on the remaining
snapshots
• Use the results on the insersection graph to batch compute on all snapshots
• On-disk graph layout
• Organized in snapshot groups
• Stored as the first snapshot followed by the updates in the remaining snapshots in this group.
215
DeltaGraph
• Publication: Khurana et al ICDE’13, EDBT’16
• Target query: access past states of the graphs and
perform static graph analysis
• E.g study the evolution of centrality measures, density,
conductance, etc
• Two major components:
• Temporal Graph Index (TGI)
• Temporal Graph Analytics Framework (TAF)
216
DeltaGraph
• Publication: Khurana et al ICDE’13, EDBT’16
• Target query: access past states of the graphs and
perform static graph analysis
• E.g study the evolution of centrality measures, density,
conductance, etc
• Two major components:
• Temporal Graph Index (TGI)
• Temporal Graph Analytics Framework (TAF)
217
Temporal Graph Index
218
• Partitioned delta and partitioned
eventlist for scalability
• Version chain for nodes
• Sorted list of references to a
node
• Graph primitives
• Snapshot retrieval
• Node’s history
• K-hop neighborhood
• Neighborhood evolution
Temporal Graph Analytics Framework
• Node-centric graph extraction and analytical logic
• Primary operand: Set of Nodes (SoN) refers to a collection of
temporal nodes
• Operations
• Extract:Timeslice, Select, Filter, etc.
• Compute: NodeCompute, NodeComputeTemporal, etc.
• Analyze: Compare, Evolution, other aggregates
219
LLAMA
• Publication: Macko et al ICDE’15
• Target query: perform various whole graph analysis
on consistent views
• A single machine system that stores and
incrementally updates an evolving graph in multi-
version representations
• LLAMA provides a general purpose programming
model instead of vertex- or edge- centric models
220
Multi-Version CSR Representation
• Augment the compact read-only CSR (compressed sparse
row) representation to support mutability and persistence.
• Large multi-versioned array (LAMA) with a software copy-on-write
technique for snapshotting
221* Figure is copied from Macko et al ICDE’15
Tutorial Outline
Message Passing Systems
Shared Memory Abstraction
Single-Machine Systems
Matrix-Based Systems
Temporal Graph Systems
DBMS-Based Systems
Subgraph-Based Systems
222
DBMS-Style Graph Systems
Data-parallel Query Execution Engine
Query Optimizer
Datalog SQL
Pregel/GAS/...
Graph Algorithms
Storage Engine
SociaLite/Myria
REX
GraphX/Pregelix
Naiad
Pregel
Vertexica
Reason #1
Expressiveness
»Transitive closure
»All pair shortest paths
Vertex-centric API?
public class AllPairShortestPaths extendsVertex<VLongWritable, DoubleWritable, FloatWritable,
DoubleWritable> {
private Map<VLongWritable, DoubleWritable> distances = new HashMap<>();
@Override
public void compute(Iterator<DoubleWritable> msgIterator) {
.......
}
}
Reason #2
Easy OPS – Unified logs, tooling, configuration…!
Reason #3
Efficient Resource Utilization and
Robustness
~30 similar threads on
Giraph-users mailing list
during the year 2015!
“I’m trying to run the sample connected
components algorithm on a large data
set on a cluster, but I get a
‘java.lang.OutOfMemoryError: Java heap
space’ error.”
Reason #4
One-size fits-all?
Physical flexibility and adaptivity
»PageRank, SSSP, CC,TriangleCounting
»Web graph, social network, RDF graph
»8 cheap machine school cluster, 200 beefy machine at
an enterprise data center
What’s graph analytics?
304 Million Monthly Active Users
500 Million Tweets Per Day!
200 Billion Tweets Per Year!
TwitterMsg(
tweetid: int64,
user: string,
sender_location: point,
send_time: datetime,
reply_to: int64,
retweet_from: int64,
referred_topics: array<string>,
message_text: string
);
Reason #5
Easy Data Science
INSERT OVERWRITE TABLE MsgGraph
SELECT T.tweetid, 1.0/10000000000.0,
CASE
WHENT.reply_to >=0
RETURN array(T.reply_to)
ELSE
RETURN array(T.forward_from)
END CASE
FROMTwitterMsg AST
WHERET.reply_to>=0
ORT.retweet_from>=0
SELECT R.user, SUM(R.rank)AS influence
FROM Result R,TwitterMsgTM
WHERE R.vertexid=TM.tweetid
GROUP BY R.user
ORDER BY influence DESC
LIMIT 50;
Giraph PageRank Job
HDFS
HDFS
HDFS
MsgGraph(
vertexid: int64,
value: double
edges: array<int64>
); Result(
vertexid: int64,
rank: double
);
Reason #6
Software Simplicity
Network management
Pregel
GraphLab
Giraph
......
Message delivery
Memory management
Task scheduling
Vertex/Message
internal format
#1 Expressiveness
Path(u, v, min(d)) :- Edge(u, v, d);
:- Path(u, w, d1), Edge(w, v, d2), d=d1+d2
TC(u, u) :- Edge(u, _)
TC(v, v) :- Edge(_, v)
TC(u, v) :-TC(u, w), Edge(w, v), u!=v
Recursive Query!
»SociaLite (VLDB’13)
»Myria (VLDB’15)
»DeALS (ICDE’15)
IDB
EDB
#2 Easy OPS
Converged Platforms!
»GraphX, on Apache Spark (OSDI’15)
»Gelly, on Apache Flink (FOSDEM’15)
#3 Efficient Resource
Utilization and Robustness
Leverage MPP query execution engine!
»Pregelix (VLDB’14)
1.0
vid edges
vid payload
vid=vid
2
4
halt
false
false
value
2.0
1.0
(3,1.0),(4,1.0)
(1,1.0)
2
4 3.0
Msg
Vertex
5
1
3.0
1.0
1 false 3.0 (3,1.0),(4,1.0)
3 false 3.0 (2,1.0),(3,1.0)
3
vid edges
1
halt
false
false
value
3.0
3.0
(3,1.0),(4,1.0)
(2,1.0),(3,1.0)
msg
NULL
1.0
5 1.0 NULL NULL NULL
2 false 2.0 (3,1.0),(4,1.0)3.0
4 false 1.0 (1,1.0)3.0
Relation Schema
Vertex
Msg
GS
(vid, halt, value, edges)
(vid, payload)
(halt, aggregate, superstep)
#4 Efficient Resource
Utilization and Robustness
In-memory
Out-of-core
In-memory
Out-of-core
Pregelix
#4 Physical Flexibility
Flexible processing for the Pregel semantics
»Storage, rowVs. column, in-placeVs. LSM, etc.
• Vertexica (VLDB’14)
• Vertica (IEEE BigData’15)
• Pregelix (VLDB’14)
»Query plan, join algorithms, group-by algorithms,
etc.
• Pregelix (VLDB’14)
• GraphX (OSDI’15)
• Myria (VLDB’15)
»Execution model, synchronousVs. asynchronous
• Myria (VLDB’15)
#4 Physical Flexibility
Vertica, column storeVs. row store (IEEE
BigData’15)
#4 Physical Flexibility
Index Left Outer
Join
UDF Call (compute)
M.vid=V.vid
Vertexi(V)
Msgi(M)
(V.halt = false || M.paylod != NULL) UDF Call (compute)
Vertexi(V)Msgi(M)
…
Vidi(I)
…
Vidi+1
(halt = false)
Index Full Outer Join Merge (choose())
M.vid=I.vid M.vid=V.vid
Pregelix, different query plans
#4 Physical Flexibility
15x
In-memory
Out-of-core
Pregelix
#4 Physical Flexibility
Myria, synchronousVs. Asynchronous
(VLDB’15)
»Least Common Ancestor
#4 Physical Flexibility
Myria, synchronousVs. Asynchronous
(VLDB’15)
»ConnectedComponents
#5 Easy Data Science
Integrated Programming Abstractions
»REX (VLDB’12)
»AsterData (VLDB’14)
SELECT R.user, SUM(R.rank)AS influence
FROM PageRank( (
SELECTT.tweetid AS vertexid, 1.0/… AS value, … AS edges
FROMTwitterMsgAST
WHERET.reply_to>=0
ORT.retweet_from>=0
), ……) AS R,
TwitterMsg ASTM
WHERE R.vertexid=TM.tweetid
GROUP BY R.user
ORDER BY influence DESC
LIMIT 50;
#6 Software Simplicity
Engineering cost is Expensive!
System Lines of source code (excluding
test code and comments)
Giraph 32,197
GraphX 2,500
Pregelix 8,514
Tutorial Outline
Message Passing Systems
Shared Memory Abstraction
Single-Machine Systems
Matrix-Based Systems
Temporal Graph Systems
DBMS-Based Systems
Subgraph-Based Systems
243
Graph analytics/network science tasks too varied
» Centrality analysis; evolution models; community detection
» Link prediction; belief propagation; recommendations
» Motif counting; frequent subgraph mining; influence analysis
» Outlier detection; graph algorithms like matching, max-flow
» An active area of research in itself…
Graph Analysis Tasks
Counting network motifs
Feed-fwd Loop Feed- back Loop Bi-parallel Motif
High school
friends
Family
members
Office
Colleagues
Friends
College
friendsFriends in
database lab
in CS dept
Friends in
CS dept
Work place friends
Identify Social circles in a user’s ego network
Vertex-centric framework
» Works well for some applications
• Pagerank,Connected Components, …
• Some machine learning algorithms can be mapped to it
» However, the framework is very restrictive
• Most analysis tasks or algorithms cannot be written easily
• Simple tasks like counting neighborhood properties infeasible
• Fundamentally: Not easy to decompose analysis tasks into vertex-level,
independent local computations
Alternatives?
» Galois, Ligra, GreenMarl: Not sufficiently high-level
» Some others (e.g., Socialite) restrictive for different reasons
Limitations ofVertex-Centric Framework
Example: Local Clustering Coefficient
1
2
4
3
A measure of local density around a node:
LCC(n) = # edges in 1-hop neighborhood/max # edges possible
Compute() at Node n:
Need to count the no. of edges between neighbors
But does not have access to that information
Option 1: Each node transmits its list of
neighbors to its neighbors
Huge memory consumption
Option 2: Allow access to neighbors’ state
Neighbors may not be local
What about computations that require
2-hop information?
Example: Frequent Subgraph Mining
Goal: Find all (labeled) subgraphs that appear sufficiently frequently
No easy way to map this to the vertex-centric framework
- Need ability to construct subgraphs of the graph incrementally
- Can construct partial subgraphs and pass them around
- Very high memory consumption, and duplication of state
- Need ability to count the number of occurrences of each subgraph
- Analogous to “reduce()” but with subgraphs as keys
- Some vertex-centric frameworks support such functionality for
aggregation, but only in a centralized fashion
Similar challenges for problems like: finding all cliques, motif counting
Major Systems
NScale:
»Subgraph-centric API that generalizes vertex-centricAPI
»The user compute() function has access to “subgraphs”
rather than “vertices”
»Graph distributed across a cluster of machines analogous
to distributed vertex-centric frameworks
Arabesque:
»Fundamentally different programming model aimed at
frequent subgraph mining, motif counting, etc.
»Key assumption:
• The graph fits in the memory of a single machine in the cluster,
• .. but the intermediate results might not
An end-to-end distributed graph programming framework
Users/application programs specify:
» Neighborhoods or subgraphs of interest
» A kernel computation to operate upon those subgraphs
Framework:
» Extracts the relevant subgraphs from underlying data and loads
in memory
» Execution engine: Executes user computation on materialized
subgraphs
» Communication: Shared state/message passing
Implementation on Hadoop MapReduce as well as Aparch Spark
NScale
NScale: LCC Computation Walkthrough
NScale programming model
1
2 3
4
6
5
7
8
9 10
11 12
Underlying graph
data on HDFS
Compute (LCC) on
Extract ({Node.color=orange}
{k=1}
{Node.color=white}
{Edge.type=solid}
)
Neighborhood Size
Query-vertex predicate
Neighborhood vertex predicate
Neighborhood edge predicate
Subgraph extraction query:
NScale programming model
1
2 3
4
6
5
7
8
9 10
11 12
Underlying graph
data on HDFS
Specifying Computation: BluePrints API
Program cannot be executed as is in vertex-centric programming frameworks.
NScale: LCC Computation Walkthrough
GEP: Graph extraction and packing
1
2 3
4
6
5
7
8
9 10
11 12
Underlying graph
data on HDFS
MapReduce
Subgraph Extraction
Cost based optimizer
Set Bin Packing
MR2: Map Tasks
MR2:
Reducer 1
MR2:
Reducer N
Exec
Engine
Exec
Engine
Node to
Bin
mapping
NScale: LCC Computation Walkthrough
GEP: Graph extraction and packing
1
2 3
4
6
5
7
8
9 10
11 12
Underlying graph
data on HDFS
Graph Extraction
and Loading
MapReduce
(Apache
Yarn)
Subgraph
extraction
1
2 3
4
6
5
7
6 7
8
9 10
10
11 12
SG-1
SG-2
SG-3 SG-4
Extracted Subgraphs
NScale: LCC Computation Walkthrough
1
2 3
4
6
5
7
8
9 10
11 12
Underlying graph
data on HDFS
Graph Extraction
and Loading
MapReduce
(Apache
Yarn)
Subgraph
extraction
Cost Based
Optimizer
Data Rep &
Placement
GEP: Graph extraction and packing
Subgraphs in
Distributed Memory
1
2 3 10
11 12
4
6
5
7
8
9
10
NScale: LCC Computation Walkthrough
1
2 3
4
6
5
7
8
9 10
11 12
Underlying graph
data on HDFS
Graph Extraction
and Loading
MapReduce
(Apache
Yarn)
Subgraph
extraction
Cost Based
Optimizer
Data Rep &
Placement
GEP: Graph extraction and packing
Subgraphs in
Distributed Memory
1
2 3 10
11 12
4
6
5
7
8
9
10
Distributed
Execution Engine
Node
Master
Node
Master
Distributed execution of user computation
NScale: LCC Computation Walkthrough
Experimental Evaluation
Personalized Page Rank on 2-Hop Neighborhood
Dataset NScale Giraph GraphLab GraphX
#Source
Vertices
CE
(Node-
Secs)
Cluster
Mem
(GB)
CE (Node-
Secs)
Cluster
Mem (GB)
CE (Node-
Secs)
Cluster
Mem (GB)
CE (Node-
Secs)
Cluster
Mem (GB)
EU Email 3200 52 3.35 782 17.10 710 28.87 9975 85.50
NotreDame 3500 119 9.56 1058 31.76 870 70.54 50595 95.00
GoogleWeb 4150 464 21.52 10482 64.16 1080 108.28 DNC -
WikiTalk 12000 3343 79.43 DNC OOM DNC OOM DNC -
LiveJournal 20000 4286 84.94 DNC OOM DNC OOM DNC -
Orkut 20000 4691 93.07 DNC OOM DNC OOM DNC -
Local Clustering Coefficient
Dataset NScale Giraph GraphLab GraphX
CE (Node-
Secs)
Cluster
Mem (GB)
CE (Node-
Secs)
Cluster
Mem (GB)
CE (Node-
Secs)
Cluster
Mem (GB)
CE (Node-
Secs)
Cluster
Mem (GB)
EU Email 377 9.00 1150 26.17 365 20.10 225 4.95
NotreDame 620 19.07 1564 30.14 550 21.40 340 9.75
GoogleWeb 658 25.82 2024 35.35 600 33.50 1485 21.92
WikiTalk 726 24.16 DNC OOM 1125 37.22 1860 32.00
LiveJournal 1800 50.00 DNC OOM 5500 128.62 4515 84.00
Orkut 2000 62.00 DNC OOM DNC OOM 20175 125.00
Building the GEP phase
InputGraph
data
RDD 1 RDD 2 RDD n
t1 t2
tn
Subgraph Extraction and Bin Packing
Executing user computation
RDD n
G1
G2
G3
G4
G5
Gn
G: Graph Object
SG1 SG2 SG3
SG4 SG5
Each graph object
contains subgraphs
grouped together using bin
packing algorithm
Map
Transformation Node
Master
Execution Engine
Instance
Spark RDD
containing Graph
objects
Transparent instantiation of
distributed execution engine
NScaleSpark: NScale on Spark
Arabesque
“Think-like-an-embedding” paradigm
User specifies what types of embeddings to construct, and
whether edge-at-a-time, or vertex-at-a-time
User provides functions to filter, and process partial embeddings
Arabesque responsibilities User responsibilities
Graph
Exploration
Load
Balancing
Aggregation
(Isomorphism)
Automorphism
Detection
Filter
Process
Arabesque
Arabesque
Arabesque: Evaluation
Comparable to centralized implementations for a single thread
Drastically more scalable to large graphs and clusters
Conclusion & Future Direction
262
End-to-End Richer Big Graph Analytics
»Keyword search (Elastic Search)
»Graph query (Neo4J)
»Graph analytics (Giraph)
»Machine learning (Spark,TensorFlow)
»SQL query (Hive, Impala, SparkSQL, etc.)
»Stream processing (Flink, Spark Streaming, etc.)
»JSON processing (AsterixDB, Drill, etc.)
Converged programming abstractions and
platforms?
Conclusion & Future Direction
Frameworks for computation-intensive jobs
High-speed network for data-intensive jobs
New hardware support
263
264
Thanks !

Big Graph Analytics Systems (Sigmod16 Tutorial)

  • 1.
    Big Graph AnalyticsSystems DaYan The Chinese University of Hong Kong The Univeristy of Alabama at Birmingham Yingyi Bu Couchbase, Inc. Yuanyuan Tian IBM Research Almaden Center Amol Deshpande University of Maryland James Cheng The Chinese University of Hong Kong
  • 2.
  • 3.
    Big Graph Systems General-PurposeGraph Analytics Programming Language »Java, C/C++, Scala, Python … »Domain-Specific Language (DSL) 3
  • 4.
    Big Graph Systems ProgrammingModel »Think Like aVertex • Message passing • Shared MemoryAbstraction »Matrix Algebra »Think Like a Graph »Datalog 4
  • 5.
    Big Graph Systems OtherFeatures »Execution Mode: Sync or Async ? »Environment: Single-Machine or Distributed ? »Support for Topology Mutation »Out-of-Core Support »Support forTemporal Dynamics »Data-Intensive or Computation-Intensive ? 5
  • 6.
    Tutorial Outline Message PassingSystems Shared Memory Abstraction Single-Machine Systems Matrix-Based Systems Temporal Graph Systems DBMS-Based Systems Subgraph-Based Systems 6 Vertex-Centric Hardware-Related Computation-Intensive
  • 7.
    Tutorial Outline Message PassingSystems Shared Memory Abstraction Single-Machine Systems Matrix-Based Systems Temporal Graph Systems DBMS-Based Systems Subgraph-Based Systems 7
  • 8.
    Message Passing Systems 8 Google’sPregel [SIGMOD’10] »Think like a vertex »Message passing »Iterative • Superstep
  • 9.
    Message Passing Systems 9 Google’sPregel [SIGMOD’10] »Vertex Partitioning 0 1 2 3 4 5 6 7 8 0 1 3 1 0 2 3 2 1 3 4 7 3 0 1 2 7 4 2 5 7 5 4 6 6 5 8 7 2 3 4 8 8 6 7 M0 M1 M2
  • 10.
    Message Passing Systems 10 Google’sPregel [SIGMOD’10] »Programming Interface • u.compute(msgs) • u.send_msg(v, msg) • get_superstep_number() • u.vote_to_halt() Called inside u.compute(msgs)
  • 11.
    Message Passing Systems 11 Google’sPregel [SIGMOD’10] »Vertex States • Active / inactive • Reactivated by messages »Stop Condition • All vertices halted, and • No pending messages
  • 12.
    Message Passing Systems 12 Google’sPregel [SIGMOD’10] »Hash-Min: Connected Components 7 0 1 2 3 4 5 67 8 0 6 85 2 4 1 3 Superstep 1
  • 13.
    Message Passing Systems 13 Google’sPregel [SIGMOD’10] »Hash-Min: Connected Components 5 0 1 2 3 4 5 67 8 0 0 60 0 2 0 1 Superstep 2
  • 14.
    Message Passing Systems 14 Google’sPregel [SIGMOD’10] »Hash-Min: Connected Components 0 0 1 2 3 4 5 67 8 0 0 00 0 0 0 0 Superstep 3
  • 15.
    Message Passing Systems 15 PracticalPregel Algorithm (PPA) [PVLDB’14] »First cost model for Pregel algorithm design »PPAs for fundamental graph problems • Breadth-first search • List ranking • Spanning tree • Euler tour • Pre/post-order traversal • Connected components • Bi-connected components • Strongly connected components • ...
  • 16.
    Message Passing Systems 16 PracticalPregel Algorithm (PPA) [PVLDB’14] »Linear cost per superstep • O(|V| + |E|) message number • O(|V| + |E|) computation time • O(|V| + |E|) memory space »Logarithm number of supersteps • O(log |V|) supersteps O(log|V|) = O(log|E|) How about load balancing?
  • 17.
    Message Passing Systems 17 BalancedPPA (BPPA) [PVLDB’14] »din(v): in-degree of v »dout(v): out-degree of v »Linear cost per superstep • O(din(v) + dout(v)) message number • O(din(v) + dout(v)) computation time • O(din(v) + dout(v)) memory space »Logarithm number of supersteps
  • 18.
    Message Passing Systems 18 BPPAExample: List Ranking [PVLDB’14] »A basic operation of Euler tour technique »Linked list where each element v has • Value val(v) • Predecessor pred(v) »Element at the head has pred(v) = NULL 11111NULL v1 v2 v3 v4 v5 Toy Example: val(v) = 1 for all v
  • 19.
    Message Passing Systems 19 BPPAExample: List Ranking [PVLDB’14] »Compute sum(v) for each element v • Summing val(v) and values of all predecessors »WhyTeraSort cannot work? 54321NULL v1 v2 v3 v4 v5
  • 20.
    Message Passing Systems 20 BPPAExample: List Ranking [PVLDB’14] »Pointer jumping / path doubling • sum(v) ← sum(v) + sum(pred(v)) • pred(v) ← pred(pred(v)) 11111NULL v1 v2 v3 v4 v5 As long as pred(v) ≠ NULL
  • 21.
    Message Passing Systems 21 BPPAExample: List Ranking [PVLDB’14] »Pointer jumping / path doubling • sum(v) ← sum(v) + sum(pred(v)) • pred(v) ← pred(pred(v)) 11111NULL 22221NULL v1 v2 v3 v4 v5
  • 22.
    Message Passing Systems 22 BPPAExample: List Ranking [PVLDB’14] »Pointer jumping / path doubling • sum(v) ← sum(v) + sum(pred(v)) • pred(v) ← pred(pred(v)) NULL 22221NULL 44321NULL v1 v2 v3 v4 v5 11111
  • 23.
    Message Passing Systems 23 BPPAExample: List Ranking [PVLDB’14] »Pointer jumping / path doubling • sum(v) ← sum(v) + sum(pred(v)) • pred(v) ← pred(pred(v)) NULL 22221NULL 44321NULL 54321NULL v1 v2 v3 v4 v5 11111 O(log |V|) supersteps
  • 24.
    Message Passing Systems 24 Optimizationsin Communication Mechanism
  • 25.
    Message Passing Systems 25 ApacheGiraph »Superstep splitting: reduce memory consumption »Only effective when compute(.) is distributive u1 u2 u3 u4 u5 u6 v 0 1 1 1 1 1 1
  • 26.
    Message Passing Systems 26 ApacheGiraph »Superstep splitting: reduce memory consumption »Only effective when compute(.) is distributive u1 u2 u3 u4 u5 u6 v 0 6
  • 27.
    Message Passing Systems 27 ApacheGiraph »Superstep splitting: reduce memory consumption »Only effective when compute(.) is distributive u1 u2 u3 u4 u5 u6 v 6
  • 28.
    Message Passing Systems 28 ApacheGiraph »Superstep splitting: reduce memory consumption »Only effective when compute(.) is distributive u1 u2 u3 u4 u5 u6 v 0 1 1 1
  • 29.
    Message Passing Systems 29 ApacheGiraph »Superstep splitting: reduce memory consumption »Only effective when compute(.) is distributive u1 u2 u3 u4 u5 u6 v 0 3
  • 30.
    Message Passing Systems 30 ApacheGiraph »Superstep splitting: reduce memory consumption »Only effective when compute(.) is distributive u1 u2 u3 u4 u5 u6 v 3
  • 31.
    Message Passing Systems 31 ApacheGiraph »Superstep splitting: reduce memory consumption »Only effective when compute(.) is distributive u1 u2 u3 u4 u5 u6 v 3 1 1 1
  • 32.
    Message Passing Systems 32 ApacheGiraph »Superstep splitting: reduce memory consumption »Only effective when compute(.) is distributive u1 u2 u3 u4 u5 u6 v 3 3
  • 33.
    Message Passing Systems 33 ApacheGiraph »Superstep splitting: reduce memory consumption »Only effective when compute(.) is distributive u1 u2 u3 u4 u5 u6 v 6
  • 34.
    Message Passing Systems 34 Pregel+[WWW’15] »Vertex Mirroring »Request-Respond Paradigm
  • 35.
    Message Passing Systems 35 Pregel+[WWW’15] »Vertex Mirroring M3 w1 w2 wk …… M2 v1 v2 vj …… M1 u1 u2 ui …… … …
  • 36.
    Message Passing Systems 36 Pregel+[WWW’15] »Vertex Mirroring M3 w1 w2 wk …… M2 v1 v2 vj …… M1 u1 u2 ui …… uiui … …
  • 37.
    Message Passing Systems 37 Pregel+[WWW’15] »Vertex Mirroring: Create mirror for u4? M1 u1 u4 … v1 v2 v4v1 v2 v3 u2 v1 v2 u3 v1 v2 M2 v1 v4 … v2 v3
  • 38.
    Message Passing Systems 38 Pregel+[WWW’15] »Vertex Mirroring v.s. Message Combining M1 u1 u4 … v1 v2 v4v1 v2 v3 u2 v1 v2 u3 v1 v2 M1 u1 u4 … u2 u3 M2 v1 v4 … v2 v3 a(u1) + a(u2) + a(u3) + a(u4)
  • 39.
    Message Passing Systems 39 Pregel+[WWW’15] »Vertex Mirroring v.s. Message Combining M1 u1 u4 … v1 v2 v4v1 v2 v3 u2 v1 v2 u3 v1 v2 M1 u1 u4 … u2 u3 M2 v1 v4 … v2 v3 u4 a(u1) + a(u2) + a(u3) a(u4)
  • 40.
    Message Passing Systems 40 Pregel+[WWW’15] »Vertex Mirroring: Only mirror high-degree vertices »Choice of degree threshold τ • M machines, n vertices, m edges • Average degree: degavg = m / n • Optimal τ is M · exp{degavg / M}
  • 41.
    Message Passing Systems 41 Pregel+[WWW’15] » Request-Respond Paradigm v1 v4 v2 v3 u M1 a(u) M2 <v1> <v2> <v3> <v4>
  • 42.
    Message Passing Systems 42 Pregel+[WWW’15] » Request-Respond Paradigm v1 v4 v2 v3 u M1 a(u) M2 a(u) a(u) a(u) a(u)
  • 43.
    Message Passing Systems 43 Pregel+[WWW’15] »A vertex v can request attribute a(u) in superstep i » a(u) will be available in superstep (i + 1)
  • 44.
    Message Passing Systems 44 v1 v4 v2 v3 u M1 D[u] M2 requestu u | D[u] Pregel+ [WWW’15] »A vertex v can request attribute a(u) in superstep I » a(u) will be available in superstep (i + 1)
  • 45.
  • 46.
    Message Passing Systems 46 VertexMigration »WindCatch [ICDE’13] • Runtime improved by 31.5% for PageRank (best) • 2% for shortest path computation • 9% for maximal matching »Stanford’s GPS [SSDBM’13] »Mizan [EuroSys’13] • Hash-based and METIS partitioning: no improvement • Range-based partitioning: around 40% improvement
  • 47.
    Message Passing Systems DynamicConcurrency Control »PAGE [TKDE’15] • Better partitioning → slower ? 47
  • 48.
    Message Passing Systems DynamicConcurrency Control »PAGE [TKDE’15] • Message generation • Local message processing • Remote message processing 48
  • 49.
    Message Passing Systems DynamicConcurrency Control »PAGE [TKDE’15] • Monitors speeds of the 3 operations • Dynamically adjusts number of threads for the 3 operations • Criteria - Speed of message processing = speed of incoming messages - Thread numbers for local & remote message processing are proportional to speed of local & remote message processing 49
  • 50.
    Message Passing Systems 50 Out-of-CoreSupport java.lang.OutOfMemoryError: Java heap space 26 cases reported by Giraph-users mailing list during 08/2013~08/2014!
  • 51.
    Message Passing Systems 51 Pregelix[PVLDB’15] »Transparent out-of-core support »Physical flexibility (Environment) »Software simplicity (Implementation) Hyracks Dataflow Engine
  • 52.
  • 53.
  • 54.
    Message Passing Systems 54 GraphD »Hardwarefor small startups and average researchers • Desktop PCs • Gigabit Ethernet switch »Features of a common cluster • Limited memory space • Disk streaming bandwidth >> network bandwidth » Each worker stores and streams edges and messages on local disks » Cost of buffering msgs on disks hidden inside msg transmission
  • 55.
  • 56.
    Message Passing Systems 56 CoordinatedCheckpointing of Pregel »Every δ supersteps »Recovery from machine failure: • Standby machine • Repartitioning among survivors An illustration with δ = 5
  • 57.
    Message Passing Systems 57 CoordinatedCheckpointing of Pregel W1 W2 W3 … … … Superstep 4 W1 W2 W3 5 W2 W3 6 W1 W2 W3 7 Failure occurs W1 Write checkpoint to HDFS Vertex states, edge changes, shuffled messages
  • 58.
    Message Passing Systems 58 CoordinatedCheckpointing of Pregel W1 W2 W3 … … … Superstep 4 W1 W2 W3 5 W1 W2 W3 6 W1 W2 W3 7 Load checkpoint from HDFS
  • 59.
    Message Passing Systems 59 Chandy-LamportSnapshot [TOCS’85] »Uncoordinated checkpointing (e.g., for async exec) »For message-passing systems »FIFO channels u v 5 5 u : 5
  • 60.
    Message Passing Systems 60 Chandy-LamportSnapshot [TOCS’85] »Uncoordinated checkpointing (e.g., for async exec) »For message-passing systems »FIFO channels u v u : 5 4 4 5
  • 61.
    Message Passing Systems 61 Chandy-LamportSnapshot [TOCS’85] »Uncoordinated checkpointing (e.g., for async exec) »For message-passing systems »FIFO channels u v u : 5 4 4
  • 62.
    Message Passing Systems 62 Chandy-LamportSnapshot [TOCS’85] »Uncoordinated checkpointing (e.g., for async exec) »For message-passing systems »FIFO channels u v u : 5 v : 4 4 4
  • 63.
    Message Passing Systems 63 Chandy-LamportSnapshot [TOCS’85] »Solution: bcast checkpoint request right after checkpointed u v 5 5 u : 5 REQ v : 5
  • 64.
    Message Passing Systems 64 Recoveryby Message-Logging [PVLDB’14] »Each worker logs its msgs to local disks • Negligible overhead, cost hidden »Survivor • No re-computaton during recovery • Forward logged msgs to replacing workers »Replacing worker • Re-compute from latest checkpoint • Only send msgs to replacing workers
  • 65.
    Message Passing Systems 65 Recoveryby Message-Logging [PVLDB’14] W1 W2 W3 … … … Superstep 4 W1 W2 W3 5 W2 W3 6 W1 W2 W3 7 Failure occurs W1 Log msgsLog msgsLog msgs Log msgsLog msgsLog msgs
  • 66.
    Message Passing Systems 66 Recoveryby Message-Logging [PVLDB’14] W1 W2 W3 … … … Superstep 4 W1 W2 W3 5 W1 W2 W3 6 W1 W2 W3 7 Standby Machine Load checkpoint
  • 67.
  • 68.
    Message Passing Systems 68 Block-CentricComputation »Main Idea • A block refers to a connected subgraph • Messages exchange among blocks • Serial in-memory algorithm within a block
  • 69.
    Message Passing Systems 69 Block-CentricComputation »Motivation: graph characteristics adverse to Pregel • Large graph diameter • High average vertex degree
  • 70.
    Message Passing Systems 70 Block-CentricComputation »Benefits • Less communication workload • Less number of supersteps • Less number of computing units
  • 71.
    Message Passing Systems 71 Giraph++[PVLDB’13] » Pioneering: think like a graph » METIS-style vertex partitioning » Partition.compute(.) » Boundary vertex values sync-ed at superstep barrier » Internal vertex values can be updated anytime
  • 72.
    Message Passing Systems 72 Blogel[PVLDB’14] » API: vertex.compute(.) + block.compute(.) »A block can have its own fields »A block/vertex can send msgs to another block/vertex »Example: Hash-Min • Construct block-level graph: to compute an adjacency list for each block • Propagate min block ID among blocks
  • 73.
    Message Passing Systems 73 Blogel[PVLDB’14] »Performance on Friendster social network with 65.6 M vertices and 3.6 B edges 1 10 100 1000 2.52 120.24 ComputingTime Blogel Pregel+ 1 100 10,000 19 7,227 MILLION Total Msg # Blogel Pregel+ 0 10 20 30 5 30 Superstep # Blogel Pregel+
  • 74.
    Message Passing Systems 74 Blogel[PVLDB’14] »Web graph: URL-based partitioning »Spatial networks: 2D partitioning »General graphs: graphVoronoi diagram partitioning
  • 75.
    Blogel [PVLDB’14] » GraphVoronoiDiagram (GVD) partitioning 75 Three seeds v is 2 hops from red seed v is 3 hops from green seed v is 5 hops from blue seedv Message Passing Systems
  • 76.
    Blogel [PVLDB’14] »Sample seedvertices with probability p 76 Message Passing Systems
  • 77.
    Blogel [PVLDB’14] »Sample seedvertices with probability p 77 Message Passing Systems
  • 78.
    Blogel [PVLDB’14] »Sample seedvertices with probability p »Compute GVD grouping • Vertex-centric multi-source BFS 78 Message Passing Systems
  • 79.
    Blogel [PVLDB’14] 79State afterSeed Sampling Message Passing Systems
  • 80.
  • 81.
  • 82.
  • 83.
    Blogel [PVLDB’14] »Sample seedvertices with probability p »Compute GVD grouping »Postprocessing 83 Message Passing Systems
  • 84.
    Blogel [PVLDB’14] »Sample seedvertices with probability p »Compute GVD grouping »Postprocessing • For very large blocks, resample with a larger p and repeat 84 Message Passing Systems
  • 85.
    Blogel [PVLDB’14] »Sample seedvertices with probability p »Compute GVD grouping »Postprocessing • For very large blocks, resample with a larger p and repeat • For tiny components, find them using Hash-Min at last 85 Message Passing Systems
  • 86.
    GVD Partitioning Performance 86 2026.65 505.85 186.89 105.4875.88 70.68 0 500 1000 1500 2000 2500 3000 WebUK Friendster BTC LiveJournal USA Road Euro Road Loading Partitioning Dumping Message Passing Systems
  • 87.
  • 88.
    Maiter [TPDS’14] » Foralgos where vertex values converge asymmetrically » Delta-based accumulative iterative computation (DAIC) 88 Message Passing Systems v1 v2 v3 v4
  • 89.
    Maiter [TPDS’14] » Foralgos where vertex values converge asymmetrically » Delta-based accumulative iterative computation (DAIC) » Strict transformation from Pregel API to DAIC formulation »Delta may serve as priority score »Natural for block-centric frameworks 89 Message Passing Systems
  • 90.
  • 91.
    Quegel [PVLDB’16] » On-demandanswering of light-workload graph queries • Only a portion of the whole graph gets accessed » Option 1: to process queries one job after another • Network underutilization, too many barriers • High startup overhead (e.g., graph loading) 91 Message Passing Systems
  • 92.
    Quegel [PVLDB’16] » On-demandanswering of light-workload graph queries • Only a portion of the whole graph gets accessed » Option 2: to process a batch of queries in one job • Programming complexity • Straggler problem 92 Message Passing Systems
  • 93.
    Quegel [PVLDB’16] »Execution model:superstep-sharing • Each iteration is called a super-round • In a super-round, every query proceeds by one superstep 93 Message Passing Systems Super–Round # 1 q1 2 3 4 1 2 3 4 q3q2 q4 Time Queries 5 6 q1 q2 q3 q4 7 1 2 3 4 1 2 3 4 1 2 3 4
  • 94.
    Quegel [PVLDB’16] »Benefits • Messagesof multiple queries transmitted in one batch • One synchronization barrier for each super-round • Better load balancing 94 Message Passing Systems Worker 1 Worker 2 time sync sync sync Individual Synchronization Superstep-Sharing
  • 95.
    Quegel [PVLDB’16] »API issimilar to Pregel »The system does more: • Q-data: superstep number, control information, … • V-data: adjacency list, vertex/edge labels • VQ-data: vertex state in the evaluation of each query 95 Message Passing Systems
  • 96.
    Quegel [PVLDB’16] »Create aVQ-dataof v for q, only when q touches v »Garbage collection of Q-data andVQ-data »Distributed indexing 96 Message Passing Systems
  • 97.
    Tutorial Outline Message PassingSystems Shared Memory Abstraction Single-Machine Systems Matrix-Based Systems Temporal Graph Systems DBMS-Based Systems Subgraph-Based Systems 97
  • 98.
    Shared-Mem Abstraction 98 Single Machine (UAI2010) Distributed GraphLab (PVLDB 2012) PowerGraph (OSDI 2012)
  • 99.
    Shared-Mem Abstraction Distributed GraphLab[PVLDB’12] »Scope of vertex v 99 u v w Du Dv Dw D(u,v) D(v,w) ………… ………… All that v can access
  • 100.
    Shared-Mem Abstraction Distributed GraphLab[PVLDB’12] » Async exec mode: for asymmetric convergence • Scheduler, serializability » API:v.update() • Access & update data in v’s scope • Add neighbors to scheduler 100
  • 101.
    Shared-Mem Abstraction Distributed GraphLab[PVLDB’12] » Vertices partitioned among machines » For edge (u, v), scopes of u and v overlap • Du, Dv and D(u, v) • Replicated if u and v are on different machines » Ghosts: overlapped boundary data • Value-sync by a versioning system » Memory space problem • x {# of machines} 101
  • 102.
    Shared-Mem Abstraction PowerGraph [OSDI’12] »API: Gather-Apply-Scatter (GAS) • PageRank: out-degree = 2 for all in-neighbors 102 1 1 1 1 1 1 1 0
  • 103.
    Shared-Mem Abstraction PowerGraph [OSDI’12] »API: Gather-Apply-Scatter (GAS) • PageRank: out-degree = 2 for all in-neighbors 103 1 1 1 1 1 1 1 1/2 0 1/2 1/2
  • 104.
    Shared-Mem Abstraction PowerGraph [OSDI’12] »API: Gather-Apply-Scatter (GAS) • PageRank: out-degree = 2 for all in-neighbors 104 1 1 1 1 1 1 1 1.5
  • 105.
    Shared-Mem Abstraction PowerGraph [OSDI’12] »API: Gather-Apply-Scatter (GAS) • PageRank: out-degree = 2 for all in-neighbors 105 1 1 1 1.5 1 1 1 0 Δ = 0.5 > ϵ
  • 106.
    Shared-Mem Abstraction PowerGraph [OSDI’12] »API: Gather-Apply-Scatter (GAS) • PageRank: out-degree = 2 for all in-neighbors 106 1 1 1 1.5 1 1 1 0 activated activated activated
  • 107.
    Shared-Mem Abstraction PowerGraph [OSDI’12] »EdgePartitioning »Goals: • Loading balancing • Minimize vertex replicas – Cost of value sync – Cost of memory space 107
  • 108.
    Shared-Mem Abstraction PowerGraph [OSDI’12] »GreedyEdge Placement 108 u v W1 W2 W3 W4 W5 W6 Workload 100 101 102 103 104 105
  • 109.
    Shared-Mem Abstraction PowerGraph [OSDI’12] »GreedyEdge Placement 109 u v W1 W2 W3 W4 W5 W6 Workload 100 101 102 103 104 105
  • 110.
    Shared-Mem Abstraction PowerGraph [OSDI’12] »GreedyEdge Placement 110 u v W1 W2 W3 W4 W5 W6 Workload 100 101 102 103 104 105 ∅ ∅
  • 111.
  • 112.
    Shared-Mem Abstraction Shared-Mem +Single-Machine »Out-of-core execution, disk/SSD-based • GraphChi [OSDI’12] • X-Stream [SOSP’13] • VENUS [ICDE’14] • … »Vertices are numbered 1, …, n; cut into P intervals 112 interval(2) interval(P) 1 nv1 v2 interval(1)
  • 113.
    Shared-Mem Abstraction GraphChi [OSDI’12] »ProgrammingModel • Edge scope of v 113 u v w Du Dv Dw D(u,v) D(v,w) ………… …………
  • 114.
    Shared-Mem Abstraction GraphChi [OSDI’12] »ProgrammingModel • Scatter & gather values along adjacent edges 114 u v w Dv D(u,v) D(v,w) ………… …………
  • 115.
    Shared-Mem Abstraction GraphChi [OSDI’12] »Loadvertices of each interval, along with adjacent edges for in-mem processing »Write updated vertex/edge values back to disk »Challenges • Sequential IO • Consistency: store each edge value only once on disk 115 interval(2) interval(P) 1 nv1 v2 interval(1)
  • 116.
    Shared-Mem Abstraction GraphChi [OSDI’12] »Diskshards: shard(i) • Vertices in interval(i) • Their incoming edges, sorted by source_ID 116 interval(2) interval(P) 1 nv1 v2 interval(1) shard(P)shard(2)shard(1)
  • 117.
    Shared-Mem Abstraction GraphChi [OSDI’12] »ParallelSlidingWindows (PSW) 117 Shard 1 in-edgessortedby src_id Vertices 1..100 Vertices 101..200 Vertices 201..300 Vertices 301..400 Shard 2 Shard 3 Shard 4Shard 1
  • 118.
    Shared-Mem Abstraction GraphChi [OSDI’12] »ParallelSlidingWindows (PSW) 118 Shard 1 in-edgessortedby src_id Vertices 1..100 Vertices 101..200 Vertices 201..300 Vertices 301..400 Shard 2 Shard 3 Shard 4Shard 1 100 100 100 1 1 1 1 Out-Edges Vertices & In-Edges 100
  • 119.
    Shared-Mem Abstraction GraphChi [OSDI’12] »ParallelSlidingWindows (PSW) 119 Shard 1 in-edgessortedby src_id Vertices 1..100 Vertices 101..200 Vertices 201..300 Vertices 301..400 Shard 2 Shard 3 Shard 4Shard 1 1 1 1 1 100 100 100 200 Vertices & In-Edges 200 200 Out-Edges 100 200
  • 120.
    Shared-Mem Abstraction GraphChi [OSDI’12] »Eachvertex & edge value is read & written for at least once in an iteration 120
  • 121.
    Shared-Mem Abstraction X-Stream [SOSP’13] »Edge-scopeGAS programming model »Streams a completely unordered list of edges 121
  • 122.
    Shared-Mem Abstraction X-Stream [SOSP’13] »Simplecase: all vertex states are memory-resident »Pass 1: edge-centric scattering • (u, v): value(u) => <v, value(u, v)> »Pass 2: edge-centric gathering • <v, value(u, v)> => value(v) 122 update aggregate
  • 123.
    Shared-Mem Abstraction X-Stream [SOSP’13] »Out-of-CoreEngine • P vertex partitions with vertex states only • P edge partitions, partitioned by source vertices • Each pass loads a vertex partition, streams corresponding edge partition (or update partition) 123 interval(2) interval(P) 1 nv1 v2 interval(1) Fit into memory Larger than in GraphChi Streamed on disk P update files generated by Pass 1 scattering
  • 124.
    Shared-Mem Abstraction X-Stream [SOSP’13] »Out-of-CoreEngine • Pass 1: edge-centric scattering – (u, v): value(u) => [v, value(u, v)] • Pass 2: edge-centric scattering – [v, value(u, v)] => value(v) 124 interval(2) interval(P) 1 nv1 v2 interval(1) Append to update file for partition of v Streamed from update file for the corresponding vertex partition
  • 125.
    Shared-Mem Abstraction X-Stream [SOSP’13] »Scaleout: Chaos [SOSP’15] • Requires 40 GigE • Slow with GigE »Weakness: sparse computation 125
  • 126.
    Shared-Mem Abstraction VENUS [ICDE’14] »Programmingmodel • Value scope of v 126 u v w Du Dv Dw D(u,v) D(v,w) ………… …………
  • 127.
    Shared-Mem Abstraction VENUS [ICDE’14] »Assumestatic topology • Separate read-only edge data and mutable vertex states »g-shard(i): incoming edge lists of vertices in interval(i) »v-shard(i): srcs & dsts of edges in g-shard(i) »All g-shards are concatenated for streaming 127 interval(2) interval(P) 1 nv1 v2 interval(1) Sources may not be in interval(i) Vertices in a v-shard are ordered by ID
  • 128.
    Dsts of interval(i)may be srcs of other intervals Shared-Mem Abstraction VENUS [ICDE’14] »To process interval(i) • Load v-shard(i) • Stream g-shard(i), update in-memory v-shard(i) • Update every other v-shard by a sequential write 128 interval(2) interval(P) 1 nv1 v2 interval(1) Dst vertices are in interval(i)
  • 129.
    Shared-Mem Abstraction VENUS [ICDE’14] »Avoid writing O(|E|) edge values to disk » O(|E|) edge values are read once » O(|V|) may be read/written for multiple times 129 interval(2) interval(P) 1 nv1 v2 interval(1)
  • 130.
    Tutorial Outline Message PassingSystems Shared Memory Abstraction Single-Machine Systems Matrix-Based Systems Temporal Graph Systems DBMS-Based Systems Subgraph-Based Systems 130
  • 131.
    Single-Machine Systems Categories »Shared-mem out-of-core(GraphChi, X-Stream,VENUS) »Matrix-based (to be discussed later) »SSD-based »In-mem multi-core »GPU-based 131
  • 132.
  • 133.
    Single-Machine Systems SSD-Based Systems »Asyncrandom IO • Many flash chips, each with multiple dies »Callback function »Pipelined for high throughput 133
  • 134.
  • 135.
  • 136.
  • 137.
  • 138.
    Single-Machine Systems TurboGraph [KDD’13] 138 In-mempage table: vertex ID -> location on SSD 1-hop neighborhood: outperform GraphChi by 104
  • 139.
    Single-Machine Systems TurboGraph [KDD’13] 139 Specialtreatment for adj-list larger than a page
  • 140.
    Single-Machine Systems TurboGraph [KDD’13] »Pin-and-slideexecution model »Concurrently process vertices of pinned pages »Do not wait for completion of IO requests »Page unpinned as soon as processed 140
  • 141.
    Single-Machine Systems FlashGraph [FAST’15] »Semi-externalmemory • Edge lists on SSDs »On top of SAFS, an SSD file system • High-throughput async I/Os over SSD array • Edge lists stored in one (logical) file on SSD 141
  • 142.
    Single-Machine Systems FlashGraph [FAST’15] »Onlyaccess requested edge lists »Merge same-page / adjacent-page requests into one sequential access »Vertex-centricAPI »Message passing among threads 142
  • 143.
  • 144.
    Single-Machine Systems In-Memory ParallelFrameworks »Programming simplicity • Green-Marl, Ligra, GRACE »Full utilization of all cores in a machine • GRACE, Galois 144
  • 145.
    Single-Machine Systems Green-Marl [ASPLOS’12] »Domain-specificlanguage (DSL) • High-level language constructs • Expose data-level parallelism »DSL → C++ program »Initially single-machine, now supported by GPS 145
  • 146.
    Single-Machine Systems Green-Marl [ASPLOS’12] »ParallelFor »Parallel BFS »Reductions (e.g., SUM, MIN, AND) »Deferred assignment (<=) • Effective only at the end of the binding iteration 146
  • 147.
    Single-Machine Systems Ligra [PPoPP’13] »VertexSet-centricAPI: edgeMap, vertexMap »Example: BFS • Ui+1←edgeMap(Ui, F, C) 147 u v Ui Vertices for next iteration
  • 148.
    Single-Machine Systems Ligra [PPoPP’13] »VertexSet-centricAPI: edgeMap, vertexMap »Example: BFS • Ui+1←edgeMap(Ui, F, C) 148 u v Ui C(v) = parent[v] is NULL? Yes
  • 149.
    Single-Machine Systems Ligra [PPoPP’13] »VertexSet-centricAPI: edgeMap, vertexMap »Example: BFS • Ui+1←edgeMap(Ui, F, C) 149 u v Ui F(u, v): parent[v] ← u v added to Ui+1
  • 150.
    Single-Machine Systems Ligra [PPoPP’13] »Modeswitch based on vertex sparseness |Ui| • When | Ui | is large 150 u v Ui w C(w) called 3 times
  • 151.
    Single-Machine Systems Ligra [PPoPP’13] »Modeswitch based on vertex sparseness |Ui| • When | Ui | is large 151 u v Ui w if C(v) is true Call F(u, v) for every in-neighbor in U Early pruning: just the first one for BFS
  • 152.
    Single-Machine Systems GRACE [PVLDB’13] »Vertex-centricAPI,block-centric execution • Inner-block computation: vertex-centric computation with an inner-block scheduler »Reduce data access to computation ratio • Many vertex-centric algos are computationally-light • CPU cache locality: every block fits in cache 152
  • 153.
    Single-Machine Systems Galois [SOSP’13] »Amorphousdata-parallelism (ADP) • Speculative execution: fully use extra CPU resources 153 v’s neighborhoodu’s neighborhood u vw
  • 154.
    Single-Machine Systems Galois [SOSP’13] »Amorphousdata-parallelism (ADP) • Speculative execution: fully use extra CPU resources 154 v’s neighborhoodu’s neighborhood u vw Rollback
  • 155.
    Single-Machine Systems Galois [SOSP’13] »Amorphousdata-parallelism (ADP) • Speculative execution: fully use extra CPU resources »Machine-topology-aware scheduler • Try to fetch tasks local to the current core first 155
  • 156.
  • 157.
    Single-Machine Systems GPU Architecture »Arrayof streaming multiprocessors (SMs) »Single instruction, multiple threads (SIMT) »Different control flows • Execute all flows • Masking »Memory cache hierarchy 157 Small path divergence Coalesced memory accesses
  • 158.
    Single-Machine Systems GPU Architecture »Warp:32 threads, basic unit for scheduling »SM: 48 warps • Two streaming processors (SPs) • Warp scheduler: two warps executed at a time »Thread block / CTA (cooperative thread array) • 6 warps • Kernel call → grid of CTAs • CTAs are distributed to SMs with available resources 158
  • 159.
    Single-Machine Systems Medusa [TPDS’14] »BPSmodel of Pregel »Fine-grained API: Edge-Message-Vertex (EMV) • Large parallelism, small path divergence »Pre-allocates an array for buffering messages • Coalesced memory accesses: incoming msgs for each vertex is consecutive • Write positions of msgs do not conflict 159
  • 160.
    Single-Machine Systems CuSha [HPDC’14] »Applythe shard organization of GraphChi »Each shard processed by one CTA »Window concatenation 160 Window write-back: imbalanced workload Shard 1 n-edgessortedbysrc_id Vertices 1..100 Vertices 101..200 Vertices 201..300 Vertices 301..400 Shard 2 Shard 3 Shard 4Shard 1 1 1 1 1 100 100 100 200 200 200 100 200
  • 161.
    Single-Machine Systems CuSha [HPDC’14] »Applythe shard organization of GraphChi »Each shard processed by one CTA »Window concatenation 161 Threads in a CTA may cross window boundaries Pointers to actual locations in shards Window write-back: imbalanced workload
  • 162.
    Tutorial Outline Message PassingSystems Shared Memory Abstraction Single-Machine Systems Matrix-Based Systems Temporal Graph Systems DBMS-Based Systems Subgraph-Based Systems 162
  • 163.
    Matrix-Based Systems 163 Categories »Single-machine systems •Vertex-centric API • Matrix operations in the backend »Distributed frameworks • (Generalized) matrix-vector multiplication • Matrix algebra
  • 164.
    Matrix-Based Systems 164 Matrix-Vector Multiplication »Example:PageRank PRi(v1) PRi(v2) PRi(v3) PRi(v4) × = Pri+1(v1) PRi+1 (v2) PRi+1 (v3) PRi+1 (v4) Out-AdjacencyList(v1) Out-AdjacencyList(v2) Out-AdjacencyList(v3) Out-AdjacencyList(v4)
  • 165.
    Matrix-Based Systems 165 Generalized Matrix-VectorMultiplication »Example: HashMin mini(v1) mini(v2) mini(v3) mini(v4) × = mini+1(v1) mini+1 (v2) mini+1 (v3) mini+1 (v4) 0/1-AdjacencyList(v1) 0/1-AdjacencyList(v2) 0/1-AdjacencyList(v3) 0/1-AdjacencyList(v4) Add → Min Assign only when smaller
  • 166.
  • 167.
    Matrix-Based Systems GraphTwist [PVLDB’15] »Multi-levelgraph partitioning • Right granularity for in-memory processing • Balance workloads among computing threads 1671 n src dst 1 n u v w(u, v) edge-weight
  • 168.
    Matrix-Based Systems GraphTwist [PVLDB’15] »Multi-levelgraph partitioning • Right granularity for in-memory processing • Balance workloads among computing threads 1681 n src dst 1 n edge-weight slice
  • 169.
    Matrix-Based Systems GraphTwist [PVLDB’15] »Multi-levelgraph partitioning • Right granularity for in-memory processing • Balance workloads among computing threads 1691 n src dst 1 n edge-weight stripe
  • 170.
    Matrix-Based Systems GraphTwist [PVLDB’15] »Multi-levelgraph partitioning • Right granularity for in-memory processing • Balance workloads among computing threads 1701 n src dst 1 n edge-weight dice
  • 171.
    Matrix-Based Systems GraphTwist [PVLDB’15] »Multi-levelgraph partitioning • Right granularity for in-memory processing • Balance workloads among computing threads 1711 n src dst 1 n edge-weight u vertex cut
  • 172.
    Matrix-Based Systems GraphTwist [PVLDB’15] »Multi-levelgraph partitioning • Right granularity for in-memory processing • Balance workloads among computing threads »Fast Randomized Approximation • Prune statistically insignificant vertices/edges • E.g., PageRank computation only using high-weight edges • Unbiased estimator: sampling slices/cuts according to Frobenius norm 172
  • 173.
    Matrix-Based Systems GridGraph [ATC’15] »Gridrepresentation for reducing IO 173
  • 174.
    Matrix-Based Systems GridGraph [ATC’15] »Gridrepresentation for reducing IO »Streaming-apply API • Streaming edges of a block (Ii, Ij) • Aggregate value to v ∈ Ij 174
  • 175.
  • 176.
    Matrix-Based Systems GridGraph [ATC’15] »Illustration:column-by-column evaluation 176 Create in-mem Load
  • 177.
  • 178.
  • 179.
    Matrix-Based Systems GridGraph [ATC’15] »Illustration:column-by-column evaluation 179 Create in-mem Load
  • 180.
  • 181.
  • 182.
  • 183.
    Matrix-Based Systems GridGraph [ATC’15] »ReadO(P|V|) data of vertex chunks »Write O(|V|) data of vertex chunks (not O(|E|)!) »Stream O(|E|) data of edge blocks • Edge blocks are appended into one large file for streaming • Block boundaries recorded to trigger the pin/unpin of a vertex chunk 183
  • 184.
  • 185.
    Distributed Systems withMatrix- Based Interfaces • PEGASUS (CMU, 2009) • GBase (CMU & IBM, 2011) • SystemML (IBM, 2011) 185 Commonality: • Matrix-based programming interface to the users • Rely on MapReduce for execution.
  • 186.
    PEGASUS • Open source:http://www.cs.cmu.edu/~pegasus • Publications: ICDM’09,KAIS’10. • Intuition: many graph computation can be modeled by a generalized form of matrix-vector multiplication. 𝑣′ = 𝑀 × 𝑣 PageRank: 𝑣′ = 0.85 ∙ 𝐴 𝑇 + 0.15 ∙ 𝑈 × 𝑣 186
  • 187.
    PEGASUS Programming Interface:GIM-V Three Primitives: 1) combine2(mi,j , vj ) : combine mi,j and vj into xi,j 2) combineAlli (xi,1 , ..., xi,n ) : combine all the results from combine2() for node i into vi ' 3) assign(vi , vi ' ) : decide how to update vi with vi ' Iterative: Operation applied till algorithm-specific convergence criterion is met.
  • 188.
    PageRank Example 𝑣′ =0.85 ∙ 𝐴 𝑇 + 0.15 ∙ 𝑈 × 𝑣 𝒄𝒐𝒎𝒃𝒊𝒏𝒆𝟐 𝑎𝑖,𝑗, 𝑣𝑗 = 0.85 ∙ 𝑎𝑖,𝑗 ∙ 𝑣𝑗 𝒄𝒐𝒎𝒃𝒊𝒏𝒆𝑨𝒍𝒍𝒊 𝑥𝑖,1, … , 𝑥𝑖,𝑛 = 0.15 𝑛 + 𝑖=1 𝑛 𝑥𝑖,𝑗 𝒂𝒔𝒔𝒊𝒈𝒏 𝑣𝑗, 𝑣𝑗′ = 𝑣𝑗′ 188
  • 189.
    Execution Model Iterations ofa 2-stage algorithm (each stage is a MR job) • Input: Edge andVector file • Edge line : (idsrc , iddst , mval) -> cell adjacency Matrix M • Vector line: (id, vval) -> element inVectorV • Stage 1: performs combine2() on columns of iddst of M with rows of id ofV • Stage 2: combines all partial results and assigns new vector -> old vector 189
  • 190.
    Optimizations • Block Multiplication •Clustered Edges 190 • Diagonal Block Iteration for connected component detection * Figures are copied from Kang et al ICDM’09
  • 191.
    GBASE • Part ofthe IBM System GToolkit • http://systemg.research.ibm.com • Publications: SIGKDD’11,VLDBJ’12. • PEGASUS vs GBASE: • Common: • Matrix-vector multiplication as the core operation • Division of a matrix into blocks • Clustering nodes to form homogenous blocks • Different: 191 PEGASUS GBASE Queries global targeted & global User Interface customizableAPIs build-in algorithms Storage normal files compression, special placement Block Size Square blocks Rectangular blocks
  • 192.
    Block Compression andPlacement • Block Formation • Partition nodes using clustering algorithms e.g. Metis • Compressed block encoding • source and destination partition ID p and q; • the set of sources and the set of destinations • the payload, the bit string of subgraph G(p,q) • The payload is compressed using zip compression or gap Elias-γ encoding. • Block Placement • Grid placement to minimize the number of input HDFS files to answer queries 192* Figure is copied from Kang et al SIGKDD’11
  • 193.
    Built-In Algorithms inGBASE • Select grids containing the blocks relevant to the queries • Derive the incidence matrix from the original adjacency matrix as required 193* Figure is copied from Kang et al SIGKDD’11
  • 194.
    SystemML • Apache Opensource: https://systemml.apache.org • Publications: ICDE’11, ICDE’12,VLDB’14, Data Engineering Bulletin’14, ICDE’15, SIGMOD’15, PPOPP’15,VLDB16. • Comparison to PEGASUS and GBASE • Core: General linear algebra and math operations (beyond just matrix- vector multiplication) • Designed for machine learning in general • User Interface: A high-level language with similar syntax as R • Declarative approach to graph processing with cost-based and rule-based optimization • Run on multiple platforms including MapReduce, Spark and single node. 194
  • 195.
    SystemML – DeclarativeMachine Learning Analytics language for data scientists (“The SQL for analytics”) » Algorithms expressed in a declarative, high-level language with R-like syntax » Productivity of data scientists » Language embeddings for • Solutions development • Tools Compiler » Cost-based optimizer to generate execution plans and to parallelize • based on data characteristics • based on cluster and machine characteristics » Physical operators for in-memory single node and cluster execution Performance & Scalability 195
  • 196.
    SystemML Architecture Overview 196 Language(DML) • R- like syntax • Rich set of statistical functions • User-defined & external function • Parsing • Statement blocks & statements • Program Analysis, type inference, dead code elimination High-Level Operator (HOP) Component • Represent dataflow in DAGs of operations on matrices, scalars • Choosing from alternative execution plans based on memory and cost estimates: operator ordering & selection; hybrid plans Low-Level Operator (LOP) Component • Low-level physical execution plan (LOPDags) over key-value pairs • “Piggybacking” operations into minimal number Map-Reduce jobs Runtime • Hybrid Runtime • CP: single machine operations & orchestrate MR jobs • MR: generic Map-Reduce jobs & operations • SP: Spark Jobs • Numerically stable operators • Dense / sparse matrix representation • Multi-Level buffer pool (caching) to evict in-memory objects • Dynamic Recompilation for initial unknowns Command Line JMLC Spark MLContext Spark ML APIs High-Level Operators Parser/Language Low-Level Operators Compiler Runtime Control Program Runtime Program Buffer Pool ParFor Optimizer/ Runtime MR InstSpark Inst CP Inst Recompiler Cost-based optimizations DFS IOMem/FS IO Generic MR Jobs MatrixBlock Library (single/multi-threaded)
  • 197.
    Pros and Consof Matrix-Based Graph Systems Pros: - Intuitive for analytic users familiar with linear algebra - E.g. SystemML provides a high-level language familiar to a lot of analysts Cons: - PEGASUS and GBASE require an expensive clustering of nodes as a preprocessing step. - Not all graph algorithms can be expressed using linear algebra - Unnecessary computation compared to vertex-centric model 197
  • 198.
    Tutorial Outline Message PassingSystems Shared Memory Abstraction Single-Machine Systems Matrix-Based Systems Temporal Graph Systems DBMS-Based Systems Subgraph-Based Systems 198
  • 199.
    Temporal and StreamingGraph Analytics • Motivation: Real world graphs often evolve over time. • Two body of work: • Real-time analysis on streaming graph data • E.g. Calculate each vertex’s current PageRank • Temporal analysis over historical traces of graphs • E.g. Analyzing the change of each vertex’s PageRank for a given time range 199
  • 200.
    Common Features forAll Systems • Temporal Graph: a continuous stream of graph updates • Graph update: addition or deletion of vertex/edge, or the update of the attribute associated with node/edge. • Most systems separate graph updates from graph computation. • Graph computation is only performed on a sequence of successive static views of the temporal graph • A graph snapshot is most commonly used static view • Using existing static graph programmingAPIs for temporal graph • Incremental graph computation • Leverage significant overlap of successive static views • Use ending vertex and edge states at time t as the starting states at time t+1 • Not applicable to all algorithms 200 Static view 1 Static view 2 Static view 3
  • 201.
    Overview • Real-time StreamingGraph Systems • Kineograph (distributed, Microsoft, 2012) • TIDE (distributed, IBM, 2015) • Historical Graph Systems • Chronos (distributed, Microsoft, 2014) • DeltaGraph (distributed, University of Maryland, 2013) • LLAMM (single-node, Harvard University & Oracle, 2015) 201
  • 202.
    Kineograph • Publication: Chenget al Eurosys’12 • Target query: continuously deliver analytics results on static snapshots of a dynamic graph periodically • Two layers: • Storage layer: continuously applies updates to a dynamic graph • Computation layer: performs graph computation on a graph snapshot 202
  • 203.
    Kineograph Architecture Overview •Graph is stored in a key/value store among graph nodes • Ingest nodes are the front end of incoming graph updates • Snapshooter uses an epoch commit protocol to produce snapshots • Progress table keeps track of the process by ingest nodes 203* Figure is copied from Cheng et al Eurosys’12
  • 204.
    Epoch Commit Protocol 204*Figure is copied from Cheng et al Eurosys’12
  • 205.
    Graph Computation • ApplyVertex-basedGAS computation model on snapshots of a dynamic graph • Supports both push and pull models for inter-vertex communication. 205* Figure is copied from Cheng et al Eurosys’12
  • 206.
    TIDE • Publication: Xieet al ICDE’15 • Target query: continuously deliver analytics results on a dynamic graph • Model social interactions as a dynamic interaction graph • New interactions (edges) continuously added • Probabilistic edge decay (PED) model to produce static views of dynamic graphs 206
  • 207.
    StaticViews ofTemporal Graph 207 E.g.,relationship between a and b is forgottena b a b Sliding Window Model  Consider recent graph data within a small time window  Problem: Abruptly forgets past data (no continuity) Snapshot Model  Consider all graph data seen so far  Problem: Does not emphasize recent data (no recency)
  • 208.
    Probabilistic Edge DecayModel 208 Key Idea: Temporally Biased Sampling  Sample data items according to a probability that decreases over time  Sample contains a relatively high proportion of recent interactions Probabilistic View of an Edge’s Role  All edges have chance to be considered (continuity)  Outdated edges are less likely to be used (recency)  Can systematically trade off recency and continuity  Can use existing static-graph algorithms Create N sample graphs Discretized Time + Exponential Decay Typically reduces Monte Carlo variability
  • 209.
    Maintaining Sample GraphsinTIDE 209 Naïve Approach: Whenever a new batch of data comes in  Generate N sampled graphs  Run graph algorithm on each sample Idea #1: Exploit overlaps at successive time points  Subsample old edges of 𝐺𝑡 𝑖 – Selection probability = 𝑝 independently for each edge  Then add new edges  Theorem: 𝐺𝑡+1 has correct marginal probability 𝐺𝑡 𝑖 𝐺𝑡+1 𝑖
  • 210.
    Maintaining Sample Graphs,Continued 210 Idea #2: Exploit overlap between sample graphs at each time point  With high probability, more than 50% of edges overlap  So maintain aggregate graph 𝐺𝑡 1 𝐺𝑡 2 𝐺𝑡 3 𝐺𝑡 1,2 1,3 Memory requirements (batch size = 𝑴)  Snapshot model: continuously increasing memory requirement  PED model: bounded memory requirement – # Edges stored by storing graphs separately: 𝑂(𝑀𝑁) – # Edges stored by aggregate graph: 𝑂(𝑀 log 𝑁)
  • 211.
    Bulk Graph ExecutionModel 211 Iterative Graph processing (Pregel, GraphLab, Trinity, GRACE, …) • User-defined compute () function on each vertex v changes v + adjacent edges • Changes propagated to other vertices via message passing or scheduled updates Key idea in TIDE: Bulk execution: Compute results for multiple sample graphs simultaneously  Partition N sample graphs into bulk sets with s sample graphs each  Execute algorithm on aggregate graph of each bulk set (partial aggregate graph) Benefits  Same interface: users still think the computation is applied on one graph  Amortize overheads of extracting & loading from aggregate graph  Better memory locality (vertex operations)  Similar message values & similar state values  opportunities for compression (>2x speedup w. LZF)
  • 212.
    Overview • Real-time StreamingGraph Systems • Kineograph (distributed, Microsoft, 2012) • TIDE (distributed, IBM, 2015) • Historical Graph Systems • Chronos (distributed, Microsoft, 2014) • DeltaGraph (distributed, University of Maryland, 2013) • LLAMM (single-node, Harvard University & Oracle, 2015) 212
  • 213.
    Chronos • Publication: Hanet al Eurosys’14 • Target query: graph computation on the sequence of static snapshots of a temporal graph within a time range • E.g analyzing the change of each vertex’s PageRank for a given time range • Naïve approach: applying graph computation on each snapshot separately • Chronos: exploit the time locality of temporal graphs 213
  • 214.
    Structure Locality vsTimeLocality • Structure locality • States of neighboring vertices in the same snapshot are laid out close to each • Time locality (preferred in Chronos) • States of a vertex (or an edge) in consecutive snapshots are stored together 214* Figures are copied from Han et al EuroSys’14
  • 215.
    Chronos Design • In-memorygraph layout • Data of a vertex/edge in consecutive snapshots are placed together • Locality-aware batch scheduling (LABS) • Batch processing of a vertex across all the snapshorts • Batch information propagation to a neighbor vertex across snapshots • Incremental Computation • Use the results on 1st snapshot to batch compute on the remaining snapshots • Use the results on the insersection graph to batch compute on all snapshots • On-disk graph layout • Organized in snapshot groups • Stored as the first snapshot followed by the updates in the remaining snapshots in this group. 215
  • 216.
    DeltaGraph • Publication: Khuranaet al ICDE’13, EDBT’16 • Target query: access past states of the graphs and perform static graph analysis • E.g study the evolution of centrality measures, density, conductance, etc • Two major components: • Temporal Graph Index (TGI) • Temporal Graph Analytics Framework (TAF) 216
  • 217.
    DeltaGraph • Publication: Khuranaet al ICDE’13, EDBT’16 • Target query: access past states of the graphs and perform static graph analysis • E.g study the evolution of centrality measures, density, conductance, etc • Two major components: • Temporal Graph Index (TGI) • Temporal Graph Analytics Framework (TAF) 217
  • 218.
    Temporal Graph Index 218 •Partitioned delta and partitioned eventlist for scalability • Version chain for nodes • Sorted list of references to a node • Graph primitives • Snapshot retrieval • Node’s history • K-hop neighborhood • Neighborhood evolution
  • 219.
    Temporal Graph AnalyticsFramework • Node-centric graph extraction and analytical logic • Primary operand: Set of Nodes (SoN) refers to a collection of temporal nodes • Operations • Extract:Timeslice, Select, Filter, etc. • Compute: NodeCompute, NodeComputeTemporal, etc. • Analyze: Compare, Evolution, other aggregates 219
  • 220.
    LLAMA • Publication: Mackoet al ICDE’15 • Target query: perform various whole graph analysis on consistent views • A single machine system that stores and incrementally updates an evolving graph in multi- version representations • LLAMA provides a general purpose programming model instead of vertex- or edge- centric models 220
  • 221.
    Multi-Version CSR Representation •Augment the compact read-only CSR (compressed sparse row) representation to support mutability and persistence. • Large multi-versioned array (LAMA) with a software copy-on-write technique for snapshotting 221* Figure is copied from Macko et al ICDE’15
  • 222.
    Tutorial Outline Message PassingSystems Shared Memory Abstraction Single-Machine Systems Matrix-Based Systems Temporal Graph Systems DBMS-Based Systems Subgraph-Based Systems 222
  • 223.
    DBMS-Style Graph Systems Data-parallelQuery Execution Engine Query Optimizer Datalog SQL Pregel/GAS/... Graph Algorithms Storage Engine SociaLite/Myria REX GraphX/Pregelix Naiad Pregel Vertexica
  • 224.
    Reason #1 Expressiveness »Transitive closure »Allpair shortest paths Vertex-centric API? public class AllPairShortestPaths extendsVertex<VLongWritable, DoubleWritable, FloatWritable, DoubleWritable> { private Map<VLongWritable, DoubleWritable> distances = new HashMap<>(); @Override public void compute(Iterator<DoubleWritable> msgIterator) { ....... } }
  • 225.
    Reason #2 Easy OPS– Unified logs, tooling, configuration…!
  • 226.
    Reason #3 Efficient ResourceUtilization and Robustness ~30 similar threads on Giraph-users mailing list during the year 2015! “I’m trying to run the sample connected components algorithm on a large data set on a cluster, but I get a ‘java.lang.OutOfMemoryError: Java heap space’ error.”
  • 227.
    Reason #4 One-size fits-all? Physicalflexibility and adaptivity »PageRank, SSSP, CC,TriangleCounting »Web graph, social network, RDF graph »8 cheap machine school cluster, 200 beefy machine at an enterprise data center
  • 228.
    What’s graph analytics? 304Million Monthly Active Users 500 Million Tweets Per Day! 200 Billion Tweets Per Year!
  • 229.
    TwitterMsg( tweetid: int64, user: string, sender_location:point, send_time: datetime, reply_to: int64, retweet_from: int64, referred_topics: array<string>, message_text: string ); Reason #5 Easy Data Science INSERT OVERWRITE TABLE MsgGraph SELECT T.tweetid, 1.0/10000000000.0, CASE WHENT.reply_to >=0 RETURN array(T.reply_to) ELSE RETURN array(T.forward_from) END CASE FROMTwitterMsg AST WHERET.reply_to>=0 ORT.retweet_from>=0 SELECT R.user, SUM(R.rank)AS influence FROM Result R,TwitterMsgTM WHERE R.vertexid=TM.tweetid GROUP BY R.user ORDER BY influence DESC LIMIT 50; Giraph PageRank Job HDFS HDFS HDFS MsgGraph( vertexid: int64, value: double edges: array<int64> ); Result( vertexid: int64, rank: double );
  • 230.
    Reason #6 Software Simplicity Networkmanagement Pregel GraphLab Giraph ...... Message delivery Memory management Task scheduling Vertex/Message internal format
  • 231.
    #1 Expressiveness Path(u, v,min(d)) :- Edge(u, v, d); :- Path(u, w, d1), Edge(w, v, d2), d=d1+d2 TC(u, u) :- Edge(u, _) TC(v, v) :- Edge(_, v) TC(u, v) :-TC(u, w), Edge(w, v), u!=v Recursive Query! »SociaLite (VLDB’13) »Myria (VLDB’15) »DeALS (ICDE’15) IDB EDB
  • 232.
    #2 Easy OPS ConvergedPlatforms! »GraphX, on Apache Spark (OSDI’15) »Gelly, on Apache Flink (FOSDEM’15)
  • 233.
    #3 Efficient Resource Utilizationand Robustness Leverage MPP query execution engine! »Pregelix (VLDB’14) 1.0 vid edges vid payload vid=vid 2 4 halt false false value 2.0 1.0 (3,1.0),(4,1.0) (1,1.0) 2 4 3.0 Msg Vertex 5 1 3.0 1.0 1 false 3.0 (3,1.0),(4,1.0) 3 false 3.0 (2,1.0),(3,1.0) 3 vid edges 1 halt false false value 3.0 3.0 (3,1.0),(4,1.0) (2,1.0),(3,1.0) msg NULL 1.0 5 1.0 NULL NULL NULL 2 false 2.0 (3,1.0),(4,1.0)3.0 4 false 1.0 (1,1.0)3.0 Relation Schema Vertex Msg GS (vid, halt, value, edges) (vid, payload) (halt, aggregate, superstep)
  • 234.
    #4 Efficient Resource Utilizationand Robustness In-memory Out-of-core In-memory Out-of-core Pregelix
  • 235.
    #4 Physical Flexibility Flexibleprocessing for the Pregel semantics »Storage, rowVs. column, in-placeVs. LSM, etc. • Vertexica (VLDB’14) • Vertica (IEEE BigData’15) • Pregelix (VLDB’14) »Query plan, join algorithms, group-by algorithms, etc. • Pregelix (VLDB’14) • GraphX (OSDI’15) • Myria (VLDB’15) »Execution model, synchronousVs. asynchronous • Myria (VLDB’15)
  • 236.
    #4 Physical Flexibility Vertica,column storeVs. row store (IEEE BigData’15)
  • 237.
    #4 Physical Flexibility IndexLeft Outer Join UDF Call (compute) M.vid=V.vid Vertexi(V) Msgi(M) (V.halt = false || M.paylod != NULL) UDF Call (compute) Vertexi(V)Msgi(M) … Vidi(I) … Vidi+1 (halt = false) Index Full Outer Join Merge (choose()) M.vid=I.vid M.vid=V.vid Pregelix, different query plans
  • 238.
  • 239.
    #4 Physical Flexibility Myria,synchronousVs. Asynchronous (VLDB’15) »Least Common Ancestor
  • 240.
    #4 Physical Flexibility Myria,synchronousVs. Asynchronous (VLDB’15) »ConnectedComponents
  • 241.
    #5 Easy DataScience Integrated Programming Abstractions »REX (VLDB’12) »AsterData (VLDB’14) SELECT R.user, SUM(R.rank)AS influence FROM PageRank( ( SELECTT.tweetid AS vertexid, 1.0/… AS value, … AS edges FROMTwitterMsgAST WHERET.reply_to>=0 ORT.retweet_from>=0 ), ……) AS R, TwitterMsg ASTM WHERE R.vertexid=TM.tweetid GROUP BY R.user ORDER BY influence DESC LIMIT 50;
  • 242.
    #6 Software Simplicity Engineeringcost is Expensive! System Lines of source code (excluding test code and comments) Giraph 32,197 GraphX 2,500 Pregelix 8,514
  • 243.
    Tutorial Outline Message PassingSystems Shared Memory Abstraction Single-Machine Systems Matrix-Based Systems Temporal Graph Systems DBMS-Based Systems Subgraph-Based Systems 243
  • 244.
    Graph analytics/network sciencetasks too varied » Centrality analysis; evolution models; community detection » Link prediction; belief propagation; recommendations » Motif counting; frequent subgraph mining; influence analysis » Outlier detection; graph algorithms like matching, max-flow » An active area of research in itself… Graph Analysis Tasks Counting network motifs Feed-fwd Loop Feed- back Loop Bi-parallel Motif High school friends Family members Office Colleagues Friends College friendsFriends in database lab in CS dept Friends in CS dept Work place friends Identify Social circles in a user’s ego network
  • 245.
    Vertex-centric framework » Workswell for some applications • Pagerank,Connected Components, … • Some machine learning algorithms can be mapped to it » However, the framework is very restrictive • Most analysis tasks or algorithms cannot be written easily • Simple tasks like counting neighborhood properties infeasible • Fundamentally: Not easy to decompose analysis tasks into vertex-level, independent local computations Alternatives? » Galois, Ligra, GreenMarl: Not sufficiently high-level » Some others (e.g., Socialite) restrictive for different reasons Limitations ofVertex-Centric Framework
  • 246.
    Example: Local ClusteringCoefficient 1 2 4 3 A measure of local density around a node: LCC(n) = # edges in 1-hop neighborhood/max # edges possible Compute() at Node n: Need to count the no. of edges between neighbors But does not have access to that information Option 1: Each node transmits its list of neighbors to its neighbors Huge memory consumption Option 2: Allow access to neighbors’ state Neighbors may not be local What about computations that require 2-hop information?
  • 247.
    Example: Frequent SubgraphMining Goal: Find all (labeled) subgraphs that appear sufficiently frequently No easy way to map this to the vertex-centric framework - Need ability to construct subgraphs of the graph incrementally - Can construct partial subgraphs and pass them around - Very high memory consumption, and duplication of state - Need ability to count the number of occurrences of each subgraph - Analogous to “reduce()” but with subgraphs as keys - Some vertex-centric frameworks support such functionality for aggregation, but only in a centralized fashion Similar challenges for problems like: finding all cliques, motif counting
  • 248.
    Major Systems NScale: »Subgraph-centric APIthat generalizes vertex-centricAPI »The user compute() function has access to “subgraphs” rather than “vertices” »Graph distributed across a cluster of machines analogous to distributed vertex-centric frameworks Arabesque: »Fundamentally different programming model aimed at frequent subgraph mining, motif counting, etc. »Key assumption: • The graph fits in the memory of a single machine in the cluster, • .. but the intermediate results might not
  • 249.
    An end-to-end distributedgraph programming framework Users/application programs specify: » Neighborhoods or subgraphs of interest » A kernel computation to operate upon those subgraphs Framework: » Extracts the relevant subgraphs from underlying data and loads in memory » Execution engine: Executes user computation on materialized subgraphs » Communication: Shared state/message passing Implementation on Hadoop MapReduce as well as Aparch Spark NScale
  • 250.
    NScale: LCC ComputationWalkthrough NScale programming model 1 2 3 4 6 5 7 8 9 10 11 12 Underlying graph data on HDFS Compute (LCC) on Extract ({Node.color=orange} {k=1} {Node.color=white} {Edge.type=solid} ) Neighborhood Size Query-vertex predicate Neighborhood vertex predicate Neighborhood edge predicate Subgraph extraction query:
  • 251.
    NScale programming model 1 23 4 6 5 7 8 9 10 11 12 Underlying graph data on HDFS Specifying Computation: BluePrints API Program cannot be executed as is in vertex-centric programming frameworks. NScale: LCC Computation Walkthrough
  • 252.
    GEP: Graph extractionand packing 1 2 3 4 6 5 7 8 9 10 11 12 Underlying graph data on HDFS MapReduce Subgraph Extraction Cost based optimizer Set Bin Packing MR2: Map Tasks MR2: Reducer 1 MR2: Reducer N Exec Engine Exec Engine Node to Bin mapping NScale: LCC Computation Walkthrough
  • 253.
    GEP: Graph extractionand packing 1 2 3 4 6 5 7 8 9 10 11 12 Underlying graph data on HDFS Graph Extraction and Loading MapReduce (Apache Yarn) Subgraph extraction 1 2 3 4 6 5 7 6 7 8 9 10 10 11 12 SG-1 SG-2 SG-3 SG-4 Extracted Subgraphs NScale: LCC Computation Walkthrough
  • 254.
    1 2 3 4 6 5 7 8 9 10 1112 Underlying graph data on HDFS Graph Extraction and Loading MapReduce (Apache Yarn) Subgraph extraction Cost Based Optimizer Data Rep & Placement GEP: Graph extraction and packing Subgraphs in Distributed Memory 1 2 3 10 11 12 4 6 5 7 8 9 10 NScale: LCC Computation Walkthrough
  • 255.
    1 2 3 4 6 5 7 8 9 10 1112 Underlying graph data on HDFS Graph Extraction and Loading MapReduce (Apache Yarn) Subgraph extraction Cost Based Optimizer Data Rep & Placement GEP: Graph extraction and packing Subgraphs in Distributed Memory 1 2 3 10 11 12 4 6 5 7 8 9 10 Distributed Execution Engine Node Master Node Master Distributed execution of user computation NScale: LCC Computation Walkthrough
  • 256.
    Experimental Evaluation Personalized PageRank on 2-Hop Neighborhood Dataset NScale Giraph GraphLab GraphX #Source Vertices CE (Node- Secs) Cluster Mem (GB) CE (Node- Secs) Cluster Mem (GB) CE (Node- Secs) Cluster Mem (GB) CE (Node- Secs) Cluster Mem (GB) EU Email 3200 52 3.35 782 17.10 710 28.87 9975 85.50 NotreDame 3500 119 9.56 1058 31.76 870 70.54 50595 95.00 GoogleWeb 4150 464 21.52 10482 64.16 1080 108.28 DNC - WikiTalk 12000 3343 79.43 DNC OOM DNC OOM DNC - LiveJournal 20000 4286 84.94 DNC OOM DNC OOM DNC - Orkut 20000 4691 93.07 DNC OOM DNC OOM DNC - Local Clustering Coefficient Dataset NScale Giraph GraphLab GraphX CE (Node- Secs) Cluster Mem (GB) CE (Node- Secs) Cluster Mem (GB) CE (Node- Secs) Cluster Mem (GB) CE (Node- Secs) Cluster Mem (GB) EU Email 377 9.00 1150 26.17 365 20.10 225 4.95 NotreDame 620 19.07 1564 30.14 550 21.40 340 9.75 GoogleWeb 658 25.82 2024 35.35 600 33.50 1485 21.92 WikiTalk 726 24.16 DNC OOM 1125 37.22 1860 32.00 LiveJournal 1800 50.00 DNC OOM 5500 128.62 4515 84.00 Orkut 2000 62.00 DNC OOM DNC OOM 20175 125.00
  • 257.
    Building the GEPphase InputGraph data RDD 1 RDD 2 RDD n t1 t2 tn Subgraph Extraction and Bin Packing Executing user computation RDD n G1 G2 G3 G4 G5 Gn G: Graph Object SG1 SG2 SG3 SG4 SG5 Each graph object contains subgraphs grouped together using bin packing algorithm Map Transformation Node Master Execution Engine Instance Spark RDD containing Graph objects Transparent instantiation of distributed execution engine NScaleSpark: NScale on Spark
  • 258.
    Arabesque “Think-like-an-embedding” paradigm User specifieswhat types of embeddings to construct, and whether edge-at-a-time, or vertex-at-a-time User provides functions to filter, and process partial embeddings Arabesque responsibilities User responsibilities Graph Exploration Load Balancing Aggregation (Isomorphism) Automorphism Detection Filter Process
  • 259.
  • 260.
  • 261.
    Arabesque: Evaluation Comparable tocentralized implementations for a single thread Drastically more scalable to large graphs and clusters
  • 262.
    Conclusion & FutureDirection 262 End-to-End Richer Big Graph Analytics »Keyword search (Elastic Search) »Graph query (Neo4J) »Graph analytics (Giraph) »Machine learning (Spark,TensorFlow) »SQL query (Hive, Impala, SparkSQL, etc.) »Stream processing (Flink, Spark Streaming, etc.) »JSON processing (AsterixDB, Drill, etc.) Converged programming abstractions and platforms?
  • 263.
    Conclusion & FutureDirection Frameworks for computation-intensive jobs High-speed network for data-intensive jobs New hardware support 263
  • 264.

Editor's Notes

  • #196 195
  • #197 196
  • #245 Add an example of how this would be done in Pregel and Nscale.
  • #246 Add an example of how this would be done in Pregel and Nscale.
  • #258 GEP: Implemented as a series of RDD transformations starting from raw input graph Subgraph extraction and bin packing implemented in Scala Phase 2: Instantiating the subgraphs in distributed memory Subgraph structural information joined with partitioning info for grouping NScale graph library ported to NSpark Spark RDD built containing graph objects Each graph object contains subgraphs grouped together using bin packing algorithm Each instantiation: Master-Worker architecture Spark RDD built containing graph objects on Graph RDD
  • #262 At an architecture level, Arabesque runs on top of Hadoop. During the execution of an exploration step, all workers execute the model we have just described with the input embeddings of size n that were passed to them from the previous step. This execution is done in parallel in all threads of a worker. At the end of the execution, the resulting embeddings of size n + 1 are then shuffled between the workers so as to reduce the imbalancing that might be caused by highly expandable embeddings (usually containing vertices with high degrees).