Big Graph Analytics Systems (Sigmod16 Tutorial)

Big Graph Analytics Systems
DaYan
The Chinese University of Hong Kong
The Univeristy of Alabama at Birmingham
Yingyi Bu
Couchbase, Inc.
Yuanyuan Tian
IBM Research Almaden Center
Amol Deshpande
University of Maryland
James Cheng
The Chinese University of Hong Kong

Motivations
Big Graphs Are Everywhere
2

Big Graph Systems
General-Purpose Graph Analytics
Programming Language
»Java, C/C++, Scala, Python …
»Domain-Specific Language (DSL)
3

Big Graph Systems
Programming Model
»Think Like aVertex
• Message passing
• Shared MemoryAbstraction
»Matrix Algebra
»Think Like a Graph
»Datalog
4

Big Graph Systems
Other Features
»Execution Mode: Sync or Async ?
»Environment: Single-Machine or Distributed ?
»Support for Topology Mutation
»Out-of-Core Support
»Support forTemporal Dynamics
»Data-Intensive or Computation-Intensive ?
5

Tutorial Outline
Message Passing Systems
Shared Memory Abstraction
Single-Machine Systems
Matrix-Based Systems
Temporal Graph Systems
DBMS-Based Systems
Subgraph-Based Systems
6
Vertex-Centric
Hardware-Related
Computation-Intensive

Tutorial Outline
DBMS-Based Systems
7

8
Google’s Pregel [SIGMOD’10]
»Think like a vertex
»Message passing
»Iterative
• Superstep

9
»Vertex Partitioning
0
1 2
3
4 5 6
7 8
0 1 3 1 0 2 3 2 1 3 4 7
3 0 1 2 7 4 2 5 7 5 4 6
6 5 8 7 2 3 4 8 8 6 7
M0 M1 M2

10
»Programming Interface
• u.compute(msgs)
• u.send_msg(v, msg)
• get_superstep_number()
• u.vote_to_halt()
Called inside u.compute(msgs)

11
»Vertex States
• Active / inactive
• Reactivated by messages
»Stop Condition
• All vertices halted, and
• No pending messages

12
»Hash-Min: Connected Components
7
0
1
2
3
4
5 67 8
0 6 85
2
4
1
3
Superstep 1

13
5
0
1
2
3
4
5 67 8
0 0 60
0
2
0
1
Superstep 2

14
0
0
1
2
3
4
5 67 8
0 0 00
0
0
0
0
Superstep 3

15
Practical Pregel Algorithm (PPA) [PVLDB’14]
»First cost model for Pregel algorithm design
»PPAs for fundamental graph problems
• Breadth-first search
• List ranking
• Spanning tree
• Euler tour
• Pre/post-order traversal
• Connected components
• Bi-connected components
• Strongly connected components
• ...

16
Practical Pregel Algorithm (PPA) [PVLDB’14]
»Linear cost per superstep
• O(|V| + |E|) message number
• O(|V| + |E|) computation time
• O(|V| + |E|) memory space
»Logarithm number of supersteps
• O(log |V|) supersteps
O(log|V|) = O(log|E|)
How about load balancing?

17
Balanced PPA (BPPA) [PVLDB’14]
»din(v): in-degree of v
»dout(v): out-degree of v
»Linear cost per superstep
• O(din(v) + dout(v)) message number
• O(din(v) + dout(v)) computation time
• O(din(v) + dout(v)) memory space
»Logarithm number of supersteps

18
BPPA Example: List Ranking [PVLDB’14]
»A basic operation of Euler tour technique
»Linked list where each element v has
• Value val(v)
• Predecessor pred(v)
»Element at the head has pred(v) = NULL
11111NULL
v1 v2 v3 v4 v5
Toy Example: val(v) = 1 for all v

19
»Compute sum(v) for each element v
• Summing val(v) and values of all predecessors
»WhyTeraSort cannot work?
54321NULL
v1 v2 v3 v4 v5

20
»Pointer jumping / path doubling
• sum(v) ← sum(v) + sum(pred(v))
• pred(v) ← pred(pred(v))
11111NULL
v1 v2 v3 v4 v5
As long as pred(v) ≠ NULL

21
11111NULL
22221NULL
v1 v2 v3 v4 v5

22
NULL
22221NULL
44321NULL
v1 v2 v3 v4 v5
11111

23
NULL
22221NULL
44321NULL
54321NULL
v1 v2 v3 v4 v5
11111
O(log |V|) supersteps

24
Optimizations in
Communication Mechanism

25
Apache Giraph
»Superstep splitting: reduce memory consumption
»Only effective when compute(.) is distributive
u1
u2
u3
u4
u5
u6
v
0
1
1
1
1
1
1

26
Apache Giraph
u1
u2
u3
u4
u5
u6
v
0
6

27
Apache Giraph
u1
u2
u3
u4
u5
u6
v
6

28
Apache Giraph
u1
u2
u3
u4
u5
u6
v
0
1
1
1

29
Apache Giraph
u1
u2
u3
u4
u5
u6
v
0
3

30
Apache Giraph
u1
u2
u3
u4
u5
u6
v
3

31
Apache Giraph
u1
u2
u3
u4
u5
u6
v
3
1
1
1

32
Apache Giraph
u1
u2
u3
u4
u5
u6
v
3
3

33
Apache Giraph
u1
u2
u3
u4
u5
u6
v
6

34
Pregel+ [WWW’15]
»Vertex Mirroring
»Request-Respond Paradigm

35
Pregel+ [WWW’15]
»Vertex Mirroring
M3
w1
w2
wk
……
M2
v1
v2
vj
……
M1
u1
u2
ui
……
…
…

36
Pregel+ [WWW’15]
»Vertex Mirroring
M3
w1
w2
wk
……
M2
v1
v2
vj
……
M1
u1
u2
ui
……
uiui
…
…

37
Pregel+ [WWW’15]
»Vertex Mirroring: Create mirror for u4?
M1
u1
u4
…
v1 v2
v4v1 v2 v3
u2 v1 v2
u3 v1 v2
M2
v1
v4
…
v2
v3

38
Pregel+ [WWW’15]
»Vertex Mirroring v.s. Message Combining
M1
u1
u4
…
v1 v2
v4v1 v2 v3
u2 v1 v2
u3 v1 v2
M1
u1
u4
…
u2
u3
M2
v1
v4
…
v2
v3
a(u1) + a(u2)
+ a(u3) + a(u4)

39
Pregel+ [WWW’15]
»Vertex Mirroring v.s. Message Combining
M1
u1
u4
…
v1 v2
v4v1 v2 v3
u2 v1 v2
u3 v1 v2
M1
u1
u4
…
u2
u3
M2
v1
v4
…
v2
v3
u4
a(u1) + a(u2) + a(u3)
a(u4)

40
Pregel+ [WWW’15]
»Vertex Mirroring: Only mirror high-degree vertices
»Choice of degree threshold τ
• M machines, n vertices, m edges
• Average degree: degavg = m / n
• Optimal τ is M · exp{degavg / M}

41
Pregel+ [WWW’15]
» Request-Respond Paradigm
v1
v4
v2
v3
u
M1
a(u)
M2
<v1>
<v2>
<v3>
<v4>

42
Pregel+ [WWW’15]
» Request-Respond Paradigm
v1
v4
v2
v3
u
M1
a(u)
M2
a(u)
a(u)
a(u)
a(u)

43
Pregel+ [WWW’15]
»A vertex v can request attribute a(u) in superstep i
» a(u) will be available in superstep (i + 1)

44
v1
v4
v2
v3
u
M1
D[u]
M2
request u
u | D[u]
Pregel+ [WWW’15]
»A vertex v can request attribute a(u) in superstep I
» a(u) will be available in superstep (i + 1)

45
Load Balancing

46
Vertex Migration
»WindCatch [ICDE’13]
• Runtime improved by 31.5% for PageRank (best)
• 2% for shortest path computation
• 9% for maximal matching
»Stanford’s GPS [SSDBM’13]
»Mizan [EuroSys’13]
• Hash-based and METIS partitioning: no improvement
• Range-based partitioning: around 40% improvement

Dynamic Concurrency Control
»PAGE [TKDE’15]
• Better partitioning → slower ?
47

»PAGE [TKDE’15]
• Message generation
• Local message processing
• Remote message processing
48

»PAGE [TKDE’15]
• Monitors speeds of the 3 operations
• Dynamically adjusts number of threads for the 3 operations
• Criteria
- Speed of message processing = speed of incoming messages
- Thread numbers for local & remote message processing are
proportional to speed of local & remote message processing
49

50
Out-of-Core Support
java.lang.OutOfMemoryError:
Java heap space
26 cases reported by Giraph-users
mailing list during 08/2013~08/2014!

51
Pregelix [PVLDB’15]
»Transparent out-of-core support
»Physical flexibility (Environment)
»Software simplicity (Implementation)
Hyracks
Dataflow Engine

52

53

54
GraphD
»Hardware for small startups and average researchers
• Desktop PCs
• Gigabit Ethernet switch
»Features of a common cluster
• Limited memory space
• Disk streaming bandwidth >> network bandwidth
» Each worker stores and streams edges and messages on local
disks
» Cost of buffering msgs on disks hidden inside msg
transmission

55
Fault Tolerance

56
Coordinated Checkpointing of Pregel
»Every δ supersteps
»Recovery from machine failure:
• Standby machine
• Repartitioning among survivors
An illustration with δ = 5

57
W1 W2 W3
…
…
…
Superstep
4
W1 W2 W3
5
W2 W3
6
W1 W2 W3
7
Failure occurs
W1
Write checkpoint to HDFS
Vertex states, edge changes, shuffled messages

58
W1 W2 W3
…
…
…
Superstep
4
W1 W2 W3
5
W1 W2 W3
6
W1 W2 W3
7
Load checkpoint from HDFS

59
Chandy-Lamport Snapshot [TOCS’85]
»Uncoordinated checkpointing (e.g., for async exec)
»For message-passing systems
»FIFO channels
u v
5 5
u : 5

60
»FIFO channels
u v
u : 5
4
4
5

61
»FIFO channels
u v
u : 5
4 4

62
»FIFO channels
u v
u : 5 v : 4
4 4

63
»Solution: bcast checkpoint request right after
checkpointed
u v
5 5
u : 5
REQ
v : 5

64
Recovery by Message-Logging [PVLDB’14]
»Each worker logs its msgs to local disks
• Negligible overhead, cost hidden
»Survivor
• No re-computaton during recovery
• Forward logged msgs to replacing workers
»Replacing worker
• Re-compute from latest checkpoint
• Only send msgs to replacing workers

65
W1 W2 W3
…
…
…
Superstep
4
W1 W2 W3
5
W2 W3
6
W1 W2 W3
7
Failure occurs
W1
Log msgsLog msgsLog msgs
Log msgsLog msgsLog msgs

66
W1 W2 W3
…
…
…
Superstep
4
W1 W2 W3
5
W1 W2 W3
6
W1 W2 W3
7
Standby Machine
Load checkpoint

67
Block-Centric Computation Model

68
Block-Centric Computation
»Main Idea
• A block refers to a connected subgraph
• Messages exchange among blocks
• Serial in-memory algorithm within a block

69
»Motivation: graph characteristics adverse to Pregel
• Large graph diameter
• High average vertex degree

70
»Benefits
• Less communication workload
• Less number of supersteps
• Less number of computing units

71
Giraph++ [PVLDB’13]
» Pioneering: think like a graph
» METIS-style vertex partitioning
» Partition.compute(.)
» Boundary vertex values sync-ed at superstep barrier
» Internal vertex values can be updated anytime

72
Blogel [PVLDB’14]
» API: vertex.compute(.) + block.compute(.)
»A block can have its own fields
»A block/vertex can send msgs to another block/vertex
»Example: Hash-Min
• Construct block-level graph: to compute an adjacency list
for each block
• Propagate min block ID among blocks

73
Blogel [PVLDB’14]
»Performance on Friendster social network with 65.6 M
vertices and 3.6 B edges
1
10
100
1000
2.52
120.24
ComputingTime
Blogel Pregel+
1
100
10,000
19
7,227
MILLION
Total Msg #
Blogel Pregel+
0
10
20
30
5
30
Superstep #
Blogel Pregel+

74
Blogel [PVLDB’14]
»Web graph: URL-based partitioning
»Spatial networks: 2D partitioning
»General graphs: graphVoronoi diagram partitioning

Blogel [PVLDB’14]
» GraphVoronoi Diagram (GVD) partitioning
75
Three seeds
v is 2 hops from red seed
v is 3 hops from green seed
v is 5 hops from blue seedv

Blogel [PVLDB’14]
»Sample seed vertices with probability p
76

Blogel [PVLDB’14]
77

Blogel [PVLDB’14]
»Compute GVD grouping
• Vertex-centric multi-source BFS
78

Blogel [PVLDB’14]
79State after Seed Sampling

Blogel [PVLDB’14]
80Superstep 1

Blogel [PVLDB’14]
81Superstep 2

Blogel [PVLDB’14]
82Superstep 3

Blogel [PVLDB’14]
»Postprocessing
83

Blogel [PVLDB’14]
»Postprocessing
• For very large blocks, resample with a larger p and repeat
84

Blogel [PVLDB’14]
»Postprocessing
• For very large blocks, resample with a larger p and repeat
• For tiny components, find them using Hash-Min at last
85

GVD Partitioning Performance
86
2026.65
505.85
186.89
105.48 75.88 70.68
0
500
1000
1500
2000
2500
3000
WebUK Friendster BTC LiveJournal USA Road Euro Road
Loading Partitioning Dumping

87
Asynchronous Computation Model

Maiter [TPDS’14]
» For algos where vertex values converge asymmetrically
» Delta-based accumulative iterative computation
(DAIC)
88
v1
v2 v3 v4

Maiter [TPDS’14]
» For algos where vertex values converge asymmetrically
» Delta-based accumulative iterative computation
(DAIC)
» Strict transformation from Pregel API to DAIC
formulation
»Delta may serve as priority score
»Natural for block-centric frameworks
89

90
Vertex-Centric Query Processing

Quegel [PVLDB’16]
» On-demand answering of light-workload graph queries
• Only a portion of the whole graph gets accessed
» Option 1: to process queries one job after another
• Network underutilization, too many barriers
• High startup overhead (e.g., graph loading)
91

Quegel [PVLDB’16]
» On-demand answering of light-workload graph queries
• Only a portion of the whole graph gets accessed
» Option 2: to process a batch of queries in one job
• Programming complexity
• Straggler problem
92

Quegel [PVLDB’16]
»Execution model: superstep-sharing
• Each iteration is called a super-round
• In a super-round, every query proceeds by one superstep
93
Super–Round # 1
q1
2 3 4
1 2 3 4
q3q2 q4
Time
Queries
5 6
q1
q2
q3
q4
7
1 2 3 4
1 2 3 4
1 2 3 4

Quegel [PVLDB’16]
»Benefits
• Messages of multiple queries transmitted in one batch
• One synchronization barrier for each super-round
• Better load balancing
94
Worker 1
Worker 2
time sync sync sync
Individual Synchronization Superstep-Sharing

Quegel [PVLDB’16]
»API is similar to Pregel
»The system does more:
• Q-data: superstep number, control information, …
• V-data: adjacency list, vertex/edge labels
• VQ-data: vertex state in the evaluation of each query
95

Quegel [PVLDB’16]
»Create aVQ-data of v for q, only when q touches v
»Garbage collection of Q-data andVQ-data
»Distributed indexing
96

Tutorial Outline
DBMS-Based Systems
97

Shared-Mem Abstraction
98
Single Machine
(UAI 2010)
Distributed GraphLab
(PVLDB 2012)
PowerGraph
(OSDI 2012)

Distributed GraphLab [PVLDB’12]
»Scope of vertex v
99
u v w
Du Dv Dw
D(u,v) D(v,w)
…………
…………
All that v can access

» Async exec mode: for asymmetric convergence
• Scheduler, serializability
» API:v.update()
• Access & update data in v’s scope
• Add neighbors to scheduler
100

» Vertices partitioned among machines
» For edge (u, v), scopes of u and v overlap
• Du, Dv and D(u, v)
• Replicated if u and v are on different machines
» Ghosts: overlapped boundary data
• Value-sync by a versioning system
» Memory space problem
• x {# of machines}
101

PowerGraph [OSDI’12]
» API: Gather-Apply-Scatter (GAS)
• PageRank: out-degree = 2 for all in-neighbors
102
1
1
1
1
1
1
1
0

103
1
1
1
1
1
1
1
1/2
0
1/2
1/2

104
1
1
1
1
1
1
1
1.5

105
1
1
1
1.5
1
1
1
0
Δ = 0.5 > ϵ

106
1
1
1
1.5
1
1
1
0
activated
activated
activated

»Edge Partitioning
»Goals:
• Loading balancing
• Minimize vertex replicas
– Cost of value sync
– Cost of memory space
107

»Greedy Edge Placement
108
u v
W1 W2 W3 W4 W5 W6
Workload 100 101 102 103 104 105

109
u v
W1 W2 W3 W4 W5 W6
Workload 100 101 102 103 104 105

110
u v
W1 W2 W3 W4 W5 W6
Workload 100 101 102 103 104 105
∅ ∅

111
Single-Machine Out-of-Core Systems

Shared-Mem + Single-Machine
»Out-of-core execution, disk/SSD-based
• GraphChi [OSDI’12]
• X-Stream [SOSP’13]
• VENUS [ICDE’14]
• …
»Vertices are numbered 1, …, n; cut into P intervals
112
interval(2) interval(P)
1 nv1 v2
interval(1)

GraphChi [OSDI’12]
»Programming Model
• Edge scope of v
113
u v w
Du Dv Dw
D(u,v) D(v,w)
…………
…………

»Programming Model
• Scatter & gather values along adjacent edges
114
u v w
Dv
D(u,v) D(v,w)
…………
…………

»Load vertices of each interval, along with adjacent
edges for in-mem processing
»Write updated vertex/edge values back to disk
»Challenges
• Sequential IO
• Consistency: store each edge value only once on disk
115
1 nv1 v2
interval(1)

»Disk shards: shard(i)
• Vertices in interval(i)
• Their incoming edges, sorted by source_ID
116
1 nv1 v2
interval(1)
shard(P)shard(2)shard(1)

»Parallel SlidingWindows (PSW)
117
Shard 1
in-edgessortedby
src_id
Vertices
1..100
Vertices
101..200
Vertices
201..300
Vertices
301..400
Shard 2 Shard 3 Shard 4Shard 1

118
Shard 1
in-edgessortedby
src_id
Vertices
1..100
Vertices
101..200
Vertices
201..300
Vertices
301..400
100
100
100
1 1 1 1
Out-Edges
Vertices & In-Edges
100

119
Shard 1
in-edgessortedby
src_id
Vertices
1..100
Vertices
101..200
Vertices
201..300
Vertices
301..400
1 1 1 1
100
100
100
200
Vertices & In-Edges
200
200
Out-Edges
100
200

»Each vertex & edge value is read & written for at least
once in an iteration
120

X-Stream [SOSP’13]
»Edge-scope GAS programming model
»Streams a completely unordered list of edges
121

»Simple case: all vertex states are memory-resident
»Pass 1: edge-centric scattering
• (u, v): value(u) => <v, value(u, v)>
»Pass 2: edge-centric gathering
• <v, value(u, v)> => value(v)
122
update
aggregate

»Out-of-Core Engine
• P vertex partitions with vertex states only
• P edge partitions, partitioned by source vertices
• Each pass loads a vertex partition, streams corresponding
edge partition (or update partition)
123
1 nv1 v2
interval(1)
Fit into memory
Larger than in GraphChi
Streamed on disk
P update files generated by Pass 1 scattering

»Out-of-Core Engine
• Pass 1: edge-centric scattering
– (u, v): value(u) => [v, value(u, v)]
• Pass 2: edge-centric scattering
– [v, value(u, v)] => value(v)
124
1 nv1 v2
interval(1)
Append to update file
for partition of v
Streamed from update file
for the corresponding vertex partition

»Scale out: Chaos [SOSP’15]
• Requires 40 GigE
• Slow with GigE
»Weakness: sparse computation
125

VENUS [ICDE’14]
»Programming model
• Value scope of v
126
u v w
Du Dv Dw
D(u,v) D(v,w)
…………
…………

VENUS [ICDE’14]
»Assume static topology
• Separate read-only edge data and mutable vertex states
»g-shard(i): incoming edge lists of vertices in interval(i)
»v-shard(i): srcs & dsts of edges in g-shard(i)
»All g-shards are concatenated for streaming
127
1 nv1 v2
interval(1)
Sources may not be in interval(i)
Vertices in a v-shard are ordered by ID

Dsts of interval(i) may be srcs of other intervals
VENUS [ICDE’14]
»To process interval(i)
• Load v-shard(i)
• Stream g-shard(i), update in-memory v-shard(i)
• Update every other v-shard by a sequential write
128
1 nv1 v2
interval(1)
Dst vertices are in interval(i)

VENUS [ICDE’14]
» Avoid writing O(|E|) edge values to disk
» O(|E|) edge values are read once
» O(|V|) may be read/written for multiple times
129
1 nv1 v2
interval(1)

Tutorial Outline
DBMS-Based Systems
130

Categories
»Shared-mem out-of-core (GraphChi, X-Stream,VENUS)
»Matrix-based (to be discussed later)
»SSD-based
»In-mem multi-core
»GPU-based
131

132
SSD-Based Systems

SSD-Based Systems
»Async random IO
• Many flash chips, each with multiple dies
»Callback function
»Pipelined for high throughput
133

TurboGraph [KDD’13]
»Vertices ordered by ID, stored in pages
134

135

136
Read order for positions in a page

137
Record for v6: in Page p3, Position 1

138
In-mem page table: vertex ID -> location on SSD
1-hop neighborhood: outperform GraphChi by 104

139
Special treatment for adj-list larger than a page

»Pin-and-slide execution model
»Concurrently process vertices of pinned pages
»Do not wait for completion of IO requests
»Page unpinned as soon as processed
140

FlashGraph [FAST’15]
»Semi-external memory
• Edge lists on SSDs
»On top of SAFS, an SSD file system
• High-throughput async I/Os over SSD array
• Edge lists stored in one (logical) file on SSD
141

FlashGraph [FAST’15]
»Only access requested edge lists
»Merge same-page / adjacent-page requests into one
sequential access
»Vertex-centricAPI
»Message passing among threads
142

143
In-Memory Multi-Core Frameworks

In-Memory Parallel Frameworks
»Programming simplicity
• Green-Marl, Ligra, GRACE
»Full utilization of all cores in a machine
• GRACE, Galois
144

Green-Marl [ASPLOS’12]
»Domain-specific language (DSL)
• High-level language constructs
• Expose data-level parallelism
»DSL → C++ program
»Initially single-machine, now supported by GPS
145

Green-Marl [ASPLOS’12]
»Parallel For
»Parallel BFS
»Reductions (e.g., SUM, MIN, AND)
»Deferred assignment (<=)
• Effective only at the end of the binding iteration
146

Ligra [PPoPP’13]
»VertexSet-centric API: edgeMap, vertexMap
»Example: BFS
• Ui+1←edgeMap(Ui, F, C)
147
u
v
Ui
Vertices for next iteration

Ligra [PPoPP’13]
»Example: BFS
148
u
v
Ui
C(v) = parent[v] is NULL?
Yes

Ligra [PPoPP’13]
»Example: BFS
149
u
v
Ui
F(u, v):
parent[v] ← u
v added to Ui+1

Ligra [PPoPP’13]
»Mode switch based on vertex sparseness |Ui|
• When | Ui | is large
150
u
v
Ui
w
C(w) called 3 times

Ligra [PPoPP’13]
»Mode switch based on vertex sparseness |Ui|
• When | Ui | is large
151
u
v
Ui
w
if C(v) is true
Call F(u, v) for every in-neighbor in U
Early pruning: just the first one for BFS

GRACE [PVLDB’13]
»Vertex-centricAPI, block-centric execution
• Inner-block computation: vertex-centric computation with
an inner-block scheduler
»Reduce data access to computation ratio
• Many vertex-centric algos are computationally-light
• CPU cache locality: every block fits in cache
152

Galois [SOSP’13]
»Amorphous data-parallelism (ADP)
• Speculative execution: fully use extra CPU resources
153
v’s neighborhoodu’s neighborhood
u vw

Galois [SOSP’13]
154
v’s neighborhoodu’s neighborhood
u vw
Rollback

Galois [SOSP’13]
»Machine-topology-aware scheduler
• Try to fetch tasks local to the current core first
155

156
GPU-Based Systems

GPU Architecture
»Array of streaming multiprocessors (SMs)
»Single instruction, multiple threads (SIMT)
»Different control flows
• Execute all flows
• Masking
»Memory cache hierarchy
157
Small path divergence
Coalesced memory accesses

GPU Architecture
»Warp: 32 threads, basic unit for scheduling
»SM: 48 warps
• Two streaming processors (SPs)
• Warp scheduler: two warps executed at a time
»Thread block / CTA (cooperative thread array)
• 6 warps
• Kernel call → grid of CTAs
• CTAs are distributed to SMs with available resources
158

Medusa [TPDS’14]
»BPS model of Pregel
»Fine-grained API: Edge-Message-Vertex (EMV)
• Large parallelism, small path divergence
»Pre-allocates an array for buffering messages
• Coalesced memory accesses: incoming msgs for each vertex
is consecutive
• Write positions of msgs do not conflict
159

CuSha [HPDC’14]
»Apply the shard organization of GraphChi
»Each shard processed by one CTA
»Window concatenation
160
Window write-back: imbalanced workload
Shard 1
n-edgessortedbysrc_id
Vertices
1..100
Vertices
101..200
Vertices
201..300
Vertices
301..400
1 1 1 1
100
100
100
200 200
200
100
200

CuSha [HPDC’14]
»Apply the shard organization of GraphChi
»Each shard processed by one CTA
»Window concatenation
161
Threads in a CTA may
cross window boundaries
Pointers to actual locations
in shards
Window write-back: imbalanced workload

Tutorial Outline
DBMS-Based Systems
162

163
Categories
»Single-machine systems
• Vertex-centric API
• Matrix operations in the backend
»Distributed frameworks
• (Generalized) matrix-vector multiplication
• Matrix algebra

164
Matrix-Vector Multiplication
»Example: PageRank
PRi(v1)
PRi(v2)
PRi(v3)
PRi(v4)
× =
Pri+1(v1)
PRi+1 (v2)
PRi+1 (v3)
PRi+1 (v4)
Out-AdjacencyList(v1)

165
Generalized Matrix-Vector Multiplication
»Example: HashMin
mini(v1)
mini(v2)
mini(v3)
mini(v4)
× =
mini+1(v1)
mini+1 (v2)
mini+1 (v3)
mini+1 (v4)
0/1-AdjacencyList(v1)
Add → Min
Assign only when smaller

166
with Vertex-Centric API

GraphTwist [PVLDB’15]
»Multi-level graph partitioning
• Right granularity for in-memory processing
• Balance workloads among computing threads
1671 n
src
dst
1
n
u
v
w(u, v)
edge-weight

1681 n
src
dst
1
n
edge-weight
slice

1691 n
src
dst
1
n
edge-weight
stripe

1701 n
src
dst
1
n
edge-weight
dice

1711 n
src
dst
1
n
edge-weight
u
vertex cut

»Fast Randomized Approximation
• Prune statistically insignificant vertices/edges
• E.g., PageRank computation only using high-weight edges
• Unbiased estimator: sampling slices/cuts according to
Frobenius norm
172

GridGraph [ATC’15]
»Grid representation for reducing IO
173

»Grid representation for reducing IO
»Streaming-apply API
• Streaming edges of a block (Ii, Ij)
• Aggregate value to v ∈ Ij
174

»Illustration: column-by-column evaluation
175

176
Create in-mem
Load

177
Load

178
Save

179
Create in-mem
Load

180
Load

181
Save

182

»Read O(P|V|) data of vertex chunks
»Write O(|V|) data of vertex chunks (not O(|E|)!)
»Stream O(|E|) data of edge blocks
• Edge blocks are appended into one large file for streaming
• Block boundaries recorded to trigger the pin/unpin of a
vertex chunk
183

184
Distributed Frameworks
with Matrix Algebra

Distributed Systems with Matrix-
Based Interfaces
• PEGASUS (CMU, 2009)
• GBase (CMU & IBM, 2011)
• SystemML (IBM, 2011)
185
Commonality:
• Matrix-based programming interface to the users
• Rely on MapReduce for execution.

PEGASUS
• Open source: http://www.cs.cmu.edu/~pegasus
• Publications: ICDM’09,KAIS’10.
• Intuition: many graph computation can be
modeled by a generalized form of matrix-vector
multiplication.
𝑣′ = 𝑀 × 𝑣
PageRank: 𝑣′ = 0.85 ∙ 𝐴 𝑇 + 0.15 ∙ 𝑈 × 𝑣
186

PEGASUS Programming Interface: GIM-V
Three Primitives:
1) combine2(mi,j , vj ) : combine mi,j and vj into xi,j
2) combineAlli (xi,1 , ..., xi,n ) : combine all the results from
combine2() for node i into vi '
3) assign(vi , vi ' ) : decide how to update vi with vi '
Iterative: Operation applied till algorithm-specific convergence
criterion is met.

PageRank Example
𝑣′ = 0.85 ∙ 𝐴 𝑇
+ 0.15 ∙ 𝑈 × 𝑣
𝒄𝒐𝒎𝒃𝒊𝒏𝒆𝟐 𝑎𝑖,𝑗, 𝑣𝑗 = 0.85 ∙ 𝑎𝑖,𝑗 ∙ 𝑣𝑗
𝒄𝒐𝒎𝒃𝒊𝒏𝒆𝑨𝒍𝒍𝒊 𝑥𝑖,1, … , 𝑥𝑖,𝑛 =
0.15
𝑛
+
𝑖=1
𝑛
𝑥𝑖,𝑗
𝒂𝒔𝒔𝒊𝒈𝒏 𝑣𝑗, 𝑣𝑗′ = 𝑣𝑗′
188

Execution Model
Iterations of a 2-stage algorithm (each stage is a MR job)
• Input: Edge andVector file
• Edge line : (idsrc , iddst , mval) -> cell adjacency Matrix M
• Vector line: (id, vval) -> element inVectorV
• Stage 1: performs combine2() on columns of iddst of M with
rows of id ofV
• Stage 2: combines all partial results and assigns new vector
-> old vector
189

Optimizations
• Block Multiplication
• Clustered Edges
190
• Diagonal Block Iteration for
connected component detection
* Figures are copied from Kang et al ICDM’09

GBASE
• Part of the IBM System GToolkit
• http://systemg.research.ibm.com
• Publications: SIGKDD’11,VLDBJ’12.
• PEGASUS vs GBASE:
• Common:
• Matrix-vector multiplication as the core operation
• Division of a matrix into blocks
• Clustering nodes to form homogenous blocks
• Different:
191
PEGASUS GBASE
Queries global targeted & global
User Interface customizableAPIs build-in algorithms
Storage normal files compression, special placement
Block Size Square blocks Rectangular blocks

Block Compression and Placement
• Block Formation
• Partition nodes using clustering algorithms e.g. Metis
• Compressed block encoding
• source and destination partition ID p and q;
• the set of sources and the set of destinations
• the payload, the bit string of subgraph G(p,q)
• The payload is compressed using zip compression or gap Elias-γ encoding.
• Block Placement
• Grid placement to minimize the number of input HDFS
files to answer queries
192* Figure is copied from Kang et al SIGKDD’11

Built-In Algorithms in GBASE
• Select grids containing the blocks relevant to the queries
• Derive the incidence matrix from the original adjacency
matrix as required
193* Figure is copied from Kang et al SIGKDD’11

SystemML
• Apache Open source: https://systemml.apache.org
• Publications: ICDE’11, ICDE’12,VLDB’14, Data Engineering Bulletin’14,
ICDE’15, SIGMOD’15, PPOPP’15,VLDB16.
• Comparison to PEGASUS and GBASE
• Core: General linear algebra and math operations (beyond just matrix-
vector multiplication)
• Designed for machine learning in general
• User Interface: A high-level language with similar syntax as R
• Declarative approach to graph processing with cost-based and rule-based
optimization
• Run on multiple platforms including MapReduce, Spark and single node.
194

SystemML – Declarative Machine Learning
Analytics language for data scientists
(“The SQL for analytics”)
» Algorithms expressed in a declarative,
high-level language with R-like syntax
» Productivity of data scientists
» Language embeddings for
• Solutions development
• Tools
Compiler
» Cost-based optimizer to generate
execution plans and to parallelize
• based on data characteristics
• based on cluster and machine characteristics
» Physical operators for in-memory single node and
cluster execution
Performance & Scalability
195

SystemML Architecture Overview
196
Language (DML)
• R- like syntax
• Rich set of statistical functions
• User-defined & external function
• Parsing
• Statement blocks & statements
• Program Analysis, type inference, dead code elimination
High-Level Operator (HOP) Component
• Represent dataflow in DAGs of operations on matrices, scalars
• Choosing from alternative execution plans based on memory and
cost estimates: operator ordering & selection; hybrid plans
Low-Level Operator (LOP) Component
• Low-level physical execution plan (LOPDags) over key-value pairs
• “Piggybacking” operations into minimal number Map-Reduce jobs
Runtime
• Hybrid Runtime
• CP: single machine operations & orchestrate MR jobs
• MR: generic Map-Reduce jobs & operations
• SP: Spark Jobs
• Numerically stable operators
• Dense / sparse matrix representation
• Multi-Level buffer pool (caching) to evict in-memory objects
• Dynamic Recompilation for initial unknowns
Command
Line
JMLC
Spark
MLContext
Spark
ML
APIs
High-Level Operators
Parser/Language
Low-Level Operators
Compiler
Runtime
Control Program
Runtime
Program
Buffer Pool
ParFor Optimizer/
Runtime
MR
InstSpark
Inst
CP
Inst
Recompiler
Cost-based
optimizations
DFS IOMem/FS IO
Generic
MR Jobs
MatrixBlock Library
(single/multi-threaded)

Pros and Cons of Matrix-Based Graph
Systems
Pros:
- Intuitive for analytic users familiar with linear algebra
- E.g. SystemML provides a high-level language familiar to a lot of analysts
Cons:
- PEGASUS and GBASE require an expensive clustering of nodes as a
preprocessing step.
- Not all graph algorithms can be expressed using linear algebra
- Unnecessary computation compared to vertex-centric model
197

Tutorial Outline
DBMS-Based Systems
198

Temporal and Streaming Graph Analytics
• Motivation: Real world graphs often evolve
over time.
• Two body of work:
• Real-time analysis on streaming graph data
• E.g. Calculate each vertex’s current PageRank
• Temporal analysis over historical traces of graphs
• E.g. Analyzing the change of each vertex’s PageRank
for a given time range
199

Common Features for All Systems
• Temporal Graph: a continuous stream of graph updates
• Graph update: addition or deletion of vertex/edge, or the update of the attribute associated with
node/edge.
• Most systems separate graph updates from graph computation.
• Graph computation is only performed on a sequence of successive static views of the temporal
graph
• A graph snapshot is most commonly used static view
• Using existing static graph programmingAPIs
for temporal graph
• Incremental graph computation
• Leverage significant overlap of successive
static views
• Use ending vertex and edge states at time t
as the starting states at time t+1
• Not applicable to all algorithms
200
Static view 1 Static view 2 Static view 3

Overview
• Real-time Streaming Graph Systems
• Kineograph (distributed, Microsoft, 2012)
• TIDE (distributed, IBM, 2015)
• Historical Graph Systems
• Chronos (distributed, Microsoft, 2014)
• DeltaGraph (distributed, University of Maryland, 2013)
• LLAMM (single-node, Harvard University & Oracle, 2015)
201

Kineograph
• Publication: Cheng et al Eurosys’12
• Target query: continuously deliver analytics results
on static snapshots of a dynamic graph periodically
• Two layers:
• Storage layer: continuously applies updates to a dynamic graph
• Computation layer: performs graph computation on a graph
snapshot
202

Kineograph Architecture Overview
• Graph is stored in a key/value
store among graph nodes
• Ingest nodes are the front end
of incoming graph updates
• Snapshooter uses an epoch
commit protocol to produce
snapshots
• Progress table keeps track of
the process by ingest nodes
203* Figure is copied from Cheng et al Eurosys’12

Epoch Commit Protocol

Graph Computation
• ApplyVertex-based GAS computation model on
snapshots of a dynamic graph
• Supports both push and pull models for inter-vertex
communication.

TIDE
• Publication: Xie et al ICDE’15
• Target query: continuously deliver analytics results
on a dynamic graph
• Model social interactions as a dynamic interaction
graph
• New interactions (edges) continuously added
• Probabilistic edge decay (PED) model to produce
static views of dynamic graphs
206

StaticViews ofTemporal Graph
207
E.g., relationship
between a and b
is forgottena b
a
b
Sliding Window Model
 Consider recent graph data within a small time window
 Problem: Abruptly forgets past data (no continuity)
Snapshot Model
 Consider all graph data seen so far
 Problem: Does not emphasize recent data (no recency)

Probabilistic Edge Decay Model
208
Key Idea: Temporally Biased Sampling
 Sample data items according to a probability
that decreases over time
 Sample contains a relatively high proportion of
recent interactions
Probabilistic View of an Edge’s Role
 All edges have chance to be considered
(continuity)
 Outdated edges are less likely to be used
(recency)
 Can systematically trade off recency and
continuity
 Can use existing static-graph algorithms
Create N sample graphs
Discretized Time + Exponential Decay
Typically reduces Monte Carlo
variability

Maintaining Sample Graphs inTIDE
209
Naïve Approach: Whenever a new batch of data comes in
 Generate N sampled graphs
 Run graph algorithm on each sample
Idea #1: Exploit overlaps at successive time points
 Subsample old edges of 𝐺𝑡
𝑖
– Selection probability = 𝑝 independently for each edge
 Then add new edges
 Theorem: 𝐺𝑡+1 has correct marginal probability
𝐺𝑡
𝑖
𝐺𝑡+1
𝑖

Maintaining Sample Graphs, Continued
210
Idea #2: Exploit overlap between sample graphs at each time point
 With high probability, more than 50% of edges overlap
 So maintain aggregate graph
𝐺𝑡
1
𝐺𝑡
2
𝐺𝑡
3
𝐺𝑡
1,2
1,3
Memory requirements (batch size = 𝑴)
 Snapshot model: continuously increasing memory requirement
 PED model: bounded memory requirement
– # Edges stored by storing graphs separately: 𝑂(𝑀𝑁)
– # Edges stored by aggregate graph: 𝑂(𝑀 log 𝑁)

Bulk Graph Execution Model
211
Iterative Graph processing (Pregel, GraphLab, Trinity, GRACE, …)
• User-defined compute () function on each vertex v changes v + adjacent edges
• Changes propagated to other vertices via message passing or scheduled updates
Key idea in TIDE:
Bulk execution: Compute results for multiple sample graphs simultaneously
 Partition N sample graphs into bulk sets with s sample graphs each
 Execute algorithm on aggregate graph of each bulk set (partial aggregate graph) Benefits
 Same interface: users still think the
computation is applied on one
graph
 Amortize overheads of extracting &
loading from aggregate graph
 Better memory locality (vertex
operations)
 Similar message values & similar
state values  opportunities for
compression (>2x speedup w. LZF)

Overview
• Real-time Streaming Graph Systems
• Kineograph (distributed, Microsoft, 2012)
• TIDE (distributed, IBM, 2015)
• Historical Graph Systems
• Chronos (distributed, Microsoft, 2014)
• DeltaGraph (distributed, University of Maryland, 2013)
• LLAMM (single-node, Harvard University & Oracle, 2015)
212

Chronos
• Publication: Han et al Eurosys’14
• Target query: graph computation on the sequence of static
snapshots of a temporal graph within a time range
• E.g analyzing the change of each vertex’s PageRank for a given time
range
• Naïve approach: applying graph computation on each
snapshot separately
• Chronos: exploit the time locality of temporal graphs
213

Structure Locality vsTime Locality
• Structure locality
• States of neighboring vertices in the same snapshot are laid out close to
each
• Time locality (preferred in Chronos)
• States of a vertex (or an edge) in consecutive snapshots are stored together
214* Figures are copied from Han et al EuroSys’14

Chronos Design
• In-memory graph layout
• Data of a vertex/edge in consecutive snapshots are placed together
• Locality-aware batch scheduling (LABS)
• Batch processing of a vertex across all the snapshorts
• Batch information propagation to a neighbor vertex across snapshots
• Incremental Computation
• Use the results on 1st snapshot to batch compute on the remaining
snapshots
• Use the results on the insersection graph to batch compute on all snapshots
• On-disk graph layout
• Organized in snapshot groups
• Stored as the first snapshot followed by the updates in the remaining snapshots in this group.
215

DeltaGraph
• Publication: Khurana et al ICDE’13, EDBT’16
• Target query: access past states of the graphs and
perform static graph analysis
• E.g study the evolution of centrality measures, density,
conductance, etc
• Two major components:
• Temporal Graph Index (TGI)
• Temporal Graph Analytics Framework (TAF)
216

DeltaGraph
• Publication: Khurana et al ICDE’13, EDBT’16
• Target query: access past states of the graphs and
perform static graph analysis
• E.g study the evolution of centrality measures, density,
conductance, etc
• Two major components:
• Temporal Graph Index (TGI)
• Temporal Graph Analytics Framework (TAF)
217

Temporal Graph Index
218
• Partitioned delta and partitioned
eventlist for scalability
• Version chain for nodes
• Sorted list of references to a
node
• Graph primitives
• Snapshot retrieval
• Node’s history
• K-hop neighborhood
• Neighborhood evolution

Temporal Graph Analytics Framework
• Node-centric graph extraction and analytical logic
• Primary operand: Set of Nodes (SoN) refers to a collection of
temporal nodes
• Operations
• Extract:Timeslice, Select, Filter, etc.
• Compute: NodeCompute, NodeComputeTemporal, etc.
• Analyze: Compare, Evolution, other aggregates
219

LLAMA
• Publication: Macko et al ICDE’15
• Target query: perform various whole graph analysis
on consistent views
• A single machine system that stores and
incrementally updates an evolving graph in multi-
version representations
• LLAMA provides a general purpose programming
model instead of vertex- or edge- centric models
220

Multi-Version CSR Representation
• Augment the compact read-only CSR (compressed sparse
row) representation to support mutability and persistence.
• Large multi-versioned array (LAMA) with a software copy-on-write
technique for snapshotting
221* Figure is copied from Macko et al ICDE’15

Tutorial Outline
DBMS-Based Systems
222

DBMS-Style Graph Systems
Data-parallel Query Execution Engine
Query Optimizer
Datalog SQL
Pregel/GAS/...
Graph Algorithms
Storage Engine
SociaLite/Myria
REX
GraphX/Pregelix
Naiad
Pregel
Vertexica

Reason #1
Expressiveness
»Transitive closure
»All pair shortest paths
Vertex-centric API?
public class AllPairShortestPaths extendsVertex<VLongWritable, DoubleWritable, FloatWritable,
DoubleWritable> {
private Map<VLongWritable, DoubleWritable> distances = new HashMap<>();
@Override
public void compute(Iterator<DoubleWritable> msgIterator) {
.......
}
}

Reason #2
Easy OPS – Unified logs, tooling, configuration…!

Reason #3
Efficient Resource Utilization and
Robustness
~30 similar threads on
Giraph-users mailing list
during the year 2015!
“I’m trying to run the sample connected
components algorithm on a large data
set on a cluster, but I get a
‘java.lang.OutOfMemoryError: Java heap
space’ error.”

Reason #4
One-size fits-all?
Physical flexibility and adaptivity
»PageRank, SSSP, CC,TriangleCounting
»Web graph, social network, RDF graph
»8 cheap machine school cluster, 200 beefy machine at
an enterprise data center

What’s graph analytics?
304 Million Monthly Active Users
500 Million Tweets Per Day!
200 Billion Tweets Per Year!

TwitterMsg(
tweetid: int64,
user: string,
sender_location: point,
send_time: datetime,
reply_to: int64,
retweet_from: int64,
referred_topics: array<string>,
message_text: string
);
Reason #5
Easy Data Science
INSERT OVERWRITE TABLE MsgGraph
SELECT T.tweetid, 1.0/10000000000.0,
CASE
WHENT.reply_to >=0
RETURN array(T.reply_to)
ELSE
RETURN array(T.forward_from)
END CASE
FROMTwitterMsg AST
WHERET.reply_to>=0
ORT.retweet_from>=0
SELECT R.user, SUM(R.rank)AS influence
FROM Result R,TwitterMsgTM
WHERE R.vertexid=TM.tweetid
GROUP BY R.user
ORDER BY influence DESC
LIMIT 50;
Giraph PageRank Job
HDFS
HDFS
HDFS
MsgGraph(
vertexid: int64,
value: double
edges: array<int64>
); Result(
vertexid: int64,
rank: double
);

Reason #6
Software Simplicity
Network management
Pregel
GraphLab
Giraph
......
Message delivery
Memory management
Task scheduling
Vertex/Message
internal format

#1 Expressiveness
Path(u, v, min(d)) :- Edge(u, v, d);
:- Path(u, w, d1), Edge(w, v, d2), d=d1+d2
TC(u, u) :- Edge(u, _)
TC(v, v) :- Edge(_, v)
TC(u, v) :-TC(u, w), Edge(w, v), u!=v
Recursive Query!
»SociaLite (VLDB’13)
»Myria (VLDB’15)
»DeALS (ICDE’15)
IDB
EDB

#2 Easy OPS
Converged Platforms!
»GraphX, on Apache Spark (OSDI’15)
»Gelly, on Apache Flink (FOSDEM’15)

#3 Efficient Resource
Utilization and Robustness
Leverage MPP query execution engine!
»Pregelix (VLDB’14)
1.0
vid edges
vid payload
vid=vid
2
4
halt
false
false
value
2.0
1.0
(3,1.0),(4,1.0)
(1,1.0)
2
4 3.0
Msg
Vertex
5
1
3.0
1.0
1 false 3.0 (3,1.0),(4,1.0)
3 false 3.0 (2,1.0),(3,1.0)
3
vid edges
1
halt
false
false
value
3.0
3.0
(3,1.0),(4,1.0)
(2,1.0),(3,1.0)
msg
NULL
1.0
5 1.0 NULL NULL NULL
2 false 2.0 (3,1.0),(4,1.0)3.0
4 false 1.0 (1,1.0)3.0
Relation Schema
Vertex
Msg
GS
(vid, halt, value, edges)
(vid, payload)
(halt, aggregate, superstep)

#4 Efficient Resource
Utilization and Robustness
In-memory
Out-of-core
In-memory
Out-of-core
Pregelix

#4 Physical Flexibility
Flexible processing for the Pregel semantics
»Storage, rowVs. column, in-placeVs. LSM, etc.
• Vertexica (VLDB’14)
• Vertica (IEEE BigData’15)
• Pregelix (VLDB’14)
»Query plan, join algorithms, group-by algorithms,
etc.
• Pregelix (VLDB’14)
• GraphX (OSDI’15)
• Myria (VLDB’15)
»Execution model, synchronousVs. asynchronous
• Myria (VLDB’15)

Vertica, column storeVs. row store (IEEE
BigData’15)

Index Left Outer
Join
UDF Call (compute)
M.vid=V.vid
Vertexi(V)
Msgi(M)
(V.halt = false || M.paylod != NULL) UDF Call (compute)
Vertexi(V)Msgi(M)
…
Vidi(I)
…
Vidi+1
(halt = false)
Index Full Outer Join Merge (choose())
M.vid=I.vid M.vid=V.vid
Pregelix, different query plans

15x
In-memory
Out-of-core
Pregelix

Myria, synchronousVs. Asynchronous
(VLDB’15)
»Least Common Ancestor

Myria, synchronousVs. Asynchronous
(VLDB’15)
»ConnectedComponents

#5 Easy Data Science
Integrated Programming Abstractions
»REX (VLDB’12)
»AsterData (VLDB’14)
SELECT R.user, SUM(R.rank)AS influence
FROM PageRank( (
SELECTT.tweetid AS vertexid, 1.0/… AS value, … AS edges
FROMTwitterMsgAST
WHERET.reply_to>=0
ORT.retweet_from>=0
), ……) AS R,
TwitterMsg ASTM
WHERE R.vertexid=TM.tweetid
GROUP BY R.user
ORDER BY influence DESC
LIMIT 50;

#6 Software Simplicity
Engineering cost is Expensive!
System Lines of source code (excluding
test code and comments)
Giraph 32,197
GraphX 2,500
Pregelix 8,514

Tutorial Outline
DBMS-Based Systems
243

Graph analytics/network science tasks too varied
» Centrality analysis; evolution models; community detection
» Link prediction; belief propagation; recommendations
» Motif counting; frequent subgraph mining; influence analysis
» Outlier detection; graph algorithms like matching, max-flow
» An active area of research in itself…
Graph Analysis Tasks
Counting network motifs
Feed-fwd Loop Feed- back Loop Bi-parallel Motif
High school
friends
Family
members
Ofﬁce
Colleagues
Friends
College
friendsFriends in
database lab
in CS dept
Friends in
CS dept
Work place friends
Identify Social circles in a user’s ego network

Vertex-centric framework
» Works well for some applications
• Pagerank,Connected Components, …
• Some machine learning algorithms can be mapped to it
» However, the framework is very restrictive
• Most analysis tasks or algorithms cannot be written easily
• Simple tasks like counting neighborhood properties infeasible
• Fundamentally: Not easy to decompose analysis tasks into vertex-level,
independent local computations
Alternatives?
» Galois, Ligra, GreenMarl: Not sufficiently high-level
» Some others (e.g., Socialite) restrictive for different reasons
Limitations ofVertex-Centric Framework

Example: Local Clustering Coefficient
1
2
4
3
A measure of local density around a node:
LCC(n) = # edges in 1-hop neighborhood/max # edges possible
Compute() at Node n:
Need to count the no. of edges between neighbors
But does not have access to that information
Option 1: Each node transmits its list of
neighbors to its neighbors
Huge memory consumption
Option 2: Allow access to neighbors’ state
Neighbors may not be local
What about computations that require
2-hop information?

Example: Frequent Subgraph Mining
Goal: Find all (labeled) subgraphs that appear sufficiently frequently
No easy way to map this to the vertex-centric framework
- Need ability to construct subgraphs of the graph incrementally
- Can construct partial subgraphs and pass them around
- Very high memory consumption, and duplication of state
- Need ability to count the number of occurrences of each subgraph
- Analogous to “reduce()” but with subgraphs as keys
- Some vertex-centric frameworks support such functionality for
aggregation, but only in a centralized fashion
Similar challenges for problems like: finding all cliques, motif counting

Major Systems
NScale:
»Subgraph-centric API that generalizes vertex-centricAPI
»The user compute() function has access to “subgraphs”
rather than “vertices”
»Graph distributed across a cluster of machines analogous
to distributed vertex-centric frameworks
Arabesque:
»Fundamentally different programming model aimed at
frequent subgraph mining, motif counting, etc.
»Key assumption:
• The graph fits in the memory of a single machine in the cluster,
• .. but the intermediate results might not

An end-to-end distributed graph programming framework
Users/application programs specify:
» Neighborhoods or subgraphs of interest
» A kernel computation to operate upon those subgraphs
Framework:
» Extracts the relevant subgraphs from underlying data and loads
in memory
» Execution engine: Executes user computation on materialized
subgraphs
» Communication: Shared state/message passing
Implementation on Hadoop MapReduce as well as Aparch Spark
NScale

NScale: LCC Computation Walkthrough
NScale programming model
1
2 3
4
6
5
7
8
9 10
11 12
Underlying graph
data on HDFS
Compute (LCC) on
Extract ({Node.color=orange}
{k=1}
{Node.color=white}
{Edge.type=solid}
)
Neighborhood Size
Query-vertex predicate
Neighborhood vertex predicate
Neighborhood edge predicate
Subgraph extraction query:

NScale programming model
1
2 3
4
6
5
7
8
9 10
11 12
Underlying graph
data on HDFS
Specifying Computation: BluePrints API
Program cannot be executed as is in vertex-centric programming frameworks.

GEP: Graph extraction and packing
1
2 3
4
6
5
7
8
9 10
11 12
Underlying graph
data on HDFS
MapReduce
Subgraph Extraction
Cost based optimizer
Set Bin Packing
MR2: Map Tasks
MR2:
Reducer 1
MR2:
Reducer N
Exec
Engine
Exec
Engine
Node to
Bin
mapping

1
2 3
4
6
5
7
8
9 10
11 12
Underlying graph
data on HDFS
Graph Extraction
and Loading
MapReduce
(Apache
Yarn)
Subgraph
extraction
1
2 3
4
6
5
7
6 7
8
9 10
10
11 12
SG-1
SG-2
SG-3 SG-4
Extracted Subgraphs

1
2 3
4
6
5
7
8
9 10
11 12
Underlying graph
data on HDFS
Graph Extraction
and Loading
MapReduce
(Apache
Yarn)
Subgraph
extraction
Cost Based
Optimizer
Data Rep &
Placement
Subgraphs in
Distributed Memory
1
2 3 10
11 12
4
6
5
7
8
9
10

1
2 3
4
6
5
7
8
9 10
11 12
Underlying graph
data on HDFS
Graph Extraction
and Loading
MapReduce
(Apache
Yarn)
Subgraph
extraction
Cost Based
Optimizer
Data Rep &
Placement
Subgraphs in
Distributed Memory
1
2 3 10
11 12
4
6
5
7
8
9
10
Distributed
Execution Engine
Node
Master
Node
Master
Distributed execution of user computation

Experimental Evaluation
Personalized Page Rank on 2-Hop Neighborhood
Dataset NScale Giraph GraphLab GraphX
#Source
Vertices
CE
(Node-
Secs)
Cluster
Mem
(GB)
CE (Node-
Secs)
Cluster
Mem (GB)
CE (Node-
Secs)
Cluster
Mem (GB)
CE (Node-
Secs)
Cluster
Mem (GB)
EU Email 3200 52 3.35 782 17.10 710 28.87 9975 85.50
NotreDame 3500 119 9.56 1058 31.76 870 70.54 50595 95.00
GoogleWeb 4150 464 21.52 10482 64.16 1080 108.28 DNC -
WikiTalk 12000 3343 79.43 DNC OOM DNC OOM DNC -
LiveJournal 20000 4286 84.94 DNC OOM DNC OOM DNC -
Orkut 20000 4691 93.07 DNC OOM DNC OOM DNC -
Local Clustering Coefficient
Dataset NScale Giraph GraphLab GraphX
CE (Node-
Secs)
Cluster
Mem (GB)
CE (Node-
Secs)
Cluster
Mem (GB)
CE (Node-
Secs)
Cluster
Mem (GB)
CE (Node-
Secs)
Cluster
Mem (GB)
EU Email 377 9.00 1150 26.17 365 20.10 225 4.95
NotreDame 620 19.07 1564 30.14 550 21.40 340 9.75
GoogleWeb 658 25.82 2024 35.35 600 33.50 1485 21.92
WikiTalk 726 24.16 DNC OOM 1125 37.22 1860 32.00
LiveJournal 1800 50.00 DNC OOM 5500 128.62 4515 84.00
Orkut 2000 62.00 DNC OOM DNC OOM 20175 125.00

Building the GEP phase
InputGraph
data
RDD 1 RDD 2 RDD n
t1 t2
tn
Subgraph Extraction and Bin Packing
Executing user computation
RDD n
G1
G2
G3
G4
G5
Gn
G: Graph Object
SG1 SG2 SG3
SG4 SG5
Each graph object
contains subgraphs
grouped together using bin
packing algorithm
Map
Transformation Node
Master
Execution Engine
Instance
Spark RDD
containing Graph
objects
Transparent instantiation of
distributed execution engine
NScaleSpark: NScale on Spark

Arabesque
“Think-like-an-embedding” paradigm
User specifies what types of embeddings to construct, and
whether edge-at-a-time, or vertex-at-a-time
User provides functions to filter, and process partial embeddings
Arabesque responsibilities User responsibilities
Graph
Exploration
Load
Balancing
Aggregation
(Isomorphism)
Automorphism
Detection
Filter
Process

Arabesque: Evaluation
Comparable to centralized implementations for a single thread
Drastically more scalable to large graphs and clusters

Conclusion & Future Direction
262
End-to-End Richer Big Graph Analytics
»Keyword search (Elastic Search)
»Graph query (Neo4J)
»Graph analytics (Giraph)
»Machine learning (Spark,TensorFlow)
»SQL query (Hive, Impala, SparkSQL, etc.)
»Stream processing (Flink, Spark Streaming, etc.)
»JSON processing (AsterixDB, Drill, etc.)
Converged programming abstractions and
platforms?

Conclusion & Future Direction
Frameworks for computation-intensive jobs
High-speed network for data-intensive jobs
New hardware support
263

Big Graph Analytics Systems (Sigmod16 Tutorial)

More Related Content

What's hot

Viewers also liked

Similar to Big Graph Analytics Systems (Sigmod16 Tutorial)

Recently uploaded

Big Graph Analytics Systems (Sigmod16 Tutorial)

Editor's Notes