What makes graph queries difficult?
Gábor Szárnyas
szarnyas@mit.bme.hu
Budapest Neo4j Meetup – 2019/06/25
With contributions from Petra Várhegyi and Bálint Hegyi
The property graph data model
SIMPLE GRAPH
A B
D E
C
5 people
 Many of them know each other
This is a simple graph.
Algorithms:
 breadth-first search
 depth-first search
 PageRank
 connected components
ADD EDGE WEIGHTS
A B
D E
C
5 people
 Weight: communication cost
This is a weighted graph.
Algorithms:
 shortest path algorithms
 max-flow
10
8
4
2
2
9
1
6
ADD EDGE TYPES
A B
D E
C
5 people
ADD EDGE TYPES
A B
D E
C
5 people
 Business partners
 Friends
Multiple edge types
but only a single node type.
This is an edge-typed graph.
ADD NODE AND EDGE TYPES
c4
c2
c5
c3
c6
c1
A B
D E
C
5 people 
 Business partners
 Friends
6 comments 
 Replying to another comment
 Authored by a given person
This is a typed graph.
ADD PROPERTIES
c4
c2
c5
c3
c6
c1
A B
D E
C
5 people  – name, age
 Business partners
 Friends – since
6 comments  – content, date
 Replying to another comment
 Authored by a given person
This is a property graph.
Similar to object-oriented data.
name: “Alice”
age: 25
name: “Bob”
age: 26
since: 2014
name: “Erin”
age: 30
content: “I totally agree”
date: 2017-02-02
content: “Great”
date: 2017-02-03
name: “Dan”
age: 47
Graph processing: Queries and analytics
GRAPH QUERIES: LOCAL
c4
c2
c5
c3
c6
c1
A B
D E
C
Local graph query:
Return “Dan” and his comments.
Well-researched topic.
Typical execution times are low.
name: “Alice”
age: 25
name: “Bob”
age: 26
since: 2014
name: “Erin”
age: 30
content: “I totally agree”
date: 2017-02-02
content: “Great”
date: 2017-02-03
name: “Dan”
age: 47
GRAPH QUERIES: GLOBAL
c4
c2
c5
c3
c6
c1
A B
D E
C
Global graph query:
Find people who had no interaction
with “Cecil” through any comments,
neither replying nor receiving a reply.
The result is „Alice”.
Typical execution times are high.
name: “Alice”
age: 25
name: “Bob”
age: 26
since: 2014
name: “Erin”
age: 30
content: “I totally agree”
date: 2017-02-02
content: “Great”
date: 2017-02-03
name: “Dan”
age: 47
GRAPH ANALTYICS: NETWORK SCIENCE
 Studies the structure of graphs
 Pioneered by László Barabási-Albert et al.
 Degree distributions, clustering coefficient, etc.
LOCAL CLUSTERING COEFFICIENT
A B
D E
C LCC(𝑣)=
𝑣
𝑣2
3
2
3
2
3
2
3
2
3
0 0.66 10.33
0.0
0.5
1.0
LCC
The empirical cumulative distribution
function does not present much useful
information in this case.
TYPED CLUSTERING COEFFICIENT
𝑣
TCC(𝑣)=
𝑣
0 0.66 10.33
0.0
0.5
1.0
TCC
𝑣
𝑣
+
+
More information
High combinatorial complexity:
• 𝑡 types → 𝑡 × (𝑡 − 1) triangles
• 𝒪 𝑡2
steps
1
2
0 2
3
0 0
A B
D E
C
TYPED CLUSTERING COEFFICIENT
A B
D E
C
𝑣
TCC(𝑣)=
𝑣
+
𝑣
+
𝑣 𝑣
+
𝑣
++
𝑣 𝑣
+
𝑣
+
 Business partners
 Friends
 Family member
3 types → 6 triangles
Petra Várhegyi:
Multidimensional Graph Analytics,
Master’s thesis, 2018
F. Battiston et al.:
Structural measures for multiplex networks,
Physical Review E, 2014
level
of detail
estimated
evaluation
time
BFS
GRAPH PROCESSING TECHNIQUES AND LANGUAGES
PageRank
Dijkstra
structure +types +properties
Local
clustering
coeff.
+weights
Floyd
Ford-Fulkerson
Global queries
Local queries
Typed
clustering
coeff.
Neo4j Graph Algorithms library Neo4j Graph Database
level
of detail
estimated
evaluation
time
BFS
GRAPH PROCESSING TECHNIQUES AND LANGUAGES
PageRank
Dijkstra
structure +types +properties+weights
Floyd
Ford-Fulkerson
Global queries
Local queries
Neo4j Graph Algorithms library
Typed
clustering
coeff.
Neo4j Graph Database
Local
clustering
coeff.
Graph processing tools and challenges
GRAPH PROCESSING CHALLENGES / STRUCTURE
the “curse of connectedness”
data structures contemporary computer architectures are
good at processing are linear and simple hierarchical
structures, such as Lists, Stacks, or Trees
a massive amount of random data access is required […]
poor performance since the CPU cache is not in effect for
most of the time. […] parallelism is difficult to extract
because of the unstructured nature of graphs.
B. Shao, Y. Li, H. Wang, H. Xia (Microsoft Research):
Trinity Graph Engine and its Applications,
IEEE Data Engineering Bulleting 2017
connectedness
computer
architectures
caching and
parallelization
GRAPH PROCESSING CHALLENGES / PROPERTIES
existing graph query methods […] focus on the topological
structure of graphs and few have considered attributed graphs.
applications of large graph databases would involve querying the
graph data (attributes) in addition to the graph topology.
answering queries that involve predicates on the attributes of
the graphs in addition to the topological structure […] makes
evaluation and optimization more complex.
S. Sakr, S. Elnikety, Y. He (Microsoft Research):
G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs,
CIKM 2012
topology
properties
complex
optimization
GRAPH PROCESSING TOOLS
graph
queries
graph
analytics
Currently, there is a strong distinction between graph query
and analytical tools – this might change in the future.
Gelly LynxKite
János Szendi-Varga (GraphAware):
Graph Technology Landscape 2019
Neo4j Graph Algorithms library
Benchmarks:
Defining a common understanding
TRANSACTION PROCESSING PERFORMANCE COUNCIL (1988-)
Many standard specifications
for benchmarking certain
aspects of relational DBs
LINKED DATA BENCHMARK COUNCIL (2012–)
LDBC is a non-profit organization dedicated to establishing benchmarks,
benchmark practices and benchmark results for graph data management
software.
LDBC’s Social Network Benchmark is an industrial and academic initiative,
formed by principal actors in the field of graph-like data management.
LDBC SOCIAL NETWORK BENCHMARK
Complex graph schema
14 node types, many edge types
Subgraphs
 Network of persons
 Arbitrary depth trees
o Comments
o TagClasses
 Fixed depth trees
o City < Country < Continent
LDBC INTERACTIVE Q3
Friends and friends of friends that have been to countries X and Y
LDBC INTERACTIVE Q14
Trusted connection paths
1 2 73 4 5 6
8 9 1410 11 12 13 1615
17 18 2319 20 21 22 2524
BI WORKLOAD
GraphBLAS:
A unified theory built on linear algebra
THE GRAPHBLAS APPROACH
BLAS GraphBLAS
HW architecture HW architecture
Numerical
applications
Graph analytical
applications
LAGraphLINPACK/LAPACK
S. McMillan: Research review @ CMU, 2015
Graph algorithms on future architectures
Separation of concernsSeparation of concerns
 GraphBLAS is an effort to define standard building blocks for graph algorithms
in the language of linear algebra
 1979: BLAS (Basic Linear Algebra Subprograms)
 2013: GraphBLAS
 Key idea: separation of concerns
Graph algorithm
implementers
Hardware vendors
HPC experts
Tim Mattson et al.: LAGraph,
GrAPL @ IPDPS 2019
PARALLELIZATION ON SKEWED DISTRIBUTIONS
Using multiple processing units require load balancing.
Very difficult to implement for real graphs.
This work is in progress and improvements are expected.
Gábor Szárnyas: Multiplex graph analytics
with GraphBLAS, FOSDEM 2019
Bálint Hegyi: Benchmarking scalable graph
query techniques, Master’s thesis, 2019
Summary
SUMMARY: CHALLENGES IN GRAPH PROCESSING
No consensus on a unifying theory:
 Relational algebra?
 Linear algebra?
Performance:
 Many random access operations
 Difficult to cache
 Difficult to parallelize
 Handling properties introduces even more complexity
Many open research and implementation challenges.
CONTRIBUTIONS IN MY PHD DISSERTATION
database
research
high-
performance
computing
network science
object-oriented
SW engineering
semantic web P1
P2
P3
Gábor Szárnyas:
Query, Analysis, and Benchmarking Techniques for Evolving Property Graphs of Software Systems,
PhD dissertation, 2019

What Makes Graph Queries Difficult?

  • 1.
    What makes graphqueries difficult? Gábor Szárnyas szarnyas@mit.bme.hu Budapest Neo4j Meetup – 2019/06/25 With contributions from Petra Várhegyi and Bálint Hegyi
  • 2.
  • 3.
    SIMPLE GRAPH A B DE C 5 people  Many of them know each other This is a simple graph. Algorithms:  breadth-first search  depth-first search  PageRank  connected components
  • 4.
    ADD EDGE WEIGHTS AB D E C 5 people  Weight: communication cost This is a weighted graph. Algorithms:  shortest path algorithms  max-flow 10 8 4 2 2 9 1 6
  • 5.
    ADD EDGE TYPES AB D E C 5 people
  • 6.
    ADD EDGE TYPES AB D E C 5 people  Business partners  Friends Multiple edge types but only a single node type. This is an edge-typed graph.
  • 7.
    ADD NODE ANDEDGE TYPES c4 c2 c5 c3 c6 c1 A B D E C 5 people   Business partners  Friends 6 comments   Replying to another comment  Authored by a given person This is a typed graph.
  • 8.
    ADD PROPERTIES c4 c2 c5 c3 c6 c1 A B DE C 5 people  – name, age  Business partners  Friends – since 6 comments  – content, date  Replying to another comment  Authored by a given person This is a property graph. Similar to object-oriented data. name: “Alice” age: 25 name: “Bob” age: 26 since: 2014 name: “Erin” age: 30 content: “I totally agree” date: 2017-02-02 content: “Great” date: 2017-02-03 name: “Dan” age: 47
  • 9.
  • 10.
    GRAPH QUERIES: LOCAL c4 c2 c5 c3 c6 c1 AB D E C Local graph query: Return “Dan” and his comments. Well-researched topic. Typical execution times are low. name: “Alice” age: 25 name: “Bob” age: 26 since: 2014 name: “Erin” age: 30 content: “I totally agree” date: 2017-02-02 content: “Great” date: 2017-02-03 name: “Dan” age: 47
  • 11.
    GRAPH QUERIES: GLOBAL c4 c2 c5 c3 c6 c1 AB D E C Global graph query: Find people who had no interaction with “Cecil” through any comments, neither replying nor receiving a reply. The result is „Alice”. Typical execution times are high. name: “Alice” age: 25 name: “Bob” age: 26 since: 2014 name: “Erin” age: 30 content: “I totally agree” date: 2017-02-02 content: “Great” date: 2017-02-03 name: “Dan” age: 47
  • 12.
    GRAPH ANALTYICS: NETWORKSCIENCE  Studies the structure of graphs  Pioneered by László Barabási-Albert et al.  Degree distributions, clustering coefficient, etc.
  • 13.
    LOCAL CLUSTERING COEFFICIENT AB D E C LCC(𝑣)= 𝑣 𝑣2 3 2 3 2 3 2 3 2 3 0 0.66 10.33 0.0 0.5 1.0 LCC The empirical cumulative distribution function does not present much useful information in this case.
  • 14.
    TYPED CLUSTERING COEFFICIENT 𝑣 TCC(𝑣)= 𝑣 00.66 10.33 0.0 0.5 1.0 TCC 𝑣 𝑣 + + More information High combinatorial complexity: • 𝑡 types → 𝑡 × (𝑡 − 1) triangles • 𝒪 𝑡2 steps 1 2 0 2 3 0 0 A B D E C
  • 15.
    TYPED CLUSTERING COEFFICIENT AB D E C 𝑣 TCC(𝑣)= 𝑣 + 𝑣 + 𝑣 𝑣 + 𝑣 ++ 𝑣 𝑣 + 𝑣 +  Business partners  Friends  Family member 3 types → 6 triangles Petra Várhegyi: Multidimensional Graph Analytics, Master’s thesis, 2018 F. Battiston et al.: Structural measures for multiplex networks, Physical Review E, 2014
  • 16.
    level of detail estimated evaluation time BFS GRAPH PROCESSINGTECHNIQUES AND LANGUAGES PageRank Dijkstra structure +types +properties Local clustering coeff. +weights Floyd Ford-Fulkerson Global queries Local queries Typed clustering coeff. Neo4j Graph Algorithms library Neo4j Graph Database
  • 17.
    level of detail estimated evaluation time BFS GRAPH PROCESSINGTECHNIQUES AND LANGUAGES PageRank Dijkstra structure +types +properties+weights Floyd Ford-Fulkerson Global queries Local queries Neo4j Graph Algorithms library Typed clustering coeff. Neo4j Graph Database Local clustering coeff.
  • 18.
    Graph processing toolsand challenges
  • 19.
    GRAPH PROCESSING CHALLENGES/ STRUCTURE the “curse of connectedness” data structures contemporary computer architectures are good at processing are linear and simple hierarchical structures, such as Lists, Stacks, or Trees a massive amount of random data access is required […] poor performance since the CPU cache is not in effect for most of the time. […] parallelism is difficult to extract because of the unstructured nature of graphs. B. Shao, Y. Li, H. Wang, H. Xia (Microsoft Research): Trinity Graph Engine and its Applications, IEEE Data Engineering Bulleting 2017 connectedness computer architectures caching and parallelization
  • 20.
    GRAPH PROCESSING CHALLENGES/ PROPERTIES existing graph query methods […] focus on the topological structure of graphs and few have considered attributed graphs. applications of large graph databases would involve querying the graph data (attributes) in addition to the graph topology. answering queries that involve predicates on the attributes of the graphs in addition to the topological structure […] makes evaluation and optimization more complex. S. Sakr, S. Elnikety, Y. He (Microsoft Research): G-SPARQL: A Hybrid Engine for Querying Large Attributed Graphs, CIKM 2012 topology properties complex optimization
  • 21.
    GRAPH PROCESSING TOOLS graph queries graph analytics Currently,there is a strong distinction between graph query and analytical tools – this might change in the future. Gelly LynxKite János Szendi-Varga (GraphAware): Graph Technology Landscape 2019 Neo4j Graph Algorithms library
  • 22.
  • 23.
    TRANSACTION PROCESSING PERFORMANCECOUNCIL (1988-) Many standard specifications for benchmarking certain aspects of relational DBs
  • 24.
    LINKED DATA BENCHMARKCOUNCIL (2012–) LDBC is a non-profit organization dedicated to establishing benchmarks, benchmark practices and benchmark results for graph data management software. LDBC’s Social Network Benchmark is an industrial and academic initiative, formed by principal actors in the field of graph-like data management.
  • 25.
    LDBC SOCIAL NETWORKBENCHMARK Complex graph schema 14 node types, many edge types Subgraphs  Network of persons  Arbitrary depth trees o Comments o TagClasses  Fixed depth trees o City < Country < Continent
  • 26.
    LDBC INTERACTIVE Q3 Friendsand friends of friends that have been to countries X and Y
  • 27.
  • 28.
    1 2 734 5 6 8 9 1410 11 12 13 1615 17 18 2319 20 21 22 2524 BI WORKLOAD
  • 29.
    GraphBLAS: A unified theorybuilt on linear algebra
  • 30.
    THE GRAPHBLAS APPROACH BLASGraphBLAS HW architecture HW architecture Numerical applications Graph analytical applications LAGraphLINPACK/LAPACK S. McMillan: Research review @ CMU, 2015 Graph algorithms on future architectures Separation of concernsSeparation of concerns  GraphBLAS is an effort to define standard building blocks for graph algorithms in the language of linear algebra  1979: BLAS (Basic Linear Algebra Subprograms)  2013: GraphBLAS  Key idea: separation of concerns Graph algorithm implementers Hardware vendors HPC experts Tim Mattson et al.: LAGraph, GrAPL @ IPDPS 2019
  • 31.
    PARALLELIZATION ON SKEWEDDISTRIBUTIONS Using multiple processing units require load balancing. Very difficult to implement for real graphs. This work is in progress and improvements are expected. Gábor Szárnyas: Multiplex graph analytics with GraphBLAS, FOSDEM 2019 Bálint Hegyi: Benchmarking scalable graph query techniques, Master’s thesis, 2019
  • 32.
  • 33.
    SUMMARY: CHALLENGES INGRAPH PROCESSING No consensus on a unifying theory:  Relational algebra?  Linear algebra? Performance:  Many random access operations  Difficult to cache  Difficult to parallelize  Handling properties introduces even more complexity Many open research and implementation challenges.
  • 34.
    CONTRIBUTIONS IN MYPHD DISSERTATION database research high- performance computing network science object-oriented SW engineering semantic web P1 P2 P3 Gábor Szárnyas: Query, Analysis, and Benchmarking Techniques for Evolving Property Graphs of Software Systems, PhD dissertation, 2019