This is a package for graphs processing and analytics on scale. It is built on top of Apache Spark and relies on DataFrame abstraction. It provides built-in and easy to use distributed graph algorithms as well as a flexible APIs like Pregel or AggregateMessages to make custom graph processing. Users can write highly expressive queries by leveraging the DataFrame API, combined with a new API for network motif finding. The user also benefits from DataFrame performance optimizations within the Spark SQL engine. GraphFrames works in Java, Scala, and Python.
There are some popular use cases when GraphFrames is almost irreplaceable, including, but not limited to:
- Compliance analytics with a scalable shortest paths algorithm and motif analysis;
- Anti-fraud with scalable cycles detection in large networks;
- Identity resolution on the scale of billions with highly efficient connected components;
- Search result ranking with a distributed, Pregel-based PageRank;
- Clustering huge graphs with Label Propagation and Power Iteration Clustering;
- Building a knowledge graph systems with Property Graph Model.
- Installation
- Creating Graphs
- Basic Graph Manipulations
- Centrality Metrics
- Motif finding
- Traversals and Connectivity
- Community Detection
- Scala API
- Python API
- Apache Spark compatibility
Now you can create a GraphFrame as follows.
from pyspark.sql import SparkSession
from graphframes import GraphFrame
spark = SparkSession.builder.getOrCreate()
nodes = [
(1, "Alice", 30),
(2, "Bob", 25),
(3, "Charlie", 35)
]
nodes_df = spark.createDataFrame(nodes, ["id", "name", "age"])
edges = [
(1, 2, "friend"),
(2, 1, "friend"),
(2, 3, "friend"),
(3, 2, "enemy") # eek!
]
edges_df = spark.createDataFrame(edges, ["src", "dst", "relationship"])
g = GraphFrame(nodes_df, edges_df)Now let's run some graph algorithms at scale!
g.inDegrees.show()
# +---+--------+
# | id|inDegree|
# +---+--------+
# | 2| 2|
# | 1| 1|
# | 3| 1|
# +---+--------+
g.outDegrees.show()
# +---+---------+
# | id|outDegree|
# +---+---------+
# | 1| 1|
# | 2| 2|
# | 3| 1|
# +---+---------+
g.degrees.show()
# +---+------+
# | id|degree|
# +---+------+
# | 1| 2|
# | 2| 4|
# | 3| 2|
# +---+------+
g2 = g.pageRank(resetProbability=0.15, tol=0.01)
g2.vertices.show()
# +---+-----+---+------------------+
# | id| name|age| pagerank|
# +---+-----+---+------------------+
# | 1| John| 30|0.7758750474847483|
# | 2|Alice| 25|1.4482499050305027|
# | 3| Bob| 35|0.7758750474847483|
# +---+-----+---+------------------+
# GraphFrames' most used feature...
# Connected components can do big data entity resolution on billions or even trillions of records!
# First connect records with a similarity metric, then run connectedComponents.
# This gives you groups of identical records, which you then link by same_as edges or merge into list-based master records.
sc.setCheckpointDir("/tmp/graphframes-example-connected-components") # required by GraphFrames.connectedComponents
g.connectedComponents().show()
# +---+-----+---+---------+
# | id| name|age|component|
# +---+-----+---+---------+
# | 1| John| 30| 1|
# | 2|Alice| 25| 1|
# | 3| Bob| 35| 1|
# +---+-----+---+---------+
# Find frenemies with network motif finding! See how graph and relational queries are combined?
(
g.find("(a)-[e]->(b); (b)-[e2]->(a)")
.filter("e.relationship = 'friend' and e2.relationship = 'enemy'")
.show()
)
# These are paths, which you can aggregate and count to find complex patterns.
# +------------+--------------+----------------+-------------+
# | a| e| b| e2|
# +------------+--------------+----------------+-------------+
# |{2, Bob, 25}|{2, 3, friend}|{3, Charlie, 35}|{3, 2, enemy}|
# +------------+--------------+----------------+-------------+To learn more about GraphFrames, check out these resources:
This resources are provided by the community:
- Introducing GraphFrames
- GraphFrames Google Group
- #graphframes Discord Channel on GraphGeeks
- Graph Operations in Apache Spark Using GraphFrames
- Executing Graph Algorithms with GraphFrames on Databricks
- On-Time Flight Performance with GraphFrames for Apache Spark
- Sustainability in Aluminum Production
- A top level overview of GraphFrames internals
- GraphFrames: An Integrated API for Mixing Graph and Relational Queries, Dave et al. 2016.
GraphFrames was made as collaborative effort among UC Berkeley, MIT, Databricks and the open source community. At the moment GraphFrames is maintained by the group of individual contributors.
See contribution guide and the local development setup walkthrough for step-by-step instructions on preparing your environment, running tests, and submitting changes.
See release notes.
