GraphFrames: graph algorithms at scale

This is a package for graphs processing and analytics on scale. It is built on top of Apache Spark and relies on DataFrame abstraction. It provides built-in and easy to use distributed graph algorithms as well as a flexible APIs like Pregel or AggregateMessages to make custom graph processing. Users can write highly expressive queries by leveraging the DataFrame API, combined with a new API for network motif finding. The user also benefits from DataFrame performance optimizations within the Spark SQL engine. GraphFrames works in Java, Scala, and Python.

GraphFrames usecases

There are some popular use cases when GraphFrames is almost irreplaceable, including, but not limited to:

Compliance analytics with a scalable shortest paths algorithm and motif analysis;
Anti-fraud with scalable cycles detection in large networks;
Identity resolution on the scale of billions with highly efficient connected components;
Search result ranking with a distributed, Pregel-based PageRank;
Clustering huge graphs with Label Propagation and Power Iteration Clustering;
Building a knowledge graph systems with Property Graph Model.

Documentation

Quick Start

Now you can create a GraphFrame as follows.

from pyspark.sql import SparkSession
from graphframes import GraphFrame

spark = SparkSession.builder.getOrCreate()

nodes = [
    (1, "Alice", 30),
    (2, "Bob", 25),
    (3, "Charlie", 35)
]
nodes_df = spark.createDataFrame(nodes, ["id", "name", "age"])

edges = [
    (1, 2, "friend"),
    (2, 1, "friend"),
    (2, 3, "friend"),
    (3, 2, "enemy")  # eek!
]
edges_df = spark.createDataFrame(edges, ["src", "dst", "relationship"])

g = GraphFrame(nodes_df, edges_df)

Now let's run some graph algorithms at scale!

g.inDegrees.show()

# +---+--------+
# | id|inDegree|
# +---+--------+
# |  2|       2|
# |  1|       1|
# |  3|       1|
# +---+--------+

g.outDegrees.show()

# +---+---------+
# | id|outDegree|
# +---+---------+
# |  1|        1|
# |  2|        2|
# |  3|        1|
# +---+---------+

g.degrees.show()

# +---+------+
# | id|degree|
# +---+------+
# |  1|     2|
# |  2|     4|
# |  3|     2|
# +---+------+

g2 = g.pageRank(resetProbability=0.15, tol=0.01)
g2.vertices.show()

# +---+-----+---+------------------+
# | id| name|age|          pagerank|
# +---+-----+---+------------------+
# |  1| John| 30|0.7758750474847483|
# |  2|Alice| 25|1.4482499050305027|
# |  3|  Bob| 35|0.7758750474847483|
# +---+-----+---+------------------+

# GraphFrames' most used feature...
# Connected components can do big data entity resolution on billions or even trillions of records!
# First connect records with a similarity metric, then run connectedComponents.
# This gives you groups of identical records, which you then link by same_as edges or merge into list-based master records.
sc.setCheckpointDir("/tmp/graphframes-example-connected-components")  # required by GraphFrames.connectedComponents
g.connectedComponents().show()

# +---+-----+---+---------+
# | id| name|age|component|
# +---+-----+---+---------+
# |  1| John| 30|        1|
# |  2|Alice| 25|        1|
# |  3|  Bob| 35|        1|
# +---+-----+---+---------+

# Find frenemies with network motif finding! See how graph and relational queries are combined?
(
    g.find("(a)-[e]->(b); (b)-[e2]->(a)")
    .filter("e.relationship = 'friend' and e2.relationship = 'enemy'")
    .show()
)

# These are paths, which you can aggregate and count to find complex patterns.
# +------------+--------------+----------------+-------------+
# |           a|             e|               b|           e2|
# +------------+--------------+----------------+-------------+
# |{2, Bob, 25}|{2, 3, friend}|{3, Charlie, 35}|{3, 2, enemy}|
# +------------+--------------+----------------+-------------+

Learn GraphFrames

To learn more about GraphFrames, check out these resources:

GraphFrames tutorials

GraphFrames Network Motif Finding Tutorial

Community Resources

This resources are provided by the community:

GraphFrames Internals

Contributing

GraphFrames was made as collaborative effort among UC Berkeley, MIT, Databricks and the open source community. At the moment GraphFrames is maintained by the group of individual contributors.

See contribution guide and the local development setup walkthrough for step-by-step instructions on preparing your environment, running tests, and submitting changes.

Releases

See release notes.

Name		Name	Last commit message	Last commit date
Latest commit History 562 Commits
.github		.github
benchmarks/src/main/scala/org/graphframes/benchmarks		benchmarks/src/main/scala/org/graphframes/benchmarks
build		build
connect/src/main		connect/src/main
core/src		core/src
docs		docs
graphx/src		graphx/src
project		project
python		python
.gitignore		.gitignore
.mailmap		.mailmap
.pre-commit-config.yaml		.pre-commit-config.yaml
.scalafix.conf		.scalafix.conf
.scalafmt.conf		.scalafmt.conf
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
buf.gen.yaml		buf.gen.yaml
buf.yaml		buf.yaml
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GraphFrames: graph algorithms at scale

GraphFrames usecases

Documentation

Quick Start

Learn GraphFrames

GraphFrames tutorials

Community Resources

GraphFrames Internals

Contributing

Releases

Star History

About

Uh oh!

Releases 15

Uh oh!

Contributors 38

Languages

License

graphframes/graphframes

Folders and files

Latest commit

History

Repository files navigation

GraphFrames: graph algorithms at scale

GraphFrames usecases

Documentation

Quick Start

Learn GraphFrames

GraphFrames tutorials

Community Resources

GraphFrames Internals

Contributing

Releases

Star History

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 15

Uh oh!

Contributors 38

Languages