-
Notifications
You must be signed in to change notification settings - Fork 266
Adding Node Types to GraphFrames
The purpose of this document is to reason through the addition of node types to GraphFrames in order to better handle labeled property graph (LPG) data.
As of version 0.8.4 there is no distinction between types of node in GraphFrames. There is support for different edge types using the relationship field.
- Node required columns:
id - Edge required columns:
src,dst,relationship
While it is possible to use any property of a node as its type, including type in features like network motifs, there are limitations when dealing with multiple types in GraphFrames.
As described in the Motif Finding Tutorial, to represent a labeled property graph (LPG) for motif finding it is necessary to create all fields in all node types and then union the result. There is no utility that does this for you, it is up to the user to figure this out... many will be confused and will simply avoid GraphFrames.
all_cols: List[Tuple[str, T.StructField]] = list(
set(
list(zip(a.columns, a.schema))
+ list(zip(b.columns, b.schema))
...
)
)
all_column_names: List[str] = sorted([x[0] for x in all_cols])
def add_missing_columns(df: DataFrame, all_cols: List[Tuple[str, T.StructField]]) -> DataFrame:
"""Add any missing columns from any DataFrame among several we want to merge."""
for col_name, schema_field in all_cols:
if col_name not in df.columns:
df = df.withColumn(col_name, F.lit(None).cast(schema_field.dataType))
return df
# Now apply this function to each of your DataFrames to get a consistent schema
a = add_missing_columns(a, all_cols).select(all_column_names)
b = add_missing_columns(b, all_cols).select(all_column_names)
...
# Ensure we got the property merge right...
assert (
set(a.columns)
== set(b.columns)
...
)The addition of am [optional or required] type field to vertices would work much like relationships for edges.
- Node required columns:
id,type - Edge required columns:
src,dst,relationship
Once nodes and edges both have types, there are useful utilities we can build:
Type aware degree functions that compute the degree of a node partitioned by the relationship types on its edges or the Type of its neighbors nodes and returns these counts in a MapType. It might be useful to compute values for ALL edge relationships or node types and fill missing types with zeros. This method is recommended in the literature to replace triangle counts for clustering coefficients for highly connected graphs.
Example usage:
degree_maps = g.typeDegree(on="relationship").show()
root
|-- id: string (nullable = true)
|-- typeDegrees: map (nullable = true)
| |-- key: string
| |-- value: long
+-----+------------------------------------——-----————-----+
|name |properties |
+-----+--------------------------------------------————----+
|<uuid1>|{friend -> 3, enemy -> 2, acquaintance -> 0} |
|<uuid2>|{friend -> 0, enemy -> 0, acquaintance -> 25} |
+-----+--------------------------------------------————----+