Skip to content

Adding Node Types to GraphFrames

Russell Jurney edited this page Apr 17, 2025 · 6 revisions

The purpose of this document is to reason through the addition of node types to GraphFrames in order to better handle labeled property graph (LPG) data.

GraphFrames without Types

As of version 0.8.4 there is no distinction between types of node in GraphFrames. There is support for different edge types using the relationship field.

  • Node required columns: id
  • Edge required columns: src, dst, relationship

While it is possible to use any property of a node as its type, including type in features like network motifs, there are limitations when dealing with multiple types in GraphFrames.

Merging Node Types

As described in the Motif Finding Tutorial, to represent a labeled property graph (LPG) for motif finding it is necessary to create all fields in all node types and then union the result. There is no utility that does this for you, it is up to the user to figure this out... many will be confused and will simply avoid GraphFrames.

all_cols: List[Tuple[str, T.StructField]] = list(
    set(
        list(zip(a.columns, a.schema))
        + list(zip(b.columns, b.schema))
        ...
    )
)
all_column_names: List[str] = sorted([x[0] for x in all_cols])


def add_missing_columns(df: DataFrame, all_cols: List[Tuple[str, T.StructField]]) -> DataFrame:
    """Add any missing columns from any DataFrame among several we want to merge."""
    for col_name, schema_field in all_cols:
        if col_name not in df.columns:
            df = df.withColumn(col_name, F.lit(None).cast(schema_field.dataType))
    return df


# Now apply this function to each of your DataFrames to get a consistent schema
a = add_missing_columns(a, all_cols).select(all_column_names)
b = add_missing_columns(b, all_cols).select(all_column_names)
...

# Ensure we got the property merge right...
assert (
    set(a.columns)
    == set(b.columns)
    ...
)

GraphFrames with Types

The addition of am [optional or required] type field to vertices would work much like relationships for edges.

  • Node required columns: id, type
  • Edge required columns: src, dst, relationship

Type Utilities

Once nodes and edges both have types, there are useful utilities we can build:

GraphFrames.typeDegree(), GraphFrame.typeInDegree() and GraphFrame.typeOutDegree()

Type aware degree functions that compute the degree of a node partitioned by the relationship types on its edges or the Type of its neighbors nodes and returns these counts in a MapType. It might be useful to compute values for ALL edge relationships or node types and fill missing types with zeros. This method is recommended in the literature to replace triangle counts for clustering coefficients for highly connected graphs.

Example usage:

degree_maps = g.typeDegree(on="relationship").show()

root
 |-- id: string (nullable = true)
 |-- typeDegrees: map (nullable = true)
 |    |-- key: string
 |    |-- value: long

+-----+------------------------------------——-----————-----+
|name |properties                                          |
+-----+--------------------------------------------————----+
|<uuid1>|{friend -> 3, enemy -> 2, acquaintance -> 0}      |
|<uuid2>|{friend -> 0, enemy -> 0, acquaintance -> 25}     |
+-----+--------------------------------------------————----+