Skip to content

bug: cannot make graphframes work with databricks-connect on databricks runtime 17.3 #782

@JoshuaBWT

Description

@JoshuaBWT

Describe the bug

Hi team, graphframes does not seem to work with databricks-connect on a databricks runtime 17.3 using spark 4.0

To Reproduce

Steps to reproduce the behavior:

  1. Setup a databricks cluster using databricks runtime 17.3
  2. install on the cluster from maven io.graphframes:graphframes-spark4_2.13:0.10.1
  3. build and install on the cluster the dbx spark connect jar version using ./build/sbt connect/assembly -Dvendor.name=dbx -Dscala.version=2.13.16 -Dspark.version=4.0.0
  4. Run a small pyspark code locally using databricks-connect (the same code without building spark from databricks-connect works by running it on a notebook in databricks:

from databricks.connect import (  # type: ignore
    DatabricksEnv,
    DatabricksSession,
)

spark = DatabricksSession.builder.getOrCreate()

nodes = [(1, "Alice", 30), (2, "Bob", 25), (3, "Charlie", 35)]
nodes_df = spark.createDataFrame(nodes, ["id", "name", "age"])

edges = [
    (1, 2, "friend"),
    (2, 1, "friend"),
    (2, 3, "friend"),
    (3, 2, "enemy"),  # eek!
]
edges_df = spark.createDataFrame(edges, ["src", "dst", "relationship"])

edges_df.show()

g = GraphFrame(nodes_df, edges_df)

g.connectedComponents().show()

Expected behavior

A completed connectedComponents run in databricks connect

System [please complete the following information]:

databricks

  • databricks runtime: 17.3.x-scala2.13
  • Operating System: Ubuntu 24.04.2 LTS
  • Java: Zulu17.58+21-CA
  • Scala: 2.13.16
  • Python: 3.12.3
  • Delta Lake: 4.0.0
  • spark: 4.0.0
  • graphframes: io.graphframes:graphframes-spark4_2.13:0.10.1

local system

  • Java: openjdk 17.0.11 2024-04-16 LTS
  • python: 3.12.12
  • databricks-connect: 17.3
  • databricks-sdk: 0.64
  • graphframes-py: 0.10.1

Component

  • Scala Core Internal
  • Scala API
  • Spark Connect Plugin
  • PySpark Classic
  • PySpark Connect

Additional context

I get two errors coming from grpc/protobuf. The first one always appears the first time the cluster is being started when running the testing code

	status = StatusCode.UNKNOWN
	details = "grpc_shaded/com/google/protobuf/RuntimeVersion$RuntimeDomain"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_status:2, grpc_message:"grpc_shaded/com/google/protobuf/RuntimeVersion$RuntimeDomain"}"
java.lang.NoClassDefFoundError: grpc_shaded/com/google/protobuf/RuntimeVersion$RuntimeDomain
	at org.graphframes.connect.proto.GraphFramesAPI.<clinit>(GraphFramesAPI.java:22)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
	at grpc_shaded.com.google.protobuf.Internal.getDefaultInstance(Internal.java:353)
	at grpc_shaded.com.google.protobuf.Any.is(Any.java:85)
	at org.apache.spark.sql.graphframes.GraphFramesConnect.transform(GraphFramesConnect.scala:16)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelationPlugin$1(SparkConnectPlanner.scala:375)
	at scala.collection.Iterator$$anon$9.next(Iterator.scala:584)
	at scala.collection.IterableOnceOps.find(IterableOnce.scala:677)
	at scala.collection.IterableOnceOps.find$(IterableOnce.scala:674)
	at scala.collection.AbstractIterable.find(Iterable.scala:935)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelationPlugin(SparkConnectPlanner.scala:378)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:343)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$8(SessionHolder.scala:743)
	at org.apache.spark.sql.connect.service.SessionHolder.measureSubtreeRelationNodes(SessionHolder.scala:759)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$6(SessionHolder.scala:742)
	at scala.Option.getOrElse(Option.scala:201)
	at org.apache.spark.sql.connect.service.SessionHolder.usePlanCache(SessionHolder.scala:740)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:229)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:215)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformShowString(SparkConnectPlanner.scala:433)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:232)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$8(SessionHolder.scala:743)
	at org.apache.spark.sql.connect.service.SessionHolder.measureSubtreeRelationNodes(SessionHolder.scala:759)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$6(SessionHolder.scala:742)
	at scala.Option.getOrElse(Option.scala:201)
	at org.apache.spark.sql.connect.service.SessionHolder.usePlanCache(SessionHolder.scala:740)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:229)
	at org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.handlePlan(SparkConnectPlanExecution.scala:96)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handlePlan(ExecuteThreadRunner.scala:385)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:291)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:247)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:536)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:860)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:536)
	at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:97)
	at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:124)
	at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:118)
	at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:123)
	at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:535)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:247)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$execute$1(ExecuteThreadRunner.scala:141)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
	at com.databricks.spark.connect.service.UtilizationMetrics.recordActiveQueries(UtilizationMetrics.scala:43)
	at com.databricks.spark.connect.service.UtilizationMetrics.recordActiveQueries$(UtilizationMetrics.scala:40)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.recordActiveQueries(ExecuteThreadRunner.scala:53)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:139)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.$anonfun$run$2(ExecuteThreadRunner.scala:595)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
	at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
	at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104)
	at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:109)
	at scala.util.Using$.resource(Using.scala:296)
	at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:108)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:595)
Caused by: java.lang.ClassNotFoundException: grpc_shaded.com.google.protobuf.RuntimeVersion$RuntimeDomain
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
	at com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:152)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
	... 57 more

the second one appears on all following tries of running the same code

	status = StatusCode.UNKNOWN
	details = "Could not initialize class org.graphframes.connect.proto.GraphFramesAPI"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_status:2, grpc_message:"Could not initialize class org.graphframes.connect.proto.GraphFramesAPI"}"
java.lang.NoClassDefFoundError: Could not initialize class org.graphframes.connect.proto.GraphFramesAPI
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:569)
	at grpc_shaded.com.google.protobuf.Internal.getDefaultInstance(Internal.java:353)
	at grpc_shaded.com.google.protobuf.Any.is(Any.java:85)
	at org.apache.spark.sql.graphframes.GraphFramesConnect.transform(GraphFramesConnect.scala:16)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelationPlugin$1(SparkConnectPlanner.scala:375)
	at scala.collection.Iterator$$anon$9.next(Iterator.scala:584)
	at scala.collection.IterableOnceOps.find(IterableOnce.scala:677)
	at scala.collection.IterableOnceOps.find$(IterableOnce.scala:674)
	at scala.collection.AbstractIterable.find(Iterable.scala:935)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelationPlugin(SparkConnectPlanner.scala:378)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:343)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$8(SessionHolder.scala:743)
	at org.apache.spark.sql.connect.service.SessionHolder.measureSubtreeRelationNodes(SessionHolder.scala:759)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$6(SessionHolder.scala:742)
	at scala.Option.getOrElse(Option.scala:201)
	at org.apache.spark.sql.connect.service.SessionHolder.usePlanCache(SessionHolder.scala:740)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:229)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:215)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformShowString(SparkConnectPlanner.scala:433)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:232)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$8(SessionHolder.scala:743)
	at org.apache.spark.sql.connect.service.SessionHolder.measureSubtreeRelationNodes(SessionHolder.scala:759)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$6(SessionHolder.scala:742)
	at scala.Option.getOrElse(Option.scala:201)
	at org.apache.spark.sql.connect.service.SessionHolder.usePlanCache(SessionHolder.scala:740)
	at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:229)
	at org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.handlePlan(SparkConnectPlanExecution.scala:96)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handlePlan(ExecuteThreadRunner.scala:385)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:291)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:247)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:536)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:860)
	at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:536)
	at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:97)
	at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:124)
	at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:118)
	at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:123)
	at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:535)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:247)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$execute$1(ExecuteThreadRunner.scala:141)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
	at com.databricks.spark.connect.service.UtilizationMetrics.recordActiveQueries(UtilizationMetrics.scala:43)
	at com.databricks.spark.connect.service.UtilizationMetrics.recordActiveQueries$(UtilizationMetrics.scala:40)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.recordActiveQueries(ExecuteThreadRunner.scala:53)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:139)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.$anonfun$run$2(ExecuteThreadRunner.scala:595)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
	at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
	at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104)
	at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:109)
	at scala.util.Using$.resource(Using.scala:296)
	at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:108)
	at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:595)
Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.NoClassDefFoundError: grpc_shaded/com/google/protobuf/RuntimeVersion$RuntimeDomain [in thread "SparkConnectExecuteThread_opId=63e2cdea-e873-41be-ac29-32fb5d8b5882"]
	at org.graphframes.connect.proto.GraphFramesAPI.<clinit>(GraphFramesAPI.java:22)
	... 56 more

I tried changinga lot of different parameters including protoc version in the build.sbt file l.34 manually without much success with any prior 3.x or later version.

Are you planning on creating a PR?

  • I'm willing to make a pull-request

thanks a lot for the support for everything that you are doing!
Joshua

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions