-
Notifications
You must be signed in to change notification settings - Fork 266
Description
Describe the bug
Hi team, graphframes does not seem to work with databricks-connect on a databricks runtime 17.3 using spark 4.0
To Reproduce
Steps to reproduce the behavior:
- Setup a databricks cluster using databricks runtime 17.3
- install on the cluster from maven io.graphframes:graphframes-spark4_2.13:0.10.1
- build and install on the cluster the dbx spark connect jar version using
./build/sbt connect/assembly -Dvendor.name=dbx -Dscala.version=2.13.16 -Dspark.version=4.0.0 - Run a small pyspark code locally using databricks-connect (the same code without building spark from databricks-connect works by running it on a notebook in databricks:
from databricks.connect import ( # type: ignore
DatabricksEnv,
DatabricksSession,
)
spark = DatabricksSession.builder.getOrCreate()
nodes = [(1, "Alice", 30), (2, "Bob", 25), (3, "Charlie", 35)]
nodes_df = spark.createDataFrame(nodes, ["id", "name", "age"])
edges = [
(1, 2, "friend"),
(2, 1, "friend"),
(2, 3, "friend"),
(3, 2, "enemy"), # eek!
]
edges_df = spark.createDataFrame(edges, ["src", "dst", "relationship"])
edges_df.show()
g = GraphFrame(nodes_df, edges_df)
g.connectedComponents().show()
Expected behavior
A completed connectedComponents run in databricks connect
System [please complete the following information]:
databricks
- databricks runtime: 17.3.x-scala2.13
- Operating System: Ubuntu 24.04.2 LTS
- Java: Zulu17.58+21-CA
- Scala: 2.13.16
- Python: 3.12.3
- Delta Lake: 4.0.0
- spark: 4.0.0
- graphframes: io.graphframes:graphframes-spark4_2.13:0.10.1
local system
- Java: openjdk 17.0.11 2024-04-16 LTS
- python: 3.12.12
- databricks-connect: 17.3
- databricks-sdk: 0.64
- graphframes-py: 0.10.1
Component
- Scala Core Internal
- Scala API
- Spark Connect Plugin
- PySpark Classic
- PySpark Connect
Additional context
I get two errors coming from grpc/protobuf. The first one always appears the first time the cluster is being started when running the testing code
status = StatusCode.UNKNOWN
details = "grpc_shaded/com/google/protobuf/RuntimeVersion$RuntimeDomain"
debug_error_string = "UNKNOWN:Error received from peer {grpc_status:2, grpc_message:"grpc_shaded/com/google/protobuf/RuntimeVersion$RuntimeDomain"}"
java.lang.NoClassDefFoundError: grpc_shaded/com/google/protobuf/RuntimeVersion$RuntimeDomain
at org.graphframes.connect.proto.GraphFramesAPI.<clinit>(GraphFramesAPI.java:22)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:569)
at grpc_shaded.com.google.protobuf.Internal.getDefaultInstance(Internal.java:353)
at grpc_shaded.com.google.protobuf.Any.is(Any.java:85)
at org.apache.spark.sql.graphframes.GraphFramesConnect.transform(GraphFramesConnect.scala:16)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelationPlugin$1(SparkConnectPlanner.scala:375)
at scala.collection.Iterator$$anon$9.next(Iterator.scala:584)
at scala.collection.IterableOnceOps.find(IterableOnce.scala:677)
at scala.collection.IterableOnceOps.find$(IterableOnce.scala:674)
at scala.collection.AbstractIterable.find(Iterable.scala:935)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelationPlugin(SparkConnectPlanner.scala:378)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:343)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$8(SessionHolder.scala:743)
at org.apache.spark.sql.connect.service.SessionHolder.measureSubtreeRelationNodes(SessionHolder.scala:759)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$6(SessionHolder.scala:742)
at scala.Option.getOrElse(Option.scala:201)
at org.apache.spark.sql.connect.service.SessionHolder.usePlanCache(SessionHolder.scala:740)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:229)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:215)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformShowString(SparkConnectPlanner.scala:433)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:232)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$8(SessionHolder.scala:743)
at org.apache.spark.sql.connect.service.SessionHolder.measureSubtreeRelationNodes(SessionHolder.scala:759)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$6(SessionHolder.scala:742)
at scala.Option.getOrElse(Option.scala:201)
at org.apache.spark.sql.connect.service.SessionHolder.usePlanCache(SessionHolder.scala:740)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:229)
at org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.handlePlan(SparkConnectPlanExecution.scala:96)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handlePlan(ExecuteThreadRunner.scala:385)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:291)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:247)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:536)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:860)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:536)
at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:97)
at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:124)
at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:118)
at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:123)
at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:535)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:247)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$execute$1(ExecuteThreadRunner.scala:141)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at com.databricks.spark.connect.service.UtilizationMetrics.recordActiveQueries(UtilizationMetrics.scala:43)
at com.databricks.spark.connect.service.UtilizationMetrics.recordActiveQueries$(UtilizationMetrics.scala:40)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.recordActiveQueries(ExecuteThreadRunner.scala:53)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:139)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.$anonfun$run$2(ExecuteThreadRunner.scala:595)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104)
at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:109)
at scala.util.Using$.resource(Using.scala:296)
at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:108)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:595)
Caused by: java.lang.ClassNotFoundException: grpc_shaded.com.google.protobuf.RuntimeVersion$RuntimeDomain
at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
at com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:152)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
... 57 more
the second one appears on all following tries of running the same code
status = StatusCode.UNKNOWN
details = "Could not initialize class org.graphframes.connect.proto.GraphFramesAPI"
debug_error_string = "UNKNOWN:Error received from peer {grpc_status:2, grpc_message:"Could not initialize class org.graphframes.connect.proto.GraphFramesAPI"}"
java.lang.NoClassDefFoundError: Could not initialize class org.graphframes.connect.proto.GraphFramesAPI
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:569)
at grpc_shaded.com.google.protobuf.Internal.getDefaultInstance(Internal.java:353)
at grpc_shaded.com.google.protobuf.Any.is(Any.java:85)
at org.apache.spark.sql.graphframes.GraphFramesConnect.transform(GraphFramesConnect.scala:16)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelationPlugin$1(SparkConnectPlanner.scala:375)
at scala.collection.Iterator$$anon$9.next(Iterator.scala:584)
at scala.collection.IterableOnceOps.find(IterableOnce.scala:677)
at scala.collection.IterableOnceOps.find$(IterableOnce.scala:674)
at scala.collection.AbstractIterable.find(Iterable.scala:935)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelationPlugin(SparkConnectPlanner.scala:378)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:343)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$8(SessionHolder.scala:743)
at org.apache.spark.sql.connect.service.SessionHolder.measureSubtreeRelationNodes(SessionHolder.scala:759)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$6(SessionHolder.scala:742)
at scala.Option.getOrElse(Option.scala:201)
at org.apache.spark.sql.connect.service.SessionHolder.usePlanCache(SessionHolder.scala:740)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:229)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:215)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformShowString(SparkConnectPlanner.scala:433)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.$anonfun$transformRelation$1(SparkConnectPlanner.scala:232)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$8(SessionHolder.scala:743)
at org.apache.spark.sql.connect.service.SessionHolder.measureSubtreeRelationNodes(SessionHolder.scala:759)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$usePlanCache$6(SessionHolder.scala:742)
at scala.Option.getOrElse(Option.scala:201)
at org.apache.spark.sql.connect.service.SessionHolder.usePlanCache(SessionHolder.scala:740)
at org.apache.spark.sql.connect.planner.SparkConnectPlanner.transformRelation(SparkConnectPlanner.scala:229)
at org.apache.spark.sql.connect.execution.SparkConnectPlanExecution.handlePlan(SparkConnectPlanExecution.scala:96)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.handlePlan(ExecuteThreadRunner.scala:385)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1(ExecuteThreadRunner.scala:291)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$executeInternal$1$adapted(ExecuteThreadRunner.scala:247)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$2(SessionHolder.scala:536)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:860)
at org.apache.spark.sql.connect.service.SessionHolder.$anonfun$withSession$1(SessionHolder.scala:536)
at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:97)
at org.apache.spark.sql.artifact.ArtifactManager.$anonfun$withResources$1(ArtifactManager.scala:124)
at org.apache.spark.sql.artifact.ArtifactManager.withClassLoaderIfNeeded(ArtifactManager.scala:118)
at org.apache.spark.sql.artifact.ArtifactManager.withResources(ArtifactManager.scala:123)
at org.apache.spark.sql.connect.service.SessionHolder.withSession(SessionHolder.scala:535)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.executeInternal(ExecuteThreadRunner.scala:247)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.$anonfun$execute$1(ExecuteThreadRunner.scala:141)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at com.databricks.spark.connect.service.UtilizationMetrics.recordActiveQueries(UtilizationMetrics.scala:43)
at com.databricks.spark.connect.service.UtilizationMetrics.recordActiveQueries$(UtilizationMetrics.scala:40)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.recordActiveQueries(ExecuteThreadRunner.scala:53)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner.org$apache$spark$sql$connect$execution$ExecuteThreadRunner$$execute(ExecuteThreadRunner.scala:139)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.$anonfun$run$2(ExecuteThreadRunner.scala:595)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:51)
at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:104)
at com.databricks.unity.HandleImpl.$anonfun$runWithAndClose$1(UCSHandle.scala:109)
at scala.util.Using$.resource(Using.scala:296)
at com.databricks.unity.HandleImpl.runWithAndClose(UCSHandle.scala:108)
at org.apache.spark.sql.connect.execution.ExecuteThreadRunner$ExecutionThread.run(ExecuteThreadRunner.scala:595)
Caused by: java.lang.ExceptionInInitializerError: Exception java.lang.NoClassDefFoundError: grpc_shaded/com/google/protobuf/RuntimeVersion$RuntimeDomain [in thread "SparkConnectExecuteThread_opId=63e2cdea-e873-41be-ac29-32fb5d8b5882"]
at org.graphframes.connect.proto.GraphFramesAPI.<clinit>(GraphFramesAPI.java:22)
... 56 more
I tried changinga lot of different parameters including protoc version in the build.sbt file l.34 manually without much success with any prior 3.x or later version.
Are you planning on creating a PR?
- I'm willing to make a pull-request
thanks a lot for the support for everything that you are doing!
Joshua