1,673 questions
0
votes
1
answer
71
views
Java Spark mapPartitions retry causing duplicate inserts into BigQuery on task failure
I have a dataproc java Spark job that processes a dataset in partitions and inserts rows into batches.
The code uses rdd.mapPartitions with an iterator, splitting rows into batches (e.g., 100 rows ...
1
vote
1
answer
121
views
How to authenticate with Service Account in dataproc cluster for duckdb connection to BigQuery
I'm trying to authenticate with an already pre-signed-in service account (SA) in a Dataproc cluster.
I'm configuring a DuckDB connection with the BigQuery extension and I can't seem to reuse the ...
1
vote
1
answer
72
views
DataprocSparkSession package in python error - "RuntimeError: Error while creating Dataproc Session"
I am using below code to create Dataproc Spark Session to run a job
from google.cloud.dataproc_spark_connect import DataprocSparkSession
from google.cloud.dataproc_v1 import Session
session = Session(...
0
votes
0
answers
64
views
What connector can be used for Google Cloud pubsub along with Cloud dataproc ( Spark 3.5.x )
I am using Spark 3.5.x and would like to use readStream() API to read structured streaming using Java .
I don't see any pubsub connector available. Couldn't try pubsub lite because it is deprecated ...
0
votes
1
answer
72
views
ModuleNotFoundError in GCP after trying to sumbit a job
new to GCP, I am trying to submit a job inside Dataproc with a .py file & attached also pythonproject.zip file (it is a project) but I am getting the below error ModuleNotFoundError: No module ...
3
votes
1
answer
153
views
Facing bigquery write failure after upgrading spark and dataproc. "Schema mismatch : corresponding field path to Parquet column has 0 repeated field"s
We are currently undergoing migration from spark 2.4 to spark 3.5 (and dataproc 1 to 2), and our workflows are failing with the following error
Caused by: com.google.cloud.spark.bigquery.repackaged....
1
vote
0
answers
58
views
Google Cloud Dataproc Cluster Creation Fails: "Failed to validate permissions for default service account"
Despite the Default Compute Engine Service Account having the necessary roles and being explicitly specified in my cluster creation command, I am still encountering the "Failed to validate ...
2
votes
1
answer
209
views
Spark memory error in thread spark-listener-group-eventLog
I have a pyspark application which is using Graphframes to compute connected components on a DataFrame.
The edges DataFrame I generate has 2.7M records.
When I run the code it is slow, but slowly ...
1
vote
0
answers
75
views
Out of memory for a smaller dataset
I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-...
1
vote
2
answers
96
views
How do you run Python Hadoop Jobs on Dataproc?
I am trying to run my Python code for Hadoop job on Dataproc. I have a mapper.py and a reducer.py file. I am running this command on the terminal -
gcloud dataproc jobs submit hadoop \
--cluster=my-...
2
votes
0
answers
58
views
Why does Spark raises an IOException while running a aggregation on a streaming dataframe in Dataproc 2.2
I try to migrate a job that is running on Dataproc 2.1 images (Spark 3.3, Python 3.10) to Dataproc 2.2 images (Spark 3.5, Python 3.11).
However I encounter an error on one of my queries. After further ...
1
vote
0
answers
45
views
Paritial records being read in Pyspark through Dataproc
I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers
Content-type : application/octet-stream
Content-encoding : gzip
FileName: gs://...
2
votes
1
answer
134
views
Default behavior of spark3.5.1 when writing Numeric/Bignumeric to BigQuery
As per documentation of spark 3.5.1 using latest spark-bigquery-connector, Spark Decimal(38,0) should be written as Numeric in BigQuery.
https://github.com/GoogleCloudDataproc/spark-bigquery-connector?...
1
vote
2
answers
181
views
How to pass arguments from GCP Workflows into Dataproc
I'm using GCP Workflows to define steps for a data engineering project. The input of the workflow consists out of multiple parameters which are provided from through the workflow API.
I defined a GCP ...
1
vote
0
answers
119
views
error "google.api_core.exceptions.InvalidArgument: 400 Cluster name is required" while trying to use airflow DataprocSubmitJobOperator
I'm trying to use the Dataproc submit job operator from Airflow (https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index....
3
votes
2
answers
458
views
GoogleHadoopOutputStream: hflush(): No-op due to rate limit: Increase in class A operation for gcs bucket
We are running our spark ingestion jobs which process multiple files in batches.
We read csv or tsv files in batches and create a dataframe and do some transformations before loading it into big query ...
1
vote
1
answer
288
views
Dataproc PySpark Job Fails with BigQuery Connector Issue - java.util.ServiceConfigurationError
I'm trying to run a PySpark job on Google Cloud Dataproc that reads data from BigQuery, processes it, and writes it back. However, the job keeps failing with the following error:
java.util....
1
vote
0
answers
46
views
Dataproc Serverless Failure on Rerun
I have written a spark job to read from kafka topic, do some processing and dump the data in avro format to GCS.
I am deploying this JAVA application dataproc serverless using the TriggerOnce mode so ...
1
vote
1
answer
108
views
java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver while running JAR on google Dataproc cluster
i have the driver dependency in the POM.xml and i am using maven shade plugin to create an Uber Jar. i do see the driver dependency correctly listed in the JAR file. Jar runs fine in intellij but on ...
1
vote
0
answers
45
views
Dataproc Hive Job - OutOfMemoryError: Java heap space
I have a dataproc cluster, we are running INSERT OVERWRITE QUERY through HIVE CLI which fails with OutOfMemoryError: Java heap space.
We adjusted memory configurations for reducers and Tez tasks, ...
1
vote
0
answers
81
views
Google Cloud Data Fusion streaming pipeline and Spark jobs with empty rows
I have a Google Cloud Data Fusion streaming pipeline that receives data from Google Pub/Sub. Micro-batching is performed every 5 seconds. Since data doesn’t always arrive consistently, I see many ...
1
vote
3
answers
670
views
Caused by: java.lang.IllegalStateException: This connector was made for Scala null, it was not meant to run on Scala 2.12
The pointed error code as follow
String strJsonContent = SessionContext
.getSparkSession()
.read()
.json(filePath)
.toJSON()
.first();
And I'm using Maven to build the package without ...
1
vote
1
answer
256
views
Creating Dataproc Cluster with public-ip-address using DataprocCreateClusterOperator in Airflow
I am trying to create a Dataproc Cluster in my GCP project within an Airflow DAG using the DataprocCreateClusterOperator. I am using the ClusterGenerator to generate the config for the cluster. ...
1
vote
1
answer
415
views
How can I debug an InactiveRpcError with status INVALID_ARGUMENT when submitting a serverless batch job?
When submitting a dataproc serverless batch request, we have been getting errors like:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode....
2
votes
1
answer
312
views
Spark The deserializer is not supported: need a(n) "ARRAY" field but got "MAP<STRING, STRING>"
Recently we have been migrated to dataproc image 2.2 version along with supporting scala 2.12.18 and spark 3.5 version.
package test
import org.apache.spark.sql.SparkSession
import test.Model._
...
2
votes
0
answers
303
views
UDF method with return type parameter is deprecated since Spark 3.0.0.how to return struct type from UDF function
We are upgrading the gcp dataproc cluster to 2.2debian12 image with
spark version is 3.5.0
scala version is 2.12.18
but with these version 1 major change is udf method with return type parameter is ...
2
votes
2
answers
540
views
Error upgrading Dataproc Serverless version from 2.1 to 2.2
I have changed the version of a Dataproc Serverless from 2.1 to 2.2 and now when I run it I get the following error:
Exception in thread "main" java.util.ServiceConfigurationError: org....
2
votes
0
answers
166
views
How to reduce GCS A and B operations in a Spark Structured Streaming pipeline in Dataproc?
I'm running a data pipeline where a NiFi on-premise flow writes JSON files in streaming to a GCS bucket. I have 5 tables, each with their own path, generating around 140k objects per day. The bucket ...
1
vote
1
answer
432
views
Accessing BigQuery Dataset from a Different GCP Project Using PySpark on Dataproc
I am working with BigQuery, Dataproc, Workflows, and Cloud Storage in Google Cloud using Python.
I have two GCP projects:
gcp-project1: contains the BigQuery dataset gcp-project1.my_dataset.my_table
...
1
vote
0
answers
55
views
How do i configure right the staging and temp buckets at DataprocCreateClusterOperator?
I'm trying to find how to set the temp and staging buckets at the DataprocOperator. I've searched for all the internet and didnt find a good awnser.
import pendulum
from datetime import timedelta
...
2
votes
1
answer
76
views
Spark on Dataproc: Slow Data Insertion into BigQuery for Large Datasets (~30M Records)
I have a Scala Spark job running on Google Cloud Dataproc that sources and writes data to Google BigQuery (BQ) tables. The code works fine for smaller datasets, but when processing larger volumes (e.g....
1
vote
1
answer
435
views
Pyspark performance problems when writing to table in Bigquery
I am new to the world of PySpark, I am experiencing serious performance problems when writing data from a dataframe to a table in Bigquery. I have tried everything I have read, recommendations, using ...
2
votes
0
answers
112
views
Delta compatibility problem with PySpark in Dataproc GKE env
I've created a dataproc cluster using GKE and a custom image with pyspark 3.5.0. but can't get it to work with delta
The custom image docker file is this:
FROM us-central1-docker.pkg.dev/cloud-...
2
votes
0
answers
32
views
How can I install html5lib on a dataproc cluster
I have a dataproc pipeline with which I do webscraping and store data in gcp.
Task setting is something like this:
create_dataproc_cluster = DataprocCreateClusterOperator(
task_id='...
6
votes
0
answers
3k
views
spark_catalog requires a single-part namespace in dbt python incremental model
Description:
Using the dbt functionality that allows one to create a python model, I created a model that reads from some BigQuery table, performs some calculations and writes back to BigQuery.
It ...
2
votes
1
answer
618
views
How to resolve: java.lang.NoSuchMethodError: 'scala.collection.Seq org.apache.spark.sql.types.StructType.toAttributes()'
Running a simple ETL PySpark job on Dataproc 2.2 with job property spark.jars.packages set to io.delta:delta-core_2.12:2.4.0 . Other settings are set to default. I have the following config:
conf = (
...
1
vote
0
answers
44
views
Create external Hive table for multiline Json
I am trying to create Hive Table for for given multiline JSON. But actual result is not similar to expected result.
Sample JSON file:
{
"name": "Adil Abro",
"...
1
vote
1
answer
79
views
not able to set the spark log properties programmatically via Python
I was trying to suppress the spark logging and specifying my own log4j.properties file.
gcloud dataproc jobs submit spark \
--cluster test-dataproc-cluster \
--region europe-north1 \
--files gs://...
3
votes
0
answers
169
views
Bigtable Read and Write using DataProc with compute engine results in Key not found
I am experimenting with reading and writing data in cloud BigTable using the DataProc compute engine and PySpark Job using spark-bigtable-connector. I got an example from spark-bigtable repo and ...
1
vote
1
answer
144
views
Not able to connect flink to kafka
Getting below error
Caused by: java.lang.NoClassDefFoundError: org/apache/kafka/clients/admin/AdminClient
while connecting flink to kafka
I am using flink 1.17 and using
flink-sql-connector-kafka-1....
2
votes
0
answers
51
views
How to evaluate total execution connection time in BigTable from Spark by scala
I am attempting to measure the total execution time from Spark to BigTable. However, when I wrap the following code around the BigTable related function, it consistently shows only 0.5 seconds, ...
1
vote
1
answer
119
views
Composer v2.6.6 - job completing successfully, Task exited with Negsignal.SIGKIL
we have Composer 2.6.6(Airflow 2.5.3), and a job VANI-UEBA3 which is running on Dataproc Serverless Batches ... the job runs through fine (as shown on the Dataproc Serverless UI),
but the composer UI ...
1
vote
1
answer
319
views
Trigger rule "one_success" doesn't work for "DataprocCreateClusterOperator"
I have a case that one of my operators which is DataprocCreateClusterOperator never triggers as if "all_success" was still set for it. It runs fine if it's the very first task but I don't ...
2
votes
0
answers
144
views
disabling yarn logs - dataproc cluster
I am trying to reduce the Class A operations on a gcs bucket which is being configured to store yarn and spark history logs.
This is costing us a lot. I disabled spark logs editing the spark-defaults....
1
vote
0
answers
111
views
Selecting a Dataproc Cluster Size with autoscaling ON
i am new to the GCP cloud and has probably a very basic question. We are running our PySpark jobs in Dataproc ephemeral cluster with autoscaling property on for the cluster. In our code we have used ...
1
vote
0
answers
172
views
life cycle policies on dataproc staging bucket sub folders
We have a dataproc cluster staging bucket wherein all the spark job logs are getting stored.
eu-digi-pipe-dataproc-stage/google-cloud-dataproc-metainfo/d0decf20-21fd-4536-bbc4-5a4f829e49bf/jobs/google-...
1
vote
0
answers
43
views
How to parse a huge amount of json files effectively in pyspark
I'm trying to parse around 100GB of small json files using pySpark. The files are stored in a google cloud bucket and they come zipped: *.jsonl.gz
How can I do it effectively?
1
vote
0
answers
175
views
bigquery write_truncate option doesn't work in spark bigquery connector
I'm trying to overwrite a BigQuery table using the WRITE_TRUNCATE option with the Spark BigQuery connector. I have verified that the target table is updated as the Last modified timestamp changes ...
2
votes
0
answers
211
views
Dataproc Serverless Batch Spark Web UI Not Displaying Executors Logs
I want to run a Spark job using the JDBCToBigQuery template in Dataproc Batch on GCP.
The job runs successfully, but there are no executor logs (stdout, stderr) on the Executors tab in the Web UI like ...
1
vote
0
answers
98
views
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
I am trying to read multiple parquet files form gcs using a dataproc spark job.
df = sc.read.option("mergeSchema", "true").parquet(remote_path)
The above code throws error saying:-...