Skip to main content
Filter by
Sorted by
Tagged with
0 votes
1 answer
71 views

I have a dataproc java Spark job that processes a dataset in partitions and inserts rows into batches. The code uses rdd.mapPartitions with an iterator, splitting rows into batches (e.g., 100 rows ...
Aakash Shrivastav's user avatar
1 vote
1 answer
121 views

I'm trying to authenticate with an already pre-signed-in service account (SA) in a Dataproc cluster. I'm configuring a DuckDB connection with the BigQuery extension and I can't seem to reuse the ...
Aleksander Lipka's user avatar
1 vote
1 answer
72 views

I am using below code to create Dataproc Spark Session to run a job from google.cloud.dataproc_spark_connect import DataprocSparkSession from google.cloud.dataproc_v1 import Session session = Session(...
Siddiq Syed's user avatar
0 votes
0 answers
64 views

I am using Spark 3.5.x and would like to use readStream() API to read structured streaming using Java . I don't see any pubsub connector available. Couldn't try pubsub lite because it is deprecated ...
Sunil's user avatar
  • 441
0 votes
1 answer
72 views

new to GCP, I am trying to submit a job inside Dataproc with a .py file & attached also pythonproject.zip file (it is a project) but I am getting the below error ModuleNotFoundError: No module ...
SofiaNiki's user avatar
3 votes
1 answer
153 views

We are currently undergoing migration from spark 2.4 to spark 3.5 (and dataproc 1 to 2), and our workflows are failing with the following error Caused by: com.google.cloud.spark.bigquery.repackaged....
Anshul Dubey's user avatar
1 vote
0 answers
58 views

Despite the Default Compute Engine Service Account having the necessary roles and being explicitly specified in my cluster creation command, I am still encountering the "Failed to validate ...
Lê Văn Đức's user avatar
2 votes
1 answer
209 views

I have a pyspark application which is using Graphframes to compute connected components on a DataFrame. The edges DataFrame I generate has 2.7M records. When I run the code it is slow, but slowly ...
Jesus Diaz Rivero's user avatar
1 vote
0 answers
75 views

I have a pyspark job reading the input data volume of just ~50-55GB Parquet data from a delta table. Job is using n2-highmem-4 GCP VM and 1-15 worker with autoscaling. Each workerVM of type n2-highmem-...
user16798185's user avatar
1 vote
2 answers
96 views

I am trying to run my Python code for Hadoop job on Dataproc. I have a mapper.py and a reducer.py file. I am running this command on the terminal - gcloud dataproc jobs submit hadoop \ --cluster=my-...
The Beast's user avatar
  • 163
2 votes
0 answers
58 views

I try to migrate a job that is running on Dataproc 2.1 images (Spark 3.3, Python 3.10) to Dataproc 2.2 images (Spark 3.5, Python 3.11). However I encounter an error on one of my queries. After further ...
AlexisBRENON's user avatar
  • 3,189
1 vote
0 answers
45 views

I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers Content-type : application/octet-stream Content-encoding : gzip FileName: gs://...
Bob's user avatar
  • 383
2 votes
1 answer
134 views

As per documentation of spark 3.5.1 using latest spark-bigquery-connector, Spark Decimal(38,0) should be written as Numeric in BigQuery. https://github.com/GoogleCloudDataproc/spark-bigquery-connector?...
Abhilash's user avatar
1 vote
2 answers
181 views

I'm using GCP Workflows to define steps for a data engineering project. The input of the workflow consists out of multiple parameters which are provided from through the workflow API. I defined a GCP ...
54m's user avatar
  • 777
1 vote
0 answers
119 views

I'm trying to use the Dataproc submit job operator from Airflow (https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_api/airflow/providers/google/cloud/operators/dataproc/index....
Abhijit Aravind's user avatar
3 votes
2 answers
458 views

We are running our spark ingestion jobs which process multiple files in batches. We read csv or tsv files in batches and create a dataframe and do some transformations before loading it into big query ...
Vikrant Singh Rana's user avatar
1 vote
1 answer
288 views

I'm trying to run a PySpark job on Google Cloud Dataproc that reads data from BigQuery, processes it, and writes it back. However, the job keeps failing with the following error: java.util....
Shima K's user avatar
  • 156
1 vote
0 answers
46 views

I have written a spark job to read from kafka topic, do some processing and dump the data in avro format to GCS. I am deploying this JAVA application dataproc serverless using the TriggerOnce mode so ...
Ravi Jain's user avatar
  • 138
1 vote
1 answer
108 views

i have the driver dependency in the POM.xml and i am using maven shade plugin to create an Uber Jar. i do see the driver dependency correctly listed in the JAR file. Jar runs fine in intellij but on ...
xOneOne's user avatar
  • 81
1 vote
0 answers
45 views

I have a dataproc cluster, we are running INSERT OVERWRITE QUERY through HIVE CLI which fails with OutOfMemoryError: Java heap space. We adjusted memory configurations for reducers and Tez tasks, ...
Parmeet Singh's user avatar
1 vote
0 answers
81 views

I have a Google Cloud Data Fusion streaming pipeline that receives data from Google Pub/Sub. Micro-batching is performed every 5 seconds. Since data doesn’t always arrive consistently, I see many ...
alexanoid's user avatar
  • 26.1k
1 vote
3 answers
670 views

The pointed error code as follow String strJsonContent = SessionContext .getSparkSession() .read() .json(filePath) .toJSON() .first(); And I'm using Maven to build the package without ...
Delevin Zhong's user avatar
1 vote
1 answer
256 views

I am trying to create a Dataproc Cluster in my GCP project within an Airflow DAG using the DataprocCreateClusterOperator. I am using the ClusterGenerator to generate the config for the cluster. ...
Mads's user avatar
  • 115
1 vote
1 answer
415 views

When submitting a dataproc serverless batch request, we have been getting errors like: grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode....
Jacob Promisel's user avatar
2 votes
1 answer
312 views

Recently we have been migrated to dataproc image 2.2 version along with supporting scala 2.12.18 and spark 3.5 version. package test import org.apache.spark.sql.SparkSession import test.Model._ ...
Vikrant Singh Rana's user avatar
2 votes
0 answers
303 views

We are upgrading the gcp dataproc cluster to 2.2debian12 image with spark version is 3.5.0 scala version is 2.12.18 but with these version 1 major change is udf method with return type parameter is ...
chanchal ahuja's user avatar
2 votes
2 answers
540 views

I have changed the version of a Dataproc Serverless from 2.1 to 2.2 and now when I run it I get the following error: Exception in thread "main" java.util.ServiceConfigurationError: org....
Chaos's user avatar
  • 21
2 votes
0 answers
166 views

I'm running a data pipeline where a NiFi on-premise flow writes JSON files in streaming to a GCS bucket. I have 5 tables, each with their own path, generating around 140k objects per day. The bucket ...
Puredepatata's user avatar
1 vote
1 answer
432 views

I am working with BigQuery, Dataproc, Workflows, and Cloud Storage in Google Cloud using Python. I have two GCP projects: gcp-project1: contains the BigQuery dataset gcp-project1.my_dataset.my_table ...
Henry Xiloj Herrera's user avatar
1 vote
0 answers
55 views

I'm trying to find how to set the temp and staging buckets at the DataprocOperator. I've searched for all the internet and didnt find a good awnser. import pendulum from datetime import timedelta ...
GuilhermeMP's user avatar
2 votes
1 answer
76 views

I have a Scala Spark job running on Google Cloud Dataproc that sources and writes data to Google BigQuery (BQ) tables. The code works fine for smaller datasets, but when processing larger volumes (e.g....
Sekar Ramu's user avatar
1 vote
1 answer
435 views

I am new to the world of PySpark, I am experiencing serious performance problems when writing data from a dataframe to a table in Bigquery. I have tried everything I have read, recommendations, using ...
aleretgub's user avatar
2 votes
0 answers
112 views

I've created a dataproc cluster using GKE and a custom image with pyspark 3.5.0. but can't get it to work with delta The custom image docker file is this: FROM us-central1-docker.pkg.dev/cloud-...
Pedro's user avatar
  • 21
2 votes
0 answers
32 views

I have a dataproc pipeline with which I do webscraping and store data in gcp. Task setting is something like this: create_dataproc_cluster = DataprocCreateClusterOperator( task_id='...
Sara 's user avatar
  • 75
6 votes
0 answers
3k views

Description: Using the dbt functionality that allows one to create a python model, I created a model that reads from some BigQuery table, performs some calculations and writes back to BigQuery. It ...
Carlos Veríssimo's user avatar
2 votes
1 answer
618 views

Running a simple ETL PySpark job on Dataproc 2.2 with job property spark.jars.packages set to io.delta:delta-core_2.12:2.4.0 . Other settings are set to default. I have the following config: conf = ( ...
dbkoop's user avatar
  • 101
1 vote
0 answers
44 views

I am trying to create Hive Table for for given multiline JSON. But actual result is not similar to expected result. Sample JSON file: { "name": "Adil Abro", "...
yac's user avatar
  • 11
1 vote
1 answer
79 views

I was trying to suppress the spark logging and specifying my own log4j.properties file. gcloud dataproc jobs submit spark \ --cluster test-dataproc-cluster \ --region europe-north1 \ --files gs://...
Vikrant Singh Rana's user avatar
3 votes
0 answers
169 views

I am experimenting with reading and writing data in cloud BigTable using the DataProc compute engine and PySpark Job using spark-bigtable-connector. I got an example from spark-bigtable repo and ...
Suga Raj's user avatar
  • 591
1 vote
1 answer
144 views

Getting below error Caused by: java.lang.NoClassDefFoundError: org/apache/kafka/clients/admin/AdminClient while connecting flink to kafka I am using flink 1.17 and using flink-sql-connector-kafka-1....
Om Prakash's user avatar
2 votes
0 answers
51 views

I am attempting to measure the total execution time from Spark to BigTable. However, when I wrap the following code around the BigTable related function, it consistently shows only 0.5 seconds, ...
Kuengaer's user avatar
1 vote
1 answer
119 views

we have Composer 2.6.6(Airflow 2.5.3), and a job VANI-UEBA3 which is running on Dataproc Serverless Batches ... the job runs through fine (as shown on the Dataproc Serverless UI), but the composer UI ...
Karan Alang's user avatar
  • 1,111
1 vote
1 answer
319 views

I have a case that one of my operators which is DataprocCreateClusterOperator never triggers as if "all_success" was still set for it. It runs fine if it's the very first task but I don't ...
Aleksander Lipka's user avatar
2 votes
0 answers
144 views

I am trying to reduce the Class A operations on a gcs bucket which is being configured to store yarn and spark history logs. This is costing us a lot. I disabled spark logs editing the spark-defaults....
Vikrant Singh Rana's user avatar
1 vote
0 answers
111 views

i am new to the GCP cloud and has probably a very basic question. We are running our PySpark jobs in Dataproc ephemeral cluster with autoscaling property on for the cluster. In our code we have used ...
Kaushik Ghosh's user avatar
1 vote
0 answers
172 views

We have a dataproc cluster staging bucket wherein all the spark job logs are getting stored. eu-digi-pipe-dataproc-stage/google-cloud-dataproc-metainfo/d0decf20-21fd-4536-bbc4-5a4f829e49bf/jobs/google-...
Vikrant Singh Rana's user avatar
1 vote
0 answers
43 views

I'm trying to parse around 100GB of small json files using pySpark. The files are stored in a google cloud bucket and they come zipped: *.jsonl.gz How can I do it effectively?
Aleksander Lipka's user avatar
1 vote
0 answers
175 views

I'm trying to overwrite a BigQuery table using the WRITE_TRUNCATE option with the Spark BigQuery connector. I have verified that the target table is updated as the Last modified timestamp changes ...
lang's user avatar
  • 51
2 votes
0 answers
211 views

I want to run a Spark job using the JDBCToBigQuery template in Dataproc Batch on GCP. The job runs successfully, but there are no executor logs (stdout, stderr) on the Executors tab in the Web UI like ...
lang's user avatar
  • 51
1 vote
0 answers
98 views

I am trying to read multiple parquet files form gcs using a dataproc spark job. df = sc.read.option("mergeSchema", "true").parquet(remote_path) The above code throws error saying:-...
ak1234's user avatar
  • 221

1
2 3 4 5
34