Newest 'apache-spark' Questions

0 votes

0 answers

66 views

Spark SQL MERGE/INSERT on Iceberg Recomputes Upstream Join Instead of Reusing Cached DataFrame (MEMORY_AND_DISK)

Spark SQL + Iceberg: MERGE and INSERT appear to ignore cached DataFrame and re-scan source I am trying to optimize an SCD2 flow in Spark SQL (Python API) using a cached intermediate DataFrame. ...

fabrik5k

11

asked Mar 31 at 18:43

8 votes

0 answers

620 views

Upgraded to version IntelliJ IDEA 2026.1 , gradle fails to sync or build

class org.jetbrains.plugins.gradle.tooling.serialization.internal.adapter.InternalIdeaModule cannot be cast to class org.gradle.tooling.model.ProjectModel (org.jetbrains.plugins.gradle.tooling....

Musa Baloyi

713

asked Mar 26 at 22:05

1 vote

1 answer

70 views

Spark 4.0 MemoryStream was moved or changed?

I tried to upgrade my project from Spark 3.5 to Spark 4.0. In the process, I ran into this issue in our unit tests. error: cannot find symbol import org.apache.spark.sql.execution.streaming....

Tarek Eid

27

asked Mar 26 at 16:37

3 votes

1 answer

99 views

How to fix a cast invalid input error in Spark?

I'm trying to create & display a spark dataframe in Databricks. This is what my code looks like: df = spark.sql(f''' SELECT A.CUSTOMER_ID FROM TABLE_1 A INNER JOIN ...

SRJCoding

543

asked Mar 24 at 12:39

Advice

1 vote

2 replies

48 views

What is Spark doing when one record is bigger than partition size?

I have a project where I ingest large jsons (100-300 MB per file) where one json is one record. I had issues with processing them until I increased spark.sql.files.maxPartitionBytes. I received Out of ...

sageroe42

1

asked Mar 24 at 7:06

-2 votes

0 answers

66 views

PySpark script hangs after job completion — ThreadPoolExecutor + PyJ4 daemon threads never terminate

Environment Spark: 3.3.2 (Cloudera parcel SPARK3-3.3.2.3.3.7191000.0-78-1.p0.56279928) Python: 3.10 PyJ4: 0.10.9.5 Deployment: YARN OS: Linux Problem I have a PySpark script that uses concurrent....

NoName_acc

11

asked Mar 22 at 20:15

Advice

0 votes

1 replies

68 views

How to perform asynchronous LLM inference on Kafka streams using Apache Spark, and handle high-throughput RAG ingestion?

I’m working on a streaming pipeline where data is coming from a Kafka topic, and I want to integrate LLM-based processing and RAG ingestion. I’m running into architectural challenges around latency ...

Arpan

993

asked Mar 20 at 19:41

2 votes

1 answer

61 views

Spark JSON infer schema

I have a JSON file like this: { "id": 1, "str": "a string", "d": "1996-11-20" } I want Spark (version 4.0.1) to infer the schema and make column d a ...

hage

6,253

asked Mar 20 at 8:11

0 votes

1 answer

30 views

How to change spark-submit command in intellij spark plugin

In intellij I am trying to setup my spark plugin. On my host I execute my code using /<spark-home>/bin/spark3-submit ....... When I setup spark plugin on my intellij the command generated looks ...

Ravi Kumar

994

asked Mar 19 at 11:39

0 votes

1 answer

71 views

Java Spark mapPartitions retry causing duplicate inserts into BigQuery on task failure

I have a dataproc java Spark job that processes a dataset in partitions and inserts rows into batches. The code uses rdd.mapPartitions with an iterator, splitting rows into batches (e.g., 100 rows ...

Aakash Shrivastav

1

asked Mar 11 at 10:08

0 votes

1 answer

57 views

Spark 4.1.1 on AKS + Cosmos DB Cassandra API: ClassNotFound without connector, ClosedConnectionException with spark-cassandra-connector_2.13-3.5.1

We are upgrading a Spark job running on AKS (Kubernetes) from Spark 3.5.3 to Spark 4.1.1. Current working setup (Spark 3.5.3): Connector: com.datastax.spark:spark-cassandra-connector-assembly_2.12:3....

akshay kadam

1

asked Mar 9 at 18:07

Advice

0 votes

2 replies

63 views

Does Spark Catalyst Optimize Across Actions?

Given the following scenario: DataFrame A, B, and C. B and C are retrieved from storage and operated on and joined to A = A*. Then a filter is applied to A* and written to one location, another filter ...

Sergei I.

76

asked Mar 8 at 21:05

Best practices

0 votes

4 replies

115 views

Recomputation of common stages accross multiple actionless branches in Spark

My team has been working in a Spark process that reads one (of many tables) and does divergently steps with subsets of that table (after some preprocess). Common trunk looks like this: Read & ...

xandor19

57

asked Mar 4 at 11:56

Advice

2 votes

1 replies

99 views

How to decide number of partitions using repartition vs coalesce in Apache Spark for optimization?

How to decide I am trying to understand how to properly use repartition and coalesce in Apache Spark, especially for performance optimization. From my understanding: repartition can increase or ...

Test User123

1

asked Mar 2 at 8:46

Advice

0 votes

7 replies

106 views

Java heap space during long dataset handle cycle

I have a long dataset handle cycle. In this cycle on each iteration i map dataset rows (mapPartitions), then calculate some statistics (foreachPartition), until a certain condition is met. It loocks ...

user32405588

asked Feb 23 at 17:25

0 votes

2 answers

80 views

How to truly stream a 1GB+ XLSX file in Spark using Aspose LightCells without loading full XML into memory?

I am trying to process a very large XLSX file (~1GB compressed). After unzipping, the internal xl/worksheets/sheet1.xml is about 2.4GB. I am using Spark (DataSource V2) and Aspose Cells ...

raviston Thanasekar

55

asked Feb 23 at 5:08

1 vote

1 answer

89 views

Registering Partition Information to Glue Iceberg Tables

I am creating Glue Iceberg tables using Spark on EMR. After creation, I also write a few records to the table. However, when I do this, Spark does not register any partition information in Glue table ...

shiva

2,801

asked Feb 22 at 0:05

Best practices

0 votes

8 replies

110 views

How do I speed up df.write from Spark to IBM DB2

Currently I am fetching live data from Kafka (Which produces 45k rows per second) into Spark, convert those rows into a dataframe (as the data coming in from Kafka is in JSON format) and trying to ...

Edvin Guromin

27

asked Feb 10 at 8:24

0 votes

1 answer

68 views

CONVERT TO DELTA fails to merge file schema

This is in Azure Databricks. I have a directory of Parquet files in Azure Data Lake Storage that I want to convert to a Delta Lake table. I run this: CONVERT TO DELTA parquet.`abfss://container@...

ardaar

1,304

asked Feb 4 at 17:21

0 votes

0 answers

168 views

How to read/write in minio using spark?

I've build a spark and minio docker container with the below config: services: spark: build: . command: sleep infinity container_name: spark volumes: - ./spark_scripts:/opt/...

vyangesh

1

asked Jan 26 at 6:21

Advice

1 vote

4 replies

63 views

Attempting to add a module-info.java to spark.core.2.13 of Apache Spark: how to make scala accepting module delaration and gather its classes?

I'm willing to progress into the subject of introducing java modules into Apache Spark. They are needed for: sorting and avoiding conflicts between dependencies. allow Spark modules to be integrated ...

Marc Le Bihan

3,685

asked Jan 23 at 4:37

4 votes

0 answers

127 views

Spark Parquet timestamp min/max statistics

I have a table in Iceberg as below: spark.sql(""" CREATE OR REPLACE TABLE my_db.my_table ( serverTime TIMESTAMP, id LONG, ... ) ...

user32238701

asked Jan 21 at 13:55

0 votes

1 answer

95 views

PySpark .show() fails with “Python worker exited unexpectedly” on Windows (Python 3.14)

Body I am facing a PySpark error on Windows while calling .show() on a DataFrame. The job fails with a Python worker crash. Environment OS: Windows 10 Spark: Apache Spark (PySpark) IDE: VS Code ...

Deepika Goyal

1

asked Jan 21 at 5:59

Advice

0 votes

0 replies

25 views

Interplay between spark's decommissioning and PVC re-use functions

I'm currently evaluating using PVCs as ephemeral storage in Spark in order to reduce costs and enable running on spot instances. As of Spark 3.5, it's possible to use either the decommissioning ...

Dzeri96

462

asked Jan 19 at 10:46

0 votes

0 answers

40 views

KubernetesExecutor, Airflow 3,SparkSubmitOperator with pod_overwrite fails with json validation error

I'm trying to figure out how to successfully run dag with SparkSubmitOperator on Airflow 3.1.5, I have a wrapper which sets pod config: self.executor_config = { "pod_override": ...

twierdzenie twierdzenie

1

asked Jan 16 at 21:08

0 votes

0 answers

64 views

Can't SELECT anything in a AWS Glue Data Catalog view due to invalid view text: <REDACTED VIEW TEXT>

i created a glue view through a glue job like this: CREATE OR REPLACE PROTECTED MULTI DIALECT VIEW risk_models_output.vw_behavior_special_limit_score SECURITY DEFINER AS [query ...

Paloma Raissa

1

asked Jan 14 at 20:37

3 votes

0 answers

76 views

Cannot attach VS Code to Spark container: Permission denied creating .vscode-server in /nonexistent

I'm getting what I feel should be a super simple error, but I'm having a tough time figuring out what I'm doing wrong. I want to open a running container in VS Code, but I keep getting a permission ...

ChristianRRL

31

asked Jan 14 at 13:25

6 votes

0 answers

139 views

How to make spark reuse python workers where we have done some costly init set up?

I'm trying to optimize execute pandas UDF's in PySpark. When I start the UDF, I do some costly initializations - like loading an ML model. This is a one time operation and I don't want to do this ...

Srinivas Kumar

41

asked Jan 13 at 18:51

0 votes

1 answer

78 views

Spark job fails with UnsafeExternalSorter OOM when using groupBy + collect_list + sort – how to optimize?

How to replace groupBy + collect_list + array_sort with a more memory-efficient approach in Spark SQL? I have a Spark (Java) batch job that processes large telecom event data The job is failing with `...

Thịnh Nguyễn

1

asked Jan 13 at 9:16

0 votes

0 answers

58 views

How to display a greater number of completed jobs in Databrick's Spark UI?

To improve the performance of a databricks workflow, I need to analyse Spark UI. However, my workflows has 1295 jobs and Spark UI on Databricks only shows 904 jobs as you can see on the following ...

Vincent Doba

5,188

asked Jan 12 at 21:52

0 votes

0 answers

93 views

Ensure two queries in a Spark declarative pipeline process the same rows when using the availableNow trigger

I'm using Spark declarative pipelines in Databricks. My pipeline runs in triggered mode. My understanding is that in triggered mode, the streaming uses the availableNow=True option to process all data ...

Rob Fisher

1,025

asked Jan 12 at 14:16

1 vote

1 answer

120 views

scala.MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$)

We are using Kafka to get avro data with the help of schema registry. After upgrading to Spark 3.3.2 from 2.4 the kafka consumer is failing with error - scala.MatchError: TimestampNTZType (of class ...

Khilesh Chauhan

897

asked Jan 9 at 11:54

0 votes

1 answer

104 views

Spark Declarative Pipelines (SDP) – TABLE_OR_VIEW_NOT_FOUND for upstream table even though it is defined

I am trying to learn Spark Declarative Pipelines (Spark 4.0 / pyspark.pipelines) locally using the spark-pipelines CLI. I have a simple Bronze → Silver → Gold pipeline, but I keep getting: pyspark....

AChaudhury

1

asked Jan 9 at 9:55

Best practices

1 vote

6 replies

80 views

How to run Pyspark UDF separately over dataframe groups

Grouping a Pyspark dataframe, applying time series analysis UDF to each group SOLVED See below I have a Pyspark process which takes a time-series dataframe for a site and calculates/adds features ...

Jernau

93

asked Jan 8 at 10:52

0 votes

1 answer

67 views

Why 2 tables bucketed by col1 and joined by (col1, col2) are shuffled?

// Enable all bucketing optimizations spark.conf.set("spark.sql.requireAllClusterKeysForDistribution", "false") spark.conf.set("spark.sql.sources.bucketing.enabled&...

user2417458

31

asked Dec 25, 2025 at 12:03

2 votes

0 answers

100 views

'JavaPackage' object is not callable error when trying to getOrCreate() local spark session

I have set up a small Xubuntu machine with intention of making it my single-node playaround Spark cluster. The cluster seems to be set up correctly - I can access the WebUI at port 8080, it shows a ...

Paweł Sopel

557

asked Dec 25, 2025 at 10:22

0 votes

0 answers

83 views

spark flatMapToPair reaching "no space left on device" due to large duplication of entries

First, my question is not on increasing disk space to avoid no space left error, but to understand what spark does, and hopefully how to improve my code. In short, here is the pseudo code: JavaRDD&...

Juh_

15.9k

asked Dec 16, 2025 at 14:47

1 vote

2 answers

115 views

Difference between org.apache.hadoop.io.compress.CompressionCodec and org.apache.spark.io.CompressionCodec

I want to use a compression in bigdata processing, but there are two compression codecs. Anyone know the difference?

Angle Tom

1,150

asked Dec 14, 2025 at 10:05

Advice

0 votes

4 replies

236 views

Use RSA key snowflake connection options instead of Password

I want to connect to a Snowflake database from the Data Bricks notebook. I have an RSA key(.pem file) and I don't want to use a traditional method like username and password as it is not as secure as ...

Prafulla

85

asked Dec 8, 2025 at 16:58

0 votes

1 answer

128 views

Does Databricks Spark SQL evaluate all CASE branches for UDFs?

I'm using Databricks SQL and have SQL UDFs for GeoIP / ISP lookups. Each UDF branches on IPv4 vs IPv6 using a CASE expression like: CASE WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path ... ...

YJCMS

3

asked Dec 8, 2025 at 8:13

1 vote

0 answers

170 views

Warning and performance issues when scanning delta tables

Why do I get multiple warnings WARN delta_kernel::engine::default::json] read_json receiver end of channel dropped before sending completed when scanning (pl.scan_delta(temp_path) a delta table that ...

gaut

6,068

asked Dec 6, 2025 at 1:45

1 vote

1 answer

75 views

How to detect Spark application failure in SparkListener when no jobs are executed?

I have a class that extends SparkListener and has access to SparkContext. I'm wondering if there is any way to check in onApplicationEnd whether the Spark application stopped because of an error or ...

tnazarew

11

asked Dec 5, 2025 at 4:28

0 votes

0 answers

56 views

How to dynamically cast columns in a dbt-spark custom materialization to resolve UNION ALL schema mismatch?

I am working on a custom materialization in dbt using the dbt-spark adapter (writing to Delta tables on S3). The goal is to handle a hybrid SCD Type 1 and Type 2 strategy. The Logic I compare the ...

HoanggLB2k2

1

asked Dec 2, 2025 at 9:23

2 votes

0 answers

82 views

How log model in mlflow using Spark Connect

I have the following setup: Kubernetes cluster with Spark Connect 4.0.1 and MLflow tracking server 3.5.0 MLFlow tracking server should serve all artifacts and is configured this way: --backend-store-...

hage

6,253

asked Nov 26, 2025 at 13:39

0 votes

1 answer

85 views

Handle corrupted files in spark load()

I have a spark job that runs daily to load data from S3. These data are composed of thousands of gzip files. However, in some cases, there is one or two corrupted files in S3, and it causes the whole ...

Nakeuh

1,933

asked Nov 26, 2025 at 7:17

-1 votes

2 answers

72 views

Connectivity issues in standalone Spark 4.0

In Azure VM, I have installed standalone Spark 4.0. On the same VM I have Python 3.11 with Jupyter deployed. In my notebook I submitted the following program: from pyspark.sql import SparkSession ...

Ziggy

41

asked Nov 24, 2025 at 16:16

1 vote

1 answer

314 views

PicklingError: Could not serialize object: RecursionError in pyspark code in Jupyter Notebook

I am very new in Spark (specifically, have just started with learning), and I have encountered a recursion error in a very simple code. Background: Spark Version 3.5.7 Java Version 11.0.29 (Eclipse ...

GINzzZ100

11

asked Nov 24, 2025 at 1:47

5 votes

2 answers

899 views

Spark with Delta Lake and S3A: NumberFormatException "60s" and request for working Docker image/config

I’m trying to create a Delta Lake table in MinIO using Spark 4.0.0 inside a Docker container. I’ve added the required JARs: delta-spark_2.13-4.0.0.jar delta-storage-4.0.0.jar hadoop-aws-3.3.6.jar aws-...

Tutu ツ

165

asked Nov 22, 2025 at 12:54

0 votes

0 answers

36 views

Large variation in spark runtimes

Long story short, my team was hired to take on some legacy code and it was running around 5ish hours. We began making some minor changes that shouldn't have affected the runtimes in any significant ...

Ben Fuqua

85

asked Nov 21, 2025 at 16:42

2 votes

2 answers

133 views

Spark-Redis write loses rows when writing large DataFrame to Redis

I’m experiencing data loss when writing a large DataFrame to Redis using the Spark-Redis connector. Details: I have a DataFrame with millions of rows. Writing to Redis works correctly for small ...

gianfranco de siena

11

asked Nov 19, 2025 at 5:06

Collectives™ on Stack Overflow