82,559 questions
0
votes
0
answers
66
views
Spark SQL MERGE/INSERT on Iceberg Recomputes Upstream Join Instead of Reusing Cached DataFrame (MEMORY_AND_DISK)
Spark SQL + Iceberg: MERGE and INSERT appear to ignore cached DataFrame and re-scan source
I am trying to optimize an SCD2 flow in Spark SQL (Python API) using a cached intermediate DataFrame.
...
8
votes
0
answers
620
views
Upgraded to version IntelliJ IDEA 2026.1 , gradle fails to sync or build
class org.jetbrains.plugins.gradle.tooling.serialization.internal.adapter.InternalIdeaModule cannot be cast to class org.gradle.tooling.model.ProjectModel (org.jetbrains.plugins.gradle.tooling....
1
vote
1
answer
70
views
Spark 4.0 MemoryStream was moved or changed?
I tried to upgrade my project from Spark 3.5 to Spark 4.0. In the process, I ran into this issue in our unit tests.
error: cannot find symbol
import org.apache.spark.sql.execution.streaming....
3
votes
1
answer
99
views
How to fix a cast invalid input error in Spark?
I'm trying to create & display a spark dataframe in Databricks. This is what my code looks like:
df = spark.sql(f'''
SELECT A.CUSTOMER_ID
FROM TABLE_1 A
INNER JOIN
...
Advice
1
vote
2
replies
48
views
What is Spark doing when one record is bigger than partition size?
I have a project where I ingest large jsons (100-300 MB per file) where one json is one record. I had issues with processing them until I increased spark.sql.files.maxPartitionBytes. I received Out of ...
-2
votes
0
answers
66
views
PySpark script hangs after job completion — ThreadPoolExecutor + PyJ4 daemon threads never terminate
Environment
Spark: 3.3.2 (Cloudera parcel SPARK3-3.3.2.3.3.7191000.0-78-1.p0.56279928)
Python: 3.10
PyJ4: 0.10.9.5
Deployment: YARN
OS: Linux
Problem
I have a PySpark script that uses concurrent....
Advice
0
votes
1
replies
68
views
How to perform asynchronous LLM inference on Kafka streams using Apache Spark, and handle high-throughput RAG ingestion?
I’m working on a streaming pipeline where data is coming from a Kafka topic, and I want to integrate LLM-based processing and RAG ingestion. I’m running into architectural challenges around latency ...
2
votes
1
answer
61
views
Spark JSON infer schema
I have a JSON file like this:
{ "id": 1, "str": "a string", "d": "1996-11-20" }
I want Spark (version 4.0.1) to infer the schema and make column d a ...
0
votes
1
answer
30
views
How to change spark-submit command in intellij spark plugin
In intellij I am trying to setup my spark plugin.
On my host I execute my code using
/<spark-home>/bin/spark3-submit .......
When I setup spark plugin on my intellij the command generated looks ...
0
votes
1
answer
71
views
Java Spark mapPartitions retry causing duplicate inserts into BigQuery on task failure
I have a dataproc java Spark job that processes a dataset in partitions and inserts rows into batches.
The code uses rdd.mapPartitions with an iterator, splitting rows into batches (e.g., 100 rows ...
0
votes
1
answer
57
views
Spark 4.1.1 on AKS + Cosmos DB Cassandra API: ClassNotFound without connector, ClosedConnectionException with spark-cassandra-connector_2.13-3.5.1
We are upgrading a Spark job running on AKS (Kubernetes) from Spark 3.5.3 to Spark 4.1.1.
Current working setup (Spark 3.5.3):
Connector: com.datastax.spark:spark-cassandra-connector-assembly_2.12:3....
Advice
0
votes
2
replies
63
views
Does Spark Catalyst Optimize Across Actions?
Given the following scenario: DataFrame A, B, and C. B and C are retrieved from storage and operated on and joined to A = A*. Then a filter is applied to A* and written to one location, another filter ...
Best practices
0
votes
4
replies
115
views
Recomputation of common stages accross multiple actionless branches in Spark
My team has been working in a Spark process that reads one (of many tables) and does divergently steps with subsets of that table (after some preprocess). Common trunk looks like this:
Read & ...
Advice
2
votes
1
replies
99
views
How to decide number of partitions using repartition vs coalesce in Apache Spark for optimization?
How to decide I am trying to understand how to properly use repartition and coalesce in Apache Spark, especially for performance optimization.
From my understanding:
repartition can increase or ...
Advice
0
votes
7
replies
106
views
Java heap space during long dataset handle cycle
I have a long dataset handle cycle. In this cycle on each iteration i map dataset rows (mapPartitions), then calculate some statistics (foreachPartition), until a certain condition is met.
It loocks ...
0
votes
2
answers
80
views
How to truly stream a 1GB+ XLSX file in Spark using Aspose LightCells without loading full XML into memory?
I am trying to process a very large XLSX file (~1GB compressed). After unzipping, the internal xl/worksheets/sheet1.xml is about 2.4GB.
I am using Spark (DataSource V2) and Aspose Cells ...
1
vote
1
answer
89
views
Registering Partition Information to Glue Iceberg Tables
I am creating Glue Iceberg tables using Spark on EMR. After creation, I also write a few records to the table. However, when I do this, Spark does not register any partition information in Glue table ...
Best practices
0
votes
8
replies
110
views
How do I speed up df.write from Spark to IBM DB2
Currently I am fetching live data from Kafka (Which produces 45k rows per second) into Spark, convert those rows into a dataframe (as the data coming in from Kafka is in JSON format) and trying to ...
0
votes
1
answer
68
views
CONVERT TO DELTA fails to merge file schema
This is in Azure Databricks.
I have a directory of Parquet files in Azure Data Lake Storage that I want to convert to a Delta Lake table. I run this:
CONVERT TO DELTA parquet.`abfss://container@...
0
votes
0
answers
168
views
How to read/write in minio using spark?
I've build a spark and minio docker container with the below config:
services:
spark:
build: .
command: sleep infinity
container_name: spark
volumes:
- ./spark_scripts:/opt/...
Advice
1
vote
4
replies
63
views
Attempting to add a module-info.java to spark.core.2.13 of Apache Spark: how to make scala accepting module delaration and gather its classes?
I'm willing to progress into the subject of introducing java modules into Apache Spark. They are needed for:
sorting and avoiding conflicts between dependencies.
allow Spark modules to be integrated ...
4
votes
0
answers
127
views
Spark Parquet timestamp min/max statistics
I have a table in Iceberg as below:
spark.sql("""
CREATE OR REPLACE TABLE my_db.my_table (
serverTime TIMESTAMP,
id LONG,
...
)
...
0
votes
1
answer
95
views
PySpark .show() fails with “Python worker exited unexpectedly” on Windows (Python 3.14)
Body
I am facing a PySpark error on Windows while calling .show() on a DataFrame. The job fails with a Python worker crash.
Environment
OS: Windows 10
Spark: Apache Spark (PySpark)
IDE: VS Code
...
Advice
0
votes
0
replies
25
views
Interplay between spark's decommissioning and PVC re-use functions
I'm currently evaluating using PVCs as ephemeral storage in Spark in order to reduce costs and enable running on spot instances. As of Spark 3.5, it's possible to use either the decommissioning ...
0
votes
0
answers
40
views
KubernetesExecutor, Airflow 3,SparkSubmitOperator with pod_overwrite fails with json validation error
I'm trying to figure out how to successfully run dag with SparkSubmitOperator on Airflow 3.1.5,
I have a wrapper which sets pod config:
self.executor_config = {
"pod_override": ...
0
votes
0
answers
64
views
Can't SELECT anything in a AWS Glue Data Catalog view due to invalid view text: <REDACTED VIEW TEXT>
i created a glue view through a glue job like this:
CREATE OR REPLACE PROTECTED MULTI DIALECT VIEW risk_models_output.vw_behavior_special_limit_score
SECURITY DEFINER AS
[query ...
3
votes
0
answers
76
views
Cannot attach VS Code to Spark container: Permission denied creating .vscode-server in /nonexistent
I'm getting what I feel should be a super simple error, but I'm having a tough time figuring out what I'm doing wrong. I want to open a running container in VS Code, but I keep getting a permission ...
6
votes
0
answers
139
views
How to make spark reuse python workers where we have done some costly init set up?
I'm trying to optimize execute pandas UDF's in PySpark. When I start the UDF, I do some costly initializations - like loading an ML model. This is a one time operation and I don't want to do this ...
0
votes
1
answer
78
views
Spark job fails with UnsafeExternalSorter OOM when using groupBy + collect_list + sort – how to optimize?
How to replace groupBy + collect_list + array_sort with a more memory-efficient approach in Spark SQL?
I have a Spark (Java) batch job that processes large telecom event data
The job is failing with `...
0
votes
0
answers
58
views
How to display a greater number of completed jobs in Databrick's Spark UI?
To improve the performance of a databricks workflow, I need to analyse Spark UI. However, my workflows has 1295 jobs and Spark UI on Databricks only shows 904 jobs as you can see on the following ...
0
votes
0
answers
93
views
Ensure two queries in a Spark declarative pipeline process the same rows when using the availableNow trigger
I'm using Spark declarative pipelines in Databricks. My pipeline runs in triggered mode. My understanding is that in triggered mode, the streaming uses the availableNow=True option to process all data ...
1
vote
1
answer
120
views
scala.MatchError: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$)
We are using Kafka to get avro data with the help of schema registry. After upgrading to Spark 3.3.2 from 2.4 the kafka consumer is failing with error -
scala.MatchError: TimestampNTZType (of class ...
0
votes
1
answer
104
views
Spark Declarative Pipelines (SDP) – TABLE_OR_VIEW_NOT_FOUND for upstream table even though it is defined
I am trying to learn Spark Declarative Pipelines (Spark 4.0 / pyspark.pipelines) locally using the spark-pipelines CLI.
I have a simple Bronze → Silver → Gold pipeline, but I keep getting:
pyspark....
Best practices
1
vote
6
replies
80
views
How to run Pyspark UDF separately over dataframe groups
Grouping a Pyspark dataframe, applying time series analysis UDF to each group
SOLVED See below
I have a Pyspark process which takes a time-series dataframe for a site and calculates/adds features ...
0
votes
1
answer
67
views
Why 2 tables bucketed by col1 and joined by (col1, col2) are shuffled?
// Enable all bucketing optimizations
spark.conf.set("spark.sql.requireAllClusterKeysForDistribution", "false")
spark.conf.set("spark.sql.sources.bucketing.enabled&...
2
votes
0
answers
100
views
'JavaPackage' object is not callable error when trying to getOrCreate() local spark session
I have set up a small Xubuntu machine with intention of making it my single-node playaround Spark cluster. The cluster seems to be set up correctly - I can access the WebUI at port 8080, it shows a ...
0
votes
0
answers
83
views
spark flatMapToPair reaching "no space left on device" due to large duplication of entries
First, my question is not on increasing disk space to avoid no space left error, but to understand what spark does, and hopefully how to improve my code.
In short, here is the pseudo code:
JavaRDD&...
1
vote
2
answers
115
views
Difference between org.apache.hadoop.io.compress.CompressionCodec and org.apache.spark.io.CompressionCodec
I want to use a compression in bigdata processing, but there are two compression codecs.
Anyone know the difference?
Advice
0
votes
4
replies
236
views
Use RSA key snowflake connection options instead of Password
I want to connect to a Snowflake database from the Data Bricks notebook. I have an RSA key(.pem file) and I don't want to use a traditional method like username and password as it is not as secure as ...
0
votes
1
answer
128
views
Does Databricks Spark SQL evaluate all CASE branches for UDFs?
I'm using Databricks SQL and have SQL UDFs for GeoIP / ISP lookups.
Each UDF branches on IPv4 vs IPv6 using a CASE expression like:
CASE
WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path
...
...
1
vote
0
answers
170
views
Warning and performance issues when scanning delta tables
Why do I get multiple warnings WARN delta_kernel::engine::default::json] read_json receiver end of channel dropped before sending completed when scanning (pl.scan_delta(temp_path) a delta table that ...
1
vote
1
answer
75
views
How to detect Spark application failure in SparkListener when no jobs are executed?
I have a class that extends SparkListener and has access to SparkContext. I'm wondering if there is any way to check in onApplicationEnd whether the Spark application stopped because of an error or ...
0
votes
0
answers
56
views
How to dynamically cast columns in a dbt-spark custom materialization to resolve UNION ALL schema mismatch?
I am working on a custom materialization in dbt using the dbt-spark adapter (writing to Delta tables on S3). The goal is to handle a hybrid SCD Type 1 and Type 2 strategy.
The Logic I compare the ...
2
votes
0
answers
82
views
How log model in mlflow using Spark Connect
I have the following setup:
Kubernetes cluster with Spark Connect 4.0.1 and
MLflow tracking server 3.5.0
MLFlow tracking server should serve all artifacts and is configured this way:
--backend-store-...
0
votes
1
answer
85
views
Handle corrupted files in spark load()
I have a spark job that runs daily to load data from S3.
These data are composed of thousands of gzip files. However, in some cases, there is one or two corrupted files in S3, and it causes the whole ...
-1
votes
2
answers
72
views
Connectivity issues in standalone Spark 4.0
In Azure VM, I have installed standalone Spark 4.0. On the same VM I have Python 3.11 with Jupyter deployed. In my notebook I submitted the following program:
from pyspark.sql import SparkSession
...
1
vote
1
answer
314
views
PicklingError: Could not serialize object: RecursionError in pyspark code in Jupyter Notebook
I am very new in Spark (specifically, have just started with learning), and I have encountered a recursion error in a very simple code.
Background:
Spark Version 3.5.7
Java Version 11.0.29 (Eclipse ...
5
votes
2
answers
899
views
Spark with Delta Lake and S3A: NumberFormatException "60s" and request for working Docker image/config
I’m trying to create a Delta Lake table in MinIO using Spark 4.0.0 inside a Docker container. I’ve added the required JARs:
delta-spark_2.13-4.0.0.jar
delta-storage-4.0.0.jar
hadoop-aws-3.3.6.jar
aws-...
0
votes
0
answers
36
views
Large variation in spark runtimes
Long story short, my team was hired to take on some legacy code and it was running around 5ish hours. We began making some minor changes that shouldn't have affected the runtimes in any significant ...
2
votes
2
answers
133
views
Spark-Redis write loses rows when writing large DataFrame to Redis
I’m experiencing data loss when writing a large DataFrame to Redis using the Spark-Redis connector.
Details:
I have a DataFrame with millions of rows.
Writing to Redis works correctly for small ...