Skip to main content
Filter by
Sorted by
Tagged with
0 votes
0 answers
66 views

Spark SQL + Iceberg: MERGE and INSERT appear to ignore cached DataFrame and re-scan source I am trying to optimize an SCD2 flow in Spark SQL (Python API) using a cached intermediate DataFrame. ...
fabrik5k's user avatar
8 votes
0 answers
620 views

class org.jetbrains.plugins.gradle.tooling.serialization.internal.adapter.InternalIdeaModule cannot be cast to class org.gradle.tooling.model.ProjectModel (org.jetbrains.plugins.gradle.tooling....
Musa Baloyi's user avatar
1 vote
1 answer
70 views

I tried to upgrade my project from Spark 3.5 to Spark 4.0. In the process, I ran into this issue in our unit tests. error: cannot find symbol import org.apache.spark.sql.execution.streaming....
Tarek Eid's user avatar
3 votes
1 answer
99 views

I'm trying to create & display a spark dataframe in Databricks. This is what my code looks like: df = spark.sql(f''' SELECT A.CUSTOMER_ID FROM TABLE_1 A INNER JOIN ...
SRJCoding's user avatar
  • 543
Advice
1 vote
2 replies
48 views

I have a project where I ingest large jsons (100-300 MB per file) where one json is one record. I had issues with processing them until I increased spark.sql.files.maxPartitionBytes. I received Out of ...
sageroe42's user avatar
-2 votes
0 answers
66 views

Environment Spark: 3.3.2 (Cloudera parcel SPARK3-3.3.2.3.3.7191000.0-78-1.p0.56279928) Python: 3.10 PyJ4: 0.10.9.5 Deployment: YARN OS: Linux Problem I have a PySpark script that uses concurrent....
NoName_acc's user avatar
Advice
0 votes
1 replies
68 views

I’m working on a streaming pipeline where data is coming from a Kafka topic, and I want to integrate LLM-based processing and RAG ingestion. I’m running into architectural challenges around latency ...
Arpan's user avatar
  • 993
2 votes
1 answer
61 views

I have a JSON file like this: { "id": 1, "str": "a string", "d": "1996-11-20" } I want Spark (version 4.0.1) to infer the schema and make column d a ...
hage's user avatar
  • 6,253
0 votes
1 answer
30 views

In intellij I am trying to setup my spark plugin. On my host I execute my code using /<spark-home>/bin/spark3-submit ....... When I setup spark plugin on my intellij the command generated looks ...
Ravi Kumar's user avatar
0 votes
1 answer
71 views

I have a dataproc java Spark job that processes a dataset in partitions and inserts rows into batches. The code uses rdd.mapPartitions with an iterator, splitting rows into batches (e.g., 100 rows ...
Aakash Shrivastav's user avatar
0 votes
1 answer
57 views

We are upgrading a Spark job running on AKS (Kubernetes) from Spark 3.5.3 to Spark 4.1.1. Current working setup (Spark 3.5.3): Connector: com.datastax.spark:spark-cassandra-connector-assembly_2.12:3....
akshay kadam's user avatar
Advice
0 votes
2 replies
63 views

Given the following scenario: DataFrame A, B, and C. B and C are retrieved from storage and operated on and joined to A = A*. Then a filter is applied to A* and written to one location, another filter ...
Sergei I.'s user avatar
Best practices
0 votes
4 replies
115 views

My team has been working in a Spark process that reads one (of many tables) and does divergently steps with subsets of that table (after some preprocess). Common trunk looks like this: Read & ...
xandor19's user avatar
Advice
2 votes
1 replies
99 views

How to decide I am trying to understand how to properly use repartition and coalesce in Apache Spark, especially for performance optimization. From my understanding: repartition can increase or ...
Test User123's user avatar
Advice
0 votes
7 replies
106 views

I have a long dataset handle cycle. In this cycle on each iteration i map dataset rows (mapPartitions), then calculate some statistics (foreachPartition), until a certain condition is met. It loocks ...
user avatar
0 votes
2 answers
80 views

I am trying to process a very large XLSX file (~1GB compressed). After unzipping, the internal xl/worksheets/sheet1.xml is about 2.4GB. I am using Spark (DataSource V2) and Aspose Cells ...
raviston Thanasekar's user avatar
1 vote
1 answer
89 views

I am creating Glue Iceberg tables using Spark on EMR. After creation, I also write a few records to the table. However, when I do this, Spark does not register any partition information in Glue table ...
shiva's user avatar
  • 2,801
Best practices
0 votes
8 replies
110 views

Currently I am fetching live data from Kafka (Which produces 45k rows per second) into Spark, convert those rows into a dataframe (as the data coming in from Kafka is in JSON format) and trying to ...
Edvin Guromin's user avatar
0 votes
1 answer
68 views

This is in Azure Databricks. I have a directory of Parquet files in Azure Data Lake Storage that I want to convert to a Delta Lake table. I run this: CONVERT TO DELTA parquet.`abfss://container@...
ardaar's user avatar
  • 1,304
0 votes
0 answers
168 views

I've build a spark and minio docker container with the below config: services: spark: build: . command: sleep infinity container_name: spark volumes: - ./spark_scripts:/opt/...
vyangesh's user avatar
Advice
1 vote
4 replies
63 views

I'm willing to progress into the subject of introducing java modules into Apache Spark. They are needed for: sorting and avoiding conflicts between dependencies. allow Spark modules to be integrated ...
Marc Le Bihan's user avatar
4 votes
0 answers
127 views

I have a table in Iceberg as below: spark.sql(""" CREATE OR REPLACE TABLE my_db.my_table ( serverTime TIMESTAMP, id LONG, ... ) ...
user avatar
0 votes
1 answer
95 views

Body I am facing a PySpark error on Windows while calling .show() on a DataFrame. The job fails with a Python worker crash. Environment OS: Windows 10 Spark: Apache Spark (PySpark) IDE: VS Code ...
Deepika Goyal's user avatar
Advice
0 votes
0 replies
25 views

I'm currently evaluating using PVCs as ephemeral storage in Spark in order to reduce costs and enable running on spot instances. As of Spark 3.5, it's possible to use either the decommissioning ...
Dzeri96's user avatar
  • 462
0 votes
0 answers
40 views

I'm trying to figure out how to successfully run dag with SparkSubmitOperator on Airflow 3.1.5, I have a wrapper which sets pod config: self.executor_config = { "pod_override": ...
twierdzenie twierdzenie's user avatar
0 votes
0 answers
64 views

i created a glue view through a glue job like this: CREATE OR REPLACE PROTECTED MULTI DIALECT VIEW risk_models_output.vw_behavior_special_limit_score SECURITY DEFINER AS [query ...
Paloma Raissa's user avatar
3 votes
0 answers
76 views

I'm getting what I feel should be a super simple error, but I'm having a tough time figuring out what I'm doing wrong. I want to open a running container in VS Code, but I keep getting a permission ...
ChristianRRL's user avatar
6 votes
0 answers
139 views

I'm trying to optimize execute pandas UDF's in PySpark. When I start the UDF, I do some costly initializations - like loading an ML model. This is a one time operation and I don't want to do this ...
Srinivas Kumar's user avatar
0 votes
1 answer
78 views

How to replace groupBy + collect_list + array_sort with a more memory-efficient approach in Spark SQL? I have a Spark (Java) batch job that processes large telecom event data The job is failing with `...
Thịnh Nguyễn's user avatar
0 votes
0 answers
58 views

To improve the performance of a databricks workflow, I need to analyse Spark UI. However, my workflows has 1295 jobs and Spark UI on Databricks only shows 904 jobs as you can see on the following ...
Vincent Doba's user avatar
  • 5,188
0 votes
0 answers
93 views

I'm using Spark declarative pipelines in Databricks. My pipeline runs in triggered mode. My understanding is that in triggered mode, the streaming uses the availableNow=True option to process all data ...
Rob Fisher's user avatar
  • 1,025
1 vote
1 answer
120 views

We are using Kafka to get avro data with the help of schema registry. After upgrading to Spark 3.3.2 from 2.4 the kafka consumer is failing with error - scala.MatchError: TimestampNTZType (of class ...
Khilesh Chauhan's user avatar
0 votes
1 answer
104 views

I am trying to learn Spark Declarative Pipelines (Spark 4.0 / pyspark.pipelines) locally using the spark-pipelines CLI. I have a simple Bronze → Silver → Gold pipeline, but I keep getting: pyspark....
AChaudhury's user avatar
Best practices
1 vote
6 replies
80 views

Grouping a Pyspark dataframe, applying time series analysis UDF to each group SOLVED See below I have a Pyspark process which takes a time-series dataframe for a site and calculates/adds features ...
Jernau's user avatar
  • 93
0 votes
1 answer
67 views

// Enable all bucketing optimizations spark.conf.set("spark.sql.requireAllClusterKeysForDistribution", "false") spark.conf.set("spark.sql.sources.bucketing.enabled&...
user2417458's user avatar
2 votes
0 answers
100 views

I have set up a small Xubuntu machine with intention of making it my single-node playaround Spark cluster. The cluster seems to be set up correctly - I can access the WebUI at port 8080, it shows a ...
Paweł Sopel's user avatar
0 votes
0 answers
83 views

First, my question is not on increasing disk space to avoid no space left error, but to understand what spark does, and hopefully how to improve my code. In short, here is the pseudo code: JavaRDD&...
Juh_'s user avatar
  • 15.9k
1 vote
2 answers
115 views

I want to use a compression in bigdata processing, but there are two compression codecs. Anyone know the difference?
Angle Tom's user avatar
  • 1,150
Advice
0 votes
4 replies
236 views

I want to connect to a Snowflake database from the Data Bricks notebook. I have an RSA key(.pem file) and I don't want to use a traditional method like username and password as it is not as secure as ...
Prafulla's user avatar
0 votes
1 answer
128 views

I'm using Databricks SQL and have SQL UDFs for GeoIP / ISP lookups. Each UDF branches on IPv4 vs IPv6 using a CASE expression like: CASE WHEN ip_address LIKE '%:%:%' THEN -- IPv6 path ... ...
YJCMS's user avatar
  • 3
1 vote
0 answers
170 views

Why do I get multiple warnings WARN delta_kernel::engine::default::json] read_json receiver end of channel dropped before sending completed when scanning (pl.scan_delta(temp_path) a delta table that ...
gaut's user avatar
  • 6,068
1 vote
1 answer
75 views

I have a class that extends SparkListener and has access to SparkContext. I'm wondering if there is any way to check in onApplicationEnd whether the Spark application stopped because of an error or ...
tnazarew's user avatar
0 votes
0 answers
56 views

I am working on a custom materialization in dbt using the dbt-spark adapter (writing to Delta tables on S3). The goal is to handle a hybrid SCD Type 1 and Type 2 strategy. The Logic I compare the ...
HoanggLB2k2's user avatar
2 votes
0 answers
82 views

I have the following setup: Kubernetes cluster with Spark Connect 4.0.1 and MLflow tracking server 3.5.0 MLFlow tracking server should serve all artifacts and is configured this way: --backend-store-...
hage's user avatar
  • 6,253
0 votes
1 answer
85 views

I have a spark job that runs daily to load data from S3. These data are composed of thousands of gzip files. However, in some cases, there is one or two corrupted files in S3, and it causes the whole ...
Nakeuh's user avatar
  • 1,933
-1 votes
2 answers
72 views

In Azure VM, I have installed standalone Spark 4.0. On the same VM I have Python 3.11 with Jupyter deployed. In my notebook I submitted the following program: from pyspark.sql import SparkSession ...
Ziggy's user avatar
  • 41
1 vote
1 answer
314 views

I am very new in Spark (specifically, have just started with learning), and I have encountered a recursion error in a very simple code. Background: Spark Version 3.5.7 Java Version 11.0.29 (Eclipse ...
GINzzZ100's user avatar
5 votes
2 answers
899 views

I’m trying to create a Delta Lake table in MinIO using Spark 4.0.0 inside a Docker container. I’ve added the required JARs: delta-spark_2.13-4.0.0.jar delta-storage-4.0.0.jar hadoop-aws-3.3.6.jar aws-...
Tutu ツ's user avatar
  • 165
0 votes
0 answers
36 views

Long story short, my team was hired to take on some legacy code and it was running around 5ish hours. We began making some minor changes that shouldn't have affected the runtimes in any significant ...
Ben Fuqua's user avatar
2 votes
2 answers
133 views

I’m experiencing data loss when writing a large DataFrame to Redis using the Spark-Redis connector. Details: I have a DataFrame with millions of rows. Writing to Redis works correctly for small ...
gianfranco de siena's user avatar

1
2 3 4 5
1652