43 questions
0
votes
3
answers
114
views
PySpark groupBy().applyInPandas() fails with INVALID_PANDAS_UDF despite correct signature and schema for GROUPED_MAP
NOTE: This question has many related questions on StackOverFlow but I was unable to get my answer from any of them.
I'm attempting to parallelize Prophet time series model training across multiple ...
0
votes
0
answers
315
views
pandas_udf causing Python worker to crash in PySpark on macOS with M3 chip
I'm working with PySpark and trying to use a pandas_udf on my macOS system with an M3 chip. My environment is Python 3.10 running from a virtual environment. The code runs fine until I introduce the ...
0
votes
2
answers
107
views
pandas udf RuntimeError: Number of columns of the returned pandas.DataFrame doesn't match specified schema
I'm running a pandas udf as follows:
def some_udf(df, keys = IDS_COLS, cols_to_keep = COLS_TO_KEEP):
INT_ID_COLUMN = '__iid'
df_keys = df.select(keys).distinct().withColumn(INT_ID_COLUMN, F....
0
votes
1
answer
49
views
Create time based features in Pyspark
I have a feature table Pyspark DF that gets created every day through a pipeline. Now the ask is to create time based features for each feature where each t-1 till t-30 (t=time) features captures the ...
0
votes
2
answers
45
views
pyspark transformation affecting multiple colums
TLDR- I’m trying to write a udf that would transform a pyspark dataframe. When the input is a data frame and the output is the same data frame, just with a few columns mapped to different values.
In ...
0
votes
0
answers
208
views
Parallelize different scenarios for pandas UDF
I have created a pandas UDF (df->df) for scenario, which takes cares of parallel run for partition which can be found withing provided data - this is fine. However it was requested to have ...
1
vote
2
answers
3k
views
Pyspark "TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array"
On Databricks, I have a streaming pipeline where the bronze source and silver target are in delta format. I have a pandas udf that uses the requests_cache library to retrieve something from an url (...
0
votes
1
answer
491
views
How to reduce the execution time of multiple models' inference on a large dataset in pyspark?
I have a pyspark data frame of a huge number of rows ( 80 million -100 million rows). I am inferencing a model on it to obtain the model score(probability) for each row. Like the below code:
import ...
0
votes
1
answer
194
views
Pyspark Error due to data type in pandas_udf
I'm trying to write a filter_words function in pandas_udf
Here are the functions I am using:
@udf_annotator(returnType=ArrayType(StructType([StructField("position", IntegerType(), True),
...
0
votes
1
answer
824
views
(Spark 3.3.2 OpenJDK19 PySpark Pandas_UDF Python3.10 Ubuntu22.04 Dockerized) Test Script producing TypeError: 'JavaPackage' object is not callable
I've created a docker container that installs Ubuntu 22.04, Python 3.10, Spark 3.3.2, Hadoop 3, Scala 13, and Open JDK 19.
I'm currently using as a test environment before deploying code in AWS.
This ...
3
votes
1
answer
290
views
Pyspark Pandas-Vectorized UDFs
I am trying to convert this udf into this pandas udf, in order to avoid creating two pandas udfs.
Convert this:
@udf("string")
def splitEmailUDF(email: str, position: int) -> str:
...
2
votes
1
answer
591
views
Use Pandas UDF to calculate Cosine Similarity of two vectors in PySpark
I want to calculate the cosine similarity of 2 vectors using Pandas UDF. I implemented it with Spark UDF, which works fine with the following script.
import numpy as np
from pyspark.sql.functions ...
0
votes
1
answer
139
views
pyspark pandas udf not able to return any object
I am moving my code from Pandas to Pypark for NLP task. I have figured out how to apply tokenization (using Keras built-in library) via a pandas UDF. However, I also want to return the fitted ...
2
votes
0
answers
558
views
How to use applyInPandas inside a class method with pyspark
I have a class with a native python function (performing some imputations on a pd df) that will be used on grouped data with applyInPandas (https://spark.apache.org/docs/3.1.2/api/python/reference/api/...
0
votes
1
answer
1k
views
pyspark with pandas udf giving java.io.EOFException while writing to CSV
pyspark code using pandas udf functions , works fine with df.limit(20).collect() & write to csv for 20 records. But when i try write 100 records to csv it fails with java.io.EOFException error. ...
1
vote
1
answer
569
views
python udf iterator -> iterator giving outputted more rows error
Have dataframe with text column CALL_TRANSCRIPT (string format) and pii_allmethods column (array of string). Trying to search Call_Transcripts for strings in array & mask using pyspark pandas udf. ...
1
vote
0
answers
109
views
Trying to parallelize hyperparameter tuning using pandas udf, but no success
I've been trying to parallelize hyperparameter tuning for my prophet model for around 100 combinations of hyperparameters saved in the dataframe params_df.
I want to parallelize the hyperparameter ...
0
votes
1
answer
106
views
Pyspark Pandas UDF Series operation on Array column
I have a dataframe like this
data_df = spark.createDataFrame([([1,2,3],'val1'),([4,5,6],'val2')],['col1','col2'])
Col1. Col2
[1,2,3] val1
[4,5,6] val2
I want to get the minimum value from the ...
0
votes
0
answers
142
views
How convert python nested loops into pandas UDF
I'm quite new to pyspark and not skilled python engineer trying to understand pandas UDF application for my case. I have developed ArimaX model, which for each "id" performs 4 outlook ...
1
vote
1
answer
230
views
Pyspark - Pandas UDF using Cosine Similarity - Setting an array element with a sequence
Here is my schema:
root
|-- embedding_init: array (nullable = true)
| |-- element: double (containsNull = true)
|-- embeddings: array (nullable = false)
| |-- element: array (containsNull = ...
1
vote
1
answer
221
views
Azure Databrickd:- PythonException: 'RuntimeError: The length of output in Scalar iterator pandas UDF should be the same with the input's;
Env : Azure Databricks
Cluster : 11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12)
I have pandas_udf, its working for 4 rows, but I tried with more than 4 rows getting below error.
PythonException: '...
0
votes
1
answer
108
views
pandas udf into column in array type
my assignment is to store the following into an array type column:
def sample_udf(df:SparkDataFrame):
device_issues = []
if (df['altitude'] == 0):
return "alt"
elif (df['...
1
vote
1
answer
626
views
Error in pandas_udf with the vector expected 1, got 2
I'm trying to get the country name with latitude and longitude as input, so I used the Nominatim API and when I pass as a UDF it works, but when I try to use pandas_udf get the following error:
An ...
0
votes
1
answer
947
views
Correct type hints for PandasUDFType.GROUPED_AGG that returns an array of doubles
I am using a Grouped Agg Pandas UDF to average the values of an array column element-wise (aka mean pooling). I keep getting the following warning and have not been able to find the correct type hints ...
1
vote
2
answers
518
views
Using pandas udf without looping in pyspark
So suppose I have a big spark dataframe .I dont know how many columns.
(the solution has to be in pyspark using pandas udf. Not a different approach)
I want to perform an action on all columns. So it'...
0
votes
0
answers
81
views
Pyspark PandasUDF: One pd.Series element per Dataframe row
I work with a couple of pyspark UDFs which slow down my code, hence I want to transform some of them to PandasUDFs. One UDF takes an list of strings as argument (which comes from another column of the ...
0
votes
1
answer
834
views
Pandas UDF Structfield return
I am trying to return a StructField from a Pandas UDF in Pyspark used with aggregation with the following function signature:
def parcel_to_polygon(geom:pd.Series,entity_ids:pd.Series) -> Tuple[int,...
3
votes
0
answers
476
views
Spark Apply In Pandas - How it works and how to tune
I have millions of sentences I want to encode with a model from sentence transformers (which is a pytorch model). https://www.sbert.net/
I am planning to use pyspark and an apply in pandas function. ...
1
vote
1
answer
1k
views
Converting apply from pandas to a pandas_udf
How can I convert the following sample code to a pandas_udf:
def calculate_courses_final_df(this_row):
some code that applies to each row of the data
df_contracts_courses.apply(lambda x: ...
0
votes
0
answers
184
views
separating dates and getting all permutations of products in Pandas UDF
I am trying to get a permutation of all possible couples of dates using a pandas_udf. As I understand the dataframe has to be grouped to be sent to a pandas_udf so I am adding an ID and grouping by ...
1
vote
0
answers
326
views
How to use a @pandas_udf function inside a class with pyspark?
I am trying to use one of the Hugging Face models with ML flow. My input is a pyspark DataFrame.
The issue is Mlflow doesn't support directly HuggingFace models, so need to use the flavor pyfunc to ...
1
vote
1
answer
566
views
Pandas UDF with dictionary lookup and conditionals
I want to use pandas_udf in Pyspark for certain transformations and calculations of column. And it seems that pandas udf can't be written exactly as normal UDFs.
An example function looks something ...
1
vote
0
answers
281
views
pyspark calculate custom metric on grouped data
I have a large dataframe (40 billion rows+) which can be grouped by key, I want to apply a custom calculation on few fields of each group and derive a single value for that group. eg, below dataframe ...
3
votes
3
answers
5k
views
Geopandas convert crs
I have a created a geopandas dataframe with 50 million records which contain Latitude Longitude in CRS 3857 and I want to convert to 4326. Since the dataset is huge the geopandas unable to convert ...
2
votes
2
answers
236
views
Apply wordninja.split() using pandas_udf
I have a dataframe df with the column sld of type string which includes some consecutive characters with no space/delimiter. One of the libraries that can be used to split is wordninja:
E.g. wordninja....
1
vote
0
answers
457
views
Pyspark error - Invalid argument, not a string or column while implementing inside pandas_udf
This code is working fine outside the pandas_udf but getting this error while trying to implement the same inside udf. To avoid conflicts between pyspark and python function names, I have explicitly ...
1
vote
1
answer
358
views
Iterating through a DataFrame using Pandas UDF and outputting a dataframe
I have a piece of code that I want to translate into a Pandas UDF in PySpark but I'm having a bit of trouble understanding whether or not you can use conditional statements.
def is_pass_in(df):
x =...
1
vote
1
answer
784
views
PySpark: Pandas UDF for scipy statistical transformations
I'm trying to create a column of standardized (z-score) of a column x on a Spark dataframe, but am missing something because none of it is working.
Here's my example:
import pandas as pd
from pyspark....
0
votes
1
answer
1k
views
Databricks notebook runs faster when triggered manually compared to when run as a job
I don't know if this question has been covered earlier, but here it goes - I have a notebook that I can run manually using the 'Run' button in the notebook or as a job.
The runtime for running the ...
0
votes
1
answer
646
views
Dividing a set of columns by its average in Pyspark
I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code.
Input ...
0
votes
1
answer
764
views
pyspark SparseVectors dataframe columns .dot product or any other vectors type column computation using @udf or @pandas_udf
I do try to compute .dot product between 2 columns of a give dataframe,
SparseVectors has this ability in spark already so I try to execute this in an easy & scalable way without converting to ...
3
votes
2
answers
1k
views
Parallelize MLflow Project runs with Pandas UDF on Azure Databricks Spark
I'm trying to parallelize the training of multiple time-series using Spark on Azure Databricks.
Other than training, I would like to log metrics and models using MLflow.
The structure of the code is ...
1
vote
1
answer
2k
views
PySpark UDF to Pandas UDF for sting columns
I do have an UDF that is slow for large dataset and I try to improve execution time and scalability by leveraging pandas_udfs and all searching and official documentation does more focus to scalar and ...