Skip to main content
Filter by
Sorted by
Tagged with
0 votes
3 answers
114 views

NOTE: This question has many related questions on StackOverFlow but I was unable to get my answer from any of them. I'm attempting to parallelize Prophet time series model training across multiple ...
Arnab Sinha's user avatar
0 votes
0 answers
315 views

I'm working with PySpark and trying to use a pandas_udf on my macOS system with an M3 chip. My environment is Python 3.10 running from a virtual environment. The code runs fine until I introduce the ...
Md. Moniruzzaman's user avatar
0 votes
2 answers
107 views

I'm running a pandas udf as follows: def some_udf(df, keys = IDS_COLS, cols_to_keep = COLS_TO_KEEP): INT_ID_COLUMN = '__iid' df_keys = df.select(keys).distinct().withColumn(INT_ID_COLUMN, F....
Lior T's user avatar
  • 137
0 votes
1 answer
49 views

I have a feature table Pyspark DF that gets created every day through a pipeline. Now the ask is to create time based features for each feature where each t-1 till t-30 (t=time) features captures the ...
Neethu Paul's user avatar
0 votes
2 answers
45 views

TLDR- I’m trying to write a udf that would transform a pyspark dataframe. When the input is a data frame and the output is the same data frame, just with a few columns mapped to different values. In ...
Lior T's user avatar
  • 137
0 votes
0 answers
208 views

I have created a pandas UDF (df->df) for scenario, which takes cares of parallel run for partition which can be found withing provided data - this is fine. However it was requested to have ...
KubaS's user avatar
  • 21
1 vote
2 answers
3k views

On Databricks, I have a streaming pipeline where the bronze source and silver target are in delta format. I have a pandas udf that uses the requests_cache library to retrieve something from an url (...
gamezone25's user avatar
0 votes
1 answer
491 views

I have a pyspark data frame of a huge number of rows ( 80 million -100 million rows). I am inferencing a model on it to obtain the model score(probability) for each row. Like the below code: import ...
krishna kaushik's user avatar
0 votes
1 answer
194 views

I'm trying to write a filter_words function in pandas_udf Here are the functions I am using: @udf_annotator(returnType=ArrayType(StructType([StructField("position", IntegerType(), True), ...
Rory's user avatar
  • 383
0 votes
1 answer
824 views

I've created a docker container that installs Ubuntu 22.04, Python 3.10, Spark 3.3.2, Hadoop 3, Scala 13, and Open JDK 19. I'm currently using as a test environment before deploying code in AWS. This ...
Bamu's user avatar
  • 11
3 votes
1 answer
290 views

I am trying to convert this udf into this pandas udf, in order to avoid creating two pandas udfs. Convert this: @udf("string") def splitEmailUDF(email: str, position: int) -> str: ...
Susy84's user avatar
  • 144
2 votes
1 answer
591 views

I want to calculate the cosine similarity of 2 vectors using Pandas UDF. I implemented it with Spark UDF, which works fine with the following script. import numpy as np from pyspark.sql.functions ...
Haritha Thilakarathne's user avatar
0 votes
1 answer
139 views

I am moving my code from Pandas to Pypark for NLP task. I have figured out how to apply tokenization (using Keras built-in library) via a pandas UDF. However, I also want to return the fitted ...
Abdul Wahab's user avatar
2 votes
0 answers
558 views

I have a class with a native python function (performing some imputations on a pd df) that will be used on grouped data with applyInPandas (https://spark.apache.org/docs/3.1.2/api/python/reference/api/...
Feary's user avatar
  • 37
0 votes
1 answer
1k views

pyspark code using pandas udf functions , works fine with df.limit(20).collect() & write to csv for 20 records. But when i try write 100 records to csv it fails with java.io.EOFException error. ...
Mohan Rayapuvari's user avatar
1 vote
1 answer
569 views

Have dataframe with text column CALL_TRANSCRIPT (string format) and pii_allmethods column (array of string). Trying to search Call_Transcripts for strings in array & mask using pyspark pandas udf. ...
Mohan Rayapuvari's user avatar
1 vote
0 answers
109 views

I've been trying to parallelize hyperparameter tuning for my prophet model for around 100 combinations of hyperparameters saved in the dataframe params_df. I want to parallelize the hyperparameter ...
Pranav Gupta's user avatar
0 votes
1 answer
106 views

I have a dataframe like this data_df = spark.createDataFrame([([1,2,3],'val1'),([4,5,6],'val2')],['col1','col2']) Col1. Col2 [1,2,3] val1 [4,5,6] val2 I want to get the minimum value from the ...
lserlohn's user avatar
  • 6,256
0 votes
0 answers
142 views

I'm quite new to pyspark and not skilled python engineer trying to understand pandas UDF application for my case. I have developed ArimaX model, which for each "id" performs 4 outlook ...
KubaS's user avatar
  • 21
1 vote
1 answer
230 views

Here is my schema: root |-- embedding_init: array (nullable = true) | |-- element: double (containsNull = true) |-- embeddings: array (nullable = false) | |-- element: array (containsNull = ...
madst's user avatar
  • 93
1 vote
1 answer
221 views

Env : Azure Databricks Cluster : 11.3 LTS (includes Apache Spark 3.3.0, Scala 2.12) I have pandas_udf, its working for 4 rows, but I tried with more than 4 rows getting below error. PythonException: '...
Ancil Pa's user avatar
0 votes
1 answer
108 views

my assignment is to store the following into an array type column: def sample_udf(df:SparkDataFrame): device_issues = [] if (df['altitude'] == 0): return "alt" elif (df['...
Jilinnie Park's user avatar
1 vote
1 answer
626 views

I'm trying to get the country name with latitude and longitude as input, so I used the Nominatim API and when I pass as a UDF it works, but when I try to use pandas_udf get the following error: An ...
BryC's user avatar
  • 97
0 votes
1 answer
947 views

I am using a Grouped Agg Pandas UDF to average the values of an array column element-wise (aka mean pooling). I keep getting the following warning and have not been able to find the correct type hints ...
David's user avatar
  • 2,767
1 vote
2 answers
518 views

So suppose I have a big spark dataframe .I dont know how many columns. (the solution has to be in pyspark using pandas udf. Not a different approach) I want to perform an action on all columns. So it'...
Barushkish's user avatar
0 votes
0 answers
81 views

I work with a couple of pyspark UDFs which slow down my code, hence I want to transform some of them to PandasUDFs. One UDF takes an list of strings as argument (which comes from another column of the ...
Moritz 's user avatar
  • 603
0 votes
1 answer
834 views

I am trying to return a StructField from a Pandas UDF in Pyspark used with aggregation with the following function signature: def parcel_to_polygon(geom:pd.Series,entity_ids:pd.Series) -> Tuple[int,...
Tarique's user avatar
  • 711
3 votes
0 answers
476 views

I have millions of sentences I want to encode with a model from sentence transformers (which is a pytorch model). https://www.sbert.net/ I am planning to use pyspark and an apply in pandas function. ...
B_Miner's user avatar
  • 1,832
1 vote
1 answer
1k views

How can I convert the following sample code to a pandas_udf: def calculate_courses_final_df(this_row): some code that applies to each row of the data df_contracts_courses.apply(lambda x: ...
Matt's user avatar
  • 185
0 votes
0 answers
184 views

I am trying to get a permutation of all possible couples of dates using a pandas_udf. As I understand the dataframe has to be grouped to be sent to a pandas_udf so I am adding an ID and grouping by ...
Matt's user avatar
  • 185
1 vote
0 answers
326 views

I am trying to use one of the Hugging Face models with ML flow. My input is a pyspark DataFrame. The issue is Mlflow doesn't support directly HuggingFace models, so need to use the flavor pyfunc to ...
Anna_v's user avatar
  • 11
1 vote
1 answer
566 views

I want to use pandas_udf in Pyspark for certain transformations and calculations of column. And it seems that pandas udf can't be written exactly as normal UDFs. An example function looks something ...
Tarique's user avatar
  • 711
1 vote
0 answers
281 views

I have a large dataframe (40 billion rows+) which can be grouped by key, I want to apply a custom calculation on few fields of each group and derive a single value for that group. eg, below dataframe ...
user14297339's user avatar
3 votes
3 answers
5k views

I have a created a geopandas dataframe with 50 million records which contain Latitude Longitude in CRS 3857 and I want to convert to 4326. Since the dataset is huge the geopandas unable to convert ...
code_bug's user avatar
  • 415
2 votes
2 answers
236 views

I have a dataframe df with the column sld of type string which includes some consecutive characters with no space/delimiter. One of the libraries that can be used to split is wordninja: E.g. wordninja....
Elm662's user avatar
  • 673
1 vote
0 answers
457 views

This code is working fine outside the pandas_udf but getting this error while trying to implement the same inside udf. To avoid conflicts between pyspark and python function names, I have explicitly ...
user22's user avatar
  • 153
1 vote
1 answer
358 views

I have a piece of code that I want to translate into a Pandas UDF in PySpark but I'm having a bit of trouble understanding whether or not you can use conditional statements. def is_pass_in(df): x =...
AndronikMk's user avatar
1 vote
1 answer
784 views

I'm trying to create a column of standardized (z-score) of a column x on a Spark dataframe, but am missing something because none of it is working. Here's my example: import pandas as pd from pyspark....
user3771195's user avatar
0 votes
1 answer
1k views

I don't know if this question has been covered earlier, but here it goes - I have a notebook that I can run manually using the 'Run' button in the notebook or as a job. The runtime for running the ...
Vidisha Kanodia's user avatar
0 votes
1 answer
646 views

I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code. Input ...
Deb's user avatar
  • 541
0 votes
1 answer
764 views

I do try to compute .dot product between 2 columns of a give dataframe, SparseVectors has this ability in spark already so I try to execute this in an easy & scalable way without converting to ...
n1tk's user avatar
  • 2,550
3 votes
2 answers
1k views

I'm trying to parallelize the training of multiple time-series using Spark on Azure Databricks. Other than training, I would like to log metrics and models using MLflow. The structure of the code is ...
Matteo Zantedeschi's user avatar
1 vote
1 answer
2k views

I do have an UDF that is slow for large dataset and I try to improve execution time and scalability by leveraging pandas_udfs and all searching and official documentation does more focus to scalar and ...
n1tk's user avatar
  • 2,550