Newest 'pyspark-pandas' Questions

0 votes

1 answer

252 views

Not able to run a spark code due to some issue on my local

I am facing the below error while running the given piece of spark code on my local Pycharm Community Edition and the spark session is not getting created. I have set up all my local environment ...

Node98

27

asked Jul 19 at 6:35

0 votes

2 answers

72 views

PySpark dataframe to Excel Email attachment with sheet name

I'm unable to send PySpark data frame as an attachment in Excel. I'm able to do easily with CSV file using below, email.add_attachment(df.toPandas().to_csv(index=False).encode('utf-8') , maintype='...

Jim Macaulay

5,251

asked Mar 27 at 9:00

0 votes

2 answers

122 views

For reach row in dataframe, how to extract elements from an array?

I'm working with a third party dataset that includes location data. I'm trying to extract the Longitude and Latitude coordinates from the location column. As stated in their doc: The location column ...

MyNameHere

337

asked Nov 25, 2024 at 3:26

0 votes

1 answer

53 views

Compare two PySpark DataFrames and append the results side by side

I have two pySpark DataFrames, need to compare those two DataFrames column wise and append the result next to it. DF1: Claim_number Claim_Status 1001 Closed 1002 In Progress 1003 open Df2: ...

Srinivasan

13

asked Nov 20, 2024 at 15:13

0 votes

1 answer

64 views

PySpark can't find existing file in Blob storage

I want to open Excel files in Azure Databricks that reside in ADSL2 with this code: #%pip install openpyxl pandas import pandas as pd display(dbutils.fs.ls("/mnt/myMnt")) path = "/mnt/...

Prefect73

341

asked Nov 4, 2024 at 20:29

1 vote

0 answers

65 views

Optimize or Eliminate UDF

I have multiple UDFs in a codebase I inherited. Is there any way to remove and implement without the UDF? I'm running on 1.3B rows, so every bit helps. I considered using apply on a function, but ...

user3480774

903

asked Nov 1, 2024 at 0:09

1 vote

0 answers

69 views

Resample on pandas api on spark

I'm trying to do resampling using pandas api for spark - (databricks runtime 15.4, spark 3.5.0) import numpy as np from datetime import datetime import pandas as pd import pyspark.pandas as ps dates ...

Mariusz Jarczak

11

asked Oct 21, 2024 at 14:43

1 vote

0 answers

43 views

Why is pyspark.pandas.frame.DataFrame showing index_col warnings?

Here is its source code. PySpark has a wapper layer to allow using Pandas API on Spark. Pandas would compute in a single processor unit and it rely on indices. Spark, however, do not use the concept ...

klenium

2,637

asked Oct 3, 2024 at 18:12

0 votes

0 answers

34 views

Pandas on Spark API Date Operations

I am using Pandas in Spark API for some data preprocessing files which was initially in Pandas. I am seeing that the date operations are very slow and some are not compatible at all. For Eg: I cannot ...

Chaitanya Kulkarni

1

asked Jun 10, 2024 at 17:58

0 votes

1 answer

147 views

Databricks pyspark pandas error with numpy

I am getting the following error when using pyspark pandas: PandasNotImplementedError: The method pd.Series.__iter__() is not implemented. If you want to collect your data as an NumPy array, use '...

marjun

736

asked May 15, 2024 at 14:37

0 votes

0 answers

178 views

Pandas on Spark Resample: "Rule Code Not Supported" & "TypeError: Type datetime64[us] was not understood"

I'm trying to do something straightforward in Pandas -- take some time series data and resample it to a minute. But I'm running into a variety of issues from Spark (I'm new to PySpark so be kind ;) ) ...

Zach

1,421

asked May 3, 2024 at 16:48

1 vote

0 answers

36 views

PySpark Deciling UDF Not Giving Output & Taking Lot of time to Run

So basically I have created a PySpark function named segmentation that performs cumulative sum calculation, handles outliers, finds the maximum cumulative sum, calculates a decile, and updates a ...

DEVEN MALI

11

asked Nov 22, 2023 at 3:50

0 votes

1 answer

324 views

PySpark regex to get value between a string and hyphen

I am trying to extract the number between the string “line_number:” and hyphen. I am struggling with generating a regex/substring for the same in PySpark. Below is my input data in a column called “...

Rohit Kadam

73

asked Nov 17, 2023 at 23:27

0 votes

2 answers

84 views

pyspark.pandas: Converting float64 column to TimedeltaIndex

I want to convert a numeric column which is resembling a timedelta in seconds to a ps.TimedeltaIndex (for the purpose of later resampling the dataset) import pyspark.pandas as ps df = ps.DataFrame({&...

ascripter

6,315

asked Nov 9, 2023 at 13:42

0 votes

1 answer

997 views

Python: Clear pyspark dataframe

from pyspark.sql import SparkSession from pyspark import SparkContext, SparkConf from pyspark.storagelevel import StorageLevel spark = SparkSession.builder.appName('TEST').config('spark.ui.port','...

Anand Reddy

21

asked Nov 8, 2023 at 16:32

1 vote

1 answer

565 views

How to group by percentile distributions for every variable in a dataset and output the mean/median in pyspark

I asked a fairly similar yet different question and got a good response here: Groupby and percentage distributions pyspark equivalent of given pandas code I am not sure how to tailor the modification ...

bernando_vialli

1,029

asked Oct 27, 2023 at 18:49

0 votes

1 answer

848 views

Why reading of excel file does not works with Crealytics version spark-excel_2.12-3.5.0_0.20.1

I could able to read the Excel file data using Crealytics library spark-excel_2.12-3.4.1_0.19.0 but was not able to execute the same code by using the latest version spark-excel_2.12-3.5.0_0.20.1. I ...

Ramesh Bathini

53

asked Oct 19, 2023 at 17:10

0 votes

1 answer

123 views

In azure databricks gen2, I am trying to modify value of column in pandas dataframe. My code is working fine in gen1 but in gen2 it is throwing error

data_2['col1'] = np.where((df1.year.astype(int) == 2021) & (df1.col1_y.notna()), df1.col1_y, data_2.col1) This is my original working code in Gen1, but it isgiving following error in gen2. ...

DigiLearner

71

asked Oct 19, 2023 at 10:50

0 votes

0 answers

126 views

Using PySpark Pandas to read in filename with a space in it

trying to read this file: abfss://[email protected]/perils_database/Bushfire AUSTRALIA - PERILS Industry Exposure Database 2023.xlsx Using this Python code: import pyspark....

TheRealJimShady

959

asked Oct 9, 2023 at 15:31

0 votes

1 answer

2k views

Read in sheet names only from Excel using pyspark.pandas

I have about 30 Excel files that I want to read into Spark dataframes, probably using pyspark.pandas. I am trying to read them like this: import pyspark.pandas as ps my_files = ps.DataFrame(dbutils.fs....

TheRealJimShady

959

asked Oct 6, 2023 at 20:41

1 vote

1 answer

84 views

How to groupby and then aggregate on multiple columns

I am using Pandas on Spark. I need to groupby A and B and then aggregate to return a list of map where keys are C and values are D Sample input: A B C D 0 7 ...

WorkInProgress

31

asked Oct 2, 2023 at 21:08

2 votes

1 answer

53 views

PySpark: Groupby within groups and display sum in separate fields based on certain values

I have a dataframe which contains Company name, EmpId, Bonus & Salary. COMPANY EMPID BONUS SALARY APPLE 1234 No 5 APPLE 1235 No 7 GOOGLE 6786 Yes 6 GOOGLE 6787 No 5 GOOGLE 6788 No 6 TARGET 9091 ...

sanju

49

asked Sep 27, 2023 at 20:15

-1 votes

1 answer

122 views

PySpark: Find if a value present in another dataframe

I have a dataframes like below. Tried Join and isin functions but not getting the expected output like below. Not sure what I was missing. Appreciate if someone can help. Thanks. DF1: Name Grade Tom A ...

sanju

49

asked Sep 21, 2023 at 19:29

1 vote

0 answers

88 views

The Pandas-on-Spark 'apply' returns incorrect results

To solve the equations of the below nature with Pandas-on-Spark, I created a class named "DiffEqns". x[k+1] = v[k] * y[k] + x[k] y[k+1] = 0.01 * z[k] + y[k] where k ranges from 0 to N-1, x[...

lord_mendonca

19

asked Sep 21, 2023 at 13:21

0 votes

1 answer

75 views

alternatives to tolist() for pyspark pandas (pandas api)

We have a code that takes care of customer orders from a retail website, the volume of data is pretty large and is getting bigger as days goes by. The tricky part is that this data is in a different ...

Trodenn

17

asked Sep 15, 2023 at 13:14

0 votes

1 answer

52 views

How to partition and get only latest records in spark dataframe

i have a spark dataframe df vehicle_coalesce vehicleNumber productionNumber pin checkDate V123 V123 P123 null 27/08/2023 01:03 P123 null ...

karthik kk

29

asked Sep 13, 2023 at 10:08

0 votes

1 answer

59 views

How to pick only latest records based on checkDate using pyspark

I have a spark dataframe dataframe vehicle_coalesce vehicleNumber productionNumber pin checkDate V123 V123 P123 null 27/08/2023 01:03 P123 null ...

karthik kk

29

asked Sep 13, 2023 at 8:29

0 votes

2 answers

364 views

Pyspark calculate new rows based on previous rows from current and other multiple columns

I have an Excel sheet of the formula I need to convert into Pyspark code considering columns A, B, C, D, E, F, G, H and I where columns F, G, H and I have fixed random numeric values. Column A has ...

prince13i

1

asked Sep 9, 2023 at 20:30

0 votes

1 answer

39 views

pyspark - making a new column lookup_l that contains a list and its elements are values from other columns from same df from current row

All values are strings: first_name_l first_name_r last_name_l last_name_r dob_l dob_r city_l city_r average_score matched_columns lookup_l_list lookup_r_list robert robert null allen 1971-06-24 1971-...

mahak tirole

1

asked Sep 4, 2023 at 8:26

0 votes

1 answer

316 views

Solving a system of multi-variable equations using PySpark on Databricks [closed]

Any suggestion or help or references are most welcome for the below problem statement. I am performing big data analysis on the data that is currently stored on Azure. The actual implementation is ...

lord_mendonca

19

asked Aug 30, 2023 at 20:27

0 votes

1 answer

414 views

How to parallelize work in pyspark over chunks of a dataset and the chunk needs to be a pandas df

I have a question on how the best way to implement the following problem. I have a LGBM model in my driver. I need to run this model against a very large distributed over the executors dataset. In ...

Vinícius Matheus Olivieri

95

asked Aug 29, 2023 at 16:30

0 votes

2 answers

46 views

PySpark on Jupyter Notebook, dataframe of two rows can't be converted to pandas dataframe. Why?

This is the pyspark dataframe And the schema of the dataframe. Just two rows. Then I want to convert it to pandas dataframe. But it is suspended at stage 3. No result, and no information about the ...

Sparrow Jack

45

asked Aug 23, 2023 at 14:29

0 votes

2 answers

613 views

manipulating multiple sum() values in pyspark pivot table

I'm having a little difficulty in further manipulating a pyspark pivot table to give me a reduced result. My data is a little more complex than the example below, but it's the best example I can come ...

zenith7

201

asked Aug 22, 2023 at 6:52

1 vote

1 answer

67 views

get median of a columns based on the weights from another column [duplicate]

I have a data frame like this, col1 col2 100 3 200 2 300 4 400 1 Now I want to have median on col1 in such way col2 values will be the weights for each col1 values like ...

Kallol

2,189

asked Aug 8, 2023 at 6:29

1 vote

1 answer

2k views

Conversion from Spark to Pandas using pandas_api and toPandas

df = spark.table("data").limit(100) df = df.toPandas() This conversion using .toPandas works just fine as df.limit is just a few rows. If I get rid of limit and do toPandas on the whole df, ...

dhk02

11

asked Aug 3, 2023 at 19:51

1 vote

1 answer

579 views

Spark ML models not able to deploy on Databricks inference

I'm trying to deploy the spark models(sparkxgbregressor, rfregressor) in databricks. Is model inferencing available for ONLY scikit learn models? If yes, is there any other way to deploy spark models ...

Rayzee

13

asked Jul 13, 2023 at 3:48

0 votes

1 answer

547 views

Cast String field to datetime64[ns] in parquet file using pandas-on-spark

My input is parquet file with I need to recast as below: df=spark.read.parquet("input.parquet") psdf=df.to_pandas_on_spark() psdf['reCasted'] = psdf['col1'].astype('float64') psdf['reCasted'...

user2531569

629

asked Jul 10, 2023 at 21:01

0 votes

1 answer

237 views

Writing Pyspark dataframe as parquet by PartitionBy dataframe becomes very slow

I have a pyspark dataframe which does multiple groupby, pivot kind of transformations and when I get a final dataframe after applying all mentioned transformations. Writing back the df as parquet by ...

Raja Sabarish PV

115

asked Jun 23, 2023 at 14:03

0 votes

2 answers

350 views

Create new columns with running count based on categorical column value counts in pyspark

Suppose a given dataframe: Model Color Car Red Car Red Car Blue Truck Red Truck Blue Truck Yellow SUV Blue SUV Blue Car Blue Car Yellow I want to add color columns that keep a count of each color ...

jay-elliot

9

asked Jun 14, 2023 at 22:34

0 votes

1 answer

894 views

how can merge multiple part file into single file in databricks

i am trying to merge multiple part file into single file. In staging folder, it itterating the all files, schema is same. part file we are converting .Tab files. Files are generating based on ...

KIRAN KUMAR

7

asked Jun 14, 2023 at 8:46

0 votes

1 answer

194 views

Pyspark Error due to data type in pandas_udf

I'm trying to write a filter_words function in pandas_udf Here are the functions I am using: @udf_annotator(returnType=ArrayType(StructType([StructField("position", IntegerType(), True), ...

Rory

383

asked Jun 7, 2023 at 10:53

0 votes

1 answer

162 views

TypeError in pySpark UDF functions

I've got this function: def ead(lista): ind_mmff, isdebala, isfubala, k1, k2, ead = lista try: isdebala = float(isdebala) isfubala = float(isfubala) k1 = float(k1) ...

JMP

38

asked Jun 5, 2023 at 17:13

-1 votes

1 answer

785 views

Is there any efficient way to store streaming data from different stock exchanges in Python besides Parquet files while using CCXT library?

What is the best way to store streaming data from different stock exchanges in order to minimise data weight? Right now I'm using CCXT library on Python and in order to get current order book ...

Ruslan Kirsanov

1

asked May 29, 2023 at 8:23

0 votes

1 answer

370 views

retrieve the non null values from a PySpark dataframe row and store this value in a new column

I have a PySpark dataframe which has column names which are unique_id's generated by UUID library. So I cannot query using column names. Each row in this pySpark dataframe has 1 "non null value&...

pscodes

11

asked May 25, 2023 at 4:39

2 votes

1 answer

337 views

How do I run a function that applies regex iteratively in pandas-on-spark API?

I am using pandas-on-spark in combination with regex to remove some abbreviations from a column in a dataframe. In pandas this all works fine, but I have the task to migrate this code to a production ...

Psychotechnopath

2,825

asked May 10, 2023 at 13:13

0 votes

1 answer

89 views

Pandas on Spark apply() seems to be reshaping columns

Can anybody explain the following behavior? import pyspark.pandas as ps loan_information = ps.read_sql_query([blah]) loan_information.shape #748834, 84 loan_information.apply(lambda col: col.shape) ...

Cody Dance

125

asked May 9, 2023 at 15:40

1 vote

1 answer

52 views

i want to sum date in a looping 13 times using pyspark

Please help me to solve this issue, as I am still new to Python/Pyspark. I want to do a loop to do a date sum in multiples of 7 for 13 times in the same column. I have a master table like this : id ...

rezha nanda

35

asked May 3, 2023 at 20:50

4 votes

1 answer

1k views

Pandas-on-spark throwing java.lang.StackOverFlowError

I am using pandas-on-spark in combination with regex to remove some abbreviations from a column in a dataframe. In pandas this all works fine, but I have the task to migrate this code to a production ...

Psychotechnopath

2,825

asked May 3, 2023 at 12:28

0 votes

1 answer

319 views

Pandas-on-spark API throws a NotImplementedError even though functionality should be implemented

I am facing a weird issue with pyspark-on-pandas. I am trying to use regex to replace abbreviations with their full counterparts. The function I am using is the following (Simplified it a bit): def ...

Psychotechnopath

2,825

asked May 2, 2023 at 12:16

0 votes

1 answer

1k views

How to remove quotes from column in pyspark dataframe?

I have csv file in which I am getting double quotes in a column. While reading and writing I have to remove those quotes. Please guide me how can I do it? Example- df: col1 "xyznm""...

alka

81

asked May 2, 2023 at 7:05

Collectives™ on Stack Overflow