138 questions
0
votes
1
answer
252
views
Not able to run a spark code due to some issue on my local
I am facing the below error while running the given piece of spark code on my local Pycharm Community Edition and the spark session is not getting created.
I have set up all my local environment ...
0
votes
2
answers
72
views
PySpark dataframe to Excel Email attachment with sheet name
I'm unable to send PySpark data frame as an attachment in Excel.
I'm able to do easily with CSV file using below,
email.add_attachment(df.toPandas().to_csv(index=False).encode('utf-8')
, maintype='...
0
votes
2
answers
122
views
For reach row in dataframe, how to extract elements from an array?
I'm working with a third party dataset that includes location data. I'm trying to extract the Longitude and Latitude coordinates from the location column. As stated in their doc:
The location column ...
0
votes
1
answer
53
views
Compare two PySpark DataFrames and append the results side by side
I have two pySpark DataFrames, need to compare those two DataFrames column wise and append the result next to it.
DF1:
Claim_number
Claim_Status
1001
Closed
1002
In Progress
1003
open
Df2:
...
0
votes
1
answer
64
views
PySpark can't find existing file in Blob storage
I want to open Excel files in Azure Databricks that reside in ADSL2 with this code:
#%pip install openpyxl pandas
import pandas as pd
display(dbutils.fs.ls("/mnt/myMnt"))
path = "/mnt/...
1
vote
0
answers
65
views
Optimize or Eliminate UDF
I have multiple UDFs in a codebase I inherited. Is there any way to remove and implement without the UDF? I'm running on 1.3B rows, so every bit helps.
I considered using apply on a function, but ...
1
vote
0
answers
69
views
Resample on pandas api on spark
I'm trying to do resampling using pandas api for spark - (databricks runtime 15.4, spark 3.5.0)
import numpy as np
from datetime import datetime
import pandas as pd
import pyspark.pandas as ps
dates ...
1
vote
0
answers
43
views
Why is pyspark.pandas.frame.DataFrame showing index_col warnings?
Here is its source code.
PySpark has a wapper layer to allow using Pandas API on Spark. Pandas would compute in a single processor unit and it rely on indices. Spark, however, do not use the concept ...
0
votes
0
answers
34
views
Pandas on Spark API Date Operations
I am using Pandas in Spark API for some data preprocessing files which was initially in Pandas. I am seeing that the date operations are very slow and some are not compatible at all. For Eg: I cannot ...
0
votes
1
answer
147
views
Databricks pyspark pandas error with numpy
I am getting the following error when using pyspark pandas:
PandasNotImplementedError: The method pd.Series.__iter__() is not implemented. If you want to collect your data as an NumPy array, use '...
0
votes
0
answers
178
views
Pandas on Spark Resample: "Rule Code Not Supported" & "TypeError: Type datetime64[us] was not understood"
I'm trying to do something straightforward in Pandas -- take some time series data and resample it to a minute. But I'm running into a variety of issues from Spark (I'm new to PySpark so be kind ;) )
...
1
vote
0
answers
36
views
PySpark Deciling UDF Not Giving Output & Taking Lot of time to Run
So basically I have created a PySpark function named segmentation that performs cumulative sum calculation, handles outliers, finds the maximum cumulative sum, calculates a decile, and updates a ...
0
votes
1
answer
324
views
PySpark regex to get value between a string and hyphen
I am trying to extract the number between the string “line_number:” and hyphen. I am struggling with generating a regex/substring for the same in PySpark.
Below is my input data in a column called “...
0
votes
2
answers
84
views
pyspark.pandas: Converting float64 column to TimedeltaIndex
I want to convert a numeric column which is resembling a timedelta in seconds to a ps.TimedeltaIndex (for the purpose of later resampling the dataset)
import pyspark.pandas as ps
df = ps.DataFrame({&...
0
votes
1
answer
997
views
Python: Clear pyspark dataframe
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.storagelevel import StorageLevel
spark = SparkSession.builder.appName('TEST').config('spark.ui.port','...
1
vote
1
answer
565
views
How to group by percentile distributions for every variable in a dataset and output the mean/median in pyspark
I asked a fairly similar yet different question and got a good response here:
Groupby and percentage distributions pyspark equivalent of given pandas code
I am not sure how to tailor the modification ...
0
votes
1
answer
848
views
Why reading of excel file does not works with Crealytics version spark-excel_2.12-3.5.0_0.20.1
I could able to read the Excel file data using Crealytics library spark-excel_2.12-3.4.1_0.19.0 but was not able to execute the same code by using the latest version spark-excel_2.12-3.5.0_0.20.1.
I ...
0
votes
1
answer
123
views
In azure databricks gen2, I am trying to modify value of column in pandas dataframe. My code is working fine in gen1 but in gen2 it is throwing error
data_2['col1'] = np.where((df1.year.astype(int) == 2021) & (df1.col1_y.notna()), df1.col1_y, data_2.col1)
This is my original working code in Gen1, but it isgiving following error in gen2.
...
0
votes
0
answers
126
views
Using PySpark Pandas to read in filename with a space in it
trying to read this file:
abfss://[email protected]/perils_database/Bushfire AUSTRALIA - PERILS Industry Exposure Database 2023.xlsx
Using this Python code:
import pyspark....
0
votes
1
answer
2k
views
Read in sheet names only from Excel using pyspark.pandas
I have about 30 Excel files that I want to read into Spark dataframes, probably using pyspark.pandas.
I am trying to read them like this:
import pyspark.pandas as ps
my_files = ps.DataFrame(dbutils.fs....
1
vote
1
answer
84
views
How to groupby and then aggregate on multiple columns
I am using Pandas on Spark. I need to groupby A and B and then aggregate to return a list of map where keys are C and values are D
Sample input:
A B C D
0 7 ...
2
votes
1
answer
53
views
PySpark: Groupby within groups and display sum in separate fields based on certain values
I have a dataframe which contains Company name, EmpId, Bonus & Salary.
COMPANY
EMPID
BONUS
SALARY
APPLE
1234
No
5
APPLE
1235
No
7
GOOGLE
6786
Yes
6
GOOGLE
6787
No
5
GOOGLE
6788
No
6
TARGET
9091
...
-1
votes
1
answer
122
views
PySpark: Find if a value present in another dataframe
I have a dataframes like below. Tried Join and isin functions but not getting the expected output like below. Not sure what I was missing. Appreciate if someone can help. Thanks.
DF1:
Name
Grade
Tom
A
...
1
vote
0
answers
88
views
The Pandas-on-Spark 'apply' returns incorrect results
To solve the equations of the below nature with Pandas-on-Spark, I created a class named "DiffEqns".
x[k+1] = v[k] * y[k] + x[k]
y[k+1] = 0.01 * z[k] + y[k]
where k ranges from 0 to N-1, x[...
0
votes
1
answer
75
views
alternatives to tolist() for pyspark pandas (pandas api)
We have a code that takes care of customer orders from a retail website, the volume of data is pretty large and is getting bigger as days goes by. The tricky part is that this data is in a different ...
0
votes
1
answer
52
views
How to partition and get only latest records in spark dataframe
i have a spark dataframe df
vehicle_coalesce vehicleNumber productionNumber pin checkDate
V123 V123 P123 null 27/08/2023 01:03
P123 null ...
0
votes
1
answer
59
views
How to pick only latest records based on checkDate using pyspark
I have a spark dataframe dataframe
vehicle_coalesce vehicleNumber productionNumber pin checkDate
V123 V123 P123 null 27/08/2023 01:03
P123 null ...
0
votes
2
answers
364
views
Pyspark calculate new rows based on previous rows from current and other multiple columns
I have an Excel sheet of the formula I need to convert into Pyspark code
considering columns A, B, C, D, E, F, G, H and I where columns F, G, H and I have fixed random numeric values.
Column A has ...
0
votes
1
answer
39
views
pyspark - making a new column lookup_l that contains a list and its elements are values from other columns from same df from current row
All values are strings:
first_name_l
first_name_r
last_name_l
last_name_r
dob_l
dob_r
city_l
city_r
average_score
matched_columns
lookup_l_list
lookup_r_list
robert
robert
null
allen
1971-06-24
1971-...
0
votes
1
answer
316
views
Solving a system of multi-variable equations using PySpark on Databricks [closed]
Any suggestion or help or references are most welcome for the below problem statement. I am performing big data analysis on the data that is currently stored on Azure. The actual implementation is ...
0
votes
1
answer
414
views
How to parallelize work in pyspark over chunks of a dataset and the chunk needs to be a pandas df
I have a question on how the best way to implement the following problem.
I have a LGBM model in my driver. I need to run this model against a very large distributed over the executors dataset.
In ...
0
votes
2
answers
46
views
PySpark on Jupyter Notebook, dataframe of two rows can't be converted to pandas dataframe. Why?
This is the pyspark dataframe
And the schema of the dataframe. Just two rows.
Then I want to convert it to pandas dataframe.
But it is suspended at stage 3. No result, and no information about the ...
0
votes
2
answers
613
views
manipulating multiple sum() values in pyspark pivot table
I'm having a little difficulty in further manipulating a pyspark pivot table to give me a reduced result. My data is a little more complex than the example below, but it's the best example I can come ...
1
vote
1
answer
67
views
get median of a columns based on the weights from another column [duplicate]
I have a data frame like this,
col1 col2
100 3
200 2
300 4
400 1
Now I want to have median on col1 in such way col2 values will be the weights for each col1 values like ...
1
vote
1
answer
2k
views
Conversion from Spark to Pandas using pandas_api and toPandas
df = spark.table("data").limit(100)
df = df.toPandas()
This conversion using .toPandas works just fine as df.limit is just a few rows. If I get rid of limit and do toPandas on the whole df, ...
1
vote
1
answer
579
views
Spark ML models not able to deploy on Databricks inference
I'm trying to deploy the spark models(sparkxgbregressor, rfregressor) in databricks. Is model inferencing available for ONLY scikit learn models? If yes, is there any other way to deploy spark models ...
0
votes
1
answer
547
views
Cast String field to datetime64[ns] in parquet file using pandas-on-spark
My input is parquet file with I need to recast as below:
df=spark.read.parquet("input.parquet")
psdf=df.to_pandas_on_spark()
psdf['reCasted'] = psdf['col1'].astype('float64')
psdf['reCasted'...
0
votes
1
answer
237
views
Writing Pyspark dataframe as parquet by PartitionBy dataframe becomes very slow
I have a pyspark dataframe which does multiple groupby, pivot kind of transformations and when I get a final dataframe after applying all mentioned transformations. Writing back the df as parquet by ...
0
votes
2
answers
350
views
Create new columns with running count based on categorical column value counts in pyspark
Suppose a given dataframe:
Model
Color
Car
Red
Car
Red
Car
Blue
Truck
Red
Truck
Blue
Truck
Yellow
SUV
Blue
SUV
Blue
Car
Blue
Car
Yellow
I want to add color columns that keep a count of each color ...
0
votes
1
answer
894
views
how can merge multiple part file into single file in databricks
i am trying to merge multiple part file into single file. In staging folder, it itterating the all files, schema is same. part file we are converting .Tab files. Files are generating based on ...
0
votes
1
answer
194
views
Pyspark Error due to data type in pandas_udf
I'm trying to write a filter_words function in pandas_udf
Here are the functions I am using:
@udf_annotator(returnType=ArrayType(StructType([StructField("position", IntegerType(), True),
...
0
votes
1
answer
162
views
TypeError in pySpark UDF functions
I've got this function:
def ead(lista):
ind_mmff, isdebala, isfubala, k1, k2, ead = lista
try:
isdebala = float(isdebala)
isfubala = float(isfubala)
k1 = float(k1)
...
-1
votes
1
answer
785
views
Is there any efficient way to store streaming data from different stock exchanges in Python besides Parquet files while using CCXT library?
What is the best way to store streaming data from different stock exchanges in order to minimise data weight?
Right now I'm using CCXT library on Python and in order to get current order book ...
0
votes
1
answer
370
views
retrieve the non null values from a PySpark dataframe row and store this value in a new column
I have a PySpark dataframe which has column names which are unique_id's generated by UUID library. So I cannot query using column names. Each row in this pySpark dataframe has 1 "non null value&...
2
votes
1
answer
337
views
How do I run a function that applies regex iteratively in pandas-on-spark API?
I am using pandas-on-spark in combination with regex to remove some abbreviations from a column in a dataframe. In pandas this all works fine, but I have the task to migrate this code to a production ...
0
votes
1
answer
89
views
Pandas on Spark apply() seems to be reshaping columns
Can anybody explain the following behavior?
import pyspark.pandas as ps
loan_information = ps.read_sql_query([blah])
loan_information.shape
#748834, 84
loan_information.apply(lambda col: col.shape)
...
1
vote
1
answer
52
views
i want to sum date in a looping 13 times using pyspark
Please help me to solve this issue, as I am still new to Python/Pyspark.
I want to do a loop to do a date sum in multiples of 7 for 13 times in the same column.
I have a master table like this :
id
...
4
votes
1
answer
1k
views
Pandas-on-spark throwing java.lang.StackOverFlowError
I am using pandas-on-spark in combination with regex to remove some abbreviations from a column in a dataframe. In pandas this all works fine, but I have the task to migrate this code to a production ...
0
votes
1
answer
319
views
Pandas-on-spark API throws a NotImplementedError even though functionality should be implemented
I am facing a weird issue with pyspark-on-pandas. I am trying to use regex to replace abbreviations with their full counterparts. The function I am using is the following (Simplified it a bit):
def ...
0
votes
1
answer
1k
views
How to remove quotes from column in pyspark dataframe?
I have csv file in which I am getting double quotes in a column. While reading and writing I have to remove those quotes. Please guide me how can I do it?
Example-
df:
col1
"xyznm""...