I am using Pandas in Spark API for some data preprocessing files which was initially in Pandas. I am seeing that the date operations are very slow and some are not compatible at all. For Eg: I cannot do this
df[time_col] + pd.Timedelta(1, unit='D')
instead, I had to write the below operation:
df[time_col ].apply(lambda x: x+timedelta(days=1))
Is there any other way I can use date_add operations? And why would pandas on Spark be slow under the hood?
I have tried the Pyspark code which has the interval operation and works fast.
df[time_col] + pd.Timedelta(1, unit='D')? Do you get wrong result? Do you get error message? You have to show it in question (not in comments). We can't run your code, we can't see your computer, and we can't read in your mind - you have to show all details in question. And it would be simpler if you would create minimal reproducible example with example data in code - so we could simply copy and run it.