0

I am using Pandas in Spark API for some data preprocessing files which was initially in Pandas. I am seeing that the date operations are very slow and some are not compatible at all. For Eg: I cannot do this

df[time_col] + pd.Timedelta(1, unit='D')

instead, I had to write the below operation:

df[time_col ].apply(lambda x: x+timedelta(days=1))

Is there any other way I can use date_add operations? And why would pandas on Spark be slow under the hood?

I have tried the Pyspark code which has the interval operation and works fast.

3
  • why you can't do df[time_col] + pd.Timedelta(1, unit='D')? Do you get wrong result? Do you get error message? You have to show it in question (not in comments). We can't run your code, we can't see your computer, and we can't read in your mind - you have to show all details in question. And it would be simpler if you would create minimal reproducible example with example data in code - so we could simply copy and run it. Commented Jun 11, 2024 at 1:02
  • in question you could also show "Pyspark code which has the interval operation" - so we could run it and compare time. Commented Jun 11, 2024 at 1:03
  • if Pyspark work faster then maybe you should convert Pandas DataFrame to Pyspark, make calculations and convert it back to DataFrame Commented Jun 11, 2024 at 1:06

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.