1

I'm encountering a problem that is working with loops in pandas is not efficient, and I'm new to Pyspark, so I would like to ask for ideas, solutions to complete 2 tasks illustrated below.

I have a dataset consisting of 700K rows like this:

Date format: MM-DD-YYYY

ID DATE REF_DATE A B C
1 01-31-2020 10-31-2019 3000 0.05 320
2 02-28-2020 10-31-2019 2750 0.04 300
3 03-31-2020 10-31-2019 2600 0.03 270

Tasks/steps:

1. Generate new rows for each id if the condition is satisfied

Condition: if Date > Ref_Date, add one more row, with the value of date decreasedcreased by 1 month unit until the new added date value = Ref_Date, keeping other variables' value constant. This is the expected output which will have in reality 30 million rows.

ID DATE REF_DATE A B C
1 01-31-2020 10-31-2019 3000 0.05 320
1 12-31-2019 10-31-2019 3000 0.05 320
1 11-30-2019 10-31-2019 3000 0.05 320
1 10-31-2019 10-31-2019 3000 0.05 320
2 02-28-2020 10-31-2019 2750 0.04 300
2 01-31-2020 10-31-2019 2750 0.04 300
2 12-31-2019 10-31-2019 2750 0.04 300
2 11-30-2019 10-31-2019 2750 0.04 300
2 10-31-2019 10-31-2019 2750 0.04 300
3 03-31-2020 10-31-2019 2600 0.03 270
3 02-28-2020 10-31-2019 2600 0.03 270
3 01-31-2020 10-31-2019 2600 0.03 270
3 12-31-2019 10-31-2019 2600 0.03 270
3 11-30-2019 10-31-2019 2600 0.03 270
3 10-31-2019 10-31-2019 2600 0.03 270

2. Create & compute 3 new columns named D, E, F as follows

IF it's the first observation within the group (by ID):
    E = A * B
    F = C - E
    D = A - C 
IF it's the second and later on rows within that group:
    E = lag(D) * B
    F = C - E
    D = lag(D) - C

D here indicates the value of D from the previous row, if we proceed row 2, we use D from row 1, then to proceed row 3, we use the value of D of row 2, ...

The expected output will be like, I show only the first ID as an example, the other ID will be processed in the same manner.

ID DATE REF_DATE A B C D E F
1 01-31-2020 10-31-2019 3000 0.05 320 2830 150 170
1 10-31-2019 10-31-2019 3000 0.05 320 2651.5 141.5 178.5
1 12-31-2019 10-31-2019 3000 0.05 320 2464.01 132.6 187.4
1 11-30-2019 10-31-2019 3000 0.05 320 2267.3 123.2 196.8

P.s: I need to process and store this whole dataset for other calculation later. Is there any way without using loops? Thank you!

I tried using pandas and multithreading but now I need to do it in Pyspark

3
  • Can you explain the expected output? the given output seems incorrect according to your logic since for the firs row the value of E should be 3000 * 0.05=150 Commented Oct 24, 2023 at 17:05
  • Also the values of D in your expected output seems incorrect. Commented Oct 24, 2023 at 17:26
  • You are right that the output was in correct as I copied from excel and one line was missing. Thanks for pointing that out. I have just edited it Commented Oct 25, 2023 at 14:36

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.