I have a feature table Pyspark DF that gets created every day through a pipeline. Now the ask is to create time based features for each feature where each t-1 till t-30 (t=time) features captures the previous day value.
For E.g
Input Table for June 1
| COL A | Count_n |
|---|---|
| 'A' | 10 |
| 'B' | 12 |
So consider start date as June-1
My feature table with time based feature would look like:
| COL A | Count_n | count_n_t-1 | count_n_t-2 | count_n_t-n | count_n_t-30 |
|---|---|---|---|---|---|
| 'A' | 10 | 0 | 0 | 0 | 0 |
| 'B' | 12 | 0 | 0 | 0 | 0 |
Since this is the start date I have initialised all features with 0.
Next day Input Data - June 2
| COL A | Count_n |
|---|---|
| 'A' | 17 |
| 'B' | 15 |
Now the feature table with time based feature would look like on June 2:
| COL A | Count_n | count_n_t-1 | count_n_t-2 | count_n_t-n | count_n_t-30 |
|---|---|---|---|---|---|
| 'A' | 17 | 10 | 0 | 0 | 0 |
| 'B' | 15 | 12 | 0 | 0 | 0 |
Similarly for June 3
Input Data
| COL A | Count_n |
|---|---|
| 'A' | 21 |
| 'B' | 35 |
Feature table:
| COL A | Count_n | count_n_t-1 | count_n_t-2 | count_n_t-n | count_n_t-30 |
|---|---|---|---|---|---|
| 'A' | 21 | 17 | 10 | 0 | 0 |
| 'B' | 35 | 15 | 12 | 0 | 0 |
If we observe that we are shifting the features according to its respective group (COL A) based on the previous day values. Similarly there will similar t-n features for different fields. But the t-1 till t-30 are the constant set of features we are creating.
Can someone suggest me an approach, how to go about it in the most efficient manner using pyspark.
NOTE: Please do let me know if the explanation of the problem is unclear. I will try to clarify it once again.
thanks
I have not yet started working this approach but my initial idea was first join the current date table with previous day table on COL A to get t-1 till t-n features and then to use groupby over COL A and apply Pandas_udf function df.groupby('A).apply(custom_udf_function).
Inside this UDF I am having bit difficulty to write a correct approach.