Create time based features in Pyspark

Question

I have a feature table Pyspark DF that gets created every day through a pipeline. Now the ask is to create time based features for each feature where each t-1 till t-30 (t=time) features captures the previous day value.

For E.g

Input Table for June 1

COL A	Count_n
'A'	10
'B'	12

So consider start date as June-1

My feature table with time based feature would look like:

COL A	Count_n	count_n_t-1	count_n_t-2	count_n_t-n	count_n_t-30
'A'	10	0	0	0	0
'B'	12	0	0	0	0

Since this is the start date I have initialised all features with 0.

Next day Input Data - June 2

COL A	Count_n
'A'	17
'B'	15

Now the feature table with time based feature would look like on June 2:

COL A	Count_n	count_n_t-1	count_n_t-2	count_n_t-n	count_n_t-30
'A'	17	10	0	0	0
'B'	15	12	0	0	0

Similarly for June 3

Input Data

COL A	Count_n
'A'	21
'B'	35

Feature table:

COL A	Count_n	count_n_t-1	count_n_t-2	count_n_t-n	count_n_t-30
'A'	21	17	10	0	0
'B'	35	15	12	0	0

If we observe that we are shifting the features according to its respective group (COL A) based on the previous day values. Similarly there will similar t-n features for different fields. But the t-1 till t-30 are the constant set of features we are creating.

Can someone suggest me an approach, how to go about it in the most efficient manner using pyspark.

NOTE: Please do let me know if the explanation of the problem is unclear. I will try to clarify it once again.

thanks

I have not yet started working this approach but my initial idea was first join the current date table with previous day table on COL A to get t-1 till t-n features and then to use groupby over COL A and apply Pandas_udf function df.groupby('A).apply(custom_udf_function).

Inside this UDF I am having bit difficulty to write a correct approach.

busfighter · Accepted Answer · 2024-06-12 20:37:08Z

0

Your problem doesn't seem to need any aggregation. If I understand correctly and you have separate dataframes for every day and they are named/you know how to get them, then you only need to do inner join on COL A

from pyspark.sql import functions as F

def get_df_name(date):
    # Logic for getting your df names based on date

df = ...  # get your initial df
tables = set(table.name for table in session.catalog.listTables())

for i in range(1, 31):
    date_of_df = date.today() - timedelta(days=i)
    df_name = get_df_name(date_of_df)
    if df_name in tables:
        right_df = session.table(df_name)
        right_df = right_df.withColumnRenamed('Count_n', f'count_n_t-{i}')
        df = df.join(right_df, 'COL A', 'left')
    else:
        df = df.withColumn(f'count_n_t-{i}', F.lit(0))

df = df.fillna(0)

answered Jun 12, 2024 at 20:37

busfighter

6466 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Create time based features in Pyspark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related