0

I have a feature table Pyspark DF that gets created every day through a pipeline. Now the ask is to create time based features for each feature where each t-1 till t-30 (t=time) features captures the previous day value.

For E.g

Input Table for June 1

COL A Count_n
'A' 10
'B' 12

So consider start date as June-1

My feature table with time based feature would look like:

COL A Count_n count_n_t-1 count_n_t-2 count_n_t-n count_n_t-30
'A' 10 0 0 0 0
'B' 12 0 0 0 0

Since this is the start date I have initialised all features with 0.

Next day Input Data - June 2

COL A Count_n
'A' 17
'B' 15

Now the feature table with time based feature would look like on June 2:

COL A Count_n count_n_t-1 count_n_t-2 count_n_t-n count_n_t-30
'A' 17 10 0 0 0
'B' 15 12 0 0 0

Similarly for June 3

Input Data

COL A Count_n
'A' 21
'B' 35

Feature table:

COL A Count_n count_n_t-1 count_n_t-2 count_n_t-n count_n_t-30
'A' 21 17 10 0 0
'B' 35 15 12 0 0

If we observe that we are shifting the features according to its respective group (COL A) based on the previous day values. Similarly there will similar t-n features for different fields. But the t-1 till t-30 are the constant set of features we are creating.

Can someone suggest me an approach, how to go about it in the most efficient manner using pyspark.

NOTE: Please do let me know if the explanation of the problem is unclear. I will try to clarify it once again.

thanks

I have not yet started working this approach but my initial idea was first join the current date table with previous day table on COL A to get t-1 till t-n features and then to use groupby over COL A and apply Pandas_udf function df.groupby('A).apply(custom_udf_function).

Inside this UDF I am having bit difficulty to write a correct approach.

1 Answer 1

0

Your problem doesn't seem to need any aggregation. If I understand correctly and you have separate dataframes for every day and they are named/you know how to get them, then you only need to do inner join on COL A

from pyspark.sql import functions as F

def get_df_name(date):
    # Logic for getting your df names based on date

df = ...  # get your initial df
tables = set(table.name for table in session.catalog.listTables())

for i in range(1, 31):
    date_of_df = date.today() - timedelta(days=i)
    df_name = get_df_name(date_of_df)
    if df_name in tables:
        right_df = session.table(df_name)
        right_df = right_df.withColumnRenamed('Count_n', f'count_n_t-{i}')
        df = df.join(right_df, 'COL A', 'left')
    else:
        df = df.withColumn(f'count_n_t-{i}', F.lit(0))

df = df.fillna(0)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.