pyspark transformation affecting multiple colums

Question

TLDR- I’m trying to write a udf that would transform a pyspark dataframe. When the input is a data frame and the output is the same data frame, just with a few columns mapped to different values.

In details:I trained a XGBoost model using target encoding package (package version is category_encoders==2.5.1). I have no problems to use it when I’m predicting on a small pandas data frame that fits memory. I just do the following:

from category_encoders import TargetEncoder
te = TargetEncoder(cols = cat_cols)
X_test = te.transform(X_test)
preds = model.predict_proba(X_test)[:,1]

Now I have much bigger data that don’t fit the memory. I would like to run inference using pyspark. The following code works for me:

@pandas_udf('float')
def predict_pandas_udf(*cols, names = features):
    X = pd.concat(cols, axis=1)
    old_names = ["_"+str(x) for x in range(len(names))]
    X.rename(columns=dict(zip(old_names, names)), inplace=True)
    X = te.transform(X)
    return pd.Series(model.predict_proba(X)[:,1])

df = df.withColumn('score', predict_pandas_udf(*df[features]))

but when I try to inspect the target encoding I get confused. I’m trying something like this but I can’t figure out how to make it work:

@pandas_udf('float')
def target_encoding_udf(df, features):
  X = df[features]
  X = te.transform(X)
  return X


df = df.transform(target_encoding_udf, features)

any help would be appreciated!

busfighter · Accepted Answer · 2024-06-06 21:25:13Z

0

TLDR; You need to use Grouped map pandas udf to get predictions for your rows and then join the result to the main df. To do so you will need to have a unique value column in your df as a join key.

In details:

from pyspark.sql import functions as F

INT_ID_COLUMN = '__iid'

pred_column_name = 'pred'

df = df.withColumn(INT_ID_COLUMN, F.monotonically_increasing_id())

pred_df_schema = df \
        .select(INT_ID_COLUMN) \
        .withColumn(pred_column_name, F.lit(.0)) \
        .schema

def predict_func(pdf):
    model = get_model() # get your model
    features = ...  # get your models necessary features
    data = pdf[features].astype(float)

    pdf[pred_column_name] = model.predict_proba(data)[:, 1]

    return pdf[[INT_ID_COLUMN, pred_column_name]]

predict_udf = F.pandas_udf(
        predict_func,
        returnType=pred_df_schema,
        functionType=F.PandasUDFType.GROUPED_MAP
    )

prediction_df = df.select(INT_ID_COLUMN, *features) \
        .groupby(F.spark_partition_id()) \
        .apply(predict_udf)

df = df.join(prediction_df, on=[INT_ID_COLUMN], how='left').drop(INT_ID_COLUMN)

Also turn on Arrow usage in your settings.

answered Jun 6, 2024 at 21:25

busfighter

6466 silver badges7 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Lior T Over a year ago

Thanks a lot! i guess i wasn't clear enough on my intent but thanks to your guidance i found the solution :)

Lior T · Accepted Answer · 2024-06-09 07:19:15Z

Based on busfighter answer i was able to come up with this solution:

def transform_target_encoding(df, features, te):
  INT_ID_COLUMN = '__iid'
  
  df = df.withColumn(INT_ID_COLUMN, F.monotonically_increasing_id())
  
  residual_cols = [c for c in df.columns if c not in features]
  residual_df = df.select(INT_ID_COLUMN,*residual_cols)
  
  pred_df_schema = (df.select(INT_ID_COLUMN, *features).
                      withColumn('feature1', F.col('feature1').cast('float')).
                      withColumn('feature2', F.col('feature2').cast('float')).
                      ...
                      schema
                      )
  
  def target_encoding_func(pdf):
        data = pdf[features]
        keys_cols = [c for c in pdf.columns if c not in features]
        keys = pdf[keys_cols]
        X = te.transform(data)
        X = pd.concat([X, keys],axis=1)
        return X

  predict_udf = F.pandas_udf(
        target_encoding_func,
        returnType=pred_df_schema,
        functionType=F.PandasUDFType.GROUPED_MAP
    )

  prediction_df = df.select(INT_ID_COLUMN, *features) \
        .groupby(F.spark_partition_id()) \
        .apply(predict_udf)

  new_df = residual_df.join(prediction_df, on=INT_ID_COLUMN).drop(INT_ID_COLUMN)

  return new_df

Collectives™ on Stack Overflow

pyspark transformation affecting multiple colums

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related