0

TLDR- I’m trying to write a udf that would transform a pyspark dataframe. When the input is a data frame and the output is the same data frame, just with a few columns mapped to different values.

In details:I trained a XGBoost model using target encoding package (package version is category_encoders==2.5.1). I have no problems to use it when I’m predicting on a small pandas data frame that fits memory. I just do the following:

from category_encoders import TargetEncoder
te = TargetEncoder(cols = cat_cols)
X_test = te.transform(X_test)
preds = model.predict_proba(X_test)[:,1]

Now I have much bigger data that don’t fit the memory. I would like to run inference using pyspark. The following code works for me:

@pandas_udf('float')
def predict_pandas_udf(*cols, names = features):
    X = pd.concat(cols, axis=1)
    old_names = ["_"+str(x) for x in range(len(names))]
    X.rename(columns=dict(zip(old_names, names)), inplace=True)
    X = te.transform(X)
    return pd.Series(model.predict_proba(X)[:,1])

df = df.withColumn('score', predict_pandas_udf(*df[features]))

but when I try to inspect the target encoding I get confused. I’m trying something like this but I can’t figure out how to make it work:

@pandas_udf('float')
def target_encoding_udf(df, features):
  X = df[features]
  X = te.transform(X)
  return X


df = df.transform(target_encoding_udf, features)

any help would be appreciated!

2 Answers 2

0

TLDR; You need to use Grouped map pandas udf to get predictions for your rows and then join the result to the main df. To do so you will need to have a unique value column in your df as a join key.

In details:

from pyspark.sql import functions as F

INT_ID_COLUMN = '__iid'

pred_column_name = 'pred'

df = df.withColumn(INT_ID_COLUMN, F.monotonically_increasing_id())

pred_df_schema = df \
        .select(INT_ID_COLUMN) \
        .withColumn(pred_column_name, F.lit(.0)) \
        .schema

def predict_func(pdf):
    model = get_model() # get your model
    features = ...  # get your models necessary features
    data = pdf[features].astype(float)

    pdf[pred_column_name] = model.predict_proba(data)[:, 1]

    return pdf[[INT_ID_COLUMN, pred_column_name]]

predict_udf = F.pandas_udf(
        predict_func,
        returnType=pred_df_schema,
        functionType=F.PandasUDFType.GROUPED_MAP
    )

prediction_df = df.select(INT_ID_COLUMN, *features) \
        .groupby(F.spark_partition_id()) \
        .apply(predict_udf)

df = df.join(prediction_df, on=[INT_ID_COLUMN], how='left').drop(INT_ID_COLUMN)

Also turn on Arrow usage in your settings.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot! i guess i wasn't clear enough on my intent but thanks to your guidance i found the solution :)
0

Based on busfighter answer i was able to come up with this solution:

def transform_target_encoding(df, features, te):
  INT_ID_COLUMN = '__iid'
  
  df = df.withColumn(INT_ID_COLUMN, F.monotonically_increasing_id())
  
  residual_cols = [c for c in df.columns if c not in features]
  residual_df = df.select(INT_ID_COLUMN,*residual_cols)
  
  pred_df_schema = (df.select(INT_ID_COLUMN, *features).
                      withColumn('feature1', F.col('feature1').cast('float')).
                      withColumn('feature2', F.col('feature2').cast('float')).
                      ...
                      schema
                      )
  
  def target_encoding_func(pdf):
        data = pdf[features]
        keys_cols = [c for c in pdf.columns if c not in features]
        keys = pdf[keys_cols]
        X = te.transform(data)
        X = pd.concat([X, keys],axis=1)
        return X

  predict_udf = F.pandas_udf(
        target_encoding_func,
        returnType=pred_df_schema,
        functionType=F.PandasUDFType.GROUPED_MAP
    )

  prediction_df = df.select(INT_ID_COLUMN, *features) \
        .groupby(F.spark_partition_id()) \
        .apply(predict_udf)

  new_df = residual_df.join(prediction_df, on=INT_ID_COLUMN).drop(INT_ID_COLUMN)

  return new_df

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.