TLDR- I’m trying to write a udf that would transform a pyspark dataframe. When the input is a data frame and the output is the same data frame, just with a few columns mapped to different values.
In details:I trained a XGBoost model using target encoding package (package version is category_encoders==2.5.1). I have no problems to use it when I’m predicting on a small pandas data frame that fits memory. I just do the following:
from category_encoders import TargetEncoder
te = TargetEncoder(cols = cat_cols)
X_test = te.transform(X_test)
preds = model.predict_proba(X_test)[:,1]
Now I have much bigger data that don’t fit the memory. I would like to run inference using pyspark. The following code works for me:
@pandas_udf('float')
def predict_pandas_udf(*cols, names = features):
X = pd.concat(cols, axis=1)
old_names = ["_"+str(x) for x in range(len(names))]
X.rename(columns=dict(zip(old_names, names)), inplace=True)
X = te.transform(X)
return pd.Series(model.predict_proba(X)[:,1])
df = df.withColumn('score', predict_pandas_udf(*df[features]))
but when I try to inspect the target encoding I get confused. I’m trying something like this but I can’t figure out how to make it work:
@pandas_udf('float')
def target_encoding_udf(df, features):
X = df[features]
X = te.transform(X)
return X
df = df.transform(target_encoding_udf, features)
any help would be appreciated!