2

The documentation page of pandas_udf in pyspark documentation has the following paragraph:

The user-defined functions do not support conditional expressions or short circuiting in boolean expressions and it ends up with being executed all internally. If the functions can fail on special rows, the workaround is to incorporate the condition into the functions.

Can somebody explain to me what this means? It seems it is saying that the UDF does not support conditional statements (if else blocks) and then suggesting that the workaround is to include the if else condition in the function body. This does not make sense to me. Please help

1 Answer 1

2

I read something similar in Learning Spark - Lightning-Fast Data Analytics

In Chapter 5 - User Defined Functions it talks about evaluation order and null checking in Spark SQL.

If your UDF can fail when dealing with NULL values it's best to move that logic inside the UDF itself just like it says in the quote you provided.

Here's the reasoning behind it:

Spark SQL (this includes DataFrame API and Dataset API) does not quarantee the order of evaluation of subexpressions. For example the following query does not guarantee that the s IS NOT NULL clause is executed prior to the strlen(s):

spark.sql("SELECT s FROM test1 WHERE s IS NOT NULL AND strlen(s) > 1")

Therefore to perform proper null checking it is recommended that you make the UDF itself null-aware and do null checking inside the UDF.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.