Understanding pandas_udf

Question

The documentation page of pandas_udf in pyspark documentation has the following paragraph:

The user-defined functions do not support conditional expressions or short circuiting in boolean expressions and it ends up with being executed all internally. If the functions can fail on special rows, the workaround is to incorporate the condition into the functions.

Can somebody explain to me what this means? It seems it is saying that the UDF does not support conditional statements (if else blocks) and then suggesting that the workaround is to include the if else condition in the function body. This does not make sense to me. Please help

vladsiv · Accepted Answer · 2021-10-28 17:48:35Z

2

I read something similar in Learning Spark - Lightning-Fast Data Analytics

In Chapter 5 - User Defined Functions it talks about evaluation order and null checking in Spark SQL.

If your UDF can fail when dealing with NULL values it's best to move that logic inside the UDF itself just like it says in the quote you provided.

Here's the reasoning behind it:

Spark SQL (this includes DataFrame API and Dataset API) does not quarantee the order of evaluation of subexpressions. For example the following query does not guarantee that the s IS NOT NULL clause is executed prior to the strlen(s):

spark.sql("SELECT s FROM test1 WHERE s IS NOT NULL AND strlen(s) > 1")

Therefore to perform proper null checking it is recommended that you make the UDF itself null-aware and do null checking inside the UDF.

answered Oct 28, 2021 at 17:48

vladsiv

2,9841 gold badge13 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Understanding pandas_udf

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related