How to use Pandas UDF Functionality in pyspark

Question

I have a spark frame with two columns which looks like:

+-------------------------------------------------------------+------------------------------------+
|docId                                                        |id                                  |
+-------------------------------------------------------------+------------------------------------+
|DYSDG6-RTB-91d663dd-949e-45da-94dd-e604b6050cb5-1537142434000|91d663dd-949e-45da-94dd-e604b6050cb5|
|VAVLS7-RTB-8e2c1917-0d6b-419b-a59e-cd4acc255bb7-1537142445000|8e2c1917-0d6b-419b-a59e-cd4acc255bb7|
|VAVLS7-RTB-c818dcde-7a68-4c1e-9cc4-c841660732d2-1537146854000|c818dcde-7a68-4c1e-9cc4-c841660732d2|
|IW2BYL-RTB-E9727F7D-D1BA-479C-9D3A-931F87E78B0A-1537146572000|E9727F7D-D1BA-479C-9D3A-931F87E78B0A|
|DYSDG6-RTB-f50f79e9-3ec3-4bd8-8e53-f62c3f80bcb0-1537146220000|f50f79e9-3ec3-4bd8-8e53-f62c3f80bcb0|
+-------------------------------------------------------------+------------------------------------+

I have a function that convert the id column into an 85 bit encoded string :

def convert_id(id):
    import base64 as bs
    id_str = str(id).replace("-", "") 
    return str(bs.a85encode(bytearray.fromhex(id_str)))[2:-1]

I want to transform this using pandas udf which is reported to be faster than the normal udf's.

How can I achieve this ? TIA.

The udf definition is the same, given that input and output would be specified as a pandas series of String — pissall
– pissall, Commented Sep 19, 2018 at 9:12

Nitesh Gupta · Accepted Answer · 2018-09-19 09:09:02Z

3

Done. Simple function can help to achieve this:

@pandas_udf(returnType=StringType())
def convert_id(id):
    converted = id.map(lambda x : str(bs.a85encode(bytearray.fromhex(str(x).replace("-", ""))))[2:-1])
    return converted

answered Sep 19, 2018 at 9:09

Nitesh Gupta

631 silver badge9 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

pissall Over a year ago

Is this optimizing on speed?

Nitesh Gupta Over a year ago

@pissall: Yes this optimizes speed. Typically this 3.7x times faster that generic udf's as I read.

pissall Over a year ago

Can you tell me out of experience?

Nitesh Gupta Over a year ago

@pissall: There is a significant improvement in the speed. Normal UDF used to take ~15 mins for 1mn rows while Pandas UDF took ~5mins.

Collectives™ on Stack Overflow

How to use Pandas UDF Functionality in pyspark

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related