0

I have a spark frame with two columns which looks like:

+-------------------------------------------------------------+------------------------------------+
|docId                                                        |id                                  |
+-------------------------------------------------------------+------------------------------------+
|DYSDG6-RTB-91d663dd-949e-45da-94dd-e604b6050cb5-1537142434000|91d663dd-949e-45da-94dd-e604b6050cb5|
|VAVLS7-RTB-8e2c1917-0d6b-419b-a59e-cd4acc255bb7-1537142445000|8e2c1917-0d6b-419b-a59e-cd4acc255bb7|
|VAVLS7-RTB-c818dcde-7a68-4c1e-9cc4-c841660732d2-1537146854000|c818dcde-7a68-4c1e-9cc4-c841660732d2|
|IW2BYL-RTB-E9727F7D-D1BA-479C-9D3A-931F87E78B0A-1537146572000|E9727F7D-D1BA-479C-9D3A-931F87E78B0A|
|DYSDG6-RTB-f50f79e9-3ec3-4bd8-8e53-f62c3f80bcb0-1537146220000|f50f79e9-3ec3-4bd8-8e53-f62c3f80bcb0|
+-------------------------------------------------------------+------------------------------------+

I have a function that convert the id column into an 85 bit encoded string :

def convert_id(id):
    import base64 as bs
    id_str = str(id).replace("-", "") 
    return str(bs.a85encode(bytearray.fromhex(id_str)))[2:-1]

I want to transform this using pandas udf which is reported to be faster than the normal udf's.

How can I achieve this ? TIA.

1
  • The udf definition is the same, given that input and output would be specified as a pandas series of String Commented Sep 19, 2018 at 9:12

1 Answer 1

3

Done. Simple function can help to achieve this:

@pandas_udf(returnType=StringType())
def convert_id(id):
    converted = id.map(lambda x : str(bs.a85encode(bytearray.fromhex(str(x).replace("-", ""))))[2:-1])
    return converted
Sign up to request clarification or add additional context in comments.

4 Comments

Is this optimizing on speed?
@pissall: Yes this optimizes speed. Typically this 3.7x times faster that generic udf's as I read.
Can you tell me out of experience?
@pissall: There is a significant improvement in the speed. Normal UDF used to take ~15 mins for 1mn rows while Pandas UDF took ~5mins.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.