Pandas on Spark apply() seems to be reshaping columns

Question

Can anybody explain the following behavior?

import pyspark.pandas as ps

loan_information = ps.read_sql_query([blah])

loan_information.shape
#748834, 84

loan_information.apply(lambda col: col.shape)
#Each column has 75 dimensions. The first 74 are size 10000, the last is size 8843
#This still sums to 748834, but hardly seems like desirable behavior

My guess is that batches of size 10000 are being fed to the executors but, again, this seems like pretty undesirable behavior.

what is your desirable behavior?

Emma
– Emma

2023-05-09 16:16:30 +00:00
Commented May 9, 2023 at 16:16 — Emma
– Emma, Commented May 9, 2023 at 16:16

BeRT2me · Accepted Answer · 2023-05-10 04:30:08Z

1

The documentation is quite clear:

when axis is 0 or ‘index’, the func is unable to access to the whole input series. pandas-on-Spark internally splits the input series into multiple batches and calls func with each batch multiple times. Therefore, operations such as global aggregations are impossible. See the example below.

.apply is for non-aggregation functions, if you want to do aggregate type functions, use something like .aggregate

answered May 10, 2023 at 4:30

BeRT2me

13.3k2 gold badges18 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Cody Dance Over a year ago

thanks Bert! Didn't see that in the docs

Collectives™ on Stack Overflow

Pandas on Spark apply() seems to be reshaping columns

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related