Using pandas udf without looping in pyspark

Question

So suppose I have a big spark dataframe .I dont know how many columns.

(the solution has to be in pyspark using pandas udf. Not a different approach)

I want to perform an action on all columns. So it's ok to loop inside on all columns But I dont want to loop through rows. I want it to act on the column at once.

I didnt find on the internet how this could be done.

Suppose I have this datafrme

A   B    C
5   3    2
1   7    0

Now I want to send to pandas udf to get sum of each row.

Sum 
 10
  8

Number of columns not known.

I can do it inside the udf by looping row at a time. But I dont want. I want it to act on all rows without looping. And I allow looping through columns if needed.

One option I tried is combining all colmns to array column

ARR
[5,3,2]
[1,7,0]

But even here it doesnt work for me without looping. I send this column to the udf and then inside I need to loop through its rows and sum each value of the list-row.

It would be nice if I could seperate each column as a one and act on the whole column at once

How do I act on the column at once? Without looping through the rows?

If I loop through the rows I guess it's no better than a regular python udf

will you ever know the columns that you want to sum? will it be all the columns in the dataframe? — iambdot
– iambdot, Commented Nov 22, 2022 at 7:29
Sorry missed it. Emm. Actually I believe your answer is nice but I cant understand it. I cant understand the logic of the udf . Some expansion on it would help. I actually dont need to sum up each row. I just gave it as an example. I need a general way of looping columns. While not looping rows but working on the whole column at once. Even if you answer works for aum I cant see how I do other task which involve working on the full row . — Barushkish
– Barushkish, Commented Nov 25, 2022 at 0:33

wwnde · Accepted Answer · 2022-11-22 07:53:18Z

1

I wouldnt go to pandas udfs, resort to udfs it cant be done in pyspark. Anyway code for both below

df = spark.read.load('/databricks-datasets/asa/small/small.csv', header=True,format='csv')

sf = df.select(df.colRegex("`.*rrDelay$|.*pDelay$`"))

#sf.show()

columns = ["id","ArrDelay","DepDelay"]
data = [("a", 81.0,3),
    ("b", 36.2,5),
    ("c", 12.0,5),
    ("d", 81.0,5),
    ("e", 36.3,5),
    ("f", 12.0,5),
    ("g", 111.7,5)]

sf = spark.createDataFrame(data=data,schema=columns)

sf.show()

# Use aggregate function
new = (sf.withColumn('sums', array(*[x for x in ['ArrDelay','DepDelay'] ]))#Create an array of values per row on desired columns
       .withColumn('sums', expr("aggregate(sums,cast(0 as double), (c,i)-> c+i)"))# USE aggregate to sum
      ).show()


#use pandas udf
sch= sf.withColumn('v', lit(90.087654623)).schema
def sum_s(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    for pdf in iterator:
           
      yield pdf.assign(v=pdf.sum(1))

sf.mapInPandas(sum_s, schema=sch).show()

answered Nov 22, 2022 at 7:53

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Barushkish Over a year ago

I replied to a comment you made above

iambdot · Accepted Answer · 2022-11-22 07:54:59Z

0

here's a simple way to do it

from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
from functools import reduce

df = spark.createDataFrame(
    [
        (5,3,2),
        (1,7,0),        
    ],
    ["A", "B", "C"],
)

cols = df.columns
calculate_sum = reduce(lambda a, x: a+x, map(col, cols))

df = (
    df
    .withColumn(
        "sum",calculate_sum
    )
)

df.show()

output:

+---+---+---+---+
|  A|  B|  C|sum|
+---+---+---+---+
|  5|  3|  2| 10|
|  1|  7|  0|  8|
+---+---+---+---+

answered Nov 22, 2022 at 7:54

iambdot

9453 gold badges10 silver badges31 bronze badges

1 Comment

Barushkish Over a year ago

I am not sure if you read the question but I emphasized throughout that the desired solution is only in pandas udf

Collectives™ on Stack Overflow

Using pandas udf without looping in pyspark

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related