1

So suppose I have a big spark dataframe .I dont know how many columns.

(the solution has to be in pyspark using pandas udf. Not a different approach)

I want to perform an action on all columns. So it's ok to loop inside on all columns But I dont want to loop through rows. I want it to act on the column at once.

I didnt find on the internet how this could be done.

Suppose I have this datafrme

A   B    C
5   3    2
1   7    0

Now I want to send to pandas udf to get sum of each row.

Sum 
 10
  8

Number of columns not known.

I can do it inside the udf by looping row at a time. But I dont want. I want it to act on all rows without looping. And I allow looping through columns if needed.

One option I tried is combining all colmns to array column

ARR
[5,3,2]
[1,7,0]

But even here it doesnt work for me without looping. I send this column to the udf and then inside I need to loop through its rows and sum each value of the list-row.

It would be nice if I could seperate each column as a one and act on the whole column at once

How do I act on the column at once? Without looping through the rows?

If I loop through the rows I guess it's no better than a regular python udf

4
  • will you ever know the columns that you want to sum? will it be all the columns in the dataframe? Commented Nov 22, 2022 at 7:29
  • @iambdot yes suppose all columns of the df. Commented Nov 22, 2022 at 7:42
  • Did my answer help or you need more assistance? Commented Nov 22, 2022 at 21:49
  • Sorry missed it. Emm. Actually I believe your answer is nice but I cant understand it. I cant understand the logic of the udf . Some expansion on it would help. I actually dont need to sum up each row. I just gave it as an example. I need a general way of looping columns. While not looping rows but working on the whole column at once. Even if you answer works for aum I cant see how I do other task which involve working on the full row . Commented Nov 25, 2022 at 0:33

2 Answers 2

1

I wouldnt go to pandas udfs, resort to udfs it cant be done in pyspark. Anyway code for both below

df = spark.read.load('/databricks-datasets/asa/small/small.csv', header=True,format='csv')

sf = df.select(df.colRegex("`.*rrDelay$|.*pDelay$`"))

#sf.show()

columns = ["id","ArrDelay","DepDelay"]
data = [("a", 81.0,3),
    ("b", 36.2,5),
    ("c", 12.0,5),
    ("d", 81.0,5),
    ("e", 36.3,5),
    ("f", 12.0,5),
    ("g", 111.7,5)]

sf = spark.createDataFrame(data=data,schema=columns)

sf.show()

# Use aggregate function
new = (sf.withColumn('sums', array(*[x for x in ['ArrDelay','DepDelay'] ]))#Create an array of values per row on desired columns
       .withColumn('sums', expr("aggregate(sums,cast(0 as double), (c,i)-> c+i)"))# USE aggregate to sum
      ).show()


#use pandas udf
sch= sf.withColumn('v', lit(90.087654623)).schema
def sum_s(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
    for pdf in iterator:
           
      yield pdf.assign(v=pdf.sum(1))

sf.mapInPandas(sum_s, schema=sch).show()
Sign up to request clarification or add additional context in comments.

1 Comment

I replied to a comment you made above
0

here's a simple way to do it

from pyspark.sql import functions as F
from pyspark.sql.types import *
from pyspark.sql import Window
from functools import reduce

df = spark.createDataFrame(
    [
        (5,3,2),
        (1,7,0),        
    ],
    ["A", "B", "C"],
)

cols = df.columns
calculate_sum = reduce(lambda a, x: a+x, map(col, cols))

df = (
    df
    .withColumn(
        "sum",calculate_sum
    )
)

df.show()

output:

+---+---+---+---+
|  A|  B|  C|sum|
+---+---+---+---+
|  5|  3|  2| 10|
|  1|  7|  0|  8|
+---+---+---+---+

1 Comment

I am not sure if you read the question but I emphasized throughout that the desired solution is only in pandas udf

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.