0

I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code.

Input Data

columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
    ("2", "45", "55", "tracey smith"),
    ("3", "33", "23", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

Col1    Col2    Col3    Name
 1      40      56  john jones
 2      45      55  tracey smith
 3      33      23  amy sanders

Expected Output

Col1    Col2    Col3    Name
 0.5    1.02    1.25    john jones
 1      1.14    1.23    tracey smith
 1.5    0.84    0.51    amy sanders

Function as of now. Not working:

#function to divide few columns by the column average and overwrite the column

def avg_scaling(df):

  #List of columns which have to be scaled by their average
  col_list = ['col1', 'col2', 'col3']

  for i in col_list:
   df = df.withcolumn(i, col(i)/df.select(f.avg(df[i])))

   return df

new_df = avg_scaling(df)
0

1 Answer 1

1

You can make use of a Window here partitioned on a pusedo column and run average on that window.

The code goes like this,

columns = ["Col1", "Col2", "Col3","Name"]
data = [("1","40", "56" , "john jones"),
    ("2", "45", "55", "tracey smith"),
    ("3", "33", "23", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

df.show()

+----+----+----+------------+
|Col1|Col2|Col3|        Name|
+----+----+----+------------+
|   1|  40|  56|  john jones|
|   2|  45|  55|tracey smith|
|   3|  33|  23| amy sanders|
+----+----+----+------------+


from pyspark.sql import Window

def avg_scaling(df, cols_to_scale):
    
    w = Window.partitionBy(F.lit(1))
    for col in cols_to_scale:
        df = df.withColumn(f"{col}", F.round(F.col(col) / F.avg(col).over(w), 2))
    return df

new_df = avg_scaling(df, ["Col1", 'Col2', 'Col3'])
new_df.show()


+----+----+----+------------+
|Col1|Col2|Col3|        Name|
+----+----+----+------------+
| 0.5|1.02|1.25|  john jones|
| 1.5|0.84|0.51| amy sanders|
| 1.0|1.14|1.23|tracey smith|
+----+----+----+------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.