0

I am working with pyspark below is my dataframe

col_list   | col_1| col_2 |...|col_n
col_1,col2 | x    | y     |...|n
col_2,col_n| a    | b     |...|z

I want concat columns which are present in my col_list column values

Expected DF:

col_list   | col_1| col_2 |...|col_n | concat_cols
col_1,col2 | x    | y     |...|n     | [col_1:x,col_2:y]
col_2,col_n| a    | b     |...|z     | [col_2:b,col_n:z]

Trying with below approaches:

  1. Creating udf to pass col_list and create expression to create array col like array(df['col_1'],df['col_2']) , but this will create column with string , not sure how to execute expression

  2. Using create map passing col_list in loop and create column, need to check on this.

1 Answer 1

2

You can use some map methods:

import pyspark.sql.functions as F

df2 = df.withColumn(
    'col_map', 
    F.map_from_arrays(
        F.array([F.lit(c) for c in df.columns[1:]]), 
        F.array(df.columns[1:])
    )
).withColumn(
    'concat_cols', 
    F.expr("""
        map_from_arrays(
            split(col_list, ','), 
            transform(split(col_list, ','), x -> col_map[x])
        )
    """)
).drop('col_map')

df2.show(truncate=False)
+-----------+-----+-----+-----+------------------------+
|col_list   |col_1|col_2|col_n|concat_cols             |
+-----------+-----+-----+-----+------------------------+
|col_1,col_2|x    |y    |n    |[col_1 -> x, col_2 -> y]|
|col_2,col_n|a    |b    |z    |[col_2 -> b, col_n -> z]|
+-----------+-----+-----+-----+------------------------+

If you prefer a string representation rather than a map type column, you can do

import pyspark.sql.functions as F

df2 = df.withColumn(
    'col_map', 
    F.map_from_arrays(
        F.array([F.lit(c) for c in df.columns[1:]]), 
        F.array(df.columns[1:])
    )
).withColumn(
    'concat_cols', 
    F.expr("""
        concat_ws(',', 
            transform(split(col_list, ','), x -> concat_ws(':', x, col_map[x]))
        )
    """)
).drop('col_map')

df2.show(truncate=False)
+-----------+-----+-----+-----+---------------+
|col_list   |col_1|col_2|col_n|concat_cols    |
+-----------+-----+-----+-----+---------------+
|col_1,col_2|x    |y    |n    |col_1:x,col_2:y|
|col_2,col_n|a    |b    |z    |col_2:b,col_n:z|
+-----------+-----+-----+-----+---------------+
Sign up to request clarification or add additional context in comments.

6 Comments

i am getting [col_1 -> x, col_2 -> ] in concat_cols
did you have spaces in your column names? that may cause issues
there are chances, happening to few columns
Then you need to make sure to remove spaces. See stackoverflow.com/questions/57318519/…
getting spaces for some records in col_lis causing issue, fixed by chaining col_map[x] to col_map[trim(x)]
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.