How to remove quotes from column in pyspark dataframe?

Question

I have csv file in which I am getting double quotes in a column. While reading and writing I have to remove those quotes. Please guide me how can I do it?

Example-

df:

col1
"xyznm""cxvb"

I want below output-

col1
xyznm""cxvb

I have written below code for this-

df = spark.read.format("com.databricks.spark.csv").option("delimiter", "|").options(header='true', escape = '\"').load("my_path")

df.show()

df.write.format('com.databricks.spark.csv').mode('overwrite').save(r"path", sep="|", escape='\"', header='True', nullValue= None)

Derek O · Accepted Answer · 2023-05-02 21:34:08Z

0

One possible workaround is to remove leading and trailing quotes after reading in your csv.

Let's say you load this df:

df = spark.createDataFrame(["\"xyznm\"\"cxvb\"","1\"1\"","\"13"], "string").toDF("col1")

+-------------+
|         col1|
+-------------+
|"xyznm""cxvb"|
|         1"1"|
|          "13|
+-------------+

Then you can use the following regex to remove outer quotes:

from pyspark.sql import functions as F

df.select(F.regexp_replace('col1', '^"+|"+$', '').alias('col1')).show()

+-----------+
|       col1|
+-----------+
|xyznm""cxvb|
|        1"1|
|         13|
+-----------+

answered May 2, 2023 at 21:34

Derek O

20.2k4 gold badges32 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to remove quotes from column in pyspark dataframe?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related