I have dataframe called source_dataframe which is referenced by multiple places in the pyspark code. Hence, I was planning to cache the data frame source_dataframe, so that cached referenced would be used instead of re-reading the same data multiple times.
source_dataframe = spark.read.format("delta").table("schema.table_name")
source_dataframe = source_dataframe.filter(condition).select(list_of_columns_needed)
# Planning to use cache here:source_dataframe.cache() so that cached data can be used by multiple references below.
# usage-1 of source_dataframe
columns_to_be_renamed = ["col_1","col_2","col_3"]
for c in columns_to_be_renamed:
source_dataframe = source_dataframe.withColumn(c,
when(trim(col(c)) == "", None)
.otherwise(concat(lit(c), lit("_"), trim(col(c))))
)
# usage-2 and other reference of source_dataframe
...
In the for loop where I rename the values of column mentioned in columns_to_be_renamed, I need to keep the same name source_dataframe. If I use something like below where I assign values to new_dataframe, only last column values are updated as earlier data is overwritten.
columns_to_be_renamed = ["col_1","col_2","col_3"]
for c in columns_to_be_renamed:
new_dataframe = source_dataframe.withColumn(c,
when(trim(col(c)) == "", None)
.otherwise(concat(lit(c), lit("_"), trim(col(c))))
)
Given this, should I cache the DataFrame source_dataframe immediately after reading it, or should I cache it after the for loop? The concern is that since the DataFrame names are the same after read and in for loop, caching immediately after reading might lead to subsequent references of source_dataframe after the loop referring to the uncached version from for loop instead of the cached one.
sd.cache()thennd=sdbefore the forloop and then loopover thend=nd.withColumn(....)