1

I have dataframe called source_dataframe which is referenced by multiple places in the pyspark code. Hence, I was planning to cache the data frame source_dataframe, so that cached referenced would be used instead of re-reading the same data multiple times.

source_dataframe = spark.read.format("delta").table("schema.table_name")
source_dataframe = source_dataframe.filter(condition).select(list_of_columns_needed)

# Planning to use cache here:source_dataframe.cache() so that cached data can be used by multiple references below.

# usage-1 of source_dataframe
columns_to_be_renamed = ["col_1","col_2","col_3"]
for c in columns_to_be_renamed:
    source_dataframe = source_dataframe.withColumn(c, 
        when(trim(col(c)) == "", None)
        .otherwise(concat(lit(c), lit("_"), trim(col(c))))
    )

# usage-2 and other reference of source_dataframe
...

In the for loop where I rename the values of column mentioned in columns_to_be_renamed, I need to keep the same name source_dataframe. If I use something like below where I assign values to new_dataframe, only last column values are updated as earlier data is overwritten.

columns_to_be_renamed = ["col_1","col_2","col_3"]
for c in columns_to_be_renamed:
    new_dataframe = source_dataframe.withColumn(c, 
        when(trim(col(c)) == "", None)
        .otherwise(concat(lit(c), lit("_"), trim(col(c))))
    )

Given this, should I cache the DataFrame source_dataframe immediately after reading it, or should I cache it after the for loop? The concern is that since the DataFrame names are the same after read and in for loop, caching immediately after reading might lead to subsequent references of source_dataframe after the loop referring to the uncached version from for loop instead of the cached one.

3
  • 1
    You should cache first. I assume you want final dataframe with all the column values renamed. so you should assign that final dataframe before the forloop. i.e sd.cache() then nd=sd before the forloop and then loopover the nd=nd.withColumn(....) Commented Apr 8, 2024 at 16:22
  • thanks @user238607. Would there be any performance/memory impact if we do nd=sd instead of direct reference as I have done above? Commented Apr 8, 2024 at 16:33
  • 1
    you want nd to be updated throughtout the for loop. so it should be input and output within the forloop. Commented Apr 8, 2024 at 17:00

1 Answer 1

1

caching immediately after reading might lead to subsequent references of source_dataframe after the loop referring to the uncached version from for loop instead of the cached one.

No because you pass your dataframe by reference, so basically it's the same dataframe, so you can put the .cache() at the beginning, you can always verify what's the state of a dataframe using the .explain() on a dataframe, that will display the execution plan, so in your case:

source_dataframe = spark.read.format("delta").table("schema.table_name").cache()
    columns_to_be_renamed = ["col_1", "col_2", "col_3"]
    for c in columns_to_be_renamed:
        source_dataframe = source_dataframe.withColumn(c,
                                                       when(trim(col(c)) == "", None)
                                                       .otherwise(concat(lit(c), lit("_"), trim(col(c))))
                                                       )
    columns_to_be_renamed = ["col_1", "col_2", "col_3"]
    new_dataframe = None
    for c in columns_to_be_renamed:
        new_dataframe = source_dataframe.withColumn(c,
                                                    when(trim(col(c)) == "", None)
                                                    .otherwise(concat(lit(c), lit("_"), trim(col(c))))
                                                    )
    new_dataframe.explain()

Will display this plan:

== Physical Plan ==
*(1) Project [CASE WHEN (trim(col_1#17, None) = ) THEN null ELSE concat(col_1, _, trim(col_1#17, None)) END AS col_1#38, CASE WHEN (trim(col_2#18, None) = ) THEN null ELSE concat(col_2, _, trim(col_2#18, None)) END AS col_2#43, CASE WHEN (trim(CASE WHEN (trim(col_3#19, None) = ) THEN null ELSE concat(col_3, _, trim(col_3#19, None)) END, None) = ) THEN null ELSE concat(col_3, _, trim(CASE WHEN (trim(col_3#19, None) = ) THEN null ELSE concat(col_3, _, trim(col_3#19, None)) END, None)) END AS col_3#59]
+- InMemoryTableScan [col_1#17, col_2#18, col_3#19]
      +- InMemoryRelation [col_1#17, col_2#18, col_3#19], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- FileScan csv [col_1#17,col_2#18,col_3#19] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/../resso..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<col_1:string,col_2:string,col_3:string>

As you can see that the operator InMemoryTableScan is called, which means the cached is called

Sign up to request clarification or add additional context in comments.

2 Comments

If we include and run source_dataframe.show(truncate=False) between columns_to_be_renamed = ["col_1", "col_2", "col_3"] and new_dataframe = None, we can see that source_dataframe prints renamed column values from 1st for loop,indicating cached data (ie base data as it is) not referenced, instead dataframe generated in 1st for loop (which might be from cached data) is considered during 2nd for loop.Is this fair understanding? I'm looking for ways where I can use only cached dataframe in all references.So should I assign new_dataframe = source_dataframe as mentoned by user16798185?
I'm not sure if I understood what you posted, but what user238607 mentioned is correct, "cache first" is always better

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.