-
Notifications
You must be signed in to change notification settings - Fork 266
[ConnectedComponents] Memory leak with unpersisted DataFrames in the last round #552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
SemyonSinchenko
merged 5 commits into
graphframes:master
from
SauronShepherd:cc-caching
Mar 28, 2025
Merged
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
3d0dcf4
Do not orphan out of scope persisted dataframes in ConnectedComponent…
james-willis a1414c1
Remove counts until proven necessary
SauronShepherd 23c46d5
Remove counts until proven necessary
SauronShepherd a85ae41
Merge branch 'master' into cc-caching
SauronShepherd 788378a
Fix count issues
SauronShepherd File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe persist is lazy and does not offer an eager flag. Will this code actually wind up using the cached dataframes if we dont cache the output df before we unpersist the child dataframes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only when an action is executed the dataframe needs to be persisted, in order to reuse those previous calculations.
Calculations are performed in the last round and, on that dataframe once the loop ends, another transformation is applied and then cached (but no new calculations have been performed because there's no action involved).
Nothing changes, only that the persisted dataframe is the one the method is returning, instead of the previous dataframe the last transformations are applied and then, the resulting dataframe returning to the user. So the user can unpersist the dataframe.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the diff with and without the count call. removing the count call causes a cache miss: https://www.diffchecker.com/i57B411V/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you see a cache miss? Because I'm debugging the "single vertex" unit test and there's one DataFrame cached and a InMemoryTableScan in the plan:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- InMemoryTableScan [id#1136L, vattr#1137, gender#1138, component#1133L]
+- InMemoryRelation [id#1136L, vattr#1137, gender#1138, component#1133L], StorageLevel(disk, memory, deserialized, 1 replicas)
+- LocalTableScan [id#1136L, vattr#1137, gender#1138, component#1133L]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I am just struggling to understand but I think the count is necessary.
If you want the
outputdataframe to leverage the persisted child dataframes in its query plan, you need to call an action on the output dataframe before those children haveunpersistcalled. Without thecountcall you will not utilize the cached version of the children dataframes when caching the output dataframe.I don't agree.
cacheandunpersistare lazy in spark, so the dataframe is only marked for caching. It is not actually cached until some action is called. Without the count call the action will always be after the children query plans have been unpersisted and so they will be recalculated by the engine. This defeats the purpose of those persist calls.I tried to add a test for this in my PR:
graphframes/src/test/scala/org/graphframes/lib/ConnectedComponentsSuite.scala
Line 256 in d3bbb00
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is an edge case because spark is optimizing away the second child of the join because ee is an empty LocalRelation.
I believe the chain graph test is more representative because there are edges in the table. There you will see only the top-level InMemoryRelation when the count call is removed and 16 when it is in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to the count being necessary. I think it might be the case the counts inside the loop aren't needed, as other actions like
_calcMinNbrSumwill trigger the DataFrame to cache. But in this case at the end, since everything is being unpersisted,outputwill be completely calculated from the last checkpoint when the user does something with it with none of the intermediate caching.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, it's simple ... let's probe it with a long dataset a then see if it takes longer or not.