[#159] Fix non-deterministic ID assignment#195
Conversation
Codecov Report
@@ Coverage Diff @@
## master #195 +/- ##
=========================================
+ Coverage 86.78% 88.29% +1.5%
=========================================
Files 23 23
Lines 757 743 -14
Branches 59 59
=========================================
- Hits 657 656 -1
+ Misses 100 87 -13
Continue to review full report at Codecov.
|
| .persist(StorageLevel.MEMORY_AND_DISK) | ||
| vertices.select(col(ID), nestAsCol(vertices, ATTR)) | ||
| .join(withLongIds, ID) | ||
| .select(LONG_ID, ID, ATTR) |
There was a problem hiding this comment.
I wonder if it is worth the extra effort to optimize for the newer Spark release.
if monotonically_increasing_id not unique is only in earlier ones, perhaps we should create a wrapper for it instead of penalizing all versions with extra repartition?
There was a problem hiding this comment.
The issue is not mono_id but the input data frame. The record ordering in the input is not deterministic, even with a correct mono_id impl we won't get correct result.
| assertComponents(components0, expected) | ||
| assert(!isFromCheckpoint(components0), | ||
| "The result shouldn't depend on checkpoint data if checkpointing is disabled.") | ||
| if (isLaterVersion("2.0")) { |
There was a problem hiding this comment.
Could you leave an inline comment explaining why we skipped this for 1.6?
|
We used a Spark cluster with Connected Components
PageRank One IterationFor the current development version (with PR-195)
For the previous release version
|
|
@phi-dbq Thanks for running performance tests. Now I'm more comfortable with the trade-off: seconds vs. correctness. |
This is a follow up task from [#189].
It removes SQLHelpers.zipWithUniqueId which is no longer needed.
We will make scalability test and address potential issues.
In the end, we will make a bug-fix release based on changes in this PR.