[#159] Fix non-deterministic ID assignment by phi-dbq · Pull Request #195 · graphframes/graphframes

phi-dbq · 2017-05-11T04:41:06Z

This is a follow up task from [#189].
It removes SQLHelpers.zipWithUniqueId which is no longer needed.
We will make scalability test and address potential issues.
In the end, we will make a bug-fix release based on changes in this PR.

codecov-io · 2017-05-11T05:19:06Z

Codecov Report

Merging #195 into master will increase coverage by 1.5%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master     #195     +/-   ##
=========================================
+ Coverage   86.78%   88.29%   +1.5%     
=========================================
  Files          23       23             
  Lines         757      743     -14     
  Branches       59       59             
=========================================
- Hits          657      656      -1     
+ Misses        100       87     -13

Impacted Files	Coverage Δ
...in/spark-2.0/org/apache/spark/sql/SQLHelpers.scala	`0% <ø> (ø)`	⬆️
...in/spark-2.x/org/apache/spark/sql/SQLHelpers.scala	`100% <ø> (ø)`	⬆️
...in/spark-1.x/org/apache/spark/sql/SQLHelpers.scala	`0% <ø> (ø)`	⬆️
src/main/scala/org/graphframes/GraphFrame.scala	`86.69% <100%> (+0.4%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c3df6a...28bad51. Read the comment docs.

felixcheung · 2017-05-11T19:04:38Z

src/main/scala/org/graphframes/GraphFrame.scala

+        .persist(StorageLevel.MEMORY_AND_DISK)
+      vertices.select(col(ID), nestAsCol(vertices, ATTR))
+        .join(withLongIds, ID)
+        .select(LONG_ID, ID, ATTR)


I wonder if it is worth the extra effort to optimize for the newer Spark release.
if monotonically_increasing_id not unique is only in earlier ones, perhaps we should create a wrapper for it instead of penalizing all versions with extra repartition?

The issue is not mono_id but the input data frame. The record ordering in the input is not deterministic, even with a correct mono_id impl we won't get correct result.

mengxr · 2017-05-11T22:38:49Z

src/test/scala/org/graphframes/lib/ConnectedComponentsSuite.scala

-    assertComponents(components0, expected)
-    assert(!isFromCheckpoint(components0),
-      "The result shouldn't depend on checkpoint data if checkpointing is disabled.")
+    if (isLaterVersion("2.0")) {


Could you leave an inline comment explaining why we skipped this for 1.6?

phi-dbq · 2017-05-16T00:16:04Z

We used a Spark cluster with 8 Workers: 244.0 GB Memory, 32 Cores.
In order to measure end-to-end elapsed time involving only the function of interest,
we clipped the RDD lineage graph using checkpoint.

Connected Components

type	#V	#E	time (dev) [str-key]	time (v0.40) [str-key]	comments
grid	40200	159200	2 mins 24 secs 698 msecs	2 mins 16 secs 438 msecs
grid	402000	15992000	5 mins 23 secs 734 msecs	4 mins 36 secs 993 msecs
chain	80001	80000	2 mins 159 msecs	1 min 31 secs 993 msecs	8 stars
chain	800001	800000	4 mins 41 secs 947 msecs	4 mins 30 secs 394 msecs	80 stars
star	20001	20000	1 min 5 secs 287 msecs	53 secs 90 msecs
star	200001	200000	1 min 14 secs 654 msecs	1 min 21 secs 196 msecs

PageRank One Iteration

For the current development version (with PR-195)

type	#V	#E	time (dev) [str-key]	time (dev) [int-key]	comments
grid	40200	159200	26 secs 937 msecs	7 secs 998 msecs
grid	4002000	15992000	58 secs 427 msecs	14 secs 98 msecs
chain	80001	80000	34 secs 188 msecs	22 secs 324 msecs	8 stars
chain	800001	800000	2 mins 15 secs 725 msecs	3 min 6 secs 179 msecs	80 stars
star	20001	20000	20 secs 415 msecs	6 secs 255 msecs
star	200001	200000	25 secs 120 msecs	8 secs 289 msecs

For the previous release version v0.40

type	#V	#E	time (v0.40) [str-key]	time (v0.40) [int-key]	comments
grid	40200	159200	23 secs 891 msecs	8 secs 897 msecs
grid	4002000	15992000	50 secs 240 msecs	15 secs 661 msecs
chain	80001	80000	35 secs 270 msecs	21 secs 796 msecs	8 stars
chain	800001	800000	2 mins 18 secs 476 msecs	2 min 32 secs 584 msecs	80 stars
star	20001	20000	20 secs 715 msecs	7 secs 948 msecs
star	200001	200000	22 secs 766 msecs	7 secs 683 msecs

mengxr · 2017-05-16T05:27:21Z

@phi-dbq Thanks for running performance tests. Now I'm more comfortable with the trade-off: seconds vs. correctness.

mengxr and others added 4 commits May 10, 2017 21:32

try a possible fix

87d552e

update the code that leads to a more optimized plan

c2ae164

[graphframes#194] skip long running test

510a938

rebase on master

0033cc1

mengxr self-assigned this May 11, 2017

skip slow sub-test for 1.x version

7c59ea5

felixcheung reviewed May 11, 2017

View reviewed changes

nchammas mentioned this pull request May 11, 2017

[#159] Fix non-deterministic ID assignment #189

Closed

mengxr reviewed May 11, 2017

View reviewed changes

inline docs for skipping tests

28bad51

mengxr changed the title ~~[WIP][#159] Fix non-deterministic ID assignment~~ [#159] Fix non-deterministic ID assignment May 11, 2017

mengxr added the lgtm label May 11, 2017

mengxr merged commit d437797 into graphframes:master May 11, 2017

mengxr mentioned this pull request May 11, 2017

Incorrect and non-deterministic connectedComponents() results when the vertices and edges are not cached #159

Closed

estebandonato mentioned this pull request May 22, 2017

feat: aggregateMessages: multiple message and aggregation columns #186

Closed

nchammas mentioned this pull request Aug 23, 2024

bug: Connected Components gives wrong results #453

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#159] Fix non-deterministic ID assignment#195

[#159] Fix non-deterministic ID assignment#195
mengxr merged 6 commits intographframes:masterfrom
phi-dbq:indexed-edges-fix

phi-dbq commented May 11, 2017

Uh oh!

codecov-io commented May 11, 2017 •

edited

Loading

Uh oh!

felixcheung May 11, 2017

Uh oh!

mengxr May 11, 2017

Uh oh!

mengxr May 11, 2017

Uh oh!

phi-dbq commented May 16, 2017

Uh oh!

mengxr commented May 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

phi-dbq commented May 11, 2017

Uh oh!

codecov-io commented May 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

felixcheung May 11, 2017

Choose a reason for hiding this comment

Uh oh!

mengxr May 11, 2017

Choose a reason for hiding this comment

Uh oh!

mengxr May 11, 2017

Choose a reason for hiding this comment

Uh oh!

phi-dbq commented May 16, 2017

Connected Components

PageRank One Iteration

Uh oh!

mengxr commented May 16, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov-io commented May 11, 2017 •

edited

Loading