Conversation
rjurney
commented
Mar 13, 2026
- New, big, fancy, super, duper Pregel tutorial
- Moved Stack Exchange data content from Network Motif Finding Tutorial into Data Setup tutorial. Refer to from both motif and Pregel tutorials.
- Point at new tutorial(s) from list of tutorials.
- New network motif and Pregel tutorial Jupyter notebooks
- Some other minor changes...
…s.txt and split out requirements-dev.txt. Version bumps.
…ney/build-upgrades
…ney/build-upgrades
|
@SemyonSinchenko I updated https://github.com/rjurney/graphframes/tree/rjurney/pypi-tutorials fixing all the issues save one I asked you about, but it isn't updating this PR or running tests... I guess it is slack today. |
|
Okay, there it goes! |
The Jupyter notebooks in python/graphframes/tutorials/notebooks/ reference images via relative paths like ../img/... which resolve to python/graphframes/tutorials/img/. However, the actual images live in docs/src/img/. Add a symlink python/graphframes/tutorials/img -> ../../../docs/src/img so all 15 image references in the notebooks (Network_Motif_Finding.ipynb and Pregel.ipynb) resolve correctly without duplicating files. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Commit ec1b084 accidentally removed the connected components diagram from the Connected Components section of the Pregel tutorial. The SVG file exists at docs/src/img/pregel-diagrams/pregel-connected-components.svg and should be shown like all other example sections in the tutorial. The original commit was mislabeled as removing a 'Mermaid diagram' but actually removed a rendered SVG <img> tag. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
The connected-components SVG renders with only 2 nodes and 0 edges (broken Mermaid render), making it useless in the tutorial. Reverting the prior mistaken restore; the image reference should remain absent from the Connected Components section. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…syntax Chained undirected edges (A --- B --- C) fail to render in LR layout with some Mermaid versions, producing a broken SVG with only 2 nodes. Fix by using explicit two-node edge lines instead. Also restore the connected-components diagram in the docs tutorial (04-pregel-tutorial.md) which was previously removed due to the broken SVG. With the correct rendering (18 nodes, 3 supersteps), the diagram is now suitable for the tutorial. Update generate_diagrams.py to use the fixed Mermaid syntax with a comment explaining the workaround. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
In Jupyter notebooks and docs, <figcaption> rendered as unstyled block text indistinguishable from surrounding headings/paragraphs. - Add a <style> block at the top of each notebook's first markdown cell to globally apply caption styling (smaller, italic, grey, centered) via the figcaption CSS selector - Add equivalent inline style attributes to every <figcaption> in the Laika docs markdown files (04-pregel-tutorial.md and 02-motif-tutorial.md) for the static site build Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
These SVG files were accidentally generated by running generate_diagrams.py from the python/ working directory (causing relative output path to resolve as python/docs/src/img/ instead of docs/src/img/). Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Replace relative 'docs/src/img/...' path with one resolved from __file__ so the script writes to the correct location regardless of cwd. Also includes minor SVG whitespace normalization from mmdc 0.4.1 re-render (stroke-width:3px vs stroke-width: 3px — functionally identical). Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
The previous approach used per-element inline style= attributes on the 15 figcaptions across the two tutorial docs files. That only covered existing files — any future doc page with a figcaption would be unstyled. Replace with a proper Laika-level fix: - Add docs/src/helium/custom.css with the figcaption rule - Wire it into the Helium theme via .site.internalCSS() in LaikaCustoms.scala (required in Laika 1.0+; the old automatic CSS directory scanning is gone) - Strip the now-redundant inline style= attributes from both tutorial docs The <style> block in the notebooks (Jupyter rendering) is unchanged — it remains the correct mechanism for notebooks since they do not go through Laika. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…s/97K edges Every mention of the ~130K nodes / ~97K edges figures now names the specific Stack Exchange dataset (stats.meta.stackexchange.com) so readers know exactly which archive produces those numbers. Updated in: - docs/src/03-tutorials/04-pregel-tutorial.md - docs/src/03-tutorials/02-motif-tutorial.md - python/graphframes/tutorials/notebooks/Pregel.ipynb (cell 2) - python/graphframes/tutorials/notebooks/Network_Motif_Finding.ipynb (cell 0) Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
<center> is a deprecated element that centers relative to its nearest
containing block. In Helium's layout that block spans the full page
(content column + both sidebars), so figures appeared shifted right.
Fix by:
- Adding figure { margin: 1em 0; text-align: center; } and
figure img { max-width: 100%; height: auto; } to custom.css and
the notebooks' <style> block — centering is now done via CSS
within the content column's own containing block
- Removing all 30 <center>/<\/center> wrappers from the two tutorial
docs and both notebooks; plain <figure> blocks are sufficient
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Rename files to match requested order: 1. GraphFrames Tutorials (01-tutorials.md, unchanged) 2. Stack Exchange Data Setup (03-data-setup.md -> 02-data-setup.md) 3. Motif Tutorial (02-motif-tutorial.md -> 03-motif-tutorial.md) 4. Pregel Tutorial (04-pregel-tutorial.md, unchanged) Update all cross-references in 9 files (docs + notebooks). Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #809 +/- ##
==========================================
- Coverage 84.94% 81.21% -3.74%
==========================================
Files 68 77 +9
Lines 3507 4263 +756
Branches 453 488 +35
==========================================
+ Hits 2979 3462 +483
- Misses 528 801 +273 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Stripping <center> left <figure> indented 4 spaces. Markdown (and Laika / Jupyter's marked.js) treats 4-space-indented lines as fenced code blocks, so all figures were rendering as raw HTML text in a dark code box instead of as images. Remove the 4-space indent from every <figure>...</figure> block in the two tutorial docs (03-motif-tutorial.md: 3 blocks, 04-pregel-tutorial.md: 12 blocks) and both notebooks (3 + 12 blocks). The first motif tutorial figure is restored to the original 4-node-directed-graphlets.png (the academic Aparicio et al. reference diagram); no Mermaid SVG replaces it. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
…play order Renaming files for nav order breaks SEO (URL changes). Revert filenames back to their originals and instead use Laika's navigationOrder in directory.conf to control the display order independently of filenames: 1. GraphFrames Tutorials (01-tutorials.md) 2. Stack Exchange Data Setup (03-data-setup.md) 3. Motif Tutorial (02-motif-tutorial.md) 4. Pregel Tutorial (04-pregel-tutorial.md) Also revert all cross-references in 9 files back to the original names. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
The previous diagram showed 18 nodes across 3 supersteps in a flat LR
layout with no visual separation, making it hard to follow.
New strategy: three side-by-side panels (graph TB lays disconnected
subgraphs left-to-right) showing the minimum-label wave advancing
one hop per superstep across two separate components.
- Two components: {1,2,3} (3-node chain) and {4,5} (2-node pair)
- Node labels use vertex:componentLabel notation (e.g. 2:1 = vertex 2
now carries component label 1)
- Left panel: Superstep 0 — each vertex starts with its own ID
- Middle panel: Superstep 1 — minimum label has advanced one hop
- Right panel: Converged — all vertices share the minimum ID of
their component (1 or 4 respectively)
- Width bumped to 900px; caption updated to explain the notation
Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a white background rect to rendered Mermaid SVGs so they display correctly on dark-themed pages. Change figcaption color from grey to white for readability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new Pregel tutorial (plus supporting scripts/assets) and reorganizes Stack Exchange dataset setup so both motif and Pregel tutorials share the same data flow.
Changes:
- Introduces a new Pregel tutorial script and adds a diagrams generator to produce Mermaid-based SVG assets.
- Refactors motif tutorial code to accept a
--data-dirargument and updates docs to point to a shared “Data Setup” tutorial. - Updates Python packaging metadata (deps/optional extras/dev tools) and documentation build configuration (Laika CSS hook, tutorial navigation ordering).
Reviewed changes
Copilot reviewed 24 out of 43 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| python/tests/test_graphframes.py | Formatting-only tweaks to existing tests. |
| python/pyproject.toml | Adds project deps/optional extras and bumps formatter/linter/test tool versions. |
| python/graphframes/tutorials/pregel.py | New end-to-end Pregel tutorial script with 7 examples. |
| python/graphframes/tutorials/motif.py | Refactors motif tutorial into a Click CLI with --data-dir. |
| python/graphframes/tutorials/img | Adds a path placeholder intended to reference docs images. |
| python/graphframes/tutorials/generate_diagrams.py | New script to render Mermaid diagrams into SVG assets. |
| python/graphframes/tutorials/download.py | Defaults tutorial data directory to package-relative path; updates CLI help text. |
| python/graphframes/console.py | Lazily registers the tutorial CLI command to avoid hard deps at import time. |
| python/graphframes/connect/proto/graphframes_pb2.py | Formatting change in generated protobuf module. |
| python/graphframes/connect/graphframes_client.py | Formatting changes in proto plan building logic. |
| python/docs/underscores.py | Formatting/quoting updates for Sphinx hook. |
| python/docs/epytext.py | Formatting/quoting updates for Sphinx hook. |
| python/docs/conf.py | Formatting/quoting updates for Sphinx config. |
| python/dev/build_jar.py | Minor formatting simplification. |
| project/LaikaCustoms.scala | Adds fallback benchmark config and injects custom Helium CSS. |
| docs/src/img/pregel-diagrams/pregel-shortest-paths.svg | Adds generated Pregel tutorial diagram asset. |
| docs/src/img/pregel-diagrams/pregel-reputation-propagation.svg | Adds generated Pregel tutorial diagram asset. |
| docs/src/img/pregel-diagrams/pregel-pagerank-iterations.svg | Adds generated Pregel tutorial diagram asset. |
| docs/src/img/pregel-diagrams/pregel-in-degree-pregel.svg | Adds generated Pregel tutorial diagram asset. |
| docs/src/img/pregel-diagrams/pregel-in-degree-am.svg | Adds generated Pregel tutorial diagram asset. |
| docs/src/img/pregel-diagrams/pregel-debug-trace.svg | Adds generated Pregel tutorial diagram asset. |
| docs/src/img/pregel-diagrams/pregel-bsp-model.svg | Adds generated Pregel tutorial diagram asset. |
| docs/src/img/motif-diagrams/motif-g5-stackexchange.svg | Adds generated motif tutorial diagram asset (currently broken). |
| docs/src/img/motif-diagrams/motif-g4-stackexchange.svg | Adds generated motif tutorial diagram asset. |
| docs/src/img/motif-diagrams/motif-g4-g5-triangles.svg | Adds generated motif tutorial diagram asset. |
| docs/src/img/motif-diagrams/motif-g30-stackexchange.svg | Adds generated motif tutorial diagram asset. |
| docs/src/img/motif-diagrams/motif-g30-opposed-3path.svg | Adds generated motif tutorial diagram asset. |
| docs/src/helium/custom.css | Adds CSS for figures/figcaptions in docs output. |
| docs/src/03-tutorials/directory.conf | Ensures a stable tutorials navigation order. |
| docs/src/03-tutorials/03-data-setup.md | New shared tutorial for Stack Exchange dataset download + Parquet conversion. |
| docs/src/03-tutorials/02-motif-tutorial.md | Removes embedded data setup steps and links to shared data setup tutorial. |
| docs/src/03-tutorials/01-tutorials.md | Adds links to the new Data Setup and Pregel tutorials. |
Comments suppressed due to low confidence (8)
python/pyproject.toml:1
- Several pinned tool/library versions appear to be non-existent (or at least newer than what package indexes are likely to provide), which will break installs/CI resolution (e.g., black 25.12.0 + required-version, pytest 9.x, isort 7.x, py7zr 1.1.0, click 8.3.1, pre-commit 4.5.1). Please verify these versions exist on PyPI and, if not, pin to released versions (ideally aligned with the repo’s supported Python/Spark matrix).
python/pyproject.toml:1 - Several pinned tool/library versions appear to be non-existent (or at least newer than what package indexes are likely to provide), which will break installs/CI resolution (e.g., black 25.12.0 + required-version, pytest 9.x, isort 7.x, py7zr 1.1.0, click 8.3.1, pre-commit 4.5.1). Please verify these versions exist on PyPI and, if not, pin to released versions (ideally aligned with the repo’s supported Python/Spark matrix).
python/graphframes/tutorials/download.py:1 - Defaulting downloads to a package-relative directory can fail in common installations where site-packages is read-only (and it can pollute the installed distribution even when writable). Prefer a user-writable default (e.g., XDG cache dir / ~/.cache, or a directory under the current working directory) and keep
--data-diras the escape hatch.
python/graphframes/tutorials/download.py:1 - Defaulting downloads to a package-relative directory can fail in common installations where site-packages is read-only (and it can pollute the installed distribution even when writable). Prefer a user-writable default (e.g., XDG cache dir / ~/.cache, or a directory under the current working directory) and keep
--data-diras the escape hatch.
python/graphframes/tutorials/img:1 - This looks like an attempted symlink to the docs image directory, but committing it as a regular file named
imgwon’t behave like a directory and may break code/docs that expectpython/graphframes/tutorials/img/to exist. If a symlink is intended, add it as a real symlink in git; otherwise create an actual directory and manage image assets via packaging configuration (e.g., include package data) or adjust references to point at the docs path.
python/graphframes/tutorials/pregel.py:1 - This authority propagation accumulates values across iterations in a way that re-sends User->Answer messages on the second iteration and then adds them again (
msg + authority), which will inflate/double-count authorities (answers can receive from users again on iter 2). If the goal is a strict two-hop propagation (Users → Answers → Questions), consider splitting into two Pregel runs (stage 1 on User→Answer edges, stage 2 on Answer→Question edges) or change the update/send logic so each hop doesn’t re-accumulate previous-hop contributions.
python/graphframes/tutorials/pregel.py:1 - Using
F.coalesce(Pregel.msg(), F.lit(\"\"))causesconcat_wsto include an empty-string element on the first iteration, producing traces like' <- A'instead of'A'. Use a conditional that returns justidwhenPregel.msg()is null (or ensure the first element is null, not an empty string) so the initial trace format is correct.
python/graphframes/tutorials/generate_diagrams.py:1 - The docstring claims mmdc is PhantomJS-based; Mermaid tooling has generally moved away from PhantomJS. Please update this comment to reflect the actual runtime dependency used by the
mmdc/converter you’re invoking (or remove the implementation detail) to avoid misleading setup/troubleshooting guidance.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -0,0 +1,930 @@ | |||
| # Pregel Tutorial | |||
|
|
|||
| This tutorial covers GraphFrames' @:pydoc(graphframes.lib.Pregel) API for developing scalable, iterative graph algorithms using **Apache Spark 4.0**. We will implement progressively complex algorithms — from simple degree counting to path-tracing algorithms — using the same Stack Exchange knowledge graph from the [Motif Finding Tutorial](02-motif-tutorial.md). | |||
There was a problem hiding this comment.
Let's switch to Spark / PySpark 4.1
| <figure> | ||
| <img src="../img/Pregel-Compute-Dataflow.png" width="650px" alt="Pregel BSP Compute Dataflow" /> | ||
| <figcaption><a href="http://stanford.edu/~rezab/dao/">CME 323: Distributed Algorithms and Optimization, Reza Zadeh, Databricks and Stanford</a></figcaption> | ||
| </figure> |
|
|
||
| ## Why Pregel? | ||
|
|
||
| You might wonder: when do I need Pregel instead of GraphFrames' built-in algorithms like `pageRank()` or `connectedComponents()`? |
There was a problem hiding this comment.
It is very confusing question. connectedComponents (algo graphx) and pageRank are the same Pregel under the hood. I would like to re-phrase and explain, that it is like a top-level API and ready to use implementations on top of that. Can we maybe start from explaining that 2/3 of built-ins in GraphFrames are Pregel?
|
|
||
| **Automatic optimization**: GraphFrames analyzes your message expressions to determine if the destination vertex state is actually needed. If your messages only reference `Pregel.src()` and `Pregel.edge()` columns (not `Pregel.dst()`), the implementation skips the second join entirely — a significant performance optimization for algorithms like PageRank. | ||
|
|
||
| Understanding this implementation helps you write better Pregel algorithms. For example: |
There was a problem hiding this comment.
I would like to reference here another important thing. The early stopping. There is an additional optimization (skip message from non-active) as well there are few options of convergence:
- iterations-based (simple for i in range loop)
- all-messages are null (optional, non-free!)
- voting (optional, non-free!)
The story is p.2 and p.3 can improve convergence and performance if applied in the right place, but can degrade performance of applied when not required...
There was a problem hiding this comment.
That is the relevant part of docs: https://graphframes.io/04-user-guide/10-pregel.html#termination-conditions
| +---------+-----+ | ||
| ``` | ||
|
|
||
| Most nodes have zero in-degree (they are source-only nodes like Votes that cast votes but don't receive edges). The distribution follows a power law, which is typical for real-world networks. The power law means that a few nodes have very high in-degree (popular questions with hundreds of votes, prolific users with thousands of badges) while the vast majority have low in-degree. |
There was a problem hiding this comment.
1). To explain power-law I is nice to plot the distribution at the log-scale (it will be a linear function)
2). It is an important thing for Pregel: power-law is the performance killer -- due to BSP model, all the vertices will wait until the supernode finish it's work... Maybe let's mention it here? That users should always keep in mind the distribution
|
|
||
| The convergence speed of connected components depends on the **diameter** of the graph — the longest shortest path between any two connected vertices. In the worst case (a linear chain of N vertices), it takes N-1 supersteps. In practice, real-world graphs have small diameters due to the [small-world property](https://en.wikipedia.org/wiki/Small-world_network), so convergence is fast. | ||
|
|
||
| The early stopping optimization is crucial here. Without it, Pregel would run all 20 iterations even after convergence. With it, the algorithm halts as soon as the minimum labels stop propagating — which often happens in 5-10 iterations for real-world social graphs. |
There was a problem hiding this comment.
Again, early stopping is not working in your code! Either update it or remove this confusing text.
|
|
||
| The early stopping optimization is crucial here. Without it, Pregel would run all 20 iterations even after convergence. With it, the algorithm halts as soon as the minimum labels stop propagating — which often happens in 5-10 iterations for real-world social graphs. | ||
|
|
||
| You can verify convergence speed by adding some logging. Run with a few different `maxIter` values and compare the results — you'll find the same component labels regardless of whether you set `maxIter` to 10, 20, or 100, as long as it's high enough. |
There was a problem hiding this comment.
Did you really try it? Because from the code, you will always run maxIter amount of iterations...
| <figure> | ||
| <img src="../img/pregel-diagrams/pregel-shortest-paths.svg" width="700px" alt="Shortest paths propagation" /> | ||
| <figcaption>Shortest paths from source A: distances propagate outward, one hop per superstep</figcaption> | ||
| </figure> |
|
|
||
| **Conditional messaging**: The `F.when(Pregel.src("distance") < F.lit(INF), ...)` guard ensures that only vertices with finite distances send messages. Vertices that haven't been reached yet don't waste computation sending infinity + 1. `null` messages are automatically filtered by Pregel. | ||
|
|
||
| **Bidirectional again**: We send messages in both directions to treat the graph as undirected for reachability. If you want directed shortest paths, use only `sendMsgToDst`. |
There was a problem hiding this comment.
The same story here: duplicating edges will allow you to apply an optimization and skip the dst state
|
|
||
| The `INF` value we use (999999) is a practical choice. A more mathematically pure implementation would use `None`/null, but nulls in Spark column expressions require more careful handling with `coalesce`. The large integer approach is simpler and works correctly as long as your graph has fewer than 999,999 vertices in any path — which is true for virtually all real-world graphs. | ||
|
|
||
| ## Advanced: Memory Optimization |
There was a problem hiding this comment.
I would like to add here a trick about skipping the dst state because it is about skipping one of the two biggest joins

