doc: Pregel Tutorial by rjurney · Pull Request #809 · graphframes/graphframes

rjurney · 2026-03-13T06:47:19Z

New, big, fancy, super, duper Pregel tutorial
Moved Stack Exchange data content from Network Motif Finding Tutorial into Data Setup tutorial. Refer to from both motif and Pregel tutorials.
Point at new tutorial(s) from list of tutorials.
New network motif and Pregel tutorial Jupyter notebooks
Some other minor changes...

…s.txt and split out requirements-dev.txt. Version bumps.

…ney/build-upgrades

rjurney · 2026-03-14T02:58:35Z

@SemyonSinchenko I updated https://github.com/rjurney/graphframes/tree/rjurney/pypi-tutorials fixing all the issues save one I asked you about, but it isn't updating this PR or running tests... I guess it is slack today.

rjurney · 2026-03-14T02:58:51Z

Okay, there it goes!

The Jupyter notebooks in python/graphframes/tutorials/notebooks/ reference images via relative paths like ../img/... which resolve to python/graphframes/tutorials/img/. However, the actual images live in docs/src/img/. Add a symlink python/graphframes/tutorials/img -> ../../../docs/src/img so all 15 image references in the notebooks (Network_Motif_Finding.ipynb and Pregel.ipynb) resolve correctly without duplicating files. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Commit ec1b084 accidentally removed the connected components diagram from the Connected Components section of the Pregel tutorial. The SVG file exists at docs/src/img/pregel-diagrams/pregel-connected-components.svg and should be shown like all other example sections in the tutorial. The original commit was mislabeled as removing a 'Mermaid diagram' but actually removed a rendered SVG <img> tag. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

The connected-components SVG renders with only 2 nodes and 0 edges (broken Mermaid render), making it useless in the tutorial. Reverting the prior mistaken restore; the image reference should remain absent from the Connected Components section. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…syntax Chained undirected edges (A --- B --- C) fail to render in LR layout with some Mermaid versions, producing a broken SVG with only 2 nodes. Fix by using explicit two-node edge lines instead. Also restore the connected-components diagram in the docs tutorial (04-pregel-tutorial.md) which was previously removed due to the broken SVG. With the correct rendering (18 nodes, 3 supersteps), the diagram is now suitable for the tutorial. Update generate_diagrams.py to use the fixed Mermaid syntax with a comment explaining the workaround. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

In Jupyter notebooks and docs, <figcaption> rendered as unstyled block text indistinguishable from surrounding headings/paragraphs. - Add a <style> block at the top of each notebook's first markdown cell to globally apply caption styling (smaller, italic, grey, centered) via the figcaption CSS selector - Add equivalent inline style attributes to every <figcaption> in the Laika docs markdown files (04-pregel-tutorial.md and 02-motif-tutorial.md) for the static site build Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

These SVG files were accidentally generated by running generate_diagrams.py from the python/ working directory (causing relative output path to resolve as python/docs/src/img/ instead of docs/src/img/). Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Replace relative 'docs/src/img/...' path with one resolved from __file__ so the script writes to the correct location regardless of cwd. Also includes minor SVG whitespace normalization from mmdc 0.4.1 re-render (stroke-width:3px vs stroke-width: 3px — functionally identical). Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

The previous approach used per-element inline style= attributes on the 15 figcaptions across the two tutorial docs files. That only covered existing files — any future doc page with a figcaption would be unstyled. Replace with a proper Laika-level fix: - Add docs/src/helium/custom.css with the figcaption rule - Wire it into the Helium theme via .site.internalCSS() in LaikaCustoms.scala (required in Laika 1.0+; the old automatic CSS directory scanning is gone) - Strip the now-redundant inline style= attributes from both tutorial docs The <style> block in the notebooks (Jupyter rendering) is unchanged — it remains the correct mechanism for notebooks since they do not go through Laika. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…s/97K edges Every mention of the ~130K nodes / ~97K edges figures now names the specific Stack Exchange dataset (stats.meta.stackexchange.com) so readers know exactly which archive produces those numbers. Updated in: - docs/src/03-tutorials/04-pregel-tutorial.md - docs/src/03-tutorials/02-motif-tutorial.md - python/graphframes/tutorials/notebooks/Pregel.ipynb (cell 2) - python/graphframes/tutorials/notebooks/Network_Motif_Finding.ipynb (cell 0) Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

<center> is a deprecated element that centers relative to its nearest containing block. In Helium's layout that block spans the full page (content column + both sidebars), so figures appeared shifted right. Fix by: - Adding figure { margin: 1em 0; text-align: center; } and figure img { max-width: 100%; height: auto; } to custom.css and the notebooks' <style> block — centering is now done via CSS within the content column's own containing block - Removing all 30 <center>/<\/center> wrappers from the two tutorial docs and both notebooks; plain <figure> blocks are sufficient Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Rename files to match requested order: 1. GraphFrames Tutorials (01-tutorials.md, unchanged) 2. Stack Exchange Data Setup (03-data-setup.md -> 02-data-setup.md) 3. Motif Tutorial (02-motif-tutorial.md -> 03-motif-tutorial.md) 4. Pregel Tutorial (04-pregel-tutorial.md, unchanged) Update all cross-references in 9 files (docs + notebooks). Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

codecov-commenter · 2026-03-15T23:05:46Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.21%. Comparing base (f1db6f4) to head (957fb9c).
⚠️ Report is 5 commits behind head on main.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #809      +/-   ##
==========================================
- Coverage   84.94%   81.21%   -3.74%     
==========================================
  Files          68       77       +9     
  Lines        3507     4263     +756     
  Branches      453      488      +35     
==========================================
+ Hits         2979     3462     +483     
- Misses        528      801     +273

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Stripping <center> left <figure> indented 4 spaces. Markdown (and Laika / Jupyter's marked.js) treats 4-space-indented lines as fenced code blocks, so all figures were rendering as raw HTML text in a dark code box instead of as images. Remove the 4-space indent from every <figure>...</figure> block in the two tutorial docs (03-motif-tutorial.md: 3 blocks, 04-pregel-tutorial.md: 12 blocks) and both notebooks (3 + 12 blocks). The first motif tutorial figure is restored to the original 4-node-directed-graphlets.png (the academic Aparicio et al. reference diagram); no Mermaid SVG replaces it. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

…play order Renaming files for nav order breaks SEO (URL changes). Revert filenames back to their originals and instead use Laika's navigationOrder in directory.conf to control the display order independently of filenames: 1. GraphFrames Tutorials (01-tutorials.md) 2. Stack Exchange Data Setup (03-data-setup.md) 3. Motif Tutorial (02-motif-tutorial.md) 4. Pregel Tutorial (04-pregel-tutorial.md) Also revert all cross-references in 9 files back to the original names. Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

The previous diagram showed 18 nodes across 3 supersteps in a flat LR layout with no visual separation, making it hard to follow. New strategy: three side-by-side panels (graph TB lays disconnected subgraphs left-to-right) showing the minimum-label wave advancing one hop per superstep across two separate components. - Two components: {1,2,3} (3-node chain) and {4,5} (2-node pair) - Node labels use vertex:componentLabel notation (e.g. 2:1 = vertex 2 now carries component label 1) - Left panel: Superstep 0 — each vertex starts with its own ID - Middle panel: Superstep 1 — minimum label has advanced one hop - Right panel: Converged — all vertices share the minimum ID of their component (1 or 4 respectively) - Width bumped to 900px; caption updated to explain the notation Co-authored-by: Russell Jurney <rjurney@users.noreply.github.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add a white background rect to rendered Mermaid SVGs so they display correctly on dark-themed pages. Change figcaption color from grey to white for readability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

docs/src/03-tutorials/03-data-setup.md

docs/src/helium/custom.css

project/LaikaCustoms.scala

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new Pregel tutorial (plus supporting scripts/assets) and reorganizes Stack Exchange dataset setup so both motif and Pregel tutorials share the same data flow.

Changes:

Introduces a new Pregel tutorial script and adds a diagrams generator to produce Mermaid-based SVG assets.
Refactors motif tutorial code to accept a --data-dir argument and updates docs to point to a shared “Data Setup” tutorial.
Updates Python packaging metadata (deps/optional extras/dev tools) and documentation build configuration (Laika CSS hook, tutorial navigation ordering).

Reviewed changes

Copilot reviewed 24 out of 43 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
python/tests/test_graphframes.py	Formatting-only tweaks to existing tests.
python/pyproject.toml	Adds project deps/optional extras and bumps formatter/linter/test tool versions.
python/graphframes/tutorials/pregel.py	New end-to-end Pregel tutorial script with 7 examples.
python/graphframes/tutorials/motif.py	Refactors motif tutorial into a Click CLI with `--data-dir`.
python/graphframes/tutorials/img	Adds a path placeholder intended to reference docs images.
python/graphframes/tutorials/generate_diagrams.py	New script to render Mermaid diagrams into SVG assets.
python/graphframes/tutorials/download.py	Defaults tutorial data directory to package-relative path; updates CLI help text.
python/graphframes/console.py	Lazily registers the tutorial CLI command to avoid hard deps at import time.
python/graphframes/connect/proto/graphframes_pb2.py	Formatting change in generated protobuf module.
python/graphframes/connect/graphframes_client.py	Formatting changes in proto plan building logic.
python/docs/underscores.py	Formatting/quoting updates for Sphinx hook.
python/docs/epytext.py	Formatting/quoting updates for Sphinx hook.
python/docs/conf.py	Formatting/quoting updates for Sphinx config.
python/dev/build_jar.py	Minor formatting simplification.
project/LaikaCustoms.scala	Adds fallback benchmark config and injects custom Helium CSS.
docs/src/img/pregel-diagrams/pregel-shortest-paths.svg	Adds generated Pregel tutorial diagram asset.
docs/src/img/pregel-diagrams/pregel-reputation-propagation.svg	Adds generated Pregel tutorial diagram asset.
docs/src/img/pregel-diagrams/pregel-pagerank-iterations.svg	Adds generated Pregel tutorial diagram asset.
docs/src/img/pregel-diagrams/pregel-in-degree-pregel.svg	Adds generated Pregel tutorial diagram asset.
docs/src/img/pregel-diagrams/pregel-in-degree-am.svg	Adds generated Pregel tutorial diagram asset.
docs/src/img/pregel-diagrams/pregel-debug-trace.svg	Adds generated Pregel tutorial diagram asset.
docs/src/img/pregel-diagrams/pregel-bsp-model.svg	Adds generated Pregel tutorial diagram asset.
docs/src/img/motif-diagrams/motif-g5-stackexchange.svg	Adds generated motif tutorial diagram asset (currently broken).
docs/src/img/motif-diagrams/motif-g4-stackexchange.svg	Adds generated motif tutorial diagram asset.
docs/src/img/motif-diagrams/motif-g4-g5-triangles.svg	Adds generated motif tutorial diagram asset.
docs/src/img/motif-diagrams/motif-g30-stackexchange.svg	Adds generated motif tutorial diagram asset.
docs/src/img/motif-diagrams/motif-g30-opposed-3path.svg	Adds generated motif tutorial diagram asset.
docs/src/helium/custom.css	Adds CSS for figures/figcaptions in docs output.
docs/src/03-tutorials/directory.conf	Ensures a stable tutorials navigation order.
docs/src/03-tutorials/03-data-setup.md	New shared tutorial for Stack Exchange dataset download + Parquet conversion.
docs/src/03-tutorials/02-motif-tutorial.md	Removes embedded data setup steps and links to shared data setup tutorial.
docs/src/03-tutorials/01-tutorials.md	Adds links to the new Data Setup and Pregel tutorials.

Comments suppressed due to low confidence (8)

python/pyproject.toml:1

Several pinned tool/library versions appear to be non-existent (or at least newer than what package indexes are likely to provide), which will break installs/CI resolution (e.g., black 25.12.0 + required-version, pytest 9.x, isort 7.x, py7zr 1.1.0, click 8.3.1, pre-commit 4.5.1). Please verify these versions exist on PyPI and, if not, pin to released versions (ideally aligned with the repo’s supported Python/Spark matrix).
python/pyproject.toml:1
Several pinned tool/library versions appear to be non-existent (or at least newer than what package indexes are likely to provide), which will break installs/CI resolution (e.g., black 25.12.0 + required-version, pytest 9.x, isort 7.x, py7zr 1.1.0, click 8.3.1, pre-commit 4.5.1). Please verify these versions exist on PyPI and, if not, pin to released versions (ideally aligned with the repo’s supported Python/Spark matrix).
python/graphframes/tutorials/download.py:1
Defaulting downloads to a package-relative directory can fail in common installations where site-packages is read-only (and it can pollute the installed distribution even when writable). Prefer a user-writable default (e.g., XDG cache dir / ~/.cache, or a directory under the current working directory) and keep --data-dir as the escape hatch.
python/graphframes/tutorials/download.py:1
Defaulting downloads to a package-relative directory can fail in common installations where site-packages is read-only (and it can pollute the installed distribution even when writable). Prefer a user-writable default (e.g., XDG cache dir / ~/.cache, or a directory under the current working directory) and keep --data-dir as the escape hatch.
python/graphframes/tutorials/img:1
This looks like an attempted symlink to the docs image directory, but committing it as a regular file named img won’t behave like a directory and may break code/docs that expect python/graphframes/tutorials/img/ to exist. If a symlink is intended, add it as a real symlink in git; otherwise create an actual directory and manage image assets via packaging configuration (e.g., include package data) or adjust references to point at the docs path.
python/graphframes/tutorials/pregel.py:1
This authority propagation accumulates values across iterations in a way that re-sends User->Answer messages on the second iteration and then adds them again (msg + authority), which will inflate/double-count authorities (answers can receive from users again on iter 2). If the goal is a strict two-hop propagation (Users → Answers → Questions), consider splitting into two Pregel runs (stage 1 on User→Answer edges, stage 2 on Answer→Question edges) or change the update/send logic so each hop doesn’t re-accumulate previous-hop contributions.
python/graphframes/tutorials/pregel.py:1
Using F.coalesce(Pregel.msg(), F.lit(\"\")) causes concat_ws to include an empty-string element on the first iteration, producing traces like ' <- A' instead of 'A'. Use a conditional that returns just id when Pregel.msg() is null (or ensure the first element is null, not an empty string) so the initial trace format is correct.
python/graphframes/tutorials/generate_diagrams.py:1
The docstring claims mmdc is PhantomJS-based; Mermaid tooling has generally moved away from PhantomJS. Please update this comment to reflect the actual runtime dependency used by the mmdc/converter you’re invoking (or remove the implementation detail) to avoid misleading setup/troubleshooting guidance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs/src/helium/custom.css

SemyonSinchenko · 2026-03-28T08:12:24Z