Skip to content

Add automated docs & notebooks freshness + normalization checks#3228

Merged
deruyter92 merged 43 commits intomainfrom
cy/automated-docs&nb-report
Mar 20, 2026
Merged

Add automated docs & notebooks freshness + normalization checks#3228
deruyter92 merged 43 commits intomainfrom
cy/automated-docs&nb-report

Conversation

@C-Achard
Copy link
Copy Markdown
Collaborator

@C-Achard C-Achard commented Mar 5, 2026

Summary / Purpose

This PR introduces a safe-by-default automation tool to improve release confidence by continuously tracking the “freshness” and validity of our docs and notebooks. Specifically, it:

  • Scans docs (docs/**/*.md) and notebooks (examples/COLAB/**/*.ipynb, examples/JUPYTER/**/*.ipynb, and docs/**/*.ipynb)
  • Captures two complementary signals:
    • last_git_updated: computed from git history (most recent commit touching the file)
    • last_verified: human-controlled “verified correct/working” date (missing treated as a warning initially)
  • Validates notebooks using nbformat.validate and detects when a notebook is not normalized to our canonical write format (warns with notebook_not_normalized)
  • Produces machine-readable and human-readable output reports to make drift visible and actionable

The intent is to reduce regressions and staleness, without blocking development or rewriting content in CI by default.

Related

For other linting efforts PRs see :

Design Overview

Safety-first approach (default read-only)

The tool is designed to be non-invasive in CI:

  • report and check are read-only modes intended for PR CI and scheduled jobs.
  • update --write is an explicit opt-in mode intended for maintainers (local runs or dedicated maintenance PRs).

Minimal and predictable notebook handling

For notebooks:

  • Use nbformat for read/validate/write (notebooks are “notebook-native” objects).
  • If writing, restrict updates to top-level notebook metadata only under the metadata.deeplabcut namespace (no cell/output changes).
  • Detect formatting drift by comparing on-disk content with nbformat.writes(..., indent=2): if mismatch → warning notebook_not_normalized

Schemas & config

Uses Pydantic v2 models with an explicit schema_version to keep report/config evolvable.
Behavior is controlled via a YAML config (docs_and_notebooks_report_config.yml) including include/exclude globs and policy thresholds.
Policy “ratcheting” is supported through allowlists:
require_metadata, require_recent_verification, and require_notebook_normalized (all empty by default)

This will let us set priority targets and deadlines that must not go out-of-date for CI to pass.
Specifics of implementation and separation of concerns for PRs will have to be decided as well.

What’s Included

New tool

  • .github/tools/docs_and_notebooks_check.py
    • report: generate JSON/MD report
    • check: report + policy enforcement (currently allowlist-only)
    • update --write: update last_git_updated (and optionally verification fields) — explicit opt-in
  • Notebook validation via nbformat.validate
  • Notebook canonical formatting drift detection (notebook_not_normalized warning)

New configuration

  • .github/tools/docs_and_notebooks_report_config.yml
  • Includes scan patterns for examples/COLAB, examples/JUPYTER, and docs
  • Excludes _build, build, .ipynb_checkpoints
  • Sets warning thresholds to 365 days
  • Treats missing last_verified as a warning (until consensus-driven tiering/ratcheting)

CI integration (standalone job)

Adds a dedicated workflow/job that runs:

  • python .github/tools/docs_and_notebooks_check.py report
  • python .github/tools/docs_and_notebooks_check.py check

Uploads the output artifacts (docs_nb_checks.json / .md) for easy review
Uses fetch-depth: 0 so git dates are accurate

Repo maintenance tools

  • .gitignore updated to ignore generated outputs:
    • **/tmp/docs_notebooks_status/
    • **/tmp/docs_notebooks_status/

Non-goals

  • No automatic rewriting of notebooks/docs in PR CI (read-only by default).
  • No tier auto-assignment: tier exists in schema but is intentionally left unset unless the project agrees on definitions.
  • Not trying to guarantee that every notebook executes end-to-end in CI (separate, costlier dimension).

How To Use

  • See included README
  • See new pre-commit hook

Pre-commit / Developer Workflow (optional but recommended)

This PR includes a pre-commit hook so contributors get fast feedback when editing notebooks/docs.
The hook should run the script in check mode on touched files.

Policy & Ratcheting Plan (Future-proofing)

This PR intentionally starts permissive and allows checks to becomes stricter over time without breaking existing workflows:

  • Phase 1 (now): warn only (no CI failures) for missing metadata / missing last_verified / notebook normalization drift.
  • Phase 2: populate allowlists for “high priority” docs/notebooks:
    • require_metadata: must have DLC metadata/frontmatter
    • require_recent_verification: must have recent last_verified
    • require_notebook_normalized: must match canonical nbformat writes

Testing / Verification

  • Ran report locally and inspected generated docs_nb_checks.md
  • Ran check locally (no unexpected failures)
  • Verified notebook validation works (nbformat errors appear under errors)
  • Confirmed notebook_not_normalized warning appears for non-canonical notebooks
  • CI workflow runs successfully and uploads artifacts

Future Actions & Extensions (Ideas)

  • Notebook tiering / ownership metadata (once consensus exists).
  • Integrate with release process: require verification for key docs/notebooks for RCs / stable releases.

Reviewer Notes / Expected Diff Characteristics

If update --write (or notebook normalization) is run, some notebooks may show larger diffs due to nbformat canonical formatting.
After the initial normalization, diffs should remain small and predictable, similar to other linting in pre-commit hooks.
The tool is designed to avoid touching notebook cells/outputs; only metadata is modified.

C-Achard added 6 commits March 5, 2026 11:08
Introduce a new CLI tool (.github/tools/docs_and_notebooks_check.py) to scan notebooks and Markdown docs for staleness and verification metadata under the 'deeplabcut' namespace. Adds a default YAML config (.github/tools/docs_and_notebooks_report_config.yml), a README for the tool (.github/tools/docs_and_notebooks_tool_README.md), and an output ignore entry in .gitignore. The tool uses pydantic schemas, computes last_git_updated from git history, reads/writes notebook top-level metadata and Markdown frontmatter (idempotent updates), and supports report/check/update modes. Outputs machine- and human-readable reports (nb_docs_status.json / .md). Requires pydantic and PyYAML; designed to be safe-by-default for CI (read-only unless --write is passed).
Introduce a GitHub Actions workflow to scan docs and notebooks for staleness. The workflow runs on push and PRs to main, checks out full git history, uses Python 3.12, installs pydantic and pyyaml, and runs a read-only staleness report and an optional policy check using .github/tools/docs_and_notebooks_check.py with tools/staleness_config.yml. Results (JSON/MD) are uploaded as the staleness-report artifact. Workflow is limited to content read permissions and has a 10-minute timeout.
Rename OUTPUT_FILENAME from 'nb_docs_status' to 'docs_nb_checks' and use it for the default --out-dir (tmp/docs_nb_checks). Update the README to show the check command as a fenced code block and clarify allowlist behavior. Update .gitignore to ignore the new tmp/docs_nb_checks path.
Ensure DLC metadata is JSON-serializable by converting date/datetime fields to ISO strings and preserving exclude_none behavior. Uses pydantic v2 API (model_dump(mode="json", exclude_none=True)) and falls back to pydantic v1 via json.loads(meta.json(...)). Adds a docstring and clarifying comments. This prevents json.dumps from failing when writing .ipynb files and keeps compatibility across pydantic versions.
Update docs-and-notebooks tool to use nbformat for reading/writing notebooks, validate .ipynb files, and detect whether notebooks are normalized. Add notebook_is_normalized helper and ensure write_ipynb_meta uses nbformat.writes/validate. Introduce a new policy field require_notebook_normalized (and add it to the report config defaults) and enforce it to emit violations when notebooks are not normalized. Also update CI job to install nbformat and pin pydantic, and update the script header notes to list the new dependency. These changes let CI detect invalid or non-normalized notebooks and reduce formatting churn when normalizing files.
Add a local pre-commit hook 'dlc-docs-notebooks-check' that runs .github/tools/docs_and_notebooks_check.py to check DLC docs and notebooks for staleness, validate nbformat, and perform normalization. The hook targets Jupyter and Markdown files, passes filenames to the script, and declares additional dependencies (pydantic>=2,<3, pyyaml, nbformat>=5).
@C-Achard C-Achard added this to the Centralize linting update milestone Mar 5, 2026
@C-Achard C-Achard self-assigned this Mar 5, 2026
@C-Achard C-Achard added enhancement New feature or request COLAB Jupyter related to jupyter notebooks documentation documentation updates/comments CI Related to CI/CD jobs and automated testing labels Mar 5, 2026
@C-Achard C-Achard requested a review from Copilot March 5, 2026 10:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an automated “docs + notebooks freshness” scanning tool and wires it into CI/pre-commit to surface stale or invalid content (git-updated date, human “last_verified”, nbformat validation, and notebook canonical-format drift).

Changes:

  • Introduces .github/tools/docs_and_notebooks_check.py + YAML config to scan docs/notebooks, emit JSON/MD reports, and (optionally) update metadata in-place.
  • Adds a GitHub Actions workflow to run report + check and upload report artifacts.
  • Adds an (optional) pre-commit hook and ignores generated report output under tmp/.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
.pre-commit-config.yaml Adds local hook intended to run the checks on touched Markdown/notebooks.
.gitignore Ignores generated report output directory under tmp/.
.github/workflows/docs_and_notebooks_checks.yml Adds CI job to run report/check and upload artifacts.
.github/tools/docs_and_notebooks_tool_README.md Documents the tool’s purpose, metadata locations, and usage.
.github/tools/docs_and_notebooks_report_config.yml Defines scan include/exclude globs and policy thresholds/allowlists.
.github/tools/docs_and_notebooks_check.py Implements scanning/reporting/checking/updating logic using git + nbformat + pydantic.
Comments suppressed due to low confidence (1)

.github/workflows/docs_and_notebooks_checks.yml:43

  • The comment says check mode “will fail only once you populate allowlists”, but docs_and_notebooks_check.py also returns non-zero if any notebooks produce errors (e.g., nbformat validation failures). If the intent is non-blocking until allowlists are set, either adjust the script’s exit-code behavior or update this comment to reflect that errors can already fail the job.
      # Optional: run check mode (will fail only once you populate allowlists in config)
      - name: Run staleness policy check (optional gate)
        run: |
          python .github/tools/docs_and_notebooks_check.py check \
          --config .github/tools/docs_and_notebooks_report_config.yml --out-dir tmp/docs_nb_checks

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

C-Achard added 4 commits March 5, 2026 14:38
Use docs_and_notebooks_report_config.yml as the default config and resolve it relative to the script. Rename machine/human report outputs to docs_nb_checks.{json,md}. Add an optional --targets argument to the report and check subcommands; scan_files now accepts a targets list and filters scanned paths to only those targets. Make --config default a string path and adjust error-exit logic so parsing errors don't cause a non-zero exit in report mode. Minor doc/formatting tweaks.
Update references and examples in .github/tools/docs_and_notebooks_tool_README.md: change config reference to .github/tools/docs_and_notebooks_report_config.yml, update report output paths to tmp/docs_nb_checks/..., simplify the example 'check' command, and replace usages of tools/staleness.py with .github/tools/docs_and_notebooks_check.py in the update/example commands. Also tidy the 'Writes' section formatting.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
C-Achard added 2 commits March 6, 2026 11:38
Rename .github/tools/docs_and_notebooks_* to tools/ and update references. Updated workflow (.github/workflows/docs_and_notebooks_checks.yml) and pre-commit config to call tools/docs_and_notebooks_check.py and use tools/docs_and_notebooks_report_config.yml, updated the tool script's internal docs and the README paths, and tweaked the workflow name to "Docs & notebooks freshness and formatting checks".
Update GitHub Actions workflow to use actions/checkout@v6 and modify the editable install command used in the test step. Replaces "pip install -e .[dev]" with "pip install -e . --group dev" before running pytest, aligning the workflow with the newer checkout action and the revised dependency installation syntax.
Update CI workflows and developer tooling: bump actions/checkout and setup-python usages (checkout@v6, setup-python@v6), upgrade peaceiris/actions-gh-pages to v4, and update codespell workflow. Simplify python-package workflow to install dev extras once using `--group dev` and remove the duplicate install. Adjust pre-commit config to pass args to the name-tests-test hook. Fix dependency declarations in pyproject.toml (move/clean up pydantic and nbformat entries, remove duplicate pydantic line). Reformat tests/tools/docs_and_notebooks_checks/test_check_contracts.py for readability (multi-line calls, string quote consistency) and simplify one assertion.
Only set last_metadata_updated when an actual file write will occur (not unconditionally). For notebooks and markdown files the metadata stamping and merge now happen only when write is performed; when not writing, the in-memory record.meta is still updated so warnings and reports remain accurate. Rename the summary/report field and related text from "git_stale" to "content_stale" and adjust wording from "last_git_updated" to "last_git_touched / last_content_updated". A new import (from curses import meta) was also added.
Build the desired metadata without mutating last_metadata_updated and use a base "desired_base" when comparing/merging. Only set meta.last_metadata_updated (and produce the final merged metadata) if an actual file write will occur. Apply the same logic to both notebooks and markdown frontmatter, simplify variable names, and remove redundant in-branch rec.meta assignments so the record is consistently updated once at the end.
Drop an unused/errant `from curses import meta` import and remove a redundant pre-metadata call to `write_ipynb_meta`. Notebooks are now written once after merging/updating DLC metadata (comment updated accordingly), reducing unnecessary I/O and avoiding potential name conflicts.
Pass an explicit "HEAD" ref to git log calls and add --fixed-strings to the grep invocation. This prevents Git from misinterpreting the path or commit selector ordering and ensures the META_COMMIT_MARKER is treated as a literal string (not a regex). Changes applied to git_last_touched and git_last_content_updated to make commit/date lookups more robust.
Add a reusable _git_log_date helper to build git log args and return parsed dates, and make git_last_touched delegate to it. Improve _parse_git_iso_date to handle plain ISO dates and trailing Z timezone markers. Update git_last_content_updated to call the new helper with grep/invert-grep args to skip META_COMMIT_MARKER and fall back to raw last-touched date when needed.
@C-Achard C-Achard marked this pull request as ready for review March 11, 2026 12:04
@C-Achard C-Achard requested a review from deruyter92 March 11, 2026 12:04
Split the hook command into a simple entry plus explicit args and replace types_or with a files regex to more precisely target docs, examples (JUPYTER/COLAB) and tools .md/.ipynb files. This improves pre-commit handling of the CLI arguments and ensures the hook only runs on relevant files; pass_filenames and additional_dependencies remain unchanged.
Clarify README language around metadata commits and notebook normalization: explicitly note git correctly reports rewritten files as “updated now”, require ack flag text capitalization, add a warning about future changes to the metadata-commit marker, rephrase allowlists line to indicate they are currently empty, promote the notebook note to IMPORTANT and reword for clarity, and make minor troubleshooting wording/capitalization tweaks.
Comment on lines +974 to +976
if args.cmd in {"report", "check"}:
records = scan_files(repo_root, cfg, targets=getattr(args, "targets", None))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually fail when parsing errors are found? Am I understanding correctly that in check mode the script exits with succes even when rec.erors are found? Is that the intended behaviour?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I clarified the exact behavior a bit and added a "strict" mode that fails on any scan error.
Let me know what you think !

Update docs_and_notebooks_check.py to continue reporting scan/parsing errors without failing checks by default and now provide an opt-in strict mode. Key changes:

- Clarify behavior in top-level docs and Markdown output: scan errors are reported as non-fatal by default and labeled as "scan errors" in summaries.
- Add a policy config option fail_on_scan_errors (default false) and a CLI --strict-mode flag which forces check to fail on scan/parsing errors.
- Introduce collect_scan_issues helper to aggregate scan errors/warnings and print a brief summary to console.
- Adjust check command help text and check flow to respect the combined strict-mode/config setting; when strict, scan errors cause non-zero exit.
- Rename suggested commit message constant to SUGGESTED_TAGGED_COMMIT and update printed suggestions.
- Minor UX improvements: more explanatory messages after report generation and limited preview of scan errors.

All changes are confined to tools/docs_and_notebooks_check.py and focus on behavior and messaging around scan error handling and strictness.
Enforce pydantic v2 APIs and tighten frontmatter parsing/validation and error reporting. Changes include:

- Require pydantic>=2 and nbformat>=5 in docs.
- read_md_frontmatter now returns an optional error string and reports unterminated or non-mapping frontmatter.
- Propagate frontmatter parse errors into scan/update flows (mark invalid_metadata and add explicit error codes).
- read_ipynb_meta returns the raw DLC metadata and preserves presence flag; notebook metadata handling clarified.
- Use model_dump(mode='json') / model_validate everywhere (remove pydantic v1 fallbacks).
- Simplify meta_to_jsonable to assume pydantic v2 output.
- Add future-date validation for last_content_updated/last_metadata_updated/last_verified and report errors.
- Improve enforcement logic to treat invalid metadata separately from missing metadata and to avoid false passes.
- Fail-safe handling in update mode when frontmatter is invalid; adjust report header and JSON output path writing.
- Minor whitespace/cleanup and reordering of strict_mode evaluation.

Overall this makes metadata parsing more strict, yields clearer diagnostics for frontmatter issues, and migrates code paths to pydantic v2.
Adapt tests to recent API changes: replace SUGGESTED_META_COMMIT_MESSAGE with SUGGESTED_TAGGED_COMMIT, and update calls to read_md_frontmatter to unpack a third return value (fm, body, _). Keeps tests aligned with the tool's renamed constant and modified function signature.
@C-Achard C-Achard requested review from AlexEMG and MMathisLab March 19, 2026 12:40
Copy link
Copy Markdown
Member

@MMathisLab MMathisLab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@deruyter92 deruyter92 merged commit 9c74d49 into main Mar 20, 2026
17 checks passed
@deruyter92 deruyter92 deleted the cy/automated-docs&nb-report branch March 20, 2026 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Related to CI/CD jobs and automated testing COLAB documentation documentation updates/comments enhancement New feature or request Jupyter related to jupyter notebooks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants