fix(auto): shift scoring to quality-vs-junk model with stable dedupe by gildesmarais · Pull Request #348 · html2rss/html2rss

gildesmarais · 2026-03-28T17:27:11Z

No description provided.

Copilot

Pull request overview

This PR updates the auto-source HTML scraping pipeline to rank candidates using a “quality vs junk” scoring model, introduces stable destination-based deduplication, and centralizes link/path heuristics to better suppress navigation/taxonomy/CTA noise across scrapers.

Changes:

Add LinkHeuristics to normalize destinations and classify utility/taxonomy/vanity/recommended links consistently across scrapers.
Update SemanticHtml to score (quality − junk), dedupe by destination (preferring stronger entries), and stably rank results.
Extend specs to cover contamination control, ranking order, and dedupe behavior in both SemanticHtml and fallback Html scraping.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`lib/html2rss/auto_source/scraper/link_heuristics.rb`	New shared destination/text classification helper used by multiple scrapers.
`lib/html2rss/auto_source/scraper/semantic_html.rb`	Adds scoring model, destination dedupe, and stable ranking to semantic scraping.
`lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb`	Switches anchor filtering/normalization to shared `LinkHeuristics` facts.
`lib/html2rss/auto_source/scraper/html.rb`	Filters fallback anchors using `LinkHeuristics` to reduce taxonomy/utility noise.
`spec/lib/html2rss/auto_source/scraper/semantic_html_ranking_spec.rb`	Adds extensive ranking/contamination/dedupe regression coverage for `SemanticHtml`.
`spec/lib/html2rss/auto_source/scraper/html_spec.rb`	Adds regression coverage for reduced noise in fallback HTML extraction.

Copilot · 2026-03-28T19:38:05Z

lib/html2rss/auto_source/scraper/link_heuristics.rb

+# frozen_string_literal: true
+
+module Html2rss
+  class AutoSource
+    module Scraper
+      ##
+      # Shared link-level heuristics used by scraper-local selection and
+      # scoring. This keeps normalization and route/text classification
+      # consistent without moving scraper policy into higher orchestration.
+      class LinkHeuristics # rubocop:disable Metrics/ClassLength


LinkHeuristics builds several Sets via to_set, but this file doesn't require 'set'. Since to_set is only defined when the stdlib set is loaded, this can raise NoMethodError depending on load order. Add an explicit require 'set' near the top of this file (or otherwise guarantee set is loaded before Zeitwerk eager loads this class).

Copilot · 2026-03-28T19:38:05Z

lib/html2rss/auto_source/scraper/link_heuristics.rb

+            high_confidence_junk_path:,
+            high_confidence_utility_destination:
+          )
+        rescue ArgumentError


destination_facts rescues ArgumentError, but Html2rss::Url.from_relative can raise Addressable::URI::InvalidURIError (see other call sites that rescue both). As written, a single malformed href could bubble up and abort scraping. Rescue Addressable::URI::InvalidURIError here as well (and consider broadening to the specific errors Url.from_relative can raise).

Suggested change

rescue ArgumentError

rescue ArgumentError, Addressable::URI::InvalidURIError

Html2rss::Url.from_relative should not expose Addressable::*Errors into this code. fence it, raise appropriate url-locale error.

nuke adressable from the code base except for this gems Url class.

Copilot · 2026-03-28T19:38:06Z

lib/html2rss/auto_source/scraper/semantic_html.rb

+        end
+
+        def article_signal_count(container, publish_signal:, descriptive_signal:, content_signal:)
+          [article_container?(container), publish_signal, descriptive_signal, content_signal].count(true)


publish_marker? returns a Nokogiri node (truthy), but article_signal_count uses .count(true), which only counts elements equal to true (not truthy). This means the publish signal is never counted, making weak_article_candidate too aggressive and potentially hard-dropping real posts. Coerce signals to booleans (e.g., !!publish_signal) or count truthy values (count(&:itself)).

Suggested change

[article_container?(container), publish_signal, descriptive_signal, content_signal].count(true)

[article_container?(container), publish_signal, descriptive_signal, content_signal].count(&:itself)

gildesmarais · 2026-03-28T19:43:17Z

lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb

        # scraping can reason about link intent instead of DOM order. It favors
        # heading-aligned article links and suppresses utility links, duplicate
        # destinations, and weak textless affordances.
        class AnchorSelector # rubocop:disable Metrics/ClassLength


Suggested change

class AnchorSelector

gildesmarais · 2026-03-28T19:44:02Z

lib/html2rss/auto_source/scraper/html.rb

      # This scraper is broader and noisier than `SemanticHtml`, so it acts as a
      # fallback for pages without stronger semantic signals.
-      class Html
+      class Html # rubocop:disable Metrics/ClassLength


Suggested change

class Html # rubocop:disable Metrics/ClassLength

class Html

gildesmarais · 2026-03-28T19:44:40Z

lib/html2rss/auto_source/scraper/link_heuristics.rb

+      # Shared link-level heuristics used by scraper-local selection and
+      # scoring. This keeps normalization and route/text classification
+      # consistent without moving scraper policy into higher orchestration.
+      class LinkHeuristics # rubocop:disable Metrics/ClassLength


Suggested change

class LinkHeuristics # rubocop:disable Metrics/ClassLength

class LinkHeuristics

fix(auto): shift scoring to quality-vs-junk model with stable dedupe

c8dc602

gildesmarais force-pushed the feat/post-scoring branch from 21d83b7 to c8dc602 Compare March 28, 2026 18:48

fix(auto): harden shared link heuristics and scraper noise filtering

894e357

gildesmarais requested a review from Copilot March 28, 2026 19:34

Copilot started reviewing on behalf of gildesmarais March 28, 2026 19:35 View session

Copilot AI reviewed Mar 28, 2026

View reviewed changes

gildesmarais commented Mar 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(auto): shift scoring to quality-vs-junk model with stable dedupe#348

fix(auto): shift scoring to quality-vs-junk model with stable dedupe#348
gildesmarais wants to merge 2 commits intomasterfrom
feat/post-scoring

gildesmarais commented Mar 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

gildesmarais Mar 28, 2026

Uh oh!

Copilot AI Mar 28, 2026

Uh oh!

gildesmarais Mar 28, 2026

Uh oh!

gildesmarais Mar 28, 2026

Uh oh!

gildesmarais Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	rescue ArgumentError
	rescue ArgumentError, Addressable::URI::InvalidURIError

	[article_container?(container), publish_signal, descriptive_signal, content_signal].count(true)
	[article_container?(container), publish_signal, descriptive_signal, content_signal].count(&:itself)

	class LinkHeuristics # rubocop:disable Metrics/ClassLength
	class LinkHeuristics

Uh oh!

Conversation

gildesmarais commented Mar 28, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

gildesmarais Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

gildesmarais Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

gildesmarais Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

gildesmarais Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants