Skip to content

fix(auto): shift scoring to quality-vs-junk model with stable dedupe#348

Draft
gildesmarais wants to merge 2 commits intomasterfrom
feat/post-scoring
Draft

fix(auto): shift scoring to quality-vs-junk model with stable dedupe#348
gildesmarais wants to merge 2 commits intomasterfrom
feat/post-scoring

Conversation

@gildesmarais
Copy link
Copy Markdown
Member

No description provided.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the auto-source HTML scraping pipeline to rank candidates using a “quality vs junk” scoring model, introduces stable destination-based deduplication, and centralizes link/path heuristics to better suppress navigation/taxonomy/CTA noise across scrapers.

Changes:

  • Add LinkHeuristics to normalize destinations and classify utility/taxonomy/vanity/recommended links consistently across scrapers.
  • Update SemanticHtml to score (quality − junk), dedupe by destination (preferring stronger entries), and stably rank results.
  • Extend specs to cover contamination control, ranking order, and dedupe behavior in both SemanticHtml and fallback Html scraping.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
lib/html2rss/auto_source/scraper/link_heuristics.rb New shared destination/text classification helper used by multiple scrapers.
lib/html2rss/auto_source/scraper/semantic_html.rb Adds scoring model, destination dedupe, and stable ranking to semantic scraping.
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb Switches anchor filtering/normalization to shared LinkHeuristics facts.
lib/html2rss/auto_source/scraper/html.rb Filters fallback anchors using LinkHeuristics to reduce taxonomy/utility noise.
spec/lib/html2rss/auto_source/scraper/semantic_html_ranking_spec.rb Adds extensive ranking/contamination/dedupe regression coverage for SemanticHtml.
spec/lib/html2rss/auto_source/scraper/html_spec.rb Adds regression coverage for reduced noise in fallback HTML extraction.

Comment on lines +1 to +10
# frozen_string_literal: true

module Html2rss
class AutoSource
module Scraper
##
# Shared link-level heuristics used by scraper-local selection and
# scoring. This keeps normalization and route/text classification
# consistent without moving scraper policy into higher orchestration.
class LinkHeuristics # rubocop:disable Metrics/ClassLength
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LinkHeuristics builds several Sets via to_set, but this file doesn't require 'set'. Since to_set is only defined when the stdlib set is loaded, this can raise NoMethodError depending on load order. Add an explicit require 'set' near the top of this file (or otherwise guarantee set is loaded before Zeitwerk eager loads this class).

Copilot uses AI. Check for mistakes.
high_confidence_junk_path:,
high_confidence_utility_destination:
)
rescue ArgumentError
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

destination_facts rescues ArgumentError, but Html2rss::Url.from_relative can raise Addressable::URI::InvalidURIError (see other call sites that rescue both). As written, a single malformed href could bubble up and abort scraping. Rescue Addressable::URI::InvalidURIError here as well (and consider broadening to the specific errors Url.from_relative can raise).

Suggested change
rescue ArgumentError
rescue ArgumentError, Addressable::URI::InvalidURIError

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Html2rss::Url.from_relative should not expose Addressable::*Errors into this code. fence it, raise appropriate url-locale error.

nuke adressable from the code base except for this gems Url class.

end

def article_signal_count(container, publish_signal:, descriptive_signal:, content_signal:)
[article_container?(container), publish_signal, descriptive_signal, content_signal].count(true)
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

publish_marker? returns a Nokogiri node (truthy), but article_signal_count uses .count(true), which only counts elements equal to true (not truthy). This means the publish signal is never counted, making weak_article_candidate too aggressive and potentially hard-dropping real posts. Coerce signals to booleans (e.g., !!publish_signal) or count truthy values (count(&:itself)).

Suggested change
[article_container?(container), publish_signal, descriptive_signal, content_signal].count(true)
[article_container?(container), publish_signal, descriptive_signal, content_signal].count(&:itself)

Copilot uses AI. Check for mistakes.
# scraping can reason about link intent instead of DOM order. It favors
# heading-aligned article links and suppresses utility links, duplicate
# destinations, and weak textless affordances.
class AnchorSelector # rubocop:disable Metrics/ClassLength
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class AnchorSelector

# This scraper is broader and noisier than `SemanticHtml`, so it acts as a
# fallback for pages without stronger semantic signals.
class Html
class Html # rubocop:disable Metrics/ClassLength
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class Html # rubocop:disable Metrics/ClassLength
class Html

# Shared link-level heuristics used by scraper-local selection and
# scoring. This keeps normalization and route/text classification
# consistent without moving scraper policy into higher orchestration.
class LinkHeuristics # rubocop:disable Metrics/ClassLength
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class LinkHeuristics # rubocop:disable Metrics/ClassLength
class LinkHeuristics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants