fix(auto): shift scoring to quality-vs-junk model with stable dedupe#348
fix(auto): shift scoring to quality-vs-junk model with stable dedupe#348gildesmarais wants to merge 2 commits intomasterfrom
Conversation
21d83b7 to
c8dc602
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the auto-source HTML scraping pipeline to rank candidates using a “quality vs junk” scoring model, introduces stable destination-based deduplication, and centralizes link/path heuristics to better suppress navigation/taxonomy/CTA noise across scrapers.
Changes:
- Add
LinkHeuristicsto normalize destinations and classify utility/taxonomy/vanity/recommended links consistently across scrapers. - Update
SemanticHtmlto score (quality − junk), dedupe by destination (preferring stronger entries), and stably rank results. - Extend specs to cover contamination control, ranking order, and dedupe behavior in both
SemanticHtmland fallbackHtmlscraping.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
lib/html2rss/auto_source/scraper/link_heuristics.rb |
New shared destination/text classification helper used by multiple scrapers. |
lib/html2rss/auto_source/scraper/semantic_html.rb |
Adds scoring model, destination dedupe, and stable ranking to semantic scraping. |
lib/html2rss/auto_source/scraper/semantic_html/anchor_selector.rb |
Switches anchor filtering/normalization to shared LinkHeuristics facts. |
lib/html2rss/auto_source/scraper/html.rb |
Filters fallback anchors using LinkHeuristics to reduce taxonomy/utility noise. |
spec/lib/html2rss/auto_source/scraper/semantic_html_ranking_spec.rb |
Adds extensive ranking/contamination/dedupe regression coverage for SemanticHtml. |
spec/lib/html2rss/auto_source/scraper/html_spec.rb |
Adds regression coverage for reduced noise in fallback HTML extraction. |
| # frozen_string_literal: true | ||
|
|
||
| module Html2rss | ||
| class AutoSource | ||
| module Scraper | ||
| ## | ||
| # Shared link-level heuristics used by scraper-local selection and | ||
| # scoring. This keeps normalization and route/text classification | ||
| # consistent without moving scraper policy into higher orchestration. | ||
| class LinkHeuristics # rubocop:disable Metrics/ClassLength |
There was a problem hiding this comment.
LinkHeuristics builds several Sets via to_set, but this file doesn't require 'set'. Since to_set is only defined when the stdlib set is loaded, this can raise NoMethodError depending on load order. Add an explicit require 'set' near the top of this file (or otherwise guarantee set is loaded before Zeitwerk eager loads this class).
| high_confidence_junk_path:, | ||
| high_confidence_utility_destination: | ||
| ) | ||
| rescue ArgumentError |
There was a problem hiding this comment.
destination_facts rescues ArgumentError, but Html2rss::Url.from_relative can raise Addressable::URI::InvalidURIError (see other call sites that rescue both). As written, a single malformed href could bubble up and abort scraping. Rescue Addressable::URI::InvalidURIError here as well (and consider broadening to the specific errors Url.from_relative can raise).
| rescue ArgumentError | |
| rescue ArgumentError, Addressable::URI::InvalidURIError |
There was a problem hiding this comment.
Html2rss::Url.from_relative should not expose Addressable::*Errors into this code. fence it, raise appropriate url-locale error.
nuke adressable from the code base except for this gems Url class.
| end | ||
|
|
||
| def article_signal_count(container, publish_signal:, descriptive_signal:, content_signal:) | ||
| [article_container?(container), publish_signal, descriptive_signal, content_signal].count(true) |
There was a problem hiding this comment.
publish_marker? returns a Nokogiri node (truthy), but article_signal_count uses .count(true), which only counts elements equal to true (not truthy). This means the publish signal is never counted, making weak_article_candidate too aggressive and potentially hard-dropping real posts. Coerce signals to booleans (e.g., !!publish_signal) or count truthy values (count(&:itself)).
| [article_container?(container), publish_signal, descriptive_signal, content_signal].count(true) | |
| [article_container?(container), publish_signal, descriptive_signal, content_signal].count(&:itself) |
| # scraping can reason about link intent instead of DOM order. It favors | ||
| # heading-aligned article links and suppresses utility links, duplicate | ||
| # destinations, and weak textless affordances. | ||
| class AnchorSelector # rubocop:disable Metrics/ClassLength |
There was a problem hiding this comment.
| class AnchorSelector |
| # This scraper is broader and noisier than `SemanticHtml`, so it acts as a | ||
| # fallback for pages without stronger semantic signals. | ||
| class Html | ||
| class Html # rubocop:disable Metrics/ClassLength |
There was a problem hiding this comment.
| class Html # rubocop:disable Metrics/ClassLength | |
| class Html |
| # Shared link-level heuristics used by scraper-local selection and | ||
| # scoring. This keeps normalization and route/text classification | ||
| # consistent without moving scraper policy into higher orchestration. | ||
| class LinkHeuristics # rubocop:disable Metrics/ClassLength |
There was a problem hiding this comment.
| class LinkHeuristics # rubocop:disable Metrics/ClassLength | |
| class LinkHeuristics |
No description provided.