Skip to content

ROX-34929: Dedupe Scanner V4 Packages and Vulns#20996

Open
dcaravel wants to merge 7 commits into
masterfrom
dc/dedupe-vulns
Open

ROX-34929: Dedupe Scanner V4 Packages and Vulns#20996
dcaravel wants to merge 7 commits into
masterfrom
dc/dedupe-vulns

Conversation

@dcaravel

@dcaravel dcaravel commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Description

Adds logic to dedupe packages and vulnerabilities in Scanner V4 results.

The deduping logic is disabled by default because it will be backported, to enable set ROX_SCANNER_V4_DEDUPE: true.

Deduping scenarios addressed

  1. Duplicate RHCC packages (one per kind: source, binary, ancestry)
    • Only one is needed.
  2. Duplicate vulns where only the advisory (RHSA) differs.
    • A single CVE may be fixed by multiple RHSA's, each RHSA would yield a 'duplicate' - now only the 'latest' is kept.
  3. Different perspectives for the same vuln (ie: GHSA vs. GO, PYSEC vs. GHSA, etc.)
    • Logic added to pick the most complete / actionable entry

Placement

Originally the logic was added to mappers.go, was later moved to convert.go as it was more readable and allowed for a 'rolling' approach when modifying the vuln details. In mappers.go, as vulns were combined new 'synthetic' vulns had to be created and added to the vulnerabilities map to not break other packages that may not reference the same set of duplicates (unnecessary complexity).

Logic

Comparing by severity or CVSS was not enough, OSV/Red Hat have priority over NVD and in various scenarios the 'lower severity' needed to be top level.

To address known duplicate scenarios and keep data actionable, fields were grouped as 'fix' related and 'scoring' related (loose), the grouped fields are kept together such that the 'fix' info may come from duplicate A and the scoring from duplicate B.

Scoring fields Fix fields
cvss advisory
cvssV2 fixed by
cvssV3 fix avail timestamp
scoreVersion
link
published_on
last_modified
severity
summary
Deduping examples from real images

Scenario 1: Duplicate RHSAs — same CVE, different advisories

Image: registry.redhat.io/compliance/openshift-compliance-openscap-rhel8@sha256:b64926c3cd92381b31f31f28ca01f05c52bc0fbbc581941ba6f0da5b3b142d36
Package: expat 2.5.0-3.el9_5.1 (rpm)

Scoring: equal — both have identical CVSS, severity, and summary.
Fix: B wins — newer advisory, newer fixed-by version, later fix timestamp.

Field Dupe A Dupe B Winner
CVE CVE-2024-8176 CVE-2024-8176
Advisory RHSA-2025:3531 RHSA-2025:7444 B (newer)
Fixed by 0:2.5.0-3.el9_5.3 0:2.5.0-5.el9_6 B (newer)
Fix timestamp 2025-04-02 2025-05-13 B (newer)
CVSS / severity 7.5 / MODERATE 7.5 / MODERATE equal

Scenario 2: Different perspectives — different CVSS scores

Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:29e3563d5fc11a54ffbf7b28694f1cc3975a72e24035fca4afa44d7937e932c7
Package: github.com/docker/docker v27.5.1+incompatible (Go binary)

Scoring: A wins — has 2 CVSS metric sources vs 1.
Fix: equal — neither has an advisory or fix version.

Field Dupe A Dupe B Winner
CVE CVE-2026-33997 CVE-2026-33997
CVSS 6.8 8.1 A (more sources)
CVSS vector AV:N/AC:H/... AV:N/AC:L/... A
Severity MODERATE IMPORTANT A
Link osv.dev nvd.nist.gov A
Published 2026-03-27 2026-04-02 A (earlier)
CVSS metrics 2 sources 1 source A

Scenario 3: Different perspectives — one has fix version

Same image as Scenario 2
Package: github.com/docker/docker v27.5.1+incompatible (Go binary)

Scoring: A wins — higher CVSS (8.8 vs 7.8), 2 metric sources vs 1.
Fix: A wins — has fix version 29.3.1; B has none.

Field Dupe A Dupe B Winner
CVE CVE-2026-34040 CVE-2026-34040
CVSS 8.8 7.8 A
Fixed by 29.3.1 empty A
Severity IMPORTANT IMPORTANT equal
Link osv.dev nvd.nist.gov A
CVSS metrics 2 sources 1 source A

Scenario 4: Different perspectives — one has no scoring at all

Same image as Scenario 2
Package: github.com/go-jose/go-jose/v3 v3.0.4 (Go binary)

Scoring: A wins — has CVSS, severity, and metric source; B has none.
Fix: equal — both have 3.0.5 as the fix version.

Field Dupe A Dupe B Winner
CVE CVE-2026-34986 CVE-2026-34986
CVSS 7.5 empty A
CVSS v3 full vector empty A
Severity IMPORTANT UNKNOWN A
Fixed by 3.0.5 3.0.5 equal
CVSS metrics 1 source empty A

Scenario 5: Different perspectives — different CVSS severity

Same image as Scenario 2
Package: setuptools 53.0.0 (Python)

Scoring: A wins — higher CVSS (7.5 vs 5.9), 2 metric sources vs 1.
Fix: equal — both have 65.5.1 as the fix version.

Field Dupe A Dupe B Winner
CVE CVE-2022-40897 CVE-2022-40897
CVSS 7.5 5.9 A
CVSS vector AV:N/AC:L/... AV:N/AC:H/... A
Severity IMPORTANT MODERATE A
Link osv.dev nvd.nist.gov A
Fixed by 65.5.1 65.5.1 equal
CVSS metrics 2 sources 1 source A

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • modified existing tests

How I validated my change

unit tests + CI + manually

Using a custom dupe detector and two different infra clusters, one with the fix the other without, analyzed each active image for duplicates.

The cluster without the fix:

Summary
  Images scanned:          96
  Total components:        26758
  Total CVE entries:       7656
  Duplicate CVEs:          436
  Identical components:    370

  If duplicates removed (estimate):
    CVEs:        7656 -> 7220  (5.7% reduction)
    Components:  26758 -> 26388  (1.4% reduction)
Summary
  Images scanned:          96
  Total components:        26387
  Total CVE entries:       7220
  Duplicate CVEs:          0
  Identical components:    0

@openshift-ci

openshift-ci Bot commented Jun 5, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Worried about impact? Review this PR in Change Stack to explore blast radius before you approve or request changes.

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds a new ScannerV4Dedupe feature flag and implements deduplication logic for Scanner V4 scan results. When enabled, the feature filters redundant packages and merges duplicate CVE entries within each package using deterministic advisory ordering and fix-selection rules.

Changes

Scanner V4 Deduplication

Layer / File(s) Summary
Feature flag and helper infrastructure
pkg/features/list.go, pkg/scanners/scannerv4/convert.go
ScannerV4Dedupe feature flag is registered and new imports (cmp) and helpers (yearNumPattern, parseYearNum, compareStorageAdvisories, v4BaseScore, mergeFixFields, mergeScoringFields) are added to support advisory parsing and deterministic comparison.
Component deduplication
pkg/scanners/scannerv4/convert.go
componentsWithLayerMap now filters out "ancestry" packages and skips "source" packages already referenced by binary packages when the feature flag is enabled.
Vulnerability deduplication and merging
pkg/scanners/scannerv4/convert.go, pkg/scanners/scannerv4/vm_convert.go
vulnerabilities function signature updated to accept pkgFixedByVersion and implements CVE-name deduplication: duplicate vulnerabilities merge into a single entry via fix-field and scoring-field selection based on advisory ordering, fixed-version presence, and CVSS heuristics. Call sites updated to pass the new parameters.
Test coverage
pkg/scanners/scannerv4/convert_test.go
TestComponents extended with dedupe parameter and RHCC test cases; new test functions TestVulnerabilities_DedupByCVEName and TestCompareStorageAdvisories validate deduplication behavior and advisory ordering.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning Required PR description sections are incomplete or marked as not done; validation section contains only placeholder text. Complete all checklist items (CHANGELOG.md, documentation PR, testing and quality, automated testing, validation). Replace 'change me!' placeholders with actual validation details or explanations for skipping.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: deduplication of Scanner V4 packages and vulnerabilities, which is the primary objective of the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dc/dedupe-vulns

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dcaravel

dcaravel commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/scanners/scannerv4/convert.go`:
- Around line 64-71: The loop that builds referencedSourceIDs when
features.ScannerV4Dedupe.Enabled() currently collects source IDs from every
package kind and must be restricted to only binary packages to avoid dropping
valid source+ancestry components; update the population logic so it only adds
srcID when pkg.GetKind() indicates a binary (check the enum/value used for
binary packages in this codebase) before calling
referencedSourceIDs.Add(pkg.GetSource().GetId()), replicate the same change for
the other analogous block (the block mentioned at lines 75-83), and add a
regression test that exercises a report containing only a source and an ancestry
package (no binary) to assert both components are retained.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: f90ad52b-bbfe-43eb-ab17-c3ab7fc086cf

📥 Commits

Reviewing files that changed from the base of the PR and between 6caac0c and c6ba88e.

📒 Files selected for processing (4)
  • pkg/features/list.go
  • pkg/scanners/scannerv4/convert.go
  • pkg/scanners/scannerv4/convert_test.go
  • pkg/scanners/scannerv4/vm_convert.go

Comment thread pkg/scanners/scannerv4/convert.go Outdated
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

🚀 Build Images Ready

Images are ready for commit 15d1419. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.12.x-116-g15d1419620

@dcaravel

dcaravel commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

/test all

@dcaravel dcaravel marked this pull request as ready for review June 9, 2026 02:02
@dcaravel dcaravel requested review from a team as code owners June 9, 2026 02:02

@vikin91 vikin91 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite not being in the Scanner-core, I gave it a selective look. My human eyes didn't catch much, but I threw it at AI and there are two things that you may want to check:

  1. Question:

does this layer intentionally rely on PackageVulnerabilities already being advisory-deduped upstream when ROX_SCANNER_V4_RED_HAT_CVES=false?

I noticed vulnerabilities() now merges by ccVuln.GetName(). In isolation that would collapse two entries like:

- RHSA-2024:10775 -> ...CVE-2024-24789...
- RHSA-2024:10775 -> ...CVE-2024-24790...

into one result.

I think toProtoV4PackageVulnerabilitiesMap() / dedupeAdvisories() currently makes that unreachable, so this may be fine. Just wanted to confirm that this function is deliberately depending on that upstream invariant.

  1. Potential issue:

I think mergeFixFields() may regress remediation data in a real way, not just reshuffle metadata. Could we add a regression test for the case where a newer advisory has no FixedInVersion, but an older duplicate does?
Something like:

t.Run("newer advisory does not erase existing fix version", func(t *testing.T) {
	vulnMap := map[string]*v4.VulnerabilityReport_Vulnerability{
		"a": {
			Id: "a", Name: "CVE-2024-1",
			Advisory:       &v4.VulnerabilityReport_Advisory{Name: "RHSA-2024:100"},
			FixedInVersion: "1.0.0",
		},
		"b": {
			Id: "b", Name: "CVE-2024-1",
			Advisory: &v4.VulnerabilityReport_Advisory{Name: "RHSA-2024:200"},
		},
	}

	got := vulnerabilities(vulnMap, []string{"a", "b"}, "", "")
	require.Len(t, got, 1)

	// Prefer the newer advisory metadata, but do not lose known fixability.
	assert.Equal(t, "RHSA-2024:200", got[0].GetAdvisory().GetName())
	assert.Equal(t, "1.0.0", got[0].GetFixedBy())
})

This would help answer whether losing fixability here is really intended.

@dcaravel

dcaravel commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author
  1. Question:

does this layer intentionally rely on PackageVulnerabilities already being advisory-deduped upstream when ROX_SCANNER_V4_RED_HAT_CVES=false?

I noticed vulnerabilities() now merges by ccVuln.GetName(). In isolation that would collapse two entries like:

- RHSA-2024:10775 -> ...CVE-2024-24789...
- RHSA-2024:10775 -> ...CVE-2024-24790...

into one result.

I think toProtoV4PackageVulnerabilitiesMap() / dedupeAdvisories() currently makes that unreachable, so this may be fine. Just wanted to confirm that this function is deliberately depending on that upstream invariant.

That's correct, by the time the the information gets to Central there should only be a single RHSA, and those RHSAs do not map to a single CVE, they generally represent many CVEs - the behavior in dedupeAdvisories addresses that.

Regardless - ROX_SCANNER_V4_RED_HAT_CVES was to be removed when the dedicated advisory field was added (ROX-26672) but never was. I wouldn't put too much focus on that path, but either way it should behave OK.

  1. Potential issue:

I think mergeFixFields() may regress remediation data in a real way, not just reshuffle metadata. Could we add a regression test for the case where a newer advisory has no FixedInVersion, but an older duplicate does? Something like:

t.Run("newer advisory does not erase existing fix version", func(t *testing.T) {
	vulnMap := map[string]*v4.VulnerabilityReport_Vulnerability{
		"a": {
			Id: "a", Name: "CVE-2024-1",
			Advisory:       &v4.VulnerabilityReport_Advisory{Name: "RHSA-2024:100"},
			FixedInVersion: "1.0.0",
		},
		"b": {
			Id: "b", Name: "CVE-2024-1",
			Advisory: &v4.VulnerabilityReport_Advisory{Name: "RHSA-2024:200"},
		},
	}

	got := vulnerabilities(vulnMap, []string{"a", "b"}, "", "")
	require.Len(t, got, 1)

	// Prefer the newer advisory metadata, but do not lose known fixability.
	assert.Equal(t, "RHSA-2024:200", got[0].GetAdvisory().GetName())
	assert.Equal(t, "1.0.0", got[0].GetFixedBy())
})

This would help answer whether losing fixability here is really intended.

That is the opposite of the existing test case:

t.Run("advisory wins with no fix version", func(t *testing.T) {
		vulnMap := map[string]*v4.VulnerabilityReport_Vulnerability{
			"a": {
				Id: "a", Name: "CVE-2024-1",
				Advisory:       &v4.VulnerabilityReport_Advisory{Name: "RHSA-2024:100"},
				FixedInVersion: "1.0.0",
			},
			"b": {
				Id: "b", Name: "CVE-2024-1",
				Advisory: &v4.VulnerabilityReport_Advisory{Name: "RHSA-2024:200"},
			},
		}
		got := vulnerabilities(vulnMap, []string{"a", "b"}, "", "")
		require.Len(t, got, 1)
		// "b" wins on advisory — fix version is empty from "b".
		assert.Equal(t, "RHSA-2024:200", got[0].GetAdvisory().GetName())
		assert.Equal(t, "", got[0].GetFixedBy())
	})

This scenario shouldn't occur - an RHSA describes a 'fix', any versions prior to the version mentioned are considered vulnerable, an RHSA wouldn't be published for a package it doesn't fix.

But lets say that can happen - if we surface to the user that this vulnerability is fixed in 1.0.0, but a later RHSA says the package is no longer fixed (not possible), that user will still be affected by the vuln in that case. I could argue that 'no fix' is the more accurate conclusion (which is why I created that test / expected outcome)

If we feel that we should carry the fix version forward in this case, I'm OK changing it - either way should not occur in practice.

@dcaravel dcaravel requested a review from vikin91 June 9, 2026 13:04
@vikin91

vikin91 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Thank you for answering! The answers make sense!
I can mark the PR green, but because I am not 100% in the topic I would suggest that maybe someone else should give this a look too.

@vikin91 vikin91 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving as in my comment (suggesting that second pair of eyes give that a look, as I may lack some assumptions).

@jvdm jvdm left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to buildEmbeddedVulnerability for the dedup candidate first, then have the merge functions compare two *storage.EmbeddedVulnerability values? The merge functions re-implement parts of buildEmbeddedVulnerability() (advisory(), vulnDataSource(), link()...) and nothing guarantees these two paths stay in sync. There might be bugs there.

Comment thread pkg/scanners/scannerv4/convert.go Outdated
Comment thread pkg/scanners/scannerv4/convert.go Outdated
Comment thread pkg/scanners/scannerv4/convert.go
Comment thread pkg/scanners/scannerv4/convert.go Outdated
Comment thread pkg/scanners/scannerv4/convert.go Outdated
@dcaravel dcaravel requested a review from jvdm June 10, 2026 22:17
@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown

@dcaravel: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ocp-4-22-operator-e2e-tests 15d1419 link false /test ocp-4-22-operator-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants