ROX-34929: Dedupe Scanner V4 Packages and Vulns by dcaravel · Pull Request #20996 · stackrox/stackrox

dcaravel · 2026-06-05T23:40:33Z

Description

Adds logic to dedupe packages and vulnerabilities in Scanner V4 results.

The deduping logic is disabled by default because it will be backported, to enable set ROX_SCANNER_V4_DEDUPE: true.

Deduping scenarios addressed

Duplicate RHCC packages (one per kind: source, binary, ancestry)
- Only one is needed.
Duplicate vulns where only the advisory (RHSA) differs.
- A single CVE may be fixed by multiple RHSA's, each RHSA would yield a 'duplicate' - now only the 'latest' is kept.
Different perspectives for the same vuln (ie: GHSA vs. GO, PYSEC vs. GHSA, etc.)
- Logic added to pick the most complete / actionable entry

Placement

Originally the logic was added to mappers.go, was later moved to convert.go as it was more readable and allowed for a 'rolling' approach when modifying the vuln details. In mappers.go, as vulns were combined new 'synthetic' vulns had to be created and added to the vulnerabilities map to not break other packages that may not reference the same set of duplicates (unnecessary complexity).

Logic

Comparing by severity or CVSS was not enough, OSV/Red Hat have priority over NVD and in various scenarios the 'lower severity' needed to be top level.

To address known duplicate scenarios and keep data actionable, fields were grouped as 'fix' related and 'scoring' related (loose), the grouped fields are kept together such that the 'fix' info may come from duplicate A and the scoring from duplicate B.

Scoring fields	Fix fields
cvss	advisory
cvssV2	fixed by
cvssV3	fix avail timestamp
scoreVersion
link
published_on
last_modified
severity
summary

Deduping examples from real images

Scenario 1: Duplicate RHSAs — same CVE, different advisories

Image: registry.redhat.io/compliance/openshift-compliance-openscap-rhel8@sha256:b64926c3cd92381b31f31f28ca01f05c52bc0fbbc581941ba6f0da5b3b142d36
Package: expat 2.5.0-3.el9_5.1 (rpm)

Scoring: equal — both have identical CVSS, severity, and summary.
Fix: B wins — newer advisory, newer fixed-by version, later fix timestamp.

Field	Dupe A	Dupe B	Winner
CVE	CVE-2024-8176	CVE-2024-8176	—
Advisory	RHSA-2025:3531	RHSA-2025:7444	B (newer)
Fixed by	`0:2.5.0-3.el9_5.3`	`0:2.5.0-5.el9_6`	B (newer)
Fix timestamp	2025-04-02	2025-05-13	B (newer)
CVSS / severity	7.5 / MODERATE	7.5 / MODERATE	equal

Scenario 2: Different perspectives — different CVSS scores

Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:29e3563d5fc11a54ffbf7b28694f1cc3975a72e24035fca4afa44d7937e932c7
Package: github.com/docker/docker v27.5.1+incompatible (Go binary)

Scoring: A wins — has 2 CVSS metric sources vs 1.
Fix: equal — neither has an advisory or fix version.

Field	Dupe A	Dupe B	Winner
CVE	CVE-2026-33997	CVE-2026-33997	—
CVSS	6.8	8.1	A (more sources)
CVSS vector	`AV:N/AC:H/...`	`AV:N/AC:L/...`	A
Severity	MODERATE	IMPORTANT	A
Link	osv.dev	nvd.nist.gov	A
Published	2026-03-27	2026-04-02	A (earlier)
CVSS metrics	2 sources	1 source	A

Scenario 3: Different perspectives — one has fix version

Same image as Scenario 2
Package: github.com/docker/docker v27.5.1+incompatible (Go binary)

Scoring: A wins — higher CVSS (8.8 vs 7.8), 2 metric sources vs 1.
Fix: A wins — has fix version 29.3.1; B has none.

Field	Dupe A	Dupe B	Winner
CVE	CVE-2026-34040	CVE-2026-34040	—
CVSS	8.8	7.8	A
Fixed by	`29.3.1`	empty	A
Severity	IMPORTANT	IMPORTANT	equal
Link	osv.dev	nvd.nist.gov	A
CVSS metrics	2 sources	1 source	A

Scenario 4: Different perspectives — one has no scoring at all

Same image as Scenario 2
Package: github.com/go-jose/go-jose/v3 v3.0.4 (Go binary)

Scoring: A wins — has CVSS, severity, and metric source; B has none.
Fix: equal — both have 3.0.5 as the fix version.

Field	Dupe A	Dupe B	Winner
CVE	CVE-2026-34986	CVE-2026-34986	—
CVSS	7.5	empty	A
CVSS v3	full vector	empty	A
Severity	IMPORTANT	UNKNOWN	A
Fixed by	`3.0.5`	`3.0.5`	equal
CVSS metrics	1 source	empty	A

Scenario 5: Different perspectives — different CVSS severity

Same image as Scenario 2
Package: setuptools 53.0.0 (Python)

Scoring: A wins — higher CVSS (7.5 vs 5.9), 2 metric sources vs 1.
Fix: equal — both have 65.5.1 as the fix version.

Field	Dupe A	Dupe B	Winner
CVE	CVE-2022-40897	CVE-2022-40897	—
CVSS	7.5	5.9	A
CVSS vector	`AV:N/AC:L/...`	`AV:N/AC:H/...`	A
Severity	IMPORTANT	MODERATE	A
Link	osv.dev	nvd.nist.gov	A
Fixed by	`65.5.1`	`65.5.1`	equal
CVSS metrics	2 sources	1 source	A

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

added unit tests
modified existing tests

How I validated my change

unit tests + CI + manually

Using a custom dupe detector and two different infra clusters, one with the fix the other without, analyzed each active image for duplicates.

The cluster without the fix:

Summary
  Images scanned:          96
  Total components:        26758
  Total CVE entries:       7656
  Duplicate CVEs:          436
  Identical components:    370

  If duplicates removed (estimate):
    CVEs:        7656 -> 7220  (5.7% reduction)
    Components:  26758 -> 26388  (1.4% reduction)

Summary
  Images scanned:          96
  Total components:        26387
  Total CVE entries:       7220
  Duplicate CVEs:          0
  Identical components:    0

openshift-ci · 2026-06-05T23:40:37Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-06-05T23:40:42Z

Worried about impact? Review this PR in Change Stack to explore blast radius before you approve or request changes.

📝 Walkthrough

Walkthrough

This PR adds a new ScannerV4Dedupe feature flag and implements deduplication logic for Scanner V4 scan results. When enabled, the feature filters redundant packages and merges duplicate CVE entries within each package using deterministic advisory ordering and fix-selection rules.

Changes

Scanner V4 Deduplication

Layer / File(s)	Summary
Feature flag and helper infrastructure `pkg/features/list.go`, `pkg/scanners/scannerv4/convert.go`	`ScannerV4Dedupe` feature flag is registered and new imports (`cmp`) and helpers (`yearNumPattern`, `parseYearNum`, `compareStorageAdvisories`, `v4BaseScore`, `mergeFixFields`, `mergeScoringFields`) are added to support advisory parsing and deterministic comparison.
Component deduplication `pkg/scanners/scannerv4/convert.go`	`componentsWithLayerMap` now filters out "ancestry" packages and skips "source" packages already referenced by binary packages when the feature flag is enabled.
Vulnerability deduplication and merging `pkg/scanners/scannerv4/convert.go`, `pkg/scanners/scannerv4/vm_convert.go`	`vulnerabilities` function signature updated to accept `pkgFixedByVersion` and implements CVE-name deduplication: duplicate vulnerabilities merge into a single entry via fix-field and scoring-field selection based on advisory ordering, fixed-version presence, and CVSS heuristics. Call sites updated to pass the new parameters.
Test coverage `pkg/scanners/scannerv4/convert_test.go`	`TestComponents` extended with dedupe parameter and RHCC test cases; new test functions `TestVulnerabilities_DedupByCVEName` and `TestCompareStorageAdvisories` validate deduplication behavior and advisory ordering.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 43.75% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	Required PR description sections are incomplete or marked as not done; validation section contains only placeholder text.	Complete all checklist items (CHANGELOG.md, documentation PR, testing and quality, automated testing, validation). Replace 'change me!' placeholders with actual validation details or explanations for skipping.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: deduplication of Scanner V4 packages and vulnerabilities, which is the primary objective of the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dc/dedupe-vulns

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

dcaravel · 2026-06-05T23:42:55Z

@coderabbitai review

coderabbitai · 2026-06-05T23:43:00Z

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/scanners/scannerv4/convert.go`:
- Around line 64-71: The loop that builds referencedSourceIDs when
features.ScannerV4Dedupe.Enabled() currently collects source IDs from every
package kind and must be restricted to only binary packages to avoid dropping
valid source+ancestry components; update the population logic so it only adds
srcID when pkg.GetKind() indicates a binary (check the enum/value used for
binary packages in this codebase) before calling
referencedSourceIDs.Add(pkg.GetSource().GetId()), replicate the same change for
the other analogous block (the block mentioned at lines 75-83), and add a
regression test that exercises a report containing only a source and an ancestry
package (no binary) to assert both components are retained.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: f90ad52b-bbfe-43eb-ab17-c3ab7fc086cf

📥 Commits

Reviewing files that changed from the base of the PR and between 6caac0c and c6ba88e.

📒 Files selected for processing (4)

pkg/features/list.go
pkg/scanners/scannerv4/convert.go
pkg/scanners/scannerv4/convert_test.go
pkg/scanners/scannerv4/vm_convert.go

github-actions · 2026-06-05T23:48:21Z

🚀 Build Images Ready

Images are ready for commit 15d1419. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.12.x-116-g15d1419620

dcaravel · 2026-06-09T01:34:36Z

/test all

vikin91

Despite not being in the Scanner-core, I gave it a selective look. My human eyes didn't catch much, but I threw it at AI and there are two things that you may want to check:

Question:

does this layer intentionally rely on PackageVulnerabilities already being advisory-deduped upstream when ROX_SCANNER_V4_RED_HAT_CVES=false?

I noticed vulnerabilities() now merges by ccVuln.GetName(). In isolation that would collapse two entries like:

- RHSA-2024:10775 -> ...CVE-2024-24789...
- RHSA-2024:10775 -> ...CVE-2024-24790...

into one result.

I think toProtoV4PackageVulnerabilitiesMap() / dedupeAdvisories() currently makes that unreachable, so this may be fine. Just wanted to confirm that this function is deliberately depending on that upstream invariant.

Potential issue:

I think mergeFixFields() may regress remediation data in a real way, not just reshuffle metadata. Could we add a regression test for the case where a newer advisory has no FixedInVersion, but an older duplicate does?
Something like:

t.Run("newer advisory does not erase existing fix version", func(t *testing.T) {
	vulnMap := map[string]*v4.VulnerabilityReport_Vulnerability{
		"a": {
			Id: "a", Name: "CVE-2024-1",
			Advisory:       &v4.VulnerabilityReport_Advisory{Name: "RHSA-2024:100"},
			FixedInVersion: "1.0.0",
		},
		"b": {
			Id: "b", Name: "CVE-2024-1",
			Advisory: &v4.VulnerabilityReport_Advisory{Name: "RHSA-2024:200"},
		},
	}

	got := vulnerabilities(vulnMap, []string{"a", "b"}, "", "")
	require.Len(t, got, 1)

	// Prefer the newer advisory metadata, but do not lose known fixability.
	assert.Equal(t, "RHSA-2024:200", got[0].GetAdvisory().GetName())
	assert.Equal(t, "1.0.0", got[0].GetFixedBy())
})

This would help answer whether losing fixability here is really intended.

dcaravel · 2026-06-09T13:03:03Z

Question:

does this layer intentionally rely on PackageVulnerabilities already being advisory-deduped upstream when ROX_SCANNER_V4_RED_HAT_CVES=false?

I noticed vulnerabilities() now merges by ccVuln.GetName(). In isolation that would collapse two entries like:
- RHSA-2024:10775 -> ...CVE-2024-24789...
- RHSA-2024:10775 -> ...CVE-2024-24790...
into one result.

I think toProtoV4PackageVulnerabilitiesMap() / dedupeAdvisories() currently makes that unreachable, so this may be fine. Just wanted to confirm that this function is deliberately depending on that upstream invariant.

That's correct, by the time the the information gets to Central there should only be a single RHSA, and those RHSAs do not map to a single CVE, they generally represent many CVEs - the behavior in dedupeAdvisories addresses that.

Regardless - ROX_SCANNER_V4_RED_HAT_CVES was to be removed when the dedicated advisory field was added (ROX-26672) but never was. I wouldn't put too much focus on that path, but either way it should behave OK.

Potential issue:

I think mergeFixFields() may regress remediation data in a real way, not just reshuffle metadata. Could we add a regression test for the case where a newer advisory has no FixedInVersion, but an older duplicate does? Something like:

t.Run("newer advisory does not erase existing fix version", func(t *testing.T) {
	vulnMap := map[string]*v4.VulnerabilityReport_Vulnerability{
		"a": {
			Id: "a", Name: "CVE-2024-1",
			Advisory:       &v4.VulnerabilityReport_Advisory{Name: "RHSA-2024:100"},
			FixedInVersion: "1.0.0",
		},
		"b": {
			Id: "b", Name: "CVE-2024-1",
			Advisory: &v4.VulnerabilityReport_Advisory{Name: "RHSA-2024:200"},
		},
	}

	got := vulnerabilities(vulnMap, []string{"a", "b"}, "", "")
	require.Len(t, got, 1)

	// Prefer the newer advisory metadata, but do not lose known fixability.
	assert.Equal(t, "RHSA-2024:200", got[0].GetAdvisory().GetName())
	assert.Equal(t, "1.0.0", got[0].GetFixedBy())
})

This would help answer whether losing fixability here is really intended.

That is the opposite of the existing test case:

t.Run("advisory wins with no fix version", func(t *testing.T) {
		vulnMap := map[string]*v4.VulnerabilityReport_Vulnerability{
			"a": {
				Id: "a", Name: "CVE-2024-1",
				Advisory:       &v4.VulnerabilityReport_Advisory{Name: "RHSA-2024:100"},
				FixedInVersion: "1.0.0",
			},
			"b": {
				Id: "b", Name: "CVE-2024-1",
				Advisory: &v4.VulnerabilityReport_Advisory{Name: "RHSA-2024:200"},
			},
		}
		got := vulnerabilities(vulnMap, []string{"a", "b"}, "", "")
		require.Len(t, got, 1)
		// "b" wins on advisory — fix version is empty from "b".
		assert.Equal(t, "RHSA-2024:200", got[0].GetAdvisory().GetName())
		assert.Equal(t, "", got[0].GetFixedBy())
	})

This scenario shouldn't occur - an RHSA describes a 'fix', any versions prior to the version mentioned are considered vulnerable, an RHSA wouldn't be published for a package it doesn't fix.

But lets say that can happen - if we surface to the user that this vulnerability is fixed in 1.0.0, but a later RHSA says the package is no longer fixed (not possible), that user will still be affected by the vuln in that case. I could argue that 'no fix' is the more accurate conclusion (which is why I created that test / expected outcome)

If we feel that we should carry the fix version forward in this case, I'm OK changing it - either way should not occur in practice.

vikin91 · 2026-06-09T13:35:09Z

Thank you for answering! The answers make sense!
I can mark the PR green, but because I am not 100% in the topic I would suggest that maybe someone else should give this a look too.

vikin91

Approving as in my comment (suggesting that second pair of eyes give that a look, as I may lack some assumptions).

jvdm

Would it be simpler to buildEmbeddedVulnerability for the dedup candidate first, then have the merge functions compare two *storage.EmbeddedVulnerability values? The merge functions re-implement parts of buildEmbeddedVulnerability() (advisory(), vulnDataSource(), link()...) and nothing guarantees these two paths stay in sync. There might be bugs there.

openshift-ci · 2026-06-11T01:58:39Z

@dcaravel: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/ocp-4-22-operator-e2e-tests	`15d1419`	link	false	`/test ocp-4-22-operator-e2e-tests`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot added the do-not-merge/work-in-progress label Jun 5, 2026

dcaravel force-pushed the dc/dedupe-vulns branch from b30926a to c6ba88e Compare June 5, 2026 23:41

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread pkg/scanners/scannerv4/convert.go Outdated

dcaravel force-pushed the dc/dedupe-vulns branch from c6ba88e to 4be2452 Compare June 9, 2026 00:42

baseline dedupe

664a9b3

dcaravel force-pushed the dc/dedupe-vulns branch from 4be2452 to 664a9b3 Compare June 9, 2026 01:02

minor cleanup

ed16587

dcaravel mentioned this pull request Jun 9, 2026

ROX-34929: Dedupe Scanner V4 Packages and Vulns (backport) #21021

Draft

dcaravel marked this pull request as ready for review June 9, 2026 02:02

dcaravel requested review from a team as code owners June 9, 2026 02:02

openshift-ci Bot removed the do-not-merge/work-in-progress label Jun 9, 2026

vikin91 reviewed Jun 9, 2026

View reviewed changes

dcaravel requested a review from vikin91 June 9, 2026 13:04

vikin91 approved these changes Jun 9, 2026

View reviewed changes

jvdm reviewed Jun 10, 2026

View reviewed changes

Comment thread pkg/scanners/scannerv4/convert.go Outdated

Comment thread pkg/scanners/scannerv4/convert.go Outdated

Comment thread pkg/scanners/scannerv4/convert.go

Comment thread pkg/scanners/scannerv4/convert.go Outdated

Comment thread pkg/scanners/scannerv4/convert.go Outdated

dcaravel added 4 commits June 10, 2026 16:12

convert candidate vuln to embedded before compare

0d7fbd2

add debug log to version compare

4ddcb20

compare advisories by fixed timestamp instead of sequence

692b7b6

update filter comment

09f6578

dcaravel requested a review from jvdm June 10, 2026 22:17

use getters

15d1419

jvdm approved these changes Jun 11, 2026

View reviewed changes

Conversation

dcaravel commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Deduping scenarios addressed

Placement

Logic

Scenario 1: Duplicate RHSAs — same CVE, different advisories

Scenario 2: Different perspectives — different CVSS scores

Scenario 3: Different perspectives — one has fix version

Scenario 4: Different perspectives — one has no scoring at all

Scenario 5: Different perspectives — different CVSS severity

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Uh oh!

openshift-ci Bot commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

dcaravel commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Build Images Ready

Uh oh!

dcaravel commented Jun 9, 2026

Uh oh!

vikin91 left a comment

Choose a reason for hiding this comment

Uh oh!

dcaravel commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vikin91 commented Jun 9, 2026

Uh oh!

vikin91 left a comment

Choose a reason for hiding this comment

Uh oh!

jvdm left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openshift-ci Bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dcaravel commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

coderabbitai Bot commented Jun 5, 2026 •

edited

Loading

github-actions Bot commented Jun 5, 2026 •

edited

Loading

dcaravel commented Jun 9, 2026 •

edited

Loading