ROX-35012: Fix V1 compliance operator startup panic and convoy by dashrews78 · Pull Request #21058 · stackrox/stackrox

dashrews78 · 2026-06-10T11:30:47Z

Description

Central panics on startup with "context deadline exceeded / iterating
over rows" when many clusters reconnect simultaneously. Root cause:
NewManager did a Walk of all compliance profiles at startup, calling
addProfileNoLock for each — which itself did another full Walk. This
O(N²) pattern combined with concurrent sensor AddProfile calls created
a lock convoy on registryLock that exhausted the cursor timeout.

Fix:

Remove the startup Walk in NewManager. Sensors always re-send all
compliance operator data on reconnect (V1 compliance types skip
deduping), so the registry populates naturally.
Throttle concurrent profile and rule pipeline operations via semaphore
(default 5, configurable via ROX_COMPLIANCE_V1_MAX_CONCURRENCY) to
prevent DB connection pool exhaustion during mass sensor reconnects.

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

How I validated my change

Added unit tests and benchmarks. Fired up a cluster and ensured that the in memory map is correctly populated as profiles are added by exercising the deprecated UI which is the only place that needs/uses the in memory map.

openshift-ci · 2026-06-10T11:30:51Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

coderabbitai · 2026-06-10T11:30:56Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 7e8a7632-d49d-4091-b518-547b11796854

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dashrews/compliance-v1-startup

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

dashrews78 · 2026-06-10T11:31:28Z

This change is part of the following stack:

ROX-35012: Fix V1 compliance operator startup panic and convoy #21058 ◀
- ROX-35013: Replace full-table Walks with targeted WalkByQuery in compliance operator manager #21059

_{Change managed by git-spice.}

github-actions · 2026-06-10T11:39:13Z

🚀 Build Images Ready

Images are ready for commit 026a0da. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.12.x-148-g026a0da031

dashrews78 · 2026-06-10T18:27:40Z

/test gke-qa-e2e-tests

Central panics on startup with "context deadline exceeded / iterating over rows" when many clusters reconnect simultaneously. Root cause: NewManager did a Walk of all compliance profiles at startup, calling addProfileNoLock for each — which itself did another full Walk. This O(N²) pattern combined with concurrent sensor AddProfile calls created a lock convoy on registryLock that exhausted the cursor timeout. Fix: - Remove the startup Walk in NewManager. Sensors always re-send all compliance operator data on reconnect (V1 compliance types skip deduping), so the registry populates naturally. - Throttle concurrent profile and rule pipeline operations via semaphore (default 5, configurable via ROX_COMPLIANCE_V1_MAX_CONCURRENCY) to prevent DB connection pool exhaustion during mass sensor reconnects. Partially generated by AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use github.com/stackrox/rox/pkg/sync instead of stdlib sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Tests verify the semaphore limits concurrent AddProfile/AddRule calls and that cancelled contexts are respected when the semaphore is full. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dashrews78 · 2026-06-10T19:17:12Z

/test gke-upgrade-tests

openshift-ci · 2026-06-10T20:30:26Z

@dashrews78: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/ocp-4-12-qa-e2e-tests	`e30417e`	link	false	`/test ocp-4-12-qa-e2e-tests`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci Bot added the do-not-merge/work-in-progress label Jun 10, 2026

github-actions Bot added the area/central label Jun 10, 2026

dashrews78 mentioned this pull request Jun 10, 2026

ROX-35013: Replace full-table Walks with targeted WalkByQuery in compliance operator manager #21059

Draft

9 tasks

dashrews78 changed the title ~~ROX-35XXX: Fix V1 compliance operator startup panic and convoy~~ ROX-35012: Fix V1 compliance operator startup panic and convoy Jun 10, 2026

dashrews78 marked this pull request as ready for review June 10, 2026 15:37

openshift-ci Bot removed the do-not-merge/work-in-progress label Jun 10, 2026

dashrews78 added backport release-4.10 backport release-4.11 Create a PR to backport this PR to release-4.11 labels Jun 10, 2026

AlexVulaj reviewed Jun 10, 2026

View reviewed changes

Comment thread central/complianceoperator/pipelines/complianceoperatorprofiles/pipeline.go

AlexVulaj reviewed Jun 10, 2026

View reviewed changes

Comment thread central/complianceoperator/manager/manager.go

AlexVulaj approved these changes Jun 10, 2026

View reviewed changes

dashrews78 and others added 4 commits June 10, 2026 14:34

ROX-35XXX: Fix roxvet import violation in convoy test

4c5d318

Use github.com/stackrox/rox/pkg/sync instead of stdlib sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ROX-35XXX: Add semaphore throttle tests for profile and rule pipelines

65ec22b

Tests verify the semaphore limits concurrent AddProfile/AddRule calls and that cancelled contexts are respected when the semaphore is full. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ROX-35XXX: Fix goimports grouping in convoy test

e30417e

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dashrews78 force-pushed the dashrews/compliance-v1-startup branch from 5cc3639 to e30417e Compare June 10, 2026 18:34

dashrews78 merged commit 026a0da into master Jun 10, 2026
176 of 200 checks passed

dashrews78 deleted the dashrews/compliance-v1-startup branch June 10, 2026 22:11

This was referenced Jun 10, 2026

ROX-35012: Fix V1 compliance operator startup panic and convoy #21067

Open

ROX-35012: Fix V1 compliance operator startup panic and convoy #21068

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-35012: Fix V1 compliance operator startup panic and convoy#21058

ROX-35012: Fix V1 compliance operator startup panic and convoy#21058
dashrews78 merged 4 commits into
masterfrom
dashrews/compliance-v1-startup

dashrews78 commented Jun 10, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

Review skipped

Uh oh!

dashrews78 commented Jun 10, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

dashrews78 commented Jun 10, 2026

Uh oh!

dashrews78 commented Jun 10, 2026

Uh oh!

openshift-ci Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dashrews78 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Uh oh!

openshift-ci Bot commented Jun 10, 2026

Uh oh!

coderabbitai Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

dashrews78 commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Build Images Ready

Uh oh!

Uh oh!

Uh oh!

dashrews78 commented Jun 10, 2026

Uh oh!

dashrews78 commented Jun 10, 2026

Uh oh!

openshift-ci Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dashrews78 commented Jun 10, 2026 •

edited

Loading

coderabbitai Bot commented Jun 10, 2026 •

edited

Loading

dashrews78 commented Jun 10, 2026 •

edited

Loading

github-actions Bot commented Jun 10, 2026 •

edited

Loading

openshift-ci Bot commented Jun 10, 2026 •

edited

Loading