Skip to content

ROX-35012: Fix V1 compliance operator startup panic and convoy#21058

Merged
dashrews78 merged 4 commits into
masterfrom
dashrews/compliance-v1-startup
Jun 10, 2026
Merged

ROX-35012: Fix V1 compliance operator startup panic and convoy#21058
dashrews78 merged 4 commits into
masterfrom
dashrews/compliance-v1-startup

Conversation

@dashrews78

@dashrews78 dashrews78 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Description

Central panics on startup with "context deadline exceeded / iterating
over rows" when many clusters reconnect simultaneously. Root cause:
NewManager did a Walk of all compliance profiles at startup, calling
addProfileNoLock for each — which itself did another full Walk. This
O(N²) pattern combined with concurrent sensor AddProfile calls created
a lock convoy on registryLock that exhausted the cursor timeout.

Fix:

  • Remove the startup Walk in NewManager. Sensors always re-send all
    compliance operator data on reconnect (V1 compliance types skip
    deduping), so the registry populates naturally.
  • Throttle concurrent profile and rule pipeline operations via semaphore
    (default 5, configurable via ROX_COMPLIANCE_V1_MAX_CONCURRENCY) to
    prevent DB connection pool exhaustion during mass sensor reconnects.

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Added unit tests and benchmarks. Fired up a cluster and ensured that the in memory map is correctly populated as profiles are added by exercising the deprecated UI which is the only place that needs/uses the in memory map.

@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 7e8a7632-d49d-4091-b518-547b11796854

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dashrews/compliance-v1-startup

Comment @coderabbitai help to get the list of available commands and usage tips.

@dashrews78

dashrews78 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🚀 Build Images Ready

Images are ready for commit 026a0da. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.12.x-148-g026a0da031

@dashrews78 dashrews78 changed the title ROX-35XXX: Fix V1 compliance operator startup panic and convoy ROX-35012: Fix V1 compliance operator startup panic and convoy Jun 10, 2026
@dashrews78 dashrews78 marked this pull request as ready for review June 10, 2026 15:37
@dashrews78 dashrews78 added backport release-4.10 backport release-4.11 Create a PR to backport this PR to release-4.11 labels Jun 10, 2026
Comment thread central/complianceoperator/manager/manager.go
@dashrews78

Copy link
Copy Markdown
Contributor Author

/test gke-qa-e2e-tests

dashrews78 and others added 4 commits June 10, 2026 14:34
Central panics on startup with "context deadline exceeded / iterating
over rows" when many clusters reconnect simultaneously. Root cause:
NewManager did a Walk of all compliance profiles at startup, calling
addProfileNoLock for each — which itself did another full Walk. This
O(N²) pattern combined with concurrent sensor AddProfile calls created
a lock convoy on registryLock that exhausted the cursor timeout.

Fix:
- Remove the startup Walk in NewManager. Sensors always re-send all
  compliance operator data on reconnect (V1 compliance types skip
  deduping), so the registry populates naturally.
- Throttle concurrent profile and rule pipeline operations via semaphore
  (default 5, configurable via ROX_COMPLIANCE_V1_MAX_CONCURRENCY) to
  prevent DB connection pool exhaustion during mass sensor reconnects.

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use github.com/stackrox/rox/pkg/sync instead of stdlib sync.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests verify the semaphore limits concurrent AddProfile/AddRule calls
and that cancelled contexts are respected when the semaphore is full.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dashrews78 dashrews78 force-pushed the dashrews/compliance-v1-startup branch from 5cc3639 to e30417e Compare June 10, 2026 18:34
@dashrews78

Copy link
Copy Markdown
Contributor Author

/test gke-upgrade-tests

@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown

@dashrews78: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ocp-4-12-qa-e2e-tests e30417e link false /test ocp-4-12-qa-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dashrews78 dashrews78 merged commit 026a0da into master Jun 10, 2026
176 of 200 checks passed
@dashrews78 dashrews78 deleted the dashrews/compliance-v1-startup branch June 10, 2026 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/central backport release-4.10 backport release-4.11 Create a PR to backport this PR to release-4.11

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants