ROX-35012: Fix V1 compliance operator startup panic and convoy#21058
Conversation
|
Skipping CI for Draft Pull Request. |
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
This change is part of the following stack: Change managed by git-spice. |
🚀 Build Images ReadyImages are ready for commit 026a0da. To use with deploy scripts: export MAIN_IMAGE_TAG=4.12.x-148-g026a0da031 |
|
/test gke-qa-e2e-tests |
Central panics on startup with "context deadline exceeded / iterating over rows" when many clusters reconnect simultaneously. Root cause: NewManager did a Walk of all compliance profiles at startup, calling addProfileNoLock for each — which itself did another full Walk. This O(N²) pattern combined with concurrent sensor AddProfile calls created a lock convoy on registryLock that exhausted the cursor timeout. Fix: - Remove the startup Walk in NewManager. Sensors always re-send all compliance operator data on reconnect (V1 compliance types skip deduping), so the registry populates naturally. - Throttle concurrent profile and rule pipeline operations via semaphore (default 5, configurable via ROX_COMPLIANCE_V1_MAX_CONCURRENCY) to prevent DB connection pool exhaustion during mass sensor reconnects. Partially generated by AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use github.com/stackrox/rox/pkg/sync instead of stdlib sync. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests verify the semaphore limits concurrent AddProfile/AddRule calls and that cancelled contexts are respected when the semaphore is full. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5cc3639 to
e30417e
Compare
|
/test gke-upgrade-tests |
|
@dashrews78: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Description
Central panics on startup with "context deadline exceeded / iterating
over rows" when many clusters reconnect simultaneously. Root cause:
NewManager did a Walk of all compliance profiles at startup, calling
addProfileNoLock for each — which itself did another full Walk. This
O(N²) pattern combined with concurrent sensor AddProfile calls created
a lock convoy on registryLock that exhausted the cursor timeout.
Fix:
compliance operator data on reconnect (V1 compliance types skip
deduping), so the registry populates naturally.
(default 5, configurable via ROX_COMPLIANCE_V1_MAX_CONCURRENCY) to
prevent DB connection pool exhaustion during mass sensor reconnects.
Partially generated by AI.
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com
User-facing documentation
Testing and quality
Automated testing
How I validated my change
Added unit tests and benchmarks. Fired up a cluster and ensured that the in memory map is correctly populated as profiles are added by exercising the deprecated UI which is the only place that needs/uses the in memory map.