Skip to content

ROX-35013: Replace full-table Walks with targeted WalkByQuery in compliance operator manager#21059

Draft
dashrews78 wants to merge 5 commits into
masterfrom
dashrews/compliance-v1-query-not-walk
Draft

ROX-35013: Replace full-table Walks with targeted WalkByQuery in compliance operator manager#21059
dashrews78 wants to merge 5 commits into
masterfrom
dashrews/compliance-v1-query-not-walk

Conversation

@dashrews78

Copy link
Copy Markdown
Contributor

Added search tags to ComplianceOperatorProfile name and cluster_id
fields, enabled search category 201 on the profile store, and replaced
3 of 4 full-table Walk calls in the manager with WalkByQuery filtered
by profile name or cluster ID.

addProfileNoLock: Walk all profiles → WalkByQuery(name = X), filter
cluster in callback. Reduces O(N) table scan to indexed query.
DeleteProfile: Walk all profiles → WalkByQuery(name = X), same pattern.
GetMachineConfigs: Walk all profiles → WalkByQuery(clusterId = X).
findProfilesWithRuleNoLock: Unchanged (searches inside repeated field).

Benchmark (20 clusters, 200 profiles concurrent):
master: 234ms, 316MB, 3.39M allocs
WalkByQuery: 103ms, 40MB, 449K allocs
Improvement: 2.3x faster, 8x less memory, 7.5x fewer allocations

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

Description

change me!

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

change me!

@dashrews78

dashrews78 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: a10a9035-bc1c-4d6f-a6d0-edd95743b561

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dashrews/compliance-v1-query-not-walk

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🚀 Build Images Ready

Images are ready for commit d668412. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.12.x-147-gd668412785

@dashrews78 dashrews78 changed the title ROX-35XXX: Replace full-table Walks with targeted WalkByQuery in compliance operator manager ROX-35013: Replace full-table Walks with targeted WalkByQuery in compliance operator manager Jun 10, 2026
dashrews78 and others added 5 commits June 10, 2026 14:34
Central panics on startup with "context deadline exceeded / iterating
over rows" when many clusters reconnect simultaneously. Root cause:
NewManager did a Walk of all compliance profiles at startup, calling
addProfileNoLock for each — which itself did another full Walk. This
O(N²) pattern combined with concurrent sensor AddProfile calls created
a lock convoy on registryLock that exhausted the cursor timeout.

Fix:
- Remove the startup Walk in NewManager. Sensors always re-send all
  compliance operator data on reconnect (V1 compliance types skip
  deduping), so the registry populates naturally.
- Throttle concurrent profile and rule pipeline operations via semaphore
  (default 5, configurable via ROX_COMPLIANCE_V1_MAX_CONCURRENCY) to
  prevent DB connection pool exhaustion during mass sensor reconnects.

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use github.com/stackrox/rox/pkg/sync instead of stdlib sync.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests verify the semaphore limits concurrent AddProfile/AddRule calls
and that cancelled contexts are respected when the semaphore is full.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…liance operator manager

Added search tags to ComplianceOperatorProfile name and cluster_id
fields, enabled search category 201 on the profile store, and replaced
3 of 4 full-table Walk calls in the manager with WalkByQuery filtered
by profile name or cluster ID.

addProfileNoLock: Walk all profiles → WalkByQuery(name = X), filter
  cluster in callback. Reduces O(N) table scan to indexed query.
DeleteProfile: Walk all profiles → WalkByQuery(name = X), same pattern.
GetMachineConfigs: Walk all profiles → WalkByQuery(clusterId = X).
findProfilesWithRuleNoLock: Unchanged (searches inside repeated field).

Benchmark (20 clusters, 200 profiles concurrent):
  master:         234ms, 316MB, 3.39M allocs
  WalkByQuery:    103ms,  40MB, 449K allocs
  Improvement:    2.3x faster, 8x less memory, 7.5x fewer allocations

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dashrews78 dashrews78 force-pushed the dashrews/compliance-v1-startup branch from 5cc3639 to e30417e Compare June 10, 2026 18:34
@dashrews78 dashrews78 force-pushed the dashrews/compliance-v1-query-not-walk branch from bc3fa1d to d668412 Compare June 10, 2026 18:35
Base automatically changed from dashrews/compliance-v1-startup to master June 10, 2026 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant