Skip to content

ROX-33339: Optimize Sensor reprocessor gRPC and AC image cache churn#19840

Open
clickboo wants to merge 2 commits intomasterfrom
boo-central-refresh-ttl-invalidate
Open

ROX-33339: Optimize Sensor reprocessor gRPC and AC image cache churn#19840
clickboo wants to merge 2 commits intomasterfrom
boo-central-refresh-ttl-invalidate

Conversation

@clickboo
Copy link
Copy Markdown
Contributor

@clickboo clickboo commented Apr 6, 2026

Description

Optimizes Central's short-circuit reprocessor path to reduce unnecessary UpdatedImage message overhead and prevent full Admission Controller cache purges.

Previously, Central sent UpdatedImage for every image during short circuit reprocessing regardless of whether the scan data actually changed. This caused unnecessary proto serialization/deserialization overhead and, combined with the subsequent FlushCache, purged the entire admission controller cache even when most images were unchanged.

What changed (short-circuit path only):

Central's reprocessor now distinguishes between changed and unchanged images during the short-circuit reprocessing cycle:

  • Changed images (imageUpdated=true): Central sends UpdatedImage as before. Sensor's detector accumulates the image key.
  • Unchanged images (imageUpdated=false): Central skips UpdatedImage entirely and batches the image key for a single RefreshImageCacheTTL message per cluster, keeping Sensor's cache warm without the deserialization cost of full image data.

At the end of the cycle, Central sends ReprocessDeployments with skip_cache_flush=true. Sensor's detector then sends a single batched InvalidateImageCache to the Admission Controller for only the changed images, instead of flushing the entire admission controller cache. This keeps the Admission controller cache warm for unchanged images, avoiding unnecessary re-fetches on subsequent review requests.

Central -> Sensor gRPC message and cache operation comparison (short-circuit path, per cluster)

N = total images reprocessed, K = images with changed scan data, D = deployments with cached images in admission controller

Metric Before PR After PR
UpdatedImage messages (Central→Sensor) N K
RefreshImageCacheTTL (Central→Sensor) 1 (batched, N−K keys)
ReprocessDeployments (Central→Sensor) 1 1 (SkipCacheFlush=true)
Total Central→Sensor messages N + 1 K + 2
Proto serializations (full image data) N K
Sensor imageCache.Add (replace data) N K
Sensor imageCache.Touch (TTL only) N − K
Sensor deployment re-detections N K
AC image cache Full purge Targeted invalidation (K entries)
AC cache entries preserved 0 N − K
AC re-fetch load (post-purge) D (all cached images) K (only changed images)
Messages saved N − K − 1

Example: 500 images, 3 changed, 200 of those images in a review request burst to admission controller → Before: 501 messages, 500 deployment re-detections, full admission controller cache purge, 200 re-fetches. After: 5 messages, 3 deployment re-detections, 3 admission controller entries invalidated, 3 re-fetches.

Sensor detection during reassess (no image changes)

  • Master: each deployment is re-evaluated N+1 times — once per ResolveDeploymentsByImages for each of its images (triggered by N UpdatedImage messages), plus once from ResolveAllDeployments (triggered by ReprocessDeployments). All but the final pass are redundant.
  • This PR: each deployment is re-evaluated exactly once — from ResolveAllDeployments only. No UpdatedImage messages are sent for unchanged images, eliminating N redundant per-image re-detection passes. Policy evaluation outcomes are identical.

Key wins:

  • Message reduction: N + 1 → K + 2 (typically K ≪ N, saving nearly N messages per cycle)
  • Serialization savings: only K full image protos serialized/deserialized instead of N
  • Sensor cache churn eliminated: N − K unnecessary cache replacements become lightweight TTL touches
  • Deployment re-detection reduced: N → K; avoids N − K redundant ResolveDeploymentsByImages calls.
  • Admission controller cache kept warm: N − K entries preserved instead of full purge
  • Admission controller re-fetch load reduced: D → K; avoids a burst of image re-fetches from Sensor/Central for unchanged images after every reprocessing cycle

Periodic reprocessor path is unchanged in behavior: The periodic path (default 4h interval) continues to send UpdatedImage for every image and ReprocessDeployments with skip_cache_flush=false (the proto default). Sensor receives skip_cache_flush=false and performs a full FlushCache() on the admission controller cache. The admissionCacheNeedsFlush guard was removed since Central never sends ReprocessDeployments without preceding images (early return on zero results), making the flag always true when reached.

Version compatibility:

Central Sensor Behavior
New New (with TargetedImageCacheInvalidation capability) Short-circuit path uses RefreshImageCacheTTL for unchanged images, batched targeted invalidation for changed images
New Old (without capability) Central always sends UpdatedImage (capability gate in sendReprocessingMessages). skip_cache_flush defaults to false in proto, so old Sensor calls FlushCache(). Identical to pre-change behavior
Old New Old Central never sends RefreshImageCacheTTL or sets skip_cache_flush. New Sensor sees skip_cache_flush=false and calls FlushCache(). Identical to pre-change behavior

No breaking changes in any version combination.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Testing

Validated end-to-end using bench-reassess.sh (available in #19427) against a live cluster, comparing a master build to this PR's build. The script automates:

  1. Burst 1 — sends N deployment creates (--dry-run=server) to warm AC/Sensor caches
  2. Snapshot pre-reassess — scrapes Prometheus metrics from AC, Sensor, and Central
  3. Trigger reassess — calls POST /v1/policies/reassess, waits for Central's reprocessor to complete by tailing logs for the completion marker
  4. Snapshot post-reassess — scrapes metrics again to measure reassess-cycle impact
  5. Burst 2 — repeats the burst to measure cache survival after reassess
  6. Snapshot post-burst2 — final metric scrape
  7. Report — prints deltas for key metrics across all three components

Run: BURST_SIZE=100 UNIQUE_PCT=25 PARALLEL=50 ./tools/admission-control/bench-reassess.sh

Results (100 deployments, 20 unique images):

Metric Master PR Change
Sensor component_process_message (reassess) 97 2 -98%
Sensor deployment_processed (reassess) 343 142 -59%
AC image_fetch_total (burst 2) 51 2 -96%
AC cache_hit_rate (burst 2) ~46% 98% cache survives reassess
AC review_duration avg (burst 2) 1.034s 0.034s 30x faster

A detailed description and analysis of the test results can be found here.

@clickboo
Copy link
Copy Markdown
Contributor Author

clickboo commented Apr 6, 2026

sourcery-ai[bot]

This comment was marked as off-topic.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 6, 2026

🚀 Build Images Ready

Images are ready for commit 5b5eb15. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.11.x-609-g5b5eb15237

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

❌ Patch coverage is 76.62338% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.56%. Comparing base (9d35506) to head (5b5eb15).
⚠️ Report is 14 commits behind head on master.

Files with missing lines Patch % Lines
central/reprocessor/reprocessor.go 85.92% 15 Missing and 4 partials ⚠️
sensor/common/detector/detector.go 0.00% 16 Missing ⚠️
sensor/kubernetes/eventpipeline/pipeline_impl.go 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #19840      +/-   ##
==========================================
- Coverage   49.61%   49.56%   -0.05%     
==========================================
  Files        2765     2764       -1     
  Lines      208541   208411     -130     
==========================================
- Hits       103464   103306     -158     
- Misses      97401    97450      +49     
+ Partials     7676     7655      -21     
Flag Coverage Δ
go-unit-tests 49.56% <76.62%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@clickboo clickboo force-pushed the boo-sensor-refresh-invalidate-reprocessor branch from a5bd2d1 to de678cb Compare April 7, 2026 09:29
@clickboo clickboo force-pushed the boo-central-refresh-ttl-invalidate branch from eb958d5 to 348a8b6 Compare April 7, 2026 09:29
@clickboo clickboo force-pushed the boo-sensor-refresh-invalidate-reprocessor branch 2 times, most recently from b1142f8 to f71c251 Compare April 7, 2026 12:00
@clickboo clickboo changed the title ROX-33339: Optimized sensor and admission controller image cache operations ROX-33339: Image cache optimizations in sensor and admission controller Apr 7, 2026
@clickboo clickboo changed the title ROX-33339: Image cache optimizations in sensor and admission controller ROX-33339: Sensor & Admission Controller image cache optimizations Apr 7, 2026
@clickboo clickboo changed the title ROX-33339: Sensor & Admission Controller image cache optimizations ROX-33339: Sensor & Admission Cntrl image cache optimizations Apr 7, 2026
@clickboo clickboo changed the title ROX-33339: Sensor & Admission Cntrl image cache optimizations ROX-33339: Sensor & Admission Controller image cache optimizations Apr 7, 2026
@clickboo clickboo changed the title ROX-33339: Sensor & Admission Controller image cache optimizations ROX-33339: Optimize Sensor reprocessor gRPC and AC cache churn Apr 7, 2026
@clickboo clickboo changed the title ROX-33339: Optimize Sensor reprocessor gRPC and AC cache churn ROX-33339: Optimize Sensor reprocessor gRPC and AC image cache churn Apr 7, 2026
Base automatically changed from boo-sensor-refresh-invalidate-reprocessor to master April 8, 2026 18:48
@clickboo clickboo force-pushed the boo-central-refresh-ttl-invalidate branch 2 times, most recently from 4953faf to 915f7d7 Compare April 8, 2026 19:44
@clickboo clickboo force-pushed the boo-central-refresh-ttl-invalidate branch from 915f7d7 to 7361076 Compare April 8, 2026 19:50
@coderabbitai

This comment was marked as low quality.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
sensor/kubernetes/eventpipeline/pipeline_test.go (1)

178-180: Assert the forwarded ReprocessDeployments payload.

This only proves the call happened. A nil or zero-valued message would still satisfy gomock.Any(), so the new skip_cache_flush contract is not actually covered here.

Suggested test tightening
-			s.detector.EXPECT().ProcessReprocessDeployments(gomock.Any()).Times(1).Do(func(_ *central.ReprocessDeployments) {
+			s.detector.EXPECT().ProcessReprocessDeployments(gomock.Any()).Times(1).Do(func(msg *central.ReprocessDeployments) {
+				protoassert.Equal(s.T(), msgFromCentral.GetReprocessDeployments(), msg)
 				defer messageReceived.Done()
 			})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sensor/kubernetes/eventpipeline/pipeline_test.go` around lines 178 - 180, The
test currently uses gomock.Any() in
s.detector.EXPECT().ProcessReprocessDeployments(...) so it only asserts the call
occurred; change the expectation to assert the actual ReprocessDeployments
payload (for example use
gomock.AssignableToTypeOf(&central.ReprocessDeployments{}) or gomock.Eq with a
constructed central.ReprocessDeployments) and in the Do(func(arg
*central.ReprocessDeployments) { ... }) verify the skip_cache_flush field (e.g.,
arg.SkipCacheFlush == true/false as appropriate) before calling
messageReceived.Done(); this ensures the new skip_cache_flush contract is
covered while keeping the same expectation target (ProcessReprocessDeployments)
and teardown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@central/reprocessor/reprocessor.go`:
- Around line 688-724: The sendRefreshImageCacheTTL currently sends all
ImageKeys for a cluster in one message which can exceed per-message size limits;
update loopImpl.sendRefreshImageCacheTTL to split the keys slice from
state.refreshTTLKeys into bounded batches (e.g. maxKeysPerMsg constant) and send
one central.MsgToSensor_RefreshImageCacheTtl per batch using l.injectMessage;
iterate over batches for each clusterID, handle errors the same way (store
clusterID in state.skipClusterIDs and stop sending further batches for that
cluster), and log per-batch sends (use state.refreshTTLKeys, l.injectMessage,
central.RefreshImageCacheTTL, and state.skipClusterIDs to locate the code to
change).

In `@sensor/common/detector/detector.go`:
- Around line 363-366: The code currently rebuilds central.ImageKey from
storage.Image and appends it to d.updatedImageKeys, losing ImageIdV2; instead,
preserve and append the original central.ImageKey that came with the
updated-image event so ImageIdV2 is retained for InvalidateImageCache. Locate
where d.updatedImageKeys is modified (the append of central.ImageKey in the
updated-image path) and change it to use the incoming/original central.ImageKey
object from the update payload (do not reconstruct from storage.Image), ensuring
ImageIdV2 is carried through to InvalidateImageCache.

In `@ui/apps/platform/src/Containers/Policies/Wizard/Step4/PolicyScopeForm.tsx`:
- Around line 266-278: The exclusion deployment scope block always renders
PolicyScopeCardLegacy and passes props (cardTitle, showTooltips) it doesn't
accept; update it to mirror the inclusion pattern: import ExclusionScopeCard and
conditionally render ExclusionScopeCard when isNewScopingEnabled, falling back
to PolicyScopeCardLegacy otherwise, wiring the same props/handlers (e.g., name
`excludedDeploymentScopes[${index}]`, clusters, onDelete ->
deleteExclusionDeploymentScope, hasAuditLogEventSource) so the new UI receives
the correct component and legacy keeps the old behavior.

---

Nitpick comments:
In `@sensor/kubernetes/eventpipeline/pipeline_test.go`:
- Around line 178-180: The test currently uses gomock.Any() in
s.detector.EXPECT().ProcessReprocessDeployments(...) so it only asserts the call
occurred; change the expectation to assert the actual ReprocessDeployments
payload (for example use
gomock.AssignableToTypeOf(&central.ReprocessDeployments{}) or gomock.Eq with a
constructed central.ReprocessDeployments) and in the Do(func(arg
*central.ReprocessDeployments) { ... }) verify the skip_cache_flush field (e.g.,
arg.SkipCacheFlush == true/false as appropriate) before calling
messageReceived.Done(); this ensures the new skip_cache_flush contract is
covered while keeping the same expectation target (ProcessReprocessDeployments)
and teardown.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 76e57661-0383-4b2e-bf57-d416989b78ac

📥 Commits

Reviewing files that changed from the base of the PR and between 39d15cc and 915f7d7.

⛔ Files ignored due to path filters (2)
  • generated/internalapi/central/sensor_iservice.pb.go is excluded by !**/*.pb.go, !**/generated/**
  • generated/internalapi/central/sensor_iservice_vtproto.pb.go is excluded by !**/*.pb.go, !**/generated/**
📒 Files selected for processing (10)
  • central/reprocessor/reprocessor.go
  • central/reprocessor/reprocessor_unit_test.go
  • central/sensor/service/service_impl.go
  • pkg/images/enricher/enricher.go
  • proto/internalapi/central/sensor_iservice.proto
  • sensor/common/detector/detector.go
  • sensor/common/detector/mocks/detector.go
  • sensor/kubernetes/eventpipeline/pipeline_impl.go
  • sensor/kubernetes/eventpipeline/pipeline_test.go
  • ui/apps/platform/src/Containers/Policies/Wizard/Step4/PolicyScopeForm.tsx
💤 Files with no reviewable changes (1)
  • central/sensor/service/service_impl.go

@clickboo clickboo force-pushed the boo-central-refresh-ttl-invalidate branch 3 times, most recently from 2a2b7e9 to 27260bd Compare April 9, 2026 08:57
@clickboo clickboo force-pushed the boo-central-refresh-ttl-invalidate branch from 27260bd to 31a924a Compare April 9, 2026 12:10
@clickboo clickboo marked this pull request as ready for review April 9, 2026 12:10
@clickboo clickboo requested review from a team as code owners April 9, 2026 12:10
sourcery-ai[bot]

This comment was marked as low quality.

@stackrox stackrox deleted a comment from openshift-ci bot Apr 9, 2026
@stackrox stackrox deleted a comment from openshift-ci bot Apr 9, 2026
@sachaudh sachaudh removed the area/ui label Apr 10, 2026
@sachaudh
Copy link
Copy Markdown
Contributor

FYI, removed the area/ui label. The PR was on our radar, but I didn't see any UI changes here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants