ROX-33339: Optimize Sensor reprocessor gRPC and AC image cache churn#19840
ROX-33339: Optimize Sensor reprocessor gRPC and AC image cache churn#19840
Conversation
|
This change is part of the following stack: Change managed by git-spice. |
🚀 Build Images ReadyImages are ready for commit 5b5eb15. To use with deploy scripts: export MAIN_IMAGE_TAG=4.11.x-609-g5b5eb15237 |
633958e to
0ea06a1
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #19840 +/- ##
==========================================
- Coverage 49.61% 49.56% -0.05%
==========================================
Files 2765 2764 -1
Lines 208541 208411 -130
==========================================
- Hits 103464 103306 -158
- Misses 97401 97450 +49
+ Partials 7676 7655 -21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
a5bd2d1 to
de678cb
Compare
eb958d5 to
348a8b6
Compare
b1142f8 to
f71c251
Compare
4953faf to
915f7d7
Compare
915f7d7 to
7361076
Compare
This comment was marked as low quality.
This comment was marked as low quality.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
sensor/kubernetes/eventpipeline/pipeline_test.go (1)
178-180: Assert the forwardedReprocessDeploymentspayload.This only proves the call happened. A
nilor zero-valued message would still satisfygomock.Any(), so the newskip_cache_flushcontract is not actually covered here.Suggested test tightening
- s.detector.EXPECT().ProcessReprocessDeployments(gomock.Any()).Times(1).Do(func(_ *central.ReprocessDeployments) { + s.detector.EXPECT().ProcessReprocessDeployments(gomock.Any()).Times(1).Do(func(msg *central.ReprocessDeployments) { + protoassert.Equal(s.T(), msgFromCentral.GetReprocessDeployments(), msg) defer messageReceived.Done() })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sensor/kubernetes/eventpipeline/pipeline_test.go` around lines 178 - 180, The test currently uses gomock.Any() in s.detector.EXPECT().ProcessReprocessDeployments(...) so it only asserts the call occurred; change the expectation to assert the actual ReprocessDeployments payload (for example use gomock.AssignableToTypeOf(¢ral.ReprocessDeployments{}) or gomock.Eq with a constructed central.ReprocessDeployments) and in the Do(func(arg *central.ReprocessDeployments) { ... }) verify the skip_cache_flush field (e.g., arg.SkipCacheFlush == true/false as appropriate) before calling messageReceived.Done(); this ensures the new skip_cache_flush contract is covered while keeping the same expectation target (ProcessReprocessDeployments) and teardown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@central/reprocessor/reprocessor.go`:
- Around line 688-724: The sendRefreshImageCacheTTL currently sends all
ImageKeys for a cluster in one message which can exceed per-message size limits;
update loopImpl.sendRefreshImageCacheTTL to split the keys slice from
state.refreshTTLKeys into bounded batches (e.g. maxKeysPerMsg constant) and send
one central.MsgToSensor_RefreshImageCacheTtl per batch using l.injectMessage;
iterate over batches for each clusterID, handle errors the same way (store
clusterID in state.skipClusterIDs and stop sending further batches for that
cluster), and log per-batch sends (use state.refreshTTLKeys, l.injectMessage,
central.RefreshImageCacheTTL, and state.skipClusterIDs to locate the code to
change).
In `@sensor/common/detector/detector.go`:
- Around line 363-366: The code currently rebuilds central.ImageKey from
storage.Image and appends it to d.updatedImageKeys, losing ImageIdV2; instead,
preserve and append the original central.ImageKey that came with the
updated-image event so ImageIdV2 is retained for InvalidateImageCache. Locate
where d.updatedImageKeys is modified (the append of central.ImageKey in the
updated-image path) and change it to use the incoming/original central.ImageKey
object from the update payload (do not reconstruct from storage.Image), ensuring
ImageIdV2 is carried through to InvalidateImageCache.
In `@ui/apps/platform/src/Containers/Policies/Wizard/Step4/PolicyScopeForm.tsx`:
- Around line 266-278: The exclusion deployment scope block always renders
PolicyScopeCardLegacy and passes props (cardTitle, showTooltips) it doesn't
accept; update it to mirror the inclusion pattern: import ExclusionScopeCard and
conditionally render ExclusionScopeCard when isNewScopingEnabled, falling back
to PolicyScopeCardLegacy otherwise, wiring the same props/handlers (e.g., name
`excludedDeploymentScopes[${index}]`, clusters, onDelete ->
deleteExclusionDeploymentScope, hasAuditLogEventSource) so the new UI receives
the correct component and legacy keeps the old behavior.
---
Nitpick comments:
In `@sensor/kubernetes/eventpipeline/pipeline_test.go`:
- Around line 178-180: The test currently uses gomock.Any() in
s.detector.EXPECT().ProcessReprocessDeployments(...) so it only asserts the call
occurred; change the expectation to assert the actual ReprocessDeployments
payload (for example use
gomock.AssignableToTypeOf(¢ral.ReprocessDeployments{}) or gomock.Eq with a
constructed central.ReprocessDeployments) and in the Do(func(arg
*central.ReprocessDeployments) { ... }) verify the skip_cache_flush field (e.g.,
arg.SkipCacheFlush == true/false as appropriate) before calling
messageReceived.Done(); this ensures the new skip_cache_flush contract is
covered while keeping the same expectation target (ProcessReprocessDeployments)
and teardown.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 76e57661-0383-4b2e-bf57-d416989b78ac
⛔ Files ignored due to path filters (2)
generated/internalapi/central/sensor_iservice.pb.gois excluded by!**/*.pb.go,!**/generated/**generated/internalapi/central/sensor_iservice_vtproto.pb.gois excluded by!**/*.pb.go,!**/generated/**
📒 Files selected for processing (10)
central/reprocessor/reprocessor.gocentral/reprocessor/reprocessor_unit_test.gocentral/sensor/service/service_impl.gopkg/images/enricher/enricher.goproto/internalapi/central/sensor_iservice.protosensor/common/detector/detector.gosensor/common/detector/mocks/detector.gosensor/kubernetes/eventpipeline/pipeline_impl.gosensor/kubernetes/eventpipeline/pipeline_test.goui/apps/platform/src/Containers/Policies/Wizard/Step4/PolicyScopeForm.tsx
💤 Files with no reviewable changes (1)
- central/sensor/service/service_impl.go
ui/apps/platform/src/Containers/Policies/Wizard/Step4/PolicyScopeForm.tsx
Outdated
Show resolved
Hide resolved
2a2b7e9 to
27260bd
Compare
27260bd to
31a924a
Compare
|
FYI, removed the |
Description
Optimizes Central's short-circuit reprocessor path to reduce unnecessary
UpdatedImagemessage overhead and prevent full Admission Controller cache purges.Previously, Central sent
UpdatedImagefor every image during short circuit reprocessing regardless of whether the scan data actually changed. This caused unnecessary proto serialization/deserialization overhead and, combined with the subsequentFlushCache, purged the entire admission controller cache even when most images were unchanged.What changed (short-circuit path only):
Central's reprocessor now distinguishes between changed and unchanged images during the short-circuit reprocessing cycle:
imageUpdated=true): Central sendsUpdatedImageas before. Sensor's detector accumulates the image key.imageUpdated=false): Central skipsUpdatedImageentirely and batches the image key for a singleRefreshImageCacheTTLmessage per cluster, keeping Sensor's cache warm without the deserialization cost of full image data.At the end of the cycle, Central sends
ReprocessDeploymentswithskip_cache_flush=true. Sensor's detector then sends a single batchedInvalidateImageCacheto the Admission Controller for only the changed images, instead of flushing the entire admission controller cache. This keeps the Admission controller cache warm for unchanged images, avoiding unnecessary re-fetches on subsequent review requests.Central -> Sensor gRPC message and cache operation comparison (short-circuit path, per cluster)
N = total images reprocessed, K = images with changed scan data, D = deployments with cached images in admission controller
UpdatedImagemessages (Central→Sensor)RefreshImageCacheTTL(Central→Sensor)ReprocessDeployments(Central→Sensor)SkipCacheFlush=true)imageCache.Add(replace data)imageCache.Touch(TTL only)Example: 500 images, 3 changed, 200 of those images in a review request burst to admission controller → Before: 501 messages, 500 deployment re-detections, full admission controller cache purge, 200 re-fetches. After: 5 messages, 3 deployment re-detections, 3 admission controller entries invalidated, 3 re-fetches.
Sensor detection during reassess (no image changes)
ResolveDeploymentsByImagesfor each of its images (triggered by NUpdatedImagemessages), plus once fromResolveAllDeployments(triggered byReprocessDeployments). All but the final pass are redundant.ResolveAllDeploymentsonly. NoUpdatedImagemessages are sent for unchanged images, eliminating N redundant per-image re-detection passes. Policy evaluation outcomes are identical.Key wins:
ResolveDeploymentsByImagescalls.Periodic reprocessor path is unchanged in behavior: The periodic path (default 4h interval) continues to send
UpdatedImagefor every image andReprocessDeploymentswithskip_cache_flush=false(the proto default). Sensor receivesskip_cache_flush=falseand performs a fullFlushCache()on the admission controller cache. TheadmissionCacheNeedsFlushguard was removed since Central never sendsReprocessDeploymentswithout preceding images (early return on zero results), making the flag always true when reached.Version compatibility:
TargetedImageCacheInvalidationcapability)RefreshImageCacheTTLfor unchanged images, batched targeted invalidation for changed imagesUpdatedImage(capability gate insendReprocessingMessages).skip_cache_flushdefaults tofalsein proto, so old Sensor callsFlushCache(). Identical to pre-change behaviorRefreshImageCacheTTLor setsskip_cache_flush. New Sensor seesskip_cache_flush=falseand callsFlushCache(). Identical to pre-change behaviorNo breaking changes in any version combination.
User-facing documentation
Testing and quality
Automated testing
How I validated my change
Testing
Validated end-to-end using
bench-reassess.sh(available in #19427) against a live cluster, comparing a master build to this PR's build. The script automates:--dry-run=server) to warm AC/Sensor cachesPOST /v1/policies/reassess, waits for Central's reprocessor to complete by tailing logs for the completion markerRun:
BURST_SIZE=100 UNIQUE_PCT=25 PARALLEL=50 ./tools/admission-control/bench-reassess.shResults (100 deployments, 20 unique images):
component_process_message(reassess)deployment_processed(reassess)image_fetch_total(burst 2)cache_hit_rate(burst 2)review_durationavg (burst 2)A detailed description and analysis of the test results can be found here.