ROX-33555: Refactor UMH to per-resource tracking, wire node ACK flow#19319
ROX-33555: Refactor UMH to per-resource tracking, wire node ACK flow#19319
Conversation
|
This change is part of the following stack: Change managed by git-spice. |
|
Skipping CI for Draft Pull Request. |
|
Images are ready for the commit at 6492540. To use with deploy scripts, first |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #19319 +/- ##
==========================================
+ Coverage 49.61% 49.63% +0.02%
==========================================
Files 2765 2765
Lines 208541 208700 +159
==========================================
+ Hits 103459 103580 +121
- Misses 97405 97441 +36
- Partials 7677 7679 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7faf2a3 to
d136021
Compare
d136021 to
0ea378c
Compare
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- The new
UnconfirmedMessageHandlerImpldocstring says it uses exponential backoff, butcalculateNextIntervalimplements linear backoff ((retry+1)*baseInterval); consider either updating the comment or changing the logic to match the intended strategy. - In
TestOnACKCallback,ackedResourcesis appended to from multiple callback goroutines without synchronization, which is a data race under the race detector; consider guarding it with a mutex or using a channel to collect results.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new `UnconfirmedMessageHandlerImpl` docstring says it uses exponential backoff, but `calculateNextInterval` implements linear backoff (`(retry+1)*baseInterval`); consider either updating the comment or changing the logic to match the intended strategy.
- In `TestOnACKCallback`, `ackedResources` is appended to from multiple callback goroutines without synchronization, which is a data race under the race detector; consider guarding it with a mutex or using a channel to collect results.
## Individual Comments
### Comment 1
<location path="pkg/retry/handler/unconfirmed_message_handler.go" line_range="27-29" />
<code_context>
+// Each resource has independent retry state with exponential backoff.
</code_context>
<issue_to_address>
**suggestion:** Backoff description says exponential but implementation is linear; align naming/behavior.
This comment promises exponential backoff, but the implementation uses linear backoff (`(retry + 1) * baseInterval`), matching the previous behavior. Please either change the wording to “linear backoff” or implement true exponential backoff (e.g., `baseInterval * (1 << retry)` with appropriate caps) to avoid confusing future readers.
```suggestion
// UnconfirmedMessageHandlerImpl handles ACK/NACK messages for multiple resources.
// Each resource has independent retry state with linear backoff.
type UnconfirmedMessageHandlerImpl struct {
```
</issue_to_address>
### Comment 2
<location path="pkg/retry/handler/unconfirmed_message_handler.go" line_range="203-217" />
<code_context>
+ // Schedule next retry
+ h.resetTimer(resourceID, state, nextInterval)
+
+ // Signal retry (non-blocking); if the channel is full we log and drop the signal.
+ select {
+ case <-h.ctx.Done():
+ return
+ case h.retryCommandCh <- resourceID:
+ default:
+ // If the channel is full or the consumer goroutine is busy, we assume that currently a
+ // transimssion of a message is in progress. This means, that we do not need to enqueue
+ // another message immediately afterwards. Thus, we drop the retry signal and log a warning.
</code_context>
<issue_to_address>
**suggestion:** Clarify behavior of the unbuffered retry channel and the "full" terminology.
Because `retryCommandCh` is unbuffered, "channel full" isn't quite accurate here: the `default` branch is taken when there is no receiver ready at the instant of the send, not due to queued signals. If this drop-when-no-receiver behavior is what you want, consider wording it in terms of "no receiver ready". If you instead want to allow a small backlog of retries, a buffered channel would more directly express that intent than a non-blocking send on an unbuffered channel.
```suggestion
// Schedule next retry
h.resetTimer(resourceID, state, nextInterval)
// Signal retry (non-blocking); if no receiver is ready on the unbuffered channel, we drop the signal.
select {
case <-h.ctx.Done():
return
case h.retryCommandCh <- resourceID:
default:
// If no receiver is ready or the consumer goroutine is busy, we assume that a transmission
// of a message is currently in progress. This means we do not need to enqueue another retry
// signal immediately afterwards. Thus, we drop the retry signal and log a warning.
log.Warnf("[%s] No receiver ready on retry channel, dropping retry signal for %s", h.handlerName, resourceID)
}
}
```
</issue_to_address>
### Comment 3
<location path="pkg/retry/handler/unconfirmed_message_handler.go" line_range="157-166" />
<code_context>
+ // HandleNACK is currently a no-op and has the same behavior as not reciving any [N]ACK message.
</code_context>
<issue_to_address>
**nitpick (typo):** Fix small typos in comments to improve readability.
Please fix the typos: "reciving" → "receiving" and "transimssion" → "transmission".
Suggested implementation:
```golang
// HandleNACK is currently a no-op and has the same behavior as not receiving any [N]ACK message.
```
```golang
transmission
```
</issue_to_address>
### Comment 4
<location path="pkg/retry/handler/unconfirmed_message_handler.go" line_range="42" />
<code_context>
+ onACK func(resourceID string)
+ ctx context.Context
+
+ // cleanupDone signals when cleanup is complete
+ cleanupDone concurrency.Stopper
}
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the handler’s lifecycle and retry signaling by removing the Stopper abstraction and making the retry channel’s coalescing behavior explicit.
You can trim a fair bit of complexity without losing any of the new functionality:
### 1. Avoid `concurrency.Stopper` if `Stopped()` isn’t needed
If nothing outside this package is actually calling `Stopped()`, you can drop the `Stopper` in favor of the existing `ctx` + channel close. That removes a whole lifecycle abstraction:
```go
type UnconfirmedMessageHandlerImpl struct {
handlerName string
baseInterval time.Duration
resources map[string]*resourceState
mu sync.Mutex
retryCommandCh chan string
onACK func(resourceID string)
ctx context.Context
}
// NewUnconfirmedMessageHandler creates a new handler for per-resource ACK/NACK tracking.
// It can be stopped by canceling the context.
func NewUnconfirmedMessageHandler(ctx context.Context, handlerName string, baseInterval time.Duration) *UnconfirmedMessageHandlerImpl {
h := &UnconfirmedMessageHandlerImpl{
handlerName: handlerName,
baseInterval: baseInterval,
resources: make(map[string]*resourceState),
retryCommandCh: make(chan string),
ctx: ctx,
}
go func() {
<-ctx.Done()
h.mu.Lock()
defer h.mu.Unlock()
for _, state := range h.resources {
if state.timer != nil {
state.timer.Stop()
}
}
close(h.retryCommandCh)
}()
return h
}
```
If you *do* need to expose a “stopped” signal, you can derive it from the context instead of a separate `Stopper`:
```go
// Stopped exposes the underlying context as a read‑only signal.
func (h *UnconfirmedMessageHandlerImpl) Stopped() <-chan struct{} {
return h.ctx.Done()
}
```
This keeps lifecycle signaling but removes the extra `Stopper`/`ReadOnlyErrorSignal` layering.
### 2. Make the retry channel semantics explicit
Right now the non‑blocking send on an unbuffered channel is subtle and the log message (“channel full”) is misleading. If the intent is “at most one pending retry, and drop additional ones while the consumer is busy”, you can make that policy explicit with a small buffer and coalescing flag:
```go
type resourceState struct {
retry int32
numUnackedSendings int32
timer *time.Timer
retryPending bool // new: whether a retry signal is already pending
}
```
```go
func NewUnconfirmedMessageHandler(ctx context.Context, handlerName string, baseInterval time.Duration) *UnconfirmedMessageHandlerImpl {
h := &UnconfirmedMessageHandlerImpl{
handlerName: handlerName,
baseInterval: baseInterval,
resources: make(map[string]*resourceState),
retryCommandCh: make(chan string, 1), // small buffer
ctx: ctx,
}
// cleanup goroutine as above ...
return h
}
```
And in `onTimerFired`:
```go
func (h *UnconfirmedMessageHandlerImpl) onTimerFired(resourceID string) {
h.mu.Lock()
defer h.mu.Unlock()
if h.ctx.Err() != nil {
return
}
state, exists := h.resources[resourceID]
if !exists || state.numUnackedSendings == 0 {
return
}
state.retry++
nextInterval := h.calculateNextInterval(state.retry)
log.Infof("[%s] Resource %s has %d unacked messages, suggesting retry %d (next in %s)",
h.handlerName, resourceID, state.numUnackedSendings, state.retry, nextInterval)
h.resetTimer(resourceID, state, nextInterval)
// Coalesce retry signals: only enqueue if none is pending.
if !state.retryPending {
select {
case <-h.ctx.Done():
return
case h.retryCommandCh <- resourceID:
state.retryPending = true
default:
// buffer full: keep retryPending true, don't log "full" since this is expected coalescing
}
}
}
```
Then clear `retryPending` when the consumer actually processes the retry (where you receive from `RetryCommand()`):
```go
// in the consumer, after handling a retry for resourceID:
handler.ClearRetryPending(resourceID)
```
with a simple method:
```go
func (h *UnconfirmedMessageHandlerImpl) ClearRetryPending(resourceID string) {
h.mu.Lock()
defer h.mu.Unlock()
if state, ok := h.resources[resourceID]; ok {
state.retryPending = false
}
}
```
This preserves the “don’t enqueue infinite retries while one is in flight” behavior, but makes the intent obvious and removes the misleading “channel full” warning.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
d3322c9 to
0085a80
Compare
Extracted from `piotr/ROX-32316-new-retry-logic` and adapted to split a large PR into smaller chunks. Rewrites UnconfirmedMessageHandler from single-resource to multi-resource with per-resource independent retry state and exponential backoff. Updates the compliance node inventory and node index scan loops to use the new resourceID-based API, handle closed retry channels, and process the new ComplianceACK message type from Sensor 4.10+ alongside the legacy NodeInventoryACK. VM_INDEX_REPORT ACK handling is stubbed for a follow-up. Split from the full ROX-32316 branch to keep the change reviewable. AI-assisted: code was extracted and adapted from the feature branch by AI, the source feature branch was written by the author, reviewed and verified by the author.
Aligns the unconfirmed-message handler docs/comments with actual linear backoff and unbuffered retry-channel behavior, fixes typos, and removes a callback data race in UMH tests. Also updates compliance_ack_test to the current ACK API so style/lint/unit checks stop failing on interface/method drift.
6492540 to
950ed7c
Compare
🚀 Build Images ReadyImages are ready for commit c400789. To use with deploy scripts: export MAIN_IMAGE_TAG=4.11.x-614-gc400789f56 |
📝 WalkthroughWalkthroughRefactors retry handling to per-resource timers emitting resourceIDs; updates UnconfirmedMessageHandler API to be resource-aware; threads resourceIDs through compliance ACK/NACK handling, splits node caches, replaces legacy ACK dispatcher with Changes
Sequence Diagram(s)sequenceDiagram
participant Compliance as Compliance Module
participant Handler as UnconfirmedMessageHandler
participant Timers as Per-Resource Timers
participant RetryCh as Retry Command Channel
Compliance->>Handler: ObserveSending(resourceID)
Handler->>Timers: create/update timer for resourceID
Timers->>Handler: timer fires for resourceID
Handler->>RetryCh: emit resourceID
RetryCh->>Compliance: receive resourceID (RetryCommand)
Compliance->>Handler: HandleACK(resourceID) / HandleNACK(resourceID)
Handler->>Timers: stop/delete timer for resourceID (on ACK)
Handler->>Compliance: invoke OnACK(resourceID) if registered
Compliance->>Handler: Stopped()
Handler->>RetryCh: close retry channel on shutdown
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@compliance/compliance.go`:
- Around line 228-239: The retry logic erroneously resends the shared c.cache
for node-index retries; update the code to maintain separate cached payloads
(e.g., add c.nodeIndexCache and c.nodeInventoryCache or a per-message-type retry
store) and have the runNodeIndex() retry branch read/send c.nodeIndexCache
(while runNodeInventory() uses c.nodeInventoryCache). Ensure these caches are
populated where runNodeIndex() (previously lines ~283-288) and the inventory
scan set their respective cache fields, and protect each cache with its own
mutex or use channel-based ownership to avoid cross-loop races when
writing/reading the caches.
In `@pkg/retry/handler/unconfirmed_message_handler.go`:
- Around line 119-121: The initial timer is being armed with h.baseInterval
directly causing immediate retries when baseInterval is non-positive; change the
call in the first-unacked path to use the same normalized interval as the later
logic (i.e., compute interval = h.baseInterval if >0 else defaultBaseInterval or
call the existing normalization helper) and pass that interval into
h.resetTimer(resourceID, state, interval); apply the same normalization to the
other resetTimer invocation(s) in the retry/backoff branch (the code around the
existing normalization at lines 219-229) so all timers use defaultBaseInterval
when baseInterval is non-positive.
- Around line 36-38: The single-slot string channel retryCommandCh coalesces
across all resources; change the design to a notification channel plus a
per-resource pending set: replace retryCommandCh chan string with retryNotifyCh
chan struct{} (buffer 1) and add a pendingRetries map[string]bool, then in the
timer firing path (the coalescing logic around the existing branch at Lines
~207-215) mark pendingRetries[resourceID]=true and do a non-blocking send to
retryNotifyCh; in the consumer/worker loop, on receiving from retryNotifyCh
iterate the pendingRetries keys, clear each entry and handle retries for that
specific resource (instead of relying on a single queued string), so retries are
coalesced per-resource rather than across all resources.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: a691a124-6090-4039-a05d-fc7c9a53193d
📒 Files selected for processing (6)
compliance/compliance.gocompliance/compliance_ack_test.gocompliance/compliance_test.gocompliance/node/interfaces.gopkg/retry/handler/unconfirmed_message_handler.gopkg/retry/handler/unconfirmed_message_handler_test.go
|
@coderabbitai recheck |
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
compliance/compliance_test.go (1)
32-48: Assert the routedresourceIDin this suite.The mock drops the
resourceID, so these tests only prove that some ACK/NACK handler was called. A regression that routes""or the wrong key intoHandleACK/HandleNACKwould still pass, even though per-resource routing is the behavior this PR is introducing. Please record the seen IDs and assert they matchnodeResourceID.As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
Also applies to: 57-117
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@compliance/compliance_test.go` around lines 32 - 48, The mockUnconfirmedMessageHandler currently discards the resourceID in HandleACK and HandleNACK so tests only verify a handler was invoked; modify the mock (mockUnconfirmedMessageHandler) to record the received resourceID(s) (e.g., append to a slice or set field) in HandleACK(resourceID string) and HandleNACK(resourceID string), expose a method or field to retrieve them, and update the tests to assert the recorded IDs equal the expected nodeResourceID; also apply the same change for any other mock methods in this file that swallow the id (see ObserveSending/OnACK areas between lines 57-117) so all assertions verify the routed resourceID rather than just call counts.pkg/retry/handler/unconfirmed_message_handler_test.go (1)
17-115: Move the retry timing table off wall-clock sleeps.These cases still wait on real time for up to ~9 seconds and race a background counting goroutine, so they remain slow and occasionally flaky under CI load. Since this file now uses
testing/synctestfor the newer timer-sensitive paths, porting this table as well would make the retry/backoff coverage deterministic and keep test latency down.As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/retry/handler/unconfirmed_message_handler_test.go` around lines 17 - 115, The test TestWithRetryable relies on real time sleeps and races with a background goroutine; replace wall-clock waits with a deterministic fake clock from testing/synctest and advance it to trigger retries. Modify the test to create a synctest.Clock, inject it into the UnconfirmedMessageHandler (e.g., add or use a constructor like NewUnconfirmedMessageHandlerWithClock(ctx, "test", baseDuration, clock) or add a clock option), then schedule sends/acks/nacks by advancing the fake clock (instead of goroutines using time.After) and drive umh.RetryCommand by advancing the clock to expected retry instants; update uses of ObserveSending, HandleACK and HandleNACK to be called synchronously at the appropriate clock advances and assert the counter deterministically.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/retry/handler/unconfirmed_message_handler.go`:
- Around line 146-173: HandleACK currently removes the resource from h.resources
but does not cancel queued retry commands, so the dispatcher can still publish
stale retries; modify HandleACK (in UnconfirmedMessageHandlerImpl) to also
remove/cancel any pending entries for resourceID from pendingRetries and close
or mark related commands so they are not sent, and update the dispatcher logic
that reads from retryCommandCh to revalidate the resource's liveness (check
h.resources or a dedicated isLive(resourceID) method) before publishing a retry;
ensure cancellation happens while holding h.mu and that any callbacks (h.onACK)
are still invoked outside the lock.
---
Nitpick comments:
In `@compliance/compliance_test.go`:
- Around line 32-48: The mockUnconfirmedMessageHandler currently discards the
resourceID in HandleACK and HandleNACK so tests only verify a handler was
invoked; modify the mock (mockUnconfirmedMessageHandler) to record the received
resourceID(s) (e.g., append to a slice or set field) in HandleACK(resourceID
string) and HandleNACK(resourceID string), expose a method or field to retrieve
them, and update the tests to assert the recorded IDs equal the expected
nodeResourceID; also apply the same change for any other mock methods in this
file that swallow the id (see ObserveSending/OnACK areas between lines 57-117)
so all assertions verify the routed resourceID rather than just call counts.
In `@pkg/retry/handler/unconfirmed_message_handler_test.go`:
- Around line 17-115: The test TestWithRetryable relies on real time sleeps and
races with a background goroutine; replace wall-clock waits with a deterministic
fake clock from testing/synctest and advance it to trigger retries. Modify the
test to create a synctest.Clock, inject it into the UnconfirmedMessageHandler
(e.g., add or use a constructor like NewUnconfirmedMessageHandlerWithClock(ctx,
"test", baseDuration, clock) or add a clock option), then schedule
sends/acks/nacks by advancing the fake clock (instead of goroutines using
time.After) and drive umh.RetryCommand by advancing the clock to expected retry
instants; update uses of ObserveSending, HandleACK and HandleNACK to be called
synchronously at the appropriate clock advances and assert the counter
deterministically.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 908902b8-3a8a-4e8b-9e4a-71f6c425fb27
📒 Files selected for processing (6)
compliance/compliance.gocompliance/compliance_ack_test.gocompliance/compliance_test.gocompliance/node/interfaces.gopkg/retry/handler/unconfirmed_message_handler.gopkg/retry/handler/unconfirmed_message_handler_test.go
HandleACK removed the resource from h.resources but left it in pendingRetries, so the dispatcher could still emit retries for already-acknowledged resources. Two race windows existed: 1. Resource in pendingRetries when ACK arrives — dispatcher drains it. 2. Resource already in dispatcher's local slice after takePendingRetries. Fix: HandleACK now also clears pendingRetries, and the dispatcher validates resource liveness via isResourceActive before sending. AI-generated: implementation and regression tests. Human-reviewed: race analysis, code correctness, style conformance.
|
@coderabbitai check again |
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@compliance/compliance.go`:
- Around line 39-43: The code currently keys UMH sends with the synthetic
constant nodeResourceID ("this-node") instead of the actual resource id from
incoming ACK/NACKs; update the send/tracking logic to use ack.GetResourceId()
(or the real node name passed into the send) as the map key and ensure that the
resource_id is forwarded into HandleACK and HandleNACK so they validate and
match the exact resource before clearing retry state; replace uses of
nodeResourceID in send/track operations and in the ACK/NACK handling paths
(including calls to HandleACK/HandleNACK) so stale/mismatched ACKs cannot
acknowledge the wrong message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Pro
Run ID: 9b4bad7c-3a88-4186-9185-4a99e092cd7c
📒 Files selected for processing (6)
compliance/compliance.gocompliance/compliance_ack_test.gocompliance/compliance_test.gocompliance/node/interfaces.gopkg/retry/handler/unconfirmed_message_handler.gopkg/retry/handler/unconfirmed_message_handler_test.go
Description
Refactors
UnconfirmedMessageHandler(UMH) from single-resource to per-resource tracking and wiresthe node inventory / node index ACK flow through the new
ComplianceACKmessage type.Problem: The existing UMH tracked a single global retry state, which doesn't work for the VM scanning use case where hundreds of independent resources (VMs) connect to a single Compliance container and need independent ACK/retry tracking.
Before there was always exactly one node per compliance container, so there was no need to track resourceIDs.
Solution:
map[string]*resourceState), each with independent retry counter, unacked count, and timer.ComplianceACKmessage from Sensor 4.10+.AI-assisted: code was extracted and adapted from a larger feature branch by AI
(selective file copying and
compliance.gosplitting), reviewed and verified by the author.User-facing documentation
Testing and quality
Automated testing
How I validated my change
Logs: