ROX-33555: Refactor UMH to per-resource tracking, wire node ACK flow by vikin91 · Pull Request #19319 · stackrox/stackrox

vikin91 · 2026-03-06T11:51:29Z

Description

Refactors UnconfirmedMessageHandler (UMH) from single-resource to per-resource tracking and wires
the node inventory / node index ACK flow through the new ComplianceACK message type.

⚠️ This change applies currently to Node scanning ACKs and is a prefactor required for handling VM scanning ACKs.

Problem: The existing UMH tracked a single global retry state, which doesn't work for the VM scanning use case where hundreds of independent resources (VMs) connect to a single Compliance container and need independent ACK/retry tracking.

Before there was always exactly one node per compliance container, so there was no need to track resourceIDs.

Solution:

Rewrote UMH to maintain per-resource state (map[string]*resourceState), each with independent retry counter, unacked count, and timer.
The compliance component handles the new ComplianceACK message from Sensor 4.10+.
Node scanning: scan loops use the new resourceID-based API and all related changes.
VM scanning is stubbed with a TODO for a follow-up PR.

AI-assisted: code was extracted and adapted from a larger feature branch by AI
(selective file copying and compliance.go splitting), reviewed and verified by the author.

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

How I validated my change

Unit tests locally and in CI
Ran on a cluster and confirmed the following:
1. Happy path for V4 node scanning works correctly
2. When Central is down, the NACKs arrive and retries are executed
3. When Central is back up, the first ACK arrives and the UMH state is reset.

Logs:

➜ kubectl -n stackrox logs pod/collector-m8vlc -c compliance | grep -E "Node Index v4|IndexReport|index.*ACK|suggesting retry|Received ACK|Observing send"

main: 2026/04/10 08:10:00.967747 compliance.go:127: Info: Node Index v4 enabled

pkg/retry/handler: 2026/04/10 08:12:08.358100 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 1)
pkg/retry/handler: 2026/04/10 08:12:09.708366 unconfirmed_message_handler.go:161: Debug: [node-index] Received ACK for resource this-node
pkg/retry/handler: 2026/04/10 08:12:09.708392 unconfirmed_message_handler.go:163: Debug: [node-index] Received ACK for unknown resource this-node

# ^^ note two types of ACK arriving - the old one and the new one.

pkg/retry/handler: 2026/04/10 08:14:00.451260 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 1)

# ^^ typical happy path, sending node scan, unacked=1, receiving ACK, sending again, unacked=1 ---> unacked messages do not accumulate.

# Now scaling down Central to 0

pkg/retry/handler: 2026/04/10 08:28:01.820878 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 1)
main: 2026/04/10 08:28:01.830983 compliance.go:445: Info: Node index NACK received: Central unreachable
pkg/retry/handler: 2026/04/10 08:28:01.831009 unconfirmed_message_handler.go:183: Debug: [node-index] Received NACK for resource this-node. Message will be resent.
pkg/retry/handler: 2026/04/10 08:28:31.821269 unconfirmed_message_handler.go:223: Info: [node-index] Resource this-node has 1 unacked messages, suggesting retry 1 (next in 1m0s)
main: 2026/04/10 08:28:31.831460 compliance.go:445: Info: Node index NACK received: Central unreachable
pkg/retry/handler: 2026/04/10 08:28:31.831471 unconfirmed_message_handler.go:183: Debug: [node-index] Received NACK for resource this-node. Message will be resent.

# ^^ receiving two NACKs for the same resource


pkg/retry/handler: 2026/04/10 08:29:31.821959 unconfirmed_message_handler.go:223: Info: [node-index] Resource this-node has 1 unacked messages, suggesting retry 2 (next in 1m30s)
main: 2026/04/10 08:29:31.831533 compliance.go:445: Info: Node index NACK received: Central unreachable
pkg/retry/handler: 2026/04/10 08:29:31.831545 unconfirmed_message_handler.go:183: Debug: [node-index] Received NACK for resource this-node. Message will be resent.
pkg/retry/handler: 2026/04/10 08:30:03.269202 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 2)
main: 2026/04/10 08:30:03.279712 compliance.go:445: Info: Node index NACK received: Central unreachable
pkg/retry/handler: 2026/04/10 08:30:03.279722 unconfirmed_message_handler.go:183: Debug: [node-index] Received NACK for resource this-node. Message will be resent.

# ^^ despite duplicated NACKs, no retry-storm. Both types of NACK are assigned to the same resource and the number of retries do not grow exponentially

# (some time passes)
# scaling Central up

pkg/retry/handler: 2026/04/10 09:00:19.131251 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 17)
main: 2026/04/10 09:00:19.140671 compliance.go:445: Info: Node index NACK received: Central unreachable
pkg/retry/handler: 2026/04/10 09:00:19.140686 unconfirmed_message_handler.go:183: Debug: [node-index] Received NACK for resource this-node. Message will be resent.
pkg/retry/handler: 2026/04/10 09:01:01.827845 unconfirmed_message_handler.go:223: Info: [node-index] Resource this-node has 17 unacked messages, suggesting retry 11 (next in 6m0s)
main: 2026/04/10 09:01:01.837200 compliance.go:445: Info: Node index NACK received: Central unreachable
pkg/retry/handler: 2026/04/10 09:01:01.837213 unconfirmed_message_handler.go:183: Debug: [node-index] Received NACK for resource this-node. Message will be resent.
pkg/retry/handler: 2026/04/10 09:02:18.483360 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 18)

# ^^ for the 6 minutes "(next in 6m0s)", we should still wait before sending a new message to Central.
# HOWEVER, for this test, I configured the rescan to happen every minute, so
# the next sending of the message is successful and ACK arrives (see below)

pkg/retry/handler: 2026/04/10 09:02:36.222017 unconfirmed_message_handler.go:161: Debug: [node-index] Received ACK for resource this-node
pkg/retry/handler: 2026/04/10 09:02:36.222092 unconfirmed_message_handler.go:163: Debug: [node-index] Received ACK for unknown resource this-node
pkg/retry/handler: 2026/04/10 09:04:10.165101 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 1)
pkg/retry/handler: 2026/04/10 09:04:12.381723 unconfirmed_message_handler.go:161: Debug: [node-index] Received ACK for resource this-node
pkg/retry/handler: 2026/04/10 09:04:12.381760 unconfirmed_message_handler.go:163: Debug: [node-index] Received ACK for unknown resource this-node
pkg/retry/handler: 2026/04/10 09:06:19.663367 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 1)
pkg/retry/handler: 2026/04/10 09:06:25.003832 unconfirmed_message_handler.go:161: Debug: [node-index] Received ACK for resource this-node
pkg/retry/handler: 2026/04/10 09:06:25.003856 unconfirmed_message_handler.go:163: Debug: [node-index] Received ACK for unknown resource this-node

vikin91 · 2026-03-06T11:51:30Z

This change is part of the following stack:

ROX-33555: Refactor UMH to per-resource tracking, wire node ACK flow #19319 ◀
- ROX-33555: Wire VM relay ACK flow with rate limiting and UMH #19321

_{Change managed by git-spice.}

openshift-ci · 2026-03-06T11:51:33Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

rhacs-bot · 2026-03-06T12:11:06Z

Images are ready for the commit at 6492540.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-386-g649254019a.

codecov · 2026-03-06T12:35:28Z

Codecov Report

❌ Patch coverage is 73.49398% with 66 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.63%. Comparing base (ab1e8a2) to head (c400789).
⚠️ Report is 11 commits behind head on master.

Files with missing lines	Patch %	Lines
compliance/compliance.go	34.17%	52 Missing ⚠️
pkg/retry/handler/unconfirmed_message_handler.go	91.76%	11 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #19319      +/-   ##
==========================================
+ Coverage   49.61%   49.63%   +0.02%     
==========================================
  Files        2765     2765              
  Lines      208541   208700     +159     
==========================================
+ Hits       103459   103580     +121     
- Misses      97405    97441      +36     
- Partials     7677     7679       +2

Flag	Coverage Δ
go-unit-tests	`49.63% <73.49%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sourcery-ai

Hey - I've found 4 issues, and left some high level feedback:

The new UnconfirmedMessageHandlerImpl docstring says it uses exponential backoff, but calculateNextInterval implements linear backoff ((retry+1)*baseInterval); consider either updating the comment or changing the logic to match the intended strategy.
In TestOnACKCallback, ackedResources is appended to from multiple callback goroutines without synchronization, which is a data race under the race detector; consider guarding it with a mutex or using a channel to collect results.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The new `UnconfirmedMessageHandlerImpl` docstring says it uses exponential backoff, but `calculateNextInterval` implements linear backoff (`(retry+1)*baseInterval`); consider either updating the comment or changing the logic to match the intended strategy.
- In `TestOnACKCallback`, `ackedResources` is appended to from multiple callback goroutines without synchronization, which is a data race under the race detector; consider guarding it with a mutex or using a channel to collect results.

## Individual Comments

### Comment 1
<location path="pkg/retry/handler/unconfirmed_message_handler.go" line_range="27-29" />
<code_context>
+// Each resource has independent retry state with exponential backoff.
</code_context>
<issue_to_address>
**suggestion:** Backoff description says exponential but implementation is linear; align naming/behavior.

This comment promises exponential backoff, but the implementation uses linear backoff (`(retry + 1) * baseInterval`), matching the previous behavior. Please either change the wording to “linear backoff” or implement true exponential backoff (e.g., `baseInterval * (1 << retry)` with appropriate caps) to avoid confusing future readers.

```suggestion
 // UnconfirmedMessageHandlerImpl handles ACK/NACK messages for multiple resources.
 // Each resource has independent retry state with linear backoff.
 type UnconfirmedMessageHandlerImpl struct {
```
</issue_to_address>

### Comment 2
<location path="pkg/retry/handler/unconfirmed_message_handler.go" line_range="203-217" />
<code_context>
+	// Schedule next retry
+	h.resetTimer(resourceID, state, nextInterval)
+
+	// Signal retry (non-blocking); if the channel is full we log and drop the signal.
+	select {
+	case <-h.ctx.Done():
+		return
+	case h.retryCommandCh <- resourceID:
+	default:
+		// If the channel is full or the consumer goroutine is busy, we assume that currently a
+		// transimssion of a message is in progress. This means, that we do not need to enqueue
+		// another message immediately afterwards. Thus, we drop the retry signal and log a warning.
</code_context>
<issue_to_address>
**suggestion:** Clarify behavior of the unbuffered retry channel and the "full" terminology.

Because `retryCommandCh` is unbuffered, "channel full" isn't quite accurate here: the `default` branch is taken when there is no receiver ready at the instant of the send, not due to queued signals. If this drop-when-no-receiver behavior is what you want, consider wording it in terms of "no receiver ready". If you instead want to allow a small backlog of retries, a buffered channel would more directly express that intent than a non-blocking send on an unbuffered channel.

```suggestion
	// Schedule next retry
	h.resetTimer(resourceID, state, nextInterval)

	// Signal retry (non-blocking); if no receiver is ready on the unbuffered channel, we drop the signal.
	select {
	case <-h.ctx.Done():
		return
	case h.retryCommandCh <- resourceID:
	default:
		// If no receiver is ready or the consumer goroutine is busy, we assume that a transmission
		// of a message is currently in progress. This means we do not need to enqueue another retry
		// signal immediately afterwards. Thus, we drop the retry signal and log a warning.
		log.Warnf("[%s] No receiver ready on retry channel, dropping retry signal for %s", h.handlerName, resourceID)
	}
}
```
</issue_to_address>

### Comment 3
<location path="pkg/retry/handler/unconfirmed_message_handler.go" line_range="157-166" />
<code_context>
+	// HandleNACK is currently a no-op and has the same behavior as not reciving any [N]ACK message.
</code_context>
<issue_to_address>
**nitpick (typo):** Fix small typos in comments to improve readability.

Please fix the typos: "reciving" → "receiving" and "transimssion" → "transmission".

Suggested implementation:

```golang
	// HandleNACK is currently a no-op and has the same behavior as not receiving any [N]ACK message.

```

```golang
transmission

```
</issue_to_address>

### Comment 4
<location path="pkg/retry/handler/unconfirmed_message_handler.go" line_range="42" />
<code_context>
+	onACK func(resourceID string)
+	ctx   context.Context
+
+	// cleanupDone signals when cleanup is complete
+	cleanupDone concurrency.Stopper
 }
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the handler’s lifecycle and retry signaling by removing the Stopper abstraction and making the retry channel’s coalescing behavior explicit.

You can trim a fair bit of complexity without losing any of the new functionality:

### 1. Avoid `concurrency.Stopper` if `Stopped()` isn’t needed

If nothing outside this package is actually calling `Stopped()`, you can drop the `Stopper` in favor of the existing `ctx` + channel close. That removes a whole lifecycle abstraction:

```go
type UnconfirmedMessageHandlerImpl struct {
    handlerName    string
    baseInterval   time.Duration
    resources      map[string]*resourceState
    mu             sync.Mutex
    retryCommandCh chan string
    onACK          func(resourceID string)
    ctx            context.Context
}

// NewUnconfirmedMessageHandler creates a new handler for per-resource ACK/NACK tracking.
// It can be stopped by canceling the context.
func NewUnconfirmedMessageHandler(ctx context.Context, handlerName string, baseInterval time.Duration) *UnconfirmedMessageHandlerImpl {
    h := &UnconfirmedMessageHandlerImpl{
        handlerName:    handlerName,
        baseInterval:   baseInterval,
        resources:      make(map[string]*resourceState),
        retryCommandCh: make(chan string),
        ctx:            ctx,
    }

    go func() {
        <-ctx.Done()
        h.mu.Lock()
        defer h.mu.Unlock()
        for _, state := range h.resources {
            if state.timer != nil {
                state.timer.Stop()
            }
        }
        close(h.retryCommandCh)
    }()

    return h
}
```

If you *do* need to expose a “stopped” signal, you can derive it from the context instead of a separate `Stopper`:

```go
// Stopped exposes the underlying context as a read‑only signal.
func (h *UnconfirmedMessageHandlerImpl) Stopped() <-chan struct{} {
    return h.ctx.Done()
}
```

This keeps lifecycle signaling but removes the extra `Stopper`/`ReadOnlyErrorSignal` layering.

### 2. Make the retry channel semantics explicit

Right now the non‑blocking send on an unbuffered channel is subtle and the log message (“channel full”) is misleading. If the intent is “at most one pending retry, and drop additional ones while the consumer is busy”, you can make that policy explicit with a small buffer and coalescing flag:

```go
type resourceState struct {
    retry              int32
    numUnackedSendings int32
    timer              *time.Timer
    retryPending       bool // new: whether a retry signal is already pending
}
```

```go
func NewUnconfirmedMessageHandler(ctx context.Context, handlerName string, baseInterval time.Duration) *UnconfirmedMessageHandlerImpl {
    h := &UnconfirmedMessageHandlerImpl{
        handlerName:    handlerName,
        baseInterval:   baseInterval,
        resources:      make(map[string]*resourceState),
        retryCommandCh: make(chan string, 1), // small buffer
        ctx:            ctx,
    }
    // cleanup goroutine as above ...
    return h
}
```

And in `onTimerFired`:

```go
func (h *UnconfirmedMessageHandlerImpl) onTimerFired(resourceID string) {
    h.mu.Lock()
    defer h.mu.Unlock()

    if h.ctx.Err() != nil {
        return
    }

    state, exists := h.resources[resourceID]
    if !exists || state.numUnackedSendings == 0 {
        return
    }

    state.retry++
    nextInterval := h.calculateNextInterval(state.retry)

    log.Infof("[%s] Resource %s has %d unacked messages, suggesting retry %d (next in %s)",
        h.handlerName, resourceID, state.numUnackedSendings, state.retry, nextInterval)

    h.resetTimer(resourceID, state, nextInterval)

    // Coalesce retry signals: only enqueue if none is pending.
    if !state.retryPending {
        select {
        case <-h.ctx.Done():
            return
        case h.retryCommandCh <- resourceID:
            state.retryPending = true
        default:
            // buffer full: keep retryPending true, don't log "full" since this is expected coalescing
        }
    }
}
```

Then clear `retryPending` when the consumer actually processes the retry (where you receive from `RetryCommand()`):

```go
// in the consumer, after handling a retry for resourceID:
handler.ClearRetryPending(resourceID)
```

with a simple method:

```go
func (h *UnconfirmedMessageHandlerImpl) ClearRetryPending(resourceID string) {
    h.mu.Lock()
    defer h.mu.Unlock()
    if state, ok := h.resources[resourceID]; ok {
        state.retryPending = false
    }
}
```

This preserves the “don’t enqueue infinite retries while one is in flight” behavior, but makes the intent obvious and removes the misleading “channel full” warning.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

pkg/retry/handler/unconfirmed_message_handler.go

Extracted from `piotr/ROX-32316-new-retry-logic` and adapted to split a large PR into smaller chunks. Rewrites UnconfirmedMessageHandler from single-resource to multi-resource with per-resource independent retry state and exponential backoff. Updates the compliance node inventory and node index scan loops to use the new resourceID-based API, handle closed retry channels, and process the new ComplianceACK message type from Sensor 4.10+ alongside the legacy NodeInventoryACK. VM_INDEX_REPORT ACK handling is stubbed for a follow-up. Split from the full ROX-32316 branch to keep the change reviewable. AI-assisted: code was extracted and adapted from the feature branch by AI, the source feature branch was written by the author, reviewed and verified by the author.

Aligns the unconfirmed-message handler docs/comments with actual linear backoff and unbuffered retry-channel behavior, fixes typos, and removes a callback data race in UMH tests. Also updates compliance_ack_test to the current ACK API so style/lint/unit checks stop failing on interface/method drift.

github-actions · 2026-04-09T09:02:54Z

🚀 Build Images Ready

Images are ready for commit c400789. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.11.x-614-gc400789f56

coderabbitai · 2026-04-09T09:04:35Z

📝 Walkthrough

Walkthrough

Refactors retry handling to per-resource timers emitting resourceIDs; updates UnconfirmedMessageHandler API to be resource-aware; threads resourceIDs through compliance ACK/NACK handling, splits node caches, replaces legacy ACK dispatcher with handleComplianceACK, and adjusts tests/mocks to the new interfaces and behavior.

Changes

Cohort / File(s)	Summary
Retry Handler Core `pkg/retry/handler/unconfirmed_message_handler.go`	Replaced global ticker with per-resource timers/state map; `RetryCommand()` now emits `string` resourceIDs; added linear backoff, `OnACK` callback, `Stopped()` signal, per-resource `ObserveSending/HandleACK/HandleNACK` semantics, dispatcher/coalescing logic, and updated shutdown/cleanup.
Retry Handler Tests `pkg/retry/handler/unconfirmed_message_handler_test.go`	Converted suite tests to standalone subtests; updated to resource-specific interactions; added tests for multi-resource routing, ACK callback, shutdown behavior, non-positive base interval, non-blocking timer fire, dispatcher edge cases, and deterministic timing via `testing/synctest`.
Interface Changes `compliance/node/interfaces.go`	Updated `UnconfirmedMessageHandler` interface: `ObserveSending`, `HandleACK`, `HandleNACK` accept `resourceID string`; `RetryCommand()` returns `<-chan string`; added `OnACK(func(resourceID string))` and `Stopped() concurrency.ReadOnlyErrorSignal`.
Compliance Logic `compliance/compliance.go`	Introduced `nodeResourceID` constant; scan loops read `resourceID` from `RetryCommand()` and exit on channel close; split single cache into `nodeInventoryCache` and `nodeIndexCache` with mutexes and getters/setters; store/clone messages per-resource; replaced legacy dispatcher with `handleComplianceACK` plus `handleNodeInventoryACK`/`handleNodeIndexACK` that call `HandleACK(nodeResourceID)`/`HandleNACK(nodeResourceID)` and log NACK reasons.
Compliance Tests & Mocks `compliance/compliance_ack_test.go`, `compliance/compliance_test.go`	Updated `fakeUMH`/mocks to new method signatures (`HandleACK/HandleNACK/ObserveSending(resourceID)`, `RetryCommand() <-chan string`, `OnACK`, `Stopped()`); added `TestCompliance` suite and tests for `handleComplianceACK` including nil-ACK case.

Sequence Diagram(s)

sequenceDiagram
    participant Compliance as Compliance Module
    participant Handler as UnconfirmedMessageHandler
    participant Timers as Per-Resource Timers
    participant RetryCh as Retry Command Channel

    Compliance->>Handler: ObserveSending(resourceID)
    Handler->>Timers: create/update timer for resourceID
    Timers->>Handler: timer fires for resourceID
    Handler->>RetryCh: emit resourceID
    RetryCh->>Compliance: receive resourceID (RetryCommand)
    Compliance->>Handler: HandleACK(resourceID) / HandleNACK(resourceID)
    Handler->>Timers: stop/delete timer for resourceID (on ACK)
    Handler->>Compliance: invoke OnACK(resourceID) if registered
    Compliance->>Handler: Stopped()
    Handler->>RetryCh: close retry channel on shutdown

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main refactoring work: UMH per-resource tracking and node ACK flow wiring, which aligns with the primary changes in the changeset.
Description check	✅ Passed	The pull request description is comprehensive and follows the template structure with clear sections covering objectives, user-facing documentation status, testing and quality checklist items with checkmarks, and detailed validation approach.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch piotr/ROX-32316-umh-node-ack

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compliance/compliance.go`:
- Around line 228-239: The retry logic erroneously resends the shared c.cache
for node-index retries; update the code to maintain separate cached payloads
(e.g., add c.nodeIndexCache and c.nodeInventoryCache or a per-message-type retry
store) and have the runNodeIndex() retry branch read/send c.nodeIndexCache
(while runNodeInventory() uses c.nodeInventoryCache). Ensure these caches are
populated where runNodeIndex() (previously lines ~283-288) and the inventory
scan set their respective cache fields, and protect each cache with its own
mutex or use channel-based ownership to avoid cross-loop races when
writing/reading the caches.

In `@pkg/retry/handler/unconfirmed_message_handler.go`:
- Around line 119-121: The initial timer is being armed with h.baseInterval
directly causing immediate retries when baseInterval is non-positive; change the
call in the first-unacked path to use the same normalized interval as the later
logic (i.e., compute interval = h.baseInterval if >0 else defaultBaseInterval or
call the existing normalization helper) and pass that interval into
h.resetTimer(resourceID, state, interval); apply the same normalization to the
other resetTimer invocation(s) in the retry/backoff branch (the code around the
existing normalization at lines 219-229) so all timers use defaultBaseInterval
when baseInterval is non-positive.
- Around line 36-38: The single-slot string channel retryCommandCh coalesces
across all resources; change the design to a notification channel plus a
per-resource pending set: replace retryCommandCh chan string with retryNotifyCh
chan struct{} (buffer 1) and add a pendingRetries map[string]bool, then in the
timer firing path (the coalescing logic around the existing branch at Lines
~207-215) mark pendingRetries[resourceID]=true and do a non-blocking send to
retryNotifyCh; in the consumer/worker loop, on receiving from retryNotifyCh
iterate the pendingRetries keys, clear each entry and handle retries for that
specific resource (instead of relying on a single queued string), so retries are
coalesced per-resource rather than across all resources.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: a691a124-6090-4039-a05d-fc7c9a53193d

📥 Commits

Reviewing files that changed from the base of the PR and between ab1e8a2 and 950ed7c.

📒 Files selected for processing (6)

compliance/compliance.go
compliance/compliance_ack_test.go
compliance/compliance_test.go
compliance/node/interfaces.go
pkg/retry/handler/unconfirmed_message_handler.go
pkg/retry/handler/unconfirmed_message_handler_test.go

compliance/compliance.go

pkg/retry/handler/unconfirmed_message_handler.go

vikin91 · 2026-04-09T14:39:35Z

@coderabbitai recheck

coderabbitai · 2026-04-09T14:39:50Z

@vikin91 I'll recheck the PR and provide an updated review shortly.

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

compliance/compliance_test.go (1)
32-48: Assert the routed resourceID in this suite.

The mock drops the resourceID, so these tests only prove that some ACK/NACK handler was called. A regression that routes "" or the wrong key into HandleACK/HandleNACK would still pass, even though per-resource routing is the behavior this PR is introducing. Please record the seen IDs and assert they match nodeResourceID.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 57-117
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compliance/compliance_test.go` around lines 32 - 48, The
mockUnconfirmedMessageHandler currently discards the resourceID in HandleACK and
HandleNACK so tests only verify a handler was invoked; modify the mock
(mockUnconfirmedMessageHandler) to record the received resourceID(s) (e.g.,
append to a slice or set field) in HandleACK(resourceID string) and
HandleNACK(resourceID string), expose a method or field to retrieve them, and
update the tests to assert the recorded IDs equal the expected nodeResourceID;
also apply the same change for any other mock methods in this file that swallow
the id (see ObserveSending/OnACK areas between lines 57-117) so all assertions
verify the routed resourceID rather than just call counts.
pkg/retry/handler/unconfirmed_message_handler_test.go (1)
17-115: Move the retry timing table off wall-clock sleeps.

These cases still wait on real time for up to ~9 seconds and race a background counting goroutine, so they remain slow and occasionally flaky under CI load. Since this file now uses testing/synctest for the newer timer-sensitive paths, porting this table as well would make the retry/backoff coverage deterministic and keep test latency down.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/retry/handler/unconfirmed_message_handler_test.go` around lines 17 - 115,
The test TestWithRetryable relies on real time sleeps and races with a
background goroutine; replace wall-clock waits with a deterministic fake clock
from testing/synctest and advance it to trigger retries. Modify the test to
create a synctest.Clock, inject it into the UnconfirmedMessageHandler (e.g., add
or use a constructor like NewUnconfirmedMessageHandlerWithClock(ctx, "test",
baseDuration, clock) or add a clock option), then schedule sends/acks/nacks by
advancing the fake clock (instead of goroutines using time.After) and drive
umh.RetryCommand by advancing the clock to expected retry instants; update uses
of ObserveSending, HandleACK and HandleNACK to be called synchronously at the
appropriate clock advances and assert the counter deterministically.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/retry/handler/unconfirmed_message_handler.go`:
- Around line 146-173: HandleACK currently removes the resource from h.resources
but does not cancel queued retry commands, so the dispatcher can still publish
stale retries; modify HandleACK (in UnconfirmedMessageHandlerImpl) to also
remove/cancel any pending entries for resourceID from pendingRetries and close
or mark related commands so they are not sent, and update the dispatcher logic
that reads from retryCommandCh to revalidate the resource's liveness (check
h.resources or a dedicated isLive(resourceID) method) before publishing a retry;
ensure cancellation happens while holding h.mu and that any callbacks (h.onACK)
are still invoked outside the lock.

---

Nitpick comments:
In `@compliance/compliance_test.go`:
- Around line 32-48: The mockUnconfirmedMessageHandler currently discards the
resourceID in HandleACK and HandleNACK so tests only verify a handler was
invoked; modify the mock (mockUnconfirmedMessageHandler) to record the received
resourceID(s) (e.g., append to a slice or set field) in HandleACK(resourceID
string) and HandleNACK(resourceID string), expose a method or field to retrieve
them, and update the tests to assert the recorded IDs equal the expected
nodeResourceID; also apply the same change for any other mock methods in this
file that swallow the id (see ObserveSending/OnACK areas between lines 57-117)
so all assertions verify the routed resourceID rather than just call counts.

In `@pkg/retry/handler/unconfirmed_message_handler_test.go`:
- Around line 17-115: The test TestWithRetryable relies on real time sleeps and
races with a background goroutine; replace wall-clock waits with a deterministic
fake clock from testing/synctest and advance it to trigger retries. Modify the
test to create a synctest.Clock, inject it into the UnconfirmedMessageHandler
(e.g., add or use a constructor like NewUnconfirmedMessageHandlerWithClock(ctx,
"test", baseDuration, clock) or add a clock option), then schedule
sends/acks/nacks by advancing the fake clock (instead of goroutines using
time.After) and drive umh.RetryCommand by advancing the clock to expected retry
instants; update uses of ObserveSending, HandleACK and HandleNACK to be called
synchronously at the appropriate clock advances and assert the counter
deterministically.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 908902b8-3a8a-4e8b-9e4a-71f6c425fb27

📥 Commits

Reviewing files that changed from the base of the PR and between 9d35506 and 1f291af.

📒 Files selected for processing (6)

compliance/compliance.go
compliance/compliance_ack_test.go
compliance/compliance_test.go
compliance/node/interfaces.go
pkg/retry/handler/unconfirmed_message_handler.go
pkg/retry/handler/unconfirmed_message_handler_test.go

pkg/retry/handler/unconfirmed_message_handler.go

HandleACK removed the resource from h.resources but left it in pendingRetries, so the dispatcher could still emit retries for already-acknowledged resources. Two race windows existed: 1. Resource in pendingRetries when ACK arrives — dispatcher drains it. 2. Resource already in dispatcher's local slice after takePendingRetries. Fix: HandleACK now also clears pendingRetries, and the dispatcher validates resource liveness via isResourceActive before sending. AI-generated: implementation and regression tests. Human-reviewed: race analysis, code correctness, style conformance.

vikin91 · 2026-04-09T15:42:59Z

@coderabbitai check again

coderabbitai · 2026-04-09T15:43:14Z

@vikin91 Sure! I'll redo the full review of the PR and provide an updated analysis shortly.

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compliance/compliance.go`:
- Around line 39-43: The code currently keys UMH sends with the synthetic
constant nodeResourceID ("this-node") instead of the actual resource id from
incoming ACK/NACKs; update the send/tracking logic to use ack.GetResourceId()
(or the real node name passed into the send) as the map key and ensure that the
resource_id is forwarded into HandleACK and HandleNACK so they validate and
match the exact resource before clearing retry state; replace uses of
nodeResourceID in send/track operations and in the ACK/NACK handling paths
(including calls to HandleACK/HandleNACK) so stale/mismatched ACKs cannot
acknowledge the wrong message.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 9b4bad7c-3a88-4186-9185-4a99e092cd7c

📥 Commits

Reviewing files that changed from the base of the PR and between ae1aa9f and c400789.

📒 Files selected for processing (6)

compliance/compliance.go
compliance/compliance_ack_test.go
compliance/compliance_test.go
compliance/node/interfaces.go
pkg/retry/handler/unconfirmed_message_handler.go
pkg/retry/handler/unconfirmed_message_handler_test.go

compliance/compliance.go

openshift-ci bot added the do-not-merge/work-in-progress label Mar 6, 2026

vikin91 changed the title ~~ROX-32316: Refactor UMH to per-resource tracking and wire node ACK flow~~ ROX-32316: Refactor UMH to per-resource tracking, wire node ACK flow Mar 6, 2026

vikin91 mentioned this pull request Mar 6, 2026

ROX-33555: Wire VM relay ACK flow with rate limiting and UMH #19321

Draft

11 tasks

This was referenced Mar 6, 2026

ROX-33555: Forward Central VM ACK/NACK to compliance #19322

Open

ROX-32848: ACK-based retry logic for VM index reports #18489

Closed

vikin91 changed the title ~~ROX-32316: Refactor UMH to per-resource tracking, wire node ACK flow~~ ROX-32848: Refactor UMH to per-resource tracking, wire node ACK flow Mar 6, 2026

vikin91 mentioned this pull request Mar 6, 2026

ROX-33555: Emit VM index ACK/NACK from Central #19323

Merged

11 tasks

vikin91 force-pushed the piotr/ROX-32316-umh-node-ack branch from 7faf2a3 to d136021 Compare March 9, 2026 10:05

vikin91 changed the title ~~ROX-32848: Refactor UMH to per-resource tracking, wire node ACK flow~~ ROX-33555: Refactor UMH to per-resource tracking, wire node ACK flow Mar 11, 2026

vikin91 force-pushed the piotr/ROX-32316-umh-node-ack branch from d136021 to 0ea378c Compare March 11, 2026 09:47

vikin91 marked this pull request as ready for review March 13, 2026 15:46

openshift-ci bot removed the do-not-merge/work-in-progress label Mar 13, 2026

vikin91 added ai-review ai-assisted labels Mar 13, 2026

sourcery-ai bot reviewed Mar 13, 2026

View reviewed changes

vikin91 force-pushed the piotr/ROX-32316-umh-node-ack branch 2 times, most recently from d3322c9 to 0085a80 Compare March 19, 2026 11:37

vikin91 added 5 commits April 9, 2026 10:50

Avoid TODO(ROX-) comment to not self-block this PR

4f87762

Self-review

2178326

Refactor tests

950ed7c

vikin91 force-pushed the piotr/ROX-32316-umh-node-ack branch from 6492540 to 950ed7c Compare April 9, 2026 08:50

vikin91 added the coderabbit-review label Apr 9, 2026

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

compliance/compliance.go Show resolved Hide resolved

pkg/retry/handler/unconfirmed_message_handler.go Outdated Show resolved Hide resolved

pkg/retry/handler/unconfirmed_message_handler.go Outdated Show resolved Hide resolved

vikin91 added 2 commits April 9, 2026 15:45

Address AI reviews

90233ba

Fix style

1f291af

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

pkg/retry/handler/unconfirmed_message_handler.go Show resolved Hide resolved

coderabbitai bot reviewed Apr 9, 2026

View reviewed changes

compliance/compliance.go Show resolved Hide resolved

vikin91 requested a review from rhybrillou April 10, 2026 09:17

Conversation

vikin91 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Uh oh!

vikin91 commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Mar 6, 2026

Uh oh!

rhacs-bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Build Images Ready

Uh oh!

coderabbitai bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vikin91 commented Apr 9, 2026

Uh oh!

coderabbitai bot commented Apr 9, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vikin91 commented Apr 9, 2026

Uh oh!

coderabbitai bot commented Apr 9, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vikin91 commented Mar 6, 2026 •

edited

Loading

vikin91 commented Mar 6, 2026 •

edited

Loading

rhacs-bot commented Mar 6, 2026 •

edited

Loading

codecov bot commented Mar 6, 2026 •

edited

Loading

github-actions bot commented Apr 9, 2026 •

edited

Loading

coderabbitai bot commented Apr 9, 2026 •

edited

Loading