Skip to content

ROX-33555: Refactor UMH to per-resource tracking, wire node ACK flow#19319

Open
vikin91 wants to merge 8 commits intomasterfrom
piotr/ROX-32316-umh-node-ack
Open

ROX-33555: Refactor UMH to per-resource tracking, wire node ACK flow#19319
vikin91 wants to merge 8 commits intomasterfrom
piotr/ROX-32316-umh-node-ack

Conversation

@vikin91
Copy link
Copy Markdown
Contributor

@vikin91 vikin91 commented Mar 6, 2026

Description

Refactors UnconfirmedMessageHandler (UMH) from single-resource to per-resource tracking and wires
the node inventory / node index ACK flow through the new ComplianceACK message type.

⚠️ This change applies currently to Node scanning ACKs and is a prefactor required for handling VM scanning ACKs.

Problem: The existing UMH tracked a single global retry state, which doesn't work for the VM scanning use case where hundreds of independent resources (VMs) connect to a single Compliance container and need independent ACK/retry tracking.

Before there was always exactly one node per compliance container, so there was no need to track resourceIDs.

Solution:

  • Rewrote UMH to maintain per-resource state (map[string]*resourceState), each with independent retry counter, unacked count, and timer.
  • The compliance component handles the new ComplianceACK message from Sensor 4.10+.
  • Node scanning: scan loops use the new resourceID-based API and all related changes.
  • VM scanning is stubbed with a TODO for a follow-up PR.

AI-assisted: code was extracted and adapted from a larger feature branch by AI
(selective file copying and compliance.go splitting), reviewed and verified by the author.

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

  • Unit tests locally and in CI
  • Ran on a cluster and confirmed the following:
    1. Happy path for V4 node scanning works correctly
    2. When Central is down, the NACKs arrive and retries are executed
    3. When Central is back up, the first ACK arrives and the UMH state is reset.

Logs:

➜ kubectl -n stackrox logs pod/collector-m8vlc -c compliance | grep -E "Node Index v4|IndexReport|index.*ACK|suggesting retry|Received ACK|Observing send"

main: 2026/04/10 08:10:00.967747 compliance.go:127: Info: Node Index v4 enabled

pkg/retry/handler: 2026/04/10 08:12:08.358100 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 1)
pkg/retry/handler: 2026/04/10 08:12:09.708366 unconfirmed_message_handler.go:161: Debug: [node-index] Received ACK for resource this-node
pkg/retry/handler: 2026/04/10 08:12:09.708392 unconfirmed_message_handler.go:163: Debug: [node-index] Received ACK for unknown resource this-node

# ^^ note two types of ACK arriving - the old one and the new one.

pkg/retry/handler: 2026/04/10 08:14:00.451260 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 1)

# ^^ typical happy path, sending node scan, unacked=1, receiving ACK, sending again, unacked=1 ---> unacked messages do not accumulate.

# Now scaling down Central to 0

pkg/retry/handler: 2026/04/10 08:28:01.820878 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 1)
main: 2026/04/10 08:28:01.830983 compliance.go:445: Info: Node index NACK received: Central unreachable
pkg/retry/handler: 2026/04/10 08:28:01.831009 unconfirmed_message_handler.go:183: Debug: [node-index] Received NACK for resource this-node. Message will be resent.
pkg/retry/handler: 2026/04/10 08:28:31.821269 unconfirmed_message_handler.go:223: Info: [node-index] Resource this-node has 1 unacked messages, suggesting retry 1 (next in 1m0s)
main: 2026/04/10 08:28:31.831460 compliance.go:445: Info: Node index NACK received: Central unreachable
pkg/retry/handler: 2026/04/10 08:28:31.831471 unconfirmed_message_handler.go:183: Debug: [node-index] Received NACK for resource this-node. Message will be resent.

# ^^ receiving two NACKs for the same resource


pkg/retry/handler: 2026/04/10 08:29:31.821959 unconfirmed_message_handler.go:223: Info: [node-index] Resource this-node has 1 unacked messages, suggesting retry 2 (next in 1m30s)
main: 2026/04/10 08:29:31.831533 compliance.go:445: Info: Node index NACK received: Central unreachable
pkg/retry/handler: 2026/04/10 08:29:31.831545 unconfirmed_message_handler.go:183: Debug: [node-index] Received NACK for resource this-node. Message will be resent.
pkg/retry/handler: 2026/04/10 08:30:03.269202 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 2)
main: 2026/04/10 08:30:03.279712 compliance.go:445: Info: Node index NACK received: Central unreachable
pkg/retry/handler: 2026/04/10 08:30:03.279722 unconfirmed_message_handler.go:183: Debug: [node-index] Received NACK for resource this-node. Message will be resent.

# ^^ despite duplicated NACKs, no retry-storm. Both types of NACK are assigned to the same resource and the number of retries do not grow exponentially

# (some time passes)
# scaling Central up

pkg/retry/handler: 2026/04/10 09:00:19.131251 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 17)
main: 2026/04/10 09:00:19.140671 compliance.go:445: Info: Node index NACK received: Central unreachable
pkg/retry/handler: 2026/04/10 09:00:19.140686 unconfirmed_message_handler.go:183: Debug: [node-index] Received NACK for resource this-node. Message will be resent.
pkg/retry/handler: 2026/04/10 09:01:01.827845 unconfirmed_message_handler.go:223: Info: [node-index] Resource this-node has 17 unacked messages, suggesting retry 11 (next in 6m0s)
main: 2026/04/10 09:01:01.837200 compliance.go:445: Info: Node index NACK received: Central unreachable
pkg/retry/handler: 2026/04/10 09:01:01.837213 unconfirmed_message_handler.go:183: Debug: [node-index] Received NACK for resource this-node. Message will be resent.
pkg/retry/handler: 2026/04/10 09:02:18.483360 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 18)

# ^^ for the 6 minutes "(next in 6m0s)", we should still wait before sending a new message to Central.
# HOWEVER, for this test, I configured the rescan to happen every minute, so
# the next sending of the message is successful and ACK arrives (see below)

pkg/retry/handler: 2026/04/10 09:02:36.222017 unconfirmed_message_handler.go:161: Debug: [node-index] Received ACK for resource this-node
pkg/retry/handler: 2026/04/10 09:02:36.222092 unconfirmed_message_handler.go:163: Debug: [node-index] Received ACK for unknown resource this-node
pkg/retry/handler: 2026/04/10 09:04:10.165101 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 1)
pkg/retry/handler: 2026/04/10 09:04:12.381723 unconfirmed_message_handler.go:161: Debug: [node-index] Received ACK for resource this-node
pkg/retry/handler: 2026/04/10 09:04:12.381760 unconfirmed_message_handler.go:163: Debug: [node-index] Received ACK for unknown resource this-node
pkg/retry/handler: 2026/04/10 09:06:19.663367 unconfirmed_message_handler.go:132: Debug: [node-index] Observing send for resource this-node (unacked: 1)
pkg/retry/handler: 2026/04/10 09:06:25.003832 unconfirmed_message_handler.go:161: Debug: [node-index] Received ACK for resource this-node
pkg/retry/handler: 2026/04/10 09:06:25.003856 unconfirmed_message_handler.go:163: Debug: [node-index] Received ACK for unknown resource this-node

@vikin91
Copy link
Copy Markdown
Contributor Author

vikin91 commented Mar 6, 2026

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 6, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@vikin91 vikin91 changed the title ROX-32316: Refactor UMH to per-resource tracking and wire node ACK flow ROX-32316: Refactor UMH to per-resource tracking, wire node ACK flow Mar 6, 2026
@rhacs-bot
Copy link
Copy Markdown
Contributor

rhacs-bot commented Mar 6, 2026

Images are ready for the commit at 6492540.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-386-g649254019a.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 73.49398% with 66 lines in your changes missing coverage. Please review.
✅ Project coverage is 49.63%. Comparing base (ab1e8a2) to head (c400789).
⚠️ Report is 11 commits behind head on master.

Files with missing lines Patch % Lines
compliance/compliance.go 34.17% 52 Missing ⚠️
pkg/retry/handler/unconfirmed_message_handler.go 91.76% 11 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #19319      +/-   ##
==========================================
+ Coverage   49.61%   49.63%   +0.02%     
==========================================
  Files        2765     2765              
  Lines      208541   208700     +159     
==========================================
+ Hits       103459   103580     +121     
- Misses      97405    97441      +36     
- Partials     7677     7679       +2     
Flag Coverage Δ
go-unit-tests 49.63% <73.49%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vikin91 vikin91 changed the title ROX-32316: Refactor UMH to per-resource tracking, wire node ACK flow ROX-32848: Refactor UMH to per-resource tracking, wire node ACK flow Mar 6, 2026
@vikin91 vikin91 force-pushed the piotr/ROX-32316-umh-node-ack branch from 7faf2a3 to d136021 Compare March 9, 2026 10:05
@vikin91 vikin91 changed the title ROX-32848: Refactor UMH to per-resource tracking, wire node ACK flow ROX-33555: Refactor UMH to per-resource tracking, wire node ACK flow Mar 11, 2026
@vikin91 vikin91 force-pushed the piotr/ROX-32316-umh-node-ack branch from d136021 to 0ea378c Compare March 11, 2026 09:47
@vikin91 vikin91 marked this pull request as ready for review March 13, 2026 15:46
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • The new UnconfirmedMessageHandlerImpl docstring says it uses exponential backoff, but calculateNextInterval implements linear backoff ((retry+1)*baseInterval); consider either updating the comment or changing the logic to match the intended strategy.
  • In TestOnACKCallback, ackedResources is appended to from multiple callback goroutines without synchronization, which is a data race under the race detector; consider guarding it with a mutex or using a channel to collect results.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new `UnconfirmedMessageHandlerImpl` docstring says it uses exponential backoff, but `calculateNextInterval` implements linear backoff (`(retry+1)*baseInterval`); consider either updating the comment or changing the logic to match the intended strategy.
- In `TestOnACKCallback`, `ackedResources` is appended to from multiple callback goroutines without synchronization, which is a data race under the race detector; consider guarding it with a mutex or using a channel to collect results.

## Individual Comments

### Comment 1
<location path="pkg/retry/handler/unconfirmed_message_handler.go" line_range="27-29" />
<code_context>
+// Each resource has independent retry state with exponential backoff.
</code_context>
<issue_to_address>
**suggestion:** Backoff description says exponential but implementation is linear; align naming/behavior.

This comment promises exponential backoff, but the implementation uses linear backoff (`(retry + 1) * baseInterval`), matching the previous behavior. Please either change the wording to “linear backoff” or implement true exponential backoff (e.g., `baseInterval * (1 << retry)` with appropriate caps) to avoid confusing future readers.

```suggestion
 // UnconfirmedMessageHandlerImpl handles ACK/NACK messages for multiple resources.
 // Each resource has independent retry state with linear backoff.
 type UnconfirmedMessageHandlerImpl struct {
```
</issue_to_address>

### Comment 2
<location path="pkg/retry/handler/unconfirmed_message_handler.go" line_range="203-217" />
<code_context>
+	// Schedule next retry
+	h.resetTimer(resourceID, state, nextInterval)
+
+	// Signal retry (non-blocking); if the channel is full we log and drop the signal.
+	select {
+	case <-h.ctx.Done():
+		return
+	case h.retryCommandCh <- resourceID:
+	default:
+		// If the channel is full or the consumer goroutine is busy, we assume that currently a
+		// transimssion of a message is in progress. This means, that we do not need to enqueue
+		// another message immediately afterwards. Thus, we drop the retry signal and log a warning.
</code_context>
<issue_to_address>
**suggestion:** Clarify behavior of the unbuffered retry channel and the "full" terminology.

Because `retryCommandCh` is unbuffered, "channel full" isn't quite accurate here: the `default` branch is taken when there is no receiver ready at the instant of the send, not due to queued signals. If this drop-when-no-receiver behavior is what you want, consider wording it in terms of "no receiver ready". If you instead want to allow a small backlog of retries, a buffered channel would more directly express that intent than a non-blocking send on an unbuffered channel.

```suggestion
	// Schedule next retry
	h.resetTimer(resourceID, state, nextInterval)

	// Signal retry (non-blocking); if no receiver is ready on the unbuffered channel, we drop the signal.
	select {
	case <-h.ctx.Done():
		return
	case h.retryCommandCh <- resourceID:
	default:
		// If no receiver is ready or the consumer goroutine is busy, we assume that a transmission
		// of a message is currently in progress. This means we do not need to enqueue another retry
		// signal immediately afterwards. Thus, we drop the retry signal and log a warning.
		log.Warnf("[%s] No receiver ready on retry channel, dropping retry signal for %s", h.handlerName, resourceID)
	}
}
```
</issue_to_address>

### Comment 3
<location path="pkg/retry/handler/unconfirmed_message_handler.go" line_range="157-166" />
<code_context>
+	// HandleNACK is currently a no-op and has the same behavior as not reciving any [N]ACK message.
</code_context>
<issue_to_address>
**nitpick (typo):** Fix small typos in comments to improve readability.

Please fix the typos: "reciving" → "receiving" and "transimssion" → "transmission".

Suggested implementation:

```golang
	// HandleNACK is currently a no-op and has the same behavior as not receiving any [N]ACK message.

```

```golang
transmission

```
</issue_to_address>

### Comment 4
<location path="pkg/retry/handler/unconfirmed_message_handler.go" line_range="42" />
<code_context>
+	onACK func(resourceID string)
+	ctx   context.Context
+
+	// cleanupDone signals when cleanup is complete
+	cleanupDone concurrency.Stopper
 }
</code_context>
<issue_to_address>
**issue (complexity):** Consider simplifying the handler’s lifecycle and retry signaling by removing the Stopper abstraction and making the retry channel’s coalescing behavior explicit.

You can trim a fair bit of complexity without losing any of the new functionality:

### 1. Avoid `concurrency.Stopper` if `Stopped()` isn’t needed

If nothing outside this package is actually calling `Stopped()`, you can drop the `Stopper` in favor of the existing `ctx` + channel close. That removes a whole lifecycle abstraction:

```go
type UnconfirmedMessageHandlerImpl struct {
    handlerName    string
    baseInterval   time.Duration
    resources      map[string]*resourceState
    mu             sync.Mutex
    retryCommandCh chan string
    onACK          func(resourceID string)
    ctx            context.Context
}

// NewUnconfirmedMessageHandler creates a new handler for per-resource ACK/NACK tracking.
// It can be stopped by canceling the context.
func NewUnconfirmedMessageHandler(ctx context.Context, handlerName string, baseInterval time.Duration) *UnconfirmedMessageHandlerImpl {
    h := &UnconfirmedMessageHandlerImpl{
        handlerName:    handlerName,
        baseInterval:   baseInterval,
        resources:      make(map[string]*resourceState),
        retryCommandCh: make(chan string),
        ctx:            ctx,
    }

    go func() {
        <-ctx.Done()
        h.mu.Lock()
        defer h.mu.Unlock()
        for _, state := range h.resources {
            if state.timer != nil {
                state.timer.Stop()
            }
        }
        close(h.retryCommandCh)
    }()

    return h
}
```

If you *do* need to expose a “stopped” signal, you can derive it from the context instead of a separate `Stopper`:

```go
// Stopped exposes the underlying context as a read‑only signal.
func (h *UnconfirmedMessageHandlerImpl) Stopped() <-chan struct{} {
    return h.ctx.Done()
}
```

This keeps lifecycle signaling but removes the extra `Stopper`/`ReadOnlyErrorSignal` layering.

### 2. Make the retry channel semantics explicit

Right now the non‑blocking send on an unbuffered channel is subtle and the log message (“channel full”) is misleading. If the intent is “at most one pending retry, and drop additional ones while the consumer is busy”, you can make that policy explicit with a small buffer and coalescing flag:

```go
type resourceState struct {
    retry              int32
    numUnackedSendings int32
    timer              *time.Timer
    retryPending       bool // new: whether a retry signal is already pending
}
```

```go
func NewUnconfirmedMessageHandler(ctx context.Context, handlerName string, baseInterval time.Duration) *UnconfirmedMessageHandlerImpl {
    h := &UnconfirmedMessageHandlerImpl{
        handlerName:    handlerName,
        baseInterval:   baseInterval,
        resources:      make(map[string]*resourceState),
        retryCommandCh: make(chan string, 1), // small buffer
        ctx:            ctx,
    }
    // cleanup goroutine as above ...
    return h
}
```

And in `onTimerFired`:

```go
func (h *UnconfirmedMessageHandlerImpl) onTimerFired(resourceID string) {
    h.mu.Lock()
    defer h.mu.Unlock()

    if h.ctx.Err() != nil {
        return
    }

    state, exists := h.resources[resourceID]
    if !exists || state.numUnackedSendings == 0 {
        return
    }

    state.retry++
    nextInterval := h.calculateNextInterval(state.retry)

    log.Infof("[%s] Resource %s has %d unacked messages, suggesting retry %d (next in %s)",
        h.handlerName, resourceID, state.numUnackedSendings, state.retry, nextInterval)

    h.resetTimer(resourceID, state, nextInterval)

    // Coalesce retry signals: only enqueue if none is pending.
    if !state.retryPending {
        select {
        case <-h.ctx.Done():
            return
        case h.retryCommandCh <- resourceID:
            state.retryPending = true
        default:
            // buffer full: keep retryPending true, don't log "full" since this is expected coalescing
        }
    }
}
```

Then clear `retryPending` when the consumer actually processes the retry (where you receive from `RetryCommand()`):

```go
// in the consumer, after handling a retry for resourceID:
handler.ClearRetryPending(resourceID)
```

with a simple method:

```go
func (h *UnconfirmedMessageHandlerImpl) ClearRetryPending(resourceID string) {
    h.mu.Lock()
    defer h.mu.Unlock()
    if state, ok := h.resources[resourceID]; ok {
        state.retryPending = false
    }
}
```

This preserves the “don’t enqueue infinite retries while one is in flight” behavior, but makes the intent obvious and removes the misleading “channel full” warning.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@vikin91 vikin91 force-pushed the piotr/ROX-32316-umh-node-ack branch 2 times, most recently from d3322c9 to 0085a80 Compare March 19, 2026 11:37
vikin91 added 5 commits April 9, 2026 10:50
Extracted from `piotr/ROX-32316-new-retry-logic` and adapted to split
a large PR into smaller chunks.

Rewrites UnconfirmedMessageHandler from single-resource to multi-resource
with per-resource independent retry state and exponential backoff. Updates
the compliance node inventory and node index scan loops to use the new
resourceID-based API, handle closed retry channels, and process the new
ComplianceACK message type from Sensor 4.10+ alongside the legacy
NodeInventoryACK. VM_INDEX_REPORT ACK handling is stubbed for a follow-up.

Split from the full ROX-32316 branch to keep the change reviewable.

AI-assisted: code was extracted and adapted from the feature branch by AI,
the source feature branch was written by the author,
reviewed and verified by the author.
Aligns the unconfirmed-message handler docs/comments with actual linear backoff and unbuffered retry-channel behavior, fixes typos, and removes a callback data race in UMH tests. Also updates compliance_ack_test to the current ACK API so style/lint/unit checks stop failing on interface/method drift.
@vikin91 vikin91 force-pushed the piotr/ROX-32316-umh-node-ack branch from 6492540 to 950ed7c Compare April 9, 2026 08:50
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

🚀 Build Images Ready

Images are ready for commit c400789. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.11.x-614-gc400789f56

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

📝 Walkthrough

Walkthrough

Refactors retry handling to per-resource timers emitting resourceIDs; updates UnconfirmedMessageHandler API to be resource-aware; threads resourceIDs through compliance ACK/NACK handling, splits node caches, replaces legacy ACK dispatcher with handleComplianceACK, and adjusts tests/mocks to the new interfaces and behavior.

Changes

Cohort / File(s) Summary
Retry Handler Core
pkg/retry/handler/unconfirmed_message_handler.go
Replaced global ticker with per-resource timers/state map; RetryCommand() now emits string resourceIDs; added linear backoff, OnACK callback, Stopped() signal, per-resource ObserveSending/HandleACK/HandleNACK semantics, dispatcher/coalescing logic, and updated shutdown/cleanup.
Retry Handler Tests
pkg/retry/handler/unconfirmed_message_handler_test.go
Converted suite tests to standalone subtests; updated to resource-specific interactions; added tests for multi-resource routing, ACK callback, shutdown behavior, non-positive base interval, non-blocking timer fire, dispatcher edge cases, and deterministic timing via testing/synctest.
Interface Changes
compliance/node/interfaces.go
Updated UnconfirmedMessageHandler interface: ObserveSending, HandleACK, HandleNACK accept resourceID string; RetryCommand() returns <-chan string; added OnACK(func(resourceID string)) and Stopped() concurrency.ReadOnlyErrorSignal.
Compliance Logic
compliance/compliance.go
Introduced nodeResourceID constant; scan loops read resourceID from RetryCommand() and exit on channel close; split single cache into nodeInventoryCache and nodeIndexCache with mutexes and getters/setters; store/clone messages per-resource; replaced legacy dispatcher with handleComplianceACK plus handleNodeInventoryACK/handleNodeIndexACK that call HandleACK(nodeResourceID)/HandleNACK(nodeResourceID) and log NACK reasons.
Compliance Tests & Mocks
compliance/compliance_ack_test.go, compliance/compliance_test.go
Updated fakeUMH/mocks to new method signatures (HandleACK/HandleNACK/ObserveSending(resourceID), RetryCommand() <-chan string, OnACK, Stopped()); added TestCompliance suite and tests for handleComplianceACK including nil-ACK case.

Sequence Diagram(s)

sequenceDiagram
    participant Compliance as Compliance Module
    participant Handler as UnconfirmedMessageHandler
    participant Timers as Per-Resource Timers
    participant RetryCh as Retry Command Channel

    Compliance->>Handler: ObserveSending(resourceID)
    Handler->>Timers: create/update timer for resourceID
    Timers->>Handler: timer fires for resourceID
    Handler->>RetryCh: emit resourceID
    RetryCh->>Compliance: receive resourceID (RetryCommand)
    Compliance->>Handler: HandleACK(resourceID) / HandleNACK(resourceID)
    Handler->>Timers: stop/delete timer for resourceID (on ACK)
    Handler->>Compliance: invoke OnACK(resourceID) if registered
    Compliance->>Handler: Stopped()
    Handler->>RetryCh: close retry channel on shutdown
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 26.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main refactoring work: UMH per-resource tracking and node ACK flow wiring, which aligns with the primary changes in the changeset.
Description check ✅ Passed The pull request description is comprehensive and follows the template structure with clear sections covering objectives, user-facing documentation status, testing and quality checklist items with checkmarks, and detailed validation approach.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch piotr/ROX-32316-umh-node-ack

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compliance/compliance.go`:
- Around line 228-239: The retry logic erroneously resends the shared c.cache
for node-index retries; update the code to maintain separate cached payloads
(e.g., add c.nodeIndexCache and c.nodeInventoryCache or a per-message-type retry
store) and have the runNodeIndex() retry branch read/send c.nodeIndexCache
(while runNodeInventory() uses c.nodeInventoryCache). Ensure these caches are
populated where runNodeIndex() (previously lines ~283-288) and the inventory
scan set their respective cache fields, and protect each cache with its own
mutex or use channel-based ownership to avoid cross-loop races when
writing/reading the caches.

In `@pkg/retry/handler/unconfirmed_message_handler.go`:
- Around line 119-121: The initial timer is being armed with h.baseInterval
directly causing immediate retries when baseInterval is non-positive; change the
call in the first-unacked path to use the same normalized interval as the later
logic (i.e., compute interval = h.baseInterval if >0 else defaultBaseInterval or
call the existing normalization helper) and pass that interval into
h.resetTimer(resourceID, state, interval); apply the same normalization to the
other resetTimer invocation(s) in the retry/backoff branch (the code around the
existing normalization at lines 219-229) so all timers use defaultBaseInterval
when baseInterval is non-positive.
- Around line 36-38: The single-slot string channel retryCommandCh coalesces
across all resources; change the design to a notification channel plus a
per-resource pending set: replace retryCommandCh chan string with retryNotifyCh
chan struct{} (buffer 1) and add a pendingRetries map[string]bool, then in the
timer firing path (the coalescing logic around the existing branch at Lines
~207-215) mark pendingRetries[resourceID]=true and do a non-blocking send to
retryNotifyCh; in the consumer/worker loop, on receiving from retryNotifyCh
iterate the pendingRetries keys, clear each entry and handle retries for that
specific resource (instead of relying on a single queued string), so retries are
coalesced per-resource rather than across all resources.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: a691a124-6090-4039-a05d-fc7c9a53193d

📥 Commits

Reviewing files that changed from the base of the PR and between ab1e8a2 and 950ed7c.

📒 Files selected for processing (6)
  • compliance/compliance.go
  • compliance/compliance_ack_test.go
  • compliance/compliance_test.go
  • compliance/node/interfaces.go
  • pkg/retry/handler/unconfirmed_message_handler.go
  • pkg/retry/handler/unconfirmed_message_handler_test.go

@vikin91
Copy link
Copy Markdown
Contributor Author

vikin91 commented Apr 9, 2026

@coderabbitai recheck

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

@vikin91 I'll recheck the PR and provide an updated review shortly.

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
compliance/compliance_test.go (1)

32-48: Assert the routed resourceID in this suite.

The mock drops the resourceID, so these tests only prove that some ACK/NACK handler was called. A regression that routes "" or the wrong key into HandleACK/HandleNACK would still pass, even though per-resource routing is the behavior this PR is introducing. Please record the seen IDs and assert they match nodeResourceID.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 57-117

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@compliance/compliance_test.go` around lines 32 - 48, The
mockUnconfirmedMessageHandler currently discards the resourceID in HandleACK and
HandleNACK so tests only verify a handler was invoked; modify the mock
(mockUnconfirmedMessageHandler) to record the received resourceID(s) (e.g.,
append to a slice or set field) in HandleACK(resourceID string) and
HandleNACK(resourceID string), expose a method or field to retrieve them, and
update the tests to assert the recorded IDs equal the expected nodeResourceID;
also apply the same change for any other mock methods in this file that swallow
the id (see ObserveSending/OnACK areas between lines 57-117) so all assertions
verify the routed resourceID rather than just call counts.
pkg/retry/handler/unconfirmed_message_handler_test.go (1)

17-115: Move the retry timing table off wall-clock sleeps.

These cases still wait on real time for up to ~9 seconds and race a background counting goroutine, so they remain slow and occasionally flaky under CI load. Since this file now uses testing/synctest for the newer timer-sensitive paths, porting this table as well would make the retry/backoff coverage deterministic and keep test latency down.

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/retry/handler/unconfirmed_message_handler_test.go` around lines 17 - 115,
The test TestWithRetryable relies on real time sleeps and races with a
background goroutine; replace wall-clock waits with a deterministic fake clock
from testing/synctest and advance it to trigger retries. Modify the test to
create a synctest.Clock, inject it into the UnconfirmedMessageHandler (e.g., add
or use a constructor like NewUnconfirmedMessageHandlerWithClock(ctx, "test",
baseDuration, clock) or add a clock option), then schedule sends/acks/nacks by
advancing the fake clock (instead of goroutines using time.After) and drive
umh.RetryCommand by advancing the clock to expected retry instants; update uses
of ObserveSending, HandleACK and HandleNACK to be called synchronously at the
appropriate clock advances and assert the counter deterministically.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/retry/handler/unconfirmed_message_handler.go`:
- Around line 146-173: HandleACK currently removes the resource from h.resources
but does not cancel queued retry commands, so the dispatcher can still publish
stale retries; modify HandleACK (in UnconfirmedMessageHandlerImpl) to also
remove/cancel any pending entries for resourceID from pendingRetries and close
or mark related commands so they are not sent, and update the dispatcher logic
that reads from retryCommandCh to revalidate the resource's liveness (check
h.resources or a dedicated isLive(resourceID) method) before publishing a retry;
ensure cancellation happens while holding h.mu and that any callbacks (h.onACK)
are still invoked outside the lock.

---

Nitpick comments:
In `@compliance/compliance_test.go`:
- Around line 32-48: The mockUnconfirmedMessageHandler currently discards the
resourceID in HandleACK and HandleNACK so tests only verify a handler was
invoked; modify the mock (mockUnconfirmedMessageHandler) to record the received
resourceID(s) (e.g., append to a slice or set field) in HandleACK(resourceID
string) and HandleNACK(resourceID string), expose a method or field to retrieve
them, and update the tests to assert the recorded IDs equal the expected
nodeResourceID; also apply the same change for any other mock methods in this
file that swallow the id (see ObserveSending/OnACK areas between lines 57-117)
so all assertions verify the routed resourceID rather than just call counts.

In `@pkg/retry/handler/unconfirmed_message_handler_test.go`:
- Around line 17-115: The test TestWithRetryable relies on real time sleeps and
races with a background goroutine; replace wall-clock waits with a deterministic
fake clock from testing/synctest and advance it to trigger retries. Modify the
test to create a synctest.Clock, inject it into the UnconfirmedMessageHandler
(e.g., add or use a constructor like NewUnconfirmedMessageHandlerWithClock(ctx,
"test", baseDuration, clock) or add a clock option), then schedule
sends/acks/nacks by advancing the fake clock (instead of goroutines using
time.After) and drive umh.RetryCommand by advancing the clock to expected retry
instants; update uses of ObserveSending, HandleACK and HandleNACK to be called
synchronously at the appropriate clock advances and assert the counter
deterministically.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 908902b8-3a8a-4e8b-9e4a-71f6c425fb27

📥 Commits

Reviewing files that changed from the base of the PR and between 9d35506 and 1f291af.

📒 Files selected for processing (6)
  • compliance/compliance.go
  • compliance/compliance_ack_test.go
  • compliance/compliance_test.go
  • compliance/node/interfaces.go
  • pkg/retry/handler/unconfirmed_message_handler.go
  • pkg/retry/handler/unconfirmed_message_handler_test.go

HandleACK removed the resource from h.resources but left it in
pendingRetries, so the dispatcher could still emit retries for
already-acknowledged resources. Two race windows existed:

1. Resource in pendingRetries when ACK arrives — dispatcher drains it.
2. Resource already in dispatcher's local slice after takePendingRetries.

Fix: HandleACK now also clears pendingRetries, and the dispatcher
validates resource liveness via isResourceActive before sending.

AI-generated: implementation and regression tests.
Human-reviewed: race analysis, code correctness, style conformance.
@vikin91
Copy link
Copy Markdown
Contributor Author

vikin91 commented Apr 9, 2026

@coderabbitai check again

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 9, 2026

@vikin91 Sure! I'll redo the full review of the PR and provide an updated analysis shortly.

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@compliance/compliance.go`:
- Around line 39-43: The code currently keys UMH sends with the synthetic
constant nodeResourceID ("this-node") instead of the actual resource id from
incoming ACK/NACKs; update the send/tracking logic to use ack.GetResourceId()
(or the real node name passed into the send) as the map key and ensure that the
resource_id is forwarded into HandleACK and HandleNACK so they validate and
match the exact resource before clearing retry state; replace uses of
nodeResourceID in send/track operations and in the ACK/NACK handling paths
(including calls to HandleACK/HandleNACK) so stale/mismatched ACKs cannot
acknowledge the wrong message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: 9b4bad7c-3a88-4186-9185-4a99e092cd7c

📥 Commits

Reviewing files that changed from the base of the PR and between ae1aa9f and c400789.

📒 Files selected for processing (6)
  • compliance/compliance.go
  • compliance/compliance_ack_test.go
  • compliance/compliance_test.go
  • compliance/node/interfaces.go
  • pkg/retry/handler/unconfirmed_message_handler.go
  • pkg/retry/handler/unconfirmed_message_handler_test.go

@vikin91 vikin91 requested a review from rhybrillou April 10, 2026 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants