ROX-32848: ACK-based retry logic for VM index reports by vikin91 · Pull Request #18489 · stackrox/stackrox

vikin91 · 2026-01-14T10:55:00Z

Empty

Draft new VM index report retry logic

Description

change me!

User-facing documentation

CHANGELOG.md is updated OR update is not needed
documentation PR is created and is linked above OR is not needed

Testing and quality

the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
CI results are inspected

Automated testing

How I validated my change

change me!

vikin91 · 2026-01-14T10:55:01Z

This change is part of the following stack:

ROX-32316: Add support for the new ACK to Sensor #18228
- ROX-32848: Emit both ACK messages from Central #18488
  - ROX-32848: ACK-based retry logic for VM index reports #18489 ◀

_{Change managed by git-spice.}

openshift-ci · 2026-01-14T10:55:05Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

sourcery-ai

Hey - I've found 7 issues, and left some high level feedback:

In UnconfirmedMessageHandlerImpl, the cleanup goroutine stops timers and then closes retryCommandCh, but time.AfterFunc callbacks may already be executing when Stop() returns false; consider tracking active timer callbacks (e.g., via a WaitGroup or the existing Stopper) and ensuring they have finished before closing the channel to avoid potential sends on a closed channel.
The VMIndexReportSendAttempts metric uses the label value "retry" for all non-success outcomes even though the sender no longer performs retries; consider renaming this label to something like "error" or splitting out result states to better reflect the current behavior.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In UnconfirmedMessageHandlerImpl, the cleanup goroutine stops timers and then closes retryCommandCh, but time.AfterFunc callbacks may already be executing when Stop() returns false; consider tracking active timer callbacks (e.g., via a WaitGroup or the existing Stopper) and ensuring they have finished before closing the channel to avoid potential sends on a closed channel.
- The VMIndexReportSendAttempts metric uses the label value "retry" for all non-success outcomes even though the sender no longer performs retries; consider renaming this label to something like "error" or splitting out result states to better reflect the current behavior.

## Individual Comments

### Comment 1
<location> `pkg/retry/handler/unconfirmed_message_handler.go:124-133` </location>
<code_context>
-	s.retry.Store(0)
-	s.numUnackedSendings.Store(0)
+// HandleACK is called when an ACK is received for a resource.
+func (h *UnconfirmedMessageHandlerImpl) HandleACK(resourceID string) {
+	concurrency.WithLock(&h.mu, func() {
+		// Check if handler is stopped before any operations
+		if h.ctx.Err() != nil {
+			return
+		}
+
+		state, exists := h.resources[resourceID]
+		if exists {
+			if state.timer != nil {
+				state.timer.Stop()
+			}
+			state.retry = 0
+			state.numUnackedSendings = 0
+			log.Debugf("[%s] Received ACK for resource %s", h.handlerName, resourceID)
+		} else {
+			log.Debugf("[%s] Received ACK for unknown resource %s", h.handlerName, resourceID)
+		}
+	})
+
+	// Invoke callback outside lock
+	if h.onACK != nil {
+		h.onACK(resourceID)
+	}
</code_context>

<issue_to_address>
**issue (bug_risk):** Callback access in HandleACK can race with OnACK registration

`HandleACK` reads `h.onACK` outside `WithLock`, while `OnACK` writes it under the same mutex, which can trigger a race under the race detector. To avoid this, read the callback under the lock into a local (e.g. `cb := h.onACK`), release the lock, then call `cb(resourceID)` if `cb != nil`. This preserves invoking the callback outside the critical section without racy access.
</issue_to_address>

### Comment 2
<location> `compliance/virtualmachines/relay/relay.go:97-103` </location>
<code_context>
 		case <-ctx.Done():
 			return ctx.Err()
+
+		case vsockID, ok := <-r.umh.RetryCommand():
+			if !ok {
+				log.Warn("UMH retry channel closed; stopping relay")
+				return errors.New("UMH retry channel closed unexpectedly")
+			}
+			// Relay no longer stores reports; rely on agent resubmission.
+			log.Debugf("Retry requested for VSOCK %s, ignoring (no cached report retained)", vsockID)
+
 		case report := <-reportChan:
</code_context>

<issue_to_address>
**suggestion:** Treating closed UMH retry channel as an error makes normal shutdown noisy

Since `RetryCommand()` is closed on context cancellation, this path likely represents normal shutdown rather than an error. Instead of logging a warning and returning an error when `ok == false`, consider checking `ctx.Err()` and treating it as a graceful exit, or simply exiting the loop without returning an error to avoid noisy shutdown logs.

```suggestion
		case vsockID, ok := <-r.umh.RetryCommand():
			if !ok {
				// Retry channel is closed; if the context is done, treat as graceful shutdown.
				if err := ctx.Err(); err != nil {
					log.Debug("UMH retry channel closed due to context cancellation; stopping relay")
					return err
				}

				// Context is still active: this is unexpected.
				log.Warn("UMH retry channel closed unexpectedly; stopping relay")
				return errors.New("UMH retry channel closed unexpectedly while context active")
			}
			// Relay no longer stores reports; rely on agent resubmission.
			log.Debugf("Retry requested for VSOCK %s, ignoring (no cached report retained)", vsockID)
```
</issue_to_address>

### Comment 3
<location> `compliance/virtualmachines/relay/sender/index_report_sender.go:58-64` </location>
<code_context>
+	result := "success"
</code_context>

<issue_to_address>
**suggestion:** Metric label value "retry" is misleading now that retry logic was removed

Since we no longer retry, `"retry"` effectively means "error" here. Please either rename the label value (e.g. to `"error"` and adjust the metric help text) or reintroduce explicit retry attempts if you need to distinguish them.

Suggested implementation:

```golang
	result := "success"
	if err != nil {
		result = "error"
	}

```

You should also:
1. Find the metric definition that uses this `result` label (likely in the same package or a common metrics file).
2. Update the label documentation / help text to replace any mention of “retry” with “error”, and ensure any dashboards/alerts that depend on the old value are updated accordingly.
</issue_to_address>

### Comment 4
<location> `pkg/retry/handler/unconfirmed_message_handler_test.go:172` </location>
<code_context>
+	suite.Equal([]string{"resource-ack-1", "resource-ack-2", "resource-ack-3"}, ackedResources)
+}
+
+func (suite *UnconfirmedMessageHandlerTestSuite) TestShutdown() {
+	ctx, cancel := context.WithCancel(context.Background())
+	baseInterval := 50 * time.Millisecond
</code_context>

<issue_to_address>
**suggestion (testing):** Consider a test for the dropped-retry case when retryCommand channel is saturated

`onTimerFired` uses a non-blocking `select` on `retryCommandCh` with a `default` that logs and drops when the channel is full, but no test currently covers this path.

Please add a test that:
- Uses a very short base interval.
- Prevents reads from `RetryCommand()` so the channel becomes full (e.g., by using a buffered channel or otherwise forcing saturation deterministically).
- Verifies that the send on `retryCommandCh` is non-blocking and that timers continue to be scheduled.

This will codify the current “drop-on-full” behavior and protect against regressions that might accidentally make the timer goroutine block.

Suggested implementation:

```golang
	// ACK should invoke the callback

```

```golang
func TestRetryCommandChannelDropOnFull(t *testing.T) {
	// This test verifies that when retryCommandCh is full, onTimerFired does not
	// block and instead drops the retry, allowing timers to continue to be scheduled.

	// Use a very short base interval so we can observe multiple timer firings quickly.
	baseInterval := 5 * time.Millisecond
	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	umh := NewUnconfirmedMessageHandler(ctx, "drop-on-full-test", baseInterval)

	// IMPORTANT: we intentionally do NOT read from RetryCommand() in this test.
	// This allows the internal retryCommandCh to become saturated.

	// Schedule a resource for retry. Depending on your real API, you might need to
	// change this to whatever method you use to register a failed/unconfirmed message.
	const resourceID = "retry-resource-1"

	// If your real API does not have HandleNACK, replace this call with the
	// appropriate method that schedules a retry (e.g., HandleFailure, TrackUnconfirmed, etc.).
	if h, ok := interface{}(umh).(interface{ HandleNACK(string) }); ok {
		h.HandleNACK(resourceID)
	} else {
		t.Skip("UnconfirmedMessageHandler does not expose HandleNACK; update the test to use the appropriate retry-scheduling API")
	}

	// Allow enough time for:
	//  - the initial retry to be scheduled
	//  - additional timer firings to attempt more retries (which should be dropped once the channel is full)
	deadline := time.After(200 * time.Millisecond)

	// Because we don't know the internal buffer size, we treat any successful send of at
	// least one retry as evidence that the timer fired and the goroutine is running.
	// We then assert that the test completes within the deadline, which implies that
	// the goroutine didn't block on a full channel.
	retryCh := umh.RetryCommand()

	var sawAtLeastOneRetry bool

loop:
	for {
		select {
		case <-deadline:
			// If we reach the deadline, it's evidence that the goroutine is not blocked:
			// the test completed and the context wasn't canceled due to a deadlock.
			break loop

		case <-time.After(10 * time.Millisecond):
			// Periodic check that the context is still alive.
			select {
			case <-ctx.Done():
				t.Fatalf("context canceled unexpectedly; possible deadlock in retry handler: %v", ctx.Err())
			default:
			}

		case <-retryCh:
			// We saw at least one retry command, which means:
			//  - timers are firing
			//  - the send into retryCommandCh succeeded at least once (non-blocking)
			sawAtLeastOneRetry = true

			// We deliberately do NOT drain the channel further; the goal is to let it
			// fill up so that subsequent timer firings hit the non-blocking default
			// branch and drop retries rather than blocking. We still rely on the
			// deadline-based completion of the test to assert that there is no block.
		}
	}

	if !sawAtLeastOneRetry {
		t.Fatalf("expected to observe at least one retry command, but none were received")
	}
}

```

You will likely need to adapt this test to your actual handler API:

1. If `HandleNACK(string)` is not the correct way to schedule a retry:
   - Replace the reflective `HandleNACK` usage with the appropriate method (e.g., `HandleFailure`, `ScheduleRetry`, etc.).
   - The key is to ensure at least one resource is in the retry set so that timer firings attempt to send on `retryCommandCh`.

2. If `RetryCommand()` has a different name or signature:
   - Replace `umh.RetryCommand()` with the actual method that exposes the retry command channel.

3. If you have direct access to the underlying channel or its capacity:
   - You can strengthen the assertions by:
     - Checking `len(ch) == cap(ch)` to confirm saturation.
     - Optionally, counting timer firings via an injected clock or test hook, if available.

4. If your project uses the testify suite pattern exclusively:
   - Convert `TestRetryCommandChannelDropOnFull` into a method on `UnconfirmedMessageHandlerTestSuite`:
     `func (suite *UnconfirmedMessageHandlerTestSuite) TestRetryCommandChannelDropOnFull() { ... }`
   - Replace `t.Fatalf` and skips with `suite.T().Fatalf` and `suite.T().Skip`, or the corresponding testify assertions.
</issue_to_address>

### Comment 5
<location> `compliance/virtualmachines/relay/relay_test.go:215` </location>
<code_context>
 }

+// TestRelay_RateLimiting verifies that rate limiting works
+func (s *relayTestSuite) TestRelay_RateLimiting() {
+	// Create mock sender
+	done := concurrency.NewSignal()
</code_context>

<issue_to_address>
**suggestion (testing):** Rate limiting tests do not cover stale vs non-stale ACK behavior and metrics

The current test only validates the basic leaky-bucket behavior (`maxReportsPerMinute=1`). The implementation also has ACK-aware semantics driven by `staleAckThreshold` and emits different metrics/logs depending on whether the last ACK is stale/missing vs recent.

To fully cover this logic, please add tests that:
- Exercise a rate-limited VM with no ACK or a very old ACK and assert the `stale_ack` path (e.g., via the Prometheus counter or a test hook).
- Exercise a rate-limited VM with a recent ACK (e.g., via `umh.HandleACK` / `markAcked`) and assert the `normal` path when subsequent reports are dropped.

This ensures both code paths and their metrics are tested.

Suggested implementation:

```golang
 // TestRelay_RateLimiting verifies that basic leaky-bucket rate limiting works
 // (stale vs non-stale ACK behavior is covered in dedicated tests below).

```

To fully implement your review suggestion, additional code needs to be added to `compliance/virtualmachines/relay/relay_test.go`. The following outlines the concrete tests and assertions that should be introduced; you’ll need to adapt counter names / helper types to match the existing implementation:

1. **Add a test for the stale-ACK path, e.g. `TestRelay_RateLimiting_StaleAck`:**
   - **Arrange:**
     - Configure a `mockIndexReportStream` with multiple reports for the *same* `VsockCid` (e.g., `"100"`), just like `TestRelay_RateLimiting`.
     - Configure a `mockIndexReportSender` that expects only the first report to be sent successfully (remaining ones should be dropped by rate limiting).
     - Create a `mockUMH` (or similar) that allows you to control the last ACK timestamp per VM. For the tested `VsockCid`, do **not** set any ACK (or explicitly set `lastAckTime` far in the past: older than `staleAckThreshold`).
     - Instantiate the relay with `maxReportsPerMinute = 1` and a reasonably small `staleAckThreshold` (e.g. `10 * time.Minute`), using `New(mockIndexReportStream, mockIndexReportSender, mockUMH, 1, staleAckThreshold)`.
     - Reset the Prometheus counter(s) that track rate limiting decisions, e.g. something like:
       ```go
       relayReportsRateLimited.Reset()
       ```
       or whatever metric(s) your implementation uses.
   - **Act:**
     - Run the relay in a test context until the stream is exhausted (the pattern should match how other relay tests are currently starting/running the relay).
   - **Assert:**
     - Verify that:
       - `mockIndexReportSender` saw exactly one send (the first report).
       - The rate-limiting metric for **stale ACK** was incremented for the subsequent dropped reports, e.g.:
         ```go
         staleAckCount := testutil.ToFloat64(relayReportsRateLimited.WithLabelValues("stale_ack"))
         require.GreaterOrEqual(t, staleAckCount, 1.0)
         ```
         Adjust metric name and label to match your real code (e.g. `"reason", "stale_ack"` or similar).
       - If you have structured logging hooks or testing logger, also assert that the correct `stale_ack` log path is taken (by counting log entries or checking structured fields).

2. **Add a test for the recent-ACK (normal) path, e.g. `TestRelay_RateLimiting_RecentAck`:**
   - **Arrange:**
     - Use a similar setup as in the stale-ACK test: same `VsockCid`, same `mockIndexReportStream` with multiple reports, and `maxReportsPerMinute = 1`.
     - For `mockUMH`, simulate a **recent ACK** for that `VsockCid`:
       - Either call the real ACK handling API (`umh.HandleACK(...)`) with a timestamp `now` (or a fake clock at `now`), or directly call the helper the implementation uses (e.g. `markAcked(cid, time.Now())`) if it’s exposed.
       - Ensure the stored ACK time is **newer** than `now - staleAckThreshold`.
     - Reset/zero the same Prometheus metric(s) as in the stale test.
   - **Act:**
     - Run the relay with the same test harness.
   - **Assert:**
     - Confirm:
       - `mockIndexReportSender` again only sends the first report (rate limit still applies).
       - The **normal**/non-stale rate-limiting metric is incremented (e.g. `"reason" = "normal"` or `"reason" = "recent_ack"`), while the `stale_ack` metric remains zero:
         ```go
         normalCount := testutil.ToFloat64(relayReportsRateLimited.WithLabelValues("normal"))
         staleCount := testutil.ToFloat64(relayReportsRateLimited.WithLabelValues("stale_ack"))
         require.GreaterOrEqual(t, normalCount, 1.0)
         require.Equal(t, 0.0, staleCount)
         ```
         Again, adjust metric and label names to your actual definitions.
       - Optionally assert the log path used for these drops corresponds to the non-stale case.

3. **Mock / helper expectations:**
   - If `mockUMH` does not currently support inspecting or setting ACK timestamps:
     - Extend it with fields and methods to simulate and expose ACK state, matching the production interface (e.g., `GetLastACK(vmID string) time.Time`, or implement whatever interface is used by `New`).
     - Ensure the relay uses the mock’s state to decide whether an ACK is stale.
   - Ensure tests are using the same Prometheus registry / metrics helpers as the rest of the test suite (often via `prometheus.NewRegistry()` and `testutil.CollectAndCount` or `testutil.ToFloat64`).

These additional tests should live in the same `relay_test.go` file, likely next to `TestRelay_RateLimiting`, and follow the same `relayTestSuite` pattern (`func (s *relayTestSuite) ...`) and test helpers already used elsewhere in this test suite.
</issue_to_address>

### Comment 6
<location> `compliance/compliance_test.go:49` </location>
<code_context>
+	// no-op for test mock
+}
+
+func (s *ComplianceTestSuite) TestHandleComplianceACK() {
+	cases := map[string]struct {
+		ack                    *sensor.MsgToCompliance_ComplianceACK
</code_context>

<issue_to_address>
**suggestion (testing):** Compliance ACK tests could include unknown type/action cases to document behavior

Consider extending `TestHandleComplianceACK` with table entries for:
- An unknown `MessageType` (not covered by the switch) to assert that no UMH is invoked and the function does not panic.
- An unknown `Action` for a known `MessageType` to assert that handlers are not called.

This will lock in the current "ignore but log" behavior for unsupported ACKs and clarify expectations if new enum values are added in the proto later.

Suggested implementation:

```golang
		expectedVMIndexACKs    int
		expectedVMIndexNACKs   int
	}{
		"unknown message type is ignored": {
			ack: &sensor.MsgToCompliance_ComplianceACK{
				// Use a value outside the known enum range to simulate an unknown type.
				MessageType: sensor.MsgToCompliance_ComplianceACK_MessageType(9999),
				Action:      sensor.MsgToCompliance_ComplianceACK_ACK,
			},
			// All expected counts remain zero to assert that no handler is invoked.
		},
		"unknown action for known message type is ignored": {
			ack: &sensor.MsgToCompliance_ComplianceACK{
				// Use a valid, known message type so only the action is unknown.
				MessageType: sensor.MsgToCompliance_ComplianceACK_NODE_INVENTORY,
				// Use a value outside the known enum range to simulate an unknown action.
				Action: sensor.MsgToCompliance_ComplianceACK_Action(9999),
			},
			// All expected counts remain zero to assert that no handler is invoked.
		},
		"should handle NODE_INVENTORY ACK": {
			ack: &sensor.MsgToCompliance_ComplianceACK{
				Action:      sensor.MsgToCompliance_ComplianceACK_ACK,

```

This patch assumes:
1. The `MessageType` enum type is `sensor.MsgToCompliance_ComplianceACK_MessageType` and includes a `MessageType` field on `MsgToCompliance_ComplianceACK`.
2. The `Action` enum type is `sensor.MsgToCompliance_ComplianceACK_Action` and `NODE_INVENTORY` is a valid `MessageType` value.

If the actual enum or field names differ (e.g. `MsgType` instead of `MessageType`, or different enum type names / constants), adjust the inserted code to match the real proto-generated identifiers. The test harness should already be asserting per-case expected counts against handler invocations; by leaving all expectations at their zero values, these new cases will enforce the "ignore but log" behavior for unknown ACKs.
</issue_to_address>

### Comment 7
<location> `pkg/retry/handler/unconfirmed_message_handler.go:166` </location>
<code_context>
+	return state
+}
+
+// resetTimer sets up the retry timer for a resource.
+func (h *UnconfirmedMessageHandlerImpl) resetTimer(resourceID string, state *resourceState, interval time.Duration) {
+	if state.timer != nil {
</code_context>

<issue_to_address>
**issue (complexity):** Consider encapsulating timer/backoff logic within resourceState and unifying locking patterns to make the handler’s behavior easier to follow.

You can keep the per-resource functionality but reduce complexity by (1) encapsulating the timer/backoff behavior in `resourceState` and (2) standardizing locking and context checks.

### 1) Encapsulate timer/backoff in `resourceState`

Right now `resetTimer`, `onTimerFired`, and `calculateNextInterval` sit on the handler and all operate on `resourceState`. That spreads the mental model across multiple functions.

You can move the logic into `resourceState` with small helpers, and keep `UnconfirmedMessageHandlerImpl` focused on map management and signaling:

```go
type resourceState struct {
	retry               int32
	numUnackedSendings  int32
	timer               *time.Timer
}

func (s *resourceState) stopTimer() {
	if s.timer != nil {
		s.timer.Stop()
	}
}

func (s *resourceState) scheduleNext(h *UnconfirmedMessageHandlerImpl, resourceID string, interval time.Duration) {
	s.stopTimer()
	s.timer = time.AfterFunc(interval, func() {
		h.onTimerFired(resourceID)
	})
}

func (s *resourceState) nextInterval(base time.Duration) time.Duration {
	if base <= 0 {
		return defaultBaseInterval
	}
	s.retry++
	next := time.Duration(s.retry+1) * base
	if next <= 0 {
		return defaultBaseInterval
	}
	return next
}
```

Then the handler methods become simpler and don’t need to know about retry math or timer lifecycle:

```go
func (h *UnconfirmedMessageHandlerImpl) ObserveSending(resourceID string) {
	if h.ctx.Err() != nil {
		return
	}

	h.mu.Lock()
	defer h.mu.Unlock()

	state := h.getOrCreateStateNoLock(resourceID)
	state.numUnackedSendings++

	log.Debugf("[%s] Observing send for resource %s (unacked: %d)",
		h.handlerName, resourceID, state.numUnackedSendings)

	if state.numUnackedSendings > 1 {
		return
	}

	state.retry = 0
	state.scheduleNext(h, resourceID, h.baseInterval)
}

func (h *UnconfirmedMessageHandlerImpl) onTimerFired(resourceID string) {
	h.mu.Lock()
	defer h.mu.Unlock()

	if h.ctx.Err() != nil {
		return
	}
	state, exists := h.resources[resourceID]
	if !exists || state.numUnackedSendings == 0 {
		return
	}

	nextInterval := state.nextInterval(h.baseInterval)

	log.Infof("[%s] Resource %s has %d unacked messages, suggesting retry %d (next in %s)",
		h.handlerName, resourceID, state.numUnackedSendings, state.retry, nextInterval)

	state.scheduleNext(h, resourceID, nextInterval)

	select {
	case <-h.ctx.Done():
		return
	case h.retryCommandCh <- resourceID:
	default:
		log.Warnf("[%s] Retry channel full, dropping retry signal for %s", h.handlerName, resourceID)
	}
}
```

This collapses the “4 interdependent pieces” into a single cohesive responsibility on `resourceState`, which makes the handler easier to read.

### 2) Standardize locking and callback invocation

You currently mix `concurrency.WithLock` and manual `Lock`/`Unlock`, and `HandleACK` uses `WithLock` while others don’t. Pick one style and stick to it.

For example, if you stick to manual locking, `HandleACK` can follow the same pattern as `ObserveSending` and also centralize the “capture callback then call after unlock” pattern:

```go
func (h *UnconfirmedMessageHandlerImpl) HandleACK(resourceID string) {
	if h.ctx.Err() != nil {
		return
	}

	h.mu.Lock()
	state, exists := h.resources[resourceID]
	if exists {
		state.stopTimer()
		state.retry = 0
		state.numUnackedSendings = 0
		log.Debugf("[%s] Received ACK for resource %s", h.handlerName, resourceID)
	} else {
		log.Debugf("[%s] Received ACK for unknown resource %s", h.handlerName, resourceID)
	}
	callback := h.onACK
	h.mu.Unlock()

	// Invoke callback outside lock
	if callback != nil {
		callback(resourceID)
	}
}
```

You can factor the “capture and invoke callback outside lock” into a tiny helper if you have more callbacks later, but even this change alone makes locking patterns consistent and easier to reason about.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-01-14T10:57:25Z

compliance/virtualmachines/relay/relay.go

+		case vsockID, ok := <-r.umh.RetryCommand():
+			if !ok {
+				log.Warn("UMH retry channel closed; stopping relay")
+				return errors.New("UMH retry channel closed unexpectedly")
+			}
+			// Relay no longer stores reports; rely on agent resubmission.
+			log.Debugf("Retry requested for VSOCK %s, ignoring (no cached report retained)", vsockID)


suggestion: Treating closed UMH retry channel as an error makes normal shutdown noisy

Since RetryCommand() is closed on context cancellation, this path likely represents normal shutdown rather than an error. Instead of logging a warning and returning an error when ok == false, consider checking ctx.Err() and treating it as a graceful exit, or simply exiting the loop without returning an error to avoid noisy shutdown logs.

Suggested change

case vsockID, ok := <-r.umh.RetryCommand():

if !ok {

log.Warn("UMH retry channel closed; stopping relay")

return errors.New("UMH retry channel closed unexpectedly")

}

// Relay no longer stores reports; rely on agent resubmission.

log.Debugf("Retry requested for VSOCK %s, ignoring (no cached report retained)", vsockID)

case vsockID, ok := <-r.umh.RetryCommand():

if !ok {

// Retry channel is closed; if the context is done, treat as graceful shutdown.

if err := ctx.Err(); err != nil {

log.Debug("UMH retry channel closed due to context cancellation; stopping relay")

return err

}

// Context is still active: this is unexpected.

log.Warn("UMH retry channel closed unexpectedly; stopping relay")

return errors.New("UMH retry channel closed unexpectedly while context active")

}

// Relay no longer stores reports; rely on agent resubmission.

log.Debugf("Retry requested for VSOCK %s, ignoring (no cached report retained)", vsockID)

sourcery-ai · 2026-01-14T10:57:25Z

pkg/retry/handler/unconfirmed_message_handler.go

+	return state
+}
+
+// resetTimer sets up the retry timer for a resource.


issue (complexity): Consider encapsulating timer/backoff logic within resourceState and unifying locking patterns to make the handler’s behavior easier to follow.

You can keep the per-resource functionality but reduce complexity by (1) encapsulating the timer/backoff behavior in resourceState and (2) standardizing locking and context checks.

1) Encapsulate timer/backoff in resourceState

Right now resetTimer, onTimerFired, and calculateNextInterval sit on the handler and all operate on resourceState. That spreads the mental model across multiple functions.

You can move the logic into resourceState with small helpers, and keep UnconfirmedMessageHandlerImpl focused on map management and signaling:

type resourceState struct { retry int32 numUnackedSendings int32 timer *time.Timer } func (s *resourceState) stopTimer() { if s.timer != nil { s.timer.Stop() } } func (s *resourceState) scheduleNext(h *UnconfirmedMessageHandlerImpl, resourceID string, interval time.Duration) { s.stopTimer() s.timer = time.AfterFunc(interval, func() { h.onTimerFired(resourceID) }) } func (s *resourceState) nextInterval(base time.Duration) time.Duration { if base <= 0 { return defaultBaseInterval } s.retry++ next := time.Duration(s.retry+1) * base if next <= 0 { return defaultBaseInterval } return next }

Then the handler methods become simpler and don’t need to know about retry math or timer lifecycle:

func (h *UnconfirmedMessageHandlerImpl) ObserveSending(resourceID string) { if h.ctx.Err() != nil { return } h.mu.Lock() defer h.mu.Unlock() state := h.getOrCreateStateNoLock(resourceID) state.numUnackedSendings++ log.Debugf("[%s] Observing send for resource %s (unacked: %d)", h.handlerName, resourceID, state.numUnackedSendings) if state.numUnackedSendings > 1 { return } state.retry = 0 state.scheduleNext(h, resourceID, h.baseInterval) } func (h *UnconfirmedMessageHandlerImpl) onTimerFired(resourceID string) { h.mu.Lock() defer h.mu.Unlock() if h.ctx.Err() != nil { return } state, exists := h.resources[resourceID] if !exists || state.numUnackedSendings == 0 { return } nextInterval := state.nextInterval(h.baseInterval) log.Infof("[%s] Resource %s has %d unacked messages, suggesting retry %d (next in %s)", h.handlerName, resourceID, state.numUnackedSendings, state.retry, nextInterval) state.scheduleNext(h, resourceID, nextInterval) select { case <-h.ctx.Done(): return case h.retryCommandCh <- resourceID: default: log.Warnf("[%s] Retry channel full, dropping retry signal for %s", h.handlerName, resourceID) } }

This collapses the “4 interdependent pieces” into a single cohesive responsibility on resourceState, which makes the handler easier to read.

2) Standardize locking and callback invocation

You currently mix concurrency.WithLock and manual Lock/Unlock, and HandleACK uses WithLock while others don’t. Pick one style and stick to it.

For example, if you stick to manual locking, HandleACK can follow the same pattern as ObserveSending and also centralize the “capture callback then call after unlock” pattern:

func (h *UnconfirmedMessageHandlerImpl) HandleACK(resourceID string) { if h.ctx.Err() != nil { return } h.mu.Lock() state, exists := h.resources[resourceID] if exists { state.stopTimer() state.retry = 0 state.numUnackedSendings = 0 log.Debugf("[%s] Received ACK for resource %s", h.handlerName, resourceID) } else { log.Debugf("[%s] Received ACK for unknown resource %s", h.handlerName, resourceID) } callback := h.onACK h.mu.Unlock() // Invoke callback outside lock if callback != nil { callback(resourceID) } }

You can factor the “capture and invoke callback outside lock” into a tiny helper if you have more callbacks later, but even this change alone makes locking patterns consistent and easier to reason about.

rhacs-bot · 2026-01-14T11:29:37Z

Images are ready for the commit at 8d89717.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.10.x-782-g8d8971731b.

vikin91 · 2026-03-06T13:38:04Z

Closing in favor of the split stack of #19319 #19321 and #19322

This was referenced Jan 14, 2026

ROX-32848: Emit both ACK messages from Central #18488

Merged

ROX-32316: Add support for the new ACK to Sensor #18228

Merged

openshift-ci bot added the do-not-merge/work-in-progress label Jan 14, 2026

github-actions bot added area/sensor ai-review labels Jan 14, 2026

sourcery-ai bot reviewed Jan 14, 2026

View reviewed changes

vikin91 mentioned this pull request Jan 14, 2026

ROX-32316: Add VM/Sensor ACK wiring #18267

Closed

9 tasks

vikin91 force-pushed the piotr/ROX-32316-new-ack-central branch from 6a069e3 to af9d192 Compare January 23, 2026 08:24

vikin91 force-pushed the piotr/ROX-32316-new-retry-logic branch from 8d89717 to 3bae470 Compare January 23, 2026 08:24

vikin91 force-pushed the piotr/ROX-32316-new-ack-central branch from af9d192 to 993e68d Compare January 23, 2026 08:25

vikin91 force-pushed the piotr/ROX-32316-new-retry-logic branch from 3bae470 to 64f7486 Compare January 23, 2026 08:25

vikin91 force-pushed the piotr/ROX-32316-new-ack-central branch from 993e68d to b8fbaeb Compare January 27, 2026 15:29

vikin91 changed the title ~~ROX-32316: ACK-based retry logic for VM index reports~~ ROX-32848: ACK-based retry logic for VM index reports Jan 28, 2026

vikin91 force-pushed the piotr/ROX-32316-new-ack-central branch from b8fbaeb to 1443f69 Compare March 3, 2026 14:34

vikin91 force-pushed the piotr/ROX-32316-new-retry-logic branch from 64f7486 to 387140c Compare March 3, 2026 14:36

vikin91 force-pushed the piotr/ROX-32316-new-ack-central branch from 1443f69 to 5db0e06 Compare March 4, 2026 12:03

Base automatically changed from piotr/ROX-32316-new-ack-central to master March 5, 2026 17:58

vikin91 added 2 commits March 6, 2026 12:06

Empty

204bc80

Draft new VM index report retry logic

8cba653

vikin91 force-pushed the piotr/ROX-32316-new-retry-logic branch from 387140c to 8cba653 Compare March 6, 2026 11:07

vikin91 mentioned this pull request Mar 6, 2026

ROX-33555: Refactor UMH to per-resource tracking, wire node ACK flow #19319

Open

11 tasks

vikin91 closed this Mar 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-32848: ACK-based retry logic for VM index reports#18489

ROX-32848: ACK-based retry logic for VM index reports#18489
vikin91 wants to merge 2 commits intomasterfrom
piotr/ROX-32316-new-retry-logic

vikin91 commented Jan 14, 2026

Uh oh!

vikin91 commented Jan 14, 2026 •

edited

Loading

Uh oh!

openshift-ci bot commented Jan 14, 2026

Uh oh!

sourcery-ai bot left a comment

Uh oh!

sourcery-ai bot Jan 14, 2026

Uh oh!

sourcery-ai bot Jan 14, 2026

Uh oh!

rhacs-bot commented Jan 14, 2026

Uh oh!

vikin91 commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vikin91 commented Jan 14, 2026

Description

User-facing documentation

Testing and quality

Automated testing

How I validated my change

Uh oh!

vikin91 commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Jan 14, 2026

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Jan 14, 2026

Choose a reason for hiding this comment

1) Encapsulate timer/backoff in resourceState

2) Standardize locking and callback invocation

Uh oh!

rhacs-bot commented Jan 14, 2026

Uh oh!

vikin91 commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vikin91 commented Jan 14, 2026 •

edited

Loading

1) Encapsulate timer/backoff in `resourceState`