ROX-32848: ACK-based retry logic for VM index reports#18489
ROX-32848: ACK-based retry logic for VM index reports#18489
Conversation
|
This change is part of the following stack: Change managed by git-spice. |
|
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Hey - I've found 7 issues, and left some high level feedback:
- In UnconfirmedMessageHandlerImpl, the cleanup goroutine stops timers and then closes retryCommandCh, but time.AfterFunc callbacks may already be executing when Stop() returns false; consider tracking active timer callbacks (e.g., via a WaitGroup or the existing Stopper) and ensuring they have finished before closing the channel to avoid potential sends on a closed channel.
- The VMIndexReportSendAttempts metric uses the label value "retry" for all non-success outcomes even though the sender no longer performs retries; consider renaming this label to something like "error" or splitting out result states to better reflect the current behavior.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In UnconfirmedMessageHandlerImpl, the cleanup goroutine stops timers and then closes retryCommandCh, but time.AfterFunc callbacks may already be executing when Stop() returns false; consider tracking active timer callbacks (e.g., via a WaitGroup or the existing Stopper) and ensuring they have finished before closing the channel to avoid potential sends on a closed channel.
- The VMIndexReportSendAttempts metric uses the label value "retry" for all non-success outcomes even though the sender no longer performs retries; consider renaming this label to something like "error" or splitting out result states to better reflect the current behavior.
## Individual Comments
### Comment 1
<location> `pkg/retry/handler/unconfirmed_message_handler.go:124-133` </location>
<code_context>
- s.retry.Store(0)
- s.numUnackedSendings.Store(0)
+// HandleACK is called when an ACK is received for a resource.
+func (h *UnconfirmedMessageHandlerImpl) HandleACK(resourceID string) {
+ concurrency.WithLock(&h.mu, func() {
+ // Check if handler is stopped before any operations
+ if h.ctx.Err() != nil {
+ return
+ }
+
+ state, exists := h.resources[resourceID]
+ if exists {
+ if state.timer != nil {
+ state.timer.Stop()
+ }
+ state.retry = 0
+ state.numUnackedSendings = 0
+ log.Debugf("[%s] Received ACK for resource %s", h.handlerName, resourceID)
+ } else {
+ log.Debugf("[%s] Received ACK for unknown resource %s", h.handlerName, resourceID)
+ }
+ })
+
+ // Invoke callback outside lock
+ if h.onACK != nil {
+ h.onACK(resourceID)
+ }
</code_context>
<issue_to_address>
**issue (bug_risk):** Callback access in HandleACK can race with OnACK registration
`HandleACK` reads `h.onACK` outside `WithLock`, while `OnACK` writes it under the same mutex, which can trigger a race under the race detector. To avoid this, read the callback under the lock into a local (e.g. `cb := h.onACK`), release the lock, then call `cb(resourceID)` if `cb != nil`. This preserves invoking the callback outside the critical section without racy access.
</issue_to_address>
### Comment 2
<location> `compliance/virtualmachines/relay/relay.go:97-103` </location>
<code_context>
case <-ctx.Done():
return ctx.Err()
+
+ case vsockID, ok := <-r.umh.RetryCommand():
+ if !ok {
+ log.Warn("UMH retry channel closed; stopping relay")
+ return errors.New("UMH retry channel closed unexpectedly")
+ }
+ // Relay no longer stores reports; rely on agent resubmission.
+ log.Debugf("Retry requested for VSOCK %s, ignoring (no cached report retained)", vsockID)
+
case report := <-reportChan:
</code_context>
<issue_to_address>
**suggestion:** Treating closed UMH retry channel as an error makes normal shutdown noisy
Since `RetryCommand()` is closed on context cancellation, this path likely represents normal shutdown rather than an error. Instead of logging a warning and returning an error when `ok == false`, consider checking `ctx.Err()` and treating it as a graceful exit, or simply exiting the loop without returning an error to avoid noisy shutdown logs.
```suggestion
case vsockID, ok := <-r.umh.RetryCommand():
if !ok {
// Retry channel is closed; if the context is done, treat as graceful shutdown.
if err := ctx.Err(); err != nil {
log.Debug("UMH retry channel closed due to context cancellation; stopping relay")
return err
}
// Context is still active: this is unexpected.
log.Warn("UMH retry channel closed unexpectedly; stopping relay")
return errors.New("UMH retry channel closed unexpectedly while context active")
}
// Relay no longer stores reports; rely on agent resubmission.
log.Debugf("Retry requested for VSOCK %s, ignoring (no cached report retained)", vsockID)
```
</issue_to_address>
### Comment 3
<location> `compliance/virtualmachines/relay/sender/index_report_sender.go:58-64` </location>
<code_context>
+ result := "success"
</code_context>
<issue_to_address>
**suggestion:** Metric label value "retry" is misleading now that retry logic was removed
Since we no longer retry, `"retry"` effectively means "error" here. Please either rename the label value (e.g. to `"error"` and adjust the metric help text) or reintroduce explicit retry attempts if you need to distinguish them.
Suggested implementation:
```golang
result := "success"
if err != nil {
result = "error"
}
```
You should also:
1. Find the metric definition that uses this `result` label (likely in the same package or a common metrics file).
2. Update the label documentation / help text to replace any mention of “retry” with “error”, and ensure any dashboards/alerts that depend on the old value are updated accordingly.
</issue_to_address>
### Comment 4
<location> `pkg/retry/handler/unconfirmed_message_handler_test.go:172` </location>
<code_context>
+ suite.Equal([]string{"resource-ack-1", "resource-ack-2", "resource-ack-3"}, ackedResources)
+}
+
+func (suite *UnconfirmedMessageHandlerTestSuite) TestShutdown() {
+ ctx, cancel := context.WithCancel(context.Background())
+ baseInterval := 50 * time.Millisecond
</code_context>
<issue_to_address>
**suggestion (testing):** Consider a test for the dropped-retry case when retryCommand channel is saturated
`onTimerFired` uses a non-blocking `select` on `retryCommandCh` with a `default` that logs and drops when the channel is full, but no test currently covers this path.
Please add a test that:
- Uses a very short base interval.
- Prevents reads from `RetryCommand()` so the channel becomes full (e.g., by using a buffered channel or otherwise forcing saturation deterministically).
- Verifies that the send on `retryCommandCh` is non-blocking and that timers continue to be scheduled.
This will codify the current “drop-on-full” behavior and protect against regressions that might accidentally make the timer goroutine block.
Suggested implementation:
```golang
// ACK should invoke the callback
```
```golang
func TestRetryCommandChannelDropOnFull(t *testing.T) {
// This test verifies that when retryCommandCh is full, onTimerFired does not
// block and instead drops the retry, allowing timers to continue to be scheduled.
// Use a very short base interval so we can observe multiple timer firings quickly.
baseInterval := 5 * time.Millisecond
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
umh := NewUnconfirmedMessageHandler(ctx, "drop-on-full-test", baseInterval)
// IMPORTANT: we intentionally do NOT read from RetryCommand() in this test.
// This allows the internal retryCommandCh to become saturated.
// Schedule a resource for retry. Depending on your real API, you might need to
// change this to whatever method you use to register a failed/unconfirmed message.
const resourceID = "retry-resource-1"
// If your real API does not have HandleNACK, replace this call with the
// appropriate method that schedules a retry (e.g., HandleFailure, TrackUnconfirmed, etc.).
if h, ok := interface{}(umh).(interface{ HandleNACK(string) }); ok {
h.HandleNACK(resourceID)
} else {
t.Skip("UnconfirmedMessageHandler does not expose HandleNACK; update the test to use the appropriate retry-scheduling API")
}
// Allow enough time for:
// - the initial retry to be scheduled
// - additional timer firings to attempt more retries (which should be dropped once the channel is full)
deadline := time.After(200 * time.Millisecond)
// Because we don't know the internal buffer size, we treat any successful send of at
// least one retry as evidence that the timer fired and the goroutine is running.
// We then assert that the test completes within the deadline, which implies that
// the goroutine didn't block on a full channel.
retryCh := umh.RetryCommand()
var sawAtLeastOneRetry bool
loop:
for {
select {
case <-deadline:
// If we reach the deadline, it's evidence that the goroutine is not blocked:
// the test completed and the context wasn't canceled due to a deadlock.
break loop
case <-time.After(10 * time.Millisecond):
// Periodic check that the context is still alive.
select {
case <-ctx.Done():
t.Fatalf("context canceled unexpectedly; possible deadlock in retry handler: %v", ctx.Err())
default:
}
case <-retryCh:
// We saw at least one retry command, which means:
// - timers are firing
// - the send into retryCommandCh succeeded at least once (non-blocking)
sawAtLeastOneRetry = true
// We deliberately do NOT drain the channel further; the goal is to let it
// fill up so that subsequent timer firings hit the non-blocking default
// branch and drop retries rather than blocking. We still rely on the
// deadline-based completion of the test to assert that there is no block.
}
}
if !sawAtLeastOneRetry {
t.Fatalf("expected to observe at least one retry command, but none were received")
}
}
```
You will likely need to adapt this test to your actual handler API:
1. If `HandleNACK(string)` is not the correct way to schedule a retry:
- Replace the reflective `HandleNACK` usage with the appropriate method (e.g., `HandleFailure`, `ScheduleRetry`, etc.).
- The key is to ensure at least one resource is in the retry set so that timer firings attempt to send on `retryCommandCh`.
2. If `RetryCommand()` has a different name or signature:
- Replace `umh.RetryCommand()` with the actual method that exposes the retry command channel.
3. If you have direct access to the underlying channel or its capacity:
- You can strengthen the assertions by:
- Checking `len(ch) == cap(ch)` to confirm saturation.
- Optionally, counting timer firings via an injected clock or test hook, if available.
4. If your project uses the testify suite pattern exclusively:
- Convert `TestRetryCommandChannelDropOnFull` into a method on `UnconfirmedMessageHandlerTestSuite`:
`func (suite *UnconfirmedMessageHandlerTestSuite) TestRetryCommandChannelDropOnFull() { ... }`
- Replace `t.Fatalf` and skips with `suite.T().Fatalf` and `suite.T().Skip`, or the corresponding testify assertions.
</issue_to_address>
### Comment 5
<location> `compliance/virtualmachines/relay/relay_test.go:215` </location>
<code_context>
}
+// TestRelay_RateLimiting verifies that rate limiting works
+func (s *relayTestSuite) TestRelay_RateLimiting() {
+ // Create mock sender
+ done := concurrency.NewSignal()
</code_context>
<issue_to_address>
**suggestion (testing):** Rate limiting tests do not cover stale vs non-stale ACK behavior and metrics
The current test only validates the basic leaky-bucket behavior (`maxReportsPerMinute=1`). The implementation also has ACK-aware semantics driven by `staleAckThreshold` and emits different metrics/logs depending on whether the last ACK is stale/missing vs recent.
To fully cover this logic, please add tests that:
- Exercise a rate-limited VM with no ACK or a very old ACK and assert the `stale_ack` path (e.g., via the Prometheus counter or a test hook).
- Exercise a rate-limited VM with a recent ACK (e.g., via `umh.HandleACK` / `markAcked`) and assert the `normal` path when subsequent reports are dropped.
This ensures both code paths and their metrics are tested.
Suggested implementation:
```golang
// TestRelay_RateLimiting verifies that basic leaky-bucket rate limiting works
// (stale vs non-stale ACK behavior is covered in dedicated tests below).
```
To fully implement your review suggestion, additional code needs to be added to `compliance/virtualmachines/relay/relay_test.go`. The following outlines the concrete tests and assertions that should be introduced; you’ll need to adapt counter names / helper types to match the existing implementation:
1. **Add a test for the stale-ACK path, e.g. `TestRelay_RateLimiting_StaleAck`:**
- **Arrange:**
- Configure a `mockIndexReportStream` with multiple reports for the *same* `VsockCid` (e.g., `"100"`), just like `TestRelay_RateLimiting`.
- Configure a `mockIndexReportSender` that expects only the first report to be sent successfully (remaining ones should be dropped by rate limiting).
- Create a `mockUMH` (or similar) that allows you to control the last ACK timestamp per VM. For the tested `VsockCid`, do **not** set any ACK (or explicitly set `lastAckTime` far in the past: older than `staleAckThreshold`).
- Instantiate the relay with `maxReportsPerMinute = 1` and a reasonably small `staleAckThreshold` (e.g. `10 * time.Minute`), using `New(mockIndexReportStream, mockIndexReportSender, mockUMH, 1, staleAckThreshold)`.
- Reset the Prometheus counter(s) that track rate limiting decisions, e.g. something like:
```go
relayReportsRateLimited.Reset()
```
or whatever metric(s) your implementation uses.
- **Act:**
- Run the relay in a test context until the stream is exhausted (the pattern should match how other relay tests are currently starting/running the relay).
- **Assert:**
- Verify that:
- `mockIndexReportSender` saw exactly one send (the first report).
- The rate-limiting metric for **stale ACK** was incremented for the subsequent dropped reports, e.g.:
```go
staleAckCount := testutil.ToFloat64(relayReportsRateLimited.WithLabelValues("stale_ack"))
require.GreaterOrEqual(t, staleAckCount, 1.0)
```
Adjust metric name and label to match your real code (e.g. `"reason", "stale_ack"` or similar).
- If you have structured logging hooks or testing logger, also assert that the correct `stale_ack` log path is taken (by counting log entries or checking structured fields).
2. **Add a test for the recent-ACK (normal) path, e.g. `TestRelay_RateLimiting_RecentAck`:**
- **Arrange:**
- Use a similar setup as in the stale-ACK test: same `VsockCid`, same `mockIndexReportStream` with multiple reports, and `maxReportsPerMinute = 1`.
- For `mockUMH`, simulate a **recent ACK** for that `VsockCid`:
- Either call the real ACK handling API (`umh.HandleACK(...)`) with a timestamp `now` (or a fake clock at `now`), or directly call the helper the implementation uses (e.g. `markAcked(cid, time.Now())`) if it’s exposed.
- Ensure the stored ACK time is **newer** than `now - staleAckThreshold`.
- Reset/zero the same Prometheus metric(s) as in the stale test.
- **Act:**
- Run the relay with the same test harness.
- **Assert:**
- Confirm:
- `mockIndexReportSender` again only sends the first report (rate limit still applies).
- The **normal**/non-stale rate-limiting metric is incremented (e.g. `"reason" = "normal"` or `"reason" = "recent_ack"`), while the `stale_ack` metric remains zero:
```go
normalCount := testutil.ToFloat64(relayReportsRateLimited.WithLabelValues("normal"))
staleCount := testutil.ToFloat64(relayReportsRateLimited.WithLabelValues("stale_ack"))
require.GreaterOrEqual(t, normalCount, 1.0)
require.Equal(t, 0.0, staleCount)
```
Again, adjust metric and label names to your actual definitions.
- Optionally assert the log path used for these drops corresponds to the non-stale case.
3. **Mock / helper expectations:**
- If `mockUMH` does not currently support inspecting or setting ACK timestamps:
- Extend it with fields and methods to simulate and expose ACK state, matching the production interface (e.g., `GetLastACK(vmID string) time.Time`, or implement whatever interface is used by `New`).
- Ensure the relay uses the mock’s state to decide whether an ACK is stale.
- Ensure tests are using the same Prometheus registry / metrics helpers as the rest of the test suite (often via `prometheus.NewRegistry()` and `testutil.CollectAndCount` or `testutil.ToFloat64`).
These additional tests should live in the same `relay_test.go` file, likely next to `TestRelay_RateLimiting`, and follow the same `relayTestSuite` pattern (`func (s *relayTestSuite) ...`) and test helpers already used elsewhere in this test suite.
</issue_to_address>
### Comment 6
<location> `compliance/compliance_test.go:49` </location>
<code_context>
+ // no-op for test mock
+}
+
+func (s *ComplianceTestSuite) TestHandleComplianceACK() {
+ cases := map[string]struct {
+ ack *sensor.MsgToCompliance_ComplianceACK
</code_context>
<issue_to_address>
**suggestion (testing):** Compliance ACK tests could include unknown type/action cases to document behavior
Consider extending `TestHandleComplianceACK` with table entries for:
- An unknown `MessageType` (not covered by the switch) to assert that no UMH is invoked and the function does not panic.
- An unknown `Action` for a known `MessageType` to assert that handlers are not called.
This will lock in the current "ignore but log" behavior for unsupported ACKs and clarify expectations if new enum values are added in the proto later.
Suggested implementation:
```golang
expectedVMIndexACKs int
expectedVMIndexNACKs int
}{
"unknown message type is ignored": {
ack: &sensor.MsgToCompliance_ComplianceACK{
// Use a value outside the known enum range to simulate an unknown type.
MessageType: sensor.MsgToCompliance_ComplianceACK_MessageType(9999),
Action: sensor.MsgToCompliance_ComplianceACK_ACK,
},
// All expected counts remain zero to assert that no handler is invoked.
},
"unknown action for known message type is ignored": {
ack: &sensor.MsgToCompliance_ComplianceACK{
// Use a valid, known message type so only the action is unknown.
MessageType: sensor.MsgToCompliance_ComplianceACK_NODE_INVENTORY,
// Use a value outside the known enum range to simulate an unknown action.
Action: sensor.MsgToCompliance_ComplianceACK_Action(9999),
},
// All expected counts remain zero to assert that no handler is invoked.
},
"should handle NODE_INVENTORY ACK": {
ack: &sensor.MsgToCompliance_ComplianceACK{
Action: sensor.MsgToCompliance_ComplianceACK_ACK,
```
This patch assumes:
1. The `MessageType` enum type is `sensor.MsgToCompliance_ComplianceACK_MessageType` and includes a `MessageType` field on `MsgToCompliance_ComplianceACK`.
2. The `Action` enum type is `sensor.MsgToCompliance_ComplianceACK_Action` and `NODE_INVENTORY` is a valid `MessageType` value.
If the actual enum or field names differ (e.g. `MsgType` instead of `MessageType`, or different enum type names / constants), adjust the inserted code to match the real proto-generated identifiers. The test harness should already be asserting per-case expected counts against handler invocations; by leaving all expectations at their zero values, these new cases will enforce the "ignore but log" behavior for unknown ACKs.
</issue_to_address>
### Comment 7
<location> `pkg/retry/handler/unconfirmed_message_handler.go:166` </location>
<code_context>
+ return state
+}
+
+// resetTimer sets up the retry timer for a resource.
+func (h *UnconfirmedMessageHandlerImpl) resetTimer(resourceID string, state *resourceState, interval time.Duration) {
+ if state.timer != nil {
</code_context>
<issue_to_address>
**issue (complexity):** Consider encapsulating timer/backoff logic within resourceState and unifying locking patterns to make the handler’s behavior easier to follow.
You can keep the per-resource functionality but reduce complexity by (1) encapsulating the timer/backoff behavior in `resourceState` and (2) standardizing locking and context checks.
### 1) Encapsulate timer/backoff in `resourceState`
Right now `resetTimer`, `onTimerFired`, and `calculateNextInterval` sit on the handler and all operate on `resourceState`. That spreads the mental model across multiple functions.
You can move the logic into `resourceState` with small helpers, and keep `UnconfirmedMessageHandlerImpl` focused on map management and signaling:
```go
type resourceState struct {
retry int32
numUnackedSendings int32
timer *time.Timer
}
func (s *resourceState) stopTimer() {
if s.timer != nil {
s.timer.Stop()
}
}
func (s *resourceState) scheduleNext(h *UnconfirmedMessageHandlerImpl, resourceID string, interval time.Duration) {
s.stopTimer()
s.timer = time.AfterFunc(interval, func() {
h.onTimerFired(resourceID)
})
}
func (s *resourceState) nextInterval(base time.Duration) time.Duration {
if base <= 0 {
return defaultBaseInterval
}
s.retry++
next := time.Duration(s.retry+1) * base
if next <= 0 {
return defaultBaseInterval
}
return next
}
```
Then the handler methods become simpler and don’t need to know about retry math or timer lifecycle:
```go
func (h *UnconfirmedMessageHandlerImpl) ObserveSending(resourceID string) {
if h.ctx.Err() != nil {
return
}
h.mu.Lock()
defer h.mu.Unlock()
state := h.getOrCreateStateNoLock(resourceID)
state.numUnackedSendings++
log.Debugf("[%s] Observing send for resource %s (unacked: %d)",
h.handlerName, resourceID, state.numUnackedSendings)
if state.numUnackedSendings > 1 {
return
}
state.retry = 0
state.scheduleNext(h, resourceID, h.baseInterval)
}
func (h *UnconfirmedMessageHandlerImpl) onTimerFired(resourceID string) {
h.mu.Lock()
defer h.mu.Unlock()
if h.ctx.Err() != nil {
return
}
state, exists := h.resources[resourceID]
if !exists || state.numUnackedSendings == 0 {
return
}
nextInterval := state.nextInterval(h.baseInterval)
log.Infof("[%s] Resource %s has %d unacked messages, suggesting retry %d (next in %s)",
h.handlerName, resourceID, state.numUnackedSendings, state.retry, nextInterval)
state.scheduleNext(h, resourceID, nextInterval)
select {
case <-h.ctx.Done():
return
case h.retryCommandCh <- resourceID:
default:
log.Warnf("[%s] Retry channel full, dropping retry signal for %s", h.handlerName, resourceID)
}
}
```
This collapses the “4 interdependent pieces” into a single cohesive responsibility on `resourceState`, which makes the handler easier to read.
### 2) Standardize locking and callback invocation
You currently mix `concurrency.WithLock` and manual `Lock`/`Unlock`, and `HandleACK` uses `WithLock` while others don’t. Pick one style and stick to it.
For example, if you stick to manual locking, `HandleACK` can follow the same pattern as `ObserveSending` and also centralize the “capture callback then call after unlock” pattern:
```go
func (h *UnconfirmedMessageHandlerImpl) HandleACK(resourceID string) {
if h.ctx.Err() != nil {
return
}
h.mu.Lock()
state, exists := h.resources[resourceID]
if exists {
state.stopTimer()
state.retry = 0
state.numUnackedSendings = 0
log.Debugf("[%s] Received ACK for resource %s", h.handlerName, resourceID)
} else {
log.Debugf("[%s] Received ACK for unknown resource %s", h.handlerName, resourceID)
}
callback := h.onACK
h.mu.Unlock()
// Invoke callback outside lock
if callback != nil {
callback(resourceID)
}
}
```
You can factor the “capture and invoke callback outside lock” into a tiny helper if you have more callbacks later, but even this change alone makes locking patterns consistent and easier to reason about.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| case vsockID, ok := <-r.umh.RetryCommand(): | ||
| if !ok { | ||
| log.Warn("UMH retry channel closed; stopping relay") | ||
| return errors.New("UMH retry channel closed unexpectedly") | ||
| } | ||
| // Relay no longer stores reports; rely on agent resubmission. | ||
| log.Debugf("Retry requested for VSOCK %s, ignoring (no cached report retained)", vsockID) |
There was a problem hiding this comment.
suggestion: Treating closed UMH retry channel as an error makes normal shutdown noisy
Since RetryCommand() is closed on context cancellation, this path likely represents normal shutdown rather than an error. Instead of logging a warning and returning an error when ok == false, consider checking ctx.Err() and treating it as a graceful exit, or simply exiting the loop without returning an error to avoid noisy shutdown logs.
| case vsockID, ok := <-r.umh.RetryCommand(): | |
| if !ok { | |
| log.Warn("UMH retry channel closed; stopping relay") | |
| return errors.New("UMH retry channel closed unexpectedly") | |
| } | |
| // Relay no longer stores reports; rely on agent resubmission. | |
| log.Debugf("Retry requested for VSOCK %s, ignoring (no cached report retained)", vsockID) | |
| case vsockID, ok := <-r.umh.RetryCommand(): | |
| if !ok { | |
| // Retry channel is closed; if the context is done, treat as graceful shutdown. | |
| if err := ctx.Err(); err != nil { | |
| log.Debug("UMH retry channel closed due to context cancellation; stopping relay") | |
| return err | |
| } | |
| // Context is still active: this is unexpected. | |
| log.Warn("UMH retry channel closed unexpectedly; stopping relay") | |
| return errors.New("UMH retry channel closed unexpectedly while context active") | |
| } | |
| // Relay no longer stores reports; rely on agent resubmission. | |
| log.Debugf("Retry requested for VSOCK %s, ignoring (no cached report retained)", vsockID) |
| return state | ||
| } | ||
|
|
||
| // resetTimer sets up the retry timer for a resource. |
There was a problem hiding this comment.
issue (complexity): Consider encapsulating timer/backoff logic within resourceState and unifying locking patterns to make the handler’s behavior easier to follow.
You can keep the per-resource functionality but reduce complexity by (1) encapsulating the timer/backoff behavior in resourceState and (2) standardizing locking and context checks.
1) Encapsulate timer/backoff in resourceState
Right now resetTimer, onTimerFired, and calculateNextInterval sit on the handler and all operate on resourceState. That spreads the mental model across multiple functions.
You can move the logic into resourceState with small helpers, and keep UnconfirmedMessageHandlerImpl focused on map management and signaling:
type resourceState struct {
retry int32
numUnackedSendings int32
timer *time.Timer
}
func (s *resourceState) stopTimer() {
if s.timer != nil {
s.timer.Stop()
}
}
func (s *resourceState) scheduleNext(h *UnconfirmedMessageHandlerImpl, resourceID string, interval time.Duration) {
s.stopTimer()
s.timer = time.AfterFunc(interval, func() {
h.onTimerFired(resourceID)
})
}
func (s *resourceState) nextInterval(base time.Duration) time.Duration {
if base <= 0 {
return defaultBaseInterval
}
s.retry++
next := time.Duration(s.retry+1) * base
if next <= 0 {
return defaultBaseInterval
}
return next
}Then the handler methods become simpler and don’t need to know about retry math or timer lifecycle:
func (h *UnconfirmedMessageHandlerImpl) ObserveSending(resourceID string) {
if h.ctx.Err() != nil {
return
}
h.mu.Lock()
defer h.mu.Unlock()
state := h.getOrCreateStateNoLock(resourceID)
state.numUnackedSendings++
log.Debugf("[%s] Observing send for resource %s (unacked: %d)",
h.handlerName, resourceID, state.numUnackedSendings)
if state.numUnackedSendings > 1 {
return
}
state.retry = 0
state.scheduleNext(h, resourceID, h.baseInterval)
}
func (h *UnconfirmedMessageHandlerImpl) onTimerFired(resourceID string) {
h.mu.Lock()
defer h.mu.Unlock()
if h.ctx.Err() != nil {
return
}
state, exists := h.resources[resourceID]
if !exists || state.numUnackedSendings == 0 {
return
}
nextInterval := state.nextInterval(h.baseInterval)
log.Infof("[%s] Resource %s has %d unacked messages, suggesting retry %d (next in %s)",
h.handlerName, resourceID, state.numUnackedSendings, state.retry, nextInterval)
state.scheduleNext(h, resourceID, nextInterval)
select {
case <-h.ctx.Done():
return
case h.retryCommandCh <- resourceID:
default:
log.Warnf("[%s] Retry channel full, dropping retry signal for %s", h.handlerName, resourceID)
}
}This collapses the “4 interdependent pieces” into a single cohesive responsibility on resourceState, which makes the handler easier to read.
2) Standardize locking and callback invocation
You currently mix concurrency.WithLock and manual Lock/Unlock, and HandleACK uses WithLock while others don’t. Pick one style and stick to it.
For example, if you stick to manual locking, HandleACK can follow the same pattern as ObserveSending and also centralize the “capture callback then call after unlock” pattern:
func (h *UnconfirmedMessageHandlerImpl) HandleACK(resourceID string) {
if h.ctx.Err() != nil {
return
}
h.mu.Lock()
state, exists := h.resources[resourceID]
if exists {
state.stopTimer()
state.retry = 0
state.numUnackedSendings = 0
log.Debugf("[%s] Received ACK for resource %s", h.handlerName, resourceID)
} else {
log.Debugf("[%s] Received ACK for unknown resource %s", h.handlerName, resourceID)
}
callback := h.onACK
h.mu.Unlock()
// Invoke callback outside lock
if callback != nil {
callback(resourceID)
}
}You can factor the “capture and invoke callback outside lock” into a tiny helper if you have more callbacks later, but even this change alone makes locking patterns consistent and easier to reason about.
|
Images are ready for the commit at 8d89717. To use with deploy scripts, first |
6a069e3 to
af9d192
Compare
8d89717 to
3bae470
Compare
af9d192 to
993e68d
Compare
3bae470 to
64f7486
Compare
993e68d to
b8fbaeb
Compare
b8fbaeb to
1443f69
Compare
64f7486 to
387140c
Compare
1443f69 to
5db0e06
Compare
387140c to
8cba653
Compare
Empty
Draft new VM index report retry logic
Description
change me!
User-facing documentation
Testing and quality
Automated testing
How I validated my change
change me!