fix(tailnet): retry after transport dial timeouts (#22977) (cherry-pick/v2.31)#22992
Merged
f0ssel merged 1 commit intorelease/2.31from Mar 13, 2026
Merged
fix(tailnet): retry after transport dial timeouts (#22977) (cherry-pick/v2.31)#22992f0ssel merged 1 commit intorelease/2.31from
f0ssel merged 1 commit intorelease/2.31from
Conversation
_Generated with mux but reviewed by a human_ This PR fixes a bug where Coder Desktop could stop retrying connections to coderd after a prolonged network interruption. When that happened, the client would no longer recoordinate or receive workspace updates, even after connectivity returned. This is likely the long-standing “stale connection” issue that has been reported both internally and by customers. In practice, it would cause all Coder Desktop workspaces to appear yellow or red in the UI and become unreachable. The underlying behavior matches the reports: peers are removed after 15 minutes without a handshake. So if network connectivity is lost for that long, the client must recoordinate to recover. This bug prevented that recoordination from happening. For that reason, I’m marking this as: Closes coder/coder-desktop-macos#227 ## Problem The tailnet controller owns a long-lived retry loop in `Controller.Run`. That loop already had an important graceful-shutdown guard added in [`ba21ba87`](ba21ba8) to prevent a phantom redial after cancellation: ```go if c.ctx.Err() != nil { return } ``` That guard was correct. It made controller lifetime depend on the controller's own context rather than on retry timing races. But the post-dial error path had since grown a broader terminal check: ```go if xerrors.Is(err, context.Canceled) || xerrors.Is(err, context.DeadlineExceeded) { return } ``` That turns out to be too broad for desktop reconnects. A dial attempt can fail with a wrapped `context.DeadlineExceeded` even while the controller's own context is still live. ## Why that happens The workspace tailnet dialer uses the SDK HTTP client, which inherits `http.DefaultTransport`. That transport uses a `net.Dialer` with a 30s `Timeout`. Go implements that timeout by creating an internal deadline-bound sub-context for the TCP connect. So during a control-plane blackhole, the reconnect path can look like this: - the existing control-plane connection dies - `Controller.Run` re-enters the retry path - the next websocket/TCP dial hangs against unreachable coderd - `net.Dialer` times out the connect after ~30s - the returned error unwraps to `context.DeadlineExceeded` - `Controller.Run` treats that as terminal and returns - the retry goroutine exits forever even though `c.ctx` is still alive At that point the data plane can remain partially alive, the desktop app can still look online, and unblocking coderd does nothing because the process is no longer trying to redial. ## How this was found We reproduced the issue in the macOS vpn-daemon process with temporary diagnostics, blackholed coderd with `pfctl`, and captured multiple goroutine dumps while the daemon was wedged. Those dumps showed: - `manageGracefulTimeout` was still blocked on `<-c.ctx.Done()`, proving the controller context was not canceled - the `Controller.Run` retry goroutine was missing from later dumps - control-plane consumers stayed idle longer over time - once coderd became reachable again the daemon still did not dial it That narrowed the failure from "slow retry" to "retry loop exited", and tracing the dial path back through `http.DefaultTransport` and `net.Dialer` explained why a transport timeout was being mistaken for controller shutdown. In my testing with coderd blocked, as expected, I did retain a connection to the workspace agent. I suspect the scenarios where connection to the agent are lost is because we can't retry coordination. ## Fix Keep the graceful-shutdown guard from [`ba21ba87`](ba21ba8) exactly as-is, but narrow the post-dial exit condition so it keys off the controller's own context instead of the error unwrap chain. Before: ```go if xerrors.Is(err, context.Canceled) || xerrors.Is(err, context.DeadlineExceeded) { return } ``` After: ```go if c.ctx.Err() != nil { return } ``` ## Why this is the right behavior This preserves the original graceful-shutdown invariant from [`ba21ba87`](ba21ba8) while restoring retryability for transient transport failures: - if `c.ctx` is canceled before dialing, the pre-dial guard still prevents a phantom redial - if `c.ctx` is canceled during a dial attempt, the error path still exits cleanly because `c.ctx.Err()` is non-nil - if a live controller hits a wrapped transport timeout, the loop no longer dies and instead retries as intended In other words, controller state remains the only authoritative signal for loop shutdown. ## Non-regression coverage This also preserves the earlier flaky-test fix from [`ba21ba87`](ba21ba8): - `pipeDialer` still returns errors instead of asserting from background goroutines - `TestController_Disconnects` still waits for `uut.Closed()` before the test exits On top of that, this change adds focused controller tests that assert: - a wrapped `net.OpError(context.DeadlineExceeded)` under a live controller causes another dial attempt instead of closing the controller - cancellation still shuts the controller down without an extra redial ## Validation After blocking TCP connections to coderd for 20 minutes to force the retry path, unblocking coderd allowed the daemon to recover on its own without toggling Coder Connect.
f0ssel
approved these changes
Mar 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport of #22977 to 2.31