fix(tailnet): retry after transport dial timeouts (#22977) (cherry-pick/v2.31) by ethanndickson · Pull Request #22992 · coder/coder

ethanndickson · 2026-03-12T13:20:16Z

Backport of #22977 to 2.31

_Generated with mux but reviewed by a human_ This PR fixes a bug where Coder Desktop could stop retrying connections to coderd after a prolonged network interruption. When that happened, the client would no longer recoordinate or receive workspace updates, even after connectivity returned. This is likely the long-standing “stale connection” issue that has been reported both internally and by customers. In practice, it would cause all Coder Desktop workspaces to appear yellow or red in the UI and become unreachable. The underlying behavior matches the reports: peers are removed after 15 minutes without a handshake. So if network connectivity is lost for that long, the client must recoordinate to recover. This bug prevented that recoordination from happening. For that reason, I’m marking this as: Closes coder/coder-desktop-macos#227 ## Problem The tailnet controller owns a long-lived retry loop in `Controller.Run`. That loop already had an important graceful-shutdown guard added in [`ba21ba87`](ba21ba8) to prevent a phantom redial after cancellation: ```go if c.ctx.Err() != nil { return } ``` That guard was correct. It made controller lifetime depend on the controller's own context rather than on retry timing races. But the post-dial error path had since grown a broader terminal check: ```go if xerrors.Is(err, context.Canceled) || xerrors.Is(err, context.DeadlineExceeded) { return } ``` That turns out to be too broad for desktop reconnects. A dial attempt can fail with a wrapped `context.DeadlineExceeded` even while the controller's own context is still live. ## Why that happens The workspace tailnet dialer uses the SDK HTTP client, which inherits `http.DefaultTransport`. That transport uses a `net.Dialer` with a 30s `Timeout`. Go implements that timeout by creating an internal deadline-bound sub-context for the TCP connect. So during a control-plane blackhole, the reconnect path can look like this: - the existing control-plane connection dies - `Controller.Run` re-enters the retry path - the next websocket/TCP dial hangs against unreachable coderd - `net.Dialer` times out the connect after ~30s - the returned error unwraps to `context.DeadlineExceeded` - `Controller.Run` treats that as terminal and returns - the retry goroutine exits forever even though `c.ctx` is still alive At that point the data plane can remain partially alive, the desktop app can still look online, and unblocking coderd does nothing because the process is no longer trying to redial. ## How this was found We reproduced the issue in the macOS vpn-daemon process with temporary diagnostics, blackholed coderd with `pfctl`, and captured multiple goroutine dumps while the daemon was wedged. Those dumps showed: - `manageGracefulTimeout` was still blocked on `<-c.ctx.Done()`, proving the controller context was not canceled - the `Controller.Run` retry goroutine was missing from later dumps - control-plane consumers stayed idle longer over time - once coderd became reachable again the daemon still did not dial it That narrowed the failure from "slow retry" to "retry loop exited", and tracing the dial path back through `http.DefaultTransport` and `net.Dialer` explained why a transport timeout was being mistaken for controller shutdown. In my testing with coderd blocked, as expected, I did retain a connection to the workspace agent. I suspect the scenarios where connection to the agent are lost is because we can't retry coordination. ## Fix Keep the graceful-shutdown guard from [`ba21ba87`](ba21ba8) exactly as-is, but narrow the post-dial exit condition so it keys off the controller's own context instead of the error unwrap chain. Before: ```go if xerrors.Is(err, context.Canceled) || xerrors.Is(err, context.DeadlineExceeded) { return } ``` After: ```go if c.ctx.Err() != nil { return } ``` ## Why this is the right behavior This preserves the original graceful-shutdown invariant from [`ba21ba87`](ba21ba8) while restoring retryability for transient transport failures: - if `c.ctx` is canceled before dialing, the pre-dial guard still prevents a phantom redial - if `c.ctx` is canceled during a dial attempt, the error path still exits cleanly because `c.ctx.Err()` is non-nil - if a live controller hits a wrapped transport timeout, the loop no longer dies and instead retries as intended In other words, controller state remains the only authoritative signal for loop shutdown. ## Non-regression coverage This also preserves the earlier flaky-test fix from [`ba21ba87`](ba21ba8): - `pipeDialer` still returns errors instead of asserting from background goroutines - `TestController_Disconnects` still waits for `uut.Closed()` before the test exits On top of that, this change adds focused controller tests that assert: - a wrapped `net.OpError(context.DeadlineExceeded)` under a live controller causes another dial attempt instead of closing the controller - cancellation still shuts the controller down without an extra redial ## Validation After blocking TCP connections to coderd for 20 minutes to force the retry path, unblocking coderd allowed the daemon to recover on its own without toggling Coder Connect.

ethanndickson requested a review from matifali March 12, 2026 13:20

github-actions bot assigned ethanndickson Mar 12, 2026

matifali requested a review from f0ssel March 12, 2026 14:09

matifali added the cherry-pick/v2.31 Needs to be cherry-picked to the 2.31 release branch label Mar 12, 2026

f0ssel approved these changes Mar 13, 2026

View reviewed changes

f0ssel merged commit 1a774ab into release/2.31 Mar 13, 2026
31 checks passed

f0ssel deleted the ethan/backport-coder-desktop-fix branch March 13, 2026 18:26

github-actions bot locked and limited conversation to collaborators Mar 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tailnet): retry after transport dial timeouts (#22977) (cherry-pick/v2.31)#22992

fix(tailnet): retry after transport dial timeouts (#22977) (cherry-pick/v2.31)#22992
f0ssel merged 1 commit intorelease/2.31from
ethan/backport-coder-desktop-fix

ethanndickson commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ethanndickson commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants