Handle MOVED error pointing to same endpoint.#3003
Handle MOVED error pointing to same endpoint.#3003barshaul wants to merge 5 commits intoStackExchange:mainfrom
Conversation
…on before retrying the request.
948a601 to
9c32b44
Compare
| { | ||
| if (bridge is null) | ||
| { | ||
| // already toast | ||
| } | ||
| else if (bridge.Multiplexer.TryResend(hashSlot, message, endpoint, isMoved)) | ||
| // MOVED to same endpoint detected. | ||
| // This occurs when Redis/Valkey servers are behind DNS records, load balancers, or proxies. | ||
| // The MOVED error signals that the client should reconnect to allow the DNS/proxy/load balancer | ||
| // to route the connection to a different underlying server host, then retry the command. | ||
| bridge?.TryConnect(null)?.Dispose(); |
There was a problem hiding this comment.
This seems like it relies on timing between the connection being detected as closed, and the parser - but the parser is IIRC "inline" here, i.e. we haven't finished reading yet. I need to think very carefully here about whether this is reliable ... I"m not sure either way, and I invite your input!
There was a problem hiding this comment.
additional concern: from memory (@NickCraver may remember more), DNS resolution in .NET has a habit of being cached for the process duration, unless explicit steps are taken - and I'm not seeing any explicit steps. I'm concerned that this may result in instant reconnection on the old cached (in-proc) DNS entry; definitely something to check
There was a problem hiding this comment.
@philon-msft tells me that the DNS part might be safer than I recall - will discuss, but: might not be a problem
There was a problem hiding this comment.
@mgravell regarding your first concern: I added a _needsReconnect flag to the bridge so that we only mark the connection as requiring a reconnect. This ensures that a retried command is queued rather than buffered to the old connection, deferring the actual reconnection to the reader loop in the case of a MOVED-to-same-endpoint scenario.
There was a problem hiding this comment.
re: the DNS caching concern:
I created a simple test using a Route53 DNS record:
- Connect to host A (via the Route53 DNS hostname), SET a key, verify it's stored
- Change the DNS record to point to a different host B
- Create a new connection (via same hostname), verify the key doesn't exist
The test passed - the second connection correctly resolved to the new host. I don't see DNS caching issues. Additionally, as I mentioned in #2990, if DNS caching were an issue with SE.Redis clients, we at ElastiCache would have already encountered it with our customers - our managed clusters logic relies heavily on DNS changes for failover and scaling operations.
…D-to-same-endpoint
Handle MOVED error pointing to same endpoint
Problem
When Redis/Valkey servers are deployed behind DNS records, load balancers, or proxies (as is common in managed environments), a MOVED error may be returned with a target node endpoint that is the same endpoint from which the error originated. Before the proposed change, when StackExchange.Redis received a MOVED error pointing to the same endpoint, it would fail the request and propagate the MOVED exception back to the application.
Issue ref: #2990
Solution
When a MOVED error points to the same endpoint, the client now triggers a reconnection before retrying the command. This allows the DNS record, proxy, or load balancer to route the new connection to a different underlying server host, enabling the retry to succeed.
Why proactive reconnection is necessary
The server closes the connection on its end immediately after sending the MOVED-to-same-endpoint error. However, due to the multiplexed nature of StackExchange.Redis connections, the library wouldn't notice the disconnection when it retries the request—all Write and Flush operations return successfully even though the underlying connection is broken. The disconnection is only detected when attempting to read the response.
At that point, it isn't safe to retry on the connection failure: we don't know if the request was actually sent to the server, so the connection error is raised back to the application, which also cannot determine if it's safe to retry. By initiating a reconnect immediately following the MOVED-to-same-endpoint error, we ensure the retry occurs on a fresh connection to a new server host.
Changes
Core fix:
ResultProcessor.cs: Detect when MOVED endpoint matches the current server endpoint and trigger reconnection by disposing the existing connection.Tests:
MovedToSameEndpointTests.cs: Integration test verifying the reconnect-and-retry behaviorMovedTestServer.cs: Test server helper that simulates MOVED responses pointing to the same endpoint