Skip to content

Handle MOVED error pointing to same endpoint.#3003

Open
barshaul wants to merge 5 commits intoStackExchange:mainfrom
barshaul:moved_same_endpoint
Open

Handle MOVED error pointing to same endpoint.#3003
barshaul wants to merge 5 commits intoStackExchange:mainfrom
barshaul:moved_same_endpoint

Conversation

@barshaul
Copy link

@barshaul barshaul commented Feb 3, 2026

Handle MOVED error pointing to same endpoint

Problem

When Redis/Valkey servers are deployed behind DNS records, load balancers, or proxies (as is common in managed environments), a MOVED error may be returned with a target node endpoint that is the same endpoint from which the error originated. Before the proposed change, when StackExchange.Redis received a MOVED error pointing to the same endpoint, it would fail the request and propagate the MOVED exception back to the application.

Issue ref: #2990

Solution

When a MOVED error points to the same endpoint, the client now triggers a reconnection before retrying the command. This allows the DNS record, proxy, or load balancer to route the new connection to a different underlying server host, enabling the retry to succeed.

Why proactive reconnection is necessary

The server closes the connection on its end immediately after sending the MOVED-to-same-endpoint error. However, due to the multiplexed nature of StackExchange.Redis connections, the library wouldn't notice the disconnection when it retries the request—all Write and Flush operations return successfully even though the underlying connection is broken. The disconnection is only detected when attempting to read the response.

At that point, it isn't safe to retry on the connection failure: we don't know if the request was actually sent to the server, so the connection error is raised back to the application, which also cannot determine if it's safe to retry. By initiating a reconnect immediately following the MOVED-to-same-endpoint error, we ensure the retry occurs on a fresh connection to a new server host.

Changes

Core fix:

  • ResultProcessor.cs: Detect when MOVED endpoint matches the current server endpoint and trigger reconnection by disposing the existing connection.

Tests:

  • MovedToSameEndpointTests.cs: Integration test verifying the reconnect-and-retry behavior
  • MovedTestServer.cs: Test server helper that simulates MOVED responses pointing to the same endpoint

@barshaul barshaul force-pushed the moved_same_endpoint branch from 948a601 to 9c32b44 Compare February 3, 2026 13:35
@barshaul barshaul marked this pull request as ready for review February 3, 2026 13:50
Comment on lines 265 to 270
{
if (bridge is null)
{
// already toast
}
else if (bridge.Multiplexer.TryResend(hashSlot, message, endpoint, isMoved))
// MOVED to same endpoint detected.
// This occurs when Redis/Valkey servers are behind DNS records, load balancers, or proxies.
// The MOVED error signals that the client should reconnect to allow the DNS/proxy/load balancer
// to route the connection to a different underlying server host, then retry the command.
bridge?.TryConnect(null)?.Dispose();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it relies on timing between the connection being detected as closed, and the parser - but the parser is IIRC "inline" here, i.e. we haven't finished reading yet. I need to think very carefully here about whether this is reliable ... I"m not sure either way, and I invite your input!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional concern: from memory (@NickCraver may remember more), DNS resolution in .NET has a habit of being cached for the process duration, unless explicit steps are taken - and I'm not seeing any explicit steps. I'm concerned that this may result in instant reconnection on the old cached (in-proc) DNS entry; definitely something to check

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philon-msft tells me that the DNS part might be safer than I recall - will discuss, but: might not be a problem

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgravell regarding your first concern: I added a _needsReconnect flag to the bridge so that we only mark the connection as requiring a reconnect. This ensures that a retried command is queued rather than buffered to the old connection, deferring the actual reconnection to the reader loop in the case of a MOVED-to-same-endpoint scenario.

Copy link
Author

@barshaul barshaul Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: the DNS caching concern:

I created a simple test using a Route53 DNS record:

  1. Connect to host A (via the Route53 DNS hostname), SET a key, verify it's stored
  2. Change the DNS record to point to a different host B
  3. Create a new connection (via same hostname), verify the key doesn't exist

The test passed - the second connection correctly resolved to the new host. I don't see DNS caching issues. Additionally, as I mentioned in #2990, if DNS caching were an issue with SE.Redis clients, we at ElastiCache would have already encountered it with our customers - our managed clusters logic relies heavily on DNS changes for failover and scaling operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants