-
Notifications
You must be signed in to change notification settings - Fork 874
Description
We are currently using Npgsql with a multi-host configuration for PostgreSQL failover (no load balancing). While testing failure scenarios we observed behavior which can introduce significant latency for user requests.
It appears that the current host recheck mechanism may cause user requests to block on connection attempts to a previously failed host, even though a healthy host is available.
Configuration
Example connection string:
Host=server1,server2;
load_balance_hosts=false;
Host Recheck Seconds=5;
Timeout=2;
target_session_attrs=primary
Maximum Pool Size=5;
Scenario assumptions:
server1= primary hostserver2= failover host- no load balancing
- application uses the default connection pooling
Observed behavior
- Application initially connects to
server1. server1becomes unavailable.- A connection attempt fails and
server1is marked as down. - The driver starts using
server2successfully. - After
HostRecheckSeconds(e.g. 5 seconds), the driver triesserver1again.
If server1 is still unavailable, multiple concurrent requests attempt to connect to it again.
Because the connection attempt waits for the configured Timeout, these requests block for the full timeout duration before falling back to server2.
Example timeline:
Hosts: server1, server2
HostRecheckSeconds = 5
Timeout = 2
If server1 is still down:
- every 5 seconds the driver retries
server1 - concurrent connection attempts wait up to 2 seconds
- only afterwards do they fall back to
server2
This results in user requests experiencing ~2 seconds of latency, even though server2 is fully healthy and could respond immediately.
With higher concurrency (e.g. Maximum Pool Size), many requests may be delayed simultaneously.
Expected behavior
From an application perspective, a preferable strategy would be:
- once a host is marked as down, continue using the healthy host
- periodically probe the failed host in the background
- avoid blocking user requests on retry attempts to the failed host
This would prevent unnecessary latency spikes when one host remains unavailable.
Question
Is the current behavior intentional?
If so, is there a recommended way to avoid user-facing delays when using multi-host failover without load balancing?
Alternatively, would it make sense to introduce an option to probe failed hosts in the background rather than during user connection attempts?
Environment
- Npgsql version: 8.0.6
- .NET SDK-version: 8.0.418