-
Notifications
You must be signed in to change notification settings - Fork 385
feat(snownet): close idle connections after 5min #5576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
Terraform Cloud Plan Output |
Performance Test ResultsTCP
UDP
|
|
Note that until we deploy this to the gateway, there will be a lot of logs like: Because the gateway doesn't have the idle timeout yet and thus will keep testing the connectivity via STUN until it times out. Once it is deployed, this should happen pretty much at the same time (modulo 1 RTT or so) and thus not trigger this many warnings. Additionally, I am demoting this log in #5584. |
Within `snownet` - `connlib`'s connectivity library - we use ICE to set up a UDP "connection" between a client and a gateway. UDP is an unreliable transport, meaning the only way how can detect that the connection is broken is for both parties to constantly send messages and acknowledgements back and forth. ICE uses STUN binding requests for this. In the default configuration of `str0m`, a STUN binding is sent every 3s, and we tolerate at most 9 missing responses before we consider the connection broken. As these responses go missing, `str0m` halves this interval, which results in a total ICE timeout of around 17 seconds. We already tweak these values by reducing the number of requests to 8 and setting the interval to 1.5s. This results in a total ICE timeout of ~10s which effectively means that there is at most a 10s lag between the connection breaking and us considering it broken at which point new packets arriving at the TUN interface can trigger the setup of a new connection with the gateway. Lowering these timeouts improves the user experience in case of a broken connection because the user doesn't have to wait as long before they can access their resources again. The downside of lowering these timeouts is that we generate a lot of background noise. Especially on mobile devices, this is bad because it prevents the CPU from going to sleep and thus simply being signed into Firezone will drain your battery, even if you don't use it. Note that this doesn't apply at all if the client application on top detects a network change. In that case, we hard-reset all connections and instantly create new ones. We attempted to fix this in #5576 by closing idle connections after 5 minutes. This however created new problems such as #6778. The original problem here is that we send too many STUN messages as soon as a connection is established. Simply increasing the timeout is not an option because it would make the user experience really bad in case the connection actually drops for reasons that the client app can't detect. In this patch, we attempt to solve this in a different way: Detecting a broken connection is only critical if the user is actively using the tunnel (i.e. sending traffic). If there is no traffic, it doesn't matter if we need longer to detect a broken connection. The user won't notice because their phone is probably in their pocket or something. With this patch, we now implement the following behaviour: - A connection is considered idle after 10s of no application traffic. - On idle connections, we send a STUN requests every 60s - On idle connections, we wait for at most 4 missing responses before considering the connection broken. - Every connection will perform a client-initiated WireGuard keep-alive every 25s, unless there is application traffic. These values have been chosen while considering the following sources: 1. [RFC4787, REQ-5](https://www.rfc-editor.org/rfc/rfc4787.html#section-12) requires NATs to keep UDP NAT mappings alive for at least 2 minutes. 2. [`conntrack`](https://www.kernel.org/doc/Documentation/networking/nf_conntrack-sysctl.rst) adopts this requirement via the `nf_conntrack_udp_timeout_stream` configuration. 3. 25s is the default keep-alive of the WireGuard kernel module. In theory the WireGuard keep-alive itself should be good enough to keep all NAT bindings alive. In practice, missed keep-alives are not exposed by boringtun (the WireGuard implementation we rely on) and thus we need the additional STUN keep-alives to detect broken connections. We set those somewhat conservatively to 60s. As soon as the user triggers new application traffic, these values are reverted back to their defaults, meaning even if the connection died just before the user is starting to use it again, we will know within the usual 10s because we are triggering new STUN requests more often. Note that existing gateways still implement the "close idle connections after 5 minutes" behaviour. Customers will need to upgrade to a new gateway version to fully benefit from these new always-on, low-power connections. Resolves: #6778. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
We define a connection as idle if we haven't sent or received any packets in the last 5 minutes. From
snownet's perspective, keep-alives sent by upper layers (like TCP keep-alives) must be honored and thus outgoing as well as incoming packets are accounted for.If the underlying connection breaks, we will hit an ICE timeout which is an implementation detail of
snownet. The packets tracked here are IP packets that the user wants to send / receive via the tunnel. Similarly, wireguard's keep-alives do not update these timestamps and thus don't mark a connection as non-idle.