feat(snownet): close idle connections after 5min #5576

thomaseizinger · 2024-06-27T01:51:06Z

We define a connection as idle if we haven't sent or received any packets in the last 5 minutes. From snownet's perspective, keep-alives sent by upper layers (like TCP keep-alives) must be honored and thus outgoing as well as incoming packets are accounted for.

If the underlying connection breaks, we will hit an ICE timeout which is an implementation detail of snownet. The packets tracked here are IP packets that the user wants to send / receive via the tunnel. Similarly, wireguard's keep-alives do not update these timestamps and thus don't mark a connection as non-idle.

vercel · 2024-06-27T01:51:10Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
firezone	⬜️ Ignored (Inspect)	Visit Preview		Jun 27, 2024 8:11am

github-actions · 2024-06-27T01:53:24Z

Terraform Cloud Plan Output

Plan: 15 to add, 23 to change, 0 to destroy.

Terraform Cloud Plan

github-actions · 2024-06-27T02:29:35Z

Performance Test Results

TCP

Test Name	Received/s	Sent/s	Retransmits
direct-tcp-client2server	232.1 MiB (-3%)	233.5 MiB (-3%)	330 (+7%)
direct-tcp-server2client	237.9 MiB (-2%)	239.0 MiB (-2%)	357 (+495%)
relayed-tcp-client2server	237.7 MiB (+1%)	238.7 MiB (+1%)	236 (-42%)
relayed-tcp-server2client	241.6 MiB (+1%)	243.2 MiB (+1%)	610 (-38%)

UDP

Test Name	Total/s	Jitter	Lost
direct-udp-client2server	500.0 MiB (-0%)	0.03ms (+32%)	46.87% (+13%)
direct-udp-server2client	500.0 MiB (+0%)	0.02ms (-7%)	22.61% (-7%)
relayed-udp-client2server	500.0 MiB (-0%)	0.08ms (+106%)	54.48% (+0%)
relayed-udp-server2client	500.0 MiB (+0%)	0.06ms (+477%)	32.73% (-5%)

rust/connlib/snownet/tests/lib.rs

thomaseizinger · 2024-06-27T05:47:54Z

Note that until we deploy this to the gateway, there will be a lot of logs like:

2024-06-27T05:38:12.626969Z  WARN firezone_tunnel::client: Failed to decapsulate incoming packet: Packet is a STUN message but no agent handled it; num_agents = 0 local=192.168.188.71:37973 from=35.243.234.232:52487 num_bytes=88
2024-06-27T05:38:13.101331Z  WARN firezone_tunnel::client: Failed to decapsulate incoming packet: Packet is a STUN message but no agent handled it; num_agents = 0 local=192.168.188.71:37973 from=155.138.222.187:40528 num_bytes=88
2024-06-27T05:38:13.105499Z  WARN firezone_tunnel::client: Failed to decapsulate incoming packet: Packet is a STUN message but no agent handled it; num_agents = 0 local=192.168.188.71:37973 from=35.236.8.36:55009 num_bytes=88
2024-06-27T05:38:13.107265Z  WARN firezone_tunnel::client: Failed to decapsulate incoming packet: Packet is a STUN message but no agent handled it; num_agents = 0 local=192.168.188.71:37973 from=34.94.252.87:63692 num_bytes=88
2024-06-27T05:38:13.111175Z  WARN firezone_tunnel::client: Failed to decapsulate incoming packet: Packet is a STUN message but no agent handled it; num_agents = 0 local=192.168.188.71:37973 from=34.172.251.49:62968 num_bytes=88

Because the gateway doesn't have the idle timeout yet and thus will keep testing the connectivity via STUN until it times out. Once it is deployed, this should happen pretty much at the same time (modulo 1 RTT or so) and thus not trigger this many warnings.

Additionally, I am demoting this log in #5584.

Within `snownet` - `connlib`'s connectivity library - we use ICE to set up a UDP "connection" between a client and a gateway. UDP is an unreliable transport, meaning the only way how can detect that the connection is broken is for both parties to constantly send messages and acknowledgements back and forth. ICE uses STUN binding requests for this. In the default configuration of `str0m`, a STUN binding is sent every 3s, and we tolerate at most 9 missing responses before we consider the connection broken. As these responses go missing, `str0m` halves this interval, which results in a total ICE timeout of around 17 seconds. We already tweak these values by reducing the number of requests to 8 and setting the interval to 1.5s. This results in a total ICE timeout of ~10s which effectively means that there is at most a 10s lag between the connection breaking and us considering it broken at which point new packets arriving at the TUN interface can trigger the setup of a new connection with the gateway. Lowering these timeouts improves the user experience in case of a broken connection because the user doesn't have to wait as long before they can access their resources again. The downside of lowering these timeouts is that we generate a lot of background noise. Especially on mobile devices, this is bad because it prevents the CPU from going to sleep and thus simply being signed into Firezone will drain your battery, even if you don't use it. Note that this doesn't apply at all if the client application on top detects a network change. In that case, we hard-reset all connections and instantly create new ones. We attempted to fix this in #5576 by closing idle connections after 5 minutes. This however created new problems such as #6778. The original problem here is that we send too many STUN messages as soon as a connection is established. Simply increasing the timeout is not an option because it would make the user experience really bad in case the connection actually drops for reasons that the client app can't detect. In this patch, we attempt to solve this in a different way: Detecting a broken connection is only critical if the user is actively using the tunnel (i.e. sending traffic). If there is no traffic, it doesn't matter if we need longer to detect a broken connection. The user won't notice because their phone is probably in their pocket or something. With this patch, we now implement the following behaviour: - A connection is considered idle after 10s of no application traffic. - On idle connections, we send a STUN requests every 60s - On idle connections, we wait for at most 4 missing responses before considering the connection broken. - Every connection will perform a client-initiated WireGuard keep-alive every 25s, unless there is application traffic. These values have been chosen while considering the following sources: 1. [RFC4787, REQ-5](https://www.rfc-editor.org/rfc/rfc4787.html#section-12) requires NATs to keep UDP NAT mappings alive for at least 2 minutes. 2. [`conntrack`](https://www.kernel.org/doc/Documentation/networking/nf_conntrack-sysctl.rst) adopts this requirement via the `nf_conntrack_udp_timeout_stream` configuration. 3. 25s is the default keep-alive of the WireGuard kernel module. In theory the WireGuard keep-alive itself should be good enough to keep all NAT bindings alive. In practice, missed keep-alives are not exposed by boringtun (the WireGuard implementation we rely on) and thus we need the additional STUN keep-alives to detect broken connections. We set those somewhat conservatively to 60s. As soon as the user triggers new application traffic, these values are reverted back to their defaults, meaning even if the connection died just before the user is starting to use it again, we will know within the usual 10s because we are triggering new STUN requests more often. Note that existing gateways still implement the "close idle connections after 5 minutes" behaviour. Customers will need to upgrade to a new gateway version to fully benefit from these new always-on, low-power connections. Resolves: #6778. --------- Signed-off-by: Thomas Eizinger <thomas@eizinger.io>

thomaseizinger added 3 commits June 27, 2024 11:11

Don't return list from snownet

02457d2

Add test for closing idle connection

7e0617b

Clear idle connections after 5min

e357c55

thomaseizinger requested review from conectado and jamilbk June 27, 2024 01:51

thomaseizinger added this to the clients-1.1.0 milestone Jun 27, 2024

thomaseizinger mentioned this pull request Jun 27, 2024

Tracking issue for battery & bandwidth optimisations #5503

Closed

jamilbk reviewed Jun 27, 2024

View reviewed changes

rust/connlib/snownet/tests/lib.rs Show resolved Hide resolved

Also close connection on the gateway side

4f47c37

jamilbk approved these changes Jun 27, 2024

View reviewed changes

thomaseizinger modified the milestones: clients-1.1.0, gateway-1.1.1 Jun 27, 2024

conectado approved these changes Jun 27, 2024

View reviewed changes

Base automatically changed from refactor/snownet/batch-closed-connections to main June 27, 2024 08:02

Merge branch 'main' into feat/snownet/close-idle-connections

98db0c8

jamilbk enabled auto-merge June 27, 2024 08:06

Fix error from merge conflict

487ef04

jamilbk added this pull request to the merge queue Jun 27, 2024

Merged via the queue into main with commit b6420ea Jun 27, 2024

jamilbk deleted the feat/snownet/close-idle-connections branch June 27, 2024 08:43

thomaseizinger mentioned this pull request Sep 27, 2024

feat(connlib): always-on, low-power connections #6845

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(snownet): close idle connections after 5min #5576

feat(snownet): close idle connections after 5min #5576

Uh oh!

thomaseizinger commented Jun 27, 2024

Uh oh!

vercel bot commented Jun 27, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Jun 27, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Jun 27, 2024 •

edited

Loading

Uh oh!

Uh oh!

thomaseizinger commented Jun 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(snownet): close idle connections after 5min #5576

feat(snownet): close idle connections after 5min #5576

Uh oh!

Conversation

thomaseizinger commented Jun 27, 2024

Uh oh!

vercel bot commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Terraform Cloud Plan Output

Uh oh!

github-actions bot commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance Test Results

TCP

UDP

Uh oh!

Uh oh!

thomaseizinger commented Jun 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vercel bot commented Jun 27, 2024 •

edited

Loading

github-actions bot commented Jun 27, 2024 •

edited

Loading

github-actions bot commented Jun 27, 2024 •

edited

Loading