Skip to content

Conversation

@thomaseizinger
Copy link
Member

We define a connection as idle if we haven't sent or received any packets in the last 5 minutes. From snownet's perspective, keep-alives sent by upper layers (like TCP keep-alives) must be honored and thus outgoing as well as incoming packets are accounted for.

If the underlying connection breaks, we will hit an ICE timeout which is an implementation detail of snownet. The packets tracked here are IP packets that the user wants to send / receive via the tunnel. Similarly, wireguard's keep-alives do not update these timestamps and thus don't mark a connection as non-idle.

@vercel
Copy link

vercel bot commented Jun 27, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
firezone ⬜️ Ignored (Inspect) Visit Preview Jun 27, 2024 8:11am

@github-actions
Copy link

github-actions bot commented Jun 27, 2024

Terraform Cloud Plan Output

Plan: 15 to add, 23 to change, 0 to destroy.

Terraform Cloud Plan

@github-actions
Copy link

github-actions bot commented Jun 27, 2024

Performance Test Results

TCP

Test Name Received/s Sent/s Retransmits
direct-tcp-client2server 232.1 MiB (-3%) 233.5 MiB (-3%) 330 (+7%)
direct-tcp-server2client 237.9 MiB (-2%) 239.0 MiB (-2%) 357 (+495%)
relayed-tcp-client2server 237.7 MiB (+1%) 238.7 MiB (+1%) 236 (-42%)
relayed-tcp-server2client 241.6 MiB (+1%) 243.2 MiB (+1%) 610 (-38%)

UDP

Test Name Total/s Jitter Lost
direct-udp-client2server 500.0 MiB (-0%) 0.03ms (+32%) 46.87% (+13%)
direct-udp-server2client 500.0 MiB (+0%) 0.02ms (-7%) 22.61% (-7%)
relayed-udp-client2server 500.0 MiB (-0%) 0.08ms (+106%) 54.48% (+0%)
relayed-udp-server2client 500.0 MiB (+0%) 0.06ms (+477%) 32.73% (-5%)

@thomaseizinger
Copy link
Member Author

Note that until we deploy this to the gateway, there will be a lot of logs like:

2024-06-27T05:38:12.626969Z  WARN firezone_tunnel::client: Failed to decapsulate incoming packet: Packet is a STUN message but no agent handled it; num_agents = 0 local=192.168.188.71:37973 from=35.243.234.232:52487 num_bytes=88
2024-06-27T05:38:13.101331Z  WARN firezone_tunnel::client: Failed to decapsulate incoming packet: Packet is a STUN message but no agent handled it; num_agents = 0 local=192.168.188.71:37973 from=155.138.222.187:40528 num_bytes=88
2024-06-27T05:38:13.105499Z  WARN firezone_tunnel::client: Failed to decapsulate incoming packet: Packet is a STUN message but no agent handled it; num_agents = 0 local=192.168.188.71:37973 from=35.236.8.36:55009 num_bytes=88
2024-06-27T05:38:13.107265Z  WARN firezone_tunnel::client: Failed to decapsulate incoming packet: Packet is a STUN message but no agent handled it; num_agents = 0 local=192.168.188.71:37973 from=34.94.252.87:63692 num_bytes=88
2024-06-27T05:38:13.111175Z  WARN firezone_tunnel::client: Failed to decapsulate incoming packet: Packet is a STUN message but no agent handled it; num_agents = 0 local=192.168.188.71:37973 from=34.172.251.49:62968 num_bytes=88

Because the gateway doesn't have the idle timeout yet and thus will keep testing the connectivity via STUN until it times out. Once it is deployed, this should happen pretty much at the same time (modulo 1 RTT or so) and thus not trigger this many warnings.

Additionally, I am demoting this log in #5584.

Base automatically changed from refactor/snownet/batch-closed-connections to main June 27, 2024 08:02
@jamilbk jamilbk enabled auto-merge June 27, 2024 08:06
@jamilbk jamilbk added this pull request to the merge queue Jun 27, 2024
Merged via the queue into main with commit b6420ea Jun 27, 2024
@jamilbk jamilbk deleted the feat/snownet/close-idle-connections branch June 27, 2024 08:43
github-merge-queue bot pushed a commit that referenced this pull request Sep 27, 2024
Within `snownet` - `connlib`'s connectivity library - we use ICE to set
up a UDP "connection" between a client and a gateway. UDP is an
unreliable transport, meaning the only way how can detect that the
connection is broken is for both parties to constantly send messages and
acknowledgements back and forth. ICE uses STUN binding requests for
this.

In the default configuration of `str0m`, a STUN binding is sent every
3s, and we tolerate at most 9 missing responses before we consider the
connection broken. As these responses go missing, `str0m` halves this
interval, which results in a total ICE timeout of around 17 seconds. We
already tweak these values by reducing the number of requests to 8 and
setting the interval to 1.5s. This results in a total ICE timeout of
~10s which effectively means that there is at most a 10s lag between the
connection breaking and us considering it broken at which point new
packets arriving at the TUN interface can trigger the setup of a new
connection with the gateway.

Lowering these timeouts improves the user experience in case of a broken
connection because the user doesn't have to wait as long before they can
access their resources again. The downside of lowering these timeouts is
that we generate a lot of background noise. Especially on mobile
devices, this is bad because it prevents the CPU from going to sleep and
thus simply being signed into Firezone will drain your battery, even if
you don't use it.

Note that this doesn't apply at all if the client application on top
detects a network change. In that case, we hard-reset all connections
and instantly create new ones.

We attempted to fix this in #5576 by closing idle connections after 5
minutes. This however created new problems such as #6778.

The original problem here is that we send too many STUN messages as soon
as a connection is established. Simply increasing the timeout is not an
option because it would make the user experience really bad in case the
connection actually drops for reasons that the client app can't detect.

In this patch, we attempt to solve this in a different way: Detecting a
broken connection is only critical if the user is actively using the
tunnel (i.e. sending traffic). If there is no traffic, it doesn't matter
if we need longer to detect a broken connection. The user won't notice
because their phone is probably in their pocket or something.

With this patch, we now implement the following behaviour:

- A connection is considered idle after 10s of no application traffic.
- On idle connections, we send a STUN requests every 60s
- On idle connections, we wait for at most 4 missing responses before
considering the connection broken.
- Every connection will perform a client-initiated WireGuard keep-alive
every 25s, unless there is application traffic.

These values have been chosen while considering the following sources:

1. [RFC4787,
REQ-5](https://www.rfc-editor.org/rfc/rfc4787.html#section-12) requires
NATs to keep UDP NAT mappings alive for at least 2 minutes.
2.
[`conntrack`](https://www.kernel.org/doc/Documentation/networking/nf_conntrack-sysctl.rst)
adopts this requirement via the `nf_conntrack_udp_timeout_stream`
configuration.
3. 25s is the default keep-alive of the WireGuard kernel module.

In theory the WireGuard keep-alive itself should be good enough to keep
all NAT bindings alive. In practice, missed keep-alives are not exposed
by boringtun (the WireGuard implementation we rely on) and thus we need
the additional STUN keep-alives to detect broken connections. We set
those somewhat conservatively to 60s.

As soon as the user triggers new application traffic, these values are
reverted back to their defaults, meaning even if the connection died
just before the user is starting to use it again, we will know within
the usual 10s because we are triggering new STUN requests more often.

Note that existing gateways still implement the "close idle connections
after 5 minutes" behaviour. Customers will need to upgrade to a new
gateway version to fully benefit from these new always-on, low-power
connections.

Resolves: #6778.

---------

Signed-off-by: Thomas Eizinger <thomas@eizinger.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants