Default tomcat keep-alive timeout does not seem to be compatible with Cloud Foundry router default keep-alive timeout

Hi java-buildpack team,

As far as I can tell the tomcat default timeout for keep alives is [60 seconds](https://tomcat.apache.org/tomcat-8.5-doc/config/http.html)

> * *keepAliveTimeout* The number of milliseconds this Connector will wait for another HTTP request before closing the connection. The default value is to use the value that has been set for the **connectionTimeout**  attribute. Use a value of -1 to indicate no (i.e. infinite) timeout
> * *connectionTimeout* The number of milliseconds this Connector will wait, after accepting a connection, for the request URI line to be presented. Use a value of -1 to indicate no (i.e. infinite) timeout. The default value is **60000 (i.e. 60 seconds)** but note that the standard server.xml that ships with Tomcat sets this to 20000 (i.e. 20 seconds). Unless disableUploadTimeout is set to false, this timeout will also be used when reading the request body (if any).

However when [keep-alive backend connections](https://docs.cloudfoundry.org/adminguide/routing-keepalive.html) are enabled in the gorouter, the gorouter will keep connections alive for [90 seconds](https://github.com/cloudfoundry/gorouter/blob/main/proxy/proxy.go#L113).

Any time connections are kept open using http keep-alive, it's important that the client _always_ closes the connection before the server does. If the server closes the connections first there is a risk of a race condition where a client will sends a request to the server on a pre-established connection which arrives _immediately after_ a server has closed the connection.

This means that any person using Tomcat in its default configuration in a Cloud Foundry with keep-alive connections enabled will see occasional 502s caused by this race conditions (the clue for diagnosing this is that it happens exactly 60s after the last successful connection).

(I'm not sure if this is better as a gorouter issue or a java-buildpack issue, but in any case default values should work well together, whatever they happen to be)

---

Supporting information about why it's important for senders to close connections before receivers do for any L7 client-server connection using keep-alives.

https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-idle-timeout.html

To ensure that the load balancer is responsible for closing the connections to your instance, **make sure that the value you set for the HTTP keep-alive time is greater than the idle timeout setting configured for your load balancer.**

https://docs.pivotal.io/application-service/2-8/operating/frontend-idle-timeout.html

In general, set the value higher than your load balancer’s back end idle timeout to **avoid the race condition where the load balancer sends a request before it discovers that the Gorouter or HAProxy has closed the connection.**

https://blog.percy.io/tuning-nginx-behind-google-cloud-platform-http-s-load-balancer-305982ddb340

This causes the load balancer to be the side that closes idle connections, rather than nginx, which fixes the race condition!

https://tanzu.vmware.com/content/pivotal-engineering-journal/understanding-keep-alive-timeouts-in-the-cloud-foundry-networking-stack-2

Let’s look from the “outside” (the client) to the “inside” (the server in the deployment). As you travel further in, the **keep-alive timeouts should be configured to be longer**. That is, the outermost layer (in this case, the client) should have the shortest backend keep-alive timeout, and as you go in, the keep-alive timeouts should get progressively longer in relation to the corresponding backend or frontend idle timeout

Consider Golang’s HTTP client, which we believe to be somewhat robust to this type of client-side problem. The low level http transport package used by the default client implements a background loop that checks if a connection in its idle pool has been closed by the server before reusing it to make a request. **This leaves a small window for a race condition where the server could close the connection between the check and the new request**; in the event that the client loses this race, it retries this request on a new connection.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default tomcat keep-alive timeout does not seem to be compatible with Cloud Foundry router default keep-alive timeout #881

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Default tomcat keep-alive timeout does not seem to be compatible with Cloud Foundry router default keep-alive timeout #881

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions