Skip to content

Add support for TCP_CORK (experimental)#6366

Merged
headius merged 2 commits intojruby:masterfrom
headius:tcp_cork
Sep 18, 2020
Merged

Add support for TCP_CORK (experimental)#6366
headius merged 2 commits intojruby:masterfrom
headius:tcp_cork

Conversation

@headius
Copy link
Member

@headius headius commented Aug 26, 2020

On Linux, there are issues when benchmarking network requests with very small packet sizes due to the delayed ACK problem. In this mode, the extra ACK sent to acknowledge packets are frequently delayed until there's real data to be sent by the server, and then piggy-back on those data packets. This avoids sending a unique ACK for every packet and improves performance on a chatty connection. However in some cases, this causes some packets to wait up to 40ms for an ACK, which causes small benchmarks of servers like Puma to run very slowly with very little CPU usage.

It is possible to specify TCP_NODELAY to disable this behavior, but setting it only on the server side did not appear to be enough. I believe it needs to be set on both the client and server side, or at least on the client side, which is not possible using a third-party benchmarking tool. At least in our case, setting it only on the server side did not solve the issue.

It is also possible to specify TCP_QUICKACK to disable this behavior, but this flag will be removed from the socket for many normal operations and must be re-set periodically. It may also need to be set on both ends, but I have not confirmed this. It did help a quick test case, but does not seem to be a feasible fix.

Instead of these options, servers like Puma may use the Linux-specific TCP_CORK mode, which "bottles" data packets until they are too large to buffer, then sending them together. Puma implements this flag in the following code:

https://github.com/puma/puma/blob/b981e97b81c6dc288081cee5ca6a2a603c047e78/lib/puma/server.rb#L101-L124

Puma calls these methods before and after handling a request, to "cork" the data and then release it, which improves latency of small request/response cycles.

https://github.com/puma/puma/blob/b981e97b81c6dc288081cee5ca6a2a603c047e78/lib/puma/server.rb#L642-L772

A longer explanation of TCP_CORK and the delayed ACK problem can be read here: https://baus.net/on-tcp_cork/

This patch adds experimental support for using the native setsockopt against a Java socket, so that we can pass through the TCP_CORK option not supported by the JDK. Puma will need an additional patch to not use RUBY_PLATFORM to detect Linux, since that constant always points to "java" for us.

@headius headius added this to the JRuby 9.3.0.0 milestone Aug 26, 2020
headius added a commit to headius/puma that referenced this pull request Aug 26, 2020
JRuby has used "java" for RUBY_PLATFORM for 15 years, so the best
way to detect the host OS is to use RbConfig's "host_os" value.
This allows JRuby to support the TCP_CORK hack and fix benchmarks
of very small request/response cycles.

See jruby/jruby#6366 for a more detailed discussion of TCP_CORK.
@headius
Copy link
Member Author

headius commented Aug 26, 2020

See puma/puma#2345 for the Puma patch.

On Linux, there are issues when benchmarking network requests with
very small packet sizes due to the delayed ACK problem. In this
mode, the extra ACK sent to acknowledge packets are frequently
delayed until there's real data to be sent by the server, and then
piggy-back on those data packets. This avoids sending a unique ACK
for every packet and improves performance on a chatty connection.
However in some cases, this causes some packets to wait up to 40ms
for an ACK, which causes small benchmarks of servers like Puma to
run very slowly with very little CPU usage.

It is possible to specify TCP_NODELAY to disable this behavior,
but this flag will be removed from the socket for many nnormal
operations and must be set periodically on both the client side
and the server side, which makes it a nonstarter when the client
is out of our control.

Instead, servers like Puma may use the Linux-specific TCP_CORK
mode, which "bottles" data packets until they are too large to
buffer, then sending them together. Puma implements this flag in
the following code:

https://github.com/puma/puma/blob/b981e97b81c6dc288081cee5ca6a2a603c047e78/lib/puma/server.rb#L101-L124

Puma calls these methods before and after handling a request, to
"cork" the data and then release it, which improves latency of
small request/response cycles.

https://github.com/puma/puma/blob/b981e97b81c6dc288081cee5ca6a2a603c047e78/lib/puma/server.rb#L642-L772

A longer explanation of TCP_CORK and the delayed ACK problem can
be read here: https://baus.net/on-tcp_cork/

This patch adds experimental support for using the native
setsockopt against a Java socket, so that we can pass through the
TCP_CORK option not supported by the JDK. Puma will need an
additional patch to not use RUBY_PLATFORM to detect Linux, since
that constant always points to "java" for us.
@jrudolph
Copy link

Someone told me about this ticket unrelated to jruby. I wonder if there's a reference for this part:

this flag will be removed from the socket for many nnormal operations and must be set periodically on both the client side and the server side

I thought it's somewhat well-known that setting TCP_NODELAY fixes delayed acks on the JVM (so that's what we've been using in Akka/spray for a long time) and it's new to me that it wouldn't work persistently. Or maybe this isn't about disabling Nagle's algorithm altogether but about more fine-grained control when to apply it?

@headius
Copy link
Member Author

headius commented Sep 1, 2020

@jrudolph When I researched this issue it was over a year ago, so my memory might be failing me...

The statement I said about it being set periodically was incorrect, I believe. The flag I was thinking of was TCP_QUICKACK, which temporarily sets the socket to do quick acks. It must be set periodically since logic for Nagle's algorithm will restore delayed ack behavior.

See https://tools.ietf.org/html/draft-stenberg-httpbis-tcp-03 which states:

Unlike disabling Nagle's Algorithm, disabling Delayed ACKs on Linux
is not a one-time operation: processing within the TCP stack can
cause Delayed ACKs to be re-enabled. As a result, to use
"TCP_QUICKACK" effectively requires setting and unsetting the socket
option during the life of the connection.

However, simply setting TCP_NODELAY on the JRuby server side did not seem to help. I believe it needs to be set on both the client and the server (or at least on the client) to avoid the delayed ACK behavior fully. In my case, I was using third-party benchmark clients (e.g. ab, wrk) so I did not have the option to disable Nagle's algorithm.

I could be wrong, but I'm pretty sure I was not able to fix this benchmarking issue solely by setting TCP_NODELAY on the server side. To be honest, I hope I'm wrong, because there's no TCP_CORK support on the JDK!

@headius
Copy link
Member Author

headius commented Sep 1, 2020

@jrudolph I have updated the description to better reflect my understanding of TCP_NODELAY and TCP_QUICKACK.

@headius
Copy link
Member Author

headius commented Sep 18, 2020

I have managed to confirm this is working on a Linux VM.

Running the roda benchmark from https://github.com/CaDs/ruby_benchmarks

Master (best of ten):

Running 10s test @ http://localhost:9292
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.91ms   12.60ms  48.48ms   82.89%
    Req/Sec     1.41k   335.80     2.68k    74.00%
  14088 requests in 10.03s, 1.32MB read
Requests/sec:   1405.20
Transfer/sec:    134.51KB

TCP_CORK branch:

Running 10s test @ http://localhost:9292
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   352.12us  404.98us  16.45ms   97.34%
    Req/Sec    12.30k   568.19    13.26k    85.00%
  122488 requests in 10.00s, 11.45MB read
Requests/sec:  12245.00
Transfer/sec:      1.14MB

I did not see as much idleness without TCP_CORK as I used to, but those tests were on a non-VM with more cores.

This still requires the patch to enable TCP_CORK support on JRuby that I provided as puma/puma#2345 plus additional patches to JRuby to bind e.g. the TCP_INFO constant.

@headius headius merged commit d04c818 into jruby:master Sep 18, 2020
@headius headius deleted the tcp_cork branch September 18, 2020 18:20
@jrudolph
Copy link

Thanks for the explanation (and sorry for dropping this thread for quite a while).

The statement I said about it being set periodically was incorrect, I believe. The flag I was thinking of was TCP_QUICKACK, which temporarily sets the socket to do quick acks. It must be set periodically since logic for Nagle's algorithm will restore delayed ack behavior.

Thanks for clarifying!

However, simply setting TCP_NODELAY on the JRuby server side did not seem to help. I believe it needs to be set on both the client and the server (or at least on the client) to avoid the delayed ACK behavior fully. In my case, I was using third-party benchmark clients (e.g. ab, wrk) so I did not have the option to disable Nagle's algorithm.

wrk sets TCP_NODELAY: https://github.com/wg/wrk/blob/7594a95186ebdfa7cb35477a8a811f84e2a31b62/src/wrk.c#L251

and I think it's common practice to set TCP_NODELAY for servers (much less sure about TCP_CORK).

I could be wrong, but I'm pretty sure I was not able to fix this benchmarking issue solely by setting TCP_NODELAY on the server side. To be honest, I hope I'm wrong, because there's no TCP_CORK support on the JDK!

Would be interesting to know what's actually going on. I guess looking at tcpdumps would reveal something, but maybe it's also the pattern of how the socket is fed with data might be related.

But anyway if you have found a working solution, that's great as well.

@headius
Copy link
Member Author

headius commented Sep 21, 2020

I managed to dig up an old gist I wrote while diagnosing this. I monitored packets using tcpdump, and also tried setting TCP_NODELAY but there were still ACKs being delayed.

https://gist.github.com/headius/d8e0468bceebccd7d3c951774845bf07

I also refreshed my research memory a bit today, and although there's still a lot of fuzzy documentation out there, I did see from several references that TCP_NODELAY does not fix or disable delayed ACKs. Even with Nagle's algorithm disabled, a TCP stack (in this case, Linux's) may still delay sending an ACK, which then may stall further communication.

The way I understand it:

  • Nagle's algorithm will attempt to combine multiple writes into a larger packet, to avoid sending separate packets for small writes. This can be disabled by TCP_NODELAY.
  • Delayed ACK will delay sending an ACK for each packet in case there's more ACKs to send. This can be disabled by TCP_QUICKACK.
  • Disabling Nagle's will send more packets, potentially for every write, but it will not change the behavior of delayed ACKs.

As a result, even with Nagle's disabled, we still hit the 40ms delayed ACK timer, stalling small HTTP benchmarks.

FWIW, TCP_CORK has the opposite effect of disabling Nagle's and disabling delayed ACK: it buffers EVERYTHING (to the packet size limit) until you flip the switch and uncork, basically giving you manual control over when any unfilled packet is released.

Some links that I should have included here but never did:

And there's gobs of question posts on Stack Overflow and friends about this problem, several of which also state that disabling Nagle's does not disable delayed ACK.

@headius headius modified the milestones: JRuby 9.3.0.0, JRuby 9.2.18.0 Jun 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants