Add support for TCP_CORK (experimental) by headius · Pull Request #6366 · jruby/jruby

headius · 2020-08-26T20:47:16Z

On Linux, there are issues when benchmarking network requests with very small packet sizes due to the delayed ACK problem. In this mode, the extra ACK sent to acknowledge packets are frequently delayed until there's real data to be sent by the server, and then piggy-back on those data packets. This avoids sending a unique ACK for every packet and improves performance on a chatty connection. However in some cases, this causes some packets to wait up to 40ms for an ACK, which causes small benchmarks of servers like Puma to run very slowly with very little CPU usage.

It is possible to specify TCP_NODELAY to disable this behavior, but setting it only on the server side did not appear to be enough. I believe it needs to be set on both the client and server side, or at least on the client side, which is not possible using a third-party benchmarking tool. At least in our case, setting it only on the server side did not solve the issue.

It is also possible to specify TCP_QUICKACK to disable this behavior, but this flag will be removed from the socket for many normal operations and must be re-set periodically. It may also need to be set on both ends, but I have not confirmed this. It did help a quick test case, but does not seem to be a feasible fix.

Instead of these options, servers like Puma may use the Linux-specific TCP_CORK mode, which "bottles" data packets until they are too large to buffer, then sending them together. Puma implements this flag in the following code:

https://github.com/puma/puma/blob/b981e97b81c6dc288081cee5ca6a2a603c047e78/lib/puma/server.rb#L101-L124

Puma calls these methods before and after handling a request, to "cork" the data and then release it, which improves latency of small request/response cycles.

https://github.com/puma/puma/blob/b981e97b81c6dc288081cee5ca6a2a603c047e78/lib/puma/server.rb#L642-L772

A longer explanation of TCP_CORK and the delayed ACK problem can be read here: https://baus.net/on-tcp_cork/

This patch adds experimental support for using the native setsockopt against a Java socket, so that we can pass through the TCP_CORK option not supported by the JDK. Puma will need an additional patch to not use RUBY_PLATFORM to detect Linux, since that constant always points to "java" for us.

JRuby has used "java" for RUBY_PLATFORM for 15 years, so the best way to detect the host OS is to use RbConfig's "host_os" value. This allows JRuby to support the TCP_CORK hack and fix benchmarks of very small request/response cycles. See jruby/jruby#6366 for a more detailed discussion of TCP_CORK.

headius · 2020-08-26T20:51:53Z

See puma/puma#2345 for the Puma patch.

On Linux, there are issues when benchmarking network requests with very small packet sizes due to the delayed ACK problem. In this mode, the extra ACK sent to acknowledge packets are frequently delayed until there's real data to be sent by the server, and then piggy-back on those data packets. This avoids sending a unique ACK for every packet and improves performance on a chatty connection. However in some cases, this causes some packets to wait up to 40ms for an ACK, which causes small benchmarks of servers like Puma to run very slowly with very little CPU usage. It is possible to specify TCP_NODELAY to disable this behavior, but this flag will be removed from the socket for many nnormal operations and must be set periodically on both the client side and the server side, which makes it a nonstarter when the client is out of our control. Instead, servers like Puma may use the Linux-specific TCP_CORK mode, which "bottles" data packets until they are too large to buffer, then sending them together. Puma implements this flag in the following code: https://github.com/puma/puma/blob/b981e97b81c6dc288081cee5ca6a2a603c047e78/lib/puma/server.rb#L101-L124 Puma calls these methods before and after handling a request, to "cork" the data and then release it, which improves latency of small request/response cycles. https://github.com/puma/puma/blob/b981e97b81c6dc288081cee5ca6a2a603c047e78/lib/puma/server.rb#L642-L772 A longer explanation of TCP_CORK and the delayed ACK problem can be read here: https://baus.net/on-tcp_cork/ This patch adds experimental support for using the native setsockopt against a Java socket, so that we can pass through the TCP_CORK option not supported by the JDK. Puma will need an additional patch to not use RUBY_PLATFORM to detect Linux, since that constant always points to "java" for us.

jrudolph · 2020-08-31T08:06:16Z

Someone told me about this ticket unrelated to jruby. I wonder if there's a reference for this part:

this flag will be removed from the socket for many nnormal operations and must be set periodically on both the client side and the server side

I thought it's somewhat well-known that setting TCP_NODELAY fixes delayed acks on the JVM (so that's what we've been using in Akka/spray for a long time) and it's new to me that it wouldn't work persistently. Or maybe this isn't about disabling Nagle's algorithm altogether but about more fine-grained control when to apply it?

headius · 2020-09-01T18:20:29Z

@jrudolph When I researched this issue it was over a year ago, so my memory might be failing me...

The statement I said about it being set periodically was incorrect, I believe. The flag I was thinking of was TCP_QUICKACK, which temporarily sets the socket to do quick acks. It must be set periodically since logic for Nagle's algorithm will restore delayed ack behavior.

See https://tools.ietf.org/html/draft-stenberg-httpbis-tcp-03 which states:

Unlike disabling Nagle's Algorithm, disabling Delayed ACKs on Linux
is not a one-time operation: processing within the TCP stack can
cause Delayed ACKs to be re-enabled. As a result, to use
"TCP_QUICKACK" effectively requires setting and unsetting the socket
option during the life of the connection.

However, simply setting TCP_NODELAY on the JRuby server side did not seem to help. I believe it needs to be set on both the client and the server (or at least on the client) to avoid the delayed ACK behavior fully. In my case, I was using third-party benchmark clients (e.g. ab, wrk) so I did not have the option to disable Nagle's algorithm.

I could be wrong, but I'm pretty sure I was not able to fix this benchmarking issue solely by setting TCP_NODELAY on the server side. To be honest, I hope I'm wrong, because there's no TCP_CORK support on the JDK!

headius · 2020-09-01T19:02:08Z

@jrudolph I have updated the description to better reflect my understanding of TCP_NODELAY and TCP_QUICKACK.

headius · 2020-09-18T18:20:26Z

I have managed to confirm this is working on a Linux VM.

Running the roda benchmark from https://github.com/CaDs/ruby_benchmarks

Master (best of ten):

Running 10s test @ http://localhost:9292
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     6.91ms   12.60ms  48.48ms   82.89%
    Req/Sec     1.41k   335.80     2.68k    74.00%
  14088 requests in 10.03s, 1.32MB read
Requests/sec:   1405.20
Transfer/sec:    134.51KB

TCP_CORK branch:

Running 10s test @ http://localhost:9292
  1 threads and 10 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   352.12us  404.98us  16.45ms   97.34%
    Req/Sec    12.30k   568.19    13.26k    85.00%
  122488 requests in 10.00s, 11.45MB read
Requests/sec:  12245.00
Transfer/sec:      1.14MB

I did not see as much idleness without TCP_CORK as I used to, but those tests were on a non-VM with more cores.

This still requires the patch to enable TCP_CORK support on JRuby that I provided as puma/puma#2345 plus additional patches to JRuby to bind e.g. the TCP_INFO constant.

jrudolph · 2020-09-21T12:53:13Z

Thanks for the explanation (and sorry for dropping this thread for quite a while).

The statement I said about it being set periodically was incorrect, I believe. The flag I was thinking of was TCP_QUICKACK, which temporarily sets the socket to do quick acks. It must be set periodically since logic for Nagle's algorithm will restore delayed ack behavior.

Thanks for clarifying!

However, simply setting TCP_NODELAY on the JRuby server side did not seem to help. I believe it needs to be set on both the client and the server (or at least on the client) to avoid the delayed ACK behavior fully. In my case, I was using third-party benchmark clients (e.g. ab, wrk) so I did not have the option to disable Nagle's algorithm.

wrk sets TCP_NODELAY: https://github.com/wg/wrk/blob/7594a95186ebdfa7cb35477a8a811f84e2a31b62/src/wrk.c#L251

and I think it's common practice to set TCP_NODELAY for servers (much less sure about TCP_CORK).

I could be wrong, but I'm pretty sure I was not able to fix this benchmarking issue solely by setting TCP_NODELAY on the server side. To be honest, I hope I'm wrong, because there's no TCP_CORK support on the JDK!

Would be interesting to know what's actually going on. I guess looking at tcpdumps would reveal something, but maybe it's also the pattern of how the socket is fed with data might be related.

But anyway if you have found a working solution, that's great as well.

headius · 2020-09-21T14:50:08Z

I managed to dig up an old gist I wrote while diagnosing this. I monitored packets using tcpdump, and also tried setting TCP_NODELAY but there were still ACKs being delayed.

https://gist.github.com/headius/d8e0468bceebccd7d3c951774845bf07

I also refreshed my research memory a bit today, and although there's still a lot of fuzzy documentation out there, I did see from several references that TCP_NODELAY does not fix or disable delayed ACKs. Even with Nagle's algorithm disabled, a TCP stack (in this case, Linux's) may still delay sending an ACK, which then may stall further communication.

The way I understand it:

Nagle's algorithm will attempt to combine multiple writes into a larger packet, to avoid sending separate packets for small writes. This can be disabled by TCP_NODELAY.
Delayed ACK will delay sending an ACK for each packet in case there's more ACKs to send. This can be disabled by TCP_QUICKACK.
Disabling Nagle's will send more packets, potentially for every write, but it will not change the behavior of delayed ACKs.

As a result, even with Nagle's disabled, we still hit the 40ms delayed ACK timer, stalling small HTTP benchmarks.

FWIW, TCP_CORK has the opposite effect of disabling Nagle's and disabling delayed ACK: it buffers EVERYTHING (to the packet size limit) until you flip the switch and uncork, basically giving you manual control over when any unfilled packet is released.

Some links that I should have included here but never did:

Good discussion of Nagle's, TCP_NODELAY, and TCP_QUICKACK: https://www.extrahop.com/company/blog/2016/tcp-nodelay-nagle-quickack-best-practices/
Nagle's famous comment on HN about how his algorithm and delayed ACK are terrible together (he suggests setting TCP_QUICKACK): https://news.ycombinator.com/item?id=10608356
Red Hat Enterprise Linux does (or did) support changing the ACK delay timeout: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux_for_real_time/7/html/tuning_guide/reducing_the_tcp_delayed_ack_timeout

And there's gobs of question posts on Stack Overflow and friends about this problem, several of which also state that disabling Nagle's does not disable delayed ACK.

headius added this to the JRuby 9.3.0.0 milestone Aug 26, 2020

headius force-pushed the tcp_cork branch from 6a78b72 to 0e3a8c2 Compare August 26, 2020 20:48

headius mentioned this pull request Aug 26, 2020

Detect linux properly across all Ruby impls puma/puma#2345

Merged

8 tasks

headius force-pushed the tcp_cork branch from 0e3a8c2 to c657820 Compare August 26, 2020 21:34

Merge branch 'master' into tcp_cork

0f9b964

headius merged commit d04c818 into jruby:master Sep 18, 2020

headius deleted the tcp_cork branch September 18, 2020 18:20

This was referenced Sep 18, 2020

make sure that TCP_CORK is supported puma/puma#2349

Merged

Missing TCP_INFO support #6399

Closed

headius modified the milestones: JRuby 9.3.0.0, JRuby 9.2.18.0 Jun 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for TCP_CORK (experimental)#6366

Add support for TCP_CORK (experimental)#6366
headius merged 2 commits intojruby:masterfrom
headius:tcp_cork

headius commented Aug 26, 2020 •

edited

Loading

Uh oh!

headius commented Aug 26, 2020

Uh oh!

jrudolph commented Aug 31, 2020

Uh oh!

headius commented Sep 1, 2020

Uh oh!

headius commented Sep 1, 2020

Uh oh!

headius commented Sep 18, 2020 •

edited

Loading

Uh oh!

jrudolph commented Sep 21, 2020

Uh oh!

headius commented Sep 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

headius commented Aug 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

headius commented Aug 26, 2020

Uh oh!

jrudolph commented Aug 31, 2020

Uh oh!

headius commented Sep 1, 2020

Uh oh!

headius commented Sep 1, 2020

Uh oh!

headius commented Sep 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrudolph commented Sep 21, 2020

Uh oh!

headius commented Sep 21, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

headius commented Aug 26, 2020 •

edited

Loading

headius commented Sep 18, 2020 •

edited

Loading