Add support for TCP_CORK (experimental)#6366
Conversation
JRuby has used "java" for RUBY_PLATFORM for 15 years, so the best way to detect the host OS is to use RbConfig's "host_os" value. This allows JRuby to support the TCP_CORK hack and fix benchmarks of very small request/response cycles. See jruby/jruby#6366 for a more detailed discussion of TCP_CORK.
|
See puma/puma#2345 for the Puma patch. |
On Linux, there are issues when benchmarking network requests with very small packet sizes due to the delayed ACK problem. In this mode, the extra ACK sent to acknowledge packets are frequently delayed until there's real data to be sent by the server, and then piggy-back on those data packets. This avoids sending a unique ACK for every packet and improves performance on a chatty connection. However in some cases, this causes some packets to wait up to 40ms for an ACK, which causes small benchmarks of servers like Puma to run very slowly with very little CPU usage. It is possible to specify TCP_NODELAY to disable this behavior, but this flag will be removed from the socket for many nnormal operations and must be set periodically on both the client side and the server side, which makes it a nonstarter when the client is out of our control. Instead, servers like Puma may use the Linux-specific TCP_CORK mode, which "bottles" data packets until they are too large to buffer, then sending them together. Puma implements this flag in the following code: https://github.com/puma/puma/blob/b981e97b81c6dc288081cee5ca6a2a603c047e78/lib/puma/server.rb#L101-L124 Puma calls these methods before and after handling a request, to "cork" the data and then release it, which improves latency of small request/response cycles. https://github.com/puma/puma/blob/b981e97b81c6dc288081cee5ca6a2a603c047e78/lib/puma/server.rb#L642-L772 A longer explanation of TCP_CORK and the delayed ACK problem can be read here: https://baus.net/on-tcp_cork/ This patch adds experimental support for using the native setsockopt against a Java socket, so that we can pass through the TCP_CORK option not supported by the JDK. Puma will need an additional patch to not use RUBY_PLATFORM to detect Linux, since that constant always points to "java" for us.
|
Someone told me about this ticket unrelated to jruby. I wonder if there's a reference for this part:
I thought it's somewhat well-known that setting TCP_NODELAY fixes delayed acks on the JVM (so that's what we've been using in Akka/spray for a long time) and it's new to me that it wouldn't work persistently. Or maybe this isn't about disabling Nagle's algorithm altogether but about more fine-grained control when to apply it? |
|
@jrudolph When I researched this issue it was over a year ago, so my memory might be failing me... The statement I said about it being set periodically was incorrect, I believe. The flag I was thinking of was See https://tools.ietf.org/html/draft-stenberg-httpbis-tcp-03 which states:
However, simply setting I could be wrong, but I'm pretty sure I was not able to fix this benchmarking issue solely by setting |
|
@jrudolph I have updated the description to better reflect my understanding of TCP_NODELAY and TCP_QUICKACK. |
|
I have managed to confirm this is working on a Linux VM. Running the roda benchmark from https://github.com/CaDs/ruby_benchmarks Master (best of ten): TCP_CORK branch: I did not see as much idleness without TCP_CORK as I used to, but those tests were on a non-VM with more cores. This still requires the patch to enable TCP_CORK support on JRuby that I provided as puma/puma#2345 plus additional patches to JRuby to bind e.g. the TCP_INFO constant. |
|
Thanks for the explanation (and sorry for dropping this thread for quite a while).
Thanks for clarifying!
wrk sets TCP_NODELAY: https://github.com/wg/wrk/blob/7594a95186ebdfa7cb35477a8a811f84e2a31b62/src/wrk.c#L251 and I think it's common practice to set
Would be interesting to know what's actually going on. I guess looking at tcpdumps would reveal something, but maybe it's also the pattern of how the socket is fed with data might be related. But anyway if you have found a working solution, that's great as well. |
|
I managed to dig up an old gist I wrote while diagnosing this. I monitored packets using tcpdump, and also tried setting TCP_NODELAY but there were still ACKs being delayed. https://gist.github.com/headius/d8e0468bceebccd7d3c951774845bf07 I also refreshed my research memory a bit today, and although there's still a lot of fuzzy documentation out there, I did see from several references that TCP_NODELAY does not fix or disable delayed ACKs. Even with Nagle's algorithm disabled, a TCP stack (in this case, Linux's) may still delay sending an ACK, which then may stall further communication. The way I understand it:
As a result, even with Nagle's disabled, we still hit the 40ms delayed ACK timer, stalling small HTTP benchmarks. FWIW, TCP_CORK has the opposite effect of disabling Nagle's and disabling delayed ACK: it buffers EVERYTHING (to the packet size limit) until you flip the switch and uncork, basically giving you manual control over when any unfilled packet is released. Some links that I should have included here but never did:
And there's gobs of question posts on Stack Overflow and friends about this problem, several of which also state that disabling Nagle's does not disable delayed ACK. |
On Linux, there are issues when benchmarking network requests with very small packet sizes due to the delayed ACK problem. In this mode, the extra ACK sent to acknowledge packets are frequently delayed until there's real data to be sent by the server, and then piggy-back on those data packets. This avoids sending a unique ACK for every packet and improves performance on a chatty connection. However in some cases, this causes some packets to wait up to 40ms for an ACK, which causes small benchmarks of servers like Puma to run very slowly with very little CPU usage.
It is possible to specify TCP_NODELAY to disable this behavior, but setting it only on the server side did not appear to be enough. I believe it needs to be set on both the client and server side, or at least on the client side, which is not possible using a third-party benchmarking tool. At least in our case, setting it only on the server side did not solve the issue.
It is also possible to specify TCP_QUICKACK to disable this behavior, but this flag will be removed from the socket for many normal operations and must be re-set periodically. It may also need to be set on both ends, but I have not confirmed this. It did help a quick test case, but does not seem to be a feasible fix.
Instead of these options, servers like Puma may use the Linux-specific TCP_CORK mode, which "bottles" data packets until they are too large to buffer, then sending them together. Puma implements this flag in the following code:
https://github.com/puma/puma/blob/b981e97b81c6dc288081cee5ca6a2a603c047e78/lib/puma/server.rb#L101-L124
Puma calls these methods before and after handling a request, to "cork" the data and then release it, which improves latency of small request/response cycles.
https://github.com/puma/puma/blob/b981e97b81c6dc288081cee5ca6a2a603c047e78/lib/puma/server.rb#L642-L772
A longer explanation of TCP_CORK and the delayed ACK problem can be read here: https://baus.net/on-tcp_cork/
This patch adds experimental support for using the native setsockopt against a Java socket, so that we can pass through the TCP_CORK option not supported by the JDK. Puma will need an additional patch to not use RUBY_PLATFORM to detect Linux, since that constant always points to "java" for us.