I don't by the reasoning for never needing Nagle anymore. Sure, telnet isn't a thing today, but I bet there are still plenty of apps which do equivalent of:
write(fd, "Host: ")
write(fd, hostname)
write(fd, "\r\n")
write(fd, "Content-type: ")
etc...
this may not be 40x overhead, but it'd still 5x or so.
Fix the apps. Nobody expect magical perf if you do that when writing to files, even though the OS also has its own buffers. There is no reason to expect otherwise when writing to a socket and actually nagle already doesn't save you from syscall overhead.
We write to files line-by-line or even character-by-character and expect the library or OS to "magically" buffer it into fast file writes. Same with memory. We expect multiple small mallocs to be smartly coalesced by the platform.
If you expect a POSIX-y OS to buffer write(2) calls, you're sadly misguided. Whether or not that happens depends on nature of the device file you're writing to.
OTOH, if you're using fwrite(3), as you likely should be actual file I/O, then your expectation is entirely reasonable.
Similarly with memory. If you expect brk(2) to handle multiple small allocations "sensibly" you're going to be disappointed. If you use malloc(3) then your expectation is entirely reasonable.
Whether buffering is part of POSIX or not is beside the point. Any modern OS you'll find will buffer write calls in one way or the other. Similarly with memory. Linux waits until accesses page faults before reserving any memory pages for you. My point is that various forms of buffering is everywhere and in practice we do rely on it a whole lot.
This is simply not true as a general rule. It depends on the nature of the file descriptor. Yes, if the file descriptor refers to the file system, it will in all likelihood be buffered by the OS (not with O_DIRECT, however). But on "any modern OS", file descriptors can refer to things that are not files, and the buffering situation there will vary from case to case.
You're right, Linux does not buffer writes to file descriptors for which buffering has no performance benefit...
Yes, your libraries should fix that. The OS (as in the kernel) should not try to do any abstraction.
Alas, kernels really like to offer abstractions.
True to a degree. But that is a singular platform wholly controlled by the OS.
Once you put packets out into the world you're in a shared space.
I assume every conceivable variation of argument has been made both for and against Nagles at this point but it essentially revolves around a shared networking resource and what policy is in place for fair use.
Nagles fixes a particular case but interferes overall. If you fix the "particular case app" the issue goes away.
Nagle doesn't save the derpy side from syscall overhead, but it would save the other side.
It's not just apps doing this stuff, it also lives in system libraries. I'm still mad at the Android HTTPS library for sending chunked uploads as so many tinygrams. I don't remember exactly, but I think it's reasonable packetization for the data chunk (if it picked a reasonable size anyway), then one packet for \r\n, one for the size, and another for another \r\n. There's no reason for that, but it doesn't hurt the client enough that I can convince them to avoid the system library so they can fix it and the server can manage more throughput. Ugh. (It might be that it's just the TLS packetization that was this bogus and the TCP packetization was fine, it's been a while)
If you take a pcap for some specific issue, there's always so many of these other terrible things in there. </rant>
Those are the apps are quickly written and do not care if they unnecessarily congest the network. The ones that do get properly maintained can set TCP_NODELAY. Seems like a reasonable default to me.
I would love to fix the apps, can you point me to the github repo with all the code written the last 30 years so I can get started?
Everybody expects magical perf if you do that when writing files. We have RAM buffers and write caches for a reason, even on fast SSDs. We expect it so much that macOS doesn't flush to disk even when you call fsync() (files get flushed to the disk's write buffer instead).
There's some overhead to calling write() in a loop, but it's certainly not as bad as when a call to write() would actually make the data traverse whatever output stream you call it on.
I agree that such code should be fixed but having hard time persuading developers to fix their code. Many of them don't know what is a syscall, how making a syscall triggers sending of an IP packet, how a library call translates to a syscall e. t. c. Worse they don't want to know this, they write say Java code (or some other high level language) and argue that libraries/JDK/kernel should handle all 'low level' stuff.
To get optimal performance for request-response protocols like HTTP one should send a full request which includes a request line, all headers and a POST body using a single write syscall (unless POST body is large and it make sense to write it in chunks). Unfortunately not all HTTP libraries work this way and a library user cannot fix this problem without switching a library which is: 1. not always easy 2. it is not widely known which libraries are efficient and which are not. Even if you have an own HTTP library it's not always trivial to fix: e. g. in Java a way to fix this problem while keeping code readable and idiomatic is too wrap socket into BufferedOutputStream which adds one more memory-to-memory copy for all data you are sending on top of at least one memory-to-memory copy you already have without a buffered stream; so it's not an obvious performance win for an application which already saturates memory bandwidth.
We actually have the similar behavior when writing to files: contents are buffered in page cache and are written to disk later in batch, unless user explicitly call "sync".
Apps can always misbehave, you never know what people implement, and you don't always have source code to patch. I don't think the role of the OS is to let the apps do whatever they wish, but it should give the possibility of doing it if it's needed. So I'd rather say, if you know you're properly doing things and you're latency sensitive, just TCP_NODELAY on all your sockets and you're fine, and nobody will blame you about doing it.
The comment about telnet had me wondering what openssh does, and it sets TCP_NODELAY on every connection, even for interactive sessions. (Confirmed by both reading the code and observing behaviour in 'strace').
Especially for interactive sessions, it absolutely should! :)
Ironic since Nagle's Algorithm (which TCP_NODELAY disables) was invented for interactive sessions.
It's hard to imagine interactive sessions making more than the tiniest of blips on a modern network.
Isn't video calling an interactive session?
I think that's more two independent byte streams. You want low latency but what is transfered doesnt really impact the other side, you just constantly want to push the next frame
Thanks, that makes sense!
It's interesting that it's very much an interactive experience for the end-user. But for the logic of the computer, it's not interactive at all.
You can make the contrast even stronger: if both video streams are transmitted over UDP, you don't even need to sent ACKs etc. To be truly one-directional from a technical point of view.
Then compare that to transferring a file via TCP. For the user this is as one-directional and non-interactive as it gets, but the computers constantly talk back and forth.
Video calls indeed almost always use UDP. TCP retransmission isn't really useful since by the time a retransmitted packet arrives it's too old to display. Worse, a single lost packet will block a TCP stream. Sometimes TCP is the only way to get through a firewall, but the experience is bad if there's any packet loss at all.
VC systems do constantly send back packet loss statistics and adjust the video quality to avoid saturating a link. Any buffering in routers along the way will add delay, so you want to keep the bitrate low enough to keep buffers empty.
Does this matter? Yes, there's a lot of waste. But you also have a 1Gbps link. Every second that you don't use the full 1Gbps is also waste, right?
This is why I always pad out the end of my html files with a megabyte of . A half empty pipe is a half wasted pipe.
Just be sure HTTP Compression is off though, or you're still half-wasting the pipe.
Better to just dump randomized uncompressible data into html comments.
I am finally starting to understand some of these OpenOffice/LibreOffice commit messages like https://github.com/LibreOffice/core/commit/a0b6744d3d77
We shouldn't penalize the internet at large because some developers write terrible code.
Isn't it how SMTP is working though?
No?
I imagine the write calls show up pretty easily as a bottleneck in a flamegraph.
They don't. Maybe if you're really good you notice the higher overhead but you expect to be spending time writing to the network. The actual impact shows up when the bandwidth consumption is way up on packet and TCP headers which won't show on a flamegraph that easily.
The discussion here mostly seems to miss the point. The argument is to change the default, not to eliminate the behavior altogether.
shouldn't autocorking help with even without nagle?
TCP_CORK handles this better than nagle tho.
Marc addresses that: “That’s going to make some “write every byte” code slower than it would otherwise be, but those applications should be fixed anyway if we care about efficiency.”
Ah yeah I fixed this exact bug in net-http in Ruby core a decade ago.
Even if you do nothing 'fancy' like Nagle, corking, or userspace building up the complete buffer before writing etc., at the very least the above should be using a vectored write (writev() ).
And they really shouldn't do this. Even disregarding the network aspect of it, this is still bad for performance because syscalls are kinda expensive.
I don't think that's actually super common anymore when you consider that doing asynchronous I/O, the only sane way to do that is put it into a buffer rather than blocking at every small write(2).
Then you consider that asynchronous I/O is usually necessary both on server (otherwise you don't scale well) and client (because blocking on network calls is terrible experience, especially in today's world of frequent network changes, falling out of network range, etc.)
Shouldn’t that go through some buffer? Unless you fflush() between each write?
Those aren't the ones you debug, so they won't be seen by OP. Those are the ones you don't need to debug because Nagle saves you.