HN comments for: Linux Pipes Are Slow

0xbadcafebee

46 replies

19h18m

2024-08-25 23:13:46 UTC

Calling Linux pipes "slow" is like calling a Toyota Corolla "slow". It's fast enough for all but the most extreme use cases. Are you racing cars? In a sport where speed is more important than technique? Then get a faster car. Otherwise stick to the Corolla.

AkBKukU

15 replies

16h39m

2024-08-26 01:52:05 UTC

I have a project that uses a proprietary SDK for decoding raw video. I output the decoded data as pure RGBA in a way FFMpeg can read through a pipe to re-encode the video to a standard codec. FFMpeg can't include the Non-Free SDK in their source, and it would be wildly impracticable to store the pure RGBA in a file. So pipes are the only way to do it, there are valid reasons to use high throughput pipes.

CyberDildonics

5 replies

14h50m

2024-08-26 03:41:21 UTC

So pipes are the only way to do it

Lets not get carried away. You can use ffmpeg as a library and encode buffers in a few dozen lines of C++.

whiterknight

1 replies

4h34m

2024-08-26 13:57:11 UTC

And you go from having a well defined modular interface that’s flexible at runtime to a binary dependency.

CyberDildonics

0 replies

19m

2024-08-26 18:12:21 UTC

You have the dependency either way, but if you use the library you can have one big executable with no external dependencies and it can actually be fast.

If there wasn't a problem to solve they wouldn't have said anything. If you want something different you have to do something different.

Almondsetat

1 replies

7h25m

2024-08-26 11:06:45 UTC

ffmpeg's library is notorious for being a complete and utter mess

CyberDildonics

0 replies

4h0m

2024-08-26 14:31:08 UTC

It worked extremely well when I did something almost exactly like this. I gave it buffers of pixels in memory and it spit out compressed video.

quietbritishjim

0 replies

7h1m

2024-08-26 11:30:27 UTC

The parent comment mentioned license incompatibility, which I guess would still apply if they used ffmpeg as a library.

Sesse__

3 replies

9h13m

2024-08-26 09:18:47 UTC

At some point, I had a similar issue (though not related to licensing), and it turned out it was faster to do a high-bitrate H.264-encode of the stream before sending it over the FFmpeg socket than sending the raw RGBA data, even over localhost… (There was some minimal quality loss, of course, but it was completely irrelevant in the big picture.)

jraph

2 replies

7h45m

2024-08-26 10:46:13 UTC

There was some minimal quality loss, of course, but it was completely irrelevant in the big picture

But then the solutions are not comparable anymore, are they? Would a lossless codec instead have improved speed?

chupasaurus

0 replies

7h34m

2024-08-26 10:57:16 UTC

H.264 has lossless mode.

Sesse__

0 replies

4h52m

2024-08-26 13:39:32 UTC

No, because I had hardware H.264 encoder support. :-) (The decoding in FFmpeg on the other side was still software. But it was seemingly much cheaper to do a H.264 software decode.)

whartung

2 replies

16h22m

2024-08-26 02:09:35 UTC

What about domain sockets?

It's clumsier, to be sure, but if performance is your goal, the socket should be faster.

ptx

0 replies

9h40m

2024-08-26 08:51:33 UTC

Why should sockets be faster?

AkBKukU

0 replies

16h14m

2024-08-26 02:17:15 UTC

It looks like FFmpeg does support reading from sockets natively[1], I didn't know that. That might be a better solution in this case, I'll have to look into some C code for writing my output to a socket to try that some time.

[1] https://ffmpeg.org/ffmpeg-protocols.html#unix

ploxiln

0 replies

11h33m

2024-08-26 06:58:46 UTC

What percentage of CPU time is used by the pipe in this scenario? If pipes were 10x faster, would you really notice any difference in wall-clock-time or overall-cpu-usage, while this decoding SDK is generating the raw data and ffmpeg is processing it? Are these video processing steps anywhere near memory copy speeds?

jcelerier

0 replies

14h34m

2024-08-26 03:57:49 UTC

Why not just store the output of the proprietary codec in an AVFrame that you'd pass to libavcodec in your own code?

Someone

13 replies

12h39m

2024-08-26 05:51:55 UTC

This isn’t code in some project that will run only a few billion times in its lifetime; it is used frequently on millions, if not billions, of computers.

Because of that, it is economical to spend lots of time optimizing it, even if it only makes the code marginally more efficient.

samastur

10 replies

6h47m

2024-08-26 11:44:32 UTC

That's not how economics works.

If 100 million people each save 1 cent because of your work, you saved 1 million in total, but in practice nobody is observably better off.

whiterknight

5 replies

4h26m

2024-08-26 14:04:59 UTC

You’re describing the outcome of one individual person. Money is just a tool for allocating resources. Saving 1 million of resources is a good thing.

wang_li

4 replies

3h46m

2024-08-26 14:45:39 UTC

It's a meaningless thing if it's 1 million resources divided into 1 million actors who have no ability to leverage a short term gain of 1 resource. It's short term because the number of computers that are 100% busy 100% of the time is zero. A pipe throughput improvement means nothing if the computer isn't waiting on pipes a lot.

carlhjerpe

3 replies

3h32m

2024-08-26 14:59:31 UTC

Eventually everyone ends up at a power plant, there's an insane amount of people living in the European grid. If an optimization ends up saving a couple tonnes of CO2 per year it is hard to not call it a good thing.

https://en.m.wikipedia.org/wiki/Synchronous_grid_of_Continen...

wang_li

2 replies

3h4m

2024-08-26 15:27:38 UTC

A couple tons spread across 400 million people with a per capita emission of 5 tons per year is in the noise. If we're at the point of trying to hyper optimize there are far more meaningful targets than pipe throughput.

sqeaky

0 replies

2h43m

2024-08-26 15:47:57 UTC

You are arguing against the concept of "division of labor".

You are a few logical layers removed, but fundamentally that is at the heart of this. It isn't just about what you think can or can't be leveraged. Reducing waste in a centralized fashion is excellent because it will enable other waste to be reduced in a self reinforcing cycle as long as experts in their domain keep getting the benefits of other experts. The chip experts make better instructions, so the library experts make better software libs they add their 2% and now it is more than 4%, so the application experts can have 4% more theoughput and buy 4% fewer servers or spend way more than 4% less optimizing or whatever and add their 2% optimization and now we are at more than 6%, and the end users can do their business slightly better and so on in a chain that is all of society. Sometimes those gains are mututed. Sometimes that speed turns into error checking, power saving, more throughput, and every trying to do their best to do more with less.

carlhjerpe

0 replies

2024-08-26 18:22:47 UTC

Absolutely, if your focus is saving emissions don't optimize pipes. But if you optimize an interface people use it's a good thing either way right

h0p3

1 replies

5h49m

2024-08-26 12:42:14 UTC

There are people whose lives are improved by having an extra cent to spend. Seriously. It is measurable, observable, and real. It might not have a serious impact on the vast majority of people, but there are people who have very, very little money or have found themselves on a tipping point that small; pinching pennies alters their utility outcomes.

InDubioProRubio

0 replies

4h4m

2024-08-26 14:27:31 UTC

https://xkcd.com/951/

Also, if you micro-optimize and that becomes your whole focus and ability to focus, your business is unable to innovate aka traverse the economic landscape and find new rich gradients and sources of "economic food", making you a dinosaur in a pit, doomed to eternally cannibalize on what other creatures descend into the pit and highly dependent on the pit not closing up for good.

sebstefan

0 replies

6h33m

2024-08-26 11:58:33 UTC

So? It doesn't need to be visible to be worth optimizing?

azulster

0 replies

3h18m

2024-08-26 15:13:48 UTC

not if it costs 200 million in man-hours to optimize

hi-v-rocknroll

1 replies

2h38m

2024-08-26 15:53:36 UTC

Citation needed.

Pipes aren't used everywhere in production in hot paths. That just doesn't happen.

ibern

0 replies

1h32m

2024-08-26 16:59:00 UTC

A lot of bioinformatics code relies very heavily on pipes.

paulannesley

3 replies

6h40m

2024-08-26 11:50:52 UTC

Sometimes the best answer really is a faster Corolla!

https://www.toyota.com/grcorolla/

(These machines have amazing engineering and performance, and their entire existence is a hack to work around rules making it unviable to bring the intended GR Yaris to the US market.. Maybe just enough eng/perf/hack/market relevance to HN folk to warrant my lighthearted reply. Also, the company president is still on the tools.

2OEH8eoCRo0

2 replies

3h58m

2024-08-26 14:33:45 UTC

There's no replacement for displacement.

Sohcahtoa82

1 replies

1h4m

2024-08-26 17:26:58 UTC

Apparently there is, because that car only has a 1.6L 3-cylinder engine and yet produces a whopping 300 horsepower.

2OEH8eoCRo0

0 replies

52m

2024-08-26 17:38:51 UTC

When? In the RPM sweet spot after waiting an eternity for the turbos to spool? There's always a catch.

bastawhiz

2 replies

46m

2024-08-26 17:45:12 UTC

I'm not sure that logic makes sense. Making a thing that's used ubiquitously a few percent faster it's absolutely a worthwhile investment of effort. Individual operations might but be very much faster but it's (in aggregate) a ton of electricity and time globally.

0xbadcafebee

1 replies

36m

2024-08-26 17:55:44 UTC

That's what's called premature optimization. Everywhere in our lives we do inefficient things. Despite the inefficiency we gain us something else: ease of use or access, simplicity, lower cost, more time, etc. The world and life as we know it is just a series of tradeoffs. Often optimization before it's necessary actually creates more drawbacks than benefits. When it's easy and has a huge benefit, or is necessary, then definitely optimize. It may be hard to accept this as a general principle, but in practice (mostly in hindsight) it becomes very apparent.

Donald Knuth thinks the same: https://en.wikipedia.org/wiki/Program_optimization#When_to_o...

bastawhiz

0 replies

19m

2024-08-26 18:12:28 UTC

Is definitionally not premature optimization. Pipes exist (and have existed for decades). This is just "optimization". "Premature" means it's too soon to optimize. When is it no longer too soon? In another few decades? When Linux takes another half of Windows usage?

The tradeoffs you're discussing are considerations. Is it worth making a ubiquitous thing faster at the expense of some complexity? At some point that answer is "yes", but that is absolutely not "When it's easy and has a huge benefit". The most important optimizations you personally benefit from were not easy OR had a huge benefit. They were hard won and generally small, but they compound on other optimizations.

I'll also note that the Knuth quote you reference says exactly this:

Yet we should not pass up our opportunities in that critical 3%

ploxiln

1 replies

11h53m

2024-08-26 06:38:05 UTC

Indeed. In the author's case, the slow pipe is moving data at 17 GB/s which is over 130 gbps.

I've used pipes for a lot of stuff over 10+ years, and never noticed being limited by the speed of the pipe, I'm almost certain to be limited by tar, gzip, find, grep, nc ... (even though these also tend to be pretty fast for what they do).

crabbone

0 replies

6h38m

2024-08-26 11:52:56 UTC

I had two cases in my practice where pipes were slow. Both related to developing a filesystem.

1. Logging. At first our tools for reading the logs from a filesystem management program were using pipes, but they would be overwhelmed quickly (even before it would overwhelm pagers and further down the line). We had to write our own pager and give up on using pipes.

2. Storage again, but a different problem: we had a setup where we deployed SPDK to manage the iSCSI frontend duties, and our component to manage the actual storage process. It was very important that the communication between these two components be as fast and as memory-efficient as possible. The slowness of pipes comes also from the fact that they have to copy memory. We had to extend SPDK to make it communicate with our component through shared memory instead.

So, yeah, pipes are unlikely to be the bottleneck of many applications, but definitely not all.

jiehong

1 replies

6h44m

2024-08-26 11:47:18 UTC

Replace “Linux pipes” by “Electron apps”, and people would not agree.

Also, why leave performance on the table by default? Just because “it should be enough for most people I can think of”?

Add Tesla motors to a Toyota Corolla and now you’ve got a sportier car by default.

azulster

0 replies

3h15m

2024-08-26 15:16:25 UTC

electron apps are an optimization all by itself.

it's not optimizing footprint or speed of application. it's optimizing the resources and speed of development and deployment

Ultimatt

1 replies

11h58m

2024-08-26 06:33:33 UTC

A better analogy is its like a society that uses steam trains attempting to industrially compete with a society that uses bullet trains (literally similar by factor of improvement). The UK built its last steam train for national use in 1960, four years later the Shinkansen was in use in Japan. Which of those two nations has a strong international industrial base in 2024?

billfruit

0 replies

11h1m

2024-08-26 07:30:38 UTC

Well the Mallard's top speed was very close to the first generation Shikansen 0 trains.

tacone

0 replies

2024-08-26 18:28:01 UTC

Wait, it depends on what you're doing. Pipes also create a subshell so they are a big nono when used inside a loop.

Suppose you're cycling on the lines of stdout and need to use sed, cut and so on, using pipes will slow down things considerably (and sed, cut startup time will make things worse).

Using bash/zsh string interpolation would be much faster.

qsantos

0 replies

5h42m

2024-08-26 12:49:49 UTC

To be frank, this is more of a pretext to understand what pipes and vmsplice do exactly.

mort96

0 replies

6h25m

2024-08-26 12:06:45 UTC

I mean why waste CPU time moving data between buffers when you could get the same semantics and programming model without wasting that CPU time?

jheriko

18 replies

21h13m

2024-08-25 21:18:04 UTC

just never use pipes. they are some weird archaism that need to die :P

the only time ive used them is external constraints. they are just not useful.

duped

7 replies

14h19m

2024-08-26 04:11:59 UTC

I agree with this but with a much more nuanced take: avoid pipes if either reader or writer expects to do async i/o and you don't own both the reader and writer.

In fact if you ever set O_NONBLOCK on a pipe you need to be damn sure both the reader and writer expect non-blocking i/o because you'll get heisenbugs under heavy i/o when either the reader/writer outpace each other and one expects blocking i/o. When's the last time you checked the error code of `printf` and put it in a retry loop?

teo_zero

5 replies

12h53m

2024-08-26 05:38:02 UTC

Genuine question: why does printf need a retry loop when using pipes?

duped

4 replies

11h39m

2024-08-26 06:52:25 UTC

It doesn't that's why no one does it.

But for pipes what it means is that if whoever is reading or writing the pipe expects non blocking semantics, the other end needs to agree. And if they don't you'll eventually get an error because the reader or writer outpaced the other, and almost no program handles errors for stdin or stdout.

caf

2 replies

5h39m

2024-08-26 12:52:35 UTC

Making the read side non-blocking doesn't affect the write side, and vice-versa.

duped

1 replies

5h32m

2024-08-26 12:59:43 UTC

That is not true for pipes.

caf

0 replies

4h59m

2024-08-26 13:31:51 UTC

It is, at least on Linux for ordinary pipe(2) pipes.

I just wrote up a test to be sure: in the process with the read side, set it to non-blocking with fcntl(p, F_SETFL, O_NONBLOCK) then go to sleep for a long period. Dump a bunch of data into the writing side with the other process: the write() call blocks once the pipe is full as you would expect.

teo_zero

0 replies

10h21m

2024-08-26 08:10:33 UTC

But even writing to a file doesn't guarantee non-blocking semantics. I still don't get what is special about pipes.

epcoa

0 replies

1h54m

2024-08-26 16:37:36 UTC

This isn't right. O_NONBLOCK doesn't mean the pipe doesn't stall, it just means you get an immediate errno and don't block on the syscall in the kernel waiting and this is specific to the file description, for which a pipe has 2 independent ones. Setting O_NONBLOCK on the writer does not affect the reader. If it did, this would break a ton of common use cases where pipelined programs are designed to not even know what is on the other side.

Not sure what printf has to do with, it isn't designed to be used with a non-block writer (but that only concerns one side). How will the reader being non-block change the semantics of the writer? It doesn't.

You can't set O_NONBLOCK on a pipe fd you expect to use with stdio, but that isn't unique to pipes. Whether the reader is O_NONBLOCK will not affect you if you're pushing the writer with printf/stdio.

(This is also a reason why I balk a bit when people refer to O_NONBLOCK as "async IO", it isn't the same and leads to this confusion)

henearkr

5 replies

21h0m

2024-08-25 21:31:39 UTC

Pipes are extremely useful. But I guess it just depends on your use case. I do a lot of scripting.

If you dislike their (relative) slowness, it's open source, you can participate in making them faster.

And I'm sure that after this HN post we'll see some patches and merge requests.

hnlmorg

3 replies

20h31m

2024-08-25 21:59:51 UTC

The very thing that makes pipes useful is what also makes them slow. I don't think there is much we can do to fix that without breaking POSIX compatibility entirely.

Personally I think there's much worse ugliness in POSIX than pipes. For example, I've just spent the last couple of days debugging a number of bugs in a shell's job control code (`fg`, `bg`, `jobs`, etc).

But despite its warts, I'm still grateful we have something like POSIX to build against.

effie

2 replies

16h35m

2024-08-26 01:55:57 UTC

What possible bugs can there be in those? They are quite simple to use and work as expected.

khafra

0 replies

11h13m

2024-08-26 07:17:53 UTC

They work as expected on Redhat and Debian. "POSIX" leaves open a lot of possibility for less-well-tested systems. They could be writing shellscripts on Minix or HelenOS.

hnlmorg

0 replies

8h47m

2024-08-26 09:44:20 UTC

I'm talking about shell implementation not shell usage.

To implement job control, there are several signals you need to be aware of:

- SIGSTSP (what the TTY sends if it receives ^Z)

- SIGSTOP (what a shell sends to a process to suspend it)

- SIGCONT (what a shell sends to a process to resume it)

- SIGCHLD (what the shell needs to listen for to see there is a change in state for a child process -- this is also sometimes referred to as SIGCLD)

- SIGTIN (received if a process read from stdin)

- SIGTOU (received if a process cannot write to stdout nor set its modes)

Some of these signals are received by the shell, some are by the process. Some are sent from the shell and others from the kernel.

SIGCHLD isn't just raised for when a child process goes into suspend, it can be raised for a few different changes of state. So if you receive SIGCHLD you then need to inspect your children (of course you don't know what child has triggered SIGCHLD because signals don't contain metadata) to see if any of them have changed their state in any way. Which is "fun"....

And all of this only works if you manage to fork your children with special flags to set their PGID (not PID, another meta ID which represents what process group they belong to), and send magic syscalls to keep passing ownership of the TTY (if you don't tell the kernel which process owns the TTY, ie is in the foreground, then either your child process and/or your shell will crash due to permission issues).

None of this is 100% portable (see footnote [1]) and all of this also depends on well behaving applications not catching signals themselves and doing something funky with them.

The bug I've got is that Helix editor is one of those applications doing something non-standard with SIGTSTP and assuming anything that breaks as a result is a parent process which doesn't support job control. Except my shell does support job control and still crashes as a result of Helix's non-standard implementation.

In fairness to Helix, my shell does also implement job control in a non-standard way because I wanted to add some wrappers around signals and TTYs to make the terminal experience a little more comfortable than it is with POSIX-compliant shells like Bash. But because job control (and signals and TTYs in general) are so archaic, the result is that there are always going to be edge case bugs with applications (like Helix) that have implemented things a little differently themselves too.

So they're definitely not easy to use and can break in unexpected ways if even just one application doesn't implement things in expected ways.

[1] By the way, this is all ignoring subtle problems that different implementations of PTYs (eg terminal emulators, terminal multiplexors, etc) and different POSIX kernels can introduce too. And those can be a nightmare to track down and debug!

noloblo

0 replies

20h37m

2024-08-25 21:54:16 UTC

+1 yes pipes are what shell scripting quite useful and allow for easy composition of the different unix shell utilities

akira2501

1 replies

20h29m

2024-08-25 22:02:41 UTC

just never use pipes.

vmslice doesn't work with every type of file descriptor. eschewing some technology entirely because it seems archaic or because it makes writing "the fastest X software" seem harder is just sloppy engineering.

they are just not useful.

Then you have not written enough software yet to discover how they are useful.

gpderetta

0 replies

18h54m

2024-08-25 23:37:19 UTC

Most importantly the fast fizbuz toy vmsplices into /dev/null.

Nothing ever touches those pages on the consumer side and they can be refused immediately.

If you actually want a functional program using vmsplice, with a real consumer, things get hairy very quickly.

w0m

0 replies

20h28m

2024-08-25 22:03:31 UTC

You can replace a 10k line python or shell script with a single creative line of pipes/xargs/etc on the cli.

It's incredibly valuable on the day to day.

hagbard_c

0 replies

19h41m

2024-08-25 22:50:13 UTC

That's like telling a builder never to use nails but turn to adhesives instead. He will look at his hammer and his nails as well as a stack of 2x4s, grin and in no time slap together a box into which he will stuff you with a bottle of glue with the advice to now go and play while the grown-ups take care of business.

Sure, you could build that box with glue and clamps and ample time, sure it would look neater and weigh less than the version that's currently holding you imprisoned and if done right it will even be stronger but it takes more time and effort as well as those glue clamps and other specialised tools to create perfect matching surfaces while the builder just wielded that hammer and those nails and now is building yet another utilitarian piece of work with the same hammer and nails.

Sometimes all you need is a hammer and some nails. Or pipes.

koverstreet

11 replies

19h23m

2024-08-25 23:08:33 UTC

One of my sideprojects is intended to address this: https://lwn.net/Articles/976836/

The idea is a syscall for getting a ringbuffer for any supported file descriptor, including pipes - and for pipes, if both ends support using the ringbuffer they'll map the same ringbuffer: zero copy IO, potentially without calling into the kernel at all.

Would love to find collaborators for this one :)

wakawaka28

4 replies

18h46m

2024-08-25 23:45:47 UTC

Buffering is there for a reason and this approach will lead to weird failure modes and fragility in scripts. The core issue is that any stream producer might go slower than any given consumer. Even a momentary hiccup will totally mess up the pipe unless there is adequate buffering, and the amount needed is system-dependent.

foota

1 replies

18h16m

2024-08-26 00:15:13 UTC

Maybe I misunderstand, but if the ring buffer is full isn't it ok for the sender to just block?

mort96

0 replies

10h27m

2024-08-26 08:04:45 UTC

Yeah, and if the ring buffer is empty it's okay for the receiver to just block... exactly as happens today with pipes

hackernudes

0 replies

18h10m

2024-08-26 00:21:06 UTC

I think the OP's proposal has buffering.

It is different from a pipe - instead of using read/write to copy data from/to a kernel buffer, it gives user space a mapped buffer object and they need to take care to use it properly (using atomic operations on the head/tail and such).

If you own the code for the reader and writer, it's like using shared memory for a buffer. The proposal is about standardizing an interface.

Spivak

0 replies

18h17m

2024-08-26 00:13:53 UTC

What makes this any different than other buffer implementations that have a max size? Buffer fills, writes block. What failure mode are you worried about that can't occur with pipes which are also bounded?

messe

2 replies

8h58m

2024-08-26 09:33:22 UTC

and for pipes, if both ends support using the ringbuffer they'll map the same ringbuffer

Is there planned to be a standardized way to signal to the other end of the pipe that ring buffers are supported, so this could be handled transparently in libc? If not, I don't really see what advantage it gets you compared to shared memory + a futex for synchronization—for pipes that is.

immibis

1 replies

7h8m

2024-08-26 11:23:03 UTC

Presumably the same interface still works if the other side is using read/write.

koverstreet

0 replies

2h9m

2024-08-26 16:21:55 UTC

correct

caf

1 replies

5h20m

2024-08-26 13:11:42 UTC

Presumably ringbuffer_wait() can also be signalled through making it 'readable' in poll()?

koverstreet

0 replies

3h10m

2024-08-26 15:21:21 UTC

yes, I believe that's already implemented; the more interesting thing I still need to do is make futex() work with the head and tail pointers.

phafu

0 replies

2h53m

2024-08-26 15:38:29 UTC

At least for user space usage, I'm not sure a new kernel thing is needed. Quite a while ago I have implemented a user space (single producer / single consumer) ring buffer, which uses an eventfd to mimic pipe behavior and functionality quite closely (i.e. being able to sleep & poll for ring buffer full/empty situations), but otherwise operates lockless and without syscall overhead.

nitwit005

6 replies

20h25m

2024-08-25 22:06:17 UTC

Just about every form of IPC is "slow". You have decided to pay a performance cost for safety.

marcosdumay

3 replies

16h55m

2024-08-26 01:35:56 UTC

You shouldn't have to pay that much. Pipes give you almost nothing, so they should cost almost nothing.

Specifically, there aren't many reasons for your fastest IPC to be slower than a long function call.

nitwit005

2 replies

15h34m

2024-08-26 02:57:30 UTC

If you don't think pipes offer much, don't use them.

Saying "long function call" doesn't mean much since a function can take infinitely long.

marcosdumay

1 replies

15h7m

2024-08-26 03:24:17 UTC

A long distance function call, that invalidates everything on your cache.

saagarjha

0 replies

28m

2024-08-26 18:03:07 UTC

…which is quite expensive.

brigade

1 replies

14h45m

2024-08-26 03:45:51 UTC

Pipes don’t exist for safety, they exist as an optimization to pass data between existing programs.

PaulDavisThe1st

0 replies

14h27m

2024-08-26 04:04:06 UTC

NOT writing and reading to and from a file stored on a drive is not, in this context, an optimization, but a significantly freeing conceptual shift that completely transforms how a class of users conducts themselves when using the computer.

JoshTriplett

6 replies

19h0m

2024-08-25 23:31:17 UTC

This is a side note to the main point being made, but on modern CPUs, "rep movsb" is just as fast as the fastest vectorized version, because the CPU knows to accelerate it. The name of the kernel function "copy_user_enhanced_fast_string" hints at this: the CPU features are ERMS ("Enhanced Repeat Move String", which makes "rep movsb" faster for anything above a certain length threshold) and FSRM ("Fast Short Repeat Move String", which makes "rep movsb" faster for shorter moves too).

Lockal

2 replies

18h10m

2024-08-26 00:21:44 UTC

This is not the full truth, "rep movsb" is fast until another threshold, after which either normal or non-temporal store is faster.

All thresholds are described in https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch...

And they are not final, i. e. Noah Goldstein still updates them every year.

jeffbee

0 replies

16h58m

2024-08-26 01:33:33 UTC

Which is these is "faster" depends greatly on whether you have the very rare memcpy-only workload, or if your program actually does something useful. Many people believe, often with good evidence, that the most important thing is for memcpy to occupy as few instruction cache lines as is practical, instead of being something that branches all over kilobytes of machine code. For comparison, see the x86 implementations in LLVM libc.

https://github.com/llvm/llvm-project/blob/main/libc/src/stri...

adrian_b

0 replies

12h48m

2024-08-26 05:43:00 UTC

It depends on the CPU. There is no good reason for "rep movsb" to be slower at any big enough data size.

On a Zen 3 CPU, "rep movsb" becomes faster than or the same as anything else above a length slightly greater than 2 kB.

However there is a range of multi-megabyte lengths, which correspond roughly with sizes below the L3 cache but exceeding the L2 cache, where for some weird reason "rep movsb" becomes slower than SIMD non-temporal stores.

At lengths exceeding the L3 size, "rep movsb" becomes again the fastest copy method.

The Intel CPUs have different behaviors.

koverstreet

1 replies

17h59m

2024-08-26 00:32:45 UTC

I'm still waiting for rep movsb and rep stosb to be fast enough to delete my simple C loop versions, for short memcpys.

adrian_b

0 replies

12h32m

2024-08-26 05:59:16 UTC

It is likely that on recent CPUs they are always faster than C loop versions.

On my Zen 3 CPU, for lengths of 2 kB or smaller it is possible to copy faster than with "rep movsb", but by using SIMD instructions (or equivalently the builtin "memcpy" provided by most C compilers), not with a C loop (unless the compiler recognizes the C loop and replaces it with the builtin memcpy, which is what some compilers will do at high optimization levels).

jeffbee

0 replies

18h38m

2024-08-25 23:53:07 UTC

Also worth noting that Linux has changed the way it uses ERMS and FSRM in x86 copy multiple times since kernel 6.1 used in the article. As a data-dote, my machine that has FSRM and ERMS — surprisingly, the latter is not implied by the former — hits 17GB/s using plain old pipes and a 32KiB buffer on Linux 6.8

stabbles

4 replies

9h5m

2024-08-26 09:26:40 UTC

A bold claim for a blog that takes about 20 seconds to load.

yas_hmaheshwari

2 replies

8h56m

2024-08-26 09:35:22 UTC

This post has gone to the top of hacker news, so I think we should give him some slack

Looks like an amazing article, and so much to learn on what happens under the hood

ben-schaaf

1 replies

3h16m

2024-08-26 15:15:43 UTC

HN generates ~20k page views over the course of a day with a peak of 2k/h: https://harrisonbroadbent.com/blog/hacker-news-traffic-spike.... At ~1MB per page load - not sure how accurate this is, I don't think it fully loaded - this static blogpost requires 0.55MB/s to meet demand. An original raspberry pi B (10mpbs ethernet) on the average french mobile internet connection (8mbps) provides double that.

I don't mean this as a slight to anyone, I just want to point out the HN "hug of death" can be trivially handled by a single cheap VPS without even breaking a sweat.

qsantos

0 replies

2h24m

2024-08-26 16:07:49 UTC

Totally agree, my server should definitely be able to handle the load. But this is a WordPress install, which is definitely doing too much work for what it is when just serving the pages. I plan to improve on this!

wvh

0 replies

7h38m

2024-08-26 10:53:20 UTC

I believe that when it's a .fr, they call it nonchalance...

nyanpasu64

3 replies

13h51m

2024-08-26 04:40:48 UTC

How do you gather profiling information for kernel function calls from a user program?

qsantos

2 replies

12h29m

2024-08-26 06:02:25 UTC

I'll write an article on the flamegraphs specifically, but to get the data, just follow Julia's article!

https://jvns.ca/blog/2017/03/19/getting-started-with-ftrace/

ismaildonmez

1 replies

7h20m

2024-08-26 11:11:20 UTC

Could you clarify how are you testing the speed of the first example where you are not writing anything to stdout? Thanks.

qsantos

0 replies

5h45m

2024-08-26 12:46:16 UTC

For the first Rust program, where I just write to memory, I just use the time utility when running the program from zsh. Then, I divide the number of bytes written by the number of seconds elapsed. That's why it's not an infinite loop ;)

cowsaymoo

3 replies

12h8m

2024-08-26 06:23:23 UTC

What is the library used to profile the program?

tzury

2 replies

7h37m

2024-08-26 10:54:07 UTC

https://linux.die.net/man/1/pv

it is in the pipe command `... | pv > /dev/null`

throw12390

1 replies

6h57m

2024-08-26 11:34:21 UTC

`pv --discard` is faster by 8% (on my system).

  % pv </dev/zero >/dev/null
  54.0GiB/s

  % pv </dev/zero --discard
  58.7GiB/s

IWeldMelons

0 replies

3h50m

2024-08-26 14:40:51 UTC

Which is suspiciously close to the speed of DDR4.

Borg3

3 replies

10h16m

2024-08-26 08:15:14 UTC

Haha. When I read the title I smiled. Linux pipes slow? Moook.. Now try Cygwin pipes. Thats what I call slow!

Anyway, nice article, its good to know whats going on under the hood.

MaxBarraclough

2 replies

9h54m

2024-08-26 08:37:22 UTC

I'd assumed Cygwin pipes are just Windows pipes, is that not the case?

tyingq

0 replies

9h13m

2024-08-26 09:17:51 UTC

Not a comprehensive list of problems, and not current but a good illustrative post of the kind of issues that people have run into in this post:

https://cygwin.com/pipermail/cygwin-patches/2016q1/008301.ht...

Borg3

0 replies

4h39m

2024-08-26 13:52:35 UTC

Its not that easy. Yeah, they are, but there is a lot of POSIX like glue inside so they work correctly with select() and other alarms. Code is very complicated.

But still, kudos for Cygwin Developers for creating Cygwin :) Great work, even tho it have some issues.

rwmj

2 replies

6h1m

2024-08-26 12:30:44 UTC

Be interesting to see a version using io_uring, which I think would let you pre-share buffers with the kernel avoiding some copies, and avoid syscall overhead (though the latter seems negligible here).

qsantos

1 replies

5h50m

2024-08-26 12:41:04 UTC

That sounds like a good idea!

rwmj

0 replies

5h23m

2024-08-26 13:08:39 UTC

I'm not claiming it'll be faster! Additionally io_uring has its own set of challenges, such as whether it's better to allocate one ring per core or one ring per application (shared by some or all cores). Pre-sharing buffers has trade-offs too, particularly in application complexity [alignment, you have to be careful not to reuse a buffer before it is consumed] versus the efficiency of zero copy.

mparnisari

1 replies

2h58m

2024-08-26 15:32:58 UTC

I get PR_CONNECT_RESET_ERROR when trying to open the page

qsantos

0 replies

2h25m

2024-08-26 16:06:21 UTC

My server struggles a bit with the load on the WordPress site. You should be fine just reloading. I will make sure to improve things for the next time!

jeremyscanvic

1 replies

1h20m

2024-08-26 17:11:42 UTC

Great post! I didn't know about vmsplice(2). I'm glad to see a former ENSL student here as well!

qsantos

0 replies

12m

2024-08-26 18:19:03 UTC

Hey!

goodpoint

1 replies

8h16m

2024-08-26 10:14:59 UTC

Excellent article even if, to be honest, the title is clickbait.

chmaynard

0 replies

6h55m

2024-08-26 11:36:43 UTC

Agreed. Titles that don't use quantifiers are almost always misleading at best.

fatcunt

1 replies

7h0m

2024-08-26 11:31:43 UTC

I do not know why the JMP is not just a RET, however.

This is caused by the CONFIG_RETHUNK option. In the disassembly from objdump you are seeing the result of RET being replaced with JMP __x86_return_thunk.

https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...

https://github.com/torvalds/linux/blob/v6.1/arch/x86/lib/ret...

The NOP instructions at the beginning and at the end of the function allow ftrace to insert tracing instructions when needed.

These are from the ASM_CLAC and ASM_STAC macros, which make space for the CLAC and STAC instructions (both of them three bytes in length, same as the number of NOPs) to be filled in at runtime if X86_FEATURE_SMAP is detected.

https://github.com/torvalds/linux/blob/v6.1/arch/x86/include...

https://github.com/torvalds/linux/blob/v6.1/arch/x86/kernel/...

qsantos

0 replies

5h47m

2024-08-26 12:43:56 UTC

Thanks a lot for the information! I was not quite sure what to look for in this case. I have added in note in the article.

donaldihunter

1 replies

4h10m

2024-08-26 14:20:58 UTC

Something I didn't see mentioned in the article about AVX512, aside from the xsave/xrstor overhead, is that AVX512 is power hungry and causes CPU frequency scaling. See [1], [2] for details and as an example of how nuanced it can get.

[1] https://www.intel.com/content/dam/www/central-libraries/us/e...

[2] https://www.intel.com/content/www/us/en/developer/articles/t...

Narishma

0 replies

3h35m

2024-08-26 14:56:49 UTC

That is only the case in specific Intel CPU models.

RevEng

1 replies

19h30m

2024-08-25 23:01:45 UTC

I didn't quite grasp why the original splice has to be so slow. They pointed out what made it slower than vmsplice - in particular allocating buffers and using scalar instructions - but why is this necessary? Why couldn't splice just be reimplemented as vmsplice? I'm sure there is a good reason, but I've missed it.

Izkata

0 replies

16h35m

2024-08-26 01:56:36 UTC

Why couldn't splice just be reimplemented as vmsplice?

A possible answer that's currently just below your comment: https://news.ycombinator.com/item?id=41351870

vmslice doesn't work with every type of file descriptor.

up2isomorphism

0 replies

3h9m

2024-08-26 15:21:58 UTC

Someone tasted a bread thinking it is not sweet enough, which is fine. But calling the bread bland is funny because it does not mean to taste sweet.

sixthDot

0 replies

8h9m

2024-08-26 10:22:01 UTC

I do not know why the JMP is not just a RET, however.

The jump seems generated by the expansion of the `ASM_CLAC` macro, which is supposed to change the EFLAGS register ([1], [2]). However in this case the expansion looks like it does nothing (maybe because of the target ?). I 'd be interested to know more about that. Call to the wild.

[1]: https://github.com/torvalds/linux/blob/master/arch/x86/inclu...

[2]: https://stackoverflow.com/a/60579385

qsantos

0 replies

11h36m

2024-08-26 06:55:11 UTC

I am again getting the hug of death of Hacker News. The situation is better than the last time thanks to caching WordPress pages, but loading the page can still take a few seconds, so bear with me!

jvanderbot

0 replies

2h34m

2024-08-26 15:57:06 UTC

Although SSE2 is always available on x86-64, I also disabled the cpuid bit for SSE2 and SSE to see if it could nudge glibc into using scalar registers to copy data. I immediately got a kernel panic. Ah, well.

I think you need to recompile your compiler, or disable those explicitly via link / cc flags. Compilers are fairly hard to get to coax / dissuade SIMD instructions, IMHO.

faizshah

0 replies

2h32m

2024-08-26 15:59:08 UTC

This is a really cool post and that is a massive amount of throughput.

In my experience in data engineering, it’s very unlikely you can exceed 500mb/s throughput of your business logic as most libraries you’re using are not optimized to that degree (SIMD etc.). That being said I think it’s a good technique to try out.

I’m trying to think of other applications this could be useful for. Maybe video workflows?

djaouen

0 replies

20h2m

2024-08-25 22:29:02 UTC

So is Python, but I'm still gonna use it lol

arendtio

0 replies

6h16m

2024-08-26 12:15:02 UTC

I know pipes primarily from shell scripts. Are they being used in other contexts as extensively, too? Like C or Rust programs?