return to table of content

Rust std fs slower than Python? No, it's hardware

quietbritishjim
53 replies
1d3h

I'm a bit confused about the premise. This is not comparing pure Python code against some native (C or Rust) code. It's comparing one Python wrapper around native code (Python's file read method) against another Python wrapper around some native code (OpenDAL). OK it's still interesting that there's a difference in performance, but it's very odd to describe it as "slower than Python". Did they expect that the Python standard library is all written in pure Python? On the contrary, I would expect the implementations of functions in Python's standard library to be native and, individually, highly optimised.

I'm not surprised the conclusion had something to do with the way that native code works. Admittedly I was surprised at the specific answer - still a very interesting article despite the confusing start.

Edit: The conclusion also took me a couple of attempts to parse. There's a heading "C is slower than Python with specified offset". To me, as a native English speaker, this reads as "C is slower (than Python) with specified offset" i.e. it sounds like they took the C code, specified the same offset as Python, and then it's still slower than Python. But it's the opposite: once the offset from Python was also specified in the C code, the C code was then faster. Still very interesting once I got what they were saying though.

qd011
32 replies
23h45m

I don't understand why Python gets shit for being a slow language when it's slow but no credit for being fast when it's fast just because "it's not really Python".

If I write Python and my code is fast, to me that sounds like Python is fast, I couldn't care less whether it's because the implementation is in another language or for some other reason.

kbenson
8 replies
23h2m

Because for any nontrivial case you would expect python+compiled library and associated marshaling of data to be slower than that library in its native implementation without any inyerop/marshaling required.

When you see an interpreted language faster than a compiled one, it's worth looking at why, because most the time it's because there's some hidden issue causing the other to be slow (which could just be a different and much worse implementation).

Put another way, you can do a lot to make a Honda Civic very fast, but when you hear one goes up against a Ferrari and wins your first thoughts should be about what the test was, how the Civic was modified, and if the Ferrari had problems or the test wasn't to its strengths at all. If you just think "yeah, I love Civics, that's awesome" then you're not thinking critically enough about it.

lmm
4 replies
14h50m

Because for any nontrivial case you would expect python+compiled library and associated marshaling of data to be slower than that library in its native implementation without any inyerop/marshaling required.

When you see an interpreted language faster than a compiled one, it's worth looking at why, because most the time it's because there's some hidden issue causing the other to be slow (which could just be a different and much worse implementation).

On the contrary, the compiled languages tend to only be faster in trivial benchmarks. In real-world systems the Python-based systems tends to be faster because they haven't had to spend so long twiddling which integers they're using and debugging crashes and memory leaks, and got to spend more time on the problem.

kbenson
2 replies
14h43m

I don't doubt that can happen, but I'm also highly doubtful that it's the norm for large, established, mature projects with lots of attention, such as popular libraries and the standard library of popular languages. As time spent on the project increases, I suspect that any gain an interpreted language has over an (efficient) compiled one not only gets smaller, but eventually reverses in most cases.

So, like in most things, the details can sometimes matter quite a bit.

lmm
1 replies
14h5m

I don't doubt that can happen, but I'm also highly doubtful that it's the norm for large, established, mature projects with lots of attention, such as popular libraries and the standard library of popular languages.

Code that has lots of attention is different, certainly, but it's also the exception rather than the rule; the last figure I saw was that 90% of code is internal business applications that are never even made publicly available in any form, much less subject to outside code review or contributions.

As time spent on the project increases, I suspect that any gain an interpreted language has over an (efficient) compiled one not only gets smaller, but eventually reverses in most cases.

In terms of the limit of an efficient implementation (which certainly something like Python is nowhere near), I've seen it argued both ways; with something like K the argument is that a tiny interpreter that sits in L1 and takes its instructions in a very compact form ends up saving you more memory bandwidth (compared to what you'd have to compile those tiny interpreter instructions into if you wanted them to execute "directly") than it costs.

JonChesterfield
0 replies
6h27m

a tiny interpreter that sits in L1 and takes its instructions in a very compact form ends up saving you more memory bandwidth

There's a paper on this you might like. https://www.researchgate.net/publication/2749121_When_are_By...

I think there's something to the idea of keeping the program in the instruction cache by deliberately executing parts of it via interpreted bytecode. There should be an optimum around zero instruction cache misses, either from keeping everything resident, or from deliberately paging instructions in and out as control flow in the program changes which parts are live.

There are complicated tradeoffs between code specialisation and size. Translating some back and forth between machine code and bytecode adds another dimension to that.

I fear it's either the domain of extremely specialised handwritten code - luajit's interpreter is the canonical example - of the the sufficiently smart compiler. In this case a very smart compiler.

JonChesterfield
0 replies
6h39m

On the contrary, the compiled languages tend to only be faster in trivial benchmarks. In real-world systems the Python-based systems tends to be faster because they haven't had to spend so long twiddling which integers they're using and debugging crashes and memory leaks, and got to spend more time on the problem.

This is an interesting premise.

Python in particular gets an absolute kicking for being slow. Hence all the libraries written in C or C++ then wrapped in a python interface. Also why "python was faster than rust at anything" is headline worthy.

I note your claim is that python systems in general tend to be faster (outside of trivial benchmarks, whatever the scope of that is). Can you cite any single example where this is the case?

Attummm
1 replies
22h37m

In this case, Python's code (opening and loading the content of a file) operates almost fully within its C runtime.

The C components initiate the system call and manage the file pointer, which loads the data from the disk into a pyobj string.

Therefore, it isn't so much Python itself that is being tested, but rather python underlying C runtime.

kbenson
0 replies
22h25m

Yep, and the next logical question when both implementations are for the most part bare metal (compiled and low-level), is why is there a large difference? Is it a matter of implementation/algorithm, inefficiency, or a bug somewhere? In this case, that search turned up a hardware issue that should be addressed, which is why it's so useful to examine these things.

heavyset_go
0 replies
18h11m

If you're staying within Python and its C-extensions, there is no marshalling, you're dealing with raw PyObjects that are exposed to the interpreter.

afdbcreid
8 replies
23h25m

Usually, yes, but when it's a bug in the hardware, it's not really that Python is fast, more like that CPython developers were lucky enough to not have the bug.

munch117
7 replies
22h46m

How do you know that it's luck?

cozzyd
5 replies
21h39m

Because the offset is entirely due to space for the PyObject header.

munch117
4 replies
19h48m

The PyObject header is a target for optimisation. Performance regressions are likely to be noticed, and if a different header layout is faster, then it's entirely possible that it will be used for purely empirical reasons. Trying different options and picking the best performing one is not luck, even if you can't explain why it's the best performing.

saagarjha
2 replies
17h35m

You can expect the Python developers to look very closely at any benchmark that significantly benefits from adding random padding to the object header. Performance isn’t just trying a bunch of random things and picking whatever works the best, it’s critical to understand why so you know that the improvement is not a fluke. Especially since it is very easy to introduce bias and significantly perturb the results if you don’t understand what’s going on.

munch117
1 replies
9h34m

We're not talking about random changes. We're talking about paying attention to the measured performance of changes made for other reasons.

Just like in this article. The author measured, wondered, investigated, experimented, and finally, after a lot of hard work, made the C/Rust programs faster. You wouldn't call that luck, would you? If there had been a similar performance regression in CPython, then a benchmark could have picked up on it, and the CPython developers would then have done the same.

saagarjha
0 replies
5h37m

You can look at the history of PyObject yourself: https://github.com/python/cpython/commits/main/Include/objec.... None of these changes were done because of weird CPU errata that meant that making the header bigger was a performance win. That isn't to say that the developers wouldn't be interested in such effects, or be able to detect them, but the fact that the object header happens to be large enough to avoid the performance bug isn't because of careful testing but because that's what they ended up for other reasons, far before Zen 3 was ever released. If it so happened that Python was affected because the offset needed to avoid a penalty was 0x50 or something then I am sure they would take it up with AMD rather than being content to increase the size of their header for no reason.

cozzyd
0 replies
19h18m

I suspect any size other than 0 would lead to this.

But the Zen3/4 were developed far, far after the PyObject header...

adgjlsfhk1
0 replies
21h36m

because the offset here is a result of python's reference counting which dates ~20 years before zen3

IshKebab
5 replies
20h48m

Because when people talk about Python performance they're talking about the performance of Python code itself, not C/Rust code that it's wrapping.

Pretty much any language can wrap C/Rust code.

Why does it matter?

1. Having to split your code across 2 languages via FFI is a huge pain.

2. You are still writing some Python. There's plenty of code that is pure Python. That code is slow.

munch117
4 replies
19h44m

Of course in this case there's no FFI involved - the open function is built-in. It's as pure-Python as it can get.

jwueller
2 replies
15h22m

How is it pure Python if it delegates all of the actual work to the Kernel?

munch117
1 replies
9h55m

All I/O delegates to the kernel, eventually.

It's pure Python in that there's no cffi, no ctypes, no Cython, no C extensions of any kind.

int_19h
0 replies
6h45m

It's pretty hard to draw this line in Python because all built-in types and functions are effectively C extensions, just compiled directly into the interpreter.

Conversely, you can have pure C code just using PyObjects (this is effectively what Cython does), with the Python bytecode interpreter completely out of the picture. But the perf improvement is nowhere near what people naively expect from compiled code, usually.

IshKebab
0 replies
19h9m

Not sure I agree there, but anyway in this case the performance had nothing to do with Python being a slow or fast language.

benrutter
1 replies
23h22m

I wonder if its because we're sometimes talking cross purposes.

For me, coding is almost exclusively using python libraries like numpy to call out to other languages like c or FORTRAN. It feels silly to say I'm not coding in Python to me.

On the other hand, if you're writing those libraries, coding to you is mostly writing FORTRAN and c optimizations. It probably feels silly to say you're coding in Python just because that's where your code is called from.

zare_st
0 replies
16h34m

There is a version of BASIC, a QuickBasic clone called Qb64 that is lightning fast because it transpiles to C++. By your admission a programmer should think that BASIC is fast because he only does BASIC and does not care about the environment details?

It's actually the opposite, a Python programmer should know how to offload most, or use the libraries that do so, out of Python into C. He should not be oblivious to the fact that any decent Python performance is due to shrinking down the ratio of actual Python instructions vs native instructions.

analog31
1 replies
17h59m

I think the confusion comes from people not having a good understanding of what an interpreted programming language does, and what actual portion of time is spent in high versus low level code. I've always assumed that most of my programs amount to a bit of glue thrown in between system calls.

Also, when we talk about "faster" and "slower," it's not clear the order of magnitude.

Maybe an analysis of actual code execution would shed more light than a simplistic explanation that the Python interpreter is written in C. I don't think the BASIC interpreter in my first computer was written in BASIC.

zare_st
0 replies
16h28m

Agreed. The speed of a language is reverse proportional to number of CPU instructions emitted to do something meaningful, e.g. solve a problem. Not whether it can target system calls without overhead and move memory around freely. That's a given.

rafaelmn
0 replies
22h54m

But you will care if that "python" breaks - you get to drop down to C/C++ and debugging native code. Likewise for adding features or understanding the implementation. Not to mention having to deal with native build tooling and platform specific stuff.

It's completely fair to say that's not python because it isn't - any language out there can FFI to C and it has the same problems mentioned above.

paulddraper
0 replies
23h39m

Yeah, it's weird.

p5a0u9l
0 replies
13h1m

I constantly get low key shade for choosing to build everything in Python. It’s really interesting to me. People can’t break out of thinking, “oh, you wrote a script for that?”. Actually, no, it’s software, not a script.

99% of my use cases are easily, maintainably solved with good, modern Python. The Python execution is almost never the bottleneck in my workflows. It’s disk or network I/O.

I’m not against building better languages and ecosystems, and compiled languages are clearly appropriate/required in many workflows, but the language parochialism gets old. I just want to build shit that works and get stuff done.

insanitybit
0 replies
20h15m

I don't understand why Python gets shit for being a slow language when it's slow but no credit for being fast when it's fast just because "it's not really Python".

What's there to understand? When it's fast it's not really Python, it's C. C is fast. Python can call out to C. You don't have to care that the implementation is in another language, but it is.

crabbone
14 replies
1d

individually, highly optimised.

Now why would you expect that?

What happened to OP is a pure chance. CPython's C code doesn't even care about const-consistency. It's flush with dynamic memory allocations, bunch of helper / convenience calls... Even stuff like arithmetic does dynamic memory allocation...

Normally, you don't expect CPython to perform well, not if you have any experience working with it. Whenever you want to improve performance you want to sidestep all the functionality available there.

Also, while Python doesn't have a standard library, since it doesn't have a standard... the library that's distributed with it is mostly written in Python. Of course, some of it comes written in C, but there's also a sizable fraction of that C code that's essentially Python code translated mechanically into C (a good example of this is Python's binary search implementation which was originally written in Python, and later translated into C using Python's C API).

What one would expect is that functionality that is simple to map to operating system functionality has a relatively thin wrapper. I.e. reading files wouldn't require much in terms of binding code because, essentially, it goes straight into the system interface.

codr7
13 replies
23h24m

Have you ever attempted to write a scripting language that performs better?

I have, several, and it's far from trivial.

The basics are seriously optimized for typical use cases, take a look at the source code for the dict type.

crabbone
8 replies
22h31m

Have you ever attempted to write a scripting language that performs better?

No, because "scripting language" is not a thing.

But, if we are talking about implementing languages, then I worked with many language implementations. The most comparable one that I know fairly well, inside-and-out would be the AVM, i.e. the ActionScript Virtual Machine. It's not well-written either unfortunately.

I've looked at implementations of Lua, Emacs Lisp and Erlang at different times and to various degree. I'm also somewhat familiar with SBCL and ECL, the implementation side. There are different things the authors looked for in these implementations. For example, SBCL emphasizes performance, where ECL emphasizes simplicity and interop with C.

If I had to grade language implementations I've seen, Erlang would absolutely take the cake. It's a very thoughtful and disciplined program where authors went to a great length to design and implement it. CPython is on the lower end of such programs. It's anarchic, very unevenly implemented, you run into comments testifying to the author not knowing what they are doing, what their predecessor did, nor what to do next. Sometimes the code is written from that perspective as well, as in if the author somehow manages to drive themselves in the corner they don't know what the reference count is anymore, they'll just hammer it until they hope all references are dead (well, maybe).

It's the code style that, unfortunately, I associate with proprietary projects where deadlines and cost dictate the quality, where concurrency problems are solved with sleeps, and if that doesn't work, then the sleep delay is doubled. It's not because I specifically hate code being proprietary, but because I meet that kind of code in my day job more than I meet it in hobby open-source projects.

take a look at the source code for the dict type.

I wrote a Protobuf parser in C with the intention of exposing its bindings to Python. Dictionaries were a natural choice for the hash-map Protobuf elements. I benchmarked my implementation against C++ (Google's) implementation only to discover that std::map wins against Python's dictionary by a landslide.

Maybe Python's dict isn't as bad as most of the rest of the interpreter, but being the best of the worst still doesn't make it good.

codr7
5 replies
21h12m

Except it is, because everyone knows sort of what it means, an interpreted language that prioritizes convenience over performance; Perl/Python/Ruby/Lua/PHP/etc.

SBCL is definitely a different beast.

I would expect Emacs Lisp & Lua to be more similar.

Erlang had plenty more funding and stricter requirements.

C++'s std::map has most likely gotten even more attention than Python's dict, but I'm not sure from your comment if you're including Python's VM dispatch in that comparison.

What are you trying to prove here?

JonChesterfield
3 replies
6h16m

(std::map is famously rubbish, to the extent that a common code review fix is to replace it with std::unordered_map. Map is a tree, unordered map is a linked-list-collision hashtable. Both are something of a performance embarrassment for C++. So std::map outperforming a given hashtable is a strongly negative judgement)

codr7
1 replies
1h0m

It's ordered and predictable, which is far from rubbish.

In most cases std::unordered_map will be faster, but hashtables have nasty edge cases and are usually more expensive to create.

I can pretty much guarantee it's been optimized to hell and back.

JonChesterfield
0 replies
51m

Unordered map sure hasn't been. There are algorithmic performance guarantees in the standard that force the linked list of buckets implementation despite that being slower than alternatives. Maybe the libc++ map<> is a very perfect rbtree, but I doubt that's the state of the art in ordered containers either.

crabbone
0 replies
11m

If I may hazard a guess (I don't know why the original code used it), Python dictionaries are also ordered (and there's no option in Python's library to have them not ordered). Maybe they wanted to match Python's behavior.

crabbone
0 replies
13m

interpreted language

There is no such thing as interpreted language. A language implementation can be called an interpreter to emphasize the reliance on rich existing library, but there's no real line here that can divide languages into two non-ambiguous categories. So... is C an "interpreted language"? -- well, under certain light it is, since it calls into libc for a lot of functionality, therefor libc can be thought of as its interpreter. Similarly, machine code is often said to be interpreted by the CPU, when it translates it to microcode and so on.

prioritizes convenience over performance

This has nothing to do with scripting. When the word "scripting" is used, it's about the ability to automate another program, and record this automation as a "script". Again, this is not an absolute metric that can divide all languages or their implementations into scripting and not-scripting. When the word "scripting" is used properly it is used to emphasize the fact that a particular program is amenable to automation by means of writing other programs, possibly in another language.

Here are some fun examples to consider. For example, MSBuild, a program written in C# AFAIK, can be scripted in C# to compile C# programs! qBittorrent, a program written in Python can be scripted using any language that has Selenium bindings because qBittorrent uses Qt for the GUI stuff and Qt can be automated using Selenium. Adobe Photoshop (used to be, not sure about now) can be scripted in JavaScript.

To give you some examples which make your claim ridiculously wrong: Forth used to be used in Solaris bootloader to automate kernel loading progress, i.e. it was used as a scripting language for that purpose, however most mature Forth implementations are aiming for the same performance bracket as C. You'd be also hard-pressed to find a lot of people who think that Forth is a very convenient language... (I do believe it's fine, but there may be another five or so people who believe it too).

---

Basically, your ideas about programming language taxonomies are all wrong and broken... sorry. Not only you misapplied the labels, you don't even have any good labels to begin with.

Anyways,

What are you trying to prove here?

Where's here? Do you mean the original comment or the one that mentions std::map?

If the former: I'm trying to prove that CPython is a dumpster fire of a program. That is based on many years of working with it and quite extensive knowledge of its internals of which I already provided examples of.

If it is the later: parent claimed something about how optimized Python's dictionary is, I showed that it has a very long way to go to be in the category of good performers. I.e. optimizing something, no matter how much, doesn't mean that it works well.

I don't know what do you mean by Python's VM dispatch in this context. I already explained that I used Python C API for dictionaries, namely this: https://docs.python.org/3/c-api/dict.html . It's very easy to find equivalent functionality in std::map.

int_19h
1 replies
6h39m

For starters, since everything in Python is a pass-by-ref object, dicts store pointers to values, which then have to be allocated on the heap and refcounted, whereas std::map can store values directly. But this is the consequence of a very-high-level object model used by CPython, not its dict implementation that has to adapt to that.

crabbone
0 replies
34m

You are very confused between how something works right now and how it can work in principle. In this you very much resemble CPython developers: they never attempt optimizations that go beyond what Python C API can offer. This is very limiting (and, this is why all sorts of Python JIT compilers can in many circumstances beat CPython by a lot).

The evidence to how absurd your claim is is right in front of you: Google's implementation of Protobuf uses std::map for dictionaries, and these dictionaries are exposed to Python. But, following your argument this... shouldn't be possible?

To better understand the difference: Python dictionary stores references to Python objects, but it doesn't have to. It could, for example, take Python strings and use C character arrays for storage, and then upon querying the dictionary convert them back to Python str objects. Similarly with integers for example etc.

Why is this not done -- I don't know. Knowing how many other things are done in Python, I'd suspect that this isn't done because nobody bothered to do it. It also feels too hard and to unrewarding to patch a single class of objects, even as popular as dictionaries. If you go for this kind of optimizations, you want it to be systematically and uniformly applied to all the code... and that's, I guess, how Cython came to be, for example.

svieira
1 replies
23h6m

Raymond Hettinger's talk Modern Python Dictionaries: A confluence of a dozen great ideas is an awesome "history of how we got these optimizations" and a walk through why they are so effective - https://www.youtube.com/watch?v=npw4s1QTmPg

codr7
0 replies
22h58m

Yeah, I had a nice chat with Raymond Hettinger at a Pycon in Birmingham/UK back in the days (had no idea who he was at the time). He seemed like a dedicated and intelligent person, I'm sure we can thank him for some of that.

wahern
0 replies
18h26m

The basics are seriously optimized for typical use cases, take a look at the source code for the dict type

Python is well micro-optimized, but the broader architecture of the language and especially the CPython implementation did not put much concern into performance, even for a dynamically typed scripting language. For example, in CPython values of built-in types are still allocated as regular objects and passed by reference; this is atrocious for performance and no amount of micro optimization will suffice to completely bridge the performance gap for tasks which stress this aspect of CPython. By contrast, primitive types in Lua (including PUC Lua, the reference, non-JIT implementation) and JavaScript are passed around internally as scalar values, and the languages were designed with this in mind.

Perl is similar to Python in this regard--the language constructs and type systems weren't designed for high primitive operation throughput. Rather, performance considerations were focused on higher level, functional tasks. For example, Perl string objects were designed to support fast concatenation and copy-on-write references, optimizations which pay huge dividends for the tasks for which Perl became popular. Perl can often seem ridiculously fast for naive string munging compared to even compiled languages, yet few people care to defend Perl as a performant language per se.

rowanG077
0 replies
16h42m

Have you ever attempted to write a scripting language that performs better?

Way to miss the mark. The point is precisely that Python is slow and one of the causes is that it is a scripting language. Stomping your foot and essentially: "You couldn't do any better" helps no one and is counterproductive.

lambda
1 replies
18h52m

I'm a bit confused by why you are confused.

It's surprising that something as simple as reading a file is slower in the Rust standard library as the Python standard library. Even knowing that a Python standard library call like this is written in C, you'd still expect the Rust standard library call to be of a similar speed; so you'd expect either that you're using it wrong, or that the Rust standard library has some weird behavior.

In this case, it turns out that neither were the case; there's just a weird hardware performance cliff based on the exact alignment of an allocation on particular hardware.

So, yeah, I'd expect a filesystem read to be pretty well optimized in Python, but I'd expect the same in Rust, so it's surprising that the latter was so much slower, and especially surprising that it turned out to be hardware and allocator dependent.

quietbritishjim
0 replies
8h12m

It's just the spin of it that threw me off. You're right: "why is a C implementation so much faster than a widely used Rust implementation" is a valid and interesting question. But phrasing it as "why is a Python function faster than a Rust function", when it's clearly not the comparison at all, distracts from the real question.

fl0ki
1 replies
19h44m

The premise is that any time you say "Python [...] faster than Rust [...]" you get page views even if it's not true. People have noticed after the last few dozen times something like this was posted.

p5a0u9l
0 replies
12h55m

This is the answer. The thread is chasing various smart-people opinions about languages, interpreters, system calls. We got tricked into click bait title and are using the opportunity to rehash our favorite topics and biases.

On the other hand… so what? It’s kind of fun.

xuanwo
0 replies
1d3h

Thanks for the comments. I have fixed the headers :)

a1o
28 replies
1d2h

Rust developers might consider switching to jemallocator for improved performance

I am curious if this is something that everyone can do to get free performance or if there are caveats. Can C codebases benefit from this too? Is this performance that is simply left on table currently?

Pop_-
11 replies
1d2h

Switching to non-default allocator does not always brings performance boost. It really depend on your workload, which requires profiling and benchmarking. But C/C++/Rust and other lower level languages should all at least be able to choose from these allocators. One caveat is binary size. Custom allocator does add more bytes to executable.

vlovich123
9 replies
1d

I don’t know why people still look to jemalloc. Mimalloc outperforms the standard allocator on nearly every single benchmark. Glibc’s allocator & jemalloc both are long in the tooth & don’t actually perform as well as state of the art allocators. I wish Rust would switch to mimalloc or the latest tcmalloc (not the one in gperftools).

masklinn
8 replies
23h54m

I wish Rust would switch to mimalloc or the latest tcmalloc (not the one in gperftools).

That's nonsensical. Rust uses the system allocators for reliability, compatibility, binary bloat, maintenance burden, ..., not because they're good (they were not when Rust switched away from jemalloc, and they aren't now).

If you want to use mimalloc in your rust programs, you can just set it as global allocator same as jemalloc, that takes all of three lines: https://github.com/purpleprotocol/mimalloc_rust#usage

If you want the rust compiler to link against mimilloc rather than jemalloc, feel free to test it out and open an issue, but maybe take a gander at the previous attempt: https://github.com/rust-lang/rust/pull/103944 which died for the exact same reason the the one before that (https://github.com/rust-lang/rust/pull/92249) did: unacceptable regression of max-rss.

vlovich123
7 replies
23h43m

I know it’s easy to change but the arguments for using glibc’s allocator are less clear to me:

1. Reliability - how is an alternate allocator less reliable? Seems like a FUD-based argument. Unless by reliability you mean performance in which case yes - jemalloc isn’t reliably faster than standard allocators, but mimalloc is.

2. Compatibility - again sounds like a FUD argument. How is compatibility reduced by swapping out the allocator? You don’t even have to do it on all systems if you want. Glibc is just unequivocally bad.

3. Binary bloat - This one is maybe an OK argument although I don’t know what size difference we’re talking about for mimalloc. Also, most people aren’t writing hello world applications so the default should probably be for a good allocator. I’d also note that having a dependency of the std runtime on glibc in the first place likely bloats your binary more than the specific allocator selected.

4. Maintenance burden - I don’t really buy this argument. In both cases you’re relying on a 3rd party to maintain the code.

masklinn
6 replies
23h19m

I know it’s easy to change but the arguments for using glibc’s allocator are less clear to me:

You can find them at the original motivation for removing jemalloc, 7 years ago: https://github.com/rust-lang/rust/issues/36963

Also it's not "glibc's allocator", it's the system allocator. If you're unhappy with glibc's, get that replaced.

1. Reliability - how is an alternate allocator less reliable?

Jemalloc had to be disabled on various platforms and architectures, there is no reason to think mimalloc or tcmalloc are any different.

The system allocator, while shit, is always there and functional, the project does not have to curate its availability across platforms.

2. Compatibility - again sounds like a FUD argument. How is compatibility reduced by swapping out the allocator?

It makes interactions with anything which does use the system allocator worse, and almost certainly fails to interact correctly with some of the more specialised system facilities (e.g. malloc.conf) or tooling (in rust, jemalloc as shipped did not work with valgrind).

Also, most people aren’t writing hello world applications

Most people aren't writing applications bound on allocation throughput either

so the default should probably be for a good allocator.

Probably not, no.

I’d also note that having a dependency of the std runtime on glibc in the first place likely bloats your binary more than the specific allocator selected.

That makes no sense whatsoever. The libc is the system's and dynamically linked. And changing allocator does not magically unlink it.

4. Maintenance burden - I don’t really buy this argument.

It doesn't matter that you don't buy it. Having to ship, resync, debug, and curate (cf (1)) an allocator is a maintenance burden. With a system allocator, all the project does is ensure it calls the system allocators correctly, the rest is out of its purview.

vlovich123
4 replies
22h57m

The reason the reliability & compatibility arguments don’t make sense to me is that jemalloc is still in use for rustc (again - not sure why they haven’t switched to mimalloc) which has all the same platform requirements as the standard library. There’s also no reason an alternate allocator can’t be used on Linux specifically because glibc’s allocator is just bad full stop.

It makes interactions with anything which does use the system allocator worse

That’s a really niche argument. Most people are not doing any of that and malloc.conf is only for people who are tuning the glibc allocator which is a silly thing to do when mimalloc will outperform whatever tuning you do (yes - glibc really is that bad).

or tooling (in rust, jemalloc as shipped did not work with valgrind)

That’s a fair argument, but it’s not an unsolvable one.

Most people aren’t writing applications bound on allocation throughput either

You’d be surprised at how big an impact the allocator can make even when you don’t think you’re bound on allocations. There’s also all sorts of other things beyond allocation throughput & glibc sucks at all of them (e.g. freeing memory, behavior in multithreaded programs, fragmentation etc etc).

The libc is the system’s and dynamically linked. And changing allocator does not magically unlink it

I meant that the dependency on libc at all in the standard library bloats the size of a statically linked executable.

josephg
3 replies
21h23m

jemalloc is still in use for rustc (again - not sure why they haven’t switched to mimalloc)

Performance of rustc matters a lot! If the rust compiler runs faster when using mimalloc, please benchmark & submit a patch to the compiler.

masklinn
1 replies
19h33m

I literally linked two attempts to use mimalloc in rustc just a few comments upthread.

josephg
0 replies
11h50m

Ah - my mistake; I somehow misread your comment. Pity about the RSS regression.

Personally I have plenty of RAM and I'd happily use more in exchange for a faster compile. Its much cheaper to buy more ram than a faster CPU, but I certainly understand the choice.

With compilers I sometimes wonder if it wouldn't be better to just switch to an arena allocator for the whole compilation job. But it wouldn't surprise me if LLVM allocates way more memory than you'd expect.

vlovich123
0 replies
20h17m

Any links to instructions on how to run said benchmarks?

saagarjha
0 replies
17h12m

Not to mention that by using the system allocator you get all sorts of things “for free” that the system developers provide for you, wrt observability and standard tooling. This is especially true of the OS and the allocator are shipped by one group rather than being developed independently.

charcircuit
0 replies
23h49m

I've never not gotten increased performance by swapping outc the allocator.

nh2
9 replies
1d1h

Be aware `jemalloc` will make you suffer the observability issues of `MADV_FREE`. `htop` will no longer show the truth about how much memory is in use.

* https://github.com/jemalloc/jemalloc/issues/387#issuecomment...

* https://gitlab.haskell.org/ghc/ghc/-/issues/17411

Apparently now `jemalloc` will call `MADV_DONTNEED` 10 seconds after `MADV_FREE`: https://github.com/JuliaLang/julia/issues/51086#issuecomment...

So while this "fixes" the issue, it'll introduce a confusing time delay between you freeing the memory and you observing that in `htop`.

But according to https://jemalloc.net/jemalloc.3.html you can set `opt.muzzy_decay_ms = 0` to remove the delay.

Still, the musl author has some reservations against making `jemalloc` the default:

https://www.openwall.com/lists/musl/2018/04/23/2

It's got serious bloat problems, problems with undermining ASLR, and is optimized pretty much only for being as fast as possible without caring how much memory you use.

With the above-mentioned tunables, this should be mitigated to some extent, but the general "theme" (focusing on e.g. performance vs memory usage) will likely still mean "it's a tradeoff" or "it's no tradeoff, but only if you set tunables to what you need".

the8472
2 replies
21h25m

Aiming to please people who panic about their RSS numbers seems... misguided? It seems like worrying about RAM being "used" as file cache[0].

If you want to gauge whether your system is memory-limited look at the PSI metrics instead.

[0] https://www.linuxatemyram.com/

nh2
1 replies
6h13m

Those are not the same.

You can see cache usage in htop; it has a different colour.

With MADV_FREE, it looks like the process is still using the memory.

That sucks: If you have some server that's slow, you want to SSH into a server and see how much memory each process takes. That's a basic, and good, observability workflow. Memory leaks exist, and tools should show them easily.

The point of RES is to show resident memory, not something else.

If you change htop to show the correct memory, that'd fix the issue of course.

the8472
0 replies
3h6m

Well, RES is resident in physical memory. It just is marked so that the kernel can make reclaim it when it needs to. But until then it's resident. If you want to track leaks you need resident-and-in-use metric which may be more difficult to come by (probably requires scanning smaps?).

It's a case of people using the subtly wrong metrics and then trying to optimize tools chasing that metric rather than improving their metrics. That's what I'm calling misguided.

masklinn
1 replies
1d

The musl remark is funny, because jemalloc's use of pretty fine-grained arenas sometimes leads to better memory utilisation through reduced fragmentation. For instance Aerospike couldn't fit in available memory under (admittedly old) glibc, and jemalloc fixed the issue: http://highscalability.com/blog/2015/3/17/in-memory-computin...

And this is not a one-off: https://hackernoon.com/reducing-rails-memory-use-on-amazon-l... https://engineering.linkedin.com/blog/2021/taming-memory-fra...

jemalloc also has extensive observability / debugging capabilities, which can provide a useful global view of the system, it's been used to debug memleaks in JNI-bridge code: https://www.evanjones.ca/java-native-leak-bug.html https://technology.blog.gov.uk/2015/12/11/using-jemalloc-to-...

nh2
0 replies
6h27m

Yes, almost everybody who looks at memory usage in production will eventually discover glibc's memory fragmentation issues. This is how I learned about this topic.

Setting the env var MALLOC_MMAP_THRESHOLD_=65536 usually solves these problems instantaneously.

Most programmers seem to not bother to understand what is going on (thus arriving at the above solution) but follow "we switched to jemalloc and it fixes the issue".

(I have no opinion yet on whether jemalloc is better or worse than glibc malloc. Both have tunables, and will create problematic corner cases if the tunables are not set accordingly. The fact that jemalloc has /more/ tunables, and more observability / debugging features, seems like a pro point for those that read the documentation. For users that "just want low memory usage", both libraries' defaults look bad, and the musl attitude seems like the best default, since OOM will cause a crash vs just having the program be some percent slower.)

singron
0 replies
1d

Note that glibc has a similar problem in multithreaded contexts. It strands unused memory in thread-local pools, which grows your memory usage over time like a memory leak. We got lower memory usage that didn't grow over time by switching to jemalloc.

Example of this: https://github.com/prestodb/presto/issues/8993

saagarjha
0 replies
17h20m

Not that I would recommend using jemalloc by default but it’s definitely going to be better than musl’s allocator ;)

dralley
0 replies
22h59m
a1o
0 replies
1d1h

Thank you! That was very thorough! I will be reading the links. :)

secondcoming
0 replies
23h24m

You can override the allocator for any app via LD_PRELOAD

saagarjha
0 replies
17h16m

Performance is not a one-dimensional scale where programs go from “slow” to “fast”, because there are always other factors at play. jemalloc can be the right fit for some applications but for others another choice might be faster, but it also might be that the choice is slower but better matches their goals (less dirty memory, better observability, certain security guarantees, …)

nicoburns
0 replies
1d2h

I think it's pretty much free performance that's being left on the table. There's slight cost to binary size. And it may not perform better in absolutely all circumstances (but it will in almost all).

Rust used to use jemalloc by default but switched as people found this surprising as the default.

kragen
0 replies
23h41m

basically that's why jason wrote it in the first place, but other allocators have caught up since then to some extent. so jemalloc might make your c either slower or faster, you'll have to test to know. it's pretty reliable at being close to the best choice

does tend to use more ram tho

kelnos
0 replies
11h39m

Rust used to use jemalloc as the default, but went back to using the system malloc back in 2018-ish[0]. Since Rust now has the GlobalAlloc trait (and the #[global_allocator] attribute), apps can use jemalloc as their allocator if they want. Not sure if there's a way for users to override via LD_PRELOAD or something, though.

It turns out jemalloc isn't always best for every workload and use case. While the system allocator is often far from perfect, it at least has been widely tested as a general-purpose allocator.

[0] https://github.com/rust-lang/rust/issues/36963

TillE
0 replies
1d

jemalloc and mimalloc are very popular in C and C++ software, yes. There are few drawbacks, and it's really easy to benchmark different allocators against eachother in your particular use case.

the8472
21 replies
23h35m

There are two dedicated CPU feature flags to indicate that REP STOS/MOV are fast and usable as short instruction sequence for memset/memcpy. Having to hand-roll optimized routines for each new CPU generation has been an ongoing pain for decades.

And yet here we are again. Shouldn't this be part of some timing testsuite of CPU vendors by now?

giancarlostoro
18 replies
23h31m

So correct me if I am wrong but does this mean you need to compile two executables for a specific compile time build? Or is it just you need to compile it from specific hardware? Wondering what the fix would be, some sort of runtime check?

the8472
6 replies
22h12m

The sibling comments mention the hardware specific dynamic linking in glibc that's used for function calls. But if your compiler inlines memcpy (usually for short, fixed-sized copies) into the binary then yes you'll have to compile it for a specific CPU to get optimal performance. But that's true for all target-dependent optimizations.

More broadly compatible routines will still work on newer CPUs, they just won yield the best performance.

It still would be nice if such central routines could just be compiled to the REP-prefixed instructions and would deliver (near-)optimal performance so we could stop worrying about that particular part.

saagarjha
3 replies
17h26m

They are, glibc already has an ERMS code path for memcpy.

the8472
2 replies
7h42m

Did you miss the part about inlined calls? I was saying that if it weren't for these issues we could have them turn into a single instruction without worrying about perf.

saagarjha
0 replies
5h46m

Generally the compiler will not inline a memcpy if it doesn't know the size it is dealing with.

gpderetta
0 replies
7h8m

I theory the compiler could always statically inline a memcpy to a (possibly nop-padded) rep mov. Then the dynamic linker could dynamically patch the instructions to a call to an out of line function if rep mov is known not to be optimal to the actual CPU the code is running. The reverse (patching a call to memmove with a rep mov) is also possible.

dzaima
1 replies
5h49m

Some quick searching gives that FSRM is used for at least 128 bytes or so (ERMS for ≥~2048 bytes for reference); in base x86 (i.e. SSE2) that's 8 loads & 8 stores, ~62 bytes of code. At that point calling into a library function isn't too unreasonable (at the very least it could utilize AVX and cut that in half to 4 loads+4 stores, though at the cost of function call overhead & some (likely correctly-predicted) branches).

the8472
0 replies
2h59m

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/...

Suggests that it should be usable for even shorter copies. And that's really my point. We should have One True memcpy instruction sequence that we use everywhere and stop worrying. And yet...

ww520
4 replies
23h6m

Since the CPU instructions are the same, instruction patching at startup or install time can be used. Just patch in the correct instructions for the respective hardware.

saagarjha
3 replies
17h27m

This is generally a bad idea because it requires code modification, which has security implications. Most implementations will bring in multiple implementations and select the right one at startup (amortizing the indirect call into something like the GOT which already exists).

JonChesterfield
2 replies
6h53m

There are security hazards around writable + executable code. They don't apply to patching before execution (e.g. the install step) since nothing needs to be executed at that point. I don't think the security concerns apply during load time either - what does it matter if the text section is edited before it gets marked read-only&executable? It just means you're running a slightly different program, exactly as if it was edited during install.

In the memcpy case, where the library call is probably in a dynamically linked library anyway, it's particularly trivial to bind to one of N implementations of memcpy at load time. That only patches code if library calls are usually implemented that way.

Patching .text does tend to mess up using the same shared pages across multiple executables though which is a shame, and somewhat argues for install time specialisation.

saagarjha
1 replies
5h45m

On certain platforms, it would break code signatures if they are tied to the pages the code is on.

andrekandre
0 replies
4h40m

  > On certain platforms, it would break code signatures
macos?

fweimer
3 replies
23h17m

The exact nature of the fix is unclear at present.

During dynamic linking, glibc picks a memcpy implementation which seems most appropriate for the current machine. We have about 13 different implementations just for x86-64. We could add another one for current(ish) AMD CPUs, select a different existing implementation for them, or change the default for a configurable cutover point in a parameterized implementation.

saagarjha
2 replies
17h32m

This code is in the kernel, so dynamic linking and glibc is not really relevant.

JonChesterfield
1 replies
7h7m

It's the same design space. If glibc has ended up with 13 versions for x64, the kernel probably has a similar number. It's an argument by analogy for how much of an annoyance this is.

saagarjha
0 replies
5h44m

No, because the kernel has particular code size and ISA restrictions.

immibis
0 replies
23h18m

glibc has the ability to dynamically link a different version of a function based on the CPU.

dralley
0 replies
23h17m

Glibc supports runtime selection of different optimized paths, yes. There was a recent discussion about a security vulnerability in that feature (discussion https://news.ycombinator.com/item?id=37756357), but in essence this is exactly the kind of thing it's useful for.

mike_hock
0 replies
17h0m

You'd think the CPU vendor knows their CPU best. If there's a faster "software" implementation, why doesn't REP MOVS at least do the same thing in microcode?

gpderetta
0 replies
7h12m

I'm completely making stuff up here, but I wonder if this is the effect of some last minute (or even post-release, via ucode update) bug fix, where page aligned fast rep movs had issues or were subject to some attack and got disabled.

drtgh
21 replies
1d3h

Rust std fs slower than Python!? No, it's hardware!

...

Python features three memory domains, each representing different allocation strategies and optimized for various purposes.

...

Rust is slower than Python only on my machine.

if one library performs wildly better than the other in the same test, on the same hardware, how can that not be a software-related problem? sounds like a contradiction.

Maybe should be considered a coding issue and/or feature absent? IMHO it would be expected Rust's std library perform well without making all the users to circumvent the issue manually.

The article is well investigated so I assume the author just want to show the problem existence without creating controversy because other way I can not understand.

Pop_-
18 replies
1d3h

The root cause is AMD's bad support for rep movsb (which is a hardware problem). However, python by default has a small offset when reading memories while lower level language (rust and c) does not, which is why python seems to perform better than c/rust. It "accidentally" avoided the hardware problem.

CoastalCoder
9 replies
1d3h

I'm not sure it makes sense to pin this only on AMD.

Whenever you're writing performance-critical software, you need to consider the relevant combinations of hardware + software + workload + configuration.

Sometimes a problem can be created or fixed by adjusting any one / some subset of those details.

hobofan
7 replies
1d3h

If that's a bug that only happens with AMD CPUs, I think that's totally fair.

If we start adding in exceptions at the top of the software stack for individuals failures of specific CPUs/vendors, that seems like a strong regression from where we are today in terms of ergonomics of writing performance-critical software. We can't be writing individual code for each N x M x O x P combination of hardware + software + workload + configuration (even if you can narrow down the "relevant" ones).

jpc0
3 replies
1d2h

We can't be writing individual code for each N x M x O x P combination of hardware + software + workload + configuration

That is kind of exactly what you would do when optimising for popular platforms.

If this error occurs on an AMD Cpu used by half your users is your response to your user going to be "just buy a different CPU" or are you going to fix it in code and ship a "performance improvement on XYZ platform" update

jacoblambda
0 replies
1d1h

Nobody said "just buy a different CPU" anywhere in this discussion or the article. And they are pinning the root cause on AMD which is completely fair because they are the source of the issue.

Given that the fix is within the memory allocator, there is already a relatively trivial fix for users who really need it (recompile with jemalloc as the global memory allocator).

For everyone else, it's probably better to wait until AMD reports back with an analysis from their side and either recommends an "official" mitigation or pushes out a microcode update.

hobofan
0 replies
17h37m

Yeah, but even if you'd take this on as your responsibility (while it should really be the CPU vendor fixing it), you would like to resolve it much lower in the stack, like the Rust compiler/standard library or LLVM, and not individually in any Rust library that happens to stumble upon that problem.

ansible
0 replies
1d1h

The fix is that AMD needs to develop, test and deploy a microcode update for their affected CPUs, and then the problem is truly fixed for everyone, not just the people who have detected the issue and tried to mitigate it.

richardwhiuk
1 replies
1d1h

You are going to be disappointed when you find out there's lots of architecture and CPU specific code in software libraries and the kernel.

hobofan
0 replies
17h34m

That's completely fine in kernels and low-level libraries, but if I find that in a library as high-level as opendal, I'll definitely mark it down as a code smell.

pmontra
0 replies
1d

Well, if Excel would be running at half the speed (or half of LibreOffice Calc!) on half of the machines around here somebody at Redmond would notice, find the hardware bug and work around it.

I guess that in most big companies it suffices that there is a problem with their own software running on the laptop of a C* manager or of somebody close to there. When I was working for a mobile operator the antennas the network division cared about most were the ones close to the home of the CEO. If he could make his test calls with no problems they had the time to fix the problems of the rest of the network in all the country.

Pop_-
0 replies
1d3h

It's a known issue for AMD and has been tested by multiple people, and by the data provided by the author. It's fair to pin this problem to AMD.

meneer_oke
4 replies
1d2h

It doesn't seem faster. Seem would imply that it isn't the case. It is faster currently on that setup.

But since python runtime is written in C, the issue can't be Python vs C.

TylerE
2 replies
1d1h

C is a very wide target. There are plenty of things that one can do “in C” that no human would ever write. For instance, the C code generated by languages like nim and zig that essentially use C as a sort of IR.

meneer_oke
1 replies
1d1h

That is true, With C allot of possible

However, python by default has a small offset when reading memories while lower level language (rust and c)

Yet if the runtime is made with C, then that statement is incorrect.

bilkow
0 replies
21h36m

By going through that line of thought, you could also argue that the slow implementation for the slow version in C and Rust is actually implemented in C, as memcpy is on glibc. Hence, Python being faster than Rust would also mean in this case that Python is faster than C.

The point is not that one language is faster than another. The point is that the default way to implement something in a language ended up being surprisingly faster when compared to other languages in this specific scenario due to a performance issue in the hardware.

In other words: on this specific hardware, the default way to do this in Python is faster than the default way to do this in C and Rust. That can be true, as Python does not use C in the default way, it adds an offset! You can change your implementation in any of those languages to make it faster, in this case by just adding an offset, so it doesn't mean that "Python is faster than C or Rust in general".

topaz0
0 replies
23h48m

It's obviously not python vs c -- the time difference turns out to be in kernel code (system call) and not user code at all, and the post explicitly constructs a c program that doesn't have the slowdown by adding a memory offset. It just turns up by default in a comparison of python vs c code because python reads have a memory offset by default (for completely unrelated reasons) and analogous c reads don't by default. In principle you could also construct python code that does see this slowdown, it would just be much less likely to show up at random. So the python vs c comp is a total red herring here, it just happened to be what the author noticed and used as a hook to understand the problem.

formerly_proven
1 replies
1d2h

That extra 0x20 (32 byte) offset is the size of the PyBytes object header for anyone wondering; 64 bits each for type object pointer, reference count, base pointer and item count.

mrweasel
0 replies
23h41m

Thank you, because I was wondering if some Python developer found the same issue and decided to just implement the offset. It makes much more sense that it just happens to work out that way in Python.

magicalhippo
0 replies
1d1h

I recall when Pentium was introduced we were told to avoid rep and write a carefully tuned loop ourselves. To go really fast one could use the FPU to do the loads and stores.

Not too long ago I read in Intel's optimization guidelines that rep was now faster again and should be used.

Seems most of these things needs to be benchmarked on the CPU, as they change "all the time". I've sped up plenty of code by just replacing hand crafted assembly with high-level functional equivalent code.

Of course so-slow-it's-bad is different, however a runtime-determined implementation choice would avoid that as well.

mwcampbell
1 replies
1d3h

Years ago, Rust's standard library used jemalloc. That decision substantially increased the minimum executable size, though. I didn't publicly complain about it back then (as far as I can recall), but perhaps others did. So the Rust library team switched to using the OS's allocator by default.

Maybe using an alternative allocator only solves the problem by accident and there's another way to solve it intentionally; I don't yet fully understand the problem. My point is that using a different allocator by default was already tried.

saghm
0 replies
23h44m

I didn't publicly complain about it back then (as far as I can recall), but perhaps others did. So the Rust library team switched to using the OS's allocator by default.

I've honestly never worked in a domain where binary size ever really mattered beyond maybe invoking `strip` on a binary before deploying it, so I try to keep an open mind. That said, this has always been a topic of discussion around Rust[0], and while I obviously don't have anything against binary sizes being smaller, bugs like this do make me wonder about huge changes like switching the default allocator where we can't really test all of the potential side effects; next time, the unintended consequences might not be worth the tradeoff.

[0]: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

pmontra
9 replies
1d1h

However, mmap has other uses too. It's commonly used to allocate large regions of memory for applications.

Slack is allocating 1132 GB of virtual memory on my laptop right now. I don't know if they are using mmap but that's 1100 GB more than the physical memory.

Waterluvian
5 replies
1d1h

I’m not sure allocations mean anything practical anymore. I recall OSX allocating ridiculous amounts of virtual memory to stuff but never found OSX or the software to ever feel slow and pagey.

dietrichepp
4 replies
1d1h

The way I describe mmap these days is to say it allocates address space. This can sometimes be a clearer way of describing it, since the physical memory will only get allocated once you use the memory (maybe never).

byteknight
3 replies
1d1h

But is it not still limited by allocating the RAM + Page/Swap size?

aseipp
1 replies
1d1h

Maybe I'm misunderstanding you but: no, you can allocate terabytes of address space on modern 64-bit Linux on a machine with only 8GB of RAM with overcommit. Try it; you can allocate 2^46 bytes of space (~= 100TB) today, with no problem. There is no limit to the allocation space in an overcommit system; there is only a limit to the actual working set, which is very different.

j16sdiz
0 replies
22h56m

You can do it without overcommit -- you can just back the mmap with file

wbkang
0 replies
1d1h

I don't think so, but it's difficult to find an actual reference. For sure it does overcommit like crazy. Here's an output from my mac:

% ps aux | sort -k5 -rh | head -1

xxxxxxxx 88273 1.2 0.9 1597482768 316064 ?? S 4:07PM 35:09.71 /Applications/Slack.app/Contents/Frameworks/Slack Helper (Renderer).app/...

Since ps displays vsz column in KiB, 1597482768 corresponds to 1TB+.

aseipp
1 replies
1d1h

That is Chromium doing it, and yes, it is using mmap to create a very large, (almost certainly) contiguous range of memory. Many runtimes do this, because it's useful (on 64-bit systems) to create a ridiculously large virtually mapped address space and then only commit small parts of it over time as needed, because it makes memory allocation simpler in several ways; notably it means you don't have to worry about allocating new address spaces when simply allocating memory, and it means answering things like "Is this a heap object?" is easier.

rasz
0 replies
22h24m

dolphin emulator has recent example of this: https://dolphin-emu.org/blog/2023/11/25/dolphin-progress-rep...

seems its not without perils on Windows:

"In an ideal world, that would be all we have to say about the new solution. But for Windows users, there's a special quirk. On most operating systems, we can use a special flag to signal that we don't really care if the system has 32 GiB of real memory. Unfortunately, Windows has no convenient way to do this. Dolphin still works fine on Windows computers that have less than 32 GiB of RAM, but if Windows is set to automatically manage the size of the page file, which is the case by default, starting any game in Dolphin will cause the page file to balloon in size. Dolphin isn't actually writing to all this newly allocated space in the page file, so there are no concerns about performance or disk lifetime. Also, Windows won't try to grow the page file beyond the amount of available disk space, and the page file shrinks back to its previous size when you close Dolphin, so for the most part there are no real consequences... "

Pop_-
0 replies
1d1h

I don't know why but this really makes me laugh

Pop_-
8 replies
1d4h

Disclaimer: The title has been changed to "Rust std fs slower than Python!? No, it's hardware!" to avoid clickbait. However I'm not able to fix the title in HN.

3cats-in-a-coat
5 replies
1d2h

What's the TLDR on how... hardware performs differently on two software runtimes?

lynndotpy
2 replies
1d2h

One of the very first things in the article is a TLDR section that points you to the conclusion.

In conclusion, the issue isn't software-related. Python outperforms C/Rust due to an AMD CPU bug.
j16sdiz
1 replies
1d2h

It is software-related. Just the CPU perform badly on some software instruction.

xuanwo
0 replies
1d2h

FSRM is a CPU feature embedded in the microcode (in this instance, amd-ucode) that software such as glibc cannot interact with. I refer to it as hardware because I consider microcode a part of the hardware.

pornel
1 replies
1d2h

AMD's implementation of `rep movsb` instruction is surprisingly slow when addresses are page aligned. Python's allocator happens to add a 16-byte offset that avoids the hardware quirk/bug.

sound1
0 replies
23h48m

thank you, upvoted!

sharperguy
0 replies
1d2h

"Works on contingency? No, money down!"

pvg
0 replies
1d1h

you can mail hn@ycombinator.com and they can change it for you to whatever.

sgift
5 replies
1d4h

Either the author changed the headline to something less clickbaity in the meantime or you edited it for clickbait Pop_- (in that case: shame on you) - current headline: "Rust std fs slower than Python!? No, it's hardware!"

xuanwo
2 replies
1d4h

Sorry for the clickbaity title, I have changed it based on others advice.

thechao
0 replies
1d3h

I disagree that it's clickbait-y. Diving down from Python bindings to ucode is ... not how things usually go. Doubly so, since Python is a very mature runtime, and I'd be inclined to believe they've dug up file-reading Kung Fu not available to the Average Joe.

jll29
0 replies
17h34m

Thanks for this unexpected, thriller-like read.

I'm impressed by your perseverance, how you follow through with your investigation to the lowest (hardware) level.

epage
0 replies
1d4h

Based on the /r/rust thread, the author seemed to change the headline based on feedback to make it less clickbait-y

Pop_-
0 replies
1d4h

The author has updated the title and also contacted me. But unfortunately I'm no longer able to update it so.

londons_explore
5 replies
1d2h

So the obvious thing to do... Send a patch to change the "copy_user_generic" kernel method to use a different memory copying implementation when the CPU is detected to be a bad one and the memory alignment is one that triggers the slowness bug...

p3n1s
3 replies
1d1h

Not obvious. Seems like if it can be corrected with microcode just have people use updated microcode rather than litter the kernel with fixes that are effectively patchable software problems.

The accepted fix would not be trivial to anyone not already experienced with the kernel. But more important, it obviously isn’t obvious what is the right way to enable the workaround. The best way is to probably measure at boot time, otherwise how do you know which models and steppings are affected.

londons_explore
2 replies
1d1h

I don't think AMD does microcode updates for performance issues do they? I thought it was strictly correctness or security issues.

If the vendor won't patch it, then a workaround is the next best thing. There shouldn't be many - that's why all copying code is in just a handful of functions.

prirun
0 replies
1d

If AMD has a performance issue and doesn't fix it, AMD should pay the negative publicity costs rather than kernel and library authors adding exceptions. IMHO.

p3n1s
0 replies
1d1h

A significant performance degradation due to normal use of the instruction (FSRM) not otherwise documented is a correctness problem. Especially considering that the workaround is to avoid using the CPU feature in many cases. People pay for this CPU feature now they need kernel tooling to warn them when they fallback to some slower workaround because of an alignment issue way up the stack.

saagarjha
0 replies
17h23m

It’s not a trivial fix. Besides the fix likely being in microcode (where AMD figures out why aliasing is broke for addresses that are close to page-aligned), even a software mitigation would be complex because the kernel cannot actually use vector instructions that are typically used for the fallback path when ERMS is not available.

diamondlovesyou
5 replies
1d

AMD's string store is not like Intel's. Generally, you don't want to use it until you are past the CPU's L2 size (L3 is a victim cache), making ~2k WAY too small. Once past that point, it's profitable to use string store, and should run at "DRAM speed". But it has a high startup cost, hence 256bit vector loads/stores should be used until that threshold is met.

js2
2 replies
23h42m

Isn't the high startup cost what FSRM is intended to solve?

With the new Zen3 CPUs, Fast Short REP MOV (FSRM) is finally added to AMD’s CPU functions analog to Intel’s X86_FEATURE_FSRM. Intel had already introduced this in 2017 with the Ice Lake Client microarchitecture. But now AMD is obviously using this feature to increase the performance of REP MOVSB for short and very short operations. This improvement applies to Intel for string lengths between 1 and 128 bytes and one can assume that AMD’s implementation will look the same for compatibility reasons.

https://www.igorslab.de/en/cracks-on-the-core-3-yet-the-5-gh...

diamondlovesyou
1 replies
22h36m

Fast is relative here. These are microcoded instructions, which are generally terrible for latency: microcoded instructions don't get branch prediction benefits, nor OoO benefits (they lock the FE/scheduler while running). Small memcpy/moves are always latency bound, hence even if the HW supports "fast" rep store, you're better off not using them. L2 is wicked fast, and these copies are linear, so prediction will be good.

Note that for rep store to be better it must overcome the cost of the initial latency and then catch up to the 32byte vector copies, which yes generally have not-as-good-perf vs DRAM speed, but they aren't that bad either. Thus for small copies.... just don't use string store.

All this is not even considering non-temporal loads/stores; many larger copies would see better perf by not trashing the L2 cache, since the destination or source is often not inspected right after. String stores don't have a non-temporal option, so this has to be done with vectors.

js2
0 replies
20h27m

I'm not sure that your comment is responsive to the original post.

FSRM is fast on Intel, even with single byte strings. AMD claims to support FSRM with recent CPUs but performs poorly on small strings, so code which Just Works on Intel has a performance regression when running on AMD.

Now here you're saying `REP MOVSB` shouldn't be used on AMD with small strings. In that case, AMD CPUs shouldn't advertise FSRM. As long as they're advertising it, it shouldn't perform worse than the alternative.

https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515

https://sourceware.org/bugzilla/show_bug.cgi?id=30994

I'm not a CPU expert so perhaps I'm misinterpreting you and we're talking past each other. If so, please clarify.

rasz
1 replies
23h49m

Or you leave it as is forcing AMD to fix their shit. "fast string mode" has been strongly hinted as _the_ optimal way over 30 years ago with Pentium Pro, further enforced over 10 years ago with ERMSB and FSRM 4 years ago. AMD get with the program.

saagarjha
0 replies
17h8m

rep movsb might have been fast at one point but it definitely was not for a few decades in the middle, where vector stores were the fastest way to implement memcpy. Intel decided that they should probably make it fast again and they have slowly made it competitive with the extensions you’ve mentioned. But for processors that don’t support it, using rep movsb is going to be slow and probably not something you’d want to pick unless you have weird constraints (binary size?)

codedokode
4 replies
21h41m

Why is there need to move memory? Hardware cannot DMA data into non-page-aligned memory? Or Linux doesn't want to load non-aligned data?

wmf
3 replies
20h51m

The Linux page cache keeps data page-aligned so if you want the data to be unaligned Linux will copy it.

codedokode
2 replies
20h40m

What if I don't want to use cache?

wmf
0 replies
20h7m

You can use O_DIRECT although that also forces alignment IIRC.

tedunangst
0 replies
20h23m

Pull out some RAM sticks.

Aissen
3 replies
1d3h

Associated glibc bug (Zen 4 though): https://sourceware.org/bugzilla/show_bug.cgi?id=30994

Arnavion
1 replies
1d

The bug is also about Zen 3, and even mentions the 5900X (the article author's CPU).

nabakin
0 replies
20h23m

If you read the bug tracker, a comment mentions this affects Zen 3 and Zen 4

fweimer
0 replies
23h16m
fsniper
2 replies
1d2h

The article itself is a great read and it has fascinating info related to this issue.

However I am more interested/concerned about another part. How the issue is reported/recorded and how the communications are handled.

Reporting is done over discord, which is a proprietary environment which is not indexed, or searchable. Will not be archived.

Communications and deliberations are done over discord and telegram, which is probably worse than discord in this context.

This blog post and the github repository is the lingering remains of them. If Xuanwo did not blog this. It would be lost in timeline.

Isn't this fascinating?

upsuper
0 replies
4h36m

Yes, they are proprietary, which is not great. But I don't buy the allegation that they are not indexed or searchable. There are very few IMs that provide builtin publicly accessable log indexed or searchable by default. Does every IRC server come with public log? What about Matrix groups? How do discussion there not get lost in timeline?

You can provide public log of them not because they are not proprietary, but that they have API to allow logging. Telegram also has such API, and FWIW our discussion group does have searchable log that you can access here: https://luoxu-web.vercel.app/#g=1264662201 It is not indexable publicly more for privacy concern, again not because the platform is proprietary.

jll29
0 replies
17h40m

Reporting is done over discord, which is a proprietary environment which is not indexed, or searchable. Will not be archived.

That's why I don't accept the response "but there's Discord now" whenever I moan about USENET's demise. Back in the days before it, every post was nicely searchable by DejaNews (later Google).

We need to get back to open standards for important communications (e.g. all open source projects that are important to the Internet/WWW stack and core programming and libraries).

amluto
2 replies
1d2h

I sent this to the right people.

saagarjha
1 replies
17h7m

(…at AMD?)

amluto
0 replies
16h59m

At AMD.

Pesthuf
2 replies
1d4h

Clickbait headline, but the article is great!

saghm
0 replies
1d

I think there might be a range of where people draw the line between reasonable headlines and clickbait, because I tend to think of clickbait as something where the "answer" to some question is intentionally left out to try to bait people into clicking. For this article, something I'd consider clickbait would be something like "Rust std fs is slower than Python?" without the answer after. More commonly, the headline isn't phrased directly as a question, but instead of saying something like "So-and-so musician loves burritos", it will leave out the main detail and say something like "The meal so-and-so eats before every concert", which is trying to get you to click and have to read through lots of extraneous prose just to find the word "burritos".

Having a hook to get people to want to read the article is reasonable in my opinion; after all, if you could fit every detail in the size of a headline, you wouldn't need an article at all! Clickbait inverts this by _only_ having enough enough substance that you could get all the info in the headline, but instead it leaves out the one detail that's interesting and then pads it with fluff that you're forced to click and read through if you want the answer.

joshfee
0 replies
1d

Surprisingly I think this usage of clickbait is totally reasonable because it matches the author's initial thoughts/experiences of "what?! this can't be right..."

royjacobs
1 replies
1d4h

I was prepared to read the article and scoff at the author's misuse of std::fs. However, the article is a delightful succession of rabbit holes and mysteries. Well written and very interesting!

bri3d
0 replies
1d1h

This was such a good article! The debugging was smart (writing test programs to peel each layer off), the conclusion was fascinating and unexpected, and the writing was clear and easy to follow.

lxe
1 replies
23h30m

So Python isn't affected by the bug because pymalloc performs better on buggy CPUs than jemalloc or malloc?

js2
0 replies
18h21m

It has nothing to do with pymalloc's performance per se.

Rather, the performance issue only occurs when using `rep movsb` on AMD CPUs with certain page/data alignment.

Pymalloc just happens to be using page/data alignment that makes `rep movsb` happy while Rust's default allocator is using alignments that just happen to make `rep movsb` sad.

lxe
1 replies
23h33m

I wonder what other things we can improve by removing spectre mitigations and tuning hugepage, syscall altency, and core affinity

saagarjha
0 replies
17h6m

Mitigations did not have a meaningful performance impact here.

exxos
1 replies
1d3h

It's the hardware. Of course Rust remains the fastest and safest language and you must rewrite your applications in Rust.

dang
0 replies
22h35m

You've been posting like this so frequently as to cross into abusing the forum, so I've banned the account.

If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.

explodingwaffle
1 replies
1d2h

Anyone else feeling the frequency illusion with rep movsb?

(https://lock.cmpxchg8b.com/reptar.html)

saagarjha
0 replies
17h4m

This is unrelated.

eigenform
1 replies
20h25m

would be lovely if ${cpu_vendor} would document exactly how FSRM/ERMS/etc are implemented and what the expected behavior is

saagarjha
0 replies
17h7m

It is documented; this is a performance bug.

titaniumtown
0 replies
1d1h

Extremely well written article! Very surprising outcome.

jokethrowaway
0 replies
23h18m

Clickbait title but interesting article.

This has nothing to do with python or rust

iampims
0 replies
1d4h

Most interesting article I've read this week. Excellent write-up.

fulafel
0 replies
53m

A related thing from times when it was common that memory layout artifacts had high impact on sw performance: https://en.wikipedia.org/wiki/Cache_coloring

forrestthewoods
0 replies
1d

Delightful article. Thank you author for sharing! I felt like I experienced every shock twist in surprise in your journey like I was right there with you all along.

darkwater
0 replies
1d

Totally unrelated but: this post talks about the bug being first discovered in OpenDAL [1], which seems to be an Apache (Incubator) project to add an abstraction layer for storage over several types of storage backend. What's the point/use case of such an abstraction? Anybody using it?

[1] https://opendal.apache.org/

comonoid
0 replies
1d1h

jemalloc was Rust's default allocator till 2018.

https://internals.rust-lang.org/t/jemalloc-was-just-removed-...