return to table of content

Garbage collection for systems programmers (2023)

pron
86 replies
1d2h

Except in the special case where all memory can be easily handled in arenas, good tracing GCs have long ago surpassed manual memory management in throughput, and more recently their latency impact is more than acceptable for the vast majority of applications (OpenJDK's ZGC has typical pause times measured in double/triple-digit microseconds, and the worst case rarely exceeds 1ms for a reasonable allocation rate -- the pauses are in the same ballpark as OS-induced ones). The only real and significant tradeoff is in memory footprint, and outside of specialty niches (where arenas just work for everything and worst-case latency is in the low microseconds range) that is the only high order question: is my application running in a memory-constrained environment (or it's really worth it to sacrifice other things to keep down RAM consumption) or not?

flohofwoe
35 replies
1d2h

Except in the special case ...

IME it's the other way around, per-object individual lifetimes is a rare special case, in most real world code there will be many related objects of the same or very similar lifetimes. In such code, tracking individual object lifetimes is overkill (in the end, memory management is all about lifetimes, and fewer individual lifetimes instead of many is always better because it means less work, both in manual and automatic memory management).

Not having to think about object lifetimes is just very convenient, that's why GC languages were successful despite the significant under-the-hood complexity of a good garbage collector.

zozbot234
25 replies
23h58m

IME it's the other way around, per-object individual lifetimes is a rare special case

It depends on your application domain. But in most cases where objects have "individual lifetimes" you can still use reference counting, which has lower latency and memory overhead than tracing GC and interacts well with manual memory management. Tracing GC can then be "plugged in" for very specific cases, preferably using a high performance concurrent implementation much like https://github.com/chc4/samsara (for Rust) or https://github.com/pebal/sgcl (for C++).

pron
19 replies
22h39m

Reference counting does not have lower latency than a tracing GC nor is it more generally predictable. In terms of performance, it is usually worse than a modern tracing GC by most performance metrics. Its benefits are, indeed, lower memory footprint and that it can achieve reasonable latency even using a crude implementation. In other words, you'd reach for refcounting either if you mostly care about footprint or if you don't have a lot of resources to invest in implementing the GC.

Modern concurrent tracing has very little impact on latency. First, compared to refcounting, the barriers are rarely invoked. Second, they compact the heap concurrently in the background, and like all mark-compact GCs, their amortized cost drops to zero as the RAM increases (the work is linear with the size of the working set, which is roughly constant for a given program, but it needs to be done less frequently the more RAM you have). That's why high-performance languages that rely heavily on a GC -- Java, C#, Go -- prefer tracing.

bluGill
10 replies
21h17m

when the reference count can only be zero or one reference counts have different performance than when it can be more, and it is much easier to reason about. This is most cases.

rbehrends
4 replies
20h18m

What happens if a large std::unordered_map<std::string, std::string> has its destructor called?

The maximum number of references is a red herring. While having a RC capped at 1 allows you to elide the actual reference count and makes pointer assignment cheaper, it does not materially affect the primary source of latency in a reference counting implementation, namely cascading deletions.

bluGill
2 replies
20h4m

Don't do that. good data design and architecture is always needed.

rbehrends
1 replies
15h47m

1. Such data structures (or more generally, std::vector<std::string, std::vector<std::string>> or something like that) are the natural way to represent e.g. dictionaries. So, you absolutely often need to do that and "don't do that" doesn't help here.

2. This general issue extends to virtually all collections. The idea that you should avoid large collections is not a practical solution to real world problems.

3. An alternative solution would be lazy destruction, but that comes with its own issues, such as a really bad worst-case memory overhead or making RC sufficiently more complex that it's not really a win over tracing GC anymore [1].

[1] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&d...

pkolaczk
0 replies
4h54m

It’s not an issue in practice because, contrary to tracing GC, it doesn’t affect the other threads. So it doesn’t really count as pausing.

tomp
0 replies
6h2m

delaying cascading deletion us much easier to implement than a high-performance tracing GC

pron
2 replies
20h5m

And yet, the reason tracing GCs are chosen by virtually all high-performance languages that heavily rely on GC is that they've been found to be faster in practice for the common workloads.

One of the reasons why your intuition is not so straightforward is that a tracing GC needs to do no work whatsoever when the number of references is zero. One of the common ways to teach the basics of GC is by starting out as looking at tracing and refcounting as duals: refcounting needs to work to free objects, while tracing works to keep them alive. If you thinking in terms of what work needs to be done to promptly determine when an object becomes garbage, then you're already not thinking in terms of tracing, because tracing never actually needs to learn about when an object becomes garbage (this isn't actually true when they do reference processing, but that's another story).

Or, if you want to think about it another way, in a tracing collector there are already only two cases no matter how many pointers there are to an object: reachable or not, i.e. the same one and zero as in your case, only there isn't even a need to ever set the counter to zero.

However, in principle tracing and refcounting can be quite similar (https://www.cs.cornell.edu/courses/cs6120/2019fa/blog/unifie...) in their behaviour, but in practice most refcounting GCs in industry use are crude, and don't match the performance of tracing GCs in common use, which are quite sophisticated.

tomp
1 replies
6h4m

> And yet, the reason tracing GCs are chosen by virtually all high-performance languages that heavily rely on GC is that they've been found to be faster in practice for the common workloads.

I disagree.

IMO the reason is simplicity, most languages don’t want to burden programmers with lifetimes and/or soft references.

Rust and Swift, both high-performance languages targeting specific niches, chose differently. Both largely for reasons of predictability, lower latency, lower memory use (which you admitted above) as well as lower memory trashing.

pron
0 replies
4h40m

IMO the reason is simplicity, most languages don’t want to burden programmers with lifetimes and/or soft references.

I'm talking about languages that already use a GC as the primary heap management mechanism. Any kind of GC algorithm, whether tracing or refcounting, frees developers from thinking about lifetimes, but languages that primarily rely on GC and care about performance almost always pick tracing.

The only languages that choose a refcounting GC are either those that care little about performance (Python) or those that don't rely heavily on GC (Rust).

Both largely for reasons of predictability, lower latency, lower memory use

Refcounting offers worse latency, worse predictability [1] (compared to modern low-latency tracing GCs, like ZGC), and worse throughput; it does offer a significantly lower memory footprint. The reason Rust chose refcounting is primarily because it doesn't rely on the GC heavily and so it might as well use a crude GC (and enjoy the lower effort of implementing it as well as the lower footprint), as a crude refcounting GC is usually better than a crude tracing GC.

Furthermore, an sophisticated tracing GC is not a great fit for languages like Rust or C++ because it requires a more elaborate code generation and some "magical" mutator thread behaviour, when low-level languages like the generated code to be as close to the source code as possible. This isn't so much predictability of performance, but predictability of the generated code and the behaviour of mutators (in terms of lowering the surprise of "why is my thread doing _that_ now?" when observed at a low level). For example, with ZGC, a mutator thread may find itself moving an object when accessing it; this is surprising at a low level -- which may not be a good fit for a low-level language -- even though it ends up offering better performance predictability overall. On the other hand, a mutator thread doing work freeing memory with a refcounting GC when it's dropping a reference to it may offer worse latency and worse predictability overall, but it's not as surprising at a lower level in the sense that it's easy to understand why it's doing that.

As for Swift, I think it picked a refcounting algorithm because it cares about memory footprint (most Swift applications run in memory-constrained environments) and because some unique language features allow to reduce the cost of refcounting sufficiently for a simple refcounting GC to be acceptable.

[1]: Again, not in principle (as both refcounting and tracing can converge to a similar behaviour) but as practiced by most refcounting GCs in industry use.

dmurray
1 replies
20h52m

This sounds intuitively true. So...what if GC languages could introduce an optional annotation for an object to say that only one reference to it can exist, and use that as a hint to the GC?

I don't see how this could be implemented in a safe and performant way - either you check for existing references at runtime, or you risk some use-after-free bug. But perhaps your project is already in a GC language and you're happy with that, but just want to optimise GC for this critical component. And we already have the concept of "unsafe" code blocks in many languages.

Does anything like this exist? I Googled "smart pointers in Java" but just got a bunch of mid-quality answers where people explained that I'm stupid for even asking this and they're unnecessary because Java manages its own memory. But hasn't someone smarter thought of this?

kaba0
0 replies
3h14m

This sounds intuitively true. So...what if GC languages could introduce an optional annotation for an object to say that only one reference to it can exist, and use that as a hint to the GC

Tracing GCs fundamentally look at what is still reachable, while ref counting tracks deadness. They are actually the “yin-and-yang” of each other, with different tradeoffs, tracing being more practical on current hardware/usage pattern.

There is no practical difference in whether a tracing GC can reach an object from multiple place, or just a single one - but there might be some other analogue feature in tracing corresponding to 0-1 only ref count.

lll-o-lll
6 replies
22h20m

nor is it more generally predictable

Can you explain what you mean here? This does not match my experience or intuition.

pron
5 replies
22h3m

In theory, refcounting and tracing can behave similarly [1], but assuming we're speaking about their implementations in the industry (rather elaborate tracing GCs; rather crude refcounting GCs) then a refcounting GC would do some work as soon the refcount for a particular object drops to zero. When exactly that happens and how much work there is to do (and sometimes, what the fragmentation effect on heap is) are local properties that are not very predictable. In contrast, the amount of work a tracing GC needs to do when compacting is directly proportional to the program's working set and the frequency in which it needs to do that work is proportional to the allocation rate -- both of which are fairly regular and predictable global properties for a given program. For stop-the-world GCs there was then the matter of the highly unpredictable exact timing of a large STW pause, but modern concurrent GCs don't collect anything in STW pauses anymore. They only have very short (sub 1ms) and constant-time pauses and no surprise throttling as long as the allocation rate is within an acceptable range. So all in all, you pay a fairly fixed tax on the CPU (and if you want to pay less, just add more RAM) and virtually no impact on latency.

[1]: https://www.cs.cornell.edu/courses/cs6120/2019fa/blog/unifie...

lll-o-lll
3 replies
21h40m

Thanks for responding. My experience with tracing GC at scale is exclusively in the realm of .Net, and RC exclusively with C++ smart pointers. That matches your “sophisticated vs crude” contrast.

The experience with .Net is that GC impact was difficult to profile and correct, and also “lumpy”, although that may have been before GC tuning. GC would dominate performance profiling in heavy async code, but these days can be corrected by value tasks and other zero alloc methods.

For C++ style ref counting, the impact was a continuous % load and simple to profile (and therefore improve). Although here, the ref counting needed to be stripped from the hot paths.

The biggest issue I’ve hit between the two modes though, is how they behave when hitting memory limits. Tracing GC appears to have an order of magnitude perf hit when memory becomes scarce, while ref counting does not suffer in this way. This is enough for me to personally dislike tracing GC, as that failure state is particularly problematic.

neonsunset
2 replies
21h30m

When you hit memory limits, .NETs GC implementation would perform much more frequent, invasive and aggressive collections, including LOH compaction to reduce memory watermark which leads to greater GC pauses, though this is rarely seen in such a bad way on modern versions with e.g. SRV GC.

The most scaling way to address this is usually to just allocate less and use valuetasks with pooling where applicable (frequent asynchronous yields), I'm certain if you built a .NET 8 based solution you would see user-written code dominate heap allocations profile, as most hot internal paths of async utilize said state machine box pooling+ValueTask<T>[0] and may be entirely allocation-free.

[0] Example: https://github.com/dotnet/runtime/blob/cc7bf831f02cad241547e...

lll-o-lll
1 replies
20h32m

When you hit memory limits, .NETs GC implementation would perform much more frequent, invasive and aggressive collections, including LOH compaction to reduce memory watermark which leads to greater GC pauses, though this is rarely seen in such a bad way on modern versions with e.g. SRV GC.

The trouble with server GC mode is that then there is no natural back pressure. If the processing is not CPU bound, then memory allocation can grow unbounded. This is not something that happens with RC as, again, the GC performance hit is inlined with task processing. The service may not be capable of as much throughput, but it doesn’t take out the entire server either.

The most scaling way to address this is usually to just allocate less and use valuetasks with pooling where applicable (frequent asynchronous yields), I'm certain if you built a .NET 8 based solution you would see user-written code dominate heap allocations profile, as most hot internal paths of async utilize said state machine box pooling+ValueTask<T>[0] and may be entirely allocation-free.

Absolutely; I think it’s relatively simple to write servers that scale using modern .net; the memory allocation foot-guns when dealing with asynchronous code are now well understood, and tooling is good. I am compressing ~15 years of experiences in that previous post.

It’s probably the case that a tracing GC is the better choice for most modern applications, excepting memory constrained devices (like phones), and so long as you design with memory in mind.

neonsunset
0 replies
20h11m

Ah, I see where you are coming from.

You are correct, sustained load heap size of SRV GC has been a known pain point that had been particularly exacerbated after beefy Windows Server hosts fell out of fashion and got replaced by 512Mi Linux containers.

There has been work conducted on this each release throughout Core 2.1, 3.1, and then 5, 6, 7 and 8 versions to make it play nicer with more constrained memory limit systems.

The two major features that address this are Regions[0] (.NET 6/7) and DATAS[1] (.NET 8). The former is enabled by default everywhere except macOS and the latter is available opt-in either via env var DOTNET_GCDynamicAdaptationMode=1 or msbuild poperty GarbageCollectionAdaptationMode: 1 (see more in [1]).

The latter has shown to significantly reduce sustained (or, especially, idling) heap size for some particularly problematic workloads (but not all, sometimes you just have a lot of live objects). I definitely recommend giving it a try if this is something still relevant to you.

TLDR of what DATAS does is dynamic heap count scaling and much smarter heap up/downsizing depending on allocation rate/frequency and anticipated throughput impact of adjusting those.

[0] https://itnext.io/how-segments-and-regions-differ-in-decommi... / https://devblogs.microsoft.com/dotnet/put-a-dpad-on-that-gc/

[1] https://maoni0.medium.com/dynamically-adapting-to-applicatio...

cesarb
0 replies
4h35m

but assuming we're speaking about their implementations in the industry (rather elaborate tracing GCs; rather crude refcounting GCs)

That might be because a tracing GC needs to be rather elaborate to get good performance and avoid bad behaviors.

When exactly that happens and how much work there is to do (and sometimes, what the fragmentation effect on heap is) are local properties

It being a local property is a good thing. It means it won't affect or be affected by how the rest of the program is behaving.

carry_bit
0 replies
4h53m

The state-of-the-art in refcounting[1] greatly improves the barrier situation over a naïve implementation: no read barrier, and the write barrier only uses atomics when mutating an (apparently) unmodified field in an old generation object.

[1] https://dl.acm.org/doi/pdf/10.1145/3519939.3523440

bsder
4 replies
17h38m

Reference counting is neither lower latency nor lower memory overhead than basically everything else.

Reference counting requires atomics on every single object. Per object atomics are quite unfriendly to modern microprocessors.

There is either very little contention with lots of atomics that you don't need or you have high contention with atomics that are blowing out your cache lines repeatedly.

In addition, managing reference counts practically requires RAII semantics and all of the baggage that goes along with that. Doing reference counting in C, for example, is extremely error prone.

kccqzy
2 replies
16h51m

The reason Rust has both Arc and Rc types for reference counting is precisely because most of the time when you need reference counting, you do not need thread safety. This is something I think about all the time in C++ when I use std::shared_ptr: it gives me thread safety using atomics but I don't need it.

More languages should distinguish between thread safe reference counting and single-threaded reference counting.

bsder
1 replies
14h27m

The reason Rust has both Arc and Rc types for reference counting is precisely because most of the time when you need reference counting, you do not need thread safety.

Hmmm, I'm curious about your use cases as this is almost precisely opposite my experience. Normally, I regard Rc as a footgun as I've always found that I'm shortly going to have to change everything to Arc.

jandrewrogers
0 replies
1h58m

I almost never use atomic reference counters, when I use them at all. A major point of using thread-per-core software architectures is that it eliminates almost all atomics even in highly concurrent contexts that may require reference counting to manage resource lifetimes.

In these cases, your concurrency primitive is essentially a coroutine, either with or without its own stack. You may have thousands of these executing concurrently similar to as if they were multithreaded and using locks or atomics, but the execution is always on a single thread so locks are unnecessary and atomics can be elided entirely. In the rare cases where atomics are used, there are almost always at most two threads accessing them.

Basically, the only place you see atomics are the very thin and lightly accessed interfaces between threads, which are about resource control rather than true thread concurrency.

I have used these types of architectures on diverse high-scale server apps. The places I’ve had to use atomic reference counters were always extremely limited and largely outside the hot path.

pron
6 replies
1d2h

That's not at all a rare special case in most server applications.

One way to see that is to consider how much of a program's working set is in threads' stacks vs the heap. If the most of the working set is in the heap, there's usually some non-trivial object lifetimes involved (i.e. cases where lifetimes can be encapsulated and abstracted away from client code). Yes, sometimes all of these can be taken care of by arenas (and there are languages, like Zig, that strive to be very arena-friendly), but that -- i.e. the case where all objects are easily arena-able -- is not the more common scenario.

flohofwoe
5 replies
1d2h

Depends on the type of server I guess. I can easily imagine a situation where all allocations of a request are handled through a simple bump allocator, and once the request is done, the bump allocator is reset instead of 'freeing' each individual allocation.

jayd16
2 replies
23h8m

How did we go from "this is the most common case" to "I can imagine it?" Sure there are cases where a request can be handled by the stack but the point is that the more complex case is extremely common.

Any kind of background/asynchronous work spawned from a request and your plans are shot.

thefaux
1 replies
19h16m

Yes and that is yet another case for encapsulation.

For me, it is an antipattern for a short lived task to be directly creating long lived resources. Async work ideally is scheduled using message passing to the task manager which may have its own memory management strategy (heck, the background task may well be written in an entirely different language).

I just feel that it is very common to have large portions of the application that fit the model of read input/write output, free all intermediate data upon completion. Due to a lack of encapsulation, however, we put unnecessary pressure on the allocator or garbage collector by mixing the short and long lived parts of our applications.

pron
0 replies
4h23m

we put unnecessary pressure on the allocator or garbage collector by mixing the short and long lived parts of our applications.

That is not how modern tracing GCs work, though. First, modern tracing GCs don't do any work to actively free any dead object (they don't even need to know about them). They work to compact the live ones. Second, modern tracing GCs automatically separate short-lived from long-lived objects and preserve them in different ways.

While it's true that in principle a programmer could find a technique to optimally manage memory in a given program and that even a good GC would never be quite as optimal, in practice a GC would do a better job than you could unless you spend an inordinate amount of time figuring out the optimal strategy; furthermore, the optimal strategy has to contend with non-local effects, i.e. a change in one subroutine might have a global effect on what the optimal global strategy is.

So as a first approximation, a good tracing GC will give you better memory management (performance-wise) per unit of programmer effort, and if you want even better performance -- give the GC more RAM. You're trading off RAM for less work on your part.

ncruces
0 replies
23h38m

That “simple bump allocator” is basically allocating everything on the thread's stack, as the GP mentioned.

It's what you put on the heap that has complex lifetimes. Sometimes you can fix that with an arena. If you can't, you probably can't figure out when the last reference to it dies either.

cogman10
0 replies
23h43m

This would only work for fairly trivial applications. The moment you start adding things like http clients or databases you have to start considering having connection pools with lifetimes that don't strictly match a request lifecycle.

Not saying such an application doesn't exist, it certainly does.

cratermoon
1 replies
1d2h

In generational GCs there are two or more allocation regions. New objects are put in the "younger" generation, which is garbage collected separated from the other generations. This sort of resolves the issue of tracking individual object lifetimes by having all the short-lived objects subject to rapid GC. This means most of the effort tracking lifetimes is reserved for the fewer long-lived objects.

lll-o-lll
0 replies
22h17m

It’s “medium term” objects that cause all the trouble. Async operations. That’s why .net invested heavily into zero-alloc tasks, as the gc would kill scaling in async heavy code.

Thaxll
27 replies
1d2h

Pauses are kind of solved but the CPU usage for the GC is still pretty high.

You're still at the mercy of unpredictable tail latency and other corner cases.

pron
25 replies
1d2h

That goes for manual memory management -- and certainly languages with a reference-counting GC, like Rust -- as well. The main difference is by far footprint overhead.

flohofwoe
16 replies
1d2h

I think the important thing to understand is that reference counting isn't any better (and often worse) than "regular" garbage collection.

The point of manual memory management is to come up with problem-specific strategies to avoid or at least reduce dynamic memory allocation, not to insert manual release/free calls for individual objects ;)

pron
15 replies
1d2h

Reference counting is regular garbage collections. The two broad classes of GC algorithms are tracing and refcounting, and while they can converge to similar behaviour, usually the former optimises for throughput while the latter for memory footprint; latency is similar these days.

flohofwoe
14 replies
1d1h

Reference counting is regular garbage collection.

...while I agree, for many C++ and Rust coders statements like this are pure heresy ;)

cesarb
11 replies
1d

...while I agree, for many C++ and Rust coders statements like this are pure heresy ;)

It's a matter of definitions. For many people, "garbage collection" refers only to tracing GC, and reference counting is a separate category. In my experience, that's the common usage; insisting that "reference counting is formally (in some paper from the last century) also defined as a form of GC" will not magically change the opinions "many C++ and Rust coders" have about tracing GC. In fact, I'd say that insisting on this nomenclature point only weakens the whole argument; tracing GC should stand on its own merits, and not on depend on some nomenclature equivalence to be accepted (if quibbling about nomenclature is your strongest argument, your arguments are weak).

pron
10 replies
23h52m

There's no need to "change opinions". People who work on GCs know that reference counting and tracing are the two general GC strategies. The only people who don't think of refcounting as a GC are people who simply aren't familiar with GCs and how they work. If they also think refcounting has lower latencies (let alone higher throughput) than tracing, then they're also just wrong. No one needs to "insist" on the GC nomenclature. You're either familiar with it or you're not (and since most people are not, they commonly make mistakes on the subject). Also, given that tracing GCs are used by ~90% the market, they hardly require justification anymore; they've won by a large margin over the application space (which constitutes most of software). However, it's nice to occasionally educate those unfamiliar with the subject on GC algorithms and nomenclature.

ngrilly
5 replies
23h18m

Also, given that tracing GCs are used by ~90% the market, they hardly require justification anymore; they've won by a large margin over the application space (which constitutes most of software).

Tracing GCs have clearly proven themselves and are everywhere (JVM, CLR, Go, Dart, OCaml, etc.) but we can't ignore that the Apple ecosystem (Swift) is using ARC. That's a significant share of the "market". Python and Ruby also use reference counting, but I don't think anyone is considering them state-of-the-art GC.

yxhuvud
1 replies
4h34m

Ruby does not use reference counting. It has a mark and sweep generational gc that is incremental and use compaction. I doubt it would be state of the art, but it is not too bad nowadays.

ngrilly
0 replies
3h40m

My mistake. Thanks for correcting.

pron
1 replies
23h6m

You're right, I should have said "languages where GC is the primary means of managing heap memory are used by 90% of the market" rather than focused on a specific algorithm.

ngrilly
0 replies
23h0m

Yes, this is quite fascinating how GC replaced manual memory management in most apps over the last 20~30 years.

zozbot234
0 replies
23h14m

Except that obligate ARC ala Swift has even lower throughput than obligate tracing GC. It's the worst possible choice unless you really care about low-latency and deterministic freeing of resources (and even then, using RAII for common tree-like allocation patterns like Rust does will perform better).

dasyatidprime
3 replies
22h0m

I have to wonder whether some of this is semantic drift over time or context. My recollection since undergrad (a few decades ago) involves treating “garbage collection” as referring to tracing garbage collection, and “reference counting” as a separate mechanism. There is still a term for the category including both, only that term is not “garbage collection” but “automatic memory management”. But what I see nowadays is closer to what you describe.

zozbot234
1 replies
21h57m

Automatic memory management is more general than that, it also includes stack allocation.

dasyatidprime
0 replies
21h43m

I agree; I meant “including” in the non-restrictive sense, not “including only”. Stack allocation is a special case where the lifetimes are arranged in a convenient way—see also escape analysis in languages where stack allocation isn't explicitly supported at the language level but can be added by the compiler.

yxhuvud
0 replies
4h40m

If anything, the drift is towards GC being only tracing as it is so dominant in the languages that are normally considered having GC. But before C++ (via boost) introduced shared pointers, and Swift ARC, I'd expect the separation to basically not exist.

pjmlp
0 replies
1d1h

They should read some CS literature, the kind that is used to write the compilers they rely on. :)

PaulDavisThe1st
0 replies
1d

From TFA:

Tools to automate the “actually freeing the memory” part, like lifetimes in Rust and RAII in C++, don’t solve these problems. They absolutely aid correctness, something else you should care deeply about, but they do nothing to simplify all this machinery.
cesarb
5 replies
1d

and certainly languages with a reference-counting GC, like Rust

It's a mistake to say that the Rust language has reference counting. There's a pair of reference-counting wrapper types in its standard library (Rc and Arc), or you can roll your own, but there's no special support for these types in the language, and their use is optional. Most of the time, you won't be using these reference-counting types. Box (the generic heap-allocated or "boxed" type) doesn't use reference counting. String (and its several specialized variants) doesn't use it. Vec (the generic heap-allocated array type) doesn't use it. HashMap, HashSet, BTreeMap, BTreeSet, none of them use reference counting. And so on. You can write a lot of Rust code without using reference counting even once.

What the Rust language has is just C++-style RAII: when a value goes out of scope, if it implements Drop, its Drop::drop is called.

cogman10
2 replies
23h37m

It's a mistake to say that the Rust language has reference counting.

Having these types in the standard library is the language having those types.

Perhaps it's not integrated to the level that a language like swift is. However, I think it's reasonable to say the language supports Rc when the standard library supports it. I'd say the same thing about C++ with `shard_ptr`.

Otherwise you end up in weird pedantic notions about what a language has or does not have. Does C have a heap? Well, technically no since malloc and free are just function calls in the standard library and you can write valid C programs without calling those functions.

cesarb
1 replies
22h59m

Having these types in the standard library is the language having those types.

It depends on whether you consider the standard library an indivisible part of the language or not. For Rust, it's clearly not the case, since you have the #![no_std] mode in which only a subset of the standard library is available, and this subset does not include these reference counted wrapper types (or any heap-allocated type at all).

Perhaps it's not integrated to the level that a language like swift is. However, I think it's reasonable to say the language supports Rc when the standard library supports it. I'd say the same thing about C++ with `shard_ptr`.

It's one thing to say a language "supports reference counting", which only means you can use reference counting with it, and another thing to say "[...] languages with a reference-counting GC", which implies that the language uses a GC for everything, and that GC is a reference-counting GC.

Does C have a heap? Well, technically no since malloc and free are just function calls in the standard library and you can write valid C programs without calling those functions.

It's actually the same thing: C can run on either a "hosted" environment or a "freestanding" environment, and on the later, most of the standard library is not available, including malloc and free. So C does not necessarily have a heap when running on a freestanding environment.

arcticbull
0 replies
21h47m

It's not part of std exactly, it's part of alloc. It's re-exported by std.

It would still be available in a #![no_std] environment using `extern crate alloc`.

This crate generally abstracts over the concept of allocation too, so relying on it doesn't require you to also have an allocator - it just requires someone at some point specify one with #[global_allocator]

kaba0
1 replies
23h23m

and their use is optional

It is not, if you have objects with dynamic lifetimes, and allocating them for the whole duration of the program is not an option.

Sure, their use can be much less than a managed language that can only do automatic memory management, but RC is objectively a worse from most perspective than tracing GC, except for the fact that they don’t need runtime support, and a slightly lower memory overhead.

arandomusername
0 replies
18h41m

*shared* objects with dynamic lifetimes.

With properly architected code, the times you need to use rc are extremely small.

mdavidn
0 replies
1d1h

Rust is more similar to C++, in that the compiler inserts calls to free as variables exit scope. Runtime reference counting is limited to those objects wrapped with Rc or Arc.

I agree with pron’s larger point. GC is fine for most applications. It’s just factually inaccurate to compare Rust’s memory management with languages like Python and PHP.

MForster
0 replies
1d1h

Rust by default doesn't do reference counting.

You can opt into reference counting with `std::rc::Rc`. (You can even opt into mark-and-sweep GC using the `gc` crate, but this isn't done much...).

adgjlsfhk1
0 replies
1d

the CPU usage of manual memory is also pretty high. it's just more evenly distributed throughout the program making it harder to observe.

znpy
17 replies
1d2h

(OpenJDK's ZGC has typical pause times measured in double/triple-digit microseconds, and the worst case rarely exceeds 1ms for a reasonable allocation rate -- the pauses are in the same ballpark as OS-induced ones)

We've been benchmarking ZGC and Shenandoah at work, and the p100 pause time is usually below 500us (micro-seconds). ZGC seems to be performing a bit better, as it seems to be performing less pauses than Shenandoah (hence doing more work/pause).

We still have to run tests-in-production, but so far it seems that GC pauses are largely a solved issue when using ZGC (and Generational ZGC since Java21).

Galanwe
10 replies
1d2h

We've been benchmarking ZGC and Shenandoah at work, and the p100 pause time is usually below 500us

so far it seems that GC pauses are largely a solved issue when using ZGC

What? 500us is abysmally slow.

Most projects I work on have a latency budget of less than 10us, the average being 2us. That is the budget for the whole wire in/wire out for a packet. Even for less latency sensitive workloads, 500us is a no go for most networking application.

pron
9 replies
1d2h

We're talking about occasional hiccups, not an average-case response-latency overhead. You can't get worst-case latency of 2-10us with a non-realtime kernel. Even a page fault could take longer than that.

Galanwe
8 replies
1d1h

You can't get worst-case latency of 2-10us with a non-realtime kernel. Even a page fault could take longer than that.

You obviously can, and this has nothing to do with the kernel being real-time or not.

There is no situation I can think of where a page fault should occur on a properly setup system running a production networking software, meaning no swap, huge TLB, and proper memory management.

pron
7 replies
1d

If you can think of "no situation" where a server may incur a page fault, forced preemption, or need to perform any I/O to a service/database, then I hope you at least recognise that your world in no way represents the state of server software at large because none of these things is true for the vast majority of server software.

In a former life I worked on some safety-critical onboard avionics software for an ultrasonic platform, and 2us was around the upper-limit worst-case latency (i.e. you'll kill someone if you miss that deadline); still, it's not the kind of requirements the vast majority of software finds itself under.

When working over the internet, some of the very best services are at >10ms ping latency anyway, where a 500us hiccup is imperceptible.

Galanwe
6 replies
22h8m

If you can think of "no situation" where a server may incur a page fault, forced preemption, or need to perform any I/O to a service/database, then I hope you at least recognise that your world in no way represents the state of server software at large

I won't deny that the majority of software out there is not latency sensitive, but the context of this article is specifically targeting those softwares that are _not_ using garbage collection, arguing that it is undeservedly overlooked. OP even adds that GC is a "solved problem" because some GC implementation is 500us worst case latency.

My point is that the article author, and OP, are mistaken. Because if you are in the category of "I write server side software without GC" (e. g. C/C++), then 500us is horribly wrong.

Your point being that 500us is fine for most software out there is surely true, but not relevant, because if that is the case, you're probably not using C/C++, thus this article is not targeting you.

In _my world_ as you phrase it, traffic is unshaped. You need to be able to support line rate, otherwise packets are dropped, and hell breaks loose.

rossjudson
2 replies
13h17m

Your "properly set up system" is apparently doing nothing other than running your single job. The vast majority of real-world systems have to deal with antagonists.

All of the characteristics you mention are true on production systems used in large scale fleets...and yet "bumps" happen...because there's never one thing happening. It's all the things, and it's always changing.

I'm gonna guess you do finance. A six microsecond fiber oneway is a thing of beauty. There are certain technical luxuries associated with that domain, and rarely any requirement to exhibit high utilization rates...or deal with antagonists running on the same hardware.

Galanwe
1 replies
10h16m

Finance is one of such use cases, but there's a lot more, and that's the use case for people not using GC, thus why I find this article (and the comment saying 500us is a solved problem) pedantic.

I wrote code profilers for instance, which also need perfectly predictable latency. I worked on L2 and L3 networking applications (bridging, routing, forwarding) that need line rate support. People working on audio sampling, or codecs have the same constraints, etc.

There's a whole world of applications where 500us is ridiculously slow. The article takes the OS as example, but if my OS has 500us random latency spikes, I would be horrified.

pebal
0 replies
9h37m

The article doesn't say anything about acceptable pause times. GC can be completely pause-free.

naasking
2 replies
13h16m

because if that is the case, you're probably not using C/C++

The point is that this claim is more wrong than it should be, eg. that C/C++ is still used more than it should be partly because these GC myths persist, hence the article.

Galanwe
1 replies
11h1m

I think at the end we're debating if the glass is half full or half empty.

I claim that people are not ignorant and if they use C/C++, they are aware of what a GC implies, and cannot afford it.

The article claims that people are somehow wrongly mislead to think they _need_ C/C++ while a GC'ed language would be alright.

I don't think people are dumb. I think given the choice, any sane person would pick Python or similar to write an application, and that thinking they don't because they don't know more is pedantic.

naasking
0 replies
3h52m

It's a mistake to conclude that people make rational choices based on the premise that they aren't dumb. Counterexamples abound.

It's literally true that people who have never programmed in a language with GC don't know what it's like, but they absorb folklore about it's difficulties or unsuitability from places like HN. Except such content is almost always from extremes that aren't applicable to most development, thus producing an unintentionally skewed picture.

Articles exactly like this one are needed for balance, and to encourage more experimentation with safer choices.

pron
5 replies
1d2h

FYI, ZGC doesn't perform any collection work in the the stop-the-world pauses. They are only required to get all mutator threads to atomically observe the increment of the "GC epoch" for all threads. All actual work, both marking and compaction, is done by GC threads running concurrently with the mutator threads or by the mutator threads themselves as they run. It is only when allocation rate is very, very high that an allocating mutator thread will be paused ("throttled") for a significant amount of time to allow the GC to catch up by freeing up memory (and if you hit these cases, then you might be better off using a throughput-oriented collector).

hawk_
3 replies
23h46m

Strangely in our workloads we have noticed generational ZGC latencies are better than G1 at ~99th-99.9th percentile or below but worse at percentiles above that. The allocation rate is moderate.

kaba0
1 replies
23h12m

Above that you might easily get measuring artifacts, like the OS swapping out your process once, or so.

hawk_
0 replies
22h38m

Yes - but it's quite consistent. If it was 'noise' it wouldn't be.

loeg
0 replies
14h26m

It can be easy to game metrics like this -- trading off p100 for p99.9 or p99. E.g., defer all the expensive work to 1 in every 10,000 operations.

znpy
0 replies
23h23m

FYI, ZGC doesn't perform any collection work in the the stop-the-world pauses.

Yeah i know, but on a certain level I don’t care what it does or does not do.

I care about my application not having latency spikes due to stop the world pauses :)

hyperpape
1 replies
1d1h

Where are the measurements comparing throughput of tracing GCs and manual memory management? I'm aware of how incredibly hard this area is to measure, but it's a shame that the state of mainstream discussion is "most people just assume GC implies slow, but then again a handful of people say it's not."

pron
0 replies
1d

Given that the no-GC-by-default market is ~10% of the global software market [1] with no signs of shift in either direction over the past couple of decades, which sounds about right to me (considering the percentage of programs that need to run in memory-constrained environment or must have precise control over memory), it seems that the number of those who may benefit significantly from a different choice is small and so it doesn't look like anyone wants or needs to be convinced of anything. "GC languages" already command ~90% of the market and have little to gain from such a small market of potential converts, and the others aren't trying or succeeding in increasing their market share, so who cares given the small stakes?

[1]: https://www.devjobsscanner.com/blog/top-8-most-demanded-prog...

mtzet
0 replies
9h52m

special case where all memory can be easily handled in arenas That seems to be an unfair bar to set. If _most_ objects are easily allocated by an arena, then that still removes most of the need for GC.

I like Jai's thesis that there's four types of memory allocations, from most common to least common:

1. Extremely short lived. Can be allocated on the function stack.

2. Short lived + well-defined lifetime (per frame/request). Can be allocated in a memory arena.

3. Long lived + well-defined owner. Can be managed by a subsystem-specific pool.

4. Long lived + unclear owner. Needs a dynamic memory management approach.

If you want to make the claim that tracing GCs surpass manual memory management in general, you should compare to a system written with this in mind, not one that calls malloc/free all over the place. I guess it might be more fair if you compare tracing GC with modern c++/rust practices.

I agree that for most systems, it's probably much more _practical_ to rely on tracing GC, but that's a very different statement.

epr
0 replies
2h7m

Except in the special case where all memory can be easily handled in arenas, good tracing GCs have long ago surpassed manual memory management in throughput

Citation needed

magicalhippo
27 replies
1d3h

A point that seems to get lost in many of the pro-GC articles, this one included from what I can see, is that memory is just one kind of resource.

Correct code, especially in systems programming, will need to manage external resources as well, be it file handles, sockets or whatnot. GC only solves the application memory part, thus doesn't help at all for handling those external resources. In fact it can make it much more complicated, just look at what it takes to correctly implement a non-trivial IDispose in .Net.

Other approaches like RAII or reference counting makes it much easier in my experience to handle both memory and external resources in a unified way, thus making it easier to write correct code and reason about it.

That said, I'm not blatantly against GC's. It's a tool and it has some pros and cons, like everything else. The "manual GC" RCU approach mentioned in the article is interesting for certain tasks.

samatman
10 replies
1d2h

GC only solves the application memory part, thus doesn't help at all for handling those external resources.

This is far from entirely true. Most languages with a garbage collector have finalizers, which will clean that sort of thing up. Generally, and unlike memory allocation, one can call those finalizers from within user code, so as not to have to rely on the GC running.

The distinction between reference counting and garbage collection is an artificial one. Reference counting is an approach to garbage collection, one with different tradeoffs from more concept-central algorithms for GC. I agree with you that it's a more unified approach to resource management, finalizers require user attention in a way which ordinary allocation doesn't, so yes, system resources get treated differently. I don't see that as a slam-dunk against them, however.

o11c
6 replies
1d1h

One problem with finalizers, try-finally, try-with-resources, Python-`with`, etc. is that they don't actually guarantee the code will be called even in common cases.

In particular, any function that returns a (possibly newly-constructed) open file handle has not yet informed the caller that it's actually a thing to keep track of, unlike with destructors which are active immediately once the object's lifetime begins, and only rely on `move` being relatively atomic.

kaba0
4 replies
22h51m

How is a try-with-resources not guaranteed? At least in Java, it is guaranteed to run - plus, you can get your closing/destructor logic wrong in RAII languages as well.

o11c
2 replies
19h53m

I mean, it's possible to write bad code in C++. But at least it's possible to write good code too. C++ makes a careful sequence that's hard to get wrong and easily compiler enforced:

* if a given subobject (field or base class)'s constructor runs, its destructor is guaranteed to run.

* in particular, in low-level ownership classes, you can arrange for field initializers to be `noexcept`, so you get to run their dtor, regardless of subsequent manipulation of the fields - just be sure to not assume invariants from the fully-constructed main object case. In most classes, deferring all ownership logic to the fields is simpler - rule of zero beats rule of 5. And if you do add manual try-catch during construction it will actually work.

Languages other than C++ generally fail in several ways:

* Allow exceptions to be thrown by the runtime for reasons unrelated to the code being executed

* No support for subobjects, forcing all objects to be allocated separately and thus the runtime not knowing it needs to poke them since finalizers are only applied to objects not pointers. In particular, Python's `contextlib.ExitStack` can narrow the window in which case leaks are possible, but not eliminate it.

Rust does slightly better than C++ by enforcing trivial moves, at the cost of making many useful programs impossible to write.

worik
1 replies
19h39m

Ouch!

Rust does slightly better than C++ by enforcing trivial moves, at the cost of making many useful programs impossible to write.

What "useful" programs fit a useful definition of "many"?

It makes self referential looped data structures hard to write (double linked lists, trees with pointers to parents).

No loss. Great improvement to have fewer of those

It makes arbitrary allocation hard. Enforces behaviors. Makes a programmer think very hard.

"useful programs impossible to write. " No it does not. It ,ages a lot of bad ideas hard to implement, makes nothing "impossible "

o11c
0 replies
18h7m

Good luck implementing "peer pointers" in Rust. Zero allocation involved, trivial to do safely in C++.

  class A
  {
    /* can be NULL if detached */
    B *peer;
    /* other fields generally exist in one of the classes. */

    /* useful objects are typically created in pairs, but can be constructed detached if needed */
    /* move ctor and swap keep the (value) objects pointing at each other */
    /* move assignment and dtor call detach() on the overwritten/expiring object */
    /* Often separate "detach and cancel" and "detach without canceling" are useful */
    void detach();
    /* other functions specific to the peerage */
  };
  class B
  {
    A *peer;
    /* same API as `class A`, but usually most of it is only used via one owner. */
    /* In my experience, generally one class's instance goes in some collection of centralized objects like an event loop, and the other is used as a handle to it. */
  };
I guess we could always use the usual Rust solution of "screw ownership and efficiency, just shove everything in an `Rc<RefCell<>>`".

cesarb
0 replies
22h33m

How is a try-with-resources not guaranteed? At least in Java, it is guaranteed to run

There's a subtle detail you have to be careful with, however: if you try to allocate memory or call a method between allocating your resource and actually doing the try-with-resources, you might get an OutOfMemoryError or a StackOverflowError, and your resource will leak. That is, if you do something like:

  try (MyHolder holder = new MyHolder(allocateResource())) { ... }
You can have a leak if the memory allocation in the "new MyHolder()" fails, or if there's not enough space in the stack to call the MyHolder constructor.

plus, you can get your closing/destructor logic wrong in RAII languages as well.

For instance, in C++ you can accidentally put your resource in a temporary which is immediately destructed at the end of the line, when you wanted it to last until the end of the enclosing scope. Rust makes it harder for this to happen, but it's still possible.

pjmlp
0 replies
21h26m

You use the C and C++ approach to all their design flaws and reach out to a static analysis tool that covers such cases.

zozbot234
0 replies
23h51m

Finalizers will be non-deterministic when called as part of GC. One advantage of reference counting is that it is deterministic (and yes, this means that sometimes freeing the last reference that's keeping a bunch of objects around will lead to a spike of extra deallocations. Guess what, that's what determinism is. It's implied by the need to free resources promptly.)

cesarb
0 replies
1d

Most languages with a garbage collector have finalizers, which will clean that sort of thing up.

Using finalizers for cleanup of non-memory resources is bad because they will only be called when there's memory pressure. If you have used all of your non-memory resource, but there's still plenty of free memory, the allocation of that resource will fail; if you instead force a garbage collection at that point, not only will you cause a pause, but also you will be collecting memory while there's still plenty of it available.

PhilipRoman
0 replies
22h46m

Trusting the GC with freeing externally observable resources will bring you some very painful bugs.

On that note, it's 2024 and we still cannot unmap a mmapped file from Java in a timely manner. I really hope they do something about it.

zwieback
4 replies
1d3h

So true, moving from C++ to mostly C# I'm liking memory management by GC but hating tracking file handles, sockets, etc. RAII is something I really appreciated

pjmlp
2 replies
1d1h

Use Roslyn analysis that errors when using is forgotten in an IDisposable type, for example.

By the way, in modern .NET, using makes use of structural typing. Any class or struct, with a Dispose() method can be referred to, no need to implement the interface, and can also be retrofitted via extension methods.

osigurdson
1 replies
23h19m

This is strange advice. Why not just implement the IDisposable interface like normal C# code? Using extension methods for this is strange as well. Doing this even somewhat correctly would mean creating another class that implements IDisposable which can only work with public members of the original. In general best not to do weird stuff for no reason imo.

pjmlp
0 replies
11h15m

Because you don't own the type?

Using another class, or struct adds up to memory usage, and turns out some C and C++ folks are itchy with such solutions, not to mention the added issue of passing wrapper classes/structs around.

neonsunset
0 replies
1d2h

(SafeFileHandle is internally reference counted, but yes, worst case when you forget to .Dispose it, it waits in the finalizer queue for the GC to trigger the finalizer thread to process the items in it)

pjmlp
3 replies
1d1h

Yet, a point that tends to get lost when criticising GC languages, is that most of them have features for deterministic resource management, that many keep failing to learn.

- Some do have RAII

- Some do offer keywords

- Some do arena like management, lambdas with implicit management

- Some have a little help from the type system

- Some do a bit of everything listed above

Additionally, just like system developers have to rely on static analysers, the static analysers for those languages can also provide validation when something is forgotten, when the type system alone isn't enough.

magicalhippo
2 replies
20h20m

My point though is that by moving memory management into "don't care" territory while still, obviously, requiring explicit handling of other resources, it's easier to forget or miss when you need to explicitly handle something.

When instantiating objects in C# say I need to check the documentation or the source code to see if it implements IDisposable to know if I need to handle it. Lets say that for a given class X in library Y, it does not. So I don't worry, I just instantiate and don't do anything about the cleanup because GC will handle it.

Later, the implementation of X changes and IDisposable is added to it. I now have to change my code, and not doing so could lead to serious production issues. Yet the compiler happily compiles my code without any warning.

Sure some static analyzers might catch it, but they're not perfect, and you need to run them. A stock Visual Studio 2022 will not complain about the above scenario for example.

In my experience it's much less error prone to unify memory and external resource management. If instead I had been using a different language which has a more uniform resource handling, the above change in class X would likely not have been a big deal.

Also, my code will already be written with resource handling in mind. It can be non-trivial having to change a hierarchy of classes just because a dependency deep down suddenly had IDisposable added to it.

I guess what I'm trying to say is that I think just focusing on memory management, that is to GC or not to GC, is having a myopic view things. I feel it's like arguing what kind of pickle to have on your burger without considering the other ingredients. Sure, the pickle is a crucial ingredient, but there's a lot more to it.

zvrba
0 replies
9h59m

In my experience it's much less error prone to unify memory and external resource management.

Until threads and async enter the scene.

pjmlp
0 replies
11h6m

Just like a stock C or C++ compiler won't complain about the endless possibility of getting things wrong.

Or to pick on IDisposable, you can repeat exactly everything you said regarding actually providing a destructor, properly implemented, taking into account class hierarchies, multiple inheritance, heap allocation, and being exception safe.

Someone has to write those destructors.

worik
1 replies
19h49m

Yes. But...

GC only solves the application memory part,

Except it does not. "Solve" that is.

It helps, yes it does

But as I have learnt the hard way (silly way too, to be truthful) GC will not help unless you actually delete all references to the unused memory

Perhaps GC have developed magic moo cow properties in the twenty five years since I made that discovery, but I think the point remains

GC is very helpful, but it does not stop resource leaks

erik_seaberg
0 replies
16h30m

If unreachable objects point at each other, that's only a problem for refcounting, not for tracing (their pointers aren't followed).

yawaramin
0 replies
14h14m

RAII is of course great but any language with a GC and proper exception handling can handle resources safely. Eg look at Java's try-with-resources statement which guarantees that the resource will be disposed safely if an exception is raised: https://docs.oracle.com/javase/tutorial/essential/exceptions...

You can build up quite resilient and resource-safe systems using these basic building blocks.

tialaramex
0 replies
1d2h

Yes, and the "Memory Safety" arguments apply to the other resources too. For example Rust eventually grew I/O safety, so your File Handles (in Unix, OwnedFd) or just straight up Handles (in Windows, OwnedHandle) are owned objects, not just integers like the number 4

At the surface this looks like it's just about avoiding dumb mistakes like using arithmetic on handles or mistakenly using reserved values as sentinels -- but the ownership model also means anything tricky we're doing with handles has explicit ownership, transparent to future maintainers.

rbehrends
0 replies
14h47m

First, I have no desire to handle both memory and external resources in a unified way, because memory management and resource management have different needs.

Memory is not just "one kind of resource", it's a very specific type of resource that if it has to be managed manually, inherently creates crosscutting concerns. And memory allocation is pervasive, often implicit in other language constructs. Garbage collectors get to cheat here, because they have a global view that ignores module boundaries and information hiding.

The classical example is that introducing a caching mechanism usually introduces API breaks. Where a function normally returns a pointer/reference/unique pointer and makes the caller responsible for freeing memory (whether through convention such as in C or enforced/automated through language mechanisms such as in Rust), the moment you cache it, you need to return a reference-counted pointer, because now the memory can only be freed if both the caller and the cache don't use it anymore. And that change from a non-reference-counted pointer to a reference-counted pointer is a breaking API change.

There are plenty more situations where manual memory management interacts poorly with modularity, such as filter() style functions, or the various complications that arise from closures capturing their local environment.

Conversely, it is absolutely possible to have pretty straightforward resource management with guaranteed and predictable lifetimes in a GCed language (though, alas, there's a lack of direct language support for that).

The general approach is as follows: Each resource's constructor takes an explicit or implicit owner argument (implicit being the current scope, whether defined through a language construct). You can also transfer a resource to a different owner (reparenting) [1].

Owners of resources can be lifetime managers such as scopes (but those do not need to correspond to lexical scopes and are more like transactions), that have more complex lifetime logic (such as a pool of resources) or objects that themselves are owned (e.g. if you have resources dependent on other resources). When the lifetime of the owner finishes, it calls a dispose function in all owned objects.

Because an owner is required in order to construct such a resource object (unlike a C# using clause or Java's try-with-resources) by virtue of its constructor requiring it, it is impossible to accidentally create such a resource without a controlled lifetime [2].

Note that this is not the equivalent to RAII. You can have a number of non-owning references to such resource objects, essentially the equivalent of a weak pointer. In my experience, this is generally a good thing, because you do not want to have a hidden pointer secretly extending the lifetime of a potentially expensive resource. I prefer resource lifetimes to be explicit and to get an error if they are used past their intended lifetime.

[1] Note that this is conceptually similar to talloc. https://talloc.samba.org/talloc/doc/html/index.html

[2] Obviously, it is still possible in any language to do the equivalent of a raw fopen() call, but that's not something that RAII can fix, either.

pron
0 replies
1d2h

There is a big difference between memory and other resources, and that is that memory -- like processing -- is fundamental for any computation; not for nothing do most theoretical models of computation assume unbounded memory. Very few languages require manual allocation of processing -- even though that is done in some software like OS kernels and hard realtime applications -- and for similar reasons automatic memory management is very useful for abstracting computation, i.e. not leaking their details of memory by a subroutine to its clients just as this is rarely done for CPU use.

So while every non-trivial computation involves some non-constant amount of processing and memory, I/O is usually done at the edges of the system. Management of I/O is very important, of course, but it's not as central to the notion of computation (and therefore to abstracting computation) as processing and memory.

pebal
0 replies
17h58m

Having garbage collection (GC) in your toolkit doesn't mean you're limited to choosing between GC and deterministic destruction. You can integrate GC, stack allocation, and manual memory management within the same application. It's possible to leverage GC in a manner similar to how `shared_ptr` is used, providing both automated cleanup and precise control over object lifetime. For a practical example, consider SGCL: https://github.com/pebal/sgcl

celrod
22 replies
1d2h

The RCU use case is convincing, but my experience with GCs in other situations has been poor. To me, this reads more like an argument for bespoke memory management solutions being able to yield the best performance (I agree!), which is a totally different case from the more general static lifetimes generally outperforming dynamic lifetimes (especially when a tracing step is needed to determine liveness).

Lies people believe... Calling free() gives the memory back to the OS.

I believe calling `free()` gives the memory back to the allocator, which is much better than giving it to the OS; syscalls are slow. Perhaps not immediately; mimalloc only makes frees available to future `malloc`s periodically.

Trying a simple benchmark where I allocate and then immediately `free` 800 bytes, 1 million times, and counting the number of unique pointers I get: glibc's malloc: 1 jemalloc: 1 mimalloc: 4 Julia's garbage collector: 62767

62767, at about 48 MiB, isn't that bad, but it still blows out my computer's L3 cache. Using a GC basically guarantees every new allocation is from RAM, rather than cache. This kills performance of any heavily allocating code; we don't care only about how fast memory management can work, but how quickly we can worth with what it gives us. I gave a benchmark in Julia showcasing this: https://discourse.julialang.org/t/blog-post-rust-vs-julia-in...

Malloc/free gives you a chance at staying hot, if your actual working memory is small enough.

Allocators like mimalloc are also designed (like the compacting GC) to have successive allocations be close together. The 4 unique pointers I got from mimalloc were 896 bytes apart.

My opinions might be less sour if I had more experience with compacting GCs, but I think GCs are just a vastly more complicated solution to the problem of safe memory management than something like Rust's borrow checker. Given that the complexity is foisted on the compiler and runtime developers, that's normally not so bad for users, and an acceptable tradeoff when writing code that isn't performance sensitive. Similarly, RAII with static lifetimes is also a reasonable tradeoff for code not important enough for more bespoke approaches. The articles example is evidently one of those deserving a more bespoke solution.

chc4
6 replies
1d2h

It's really not enough to just say that a GC gave you more pointers = it has worse cache locality. Compacting GC almost always has better cache utilization than malloc, because heap fragmentation over long-running programs will waste TLB cache entries and slack space between objects. A bump allocator from a compacting GC will give you a new pointer for each allocation because free doesn't reclaim the memory...but those allocations will be sequentially, and if you were in the case where you are churning through your heap and only ever touch the most recent object they will always be in cache still. Benchmarking the knock-on effects of allocators and GCs are insanely difficult and I'm very skeptical of basically any synthetic benchmarks like this.

convolvatron
1 replies
1d1h

compaction really does help runtimes alot. but I'm not sure how much it really has to do with line level locality. in general we don't try to batch related objects together except in a coarse form by generation.

I think the measurable benefit comes from page level savings, both reducing the number of trips to the kernel to get zeroes pages, and from reduced pressure on the tlb.

but I have definitely seen numbers like 20% on some workloads for turning on compaction

tsimionescu
0 replies
9h8m

in general we don't try to batch related objects together except in a coarse form by generation.

It greatly depends on the GC algorithm, right? Copying collectors naturally bunch up related objects (though of course, in which order is less well defined, if one object has multiple pointers to others).

celrod
1 replies
1d1h

FWIW, that synthetic benchmark was reflective of some real world code we were deploying. Using malloc/free for one function led to something like a 2x performance improvement of the whole program.

I think it's important to differentiate between malloc implementations/algorithms, just like it's important to differentiate between GCs. E.g., mimalloc "shards" size classes into pages, with separate free lists per page. This way, subsequent allocations are all from the same page. Freeing does not free eagerly; only if the entire page is freed, or if we hit a new allocation and the page is empty, then it can hit a periodic slow path to do deferred work. https://www.microsoft.com/en-us/research/uploads/prod/2019/0...

Good malloc implementations can also employ techniques to avoid fragmentation. It's unfortunate that the defaults are bad.

But I confess, compacting GCs and profiling the effects of heap fragmentation (especially over time in long running programs) are both things I lack experience in. Microbenchmarks are unlikely to capture that accurately.

yxhuvud
0 replies
4h55m

If we are talking long running multithreaded processes, then libc malloc have issues that isn't even fragmentation. There are many workloads when it seems to forget to return whole empty pages and just accumulate allocated areas despite them being totally empty.

hedora
0 replies
1d1h

I think the fact that it is complicated to reason about is precisely why systems developers don’t trust GC’s.

It’s far easier to write a threadsafe bump (slab) allocator than to locate and diagnose the code that someone wrote two years ago and, as of last week, started causing the GC to blow up the cache, contend on a lock, fragment the heap, add unbounded pauses, burn an extra cpu, etc, etc.

(Though, at this point, most mallocs are so good that the slab allocator loses anyway and there’s no need to bother.)

flohofwoe
0 replies
7h58m

Compacting GC almost always has better cache utilization than malloc

...only if you put each object into its own heap allocation, which is a dumb thing to do in manual memory management. Typically you'd group many related objects of the same type tightly packed into the same allocation and loop over them in sequence to make proper use of prefetching.

The programmer will always have a better idea how objects are going to be used compared to an automatic memory management solution, and even with a GC it makes a lot of sense to use arrays of value types for better memory layout control.

louthy
4 replies
1d

This makes no sense to me. In a generational GC gen-0 is more likely than not to be cached — and that’s where the ‘churn’ is. Outside of that, any longer lived allocations are by definition not easy to control cache-wise.

Locality is one of the big wins for GCs, the only issue I’m aware of is the ‘stop the world’ mark/sweep (yes, I know modern GCs have a background thread — but you still get stop-the-world events afaik)

fweimer
3 replies
9h17m

Modern collectors have stop-the-world pause times in the millisecond range (aiming for less), even for very large heaps. Allocating threads may also incur allocation stalls, also in the millisecond range. However, these collectors need additional barriers in the application, and the concurrently running collector competes with the application for resources even if the application is not paused. Meaningful comparisons are difficult.

pkolaczk
1 replies
8h47m

The problem is not only how long the pause takes but also the fact it pauses all the things. In manual memory management even if you have to spend some time in allocation / deallocation, it affects only the allocating thread. A thread that doesn’t allocate doesn’t pause.

fweimer
0 replies
1h9m

With current concurrent collectors, the stop-the-world pauses are so short that they approach the pause times processes encounter for other reasons on untuned systems.

But I meant something else: it's not always clear whether low-pause collectors are a win. A totally made-up example: Is it worthwhile to add 2ms processing time to every request to avoid that every 7 to 10 minutes, there is a 20% chance that there is a garbage collection that delays all requests currently in flight by 40ms?

tuna74
0 replies
3h46m

How much more memory bandwidth does that require?

arcticbull
4 replies
22h12m

The post explains why this works in the RCU context, why it sucks in general, and then just writes it off and ignores it.

At this point, some folks fire back with non-arguments about how this isn’t “real” garbage collection. Like, uh, because you manually mark the garbage!

Yeah. People's concerns are that the process of figuring out what memory is not longer used is inefficient and non-deterministic relative to simply telling the allocator when you're done with a resource. I've never met someone who's been concerned with deferring deallocation.

Sure traversing the whole live set is rare and we've spent 30 years tweaking GC algorithms to make them better, and now they're practically sentient. However this statement either willfully or unintentionally writes off the thing people actually have an issue with. If you run into GC issues in your services you have to bring in a shaman to tweak things here and there hoping it sends the angry spirits back to the shadow realm.

If you're just marking the garbage and being notified when it's no longer used, that entire process is gone.

Yes, it can be very fast to allocate memory in a GC. This ignores the cost of marking and compaction that actually need to be amortized in to get a fair comparison.

The other big issue people have with GC is that in general it requires significantly more memory than manual memory management to achieve equivalent performance. And you have to have a bunch of extra CPU power to throw at redundantly checking if things are still referenced over and over. And you have to be okay with a bunch of extra memory copies for optimistic compaction.

Finally the post criticizes manual memory management (Rust's Arc/Rc) as being necessary when you have unclear lifetimes - but ignores that you basically build the exact same infrastructure in GC'd languages to close external resources as you can't rely on finalizers ever being called.

Anyways this has been beaten to death for the last 20-30 years and this article doesn't seem to bring anything new to the table besides ignoring the legitimate concerns of a GC using memes, which is fine because memes are fun IMO.

The correct answer is exactly what you say - there is no general correct answer. You use the tool appropriate to the job to meet the design constraints of the system.

rerdavies
3 replies
15h31m

I've never met someone who's been concerned with deferring deallocation.

Realtime audio processing (instruments and effects), where malloc/free can never be called on the realtime audio thread. Deallocations have to be deferred.

If it matters for realtime audio, I cannot imagine that it would not matter for twitch games as well.

In non-GC languages the audio thread can be kept running even when changing audio plugins, as long as allocations and deallocations are not performed on the audio thread.

Realtime audio synthesis is possible in .net; but no memory allocations can occur anywhere in the entire realtime audio process. It is possible to write allocation-free message queues between a UI process and a realtime process. If allocations do occur in the realtime audio process, the realtime thread has to be suspended at some point in order to grab object references on the stack of the realtime thread -- a process that can take 2ms or more in .net GCs (which is more than enough to case audio dropouts).[1]

pclmulqdq
1 replies
15h17m

Many of the systems I have seen that need to be deterministic and fast but still need to allocate will use pool allocators or slab allocators. The edge cases mean that malloc/free is not in the conversation. I suppose that pre-allocating objects and deferring deallocation would also work, but that seems roughly equivalent to pool allocation.

The null allocator is the fastest allocator, though!

rerdavies
0 replies
14h32m

Yes. Lots of pool allocators as well. :-)

The principle application is making changes to Audio plugin chains without stopping the real-time audio thread. The new chain has to be pre-allocated, and the old chain has to be sent off-thread to get deallocated on a deferred basis. You can't pool-allocate a VST or an LV2 audio plugin.

Data subscriptions (Audio VU data and control port values) also use a similar deferred deallocation scheme. Doing so allows use of mutexes and shared ptrs in dependent non-realtime data structures.

ninkendo
0 replies
4h39m

> I've never met someone who's been concerned with deferring deallocation.

Realtime audio processing (instruments and effects), where malloc/free can never be called on the realtime audio thread. Deallocations have to be deferred.

I think you misunderstood GP, they are saying “I’ve never met someone who thinks deferred deallocation is a bad thing”, ie. you’re agreeing with them.

MichaelMoser123
1 replies
23h56m

I believe calling `free()` gives the memory back to the allocator, which is much better than giving it to the OS

Having to deal with memory fragmentation in long running servers is no fun at all, especially internal fragmentation of pages maintained by slab allocators.

this is not a very common problem, but it is a hard one to deal with.

pkolaczk
0 replies
8h50m

Fragmentation rarely wastes more than 20% memory and 50% is extremely unlikely. But with tracing GC you’re often wasting 4-5x right from the start. Maybe it’s not called fragmentation, but the room needed for the GC to run efficiently is also waste. Plus all the headers in the objects to keep the marking flags also addup.

osigurdson
0 replies
20h26m

> My opinions might be less sour if I had more experience with compacting GCs

I have quite a bit of experience with the GC in .NET. For projects that deal with large data structures, the GC is something that you are always thinking about though it's behavior is conceptually transparent. I think I would ultimately prefer a more explicit approach.

kazinator
0 replies
20h51m

free() gives back memory to your local POSIX. :)

darby_eight
0 replies
1d

If cache usage is that major of a concern, arena allocation works just as well as it does with manual memory allocation. Thankfully there aren't too many areas where garbage collection has to compete with such conveniently contrived examples.

teleforce
9 replies
1d2h

For promising modern and parallel GC techniques please check MPL or MaPLe with its novel Automatic Management of Parallelism. It won distinguished paper award in POPL 2024 and ACM SIGPLAN dissertation award 2023 by proposing these two main things [1],[2]:

a) Provably efficient parallel garbage collection based on disentanglement

b) Provably efficient automatic granularity control

[1] MaPLe (MPL):

https://github.com/MPLLang/mpl

[2] Automatic Parallelism Management:

https://dl.acm.org/doi/10.1145/3632880

amelius
2 replies
21h3m

How does this compare against recent efforts in OCaml to support multicore parallelism?

zogrodea
0 replies
18h21m

One of the people who helped optimise the multi core implementation for OCaml said it was the way to go, but that was in 2020. Don’t know where things are now. https://news.ycombinator.com/item?id=23776609

shwestrick
0 replies
1h46m

The two systems have very different tradeoffs.

A few things in particular:

  * Separate compilation vs whole-program compilation. OCaml uses separate compilation and therefore has a very constrained heap object model which makes it possible to get polymorphism across separately compiled modules. In contrast, MaPLe uses whole-program compilation and therefore is able to monomorphize and optimize much more aggressively. Whole-program compilation can be slow for large projects.

  * The multicore OCaml effort was driven by backwards compatibility, especially in terms of performance -- they wanted to ensure that the performance of existing sequential OCaml code would be completely unaffected by the new run-time system.  In contrast, MaPLe focuses on efficiency and scalability for parallel code.

  * Multicore OCaml will let you implement your own scheduler, as a library, on top of coarse-grained threads. In contrast, MaPLe comes with a built-in scheduler, and it's not easy to change it.
We did a comparison with multicore OCaml in https://dl.acm.org/doi/10.1145/3591284, and found that MaPLe can be significantly faster, but that comes with all of the tradeoffs above. And, it's a cross-language comparison, so take it with a grain of salt. Our comparison in particular emphasized similar source code, but typically, fast code in OCaml just looks different from fast code in MaPLe. For example, in OCaml, you often need to manually unbox certain data structures to get better memory efficiency (but MaPLe will often do this for you, automatically).

By the way, there's a recent effort---led by the compiler team at Jane Street---to make it possible to get automatic data unboxing in OCaml! Some more info here: https://www.janestreet.com/tech-talks/unboxed-types-for-ocam...

bool3max
1 replies
23h55m

What does "provably efficient" mean?

shwestrick
0 replies
2h8m

I'm one of the authors of this work -- I can explain a little.

"Provably efficient" means that the language provides worst-case performance guarantees.

For example in the "Automatic Parallelism Management" paper (https://dl.acm.org/doi/10.1145/3632880), we develop a compiler and run-time system that can execute extremely fine-grained parallel code without losing performance. (Concretely, imagine tiny tasks of around only 10-100 instructions each.)

The key idea is to make sure that any task which is *too tiny* is executed sequentially instead of in parallel. To make this happen, we use a scheduler that runs in the background during execution. It is the scheduler's job to decide on-the-fly which tasks should be sequentialized and which tasks should be "promoted" into actual threads that can run in parallel. Intuitively, each promotion incurs a cost, but also exposes parallelism.

In the paper, we present our scheduler and prove a worst-case performance bound. We specifically show that the total overhead of promotion will be at most a small constant factor (e.g., 1% overhead), and also that the theoretical amount of parallelism is unaffected, asymptotically.

All of this is implemented in MaPLe (https://github.com/mpllang/mpl) and you can go play with it now!

zogrodea
0 replies
23h12m

Standard ML and the community around it has been pretty impressive as far as contributions to memory management literature goes.

There is of course the paper you linked, and there's also the MLKit which was among the first users, and one of the pioneers, of region-based memory management.

prisenco
0 replies
12h51m

Question for people who are more qualified: How applicable is this to other languages? Could this approach significantly speed up garbage collection in Go for example?

Or do we run into design issues with existing languages?

pjmlp
0 replies
1d

Great material, thanks for sharing.

bugbuddy
0 replies
1d2h

Nice links. Thanks for posting these.

netbioserror
9 replies
1d2h

I use Nim in production. It defaults to RC. The biggest benefit of runtime automatic memory management that rarely gets mentioned: You can easily eliminate almost all memory semantics from a typical server-side program. My code is 99.9% business logic, with the 0.1% being a couple procedures which interface to a C library.

Hardcore manual memory people seem to have completely misaligned priorities to real-world concerns. Maintainability, safety, productivity, and iteration speed seem to be bottom-of-list to the egotistic holy grail be being able to say they write their own memory routines. If they're not in an embedded context, I'm really unsure why anyone paying them to write code should care what they have to say. They're not talking about increasing robustness and reducing costs, so what the hell are they even rambling about? I've had too many of these debates in-person face-to-face with people who can't match my ability to deliver.

The vast majority of us are not smarter than the compiler, and are not smarter than automatic MM routines. Those who are write compilers and GCs. They don't work on the same programs we all write, just memory managed. It's the worst-kept secret that the remaining contexts where manual management is used often have the worst-maintained spaghetti codebases, leading to disasters, whistleblown scandles, and hush-hush covered catastrophes waiting in the wings while people get the hell out of dodge before they blow. It's all duct tape and prayers. Having had to inspect and translate procedures from a deliberately obfuscated spaghetti C codebase, my position on this is going to be hard to budge. Experience is an unbeatable Bayesian prior.

EatFlamingDeath
4 replies
1d2h

How's Nim in production? I really enjoyed working with the language and I've been bitten too many times by Python's package management. Do you think it's suitable for scripting too?

netbioserror
2 replies
1d1h

I use it in a context it's extremely well suited for: As a CLI executable invoked by other server-side scripts. It's a program that does lots of mathematical and statistical data processing, along with HTML report generation. Like I said, I use the default RC and don't even think about memory. The type system is very simple; there is only one nominal string type, for example, instead of the 12 in Rust or C++, and it's a fleshed-out string type that can be mutable or immutable. I also take big advantage of its preference for functional-style referential transparency with only local mutations. I have only a single module that could be construed as "OO". Oh, and the module system works exactly like you probably intuit a module system should, so you probably already know how to use it.

If you want to script with Nim, it's actually quite nice; it has a "run" command which compiles and runs without depositing artifacts in the run directory, and has an incredibly robust standard library, comparable to that of Python.

I have no experience making a live always-running Nim application, so I can't speak to that. But in the context I use it, it's incredible.

cb321
0 replies
1d

I have no experience making a live always-running Nim application, so I can't speak to that. But in the context I use it, it's incredible.

I have done this several ways quite successfully. Firstly, instead of "%cpu" I tend to measure "10s..100s of parts per billion" CPU from a like ~100 lines of code "cron library" that I use instead of system cron demons: https://github.com/c-blake/cron -- seeing 40 ppb on one box and 73 ppb on another at the moment.

Another example might be https://github.com/c-blake/bu/blob/main/doc/dirq.md which is a kind of ad hoc demon to monitor however many directories on Linux with inotify and then launch user-commands against dropped in files. This can be a sort of Plan 9 "plumb" style thing. E.g., one of the directories I monitor is a browser download directory. So, one click to save and them boom - a potential cascade of activity. (EDTI: This is clocking in at about 9572177ns/(48*3600s) =~ 55 parts per billion for me right now.)

As a final example, I got annoyed with the unwanted complexity of modern syslog junk. So, I whipped up https://github.com/c-blake/kslog which just looking it up for this post, according to /proc/PID/schedstat has only accumulated about 27357590/(24*3600+24*60) =~ 311 parts per billion of 1 CPU on a 4-core system since boot about 24h:24m ago. This in about 25% of the lines of Nim code that busybox spends on less useful C code (to your point of "almost all the code is the actual logic, not other junk").

Also, while it is not as full-featured as stdlib `string`, there is a zero-copy `cligen/mslice.MSlice` type (https://github.com/c-blake/cligen/blob/master/cligen/mslice....) that does have basic features like splitting and number parsing. There are probably others out in the Nimbleverse. If view types ever migrate from experimental status to relied-upon that might become less interesting.

Since you do a lot of statistics, you might also appreciate the fmtUncertain* subsystem of https://github.com/c-blake/cligen/blob/master/cligen/strUt.n... which allows you to retain only meaningful number of digits and also https://github.com/c-blake/adix/blob/master/adix/mvstat.nim which has efficient moving quantiles via a Fenwick Tree logarithmic histogram.

EatFlamingDeath
0 replies
3h52m

Thanks for the detailed response :) Sounds like I'm gonna be experimenting with Nim in the next few days. Creating CLI executables with Python is always a pain, you cannot propely create a self-contained executable without going through hoops.

netbioserror
0 replies
1d

I should also mention: Be careful analyzing forum or example Nim code. A LOT of Nim programmers are coming from the C or C++ side of things and trying to use it like C or C++. You can write most Nim programs essentially like Python. Keep it clean, keep it simple. Use iterators, use map(), use universal function call syntax, use the full semantic power of sequences (very comparable to Python lists).

zozbot234
1 replies
23h48m

Most systems programming languages don't have fully manual memory management these days - they use RAII. Manual deallocation is possible, but not used often.

yawaramin
0 replies
14h10m

Amusing segfault in 10 lines of Modern C++ with RAII:

    #include <iostream>
    #include <memory>

    int main(int argc, char* argv[]) {
      auto s = std::make_unique<std::string>("hello");
      std::unique_ptr<std::string> s2 = std::move(s);

      std::cout << *s << std::endl;
      return 0;
    }

CyberDildonics
1 replies
22h3m

Hardcore manual memory people seem to have completely misaligned priorities to real-world concerns. Maintainability, safety, productivity, and iteration speed seem to be bottom-of-list to the egotistic holy grail be being able to say they write their own memory routines.

I think you've gotten very worked up over a theoretical boogieman person that doesn't exist. Most natively compiled programs are written in C++ with Rust as a distant second and both put a lot of effort into the selling point that you don't have to manually manage memory the same way you do with C and can do with ownership. I've never seen anyone like you're describing.

netbioserror
0 replies
21h30m

I have met people like that. Sorry.

spintin
6 replies
1d3h

Or you just use static atomic variables no?

I mean it wont solve your race conditions but it also wont crash.

And it will be VERY fast because parallelism without cache misses!

I wish there was a Java with C struct JNI access.

neonsunset
1 replies
1d2h

It's called C# (supporting plain C structs is just a small part of its low-level features)

pjmlp
0 replies
1d1h

Or D, available as official fronted language on GCC and LLVM projects, :)

spintin
0 replies
7h22m

Hm, ok I have Arrays of 64 byte atomic Structs defined in C and I want to iterate over them in Java, is that possible with the Memory API?

yvdriess
0 replies
1d3h

Saying this as a big hardware atomics enjoyer: Static atomic variables are not fast, are not parallel and definitely are the worst kind of cache misses.

Atomics are also atomic only on that one single operation, such as manipulating a single number in a struct or swapping out a pointer from one version of an object to another. To atomically expose a set of changes to a datastructure, you either use a lock or swap the old for the new, which is the driving example in the article.

bugbuddy
0 replies
1d3h

This is false. Depending on the CPU Arch, updating a shared static atomic variable may cause cache invalidation. How fast depends on how hot and tight you critical sections are.

mwkaufma
6 replies
1d

(1) the pivot from rcu to general purpose tracing gcs is bait-and-switch.

(2) Manual memory management is more than just malloc/free calls -- it's about layout (e.g. struct-of-arrays, inlining, implicit offsets, packing, etc)

titzer
3 replies
19h23m

For (2) Virgil has several features that allow you to layout memory with various levels of control. I assume you meaning "array of structs", and you can do that with arrays of tuples, which will naturally be flattened and normalized based on the target (i.e. will be array-of-structs on native targets). You can define byte-exact layouts[1] (mostly for interfacing with other software and parsing binary formats), unbox ADTs, and soon you can even control the exact encoding of ADTs.

Virgil is GC'd.

[1] https://github.com/titzer/virgil/blob/master/doc/tutorial/La...

mwkaufma
1 replies
18h37m

Skimming [1], Virgil Layouts resemble ArrayBuffers in JavaScript -- a boxed view of some native span of memory. If I'm reading that right, it's basically an escape-hatch from the GC for "mixed" environments -- the buffer itself is owned by a GC'd proxy, but it doesn't doesn't e.g. contain GC references internally.

That's useful for lots of low-level interfaceing (e.g. communicating with serial interfaces or device drivers), but one couldn't, however, build an "allocator" in a Layout for GC'd objects the way, e.g., in native code you can make a "bump allocator" for arbitrary structs which are "free"ed with just a pointer-reset (pretty important, e.g., in game engines, which is my field).

titzer
0 replies
18h28m

The last part is correct; layouts are not GC'd objects. They are views over byte arrays (or mmap'd memory, or unsafe regions like the execution stack). The first part is partially incorrect; layouts are not "boxed"--they are represented underneath as a pair of a (potentially null) GC object and a pointer-sized offset into that object. So with that representation you can have a reference to an off-heap layout as well.

Layouts are a little underpowered right now, owing mostly to the conservatism in that they could always be aliased by a byte array or any other alias, thus their bytes are chaos-bits that cannot be trusted. Adding the ability to have pointers between layouts is another level; for that, I am envisioning reads of pointers requiring a second capability which is the expected region into which the pointers lie.

But all of the layout stuff is fundamentally not about performance, it's about interfacing with external software and hardware. In Virgil the expectation that you should use the good, type- and memory-safe constructs like classes, arrays, tuples, ADTs, etc. These are plenty efficient and getting new annotations to control their representation in a more fine-grained way.

im3w1l
0 replies
11h15m

I think struct-of-arrays was a quite delibarate wording. It stands out as a strategy that is at odds with object orientation and is sometimes more efficient.

kaba0
1 replies
23h3m

I disagree with 2 being manual memory management. There is definitely a lack of control in contemporary managed languages (though it’s also far from perfect in low-level languages) for memory layout, but there are definitely ways to affect it.

cb321
0 replies
21h32m

I agree with your disagreement. As a case in point, Nim has packed bitfields but various choices in automatic memory management. As a concrete example, a spell-check custom-data store uses them here: https://github.com/c-blake/suggest/blob/04e313f8f8d3adf4cb55... (there the memory is in a memory mapped file and so that code has to manage the space itself.. so, maybe not the perfect example).

But I also agree there tends to be a correlation in PLang design in avoiding both low-level memory layout and in manual memory management. But it's just a tendency, not fundamental.

samatman
4 replies
1d2h

This skirts around the edge of an observation which I want to dive into, which is that modern user OSes (anything which isn't a specialized RTOS) has built-in garbage collection. We just don't call it that: we just call it memory management. What do we call languages with a built in GC? Memory-managed languages!

You see this in a lot of older "top-to-bottom" C programs: they allocate, they clean up system resources (using longjmp to one label), but they don't bother with free. When the program exits the OS gets all that memory back anyway, so why bother?

There's a missed opportunity here, to have an OS with a garbage collector with less isolation from programs, one which handles resources more like a language's runtime GC. It will probably stay missed, because in the typical GCed language, the GC is intricately built into practically every line of the runtime, so it's not really practical to make a distribution of that language for one OS which hands that control over to the operating system.

But it's a pity, because there's a lot of room to improve some of the chronic problems we see from this artificial isolation of program-level memory management from OS level.

ninkendo
1 replies
20h29m

The OS only knows it can free memory when your process exits (same as file handles and other resources.) If your process is designed to exit once it's done its job, you can use the OS as a garbage collector.

Having an operating system know when memory is unused within your running program is not something that has ever existed though (except for perhaps some esoteric research OS's.) I wouldn't say we're missing an opportunity, because the thing we're missing doesn't exist in any meaningful sense. On the other hand, a programming methodology that uses very simple, short-lived programs is a totally legitimate way to do things... it's how CLI tools and the scripting languages that script them work, it's how web servers used to work (with CGI/etc), and it's a perfectly reasonable approach even today.

samatman
0 replies
3h51m

I wouldn't say we're missing an opportunity, because the thing we're missing doesn't exist in any meaningful sense

That's a not-uncommon sort of missed opportunity! But I believe Inferno does this, the Dis VM which runs the entire (hosted) operating system has a GC which is reused by Limbo, the systems language. But it didn't catch on.

There's no technical reason for the division between a process GC and an OS GC, and it would be nice to see what could be done by eliminating it. There are good practical reasons it hasn't happened, but it's hard for me to imagine it making anything worse, and it's easy to see things which could improve, including safe(r) shared-memory between processes.

kaba0
0 replies
22h44m

In Java, the Epsilon GC is just that.

flohofwoe
0 replies
1d2h

You see this in a lot of older "top-to-bottom" C programs: they allocate, they clean up system resources (using longjmp to one label), but they don't bother with free. When the program exits the OS gets all that memory back anyway, so why bother?

You don't need to bother with releasing other types of resources either though (files, sockets, threads, ...), since the operating system will take care of that on process exit (unless you are on AmigaOS). The only reason to free memory is to recycle that memory in other allocations in long running applications without grabbing new memory from the OS. For one-shot command line tools it's usually not needed.

SmartHypercube
4 replies
18h30m

Using RCU as the example to motivate GC is interesting. It is essentially transferring the responsibility of freeing from the writer to the last reader, who cannot be determined when compiling. It makes a lot of sense.

But this makes me thinking that, if I want more performance, should I further transfer the freeing from the reader to a dedicated batch process? The readers only update a mark / write into a queue or something. Every once in a while, a batch process collects all garbages and compacts. This way the readers don't have random additional overheads.

Lies people believe about memory management

The programmer knows the best times to pause for memory management.

In my experience, there are many programs in which the programmer does know the best times to pause for memory management. For example, in games and crypto trading programs, I want to classify all computations into two priorities. I need to do an update / render a frame / compute a trading action in every time period. If it finishes before the deadline, I have nothing to do now and would like to collect some garbages. If the high-priority thing is using all the available time, for example, when the market is very active, I don't care too much about collecting garbages and would like to defer everything that is not strictly necessary to as late as possible.

filleduchaos
3 replies
18h10m

The very next line after the portion of the article you clipped is "Sometimes there are obvious answers—like on the loading screen of a video game."

SmartHypercube
2 replies
17h21m

I'm not talking about the loading screen of a video game.

zelphirkalt
1 replies
7h53m

You are saying for example in games. The article is saying for example in games in a loading screen. You are saying not in loading screens. Maybe you should give more specific examples then, instead of leaving people guessing what you mean?

SmartHypercube
0 replies
7h11m

For example, in games and crypto trading programs, I want to classify all computations into two priorities. I need to do an update / render a frame / compute a trading action in every time period.

Sorry if that is not clear enough. The very next sentence in my comment gave do an update and render a frame as examples about games. I'm talking about the typical game loop. In almost all games the program has to do an update and render a frame every 1/30s or 1/60s. Missing the deadline degrades user experience a lot. And this is constantly happening, applies to almost all games. I mention these examples because I believe they are completely different from loading screens which only happen on occasions. These examples make the batch and low-priority garbage collection I was thinking necessary. Loading screens do not.

keybored
3 replies
1d3h

The article motivates RCU and then does a u-turn and starts making a general argument for general-purpose GC. Not quite a trojan horse but a bit whiplash provoking.

o11c
2 replies
1d1h

I definitely wouldn't call the RCU thing a GC, since at no point is the object garbage. It is in one of 3 states, and changes between them as quickly as possible:

* active

* obsolete but alive for old readers

* deallocated

Note that depending on how you write your code, it may be possible to reuse an "obsolete-but-alive" object for a "new" allocation safely, though I haven't analyzed performance for this in full.

As usual for GC discussions, it is very handwavy about when you have to "fall back to `shared\_ptr/Arc`", when in fact avoiding refcounts (either because you can prove you already have ownership (which does have implications for tail calls, but you shouldn't use those anyway), or by avoiding indirection entirely) is the whole point of serious refcount-based systems. Doing nothing at all is obviously better than GC's "do something, sometime".

nemetroid
0 replies
16h25m

It’s not entirely clear from the article, but the part about rcu_defer() is not just a ”what if?” segue into GC, it’s how RCU is typically used.

naasking
0 replies
12h57m

RCU objects are garbage when they're held by old readers thay will no longer utilize them in the remaining code path before they exit. This code path is virtually always of non-zero length, because ensuring they are deallocated in the absolute shortest path is an NP-class problem for sure. I don't see why the fact that this code path might sometimes be longer in GC would preclude RCU from being GC.

bckr
2 replies
1d3h

Off topic but what is going on in the picture of the leaking pipe? I don’t think it’s AI but there’s a third arm in there that I don’t understand. There’s at least two guys in the picture but I can’t assign one of the arms to either of them.

mindv0rtex
0 replies
1d2h

It's definitely AI generated, that picture makes no sense.

deutschepost
0 replies
1d2h

It's three people. The guy on the right is leaning over another one under the water. It's probably a picture from a flood training simulator.

Edit: https://commons.wikimedia.org/wiki/File:US_Navy_040308-N-000...

The picture was uploaded in 2009. Downvote if you want, but there is no AI at work here.

HippoBaro
2 replies
22h49m

For the kind of software I write there are two cases: (1) the hot path for which I will always have custom allocators and avoid allocations and (2) everything else.

For (1) GC or not it doesn’t make a difference, I’ll opt-out. For (2) GC is really convenient and correct.

leapis
1 replies
20h53m

Agreed- I come from a Java/C++ shop where we tried to tackle this dichotomy with interop but it ended up causing more problems than it solved. A lot of the work that Java has done with modern garbage collectors is impressive, but even they admit (indirectly, via Valhalla) that no/low-alloc code has it's place.

astrobe_
0 replies
10h17m

no/low-alloc code has it's place

... Which is pretty large in the embedded firwmare field. However that's not systems programming but system (singular) programming.

musicale
1 replies
21h46m

The tricky part is identifying which systems programmers can be garbage collected and when.

atum47
1 replies
1d1h

I once enabled garbage collection for this software I was writing and it collected everything.

ervine
0 replies
1d1h

Sprayed spot remover on my dog. Next day he was gone.

worik
0 replies
19h26m

Not mentioned in this article is one thing that goes very well with GC is async/await

I am a async/await hater for personal Idiosyncratic style reasons that I will not bore you with

I use it a lot in Type/Java script. Have done so in Dart. Works as it should

I have used it in Rust. IMO it is a disaster there. Shoehorning the sort of memory management required to use asyc/await with a multi threaded runtime is a hellscape

https://doc.rust-lang.org/std/pin/index.html

mgaunard
0 replies
1d2h

I do a lot of different types of systems programming.

Only times I actually use GC is to manage resources other than memory.

dblohm7
0 replies
2h2m

I really like the way the author explained RCU!

bsdpufferfish
0 replies
23h59m

Why are they always trying to sell these things to audiences who are not interested?

WhereIsTheTruth
0 replies
4h26m

Memory allocation strategy is task specific, not program specific

PaulDavisThe1st
0 replies
1d

I quoted this in another comment here, but just to highlight one of the best couple of sentences in TFA:

Tools to automate the “actually freeing the memory” part, like lifetimes in Rust and RAII in C++, don’t solve these problems. They absolutely aid correctness, something else you should care deeply about, but they do nothing to simplify all this machinery.