return to table of content

Spice: Fine-grained parallelism with sub-nanosecond overhead in Zig

nirushiv
24 replies
17h44m

I haven’t read through the code in detail but I can tell you “sub-nanosecond overhead” is misleading and marketing fluff. On first look, the measure seems to be some convoluted “time per thing” where the number of threads is far far smaller than the number of “thing”s

mananaysiempre
7 replies
17h25m

That is the ecological niche of Rayon (cited) as well, isn’t it? You need to process a lot of things (thousands to millions), you want to parallelize that processing as much as possible (couple dozen of cores tops), you want to not get killed by scheduling overhead. So you account for the per-thing overhead.

CyberDildonics
6 replies
16h35m

Anything can have its overhead amortized, but claiming that amortization as your actual overhead is just a lie.

hansvm
5 replies
14h38m

Most engineers aren't precise with throughput vs latency. Ideally you should report both figures (and anything else salient in performance-sensitive spaces), but it's less a lie and more an extremely commonplace mode of thinking and speaking.

Moreover, I think that mode of thought comes from the fact that most programming problems don't have hard latency bounds, so throughput dominates the conversation. If I'm spending 10us on average while handling a 10ms soft deadline, every single component can easily be occasionally 100x more expensive (latency) without me caring, and if it buys me another 1us on average (throughput) then I'll save gobs of money in compute.

CyberDildonics
2 replies
14h10m

Most engineers aren't precise with throughput vs latency.

Anyone making claims about performance should know the difference.

This isn't even about either, it's lying about overhead by not counting it correctly.

If someone asks what your cable bill is and you say it's only $2.50 a month because you have 32 TVs, no one is going to say that makes sense.

hansvm
1 replies
12h1m

It's not "lying" to use the words most likely to correctly get your point across to your target audience. Maybe they could have communicated better (for a seemingly dead project, IMO they put enough time in regardless), but it's not lying.

Anyone making claims about performance should know the difference.

They probably do know the difference. Knowing the difference isn't the thing you're quibbling with.

If someone asks what your cable bill is and you say it's only $2.50 a month because you have 32 TVs, no one is going to say that makes sense.

Sure...because when asking about your cable bill it's unambiguous that you want the total number of dollars. The whole reason there's any issue at all here is that some of their language is ambiguous if you don't consider the target audience and don't analyze their charts or read the descriptions of those charts. Picking an analogy which resonates precisely because of the lack of ambiguity doesn't say a whole lot about the actual problem at hand.

CyberDildonics
0 replies
2h12m

The whole point of overhead is that it is a base constant on top of whatever you do. The whole point of overhead is to eventually be amortized in some way. When someone asks what the overhead is, it is a lie to try to factor amortization in because it's implied that it will be done somewhere anyway.

Dylan16807
1 replies
13h53m

If you're trying to split a sub-second task across multiple threads, then latency is probably your main concern.

hansvm
0 replies
12h26m

Kind of, but not in a way that matters for this scheduler (at least, I posit, not usually). If you have a single task with a wall-clock time in the 10-1000ms range and want to make it "faster," yes, you probably want to improve the latency. However, the latency introduced by this experimental library is negligible on the timescales being considered. With that in mind, a throughput improvement is a better description of the sorts of engineering efforts you need to accomplish your goals.

From a slightly different perspective, one thing the library does on top of handling small workloads well is handling _finely grained_ workloads. That's super important from a usability point of view, since you can express the parallelism in the most natural way and let the library handle the fact that it's hard to schedule that task (e.g., most schedulers suck as soon as you're talking about work items on the order of a single cache line). Just being able to have a convenient abstraction when writing a long-running job is also a fantastic feature. In that case, we would still care about throughput, not latency.

Separately, though definitely more rarely, there are jobs which are sub-second but where it's hard to batch many of them at once. If you're forced to execute many and care about the time till completion (latency, again at a much longer timescale), being able to make each one much faster is a big deal.

I really think that first paragraph is a commonplace occurrence though. Something important is slow (measured in "human" blinking timescales), so you slap a better scheduler and some parallelism at it and start work on the next ticket. Ripgrep, at some level (it has lots of other technical accomplishments; I don't want to let this comment give the mistaken impression that I'm demeaning the project) is useful precisely because it throws parallelism at sub-second problems.

judofyr
5 replies
11h52m

Author here.

I knew that some people would react negatively to the term, but I can assure the intention is for you to have a better understanding of exactly how and when you should use Spice and Rayon. I would recommend reading the benchmark document: https://github.com/judofyr/spice/blob/main/bench/README.md.

What people typically do when comparing parallel code is to only compare the sequential/baseline with a parallel version running at all threads (16). Let's use the numbers for Rayon that I got for the 100M case:

- Sequential version: 7.48 ns.

- Rayon: 1.64 ns.

Then they go "For this problem Rayon showed a 4.5x speed-up, but uses 16 threads. Oh no, this is a bad fit." That's very true, but you don't learn anything from that. How can I use apply this knowledge to other types of problems?

However, if you run the same benchmark on varying number of threads you learn something more interesting: The scheduler in Rayon is actually pretty good at giving work to separate threads, but the overall work execution mechanism has a ~15 ns overhead. Despite this being an utterly useless program we've learnt something that we can apply later on: Our smallest unit of work should probably be a bit bigger than ~7 ns before we reach for Rayon. (Unless it's more important for use to reduce overall latency at the cost of the throughput of the whole system.)

In comparison, if you read the Rayon documentation they will not attempt to give you any number. They just say "Conceptually, calling join() is similar to spawning two threads, one executing each of the two closures. However, the implementation is quite different and incurs very low overhead": https://docs.rs/rayon/latest/rayon/fn.join.html.

(Also: If I wanted to be misleading I would say "Spice is twice as fast as Rayon since it gets 10x speed-up compared to 4.5x speed-up")

mgaunard
1 replies
9h39m

You can just divide the speed-up by the number of cores, and that gives you the parallelization efficiency.

I've seen systems that can achieve 99% efficiency on thousands on nodes for real useful applications that involve non-trivial synchronization. Now that is an impressive feat.

Sure, there is probably some extra latency to get everything running, but for a sufficiently long program run, that is all irrelevant.

aidenn0
0 replies
2h54m

You can just divide the speed-up by the number of cores, and that gives you the parallelization efficiency.

The linked benchmark document does this. 82% for 4 cores on Spice with a small workload, and 69% for 16 cores on Spice with a large workload. Compared to about 25% for Rayon on 4 cores with a small workload and 88% for Rayon on 16 cores with a large workload.

Sure, there is probably some extra latency to get everything running, but for a sufficiently long program run, that is all irrelevant.

The entire point of the linked benchmark README.md is to deal with insufficiently long program runs. Spice is an attempt to allow parallelization of very small amounts of work by decreasing fixed overhead. Perhaps such a thing is not useful but that doesn't prevent it from being interesting.

littlestymaar
1 replies
10h48m

Thanks for the answer, this part is particularly interesting indeed:

Despite this being an utterly useless program we've learnt something that we can apply later on: Our smallest unit of work should probably be a bit bigger than ~7 ns before we reach for Rayon.

That's a very interesting project.

The big limitation I see with the current approach is that the usability of the library is much worth than what Rayon offers.

The true magic of Rayon is that you just replace `iter()` with `par_iter()` in you code and voilà! now you have a parallel execution. But yes it has some overhead, so maybe Rayon could try and implement this kind of scheduling as an alternative so that people pick what works best for their use-case.

littlestymaar
0 replies
2h9m

Too late to edit so I'll put it here:

the usability of the library is much worse than what Rayon

I'm a little bit ashamed to see that this fairly upvoted comment of mine has such an stupid English mistake in it…

imtringued
0 replies
4h26m

Adding more cores doesn't change the time per operation. Your graphs are grossly wrong. What you should have done is drop the nanoseconds and just take the total execution time. Whenever you're writing 1.64ns, you should have written 164ms.

The overhead should be measured as a percentage versus a theoretical base line such as perfect linear speedup. You haven't shown the ideal scenario for each core count, so how are we supposed to know how much overhead there really is?

The single core scenario is 363ms and linear speedup for 32 cores gives us 11.3ms. Your benchmark says you needed 38ms. This means you achieved 31% of the theoretical performance of the CPU, which mind you is pretty good, especially since nobody has benchmarked what is actually possible while loading all cores by running 32 single threaded copies of the original program, but you're advertising a meaningless "sub nanosecond" measure here.

jnordwick
5 replies
14h21m

Yesterday, he posted on Reddit and I expressed some concern with the benchmarks. The benchmarks are claiming 0.36 ns of overhead per call, but only the computing function. There is a second thread running doing the schedule that the overhead numbers don't include. It seems pretty clear he's running on a hyperthreaded 8 core machine (so 16 threads), I was guessing 3 Ghz, so that literally a single cycle of overhead.

Each extra thread adds more overhead from lock contention. At 16 threads overhead is up to 3.6 ns (so 10 times more). I'm guessing, but that would mean the 0.36 ns of overhead included an uncontested lock? That's not possible. There's some other weirdness going on in the benchmark data too. So either I'm not understanding what he's actually timing or maybe there is a bug in the benchmark code.

Also, if you multiply all the values out, I think he's timing in milliseconds (when runtime is calculated and converted to millis, they come out as whole numbers). Don't most benchmarkers have better precision than that? Maybe he's just using `time prog` and the data is just really dirty. Or maybe he's just choosing really, really bad metrics that totally useless for this (this is probably correct, just not sure if there are more issues).

Veedrac
2 replies
12h55m

I don't understand why it's unbelievable. They walk through the required logic to perform scheduling and dispatch; it looks incredibly simple and extremely suited to hiding on an OoO. The benchmark is summing down a binary tree which is going to give you a lot of space to hide what's at most a handful of instructions. There's obviously no lock contention because, like, look at the algorithm, what locks contending on what?

AFAICT this is just a really cool and really practical algorithm for microthreads with near zero cost per semantic thread.

chc4
1 replies
11h49m

There's obviously no lock

What? The threadpool has a shared mutex accessed by each worker thread, which is used for pushing work on heartbeat and for dequeuing work. https://github.com/judofyr/spice/blob/167deba6e4d319f96d9d67...

"Adding more threads to the system will not make your program any slower" is an insane claim to be making for any lightweight task scheduling system, and trivially not true looking at this: if you have 1000 threads as workers and degenerate tree program than each worker will take turns serializing themselves through the mutex. Things like these are massive red flags to anyone reading the README.

Veedrac
0 replies
11h8m

The README covers this. Its argument is persuasive. If your point is that the constant is badly tuned for theoretical 1000 core machines that don't exist, I'm not sure I care. A 100ns stall at most every 100us becoming more likely when you approach multiple hundreds of cores is hardly a disaster. In the context of the comment I replied to, the difference between 8 and 16 workers is literally zero, as the wakeups are spaced so the locks will never conflict.

Actually, if you did have a 32k core machine somehow with magical sufficiently-uniform memory for microthreading to be sensible for it, I think it's not even hard to extend the algorithm to work with that. Just put the workers on a 3D torus and only share orthogonally. It means you don't have perfect work sharing, but I'm also pretty sure it doesn't matter.

imtringued
1 replies
11h51m

The lock is acquired every 100 microseconds and it is expected that only one thread accesses it at a time.

jnordwick
0 replies
9h31m

I know. But I have no other explanation as to the latency increase that is seen as the number of threads increases. Do you have a better theory?

x-complexity
2 replies
17h7m

I can tell you “sub-nanosecond overhead” is misleading and marketing fluff

If and only if (1-thread Spice - non-parallelized baseline) > 1ns, which their tests back up their claims.

https://github.com/judofyr/spice/tree/main/bench

adrian_b
1 replies
11h35m

At your link it also says:

"Spice shows subpar scalability: The speed-up of using 16 threads was merely ~11x"

If that is true, then "Spice" is suitable only for small tasks, which can be completed at most in milliseconds, which can benefit from its low overhead, while for any bigger tasks something better must be used.

judofyr
0 replies
11h13m

Author here.

I’d maybe phrase it as “Spice is not optimal” instead of “Spice is not suited for”, but yes, that’s the conclusion for this benchmark.

I’m hoping/assuming that for a more typical case (more CPU work being done) Spice will scale better, but I haven’t done the benchmark yet.

audunw
0 replies
4h48m

Did you read the README at all? I thought it was extremely precise about what exactly is meant by the claim in the title. Not much room for misunderstanding.

The title is totally fine. There is no title with zero room for misinterpretation. All I took from it is that it's a library with extreme low latency by some kind of measure.. and then I went to the readme to see exactly what that measure was. Very straightforward.

jd3
6 replies
17h31m

Want to state off the bat this this project is awesome and huge kudos to the author for spending their time, attention, and energy 1) working diligently to get this working at all and 2) sharing it with the broader HN community, who are generally known to by hyper-critical to a pedantic degree and/or overly pessimistic (cough the initial Docker project Show HN thread cough)

I also really appreciate that the author recognizes the limits of their own project, which preemptively addresses most of the usual snark.

Lack of tests: Spice contains a lot of gnarly concurrent code, but has zero testing coverage. This would have be improved before Spice can be responsibly used for critical tasks.

Testing correctness of execution for critical tasks is one thing, but I would expect a library which implements "gnarly concurrent code" to at least have regression tests — what guarantee is there to an end-user that functionality which exists in a working state today might not break tomorrow due to a subtle yet nefarious regression?

sqlite has 590 times as much test code and test scripts as it does raw c source code [0]; this fact, along with its stability and portability, is one of the numerous reasons why it has proliferated to become the defacto embedded database used across the planet. While we're comparing apples to oranges in this contrived example, the general point still stands — regression tests beget stability and confidence in a project.

In epics where I work, if we _must_ defer baseline regression tests, we usually create a follow-up ticket inside of the same epic to at least write them before feature/epic launch, usually.

[0]: https://www.sqlite.org/testing.html

stareatgoats
1 replies
5h52m

(cough the initial Docker project Show HN thread cough)

Docker was largely met with enthusiasm here when it was launched. I believe you must refer to how Dropbox was received — famously negatively, initially.

pverghese
1 replies
17h29m

You are welcome to add it. This is a proof of concept

jd3
0 replies
17h24m

Spice is primarily a research project. Read along to learn more about it, but if you're considering using it in production you should be aware of its many limitations.

Ah, I missed that upon first read. In that case, that caveat/limitation is definitely justified.

dsp_person
0 replies
17h16m

sharing it with the broader HN community

Note that I posted this, but I am not the author

dagw
0 replies
5h46m

HN community, who are generally known to by hyper-critical to a pedantic degree and/or overly pessimistic (cough the initial Docker project Show HN thread cough)

So you made me dig up the announcement, and contrary to how you recall, it is almost universally positive.

https://news.ycombinator.com/item?id=5408002

raggi
2 replies
17h20m

cooperative scheduling is the basis for so many patterns with great metrics :)

nine_k
1 replies
17h5m

But it's not very cooperative, as in tasks yielding to each other. They mostly cooperate by letting some tasks to be given to other threads, and not all the time, but once in a heartbeat. The scheduling happens rarely, so its amortized cost is low.

raggi
0 replies
15h20m

sure, but that's all details, and there are other cooperative scheduling models which aren't excessively eager. this isn't to be a downer on this implementation, or other implementations, but just generally pointing at cooperative advantages, of which there are many, with one particular implicit and major downside, which is starvation when the cooperation contract is not adhered to.

shwestrick
1 replies
4h48m

For those curious, this implementation is based on a recent line of research called "heartbeat scheduling" which amortizes the overheads of creating parallelism, essentially accomplishing a kind of dynamic automatic granularity control.

Related papers:

(2018) Heartbeat Scheduling: Provable Efficiency for Nested Parallelism. https://www.andrew.cmu.edu/user/mrainey/papers/heartbeat.pdf

(2021) Task Parallel Assembly Language for Uncompromising Parallelism. https://users.cs.northwestern.edu/~simonec/files/Research/pa...

(2024) Compiling Loop-Based Nested Parallelism for Irregular Workloads. https://users.cs.northwestern.edu/~simonec/files/Research/pa...

(2024) Automatic Parallelism Management. https://www.cs.cmu.edu/~swestric/24/popl24-par-manage.pdf

judofyr
0 replies
3h14m

Oh this is super interesting. I was only aware of the two first while writing Spice. I’ll definitely look into the two last as well. Thanks for sharing!

gyrovagueGeist
0 replies
13h56m

This is neat and links to some great papers. I wish the comparison was with OpenMP tasks though; I’ve heard Rayon has a reputation for being a bit slow

geertj
0 replies
4h3m

Per the description this uses busy waiting in the workers to get to nanosecond level latencies. I wonder if anyone has a perspective on how realistic busy waiting is in large applications with tens of thousands of tasks? Maybe it works if the tasks are async (i.e. not thread based) so that you only have N waiters where N is the size of the executor’s thread pool? In any case energy consumption of such an architecture would be higher.

Related, I’ve been interested a while whether there’s a faster way for a producer of work to have a consumer wake up without resorting to busy waiting, possibly by running the consumer in the producer time slice.

Also related, I’ve wondered if it’s possible to have a user space FUTEX_WAKE operation that would halve the typical penalty of waking up a consumer (to just the consumer).

assafe
0 replies
11h50m

This is great!

akovaski
0 replies
15h29m

I'm not terribly familiar with this space, but I do like the concurrency model presented here.

I think the README here is very well written, and I have a good idea of what's going on just from reading it, but there are a few areas where I'm left scratching my head. Thankfully the code is fairly easy to read.