One nice thing about this (and the new offerings from AMD) is that they will be using the "open accelerator module (OAM)" interface- which standardizes the connector that they use to put them on baseboards, similar to the SXM connections of Nvidia that use MegArray connectors to thier baseboards.
With Nvidia, the SXM connection pinouts have always been held proprietary and confidential. For example, P100's and V100's have standard PCI-e lanes connected to one of the two sides of their MegArray connectors, and if you know that pinout you could literally build PCI-e cards with SXM2/3 connectors to repurpose those now obsolete chips (this has been done by one person).
There are thousands, maybe tens of thousands of P100's you could pickup for literally <$50 apiece these days which technically give you more Tflops/$ than anything on the market, but they are useless because their interface was not ever made open and has not been reverse engineered openly and the OEM baseboards (Dell, Supermicro mainly) are still hideously expensive outside China.
I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.
The price is low because they’re useless (except for replacing dead cards in a DGX), if you had a 40$ PCIe AIC-to-SXM adapter, the price would go up a lot.
Very cool hobby. It’s also unfortunate how stringent e-waste rules lead to so much perfectly fine hardware to be scrapped. And how the remainder is typically pulled apart to the board / module level for spares. Makes it very unlikely to stumble over more or less complete-ish systems.
I'm not sure the prices would go up that much. What would anyone buy that card for?
Yes, it has a decent memory bandwidth (~750 GB/s) and it runs CUDA. But it only has 16 GB and doesn't support tensor cores or low precision floats. It's in a weird place.
Scientific computing would buy it up like hot cakes.
Only if the specific workload needs FP64 (4.5 Tflop/s), the 9 Tflop/s for FP32 can be had for cheap with Turing or Ampere consumer cards.
Still, your point stands. It's crazy how that 2016 GPU has two thirds the FP32 power of this new 2024 unobtanium card and infinitely more FP64.
Somewhat off topic:
Is there a similar "magic value card" for low memory (2GB?) 8-bit LLMs?
Since memory is the expensive bit, surely there are low cost low memory models?
I believe that's what tenstorrent is aiming for.
The main offer of Tenstorrent goes into server racks and is designed to form clusters.
Standalone cards are more like dev kits.
(I’ve been tracking Tenstorrent for 3+ years and currently have Grayskull in ML test rig together with 3090)
IDK, is it really that much more powerful than the P40, which is already fairly cheap?
The P100 has amazing double precision (FP64) flops (due to a 1:2 FP ratio that got nixed on all other cards) and a higher memory bandwidth which made it a really standout GPU for scientific computing applications. Computational Fluid Dynamics, etc.
The P40 was aimed at the image and video cloud processing market I think, and thus the GDDR ram instead of HBM, so it got more VRAM but at much less bandwidth.
Well, the p40 has 24gb VRAM, which makes it the perfect hobbyist card for a llm, assuming you can keep it cool.
The pci-e p100 is has 16gb vram and won’t go below 160 dollars. Prices for these things would pick up if you could put them in some sort of pcie adapter
Pascal series are cheap because they are CUDA compute capability 6.0 and lack Tensor Cores. Volta (7.0) was the first to have Tensor Cores and in many cases is the bare minimum for modern/current stacks.
See flash attention, triton, etc as core enabling libraries. Not to mention all of the custom CUDA kernels all over the place. Take all of this and then stack layers on top of them...
Unfortunately there is famously "GPU poor vs GPU rich". Pascal puts you at "GPU destitute" (regardless of assembled VRAM) and outside of implementations like llama.cpp that go incredible and impressive lengths to support these old archs you will very quickly run into show-stopping issues that make you wish you just handed over the money for >= 7.0.
I support any use of old hardware but this kind of reminds me of my "ancient" X5690 that has impressive performance (relatively speaking) but always bites me because it doesn't have AVX.
Hey that’s not fair, the X5690 is VERY efficient… at heating a home in the winter time.
Easier said than done. I've got a dual X5690 at home in Kiev, Ukraine and I just couldn't find anything to run on it 24x7. And it doesn't produce much heat idling. I mean at all.
Run BOINC maybe? [0]
[0]: https://boinc.berkeley.edu/
Makes little sense to actually run anything on X5960 power-wise
All the sane and rational people are rooting for you here in the U.S. I’m sorry our government is garbage and aid hasn’t been coming through as expected. Hopefully Ukraine can stick it to that chicken-fucker in the Kremlin and retake Crimea too.
I didn’t have an X5690 because the TDP was too high for my server’s heatsinks, but I had 90W variants of the same generation. To me, two at idle produced noticeable heat, though not as much as four idling in a PowerEdge R910 did. The R910 idled at around 300W.
There’s always Folding@Home if you don’t mind the electric bill. Plex is another option. I know a guy running a massive Plex server that was on Westmere/Nehalem Xeons until I gave him my R720 with Haswell Xeons.
It looks pathetic indeed. Makes many people question: if THAT'S democracy, then maybe it's not worth fighting for.
The same could be said about russian people (sane and rational ones). But what do both people have in common? The answer is: currently both nations are helpless to change what their government does.
I know. We all truly know and greatly appreciate that. There would be no Ukraine if not American weapons and help.
Makes little sense power-wise.
This is all very true for Machine-Learning research tasks, were yes, if you want that latest PyTorch library function to work you need to be on the latest ML code.
But my work/fun is in CFD. One of the main codes I use for work was written to be supported primarily at the time of Pascal. Other HPC stuff too that can be run via OpenCL, and is still plenty compatible. Things compiled back then will still run today; It's not a moving target like ML has been.
Exactly. Demand for FP64 is significantly lower than for ML/AI.
Pascal isn’t incredibly cheap by comparison because it’s some secret hack. It’s cheap by comparison because most of the market (AI/ML) doesn’t want it. Speaking of which…
At the risk of “No True Scotsman” what qualifies as HPC gets interesting but just today I was at a Top500 site that was talking about their Volta system not being worth the power, which is relevant to parent comment but still problematic for reasons.
I mentioned llama.cpp because the /r/locallama crowd, etc has actually driven up the cost of used Pascal hardware because they treat it as a path to get VRAM on the cheap with their very very narrow use cases.
If we’re talking about getting a little FP64 for CFD that’s one thing. ML/AI is another. HPC is yet another.
The SXM2 interface is actually publicly documented! There is an open compute spec for a 8-way baseboard. You can find the pinouts there.
Upon further review... I think any actual base board schematics / pinouts touching the Nvidia hardware directly is indeed kept behind some sort of NDA or OEM license agreement and is specifically kept out of any of those documents for the Open Compute project JBOG rigs.
I think this is literally the impetus for their OAM spec which makes the pinout open and shareable. Up until that, they had to keep the actual designs of the baseboards out of the public due to that part being still controlled Nvidia IP.
Hmm interesting, I was linked to an OCP dropbox with a version that did have the connector pinouts. Maybe something someone shouldn’t have posted then…
I could find OCP accelerator spec but it looks like an open source reimplementation, not actual SXM2. That said, the photos of SXM2-PCIe adapters I could find look almost entirely passive, so I don't think all hopes are lost either.
It would be a shame if such a thing were to fall off the back of a truck as they say
couldn't someone just buy one of those chinese sxm2 to pcie adapter boards and test continuity to get the pinouts? I have one that could take like 10 minutes
I had read their documents such as the spec for the Big Basin JBOG, where everything is documented except the actual pinouts on the base board. Everything leading up to it and from it is there but the actual MegArray pinout connection to a single P100/V100 I never found.
But maybe there was more I missed. I'll take another look.
I really like this side to AMD. There's a strategic call somewhere high up to bias towards collaboration with other companies. Sharing the fabric specifications with broadcom was an amazing thing to see. It's not out of the question that we'll see single chips with chiplets made by different companies attached together.
Well, lets not forget, AMD is AMD because they reverse-engineered Intel chips....
IBM didn't want to rely solely on Intel when introducing PCs so it forced Intel to share its arch with another manufacturer that turned out to be AMD. It's not like AMD stole it. Math coprocessor was in turn invented by AMD (Am9511, Am9512) and licensed by Intel (8231, 8232).
Also AMD64
Maybe they feel threatened by ARM on mobile and Intel on desktop / server. Companies that think they're first try to monopolize. Companies that think they're second try to cooperate.
Why don't they sell used P100 DGX/HGX servers as a unit? Are those bare P100s only so cheap precisely because they're useless?
I have a theory some big cloud provider moved a ton of racks from SXM2 P100's to SXM2 V100's (those were a thing) and thus orphaned an absolute ton of P100's without their baseboards.
Or, these salvage operations just stripped racks and kept the small stuff and e-waste the racks because they think it's the more efficient use of their storage space and would be easier to sell, without thinking correctly.
A ton of Nvidia GPUs fry their memory over time and need to be scrapped. Lookup nvidia (A100/H100) row remapping failure
As “humble” as NVIDIA’s CEO appears to be, NVIDIA the company (he’s been running this whole time), made decision after decision with the simple intention of killing off its competition (ATI/AMD). Gameworks is my favorite example- essentially if you wanted a video game to look as good as possible, you needed an NVIDIA GPU. Those same games played on AMD GPUs just didn’t look as good.
Now that video gaming is secondary (tertiary?) to Nvidia’s revenue stream, they could give a shit which brand gamers prefer. It’s small time now. All that matters is who companies are buying their GPUs from for AI stuff. Break down that CUDA wall and it’s open-season. I wonder how they plan to stave that off. It’s only a matter of time before people get tired of writing C++ code to interface with CUDA.
You don't need to use C++ to interface with CUDA or even write it.
A while ago NVIDIA and the GraalVM team demoed grCUDA which makes it easy to share memory with CUDA kernels and invoke them from any managed language that runs on GraalVM (which includes JIT compiled Python). Because it's integrated with the compiler the invocation overhead is low:
https://developer.nvidia.com/blog/grcuda-a-polyglot-language...
And TornadoVM lets you write kernels in JVM langs that are compiled through to CUDA:
https://www.tornadovm.org
There are similar technologies for other languages/runtimes too. So I don't think that will cause NVIDIA to lose ground.
So these alternatives exist yes, but are they “production ready”- in other words, are they being used. My opinion is that while you can use another language, most companies for one reason or another are still using C++. I just don’t really know what the reason(s) are.
I think about other areas in tech where you can use whatever language, but it isn’t practical to do so. I can write a backend API server in Swift… or perhaps more relevant- I can use AMD’s ROCm to do… anything.
Best Tflops/$ is actually 4090, then 3090. Also L4