HN comments for: Intel Gaudi 3 AI Accelerator

mk_stjames

39 replies

22h50m

2024-04-09 20:01:45 UTC

One nice thing about this (and the new offerings from AMD) is that they will be using the "open accelerator module (OAM)" interface- which standardizes the connector that they use to put them on baseboards, similar to the SXM connections of Nvidia that use MegArray connectors to thier baseboards.

With Nvidia, the SXM connection pinouts have always been held proprietary and confidential. For example, P100's and V100's have standard PCI-e lanes connected to one of the two sides of their MegArray connectors, and if you know that pinout you could literally build PCI-e cards with SXM2/3 connectors to repurpose those now obsolete chips (this has been done by one person).

There are thousands, maybe tens of thousands of P100's you could pickup for literally <$50 apiece these days which technically give you more Tflops/$ than anything on the market, but they are useless because their interface was not ever made open and has not been reverse engineered openly and the OEM baseboards (Dell, Supermicro mainly) are still hideously expensive outside China.

I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.

formerly_proven

10 replies

22h0m

2024-04-09 20:52:16 UTC

The price is low because they’re useless (except for replacing dead cards in a DGX), if you had a 40$ PCIe AIC-to-SXM adapter, the price would go up a lot.

I'm one of those people who finds 'retro-super-computing' a cool hobby and thus the interfaces like OAM being open means that these devices may actually have a life for hobbyists in 8~10 years instead of being sent directly to the bins due to secret interfaces and obfuscated backplane specifications.

Very cool hobby. It’s also unfortunate how stringent e-waste rules lead to so much perfectly fine hardware to be scrapped. And how the remainder is typically pulled apart to the board / module level for spares. Makes it very unlikely to stumble over more or less complete-ish systems.

KeplerBoy

9 replies

21h50m

2024-04-09 21:02:40 UTC

I'm not sure the prices would go up that much. What would anyone buy that card for?

Yes, it has a decent memory bandwidth (~750 GB/s) and it runs CUDA. But it only has 16 GB and doesn't support tensor cores or low precision floats. It's in a weird place.

trueismywork

4 replies

21h44m

2024-04-09 21:08:11 UTC

Scientific computing would buy it up like hot cakes.

KeplerBoy

3 replies

21h33m

2024-04-09 21:18:47 UTC

Only if the specific workload needs FP64 (4.5 Tflop/s), the 9 Tflop/s for FP32 can be had for cheap with Turing or Ampere consumer cards.

Still, your point stands. It's crazy how that 2016 GPU has two thirds the FP32 power of this new 2024 unobtanium card and infinitely more FP64.

algo_trader

2 replies

19h31m

2024-04-09 23:21:22 UTC

Somewhat off topic:

Is there a similar "magic value card" for low memory (2GB?) 8-bit LLMs?

Since memory is the expensive bit, surely there are low cost low memory models?

KeplerBoy

1 replies

14h2m

2024-04-10 04:50:03 UTC

I believe that's what tenstorrent is aiming for.

abdullin

0 replies

13h27m

2024-04-10 05:24:57 UTC

The main offer of Tenstorrent goes into server racks and is designed to form clusters.

Standalone cards are more like dev kits.

(I’ve been tracking Tenstorrent for 3+ years and currently have Grayskull in ML test rig together with 3090)

jsight

2 replies

19h42m

2024-04-09 23:10:24 UTC

IDK, is it really that much more powerful than the P40, which is already fairly cheap?

mk_stjames

0 replies

12h42m

2024-04-10 06:10:09 UTC

The P100 has amazing double precision (FP64) flops (due to a 1:2 FP ratio that got nixed on all other cards) and a higher memory bandwidth which made it a really standout GPU for scientific computing applications. Computational Fluid Dynamics, etc.

The P40 was aimed at the image and video cloud processing market I think, and thus the GDDR ram instead of HBM, so it got more VRAM but at much less bandwidth.

7speter

0 replies

15h1m

2024-04-10 03:51:29 UTC

Well, the p40 has 24gb VRAM, which makes it the perfect hobbyist card for a llm, assuming you can keep it cool.

7speter

0 replies

15h2m

2024-04-10 03:50:30 UTC

The pci-e p100 is has 16gb vram and won’t go below 160 dollars. Prices for these things would pick up if you could put them in some sort of pcie adapter

kkielhofner

8 replies

14h12m

2024-04-10 04:40:09 UTC

Pascal series are cheap because they are CUDA compute capability 6.0 and lack Tensor Cores. Volta (7.0) was the first to have Tensor Cores and in many cases is the bare minimum for modern/current stacks.

See flash attention, triton, etc as core enabling libraries. Not to mention all of the custom CUDA kernels all over the place. Take all of this and then stack layers on top of them...

Unfortunately there is famously "GPU poor vs GPU rich". Pascal puts you at "GPU destitute" (regardless of assembled VRAM) and outside of implementations like llama.cpp that go incredible and impressive lengths to support these old archs you will very quickly run into show-stopping issues that make you wish you just handed over the money for >= 7.0.

I support any use of old hardware but this kind of reminds me of my "ancient" X5690 that has impressive performance (relatively speaking) but always bites me because it doesn't have AVX.

gymbeaux

5 replies

13h12m

2024-04-10 05:39:57 UTC

Hey that’s not fair, the X5690 is VERY efficient… at heating a home in the winter time.

egorfine

4 replies

9h35m

2024-04-10 09:16:57 UTC

Easier said than done. I've got a dual X5690 at home in Kiev, Ukraine and I just couldn't find anything to run on it 24x7. And it doesn't produce much heat idling. I mean at all.

tambre

1 replies

6h59m

2024-04-10 11:53:30 UTC

Run BOINC maybe? [0]

[0]: https://boinc.berkeley.edu/

egorfine

0 replies

6h47m

2024-04-10 12:04:53 UTC

Makes little sense to actually run anything on X5960 power-wise

gymbeaux

1 replies

7h11m

2024-04-10 11:40:54 UTC

All the sane and rational people are rooting for you here in the U.S. I’m sorry our government is garbage and aid hasn’t been coming through as expected. Hopefully Ukraine can stick it to that chicken-fucker in the Kremlin and retake Crimea too.

I didn’t have an X5690 because the TDP was too high for my server’s heatsinks, but I had 90W variants of the same generation. To me, two at idle produced noticeable heat, though not as much as four idling in a PowerEdge R910 did. The R910 idled at around 300W.

There’s always Folding@Home if you don’t mind the electric bill. Plex is another option. I know a guy running a massive Plex server that was on Westmere/Nehalem Xeons until I gave him my R720 with Haswell Xeons.

egorfine

0 replies

6h43m

2024-04-10 12:09:10 UTC

I’m sorry our government is garbage

It looks pathetic indeed. Makes many people question: if THAT'S democracy, then maybe it's not worth fighting for.

All the sane and rational people are rooting for you here in the U.S.

The same could be said about russian people (sane and rational ones). But what do both people have in common? The answer is: currently both nations are helpless to change what their government does.

are rooting for you here in the U.S.

I know. We all truly know and greatly appreciate that. There would be no Ukraine if not American weapons and help.

There’s always Folding@Home

Makes little sense power-wise.

mk_stjames

1 replies

12h47m

2024-04-10 06:04:56 UTC

This is all very true for Machine-Learning research tasks, were yes, if you want that latest PyTorch library function to work you need to be on the latest ML code.

But my work/fun is in CFD. One of the main codes I use for work was written to be supported primarily at the time of Pascal. Other HPC stuff too that can be run via OpenCL, and is still plenty compatible. Things compiled back then will still run today; It's not a moving target like ML has been.

kkielhofner

0 replies

11h14m

2024-04-10 07:38:37 UTC

Exactly. Demand for FP64 is significantly lower than for ML/AI.

Pascal isn’t incredibly cheap by comparison because it’s some secret hack. It’s cheap by comparison because most of the market (AI/ML) doesn’t want it. Speaking of which…

At the risk of “No True Scotsman” what qualifies as HPC gets interesting but just today I was at a Top500 site that was talking about their Volta system not being worth the power, which is relevant to parent comment but still problematic for reasons.

I mentioned llama.cpp because the /r/locallama crowd, etc has actually driven up the cost of used Pascal hardware because they treat it as a path to get VRAM on the cheap with their very very narrow use cases.

If we’re talking about getting a little FP64 for CFD that’s one thing. ML/AI is another. HPC is yet another.

buildbot

6 replies

21h38m

2024-04-09 21:14:12 UTC

The SXM2 interface is actually publicly documented! There is an open compute spec for a 8-way baseboard. You can find the pinouts there.

mk_stjames

4 replies

18h4m

2024-04-10 00:47:59 UTC

Upon further review... I think any actual base board schematics / pinouts touching the Nvidia hardware directly is indeed kept behind some sort of NDA or OEM license agreement and is specifically kept out of any of those documents for the Open Compute project JBOG rigs.

I think this is literally the impetus for their OAM spec which makes the pinout open and shareable. Up until that, they had to keep the actual designs of the baseboards out of the public due to that part being still controlled Nvidia IP.

buildbot

3 replies

14h1m

2024-04-10 04:50:48 UTC

Hmm interesting, I was linked to an OCP dropbox with a version that did have the connector pinouts. Maybe something someone shouldn’t have posted then…

numpad0

0 replies

5h45m

2024-04-10 13:07:10 UTC

I could find OCP accelerator spec but it looks like an open source reimplementation, not actual SXM2. That said, the photos of SXM2-PCIe adapters I could find look almost entirely passive, so I don't think all hopes are lost either.

IntelMiner

0 replies

11h59m

2024-04-10 06:53:21 UTC

It would be a shame if such a thing were to fall off the back of a truck as they say

CYR1X

0 replies

3h6m

2024-04-10 15:45:54 UTC

couldn't someone just buy one of those chinese sxm2 to pcie adapter boards and test continuity to get the pinouts? I have one that could take like 10 minutes

mk_stjames

0 replies

20h45m

2024-04-09 22:07:26 UTC

I had read their documents such as the spec for the Big Basin JBOG, where everything is documented except the actual pinouts on the base board. Everything leading up to it and from it is there but the actual MegArray pinout connection to a single P100/V100 I never found.

But maybe there was more I missed. I'll take another look.

JonChesterfield

4 replies

22h40m

2024-04-09 20:12:27 UTC

I really like this side to AMD. There's a strategic call somewhere high up to bias towards collaboration with other companies. Sharing the fabric specifications with broadcom was an amazing thing to see. It's not out of the question that we'll see single chips with chiplets made by different companies attached together.

rhelz

2 replies

18h54m

2024-04-09 23:58:18 UTC

Well, lets not forget, AMD is AMD because they reverse-engineered Intel chips....

treprinum

1 replies

10h14m

2024-04-10 08:37:58 UTC

IBM didn't want to rely solely on Intel when introducing PCs so it forced Intel to share its arch with another manufacturer that turned out to be AMD. It's not like AMD stole it. Math coprocessor was in turn invented by AMD (Am9511, Am9512) and licensed by Intel (8231, 8232).

vegabook

0 replies

2024-04-10 18:45:24 UTC

Also AMD64

01HNNWZ0MV43FF

0 replies

19h50m

2024-04-09 23:02:35 UTC

Maybe they feel threatened by ARM on mobile and Intel on desktop / server. Companies that think they're first try to monopolize. Companies that think they're second try to cooperate.

wmf

2 replies

22h31m

2024-04-09 20:21:27 UTC

Why don't they sell used P100 DGX/HGX servers as a unit? Are those bare P100s only so cheap precisely because they're useless?

mk_stjames

1 replies

21h17m

2024-04-09 21:35:32 UTC

I have a theory some big cloud provider moved a ton of racks from SXM2 P100's to SXM2 V100's (those were a thing) and thus orphaned an absolute ton of P100's without their baseboards.

Or, these salvage operations just stripped racks and kept the small stuff and e-waste the racks because they think it's the more efficient use of their storage space and would be easier to sell, without thinking correctly.

bushbaba

0 replies

16h58m

2024-04-10 01:53:58 UTC

A ton of Nvidia GPUs fry their memory over time and need to be scrapped. Lookup nvidia (A100/H100) row remapping failure

gymbeaux

2 replies

13h6m

2024-04-10 05:46:28 UTC

As “humble” as NVIDIA’s CEO appears to be, NVIDIA the company (he’s been running this whole time), made decision after decision with the simple intention of killing off its competition (ATI/AMD). Gameworks is my favorite example- essentially if you wanted a video game to look as good as possible, you needed an NVIDIA GPU. Those same games played on AMD GPUs just didn’t look as good.

Now that video gaming is secondary (tertiary?) to Nvidia’s revenue stream, they could give a shit which brand gamers prefer. It’s small time now. All that matters is who companies are buying their GPUs from for AI stuff. Break down that CUDA wall and it’s open-season. I wonder how they plan to stave that off. It’s only a matter of time before people get tired of writing C++ code to interface with CUDA.

mike_hearn

1 replies

9h49m

2024-04-10 09:03:18 UTC

You don't need to use C++ to interface with CUDA or even write it.

A while ago NVIDIA and the GraalVM team demoed grCUDA which makes it easy to share memory with CUDA kernels and invoke them from any managed language that runs on GraalVM (which includes JIT compiled Python). Because it's integrated with the compiler the invocation overhead is low:

https://developer.nvidia.com/blog/grcuda-a-polyglot-language...

And TornadoVM lets you write kernels in JVM langs that are compiled through to CUDA:

https://www.tornadovm.org

There are similar technologies for other languages/runtimes too. So I don't think that will cause NVIDIA to lose ground.

gymbeaux

0 replies

7h3m

2024-04-10 11:49:29 UTC

So these alternatives exist yes, but are they “production ready”- in other words, are they being used. My opinion is that while you can use another language, most companies for one reason or another are still using C++. I just don’t really know what the reason(s) are.

I think about other areas in tech where you can use whatever language, but it isn’t practical to do so. I can write a backend API server in Swift… or perhaps more relevant- I can use AMD’s ROCm to do… anything.

pavelstoev

0 replies

15h28m

2024-04-10 03:24:37 UTC

Best Tflops/$ is actually 4090, then 3090. Also L4

rileyphone

33 replies

1d1h

2024-04-09 17:09:22 UTC

128GB in one chip seems important with the rise of sparse architectures like MoE. Hopefully these are competitive with Nvidia's offerings, though in the end they will be competing for the same fab space as Nvidia if I'm not mistaken.

latchkey

32 replies

1d1h

2024-04-09 17:17:36 UTC

AMD MI300x is 192GB.

tucnak

31 replies

1d1h

2024-04-09 17:23:54 UTC

Which would be impressive had it _actually_ worked for ML workloads.

Hugsun

28 replies

1d1h

2024-04-09 17:30:32 UTC

Does it not work for them? Where can I learn why?

tucnak

27 replies

1d1h

2024-04-09 17:47:20 UTC

Just go have a look around Github issues in their ROCm repositories on Github. A few months back the top excuse re: AMD was that we're not supposed to use their "consumer" cards, however the datacenter stuff is kosher. Well, guess what, we have purchased their datacenter card, MI50, and it's similarly screwed. Too many bugs in the kernel, kernel crashes, hangs, and the ROCm code is buggy / incomplete. When it works, it works for a short period of time, and yes HBM memory is kind of nice, but the whole thing is not worth it. Some say MI210 and MI300 are better, but it's just wishful thinking as all the bugs are in the software, kernel driver, and firmware. I have spent too many hours troubleshooting entry-level datacenter-grade Instinct cards with no recourse from AMD whatsoever to pay 10+ thousands for MI210 a couple-year old underpowered hardware, and MI300 is just unavailable.

Not even from cloud providers which should be telling enough.

JonChesterfield

9 replies

23h17m

2024-04-09 19:34:45 UTC

We absolutely hammered the MI50 in internal testing for ages. Was solid as far as I can tell.

Rocm is sensitive to matching kernel version to driver version to userspace version. Staying very much on the kernel version from a official release and using the corresponding driver is drastically more robust than optimistically mixing different components. In particular, rocm is released and tested as one large blob, and running that large blob on a slightly different kernel version can go very badly. Mixing things from GitHub with things from your package manager is also optimistic.

Imagine it as huge ball of code where cross version compatibility of pieces is totally untested.

tucnak

8 replies

22h14m

2024-04-09 20:37:41 UTC

I would run simple llama.cpp batch jobs for 10 minutes when it would suddenly fail, and require a restart. Random VM_L2_PROTECTION_FAULT in dmesg, something having to do with doorbells. I did report this, never heard back from them.

FeepingCreature

3 replies

13h35m

2024-04-10 05:17:36 UTC

Same here with SD on 7900XTX. Most of the time for me it's sufficient to reset the card with rocm-smi --gpureset -d 0.

michaelt

2 replies

11h37m

2024-04-10 07:14:59 UTC

Only "most of the time" ? :(

You'd hope at $15,000+ per unit, you wouldn't have to reset it at all...

JonChesterfield

1 replies

10h13m

2024-04-10 08:38:50 UTC

It's $1000 per, no? This is one of the gaming cards.

FeepingCreature

0 replies

4h2m

2024-04-10 14:50:32 UTC

Yep, bought for $1000.

At which price point to be honest, it still shouldn't be needed.

AMD are lucky everyone expects this nowadays, or people might consider sueing.

JonChesterfield

2 replies

10h44m

2024-04-10 08:08:24 UTC

Did you run on the blessed Ubuntu version with the blessed kernel version and the blessed driver version? As otherwise you really are in a development branch.

If you can point me to a repro I'll add it to my todo list. You can probably tag me in the github issue if that's where you reported it.

Aissen

1 replies

8h36m

2024-04-10 10:16:39 UTC

blessed Ubuntu version with the blessed kernel version

To an SRE, this is a nightmare to read. Cuda is bad in this regard (can often prevent major kernel version updates), but this is worse.

latchkey

0 replies

2h49m

2024-04-10 16:02:43 UTC

I feel like this goes both ways. You also don't want to have to run bleeding edge for everything because there are so many bugs in things. You kind of want known stable versions to at least base yourself off of.

latchkey

0 replies

19h25m

2024-04-09 23:26:47 UTC

Would you like to share the model of GPU and versions of various software used?

George has a nice explanation of doorbells:

https://youtu.be/AqPIOtUkxNo?feature=shared&t=968

jmward01

8 replies

23h50m

2024-04-09 19:01:58 UTC

Yeah, this has stopped me from trying anything with them. They need to lead with their consumer cards so that developers can test/build/evaluate/gain trust locally and then their enterprise offerings need to 100% guarantee that the stuff developers worked on will work in the data center. I keep hoping to see this but every time I look it isn't there. There is way more support for apple silicon out there than ROCm and that has no path to enterprise. AMD is missing the boat.

latchkey

4 replies

23h42m

2024-04-09 19:10:33 UTC

You are right, AMD should do more with consumer cards, but I understand why they aren't today. It is a big ship, they've really only started changing course as of last Oct/Nov, before the release of MI300x in Dec. If you have limited resources and a whole culture to change, you have to give them time to fix that.

That said, if you're on the inside, like I am, and you talk to people at AMD (just got off two separate back to back calls with them), rest assured, they are dedicated to making this stuff work.

Part of that is to build a developer flywheel by making their top end hardware available to end users. That's where my company Hot Aisle comes into play. Something that wasn't available before outside of the HPC markets, is now going to be made available.

tucnak

1 replies

23h35m

2024-04-09 19:17:25 UTC

developer flywheel

This is peak comedy

latchkey

0 replies

23h27m

2024-04-09 19:25:10 UTC

https://news.ycombinator.com/newsguidelines.html

Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.

jmward01

1 replies

20h58m

2024-04-09 21:53:42 UTC

I look forward to seeing it. NVIDIA needs real competition for their own benefit if not the market as a whole. I want a richer ecosystem where Intel, AMD, NVIDIA and other players all join in with the winner being the consumer. From a selfish point of view I also want to do more home experimentation. LLMs are so new that you can make breakthroughs without a huge team but it really helps to have hardware to make it easier to play with ideas. Consumer card memory limitations are hurting that right now.

latchkey

0 replies

20h56m

2024-04-09 21:56:16 UTC

I want a richer ecosystem where Intel, AMD, NVIDIA and other players all join in with the winner being the consumer.

This is exactly the void I'm trying to fill.

JonChesterfield

2 replies

23h3m

2024-04-09 19:48:44 UTC

In fairness it wasn't Apple who implemented the non-mac uses of their hardware.

AMD's driver is in your kernel, all the userspace is on GitHub. The ISA is documented. It's entirely possible to treat the ASICs as mass market subsidized floating point machines and run your own code on them.

Modulo firmware. I'm vaguely on the path to working out what's going on there. Changing that without talking to the hardware guys in real time might be rather difficult even with the code available though.

imtringued

1 replies

12h30m

2024-04-10 06:21:41 UTC

You are ignoring that AMD doesn't use an intermediate representation and every ROCm driver is basically compiling to a GPU specific ISA. It wouldn't surprise me that there are bugs they have fixed for one ISA that they didn't bother porting to the others. The other problem is that most likely their firmware contains classic C bugs like buffer overflows, undefined behaviour, or stuff like deadlocks.

JonChesterfield

0 replies

10h48m

2024-04-10 08:04:00 UTC

This is sort of true. Graphics compiles to spir-v, moves that around as deployment, then runs it through llvm to create the compiled shaders. Compute doesn't bother with spir-v (to the distress of some of our engineers) and moves llvm IR around instead. That goes through the llvm backend which does mostly the same stuff for each target machine. There probably are some bugs that were fixed on one machine and accidentally missed on another - the compiler is quite branchy - but it's nothing like as bad as a separate codebase per ISA. Nvidia has a specific ISA per card too, they just expose PTX and SASS as abstractions over it.

I haven't found the firmware source code yet - digging through confluence and perforce tries my patience and I'm supposed to be working on llvm - but I hear it's written in assembly, where one of the hurdles to open sourcing it is the assembler is proprietary. I suspect there's some common information shared with the hardware description language (tcl and verilog or whatever they're using). To the extent that turns out to be true, it'll be immune to C style undefined behaviour, but I wouldn't bet on it being free from buffer overflows.

Workaccount2

6 replies

2024-04-09 18:05:27 UTC

It's seriously impressive how well AMD has been able to maintain their incredible software deficiency for over a decade now.

amirhirsch

3 replies

23h35m

2024-04-09 19:16:50 UTC

Buying Xilinx helped a lot here.

fpgamlirfanboy

2 replies

16h42m

2024-04-10 02:10:34 UTC

it's so true it hurts

tucnak

1 replies

12h55m

2024-04-10 05:57:17 UTC

Hey man have seen you around here, very knowledgeable, thanks for your input!

What's your take on projects like https://github.com/corundum/corundum I'm trying to get better at FPGA design, perhaps learn PCIe and some such but Vivado is intimidating (as opposed to Yosys/nextpnr which you seem to hate) should I just get involved with a project like this to acclimatise somewhat?

fpgamlirfanboy

0 replies

4h40m

2024-04-10 14:12:12 UTC

Vivado is intimidating (as opposed to Yosys/nextpnr which you seem to hate)

i never said i hated yosys/nextpnr? i said somewhere that yosys makes the uber strange decision to use C++ as effectively a scripting language ie gluing and scheduling "passes" together - like they seemed to make the firm decision to diverge from tcl but diverged into absurd territory. i wish yosys were great because it's open source and then i could solve my own problems as they occurred. but it's not great and i doubt it ever will be because building logic synthesis, techmapping, timing analysis, place and route, etc. is just too many extremely hard problems for OSS.

all the vendor tools suck. it's just a fact that both big fpga manufacturers have completely shit software devs working on those tools. the only tools that i've heard are decent are the very expensive suites from cadence/siemenns/synopsis but i have yet to be in a place that has licenses (neither school nor day job - at least not in my team). and mind you, you will still need to feed the RTL or netlist or whatever that those tools generate into vivado (so you're still fucked).

so i don't have advice for you on RTL - i moved one level up (ISA, compilers, etc.) primarily because i could not effectively learn by myself i.e., without going to "apprentice" under someone that just has enough experience to navigate around the potholes (because fundamentally if that's what it takes to learn then you're basically working on ineluctably entrenched tech).

alexey-salmin

1 replies

23h42m

2024-04-09 19:10:26 UTC

They deeply care about the tradition of ATI kernel modules from 2004

sumtechguy

0 replies

6h7m

2024-04-10 12:45:08 UTC

more like 1998 :)

cavisne

0 replies

16h26m

2024-04-10 02:26:11 UTC

Yeah, I think AMD will really struggle with the cloud providers.

Even Nvidia GPU's are tricky to sandbox, and it sounds like the AMD cards are really easy for the tenant to break (or at least force a restart of the underlying host).

AWS does have a Gaudi instance which is interesting, but overall I don't see why Azure, AWS & Google would deploy AMD or Intel GPU's at scale vs their own chips.

They need some competitor to Nvidia to help negotiate, but if its going to be a painful software support story suited to only a few enterprise customers, why not do it with your own chip?

huac

1 replies

2024-04-09 18:08:25 UTC

There's a number of scaled AMD deployments, including Lamini (https://www.lamini.ai/blog/lamini-amd-paving-the-road-to-gpu...) specifically for LLM's. There's also a number of HPC configurations, including the world's largest publicly disclosed supercomputer (Frontier) and Europe's largest supercomputer (LUMI) running on MI250x. Multiple teams have trained models on those HPC setups too.

Do you have any more evidence as to why these categorically don't work?

latchkey

0 replies

23h47m

2024-04-09 19:05:40 UTC

Do you have any more evidence as to why these categorically don't work?

They don't. Loud voices parroting George, with nothing to back it up.

Here are another couple good links:

https://www.evp.cloud/post/diving-deeper-insights-from-our-l...

https://www.databricks.com/blog/training-llms-scale-amd-mi25...

neilmovva

24 replies

1d1h

2024-04-09 17:20:39 UTC

A bit surprised that they're using HBM2e, which is what Nvidia A100 (80GB) used back in 2020. But Intel is using 8 stacks here, so Gaudi 3 achieves comparable total bandwidth (3.7TB/s) to H100 (3.4TB/s) which uses 5 stacks of HBM3. Hopefully the older HBM has better supply - HBM3 is hard to get right now!

The Gaudi 3 multi-chip package also looks interesting. I see 2 central compute dies, 8 HBM die stacks, and then 6 small dies interleaved between the HBM stacks - curious to know whether those are also functional, or just structural elements for mechanical support.

bayindirh

22 replies

2024-04-09 18:41:15 UTC

A bit surprised that they're using HBM2e, which is what Nvidia A100 (80GB) used back in 2020.

This is one of the secret recipes of Intel. They can use older tech and push it a little further to catch/surpass current gen tech until current gen becomes easier/cheaper to produce/acquire/integrate.

They have done it with their first quad core processors by merging two dual core processors (Q6xxx series), or by creating absurdly clocked single core processors aimed at very niche market segments.

We have not seen it until now, because they were sleeping at the wheel, and knocked unconscious by AMD.

alexey-salmin

11 replies

23h45m

2024-04-09 19:06:53 UTC

Oh dear, Q6600 was so bad, I regret ever owning it

chucke1992

4 replies

23h10m

2024-04-09 19:42:14 UTC

Q6600 was quite good but E8400 was the best.

astrodust

2 replies

22h38m

2024-04-09 20:14:23 UTC

Q6600 is the spiritual successor to the ABIT BP6 Dual Celeron option: https://en.wikipedia.org/wiki/ABIT_BP6

dfex

0 replies

15h38m

2024-04-10 03:14:27 UTC

The ABit BP6 bought me so much "cred" at LAN Parties back in the day - the only dual socket motherboard in the building, and paired with two Creative Voodoo 2 GPUs in SLI mode, that thing was a beast (for the late nineties).

I seem to recall that only Quake 2 or 3 was capable of actually using that second processor during a game, but that wasn't the point ;)

bayindirh

0 replies

21h45m

2024-04-09 21:07:23 UTC

ABIT was a legend in motherboards. I used their AN-7 Ultra and AN-8 Ultra. No newer board gave the flexibility and capabilities of these series.

My latest ASUS was good enough, but I didn't (and probably won't) build any newer systems, so ABITs will have the crown.

alexey-salmin

0 replies

22h59m

2024-04-09 19:53:33 UTC

E8400 was actually good, yes

mrybczyn

1 replies

23h28m

2024-04-09 19:24:23 UTC

What? It was outstanding for the time, great price performance, and very tunable for clock / voltage IIRC.

alexey-salmin

0 replies

22h50m

2024-04-09 20:02:23 UTC

Well overclocked I don't know, but out-of-the box single-core performance completely sucked. And in 2007 not enough applications had threads to make it up in the number of cores.

It was fun to play with but you'd also expect the higher-end desktop to e.g. handle x264 videos which was not the case (search for q6600 on videolan forum). And depressingly many cheaper CPUs of the time did it easily.

PcChip

1 replies

23h39m

2024-04-09 19:12:59 UTC

Really? I never owned one but even I remember the famous SLACR, I thought they were the hot item back then

alexey-salmin

0 replies

22h47m

2024-04-09 20:04:57 UTC

It was "hot" but using one as a main desktop in 2007 was depressing due to abysmal single-core performance.

bayindirh

0 replies

23h20m

2024-04-09 19:32:04 UTC

I owned one, it was a performant little chip. Developed my first multi core stuff with it.

I loved it, to be honest.

JonChesterfield

0 replies

23h26m

2024-04-09 19:26:26 UTC

65nm tolerated a lot of voltage. Fun thing to overclock.

JonChesterfield

5 replies

23h27m

2024-04-09 19:25:18 UTC

This is one of the secret recipes of Intel

Any other examples of this? I remember the secret sauce being a process advantage over the competition, exactly the opposite of making old tech outperform the state of the art.

calaphos

3 replies

23h14m

2024-04-09 19:38:12 UTC

Intels surprisingly fast 14nm processors come to mind. Born of necessity as they couldn't get their 10 and later 7nm processes working for years. Despite that Intel managed to keep up in single core performance with newer 7nm AMD chips, although at a mich higher power draw.

deepnotderp

0 replies

21h1m

2024-04-09 21:51:09 UTC

That's because CPU performance cares less about transistor density and more about transistor performance, and 14nm drive strength was excellent

Dalewyn

0 replies

22h48m

2024-04-09 20:04:23 UTC

Or today with Alder Lake and Raptor Lake(Refresh), where their CPUs made on Intel 7 (10nm) are on par if not slightly better than AMD's offerings made on TSMC 5nm.

0x457

0 replies

17h59m

2024-04-10 00:53:24 UTC

For like half of 14nm intel era, there was no competition on CPU market in any segment for them. Intel was able to improve their 14nm process and be better at branch prediction. Moving things to hardware implementation is what kept improving.

This isn't the same as getting more out of the same over and over again.

timr

0 replies

23h13m

2024-04-09 19:38:49 UTC

Back in the day, Intel was great for overclocking because all of their chips could run at significantly higher speeds and voltages than on the tin. This was because they basically just targeted the higher specs, and sold the underperforming silicon as lower-tier products.

Don't know if this counts, but feels directionally similar.

mvkel

3 replies

2024-04-09 18:50:16 UTC

Interesting.

Would you say this means Intel is "back," or just not completely dead?

bayindirh

2 replies

23h45m

2024-04-09 19:07:10 UTC

No, this means Intel has woken up and trying. There's no guarantee in anything. I'm more of an AMD person, but I want to see fierce competition, not monopoly, even if it's "my team's monopoly".

chucke1992

1 replies

23h9m

2024-04-09 19:42:59 UTC

Well the only reason why AMD is doing good at CPU is becoming Intel is sleeping. Otherwise it would be Nvidia vs AMD (less steroids though).

bayindirh

0 replies

22h52m

2024-04-09 20:00:19 UTC

EPYC is actually pretty good. It’s true that Intel was sleeping, but AMD’s new architecture is a beast. Has better memory support, more PCIe lanes and better overall system latency and throughput.

Intel’s TDP problems and AVX clock issues leave a bitter taste in the mouth.

tmikaeld

0 replies

13h15m

2024-04-10 05:37:24 UTC

I was just about to comment on this, apparently all production capacity for hbm is tapped out until early 2026

sairahul82

17 replies

1d1h

2024-04-09 17:25:14 UTC

Can we expect the price of 'Gaudi 3 PCIe' to be reasonable enough to put in a workstation? That would be a game changer for local LLMs

wongarsu

14 replies

1d1h

2024-04-09 17:43:43 UTC

Probably not. An 40GB Nvidia A100 is arguably reasonable for a workstation at $6000. Depending on your definition an 80GB A100 for $16000 is still reasonable. I don't see this being cheaper than an 80GB A100. Probably a good bit more expensive, seeing as it has more RAM, compares itself favorably to the H100, and has enough compelling features that it probably doesn't have to (strongly) compete on price.

narrator

4 replies

21h7m

2024-04-09 21:45:19 UTC

Isn't it much better to get a Mac Studio with an M2 Max and 192gb of Ram and 31 terraflops for $6599 and run llama.cpp?

magic_hamster

2 replies

18h38m

2024-04-10 00:13:57 UTC

Macs don't support CUDA which means all that wonderful hardware will be useless when trying to do anything with AI for at least a few years. There's Metal but it has its own set of problems, biggest one being it isn't a drop in CUDA replacement.

doublepg23

0 replies

15h33m

2024-04-10 03:18:45 UTC

I'm assuming this won't support CUDA either?

adam_arthur

0 replies

4h11m

2024-04-10 14:41:07 UTC

You can do LLM inference without CUDA just fine. Download Ollama and see for yourself

egorfine

0 replies

9h30m

2024-04-10 09:22:28 UTC

For LLM inference - yes absolutely.

chessgecko

3 replies

2024-04-09 18:10:14 UTC

I think you're right on the price, but just to give some false hope. I think newish hbm (and this is hbm2e which is a little older) is around $15/gb so for 128 gb thats $1920. There are some other cogs, but in theory they could sell this for like $3-4k and make some gross profit while getting some hobbyist mindshare/research code written for it. I doubt they will though, it might eat too much into profits from the non pcie variants.

p1esk

1 replies

21h21m

2024-04-09 21:30:55 UTC

in theory they could sell this for like $3-4k

You’re joking, right? They will price it to match current H100 pricing. Multiply your estimate by 10x.

chessgecko

0 replies

20h37m

2024-04-09 22:15:18 UTC

They could, I know they wont, but they wouldn't lose money on the parts

ksec

0 replies

4h29m

2024-04-10 14:22:47 UTC

is around $15/gb

This figure is old and I dont think $15 cuts it anymore. My guess would be $20 if not exceeds it.

0cf8612b2e1e

3 replies

2024-04-09 18:32:54 UTC

Surely NVidia’s pricing is more what the market will bear vs an intrinsic cost to build. Intel being the underdog should be willing to offer a discount just to get their foot in the door.

tormeh

1 replies

23h4m

2024-04-09 19:48:28 UTC

Pricing is normally what the market will bear. If this is below your cost as supplier you exit the market.

AnthonyMouse

0 replies

22h25m

2024-04-09 20:27:08 UTC

But if your competitor's price is dramatically above your cost, you can provide a huge discount as an incentive for customers to pay the transition cost to your system while still turning a tidy profit.

wmf

0 replies

2024-04-09 18:35:39 UTC

Nvidia is charging $35K so a discount relative to that is still very expensive.

Workaccount2

0 replies

2024-04-09 18:12:10 UTC

Interestingly they are using HBME2 memory which is a few years old at this point. The price might end up being surprisingly good because of this.

ipsum2

0 replies

20h38m

2024-04-09 22:14:03 UTC

It won't be under $10k.

CuriouslyC

0 replies

1d1h

2024-04-09 17:36:21 UTC

Just based on the RAM alone, let's just say if you can't just buy a Vision Pro without a second thought about the price tag, don't get your hopes up.

riskable

17 replies

1d1h

2024-04-09 17:17:24 UTC

Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator

WHAT‽ It's basically got the equivalent of a 24-port, 200-gigabit switch built into it. How does that make sense? Can you imaging stringing 24 Cat 8 cables between servers in a single rack? Wait: How do you even decide where those cables go? Do you buy 24 Gaudi 3 accelerators and run cables directly between every single one of them so they can all talk 200-gigabit ethernet to each other?

Also: If you've got that many Cat 8 cables coming out the back of the thing how do you even access it? You'll have to unplug half of them (better keep track of which was connected to what port!) just to be able to grab the shell of the device in the rack. 24 ports is usually enough to take up the majority of horizontal space in the rack so maybe this thing requires a minimum of 2-4U just to use it? That would make more sense but not help in the density department.

I'm imagining a lot of orders for "a gradient" of colors of cables so the data center folks wiring the things can keep track of which cable is supposed to go where.

brookst

3 replies

1d1h

2024-04-09 17:22:58 UTC

Audio folks solved the "which cable goes where" problem ages ago with cable snakes: https://www.seismicaudiospeakers.com/products/24-channel-xlr...

But I'm not how big and how expensive a 24 channel cat 8 snake would be (!).

nullindividual

2 replies

22h21m

2024-04-09 20:30:45 UTC

I wouldn’t think that would be appropriate for Ethernet due to cross talk.

wmf

0 replies

20h9m

2024-04-09 22:43:12 UTC

Four-lane and eight-lane twinax cables exist; I think each pair is individually shielded. Beyond that there's fiber.

pezezin

0 replies

20h5m

2024-04-09 22:47:07 UTC

Those cables definitely exist for Ethernet, and regarding cross talk, that's what shielding is for.

Although not for 200 Gbps, at that rate you either use big twinax DACs, or go to fibre.

radicaldreamer

2 replies

1d1h

2024-04-09 17:19:12 UTC

The amount of power that will use up is massive, they should've gone for some fiber instead

buildbot

0 replies

1d1h

2024-04-09 17:26:37 UTC

It will be fiber, Ethernet is just the protocol not the physical interface.

KeplerBoy

0 replies

2024-04-09 17:56:25 UTC

The fiber optics are also extremely power hungry. For short runs people use direct attach copper cables to avoid having to deal with fiberoptics.

gaogao

2 replies

1d1h

2024-04-09 17:24:15 UTC

Infiniband I've heard as incredibly annoying to deal with procuring as well as some other aspects of it, so lots of folks very happy to get RoCE (ethernet) working instead, even if it is a bit cumbersome.

throwaway2037

0 replies

17h52m

2024-04-10 00:59:58 UTC

"RoCE"? Woah, I had to Google that.

https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

    > RDMA over Converged Ethernet (RoCE) or InfiniBand over Ethernet (IBoE)[1] is a network protocol which allows remote direct memory access (RDMA) over an Ethernet network. It does this by encapsulating an InfiniBand (IB) transport packet over Ethernet.

Sounds very cool.

pezezin

0 replies

6h10m

2024-04-10 12:42:13 UTC

Is there any Infiniband vendor left other than Nvidia?

buildbot

2 replies

1d1h

2024-04-09 17:25:44 UTC

200gb is not going to be using CAT, it will be fiber (or direct attached copper cable as noted by dogma1138) with a QSFP interface

dogma1138

1 replies

1d1h

2024-04-09 17:47:50 UTC

It will most likely use copper QSFP56 cables since these interfaces are either used in inter rack or adjacent rack direct attachments or to the nearest switch.

O.5-1.5/2m copper cables are easily available and cheap and 4-8m (and even longer) is also possible with copper but tends to be more expensive and harder to get by.

Even 800gb is possible with copper cables these days but you’ll end up spending just as much if not more on cabling as the rest of your kit…https://www.fibermall.com/sale-460634-800g-osfp-acc-3m-flt.h...

buildbot

0 replies

2024-04-09 18:03:12 UTC

Fair point!

blackeyeblitzar

2 replies

1d1h

2024-04-09 17:24:31 UTC

See https://www.nextplatform.com/2024/04/09/with-gaudi-3-intel-c... for more details. Here’s the relevant bits, although you should visit the article to see the networking diagrams:

The Gaudi 3 accelerators inside of the nodes are connected using the same OSFP links to the outside world as happened with the Gaudi 2 designs, but in this case the doubling of the speed means that Intel has had to add retimers between the Ethernet ports on the Gaudi 3 cards and the six 800 Gb/sec OSFP ports that come out of the back of the system board. Of the 24 ports on each Gaudi 3, 21 of them are used to make a high-bandwidth all-to-all network linking those Gaudi 3 devices tightly to each other. Like this:

As you scale, you build a sub-cluster with sixteen of these eight-way Gaudi 3 nodes, with three leaf switches – generally based on the 51.2 Tb/sec “Tomahawk 5” StrataXGS switch ASICs from Broadcom, according to Medina – that have half of their 64 ports running at 800 GB/sec pointing down to the servers and half of their ports pointing up to the spine network. You need three leaf switches to do the trick:

To get to 4,096 Gaudi 3 accelerators across 512 server nodes, you build 32 sub-clusters and you cross link the 96 leaf switches with a three banks of sixteen spine switches, which will give you three different paths to link any Gaudi 3 to any other Gaudi 3 through two layers of network. Like this:

The cabling works out neatly in the rack configurations they envision. The idea here is to use standard Ethernet instead of proprietary Infiniband (which Nvidia got from acquiring Mellanox). Because each accelerator can reach other accelerators via multiple paths that will (ideally) not be over-utilized, you will be able to perform large operations across them efficiently without needing to get especially optimized about how your software manages communication.

Manabu-eo

1 replies

23h32m

2024-04-09 19:19:45 UTC

The PCI-e HL-338 version is also listing 24 200GbE RDMA nics in a dual-slot configuration. How would they be connected?

wmf

0 replies

22h25m

2024-04-09 20:27:26 UTC

They may go to the top of the card where you can use an SLI-like bridge to connect multiple cards.

juliangoldsmith

0 replies

1d1h

2024-04-09 17:31:51 UTC

For Gaudi2, it looks like 21/24 ports are internal to the server. I highly doubt those have actual individual cables. Most likely they're just carried on PCBs like any other signal.

100GBe is only supported on twinax anyway, so Cat8 is irrelevant here. The other 3 ports are probably QSFP or something.

kylixz

15 replies

2024-04-09 18:11:15 UTC

This is a bit snarky — but will Intel actually keep this product line alive for more than a few years? Having been bitten by building products around some of their non-x86 offerings where they killed good IP off and then failed to support it… I’m skeptical.

I truly do hope it is successful so we can have some alternative accelerators.

riffic

4 replies

23h55m

2024-04-09 18:57:37 UTC

Itanic was a fun era

cptskippy

3 replies

22h53m

2024-04-09 19:59:31 UTC

Itanium only stuck around as long as it did because they were obligated to support HP.

pjmlp

2 replies

12h47m

2024-04-10 06:05:40 UTC

Itanium only failed because AMD was allowed to come up with AMD64, Intel would have managed to push Itanium no matter what, if there were no alternatives to a 64bit compatible x86 CPU.

pezezin

0 replies

6h11m

2024-04-10 12:41:04 UTC

But Itanium was not compatible with x86, it used emulation to run x86 software.

cptskippy

0 replies

2h59m

2024-04-10 15:53:02 UTC

Itanium wasn't x86 compatible, it used the EPIC VLIW instruction set. It relied heavily on compiler optimization that never really materialized. I think it was called speculative precompilation or something like that. The Itanium suffered in two ways that had interplay with one another. The first is that it was very latency sensitive and non-deterministic fetches stalled it. The second was there often weren't enough parallel instructions to execute simultaneously. In both cases the processor spent a lot of time executing NOPs.

Modern CPUs have moved towards becoming simpler and more flexible in their execution with specialized hardware (GPUs, etc) for the more parallel and repetitive tasks that Itanium excelled at.

jtriangle

1 replies

2024-04-09 18:48:39 UTC

The real question is, how long does it actually have to hang around really? With the way this market is going, it probably only has to be supported in earnest for a few years by which point it'll be so far obsolete that everyone who matters will have moved on.

AnthonyMouse

0 replies

22h34m

2024-04-09 20:18:31 UTC

We're talking about the architecture, not the hardware model. What people want is to have a new, faster version in a few years that will run the same code written for this one.

Also, hardware has a lifecycle. At some point the old hardware isn't worth running in a large scale operation because it consumes more in electricity to run 24/7 than it would cost to replace with newer hardware. But then it falls into the hands of people who aren't going to run it 24/7, like hobbyists and students, which as a manufacturer you still want to support because that's how you get people to invest their time in your stuff instead of a competitor's.

fourg

1 replies

19h0m

2024-04-09 23:52:33 UTC

What’s Next: Intel Gaudi 3 accelerators' momentum will be foundational for Falcon Shores, Intel’s next-generation graphics processing unit (GPU) for AI and high-performance computing (HPC). Falcon Shores will integrate the Intel Gaudi and Intel® Xe intellectual property (IP) with a single GPU programming interface built on the Intel® oneAPI specification.

johnchristopher

0 replies

12h51m

2024-04-10 06:00:58 UTC

I can't tell if your comment is sarcastic or genuine :). It goes to show how out of touch I am on AI hw and sw matters.

Yesterday I thought about installing and trying to use https://news.ycombinator.com/item?id=39372159 (Reor is an open-source AI note-taking app that runs models locally.) and feed it my markdown folder but I stop midway, asking myself "don't I need some kind of powerful GPU for that ?". And now I am thinking "wait, should I wait for `standard` pluggable AI computing hardware device ? Is that Intel Gaudi 3 something like that ?".

astrodust

1 replies

22h39m

2024-04-09 20:13:06 UTC

I hope it pairs well with Optane modules!

VHRanger

0 replies

22h16m

2024-04-09 20:36:24 UTC

I'll add it right next to my Xeon Phi!

iamleppert

0 replies

20h59m

2024-04-09 21:53:01 UTC

Long enough for you to get in, develop some AI product, raise investment funds, and get out with your bag!

gymbeaux

0 replies

12h48m

2024-04-10 06:04:09 UTC

I haven’t read the article but my first question would be “what problem is this accelerator solving?” and if the answer is simply “you can AI without Nvidia”, that’s not good enough, because that’s the pot calling the kettle black. None of these companies is “altruistic” but between the three of them I expect AMD to be the nicest to its customers. Nvidia will squeeze the most money out of theirs, and Intel will leave theirs out to dry when corporate leadership decides it’s a failure.

forkerenok

0 replies

2024-04-09 18:43:41 UTC

I'm not very involved in the broader topic, but isn't the shortage of hardware for AI-related workloads intense enough so as to grant them the benefit of the doubt?

cptskippy

0 replies

22h53m

2024-04-09 19:58:43 UTC

I think it's a valid question. Intel has a habit of whispering away anything that doesn't immediately ship millions of units or that they're contractually obligated to support.

geertj

14 replies

23h0m

2024-04-09 19:52:15 UTC

I wonder if someone knowledgeable could comment on OneAPI vs Cuda. I feel like if Intel is going to be a serious competitor to Nvidia, both software and hardware are going to be equally important.

ZoomerCretin

12 replies

22h54m

2024-04-09 19:58:17 UTC

I'm not familiar with the particulars of OneAPI, but it's just a matter of rewriting CUDA kernels into OneAPI. This is pretty trivial for the vast majority of small (<5 LoC) kernels. Unlike AMD, it looks like they're serious about dogfooding their own chips, and they have a much better reputation for their driver quality.

JonChesterfield

4 replies

22h34m

2024-04-09 20:17:42 UTC

All the dev work at AMD is on our own hardware. Even things like the corporate laptops are ryzen based. The first gen ryzen laptop I got was terrible but it wasn't intel. We also do things like develop ROCm on the non-qualified cards and build our tools with our tools. It would be crazy not to.

sorenjan

1 replies

20h12m

2024-04-09 22:40:37 UTC

Why isn't AMD part of the UXL Foundation? What does AMD gain from not working together with other companies do make an open alternative to Cuda?

Please make SYCL a priority, cross platform code would make AMD GPUs a viable alternative in the future.

JonChesterfield

0 replies

11h43m

2024-04-10 07:08:53 UTC

Like opencl was an open alternative? Or HSA? Or HIP? Or openmp? Or spir-v? There are lots of GPU programming languages for amdgpu.

Opencl and hip compilers are in llvm trunk, just bring a runtime from GitHub. Openmp likewise though with much more of the runtime in trunk, just bring libhsa.so from GitHub or debian repos. All of it open source.

There's also a bunch of machine learning stuff. Pytorch and Triton, maybe others. And non-C++ languages, notably Fortran, but Julia and Mojo have mostly third party implementations as well.

I don't know what the UXL foundation is. I do know what sycl is, but aside from using code from intel I don't see what it brings over any of the other single source languages.

At some point sycl will probably be implemented on the llvm offload infra Johannes is currently deriving from the openmp runtime, maybe by intel or maybe by one of my colleagues, at which point I expect people to continue using cuda and complaining about amdgpu. It seems very clear to me that extra GPU languages aren't the solution to people buying everything from Nvidia.

ZoomerCretin

1 replies

20h57m

2024-04-09 21:54:45 UTC

Yes that's why I qualified "serious" dogfooding. Of course you use your hardware for your own development work, but it's clearly not enough given that showstopper driver issues are going unfixed for half a year.

FeepingCreature

0 replies

13h36m

2024-04-10 05:16:07 UTC

Way more than half a year. The 7900XTX came out two years ago and still hits hardware resets with Stable Diffusion.

alecco

2 replies

22h31m

2024-04-09 20:21:24 UTC

Trivial??

ZoomerCretin

0 replies

20h56m

2024-04-09 21:55:47 UTC

That statement has two qualifications.

TApplencourt

0 replies

20h57m

2024-04-09 21:55:15 UTC

You have SYCLomatic to help.

JonChesterfield

1 replies

11h4m

2024-04-10 07:48:03 UTC

(reply to Zoomer from further down, moving up because I ended up writing a lot)

This experience is largely a misalignment between what AMD thinks their product is and what the Linux world thinks software is. My pet theory is it's a holdover from the GPU being primarily a games console product as that's what kept the company alive through the recent dark times. There's money now but some of the best practices are sticky.

In games dev, you ship a SDK. Speaking with personal experience here as I was on the playstation dev tools team. That's a compiler, debugger, profiler, language runtimes, bunch of math libs etc all packaged together with a single version number for the whole thing. A games studio downloads that and uses it for the entire dev cycle of the game. They've noticed that compiler bugs move so each game is essentially dependent on the "characteristics" of that toolchain and persuading them to gamble on a toolchain upgrade mid cycle requires some feature they really badly want.

HPC has some things in common with this. You "module load rocm-5.2" or whatever and now your whole environment is that particular toolchain release. That's where the math libraries are and where the compiler is.

With that context, the internal testing process makes a lot of sense. At some point AMD picks a target OS. I think it's literally "LTS Ubuntu" or a RedHat release or similar. Something that is already available anyway. That gets installed on a lot of CI machines, test machines, developer machines. Most of the boxes I can ssh into have Ubuntu on them. The userspace details don't matter much but what this does do is fix the kernel version for a given release number. Possibly to one of two similar kernel versions. Then there's a multiple month dev and testing process, all on that kernel.

Testing involves some largish number of programs that customers care about. Whatever they're running on the clusters, or some AI things these days. It also involves a lot of performance testing where things getting slower is a bug. The release team are very clear on things not going out the door if things are broken or slower and it's not a fun time to have your commit from months ago pulled out of the bisection as the root cause. That as-shipped configuration - kernel 5.whatever, the driver you build yourself as opposed to the one that kernel shipped with, the ROCm userspace version 4.1 or so - taken together is pretty solid. It sometimes falls over in the field anyway when running applications that aren't in the internal testing set but users of it don't seem anything like as cross as the HN crowd.

This pretty much gives you the discrepancy in user experience. If you've got a rocm release running on one of the HPC machines, or you've got a gaming SDK on a specific console version, things work fairly well and because it's a fixed point things that don't work can be patched around.

In contrast, you can take whatever linux kernel you like and use the amdkfd driver in that, combined with whatever ROCm packages your distribution has bundled. Last I looked it was ROCm 5.2 in debian, lightly patched. A colleague runs Arch which I think is more recent. Gentoo will be different again. I don't know about the others. That kernel probably isn't from the magic list of hammered on under testing. The driver definitely isn't. The driver people work largely upstream but the gitlab fork can be quite divergent from it, much like the rocm llvm can be quite divergent from the upstream llvm.

So when you take the happy path on Linux and use whatever kernel you happen to have installed, that's a codebase that went through whatever testing the kernel project does on the driver and reflects the fraction of a kernel dev branch that was upstream at that point in time. Sometimes it's very stable, sometimes it's really not. I stubbornly refuse to use the binary release of ROCm and use whatever driver is in Debian testing and occasionally have a bad time with stability as a result. But that's because I'm deliberately running a bleeding edge dev build because bugs I stumble across have a chance of me fixing them before users run into it.

I don't think people using apt-get install rocm necessarily know whether they're using a kernel that the userspace is expected to work with or a dev version of excitement since they look the same. The documentation says to use the approved linux release - some Ubuntu flavour with a specific version number - but doesn't draw much attention to the expected experience if you ignore that command.

This is strongly related to the "approved cards list" that HN also hates. It literally means the release testing passed on the cards in that list, and the release testing was not run on the other ones. So you're back into the YMMV region, along with people like me stubbornly running non-approved gaming hardware on non-approved kernels with a bunch of code I built from source using a different compiler to the one used for the production binaries.

None of this is remotely apparent to me from our documentation but it does follow pretty directly from the games dev / HPC design space.

ZoomerCretin

0 replies

3h17m

2024-04-10 15:35:12 UTC

Wow thank you for the insight. I appreciate you taking the time to write all of this out, and also for your stubbornness in testing!

wmf

0 replies

22h26m

2024-04-09 20:26:14 UTC

IMO dogfooding Gaudi would mean training a model on it (and the only way to "prove" it would be to release that model).

pjmlp

0 replies

12h41m

2024-04-10 06:11:06 UTC

Only for CUDA kernels that happen to be C++, good luck with C, Fortran and the PTX toolchains for Java, .NET, Haskell, Julia, Python JITs,...

Althought at least for Python JITs, Intel seems to also be doing something.

And then there is the graphical debugging experience for GPGPU on CUDA, that feels like doing CPU debugging.

meragrin_

0 replies

21h46m

2024-04-09 21:05:59 UTC

Apparently, Google, Qualcomm, Samsung, and ARM are rallying around oneAPI:

https://uxlfoundation.org/

whalesalad

8 replies

1d1h

2024-04-09 17:38:33 UTC

https://www.merriam-webster.com/dictionary/gaudy

TheAceOfHearts

5 replies

2024-04-09 18:50:31 UTC

Honestly, I thought the same thing upon reading the name. I'm aware of the reference to Antoni Gaudí, but having the name sound so close to gaudy seems a bit unfortunate. Surely they must've had better options? Then again I don't know how these sorts of names get decided anymore.

whalesalad

3 replies

23h56m

2024-04-09 18:56:17 UTC

to be fair intel is not known for naming things well.

brookst

1 replies

22h52m

2024-04-09 20:00:35 UTC

Yeah I can't believe people are nitpicking the name when it could just as easily have been AIX19200xvr4200AI.

ukuina

0 replies

13h47m

2024-04-10 05:04:52 UTC

Assuming that's going to be the datasheet naming.

bio-s

0 replies

12h32m

2024-04-10 06:20:36 UTC

The name was picked before the acquisition

prewett

0 replies

20h41m

2024-04-09 22:11:34 UTC

'Gaudi' is properly pronounced Ga-oo-DEE in his native Catalan, whereas (in my dialect) 'gaudy' is pronounced GAW-dee. My guess is Intel wasn't even thinking about 'gaudy' because they were thinking about "famous architects" or whatever the naming pool was. Although, I had heard that the 'gaudy' came from the architect's name because of what people thought of his work. (I'm not sure this is correct, it was just my first introduction to the word.)

riazrizvi

0 replies

1d1h

2024-04-09 17:42:58 UTC

That’s an i. He’s one the the greatest architects of all time. https://www.archdaily.com/877599/10-must-see-gaudi-buildings...

jagger27

0 replies

1d1h

2024-04-09 17:42:46 UTC

https://en.wikipedia.org/wiki/Antoni_Gaud%C3%AD

colechristensen

8 replies

1d1h

2024-04-09 17:18:22 UTC

Anyone have experience and suggestions for an AI accelerator?

Think prototype consumer product with total cost preferably < $500, definitely less than $1000.

wmf

0 replies

2024-04-09 17:54:12 UTC

AMD Hawk Point?

mirekrusin

0 replies

1d1h

2024-04-09 17:28:02 UTC

Rent or 3090, maybe used 4090 if you're lucky.

jsheard

0 replies

1d1h

2024-04-09 17:27:48 UTC

The default answer is to get the biggest Nvidia gaming card you can afford, prioritizing VRAM size over speed. Ideally one of the 24GB ones.

jononor

0 replies

1d1h

2024-04-09 17:41:48 UTC

What is the workload?

hedgehog

0 replies

2024-04-09 17:53:58 UTC

What else in on the BOM? Volume? At that price you likely want to use whatever resources are on the SoC that runs the thing and work around that. Feel free to e-mail me.

dist-epoch

0 replies

2024-04-09 18:11:03 UTC

All new CPUs will have so called NPUs inside them. For helping running models locally.

JonChesterfield

0 replies

22h24m

2024-04-09 20:28:12 UTC

I liked my 5700XT. That seems to be $200 now. Ran arbitrary code on it just fine. Lots of machine learning seems to be obsessed with amount of memory though and increasing that is likely to increase the price. Also HN doesn't like ROCm much, so there's that.

Hugsun

0 replies

1d1h

2024-04-09 17:33:56 UTC

You can get very cheap tesla P40s with 24gb of ram. They are much much slower than the newer cards but offer decent value for running a local chatbot.

I can't speak to the ease of configuration but know that some people have used these successfully.

metadat

5 replies

18h36m

2024-04-10 00:16:04 UTC

> Twenty-four 200 gigabit (Gb) Ethernet ports are integrated into every Intel Gaudi 3 accelerator

How much does a single 200Gbit active (or inactive) fiber cable cost? Probably thousands of dollars.. making even the cabling for each card Very Expensive. Nevermind the network switches themselves..

Simultaneously impressive and disappointing.

carlhjerpe

2 replies

18h16m

2024-04-10 00:35:45 UTC

https://www.fs.com/de-en/products/115636.html 2 meters seems to be about 100$, which isn't unreasonable.

If you're going fiber instead of twinax it's another order of magnitude and a bit for trancievers, but cables are pretty cheap still.

You seem to be loading negative energy into this release from the get-go

metadat

1 replies

18h14m

2024-04-10 00:38:30 UTC

You're going to need a lot more than 2 meters... It's probably AOC (Active-Optical Fiber Cable), they're pricey even for 40Gbit, at DC lengths.

pezezin

0 replies

6h3m

2024-04-10 12:49:24 UTC

2 meters is enough to connect a server to a leaf ToR switch.

Now, connecting the leaf switches to the spine is a different story...

throwaway2037

1 replies

17h41m

2024-04-10 01:10:47 UTC

What do you mean by active vs inactive fiber cable? I tried to Google about this distinction, but I couldn't find anything helpful.

metadat

0 replies

2h27m

2024-04-10 16:25:21 UTC

My off-the-cuff take: AOC's are a specific kind of fiber optic cable, typically used in data center applications for 100Gbit+ connections. The alternate types of fiber are typically referred to as passive fiber cables, e.g. simplex or duplex, single-mode (single fiber strands, usually in a yellow jacket) or multi-mode (multiple fiber strands, usually in an orange jacket). Each type of passive fiber cable has specific applications and requires matching transceivers, whereas AOCs are self-contained with the transceivers pre-terminated on.

If you search for "AOC Fiber", lots of resources will pop up. FS.com is one helpful resource.

https://community.fs.com/article/active-optical-cable-aoc-ri...

> Active optical cable (AOC) can be defined as an optical fiber jumper cable terminated with optical transceivers on both ends. It uses electrical-to-optical conversion on the cable ends to improve speed and distance performance of the cable without sacrificing compatibility with standard electrical interfaces.

1024core

5 replies

1d1h

2024-04-09 17:05:03 UTC

Memory Boost for LLM Capacity Requirements: 128 gigabytes (GB) of HBMe2 memory capacity, 3.7 terabytes (TB) of memory bandwidth ...

I didn't know "terabytes (TB)" was a unit of memory bandwidth...

gnabgib

2 replies

1d1h

2024-04-09 17:16:23 UTC

Bit of an embarrassing typo, they do later qualify it as 3.7TB/s

SteveNuts

1 replies

2024-04-09 18:47:32 UTC

Most of the time bandwidth is expressed in giga/gibi/tera/tebi bits per second so this is also confusing to me

sliken

0 replies

23h7m

2024-04-09 19:45:34 UTC

Only for networking, not for anything measured inside a node. Disk bandwidth, cache bandwidth, and memory bandwidth is nearly always measured in bytes/sec (bandwidth), or NS/cache line or similar (which is mix of bandwidth and latency).

throwup238

0 replies

1d1h

2024-04-09 17:07:45 UTC

It’s equivalent to about thirteen football fields per arn if that helps.

nahnahno

0 replies

14h31m

2024-04-10 04:21:23 UTC

About as relevant a measure of speed as parsecs

InvestorType

4 replies

19h3m

2024-04-09 23:48:47 UTC

This appears to be manufactured by TSMC (or Samsung). The press release says it will use a 5nm process, which is not on Intel's roadmap.

"The Intel Gaudi 3 accelerator, architected for efficient large-scale AI compute, is manufactured on a 5 nanometer (nm) process"

ac29

3 replies

17h59m

2024-04-10 00:52:41 UTC

Habana was an acquisition and their use of TSMC predates the acquisition.

modeless

2 replies

15h9m

2024-04-10 03:43:38 UTC

Yeah, but if Intel can't even get internal customers to adopt their foundry services it seems to bode poorly for the future of the company.

simpsond

0 replies

4h24m

2024-04-10 14:28:25 UTC

Process matters. Intel was ahead for a long time, and has been behind for a long time. Perhaps they will be ahead again, but maybe not. I’d rather see them competitive.

ksec

0 replies

4h26m

2024-04-10 14:26:11 UTC

The design and decision to make it Fab with TSMC was way ahead of Intel's Foundry services offering. ( And it is not like Intel had the extra capacity planned at the time for Intel's GPU )

sandGorgon

3 replies

14h12m

2024-04-10 04:40:01 UTC

Intel Gaudi software integrates the PyTorch framework and provides optimized Hugging Face community-based models – the most-common AI framework for GenAI developers today. This allows GenAI developers to operate at a high abstraction level for ease of use and productivity and ease of model porting across hardware types.

what is the programming interface here ? this is not CUDA right ...so how is this being done ?

wmf

2 replies

13h28m

2024-04-10 05:23:42 UTC

PyTorch has a bunch of backends including CUDA, ROCm, OneAPI, etc.

sandGorgon

1 replies

12h53m

2024-04-10 05:58:58 UTC

i understand. but which backend is intel committing to ? not CUDA for sure. or have they created a new backend

singhrac

0 replies

11h16m

2024-04-10 07:36:14 UTC

Intel makes oneAPI. They have corresponding toolkits to cuDNN like oneMKL, oneDNN, etc.

However the Gaudi chips are built on top of SynapseAI, another API from before the Habana acquisition. I don’t know if there’s a plan to support oneAPI on Gaudi, but it doesn’t look like it at the moment.

m3kw9

3 replies

23h46m

2024-04-09 19:06:38 UTC

Can you run Cuda on it?

boroboro4

2 replies

21h3m

2024-04-09 21:49:05 UTC

No one runs Cuda, everyone runs PyTorch. Which you can run on it.

m3kw9

1 replies

18h10m

2024-04-10 00:42:16 UTC

So does it support cuda or not are are you gonna argue little things all day?

kimixa

0 replies

13h0m

2024-04-10 05:51:49 UTC

CUDA is a proprietary Nvidia API where the SDK license explicitly forbids use for development of apps that might run on other hardware.

You do read the licenses of SDKs you use, right?

Nothing but Nvidia hardware will ever "support" CUDA.

alecco

3 replies

2024-04-09 18:25:11 UTC

Gaudi 3 has PCIe 4.0 (vs. H100 PCIe 5.0, so 2x the bandwidth). Probably not a deal-breaker but it's strange for Intel (of all vendors) to lag behind in PCIe.

wmf

1 replies

2024-04-09 18:37:45 UTC

N5, PCIe 4.0, and HBM2e. This chip was probably delayed two years.

alecco

0 replies

23h59m

2024-04-09 18:53:08 UTC

Good point, it's built on TSMC while Intel is pushing to become the #2 foundry. Probably it's because Gaudi was made by an Israeli company Intel acquired in 2019 (not an internal project). Who knows.

https://www.semianalysis.com/p/is-intel-back-foundry-and-pro...

KeplerBoy

0 replies

21h45m

2024-04-09 21:06:56 UTC

The whitepaper says it's PCIe 5 on Gaudi 3.

amelius

2 replies

21h44m

2024-04-09 21:08:37 UTC

Missing in these pictures are the thermal management solutions.

wmf

0 replies

13h26m

2024-04-10 05:26:31 UTC

It's going to look very similar to an Nvidia SXM or AMD MI300 heatsink since these all have similar form factors.

InitEnabler

0 replies

20h43m

2024-04-09 22:09:35 UTC

If you look at one of the pictures you can get a peak at what they look like (I think...) in the bottom right.

https://www.intel.com/content/dam/www/central-libraries/us/e...

yieldcrv

1 replies

1d1h

2024-04-09 17:27:46 UTC

Has anyone here bought an AI accelerator to run their AI SaaS service from their home to customers instead of trying to make a profit on top of OpenAI or Replicate

Seems like an okay $8,000 - $30,000 investment, and bare metal server maintenance isn’t that complicated these days.

shiftpgdn

0 replies

23h37m

2024-04-09 19:15:36 UTC

Dingboard runs off of the owner's pile of used gamer cards. The owner frequently posts about it on twitter.

kaycebasques

1 replies

1d1h

2024-04-09 17:28:09 UTC

Wow, I very much appreciate the use of the 5 Ws and H [1] in this announcement. Thank you Intel for not subjecting my eyes to corp BS

[1] https://en.wikipedia.org/wiki/Five_Ws

belval

0 replies

2024-04-09 17:56:34 UTC

I wonder if with the advent of LLMs being able to spit out perfect corpo-speak everyone will recenter to succint and short "here's the gist" as the long version will become associated to cheap automated output.

YetAnotherNick

1 replies

1d1h

2024-04-09 17:35:18 UTC

So now hardware companies stopped reporting FLOP/s number and reports in arbitrary unit of parallel operation/s.

AnonMO

0 replies

2024-04-09 18:13:11 UTC

1835 tflops fp8. you have to look for it, but they posted it. The link in the op is just an announcement. the white paper has more info. https://www.intel.com/content/www/us/en/content-details/8174...

throwaway4good

0 replies

22h16m

2024-04-09 20:36:27 UTC

Worth noting that it is fabbed by TSMC.

mpreda

0 replies

1d1h

2024-04-09 17:50:59 UTC

How much does one such card cost?

latchkey

0 replies

1d1h

2024-04-09 17:17:13 UTC

the only MLPerf-benchmarked alternative for LLMs on the market

I hope to work on this for AMD MI300x soon. My company just got added to the MLCommons organization.

einpoklum

0 replies

21h31m

2024-04-09 21:20:59 UTC

If your metric is memory bandwidth or memory size, then this announcement gives you some concrete information. But - suppose my metric for performance is matrix-multiply-add (or just matrix-multiply) bandwidth. What MMA primitives does Gaudi offer (i.e. type combinations and matrix dimension combinations), and how many of such ops per second, in practice? The linked page says "64,000 in parallel", but that does not actually tell me much.

chessgecko

0 replies

23h54m

2024-04-09 18:58:32 UTC

I feel a little misled by the speedup numbers. They are comparing lower batch size h100/200 numbers to higher batch size gaudi 3 numbers for throughput (which is heavily improved by increasing batch size). I feel like there are some inference scenarios where this is better, but its really hard to tell from the numbers in the paper.

cavisne

0 replies

16h24m

2024-04-10 02:28:19 UTC

Is there an equivalent to this reference for Intel Gaudi?

https://docs.nvidia.com/cuda/parallel-thread-execution/index...

brcmthrowaway

0 replies

2024-04-09 18:46:32 UTC

Does this support apple silicon?

andersa

0 replies

1d1h

2024-04-09 17:45:06 UTC

Price?

ancharm

0 replies

2024-04-09 18:48:11 UTC

Is the scheduling / bare metal software open source through OneAPI? Can a link be posted showing it if so?

MrYellowP

0 replies

10h10m

2024-04-10 08:42:09 UTC

https://www.dwds.de/wb/Gaudi

That's amusing. :D

KeplerBoy

0 replies

21h39m

2024-04-09 21:13:02 UTC

vector floating point performance comes in at 14 Tflops/s for FP32 and 28 Tflop/s for FP16.

Not the best of times for stuff that doesn't fit matrix processing units.

AnonMO

0 replies

2024-04-09 18:16:40 UTC

it's crazy that Intel can't manufacture its own chips atm, but it looks like that might change in the coming years as new fabs come online.