return to table of content

Run CUDA, unmodified, on AMD GPUs

modeless
153 replies
23h8m

A lot of people think AMD should support these translation layers but I think it's a bad idea. CUDA is not designed to be vendor agnostic and Nvidia can make things arbitrarily difficult both technically and legally. For example I think it would be against the license agreement of cuDNN or cuBLAS to run them on this. So those and other Nvidia libraries would become part of the API boundary that AMD would need to reimplement and support.

Chasing bug-for-bug compatibility is a fool's errand. The important users of CUDA are open source. AMD can implement support directly in the upstream projects like pytorch or llama.cpp. And once support is there it can be maintained by the community.

eslaught
49 replies
22h17m

Are you aware of HIP? It's officially supported and, for code that avoids obscure features of CUDA like inline PTX, it's pretty much a find-and-replace to get a working build:

https://github.com/ROCm/HIP

Don't believe me? Include this at the top of your CUDA code, build with hipcc, and see what happens:

https://gitlab.com/StanfordLegion/legion/-/blob/master/runti...

It's incomplete because I'm lazy but you can see most things are just a single #ifdef away in the implementation.

currymj
41 replies
22h9m

if you're talking about building anything, that is already too hard for ML researchers.

you have to be able to pip install something and just have it work, reasonably fast, without crashing, and also it has to not interfere with 100 other weird poorly maintained ML library dependencies.

bootsmann
29 replies
21h39m

Don’t most orgs that are deep enough to run custom cuda kernels have dedicated engineers for this stuff. I can’t imagine a person who can write raw cuda not being able to handle things more difficult than pip install.

gaogao
27 replies
20h50m

Engineers who are really, really good at CUDA are worth their weight in gold, so there's more projects for them than they have time. Worth their weight in gold isn't figurative here – the one I know has a ski house more expensive than 180 lbs of gold (~$5,320,814).

bbkane
14 replies
20h37m

Would you (or your friend) be able to drop any good CUDA learning resources? I'd like to be worth my weight in gold...

throwaway81523
12 replies
11h48m

A working knowledge of C++, plus a bit of online reading about CUDA and the NVidia GPU architecture, plus studying the LCZero chess engine source code (the CUDA neural net part, I mean) seems like enough to get started. I did that and felt like I could contribute to that code, at least at a newbie level, given the hardware and build tools. At least in the pre-NNUE era, the code was pretty readable. I didn't pursue it though.

Of course becoming "really good" is a lot different and like anything else, it presumably takes a lot of callused fingertips (from typing) to get there.

mosselman
5 replies
10h34m

The real challenge is probably getting your hands on a 4090 for a price you can pay before you are worth your weight in gold. Because an arm and a limb in gold is quite a lot.

ahepp
3 replies
5h6m

I was looking into this recently and it seems like the cheapest AWS instance with a CUDA GPU is something on the order of $1/hr. It looks like an H100 instance might be $15/hr (although I’m not sure if I’m looking at a monthly price).

So yeah it’s not ideal if you’re on a budget, but it seems like there are some solutions that don’t involve massive capex.

throwaway81523
2 replies
4h47m

Look on vast.ai instead of AWS, you can rent machines with older GPU's dirt cheap. I don't see how they even cover the electricity bills. A 4090 machine starts at about $.25/hour though I didn't examine the configuration.

A new 4090 costs around $1800 (https://www.centralcomputer.com/asus-tuf-rtx4090-o24g-gaming...) and that's probably affordable to AWS users. I see a 2080Ti on Craigslist for $300 (https://sfbay.craigslist.org/scz/sop/d/aptos-nvidia-geforce-...) though used GPU's are possibly thrashed by bitcoin mining. I don't have a suitable host machine, unfortunately.

dotancohen
0 replies
2h13m

Thrashed? What type of damage could a mostly-solid state device suffer? Fan problems? Worn PCi connectors? Deteriorating Arctic Ice from repeated heat cycling?

SonOfLilit
0 replies
1h56m

replying to sibling @dotancohen, they melt, and they suffer from thermal expansion and compression

throwaway81523
0 replies
9h25m

You don't really need a 4090. An older board is plenty. The software is basically the same. I fooled around with what I think was a 1080 on Paperspace for something like 50 cents an hour, but it was mostly with some Pytorch models rather than CUDA directly.

8n4vidtmkvmk
4 replies
10h50m

Does this pay more than $500k/yr? I already know C++, could be tempted to learn CUDA.

throwaway81523
3 replies
9h30m

I kinda doubt it. Nobody paid me to do that though. I was just interested in LCZero. To get that $500k/year, I think you need up to date ML understanding and not just CUDA. CUDA is just another programming language while ML is a big area of active research. You could watch some of the fast.ai ML videos and then enter some Kaggle competitions if you want to go that route.

almostgotcaught
2 replies
5h53m

You're wrong. The people building the models don't write CUDA kernels. The people optimizing the models write CUDA kernels. And you don't need to know a bunch of ML bs to optimize kernels. Source: I optimize GPU kernels. I don't make 500k but I'm not that far from.

throwaway81523
0 replies
5h34m

Heh I'm in the wrong business then. Interesting. Used to be that game programmers spent lots of time optimizing non-ML CUDA code. They didn't make anything like 500k at that time. I wonder what the ML industry has done to game development, or for that matter to scientific programming. Wow.

HarHarVeryFunny
0 replies
3h33m

How much performance difference is there between writing a kernel in a high level language/framework like PyTorch (torch.compile) or Triton, and hand optimizing? Are you writing kernels in PTX?

What's your opinion on the future of writing optimized GPU code/kernels - how long before compilers are as good or better than (most) humans writing hand-optimized PTX?

robotnikman
0 replies
1h40m

Are there any certifications or other ways to prove your knowledge to employers in order to get your foot in the door?

iftheshoefitss
0 replies
10h26m

On bro forget gold if like to be worth my weight in paper lmao

Willish42
4 replies
14h47m

The fact that "worth their weight in cold" typically means in the single-digit millions is fascinating to me (though I doubt I'll be able to get there myself, maybe someday). I looked it up though and I think this is undercounting the current value of gold per ounce/lb/etc.

5320814 / 180 / 16 = ~1847.5

Per https://www.apmex.com/gold-price and https://goldprice.org/, current value is north of $2400 / oz. It was around $1800 in 2020. That growth for _gold_ of all things (up 71% in the last 5 years) is crazy to me.

It's worth noting that anyone with a ski house that expensive probably has a net worth well over twice the price of that ski house. I guess it's time to start learning CUDA!

boulos
1 replies
14h25m

Note: gold uses troy ounces, so adjust by ~10%. It's easier to just use grams or kilograms :).

Willish42
0 replies
1h28m

Thanks, I'm a bit new to this entire concept. Do troy lbs also exist, or is that just a term when measuring ounces?

atwrk
1 replies
10h20m

> That growth for _gold_ of all things (up 71% in the last 5 years) is crazy to me.

For comparison: S&P500 grew about the same during that period (more than 100% from Jan 2019, about 70 from Dec 2019), so the higher price of gold did not outperform the growth of the general (financial) economy.

dash2
0 replies
5h56m

But that's still surprising performance, because the S&P generates income and pays dividends. Its increase reflects (at least, is supposed to!) expectations of future higher income. Gold doesn't even bear interest....

eigenvalue
2 replies
20h21m

That’s pretty funny. Good test of value across the millennia. I wonder if the best aqueduct engineers during the peak of Ancient Rome’s power had villas worth their body weight in gold.

Winse
1 replies
19h4m

Lol. For once being overweight may come with some advantages here.

necovek
0 replies
12h28m

Or disadvantages: you may be as rich as your skinny neighbour, but they are the only ones worth their weight in gold ;)

iftheshoefitss
1 replies
10h28m

What do people study to figure out CUDA? I’m studying to get me GED and hope to go to school one day

paulmd
0 replies
13m

Computer science. This is a grad level topic probably.

Nvidia literally wrote most of the textbooks in this field and you’d probably be taught using one of these anyway:

https://developer.nvidia.com/cuda-books-archive

“GPGPU Gems” is another “cookbook” sort of textbook that might be helpful starting out but you’ll want a good understanding of the SIMT model etc.

amelius
1 replies
7h58m

Just wait until someone trains an ML model that can translate any CUDA code into something more portable like HIP.

GP says it is just some #ifdefs in most cases, so an LLM should be able to do it, right?

phkahler
0 replies
5h4m

> Don’t most orgs that are deep enough to run custom cuda kernels have dedicated engineers for this stuff. I can’t imagine a person who can write raw cuda not being able to handle things more difficult than pip install.

This seems to be fairly common problem with software. The people who create software regularly deal with complex tool chains, dependency management, configuration files, and so on. As a result they think that if a solutions "exists" everything is fine. Need to edit a config file for your particular setup? No problem. The thing is, I have been programming stuff for decades and I really hate having to do that stuff and will avoid tools that make me do it. I have my own problems to solve, and don't want to deal with figuring out tools no matter how "simple" the author thinks that is to do.

A huge part of the reason commercial software exists today is probably because open source projects don't take things to this extreme. I look at some things that qualify as products and think they're really simplistic, but they take care of some minutia that regular people are will to pay so they don't have to learn or deal with it. The same can be true for developers and ML researchers or whatever.

ezekiel68
4 replies
19h20m

if you're talking about building anything, that is already too hard for ML researchers.

I don't think so. I agree it is too hard for the ML researches at the companies which will have their rear ends handed to them by the other companies whose ML researchers can be bothered to follow a blog post and prompt ChatGPT to resolve error messages.

jokethrowaway
2 replies
17h40m

a lot of ML researchers stay pretty high level and reinstall conda when things stop working

and rightly so, they have more complicated issues to tackle

It's on developers to provide better infrastructure and solve these challenges

LtWorf
1 replies
14h17m

Not rightly. It'd be faster on the long term to address the issues.

bayindirh
0 replies
12h34m

Currently nobody think that long term. They just reinstall, that’s it.

currymj
0 replies
3h47m

I'm not really talking about companies here for the most part, I'm talking about academic ML researchers (or industry researchers whose role is primarily academic-style research). In companies there is more incentive for good software engineering practices.

I'm also speaking from personal experience: I once had to hand-write my own CUDA kernels (on official NVIDIA cards, not even this weird translation layer): it was useful and I figured it out, but everything was constantly breaking at first.

It was a drag on productivity and more importantly, it made it too difficult for other people to run my code (which means they are less likely to cite my work).

eslaught
1 replies
20h35m

If your point is that HIP is not a zero-effort porting solution, that is correct. HIP is a low-effort solution, not a zero effort solution. It targets users who already use and know CUDA, and minimizes the changes that are required from pre-existing CUDA code.

In the case of these abstraction layers, then it would be the responsibility of the abstraction maintainers (or AMD) to port them. Obviously, someone who does not even use CUDA would not use HIP either.

To be honest, I have a hard time believing that a truly zero-effort solution exists. Especially one that gets high performance. Once you start talking about the full stack, there are too many potholes and sharp edges to believe that it will really work. So I am highly skeptical of original article. Not that I wouldn't want to be proved wrong. But what they're claiming to do is a big lift, even taking HIP as a starting point.

The easiest, fastest (for end users), highest-performance solution for ML will come when the ecosystem integrates it natively. HIP would be a way to get there faster, but it will take nonzero effort from CUDA-proficient engineers to get there.

currymj
0 replies
3h43m

I agree completely with your last point.

As other commenters have pointed out, this is probably a good solution for HPC jobs where everyone is using C++ or Fortran anyway and you frequently write your own CUDA kernels.

From time to time I run into a decision maker who understandably wants to believe that AMD cards are now "ready" to be used for deep learning, and points to things like the fact that HIP mostly works pretty well. I was kind of reacting against that.

klik99
0 replies
16h45m

God this explains so much about my last month, working with tensorflow lite and libtorch in C++

jchw
0 replies
21h35m

The target audience of interoperability technology is whoever is building, though. Ideally, interoperability technology can help software that supports only NVIDIA GPUs today go on to quickly add baseline support for Intel and AMD GPUs tomorrow.

(and for one data point, I believe Blender is actively using HIP for AMD GPU support in Cycles.)

elashri
0 replies
17h34m

As someone doing a lot of work with CUDA in a big research organization, there are few of us. If you are working with CUDA, then you are not from the type of people who wait to have something that just works like you describe. CUDA itself is a battle with poorly documented stuff.

Agingcoder
0 replies
21h23m

Their target is hpc users, not ml researchers. I can understand why this would be valuable to this particular crowd.

jph00
4 replies
19h40m

Inline PTX is hardly an obscure feature. It's pretty widely used in practice, at least in the AI space.

saagarjha
3 replies
17h46m

Yeah, a lot of the newer accelerators are not even available without using inline PTX assembly. Even the ones that are have weird shapes that are not amenable to high-performance work.

HarHarVeryFunny
2 replies
4h14m

Are you saying that the latest NVIDIA nvcc doesn't support the latest NVIDIA devices?

adrian_b
1 replies
3h27m

For any compiler, "supporting" a certain CPU or GPU only means that they can generate correct translated code with that CPU or GPU as the execution target.

It does not mean that the compiler is able to generate code that has optimal performance, when that can be achieved by using certain instructions without a direct equivalent in a high-level language.

No compiler that supports the Intel-AMD ISA knows how to use all the instructions available in this ISA.

HarHarVeryFunny
0 replies
3h10m

Sure, but I'm not sure if that is what the parent poster was saying (that nvcc generates poor quality PTX for newer devices).

It's been a while since I looked at CUDA, but it used to be that NVIDIA were continually extending cuDNN to add support for kernels needed by SOTA models, and I assume these kernels were all hand optimized.

I'm curious what kind of models people are writing where not only is there is no optimized cuDNN support, but also solutions like Triton or torch.compile, and even hand optimized CUDA C kernels are too slow. Are hand written PTX kernels really that common ?

pjmlp
0 replies
13h18m

How does it run CUDA Fortran?

blitzar
33 replies
22h27m

It would be good if AMD did something, anything.

Support this, reimplement that, support upstream efforts, dont really care. Any of those would cost a couple of million and be worth a trillion dollars to AMD shareholders.

chatmasta
16 replies
18h50m

Is it weird how the comments here are blaming AMD and not Nvidia? Sure, the obvious argument is that Nvidia has no practical motivation to build an open platform. But there are counterexamples that suggest otherwise (Android). And there is a compelling argument that long term, their proprietary firmware layer will become an insufficient moat to their hardware dominance.

Who’s the root cause? The company with the dominant platform that refuses to open it up, or the competitor who can’t catch up because they’re running so far behind? Even if AMD made their own version of CUDA that was better in every way, it still wouldn’t gain adoption because CUDA has become the standard. No matter what they do, they’ll need to have a compatibility layer. And in that case maybe it makes sense for them to invest in the best one that emerges from the community.

roenxi
7 replies
17h38m

Is it weird how the comments here are blaming AMD and not Nvidia?

Not even a little bit. It simply isn't Nvidia's job to provide competitive alternatives to Nvidia. Competing is something AMD must take responsibility for.

The only reason CUDA is such a big talking point is because AMD tripped over their own feet supporting accelerated BLAS on AMD GPUs. Realistically it probably is hard to implement (AMD have a lot of competent people on staff) but Nvidia hasn't done anything unfair apart from execute so well that they make all the alternatives look bad.

jkmcf
6 replies
16h10m

I agree with you, but replace NVIDIA with Apple. What would the EU say?

LtWorf
5 replies
14h14m

I don't think nvidia bans anyone from running code on their devices.

kbolino
1 replies
4h31m

This seems to be more about certain devices (consumer-grade GPUs) in certain settings (data centers), though I do question how enforceable it actually is. My guess is that it can only apply when you try to get discounts from bulk-ordering GPUs.

Also, was there any followup to this story? It seems a bit unnecessary because nVidia has already neutered consumer cards for many/most data center purposes by not using ECC and by providing so few FP64 units that double precision FLOPS is barely better than CPU SIMD.

paulmd
0 replies
2m

it’s also not really a thing anymore because of the open kernel driver… at that point it’s just MIT licensed.

the userland they paid to develop is still proprietary though, but idk if the libraries have the no-datacenter restrictions on them.

paulmd
0 replies
4m

so terrible that vendors can enforce these proprietary licenses on software they paid to develop /s

lmm
2 replies
17h49m

Is it weird how the comments here are blaming AMD and not Nvidia?

Nvidia has put in the legwork and are reaping the rewards. They've worked closely with the people who are actually using their stuff, funding development and giving loads of support to researchers, teachers and so on, for probably a decade now. Why should they give all that away?

But there are counterexamples that suggest otherwise (Android).

How is Android a counterexample? Google makes no money off of it, nor does anyone else. Google keeps Android open so that Apple can't move everyone onto their ad platform, so it's worth it for them as a strategic move, but Nvidia has no such motive.

Even if AMD made their own version of CUDA that was better in every way, it still wouldn’t gain adoption because CUDA has become the standard.

Maybe. But again, that's because NVidia has been putting in the work to make something better for a decade or more. The best time for AMD to start actually trying was 10 years ago; the second-best time is today.

Zambyte
1 replies
3h8m

Google makes no money off of it, nor does anyone else

Google makes no money off of Android? That seems like a really weird claim to make. Do you really think Google would be anywhere near as valuable of a company if iOS had all of the market share that the data vacuum that is Android has? I can't imagine that being the case.

Google makes a boatload off of Android, just like AMD would if they supported open GPGPU efforts aggressively.

michaelt
0 replies
1h8m

Google gave away the software platform - Android - to hardware vendors for free, vendors compete making the hardware into cheap, low-margin commodity items, and google makes boatloads of money from ads, tracking and the app store.

nvidia could give away the software platform - CUDA - to hardware vendors for free, making the hardware into cheap, low-margin commodity items. But how would they make boatloads of money when there's nowhere to put ads, tracking or an app store?

whywhywhywhy
0 replies
6h48m

Is it weird how the comments here are blaming AMD and not Nvidia?

Because it IS AMD/Apple/etcs fault for the position they're in right now. CUDA showed where the world was heading and where the gains in compute would be made well over a decade ago now.

They even had OpenCL, didn't put the right amount of effort into it, all the talent found CUDA easier to work with so built there. Then what did AMD, Apple do? Double down and try and make something better and compete? Nah they fragmented and went their own way, AMD with what feels like a fraction of the effort even Apple put in.

From the actions of the other teams in the game it's not hard to imagine a world without CUDA being a world where this tech is running at a fraction of it's potential.

pjmlp
0 replies
13h13m

Khrons, AMD and Intel have had 15 years to make something out of OpenCL that could rival CUDA.

Instead they managed 15 years of disappointment, in a standard stuck in C99, that adopted C++ and a polyglot bytecode too late to matter, never produced an ecosystem of IDE tooling and GPU libraries.

Naturally CUDA became the standard, when NVIDIA provided what the GPU community cared about.

nemothekid
0 replies
14h30m

Is it weird how the comments here are blaming AMD and not Nvidia?

It's not. Even as it is, I do not trust HIP or RocM to be a viable alternative to Cuda. George Hotz did plenty of work trying to port various ML architectures to AMD and was met with countless driver bugs. The problem isn't nvidia won't build an open platform - the problem is AMD won't invest in a competitive platform. 99% of ML engineers do not write CUDA. For the vast majority of workloads, there are probably 20 engineers at Meta who write the Cuda backend for Pytorch that every other engineer uses. Meta could hire another 20 engineers to support whatever AMD has (they did, and it's not as robust as CUDA).

Even if CUDA was open - do you expect nvidia to also write drivers for AMD? I don't believe 3rd parties will get anywhere writing "compatibility layers" because AMD's own GPU aren't optimized or tested for CUDA-like workloads.

immibis
0 replies
6h35m

It's always been on the straggler to catch up by cheating. That's just how the world works - even in open source. If AMD supported CUDA, it would have a bigger market share. That's a fact. Nvidia doesn't want that. That's a fact. But when Reddit started, it just scraped feeds from Digg, and when Facebook started, it let you link your MySpace credentials and scraped your MySpace account. Adversarial interoperability is nothing new.

cogman10
0 replies
2h34m

Funnily, who I blame the most for there not being real competition to CUDA is apple. As of late, Apple has been really pushing for vender lock in APIs rather than adopting open standards. The end result is you can get AMD and Intel onboard with some standard which is ultimately torpedoed by apple. (See apple departing from and rejecting everything that comes from the khronos group).

With the number of devs that use Apple silicon now-a-days, I have to think that their support for khronos initiatives like SYCL and OpenCL would have significantly accelerated progress and adoption in both.

We need an open standard that isn't just AMD specific to be successful in toppling CUDA.

slashdave
13 replies
20h54m

ROCm counts as "something"

curt15
12 replies
20h17m

Pretty much any modern NVIDIA GPU supports CUDA. You don't have to buy a datacenter-class unit to get your feet wet with CUDA programming. ROCm will count as "something" when the same is true for AMD GPUs.

mappu
3 replies
17h33m

AMD's definition of "support" I think is different than what people expect, and pretty misleading - ROCm itself will run on almost anything, back as far as the RX 400/500 series:

https://en.wikipedia.org/wiki/ROCm#:~:text=GCN%205%20%2D%20V...

Stable Diffusion ran fine for me on RX 570 and RX 6600XT with nothing but distro packages.

slavik81
0 replies
30m

There are out-of-bounds writes in the BLAS libraries for gfx803 GPUs (such as the RX 570). That hardware might work fine for your use case, but there's a lot of failures in the test suites.

I agree with your general sentiment, but I wouldn't recommend pre-Vega GPUs for use with ROCm. Stick to gfx900 and newer, if you can.

imtringued
0 replies
10h12m

I don't buy it. Even running things like llama.cpp on my RX 570 via Vulkan crashes the entire system.

Nab443
0 replies
9h31m

The last time I checked, I was stuck with a pretty old kernel if I wanted to have the last version of ROCm available for my rx470. It's compatible at some point in time, but not kept compatible with recent kernels.

muxr
2 replies
19h59m

I don't think AMD needs to support 5+ year old GPUs personally. And all the recent generations are already practically supported.

AMD only claims support for a select few GPUs, but in my testing I find all the GPUs work fine if the architecture is supported. I've tested rx6600, rx6700xt for example and even though they aren't officially supported, they work fine on ROCm.

imtringued
0 replies
10h14m

I'm on a 5+ year old GPU, because I don't trust AMD to offer a compelling GPU that actually works. An RX 7 570 is good enough for the little gaming I do. It mostly acts as an oversized iGPU that has good Linux drivers, but since AMD is not supporting ROCm on this GPU, there is no need to hurry on upgrading to a better GPU or to get my feet wet on running things locally on the GPU like Stable Diffusion, LLMs, etc.

Dylan16807
0 replies
16h15m

5+ year old GPUs

AMD had a big architecture switchover exactly 5 years ago, and the full launch wasn't over until 4.5 years ago. I think that generation should have full support. Especially because it's not like they're cutting support now. They didn't support it at launch, and they didn't support it after 1, 2, 3, 4 years either.

The other way to look at things, I'd say that for a mid to high tier GPU to be obsolete based on performance, the replacement model needs to be over twice as fast. 7700XT is just over 50% faster than 5700XT.

jacoblambda
1 replies
19h19m

ROCm supports current gen consumer gpus officially and a decent chunk of recent gen consumer gpus unofficially. Not all of them of course but a decent chunk.

It's not ideal but I'm pretty sure CUDA didn't support everything from day 1. And ROCm is part of AMD's vendor part of the Windows AI stack so from upcoming gen on out basically anything that outputs video should support ROCm.

slashdave
0 replies
34m

AMD should focus their efforts on competitive hardware offerings, because that is where the need and the money is. Sorry, I don't think the hobbyist should be a priority.

bavell
0 replies
2h47m

Huh? I've been running ROCm for SD and LLMs for over a year and a half on my puny consumer 6750X - not even latest gen.

oezi
1 replies
21h18m

A couple of million doesn't get you anything in corporate land

spacebanana7
0 replies
20h18m

A couple dozen billion for a 10% chance of becoming NVIDIA competitive is worth it, looking at the stock prices.

Const-me
18 replies
22h22m

Nvidia can make things arbitrarily difficult both technically and legally

Pretty sure APIs are not copyrightable, e.g. https://www.law.cornell.edu/supremecourt/text/18-956

against the license agreement of cuDNN or cuBLAS to run them on this

They don’t run either of them, they instead implement an equivalent API on top of something else. Here’s a quote: “Open-source wrapper libraries providing the "CUDA-X" APIs by delegating to the corresponding ROCm libraries. This is how libraries such as cuBLAS and cuSOLVER are handled.”

dralley
16 replies
22h7m

I believe it was decided that they are copyrightable but that using them for compatibility purposes is fair use.

kbolino
15 replies
21h45m

No, it's stranger than that: SCOTUS did not rule on copyrightability of APIs at all, but simply ruled that even if they are copyrightable, what Google did (completely reimplement Sun/Oracle's public API) was still fair use.

mrandish
9 replies
21h31m

It would have been nice to get a clear SCOTUS precedent on this. On the other hand, I also value a SCOTUS which rules minimally and narrowly by default (I also appreciate SCOTUS' return to stricter constitutional grounding in the past decade).

hobs
8 replies
20h40m

Incredibly loud laughing from the lawyers whose study of law is being thrown around willy nilly because of all the unprecedented joke decisions they are making right now.

kbolino
7 replies
20h31m

We are stuck between a rock and a hard place politically. The real decisions should be coming from Congress not the courts. However, Congress is too disorganized and disconnected to answer the important questions, leaving the courts to either muddle along or else become semi-dictatorial. In most countries, this would cause a constitutional crisis, but the modern U.S. system seems to be a little too resilient to such otherwise concerning signals.

hobs
6 replies
19h40m

We're far past a constitutional crisis, and the courts taking power nobody wanted to give to them (who wasn't interested in a unitary executive at least) isn't a good solution.

kbolino
5 replies
19h25m

What constitutional crisis has occurred that hasn't been resolved?

Constitutional crises involve fundamental breaks in the working of government that bring two or more of its elements into direct conflict that can't be reconciled through the normal means. The last of these by my accounting was over desegregation, which was resolved with the President ordering the Army to force the recalcitrant states to comply. Before that was a showdown between the New Deal Congress and the Supreme Court, which the former won by credibly threatening to pack the latter (which is IMO a much less severe crisis but still more substantial than anything happening today). However, that was almost a century ago, and Congress has not been that coherent lately.

ted_dunning
4 replies
17h38m

I would think the latest one where SCOTUS ruled that the president was a king except in matters where the SCOTUS decides they aren't counts as a constitutional crisis.

FeepingCreature
3 replies
16h56m

Constitutional crises are not a matter of opinion but of occurrence, arising from an actual power conflict between arms of the government that is caused by a conflicted reading of the constitutional text. Basically, if the system just ticks on, it's not a constitutional crisis.

If "I think this is a very bad decision" was cause for a constitutional crisis, any state with more than three digit population would be in constitutional crisis perpetually.

jolux
2 replies
14h38m

Constitutional crises are not a matter of opinion but of occurrence, arising from an actual power conflict between arms of the government that is caused by a conflicted reading of the constitutional text. Basically, if the system just ticks on, it's not a constitutional crisis.

This happened as recently as 2021-01-06; strong evidence that the military subverted the president to call the National Guard into Washington DC and secure the electoral count.

kbolino
0 replies
5h31m

That's close. Both the excessively long lame duck period (2 months for Congress and 2.5 months for the President) and disunity between the President and the rest of the executive branch have also been fodder for crises in the past (Marbury v Madison, Andrew Johnson's impeachment).

hnfong
0 replies
12h57m

If Trump didn't back down it could have definitely been a constitutional crisis.

I'd say it was narrowly averted though.

not2b
4 replies
19h27m

That is how the SC used to work: they would decide cases on the narrowest possible grounds. If they don't have to decide a tough question, but they can finesse it with something simpler, good enough. More recently they have been willing to tear up decades of established law on a regular basis.

hnfong
3 replies
13h1m

"Used to work"... this was 2021.

And generally courts/judges just choose the scope of their legal opinions based on how far reaching they want the legal principles to apply.

IMHO, copyright-ability of APIs is so far away from their political agenda that they probably just decided to leave the issue on a cliffhanger...

immibis
2 replies
5h32m

Yes, "used to". Now, in 2024, the same supreme court has decided that presidents have immunity in all official acts, from stealing documents, up to and including assassination attempts on their opponents. This is a radical shift in how the court operates.

kbolino
0 replies
5h27m

This "opponent assassination" hypothetical gets bandied about a lot but I have not seen any evidence that any court considers that to be an "official act". Official acts are constrained to legitimate exercises of constitutional authority and are not merely anything a President (or especially, an ex-President) does.

jpadkins
0 replies
5h10m

the only thing radical is the opinions of people you are listening to if you believe SCOTUS enabled legally sanctioned assassinations. It was political hyperbole based on nothing, and it worked (with you). Think for yourself.

consf
0 replies
7h24m

You're correct! Fair Use Doctrine

apatheticonion
13 replies
15h50m

Agreed. Rather than making CUDA the standard; AMD should push/drive an open standard that can be run on any hardware.

We have seen this succeed multiple times: FreeSync vs GSync, DLSS vs FSR, (not AMD but) Vulkan vs DirectX & Metal.

All of the big tech companies are obsessed with ring-fencing developers behind the thin veil of "innovation" - where really it's just good for business (I swear it should be regulated because it's really bad for consumers).

A CUDA translation layer is okay for now but it does risk CUDA becoming the standard API. Personally, I am comfortable with waiting on an open standard to take over - ROCm has serviced my needs pretty well so far.

Just wish GPU sharing with VMs was as easy as CPU sharing.

pjmlp
3 replies
13h6m

Vulkan only matters on Android (from version 10 onwards) and GNU/Linux.

Zero impact on Switch, Playstation, XBox, Windows, macOS, iOS, iPadOS, Vision OS.

ChoGGi
2 replies
11h40m

"Windows"

dxvk-gplasync is a game changer for dx9-11 shader stutter.

pjmlp
1 replies
10h14m

Sure, for the 2% folks that enjoy Windows games, written againt DirectX, on Linux Steam Store.

Which Android Studios can't even be bothered to target with their NDK engines, based on GL ES, Vulkan.

ChoGGi
0 replies
4h57m

I'm on windows 11, if I see not dx12 in my afterburner overlay, I use it.

Even if there's no shader stutter, Vulkan tends to use less juice than DX.

amy-petrik-214
3 replies
15h40m

we actually also saw this historically with openGL. openGL comes from an ancient company whispered about by the elderly programmers (30 + year old) known as SGI. Originally it was CLOSED SOURCE and SGI called it "SGI-GL" for a computer codename IRIS which was cool looking with bright popping color plastic and faux granite keyboard. Good guy SGI open sourced SGI-GL to become what we called "openGL" (get it, now it's open), and then it stuck.

That's all to say NVIDIA could pull a SGI and open their stuff, but they're going more sony style and trying to monopolize. Oh, and SGI also wrote another ancient lore library known as "STL" or the "SGI Template Library" which is like the original boost template metaprogramming granddaddy

usr1106
1 replies
12h3m

Nice story, but is it correct? Wikipedia says STL was first implemented by HP and later by the same authors at SGI.

adrian_b
0 replies
3h15m

STL started even earlier, obviously without using the name "STL", as a library of generic algorithms for the programming language Ada (David R. Musser & Alexander A. Stepanov, 1987).

adrian_b
0 replies
13h5m

Also the XFS file system.

naasking
0 replies
6h19m

AMD should push/drive an open standard that can be run on any hardware.

AMD has always been notoriously bad at the software side, and they frequently abandon their projects when they're almost usable, so I won't hold my breath.

imtringued
0 replies
10h9m

AMD shouldn't push on anything. They have the wrong incentives. They should just make sure that software runs on their GPUs and nothing else.

Karol Herbst is working on Rusticl, which is mesa's latest OpenCL implementation and will pave the way for other things such as SYCL.

gjulianm
0 replies
10h26m

OpenCL was released in 2009. AMD has had plenty of time to push and drive that standard. But OpenCL had a worse experience than CUDA, and AMD wasn't up to the task in terms of hardware, so it made no real sense to go for OpenCL.

consf
0 replies
7h14m

A strategic and forward-thinking approach

ChoGGi
0 replies
11h42m

"We have seen this succeed multiple times: FreeSync vs GSync, DLSS vs FSR, (not AMD but) Vulkan vs DirectX & Metal."

I'll definitely agree with you on Sync and Vulkan, but dlss and xess are both better than fsr.

https://youtube.com/watch?v=el70HE6rXV4

amelius
7 replies
22h21m

Like supporting x86 was a bad idea as well?

karolist
4 replies
22h18m

Was there a large entity steering x86 spec alone with a huge feature lead against their competition, free to steer the spec in any ways they choose? Also, hardware is not opensource software, you get big players onboard and they will be able to implement the spec they want every gen, software has more moving parts and unaligned parties involved.

cherryteastain
3 replies
22h8m

Was there a large entity steering x86 spec alone with a huge feature lead against their competition, free to steer the spec in any ways they choose?

Ever heard of Intel?

karolist
2 replies
21h55m

I had't considered that angle. Is your point that Intel was the creator of x86, but software chose to support it, then AMD had nothing else but to play catch up in x86 support to be part of the software target market? If so and factual (I've no idea), fair point, I didn't know.

marshray
1 replies
19h33m

It was exactly the same instruction set.

C compilers didn't offer an "AMD" CPU target* until AMD came out with the "AMD64" instruction set. Today we call this "x86_64" or "x64".

* Feel free to point out some custom multimedia vector extensions for Athlons or something, but the point remains.

gmokki
0 replies
6h29m

And Intel named its licenced implementation of AMD64 as IA-32e, just to make it clear to everyone that it is based on Intel architecture 32bit version with an extension. Luckily they dropped that name few years later

modeless
1 replies
21h37m

Before starting, AMD signed an agreement with Intel that gave them an explicit license to x86. And x86 was a whole lot smaller and simpler back then in 1982. A completely different and incomparable situation.

nostrademons
0 replies
21h0m

Technically it was after starting - AMD was founded in 1969 as a second-sourcer for Fairchild and National Semiconductor, and had reverse-engineered the 8080 by 1975 and acquired a formal license to it by 1976.

The 1982 deal you speak of was actually pretty interesting: as a condition of the x86's use in the IBM PC, IBM requested a second source for x86 chips. AMD was that source, and so they cross-licensed the x86 in 1982 to allow the IBM PC project to proceed forward. This makes the Intel/AMD deal even more important for both companies: the PC market would never have developed without the cross-licensing, which would've been bad for all companies involved. This gave Intel an ongoing stake in AMD's success at least until the PC market consolidated on the x86 standard.

fngjdflmdflg
4 replies
22h38m

Nvidia can make things arbitrarily difficult both technically and legally.

I disagree. AMD can simply not implement those APIs, similar to how game emulators implement the most used APIs first and sometimes never bother implementing obscure ones. It would only matter that NVIDIA added eg. patented APIs to CUDA if those APIs were useful. In which case AMD should have a way to do them anyway. Unless NVIDIA comes up with a new patented API which is both useful and impossible to implement in any other way, which would be bad for AMD in any event. On the other hand, if AMD start supporting CUDA and people start using AMD cards, then developers will be hesitant to use APIs that only work on NVIDIA cards. Right now they are losing billions of dollars on this. Then again they barely seem capable of supporting RocM on their cards, much less CUDA.

You have a fair point in terms of cuDNN and cuBLAS but I don't know that that kind of ToS is actually binding.

selimnairb
3 replies
19h2m

Patented API? I thought Google v. Oracle settled this? Making an implementation of an API spec is fair use, is it not?

fngjdflmdflg
2 replies
17h12m

My understanding is that Google v. Oracle only applies to copyright.

nl
1 replies
15h12m

Well you can't patent an API so....

fngjdflmdflg
0 replies
13h8m

You can patent the implementation. You can't patent the API name DecodeH265Video() but you can still sue someone for implementing that function correctly.

dietr1ch
4 replies
22h27m

How's this situation different than the one around Java, Sun/Oracle and Google?

dboreham
3 replies
21h57m

The judge might not be a coder next time.

viraptor
2 replies
21h51m

The US law is highly dependent on precedents. The Google-Oracle case has set one fortunately, so anything following it won't start from scratch. Fortunately we may not need a closer judge.

jjk166
0 replies
21h10m

Google-Oracle side stepped the issue of API copyrightability by saying Google's particular implementation would fall under fair use. Whether APIs are copyrightable remains an open question.

dylan604
0 replies
21h8m

Until you get an activist court

viraptor
2 replies
21h55m

Isn't cuDNN a much better case for reimplementing than CUDA? It has much more choice in how things actually happen and cuDNN itself chooses different implementations at runtime + does fusing. It seems way more generic and the reimplementation would allow using the best AMD-targeted kernel rather than one the original has.

ckitching
1 replies
21h14m

AMD have "MIOpen" which is basically cuDNN-for-AMD. Ish.

mmis1000
0 replies
12h6m

And that thing is left for unreleased on windows for almost a whole year for unknown reason. Even though there is activity on github and build fix frequently. There is just no .exe or .msi for you to download. In fact, the rocm for linux is on major 6 release (which includes miopen). But somehow windows is still on major 5 (don't have miopen) for almost a whole year.

It almost make me wonder. Is there a shady trade somewhere to ask amd never release sdk for Windows to hike the price of nvidia card higher? Why they keep developing these without release it at all?

Wowfunhappy
2 replies
20h29m

CUDA is not designed to be vendor agnostic and Nvidia can make things arbitrarily difficult [...] technically.

(Let's put the legal questions aside for a moment.)

nVidia changes GPU architectures every generation / few generations, right? How does CUDA work across those—and how can it have forwards compatibility in the future—if it's not designed to be technologically agnostic?

saagarjha
0 replies
17h44m

PTX is meant to be portable across GPU microarchitectures. That said, Nvidia owns the entire spec, so they can just keep adding new instructions that their GPUs now support but AMD GPUs don't.

andy_ppp
0 replies
20h26m

One way is to make sure the hardware team does certain things to support easy transition to new architectures, we have seen this with Apple Silicon for example!

rjurney
1 replies
17h9m

Not having a layer like this has left AMD completely out of the AI game that has made NVDA the world's most valuable company.

ChoGGi
0 replies
11h37m

Self-inflicted wounds hurt the most.

raxxorraxor
1 replies
7h30m

I really hope they will do what you suggested. With some innovative product placement, GPUs with a lot of memory for example, they could dethrone nvidia if it doesn't change strategy.

That said, easier said than done. You need very specialized developers to build a CUDA equivalent and have people start using it. AMD could do it with a more open development process leveraging the open source community. I believe this will happen at some point anyway by AMD or someone else. The market just gets more attractive by the day and at some point the high entry barrier will not matter much.

So why should AMD skimp on their ambitions here? This would be a most sensible investment, few risks and high gains if successful.

consf
0 replies
7h26m

This expanding market provides AMD with a lucrative opportunity indeed

koolala
1 replies
19h58m

CUDA v1...CUDA v2... CUDA v... CUDA isnt commonly assosiated with a version number...

neutrinobro
0 replies
17h6m

Cries in OpenCL

magic_hamster
0 replies
14h34m

CUDA is the juice that built Nvidia in the AI space and allowed them to charge crazy money for their hardware. To be able to run CUDA on cost effective AMD hardware can be a big leap forward, allow more people to research, and break away from Nvidia's stranglehold over VRAM. Nvidia will never open source their own platform unless their hand is forced. I think we all should support this endeavor and contribute where possible.

consf
0 replies
7h28m

The legal, technical and strategic challenges make it a less attractive option

anigbrowl
0 replies
20h38m

Given AMDs prior lack of interest I'll take whatever options there are. My daily driver has a Vega 10 GPU and it's been quite frustrating not to be able to easily leverage it for doing basic ML tasks, to the point that I've been looking at buying an external nvidia GPU instead just to try out some of the popular Python libraries.

Sparkyte
0 replies
8h43m

That is why an open standard should be made so it isn't locked to a particular piece of hardware and then allow modular support for different hardware to interface with supported drivers.

DeepYogurt
0 replies
23h5m

Ya, honestly better to leave that to third parties who can dedicate themselves to it and maybe offer support or whatever. Let AMD work on good first party support first.

acheong08
42 replies
23h10m

Impressive if true. Unfortunately not open source and scarce on exact details on how it works

Edit: not sure why I just sort of expect projects to be open source or at least source available these days.

TaylorAlexander
19 replies
21h54m

Makes sense to expect this kind of thing to be open source. The whole point of providing improved compatibility is to make people’s lives easier, and open source is usually an important feature to ensure wide compatibility. It also means projects can live on after the creators move to other things, people can submit patches for important features or bug fixes, and generally makes the system much more useful.

dylan604
18 replies
21h2m

I don't find it wrong for someone to attempt to make money back on their time and experience of doing the work. I don't mind people that offer that back as open source either. However, I do have a problem of people expecting everything to be open/free, especially those that then go on a crusade chastising those that do try to make money.

TaylorAlexander
17 replies
20h12m

I'm really trying to keep this about the engineering features of a system rather than moral judgments. Open source systems are simply more flexible and adaptable than proprietary systems, which have their own benefits. In today's world, the engineering value of open source systems is becoming so important that people are looking for other ways to provide for the developers creating these systems. It can be surprising when a project creator builds something in an area that is usually all open source, but they choose a proprietary path. Just look at the problems created by NVIDIA for their use of proprietary software in CUDA and their GPUs. This software is an attempt to fix issues created by proprietary software with another piece of proprietary software, which is if nothing else an interesting decision.

dylan604
16 replies
19h34m

UNIX wasn't free. Windows wasn't free. It wasn't until some knucklehead came along and did something abnormal and gave away their thing. Bakers don't give away their goods. Mechanics don't typically repair things for free. Builders don't build things for free. Gas stations don't give away gas.

Why do we think all software should be free, and then think that those that don't give it away are the abnormal ones?

talldayo
6 replies
18h48m

Why do we think all software should be free

Why do people return Windows laptops when they have to pay for a Windows License Activation? Because every single OEM pays for it; you don't expect to buy Windows because it is a failed B2C business model. Nobody wants it. Same goes for proprietary UNIX, and people wish it was the case for Nvidia drivers. I own CUDA hardware and lament the fact that cross-industry GPGPU died so FAANG could sell licensed AI SDKs. The only thing stopping AI from being "free" is the limitations OEMs impose on their hardware.

that those that don't give it away are the abnormal ones?

They are. Admit it; the internet is the new normal, if your software isn't as "free" as opening a website, you're weird. If I have to pay to access your little forum, I won't use it. If I have to buy your app to see what it's like, I'll never know what you're offering. Part of what makes Nvidia's business model so successful is that they do "give away" CUDA to anyone that owns their hardware. There is no developer fee or mandatory licensing cost, it is plug-and-play with the hardware. Same goes for OpenAI, they'd have never succeeded if you had to buy "the ChatGPT App" from your App Store.

dylan604
5 replies
18h13m

Why do people return Windows laptops when they have to pay for a Windows License Activation?

The internet echo chamber strikes again. Exactly how many people are actually doing this? Not many, and those that are all hangout together. The rest of the world just blindly goes about their day using Windows while surfing the web using Chrome. Sometimes, it's a good thing to get outside your bubble. It's a big world out there, and not everybody sees the world as you do

talldayo
4 replies
17h49m

The rest of the world just blindly goes about their day using Windows while surfing the web using Chrome.

Paying for Windows? I think you missed my point. If your computer doesn't ship with an OS, paid or otherwise, people think it's a glitch. The average consumer will sooner return their laptop before they buy a license of Windows, create an Install Media from their old device and flash the new hardware with a purchased license. They'll get a Chromebook instead, people don't buy Windows today.

The internet has conditioned the majority of modern technology users to reject and habitually avoid non-free experiences. Ad-enabled free platforms and their pervasive success is all the evidence you need. Commercial software as it existed 20 or 30 years ago is a dead business. Free reigns supreme.

dylan604
2 replies
15h32m

Who/where/how does someone buy a laptop without an OS? I'm just not able to follow down this hypothetical path that you are insisting on blazing

hamilyon2
1 replies
11h48m

That is kind of his point. You don't, Windows is bundled with laptop. It is not that I agree with his points. Windows for example isn't open source in remotest sense

dylan604
0 replies
4h52m

Dell offers laptops with a version of Linux preinstalled and supports them. System76, Lenovo, Purism as well to name a few. Apple also sells laptops without Windows on them. There are actually quite a few options that do this. If you don't want Windows, we have options now. Yes, historically, it was Windows or Apple's OS, but that's no longer true and not recognizing that just makes you look like you're pushing a false narrative on the situation for what purpose only you know.

alt227
0 replies
11h4m

Commercial software as it existed 20 or 30 years ago is a dead business. Free reigns supreme.

What nonsense. Go into any business and you will find every single piece of software they use is bought and paid for with bells on. The 'Free World' you speak of is only there to get you, an individual, used to using the software so that businesses are made to purchase it. In the old days we called this 'demo' or 'shareware'. Now its 'free' or 'personal' tier subscription.

Go and ask any designer if their copy of Adobe Creative Cloud, 3d studio Max, or AutoCAD is free. Any office worker if Micsrosoft Office(including Teams and Sharedpoint etc) or even google docs for business. Majority of developers are running paid versions of Jetbrains. Running an online shop? Chances are you are paying for shopify software, or something like Zoho to manage your customers and orders.

'Free' as you put it is very much only in the online individual consumer world, a very small part of the software world.

The commercial software market is more alive and expensive than it has ever been.

dTal
6 replies
17h59m

Because software is information. It is closer to a scientific paper than a loaf of bread, and I do expect those to be free. I do not expect scientists to work for free, but the marginal cost of copying their output is 0 and the social benefit is huge.

Free software, like open science, clearly has something going for it pragmatically. The developer hours put into it have paid for themselves magnitudes of times over. Megacorps hire people to work on free software. If you can't see the value, that's a you problem.

voidUpdate
3 replies
10h12m

If all software was free and made no money, how could developers pay their bills?

einpoklum
1 replies
6h57m

Most software is free and makes no money - and that has always been the case. There are some very popular and widely-used non-free systems, but most software isn't that, and its developers still pay the bills.

This is somewhat analogous to music or books/literature. Most composers and performers and authors make no money from people copying and sharing their works. Some pay the bills working professionally for entities who want their product enough to pay for it; some do other things in life. Some indeed give up their work on music because they can't afford to not do more gainful work. And still, neither music nor books go away as copying them gets closer to being free.

voidUpdate
0 replies
6h47m

If my current employer can't make any money from the code we write, then it would collapse faster than a soufflé taken out of the oven too early, and I would be out of a job

TaylorAlexander
0 replies
9h3m

Free software is so important to society that I believe the most reasonable solution is to provide for all people without their need to work for survival. Automate as much as possible such that work is not compulsory, and enough people simply want something to do (and possibly additional pay depending on how the system is arranged) that everything that needs to get done by people does get done.

For now that is fiction, but so is "if all software was free". I do think though that both would lead to a faster rate of innovation in society versus one where critical information is withheld from society to pay someone's rent and food bills.

acuozzo
1 replies
13h34m

the social benefit is huge

It will be interesting to see if this is the case in the long run, assuming "huge" has a positive connotation in your post, of course.

If AGI comes to pass and it winds up being a net negative for humanity, then the ethics of any practice which involves freely distributing information that can be endlessly copied for very little cost must be reevaluated.

TaylorAlexander
0 replies
9h2m

If AGI comes to pass

Increasingly, I am not putting much weight in any predictions about whether this will happen in the way we think it will, or what it could possibly mean. We might as well be talking about the rapture.

napoleongl
0 replies
11h14m

Otoh recepies and drawings are commonly available for free. So if you can support yourself the cake and engine repair is free. But if you need support then you can get someone to bake or build for you.

TaylorAlexander
0 replies
18h3m

Bakers don't give away their goods. Mechanics don't typically repair things for free. Builders don't build things for free. Gas stations don't give away gas.

These all have the property which is that they are scarce physical goods or services. Software is not scarce (though of course the labor to create it is), so this is a really bad comparison.

And again I did not say it should or should not be free, I said there are engineering benefits to open source software and more and more people recognize those benefits and choose to make things free because they see the value and are willing to recognize the tradeoffs. I never said what "should" be done. "Should" is kind of a nonsense term when used in this way as it hides a lot of assumptions, so I generally do not use it, and notably did not use it in my comment. I want to point out the peculiarity in your rather strong response to a word and concept I never used. I think you are having an argument with imagined people, not a discussion with me.

And for what it is worth, I am a robotics engineer and I am designing a completely open source solar powered farming robot designed to be made in a small shop in any city in the world (see my profile), funded by a wealthy robotics entrepreneur who recognizes the value in making this technology available to people all over the world.

So I am one of those engineers making this choice, and not someone just asking for things without doing the same of my work. Everything I produce is open source, including person projects and even my personal writing.

dheera
9 replies
21h10m

Also, can I even buy an AMD GPU? I don't see a "buy now" button or a PCIe version anywhere here

https://www.amd.com/en/products/accelerators/instinct/mi300/...

Another big AMD fuckup in my opinion. Nobody is going to drop millions on these things without being able to test them out first.

First rule of sales: If you have something for sale, take my money.

nwiswell
8 replies
20h2m

I don't see a "buy now" button or a PCIe version anywhere here

"Buy now" buttons and online shopping carts are not generally how organizations looking to spend serious money on AI buy their hardware.

They have a long list of server hardware partners, and odds are you'd already have an existing relationship with one or more of them, and they'd provide a quote.

They even go one step further and show off some of their partners' solutions:

https://www.amd.com/en/graphics/servers-instinct-deep-learni...

FWIW I believe Supermicro and Exxact actually do have web-based shopping carts these days, so maybe you could skip the quotation and buy directly if you were so motivated? Seems kind of weird at this price point.

https://www.exxactcorp.com/Exxact-TS4-185328443-E185328443

dheera
7 replies
19h35m

... and that's why AMD is losing.

They could break the trend and offer a "buy now" button instead of offering quotes and coffee chats. It's very likely that will kickstart the software snowball with early adopters.

Nobody is going to drop millions on an unproven platform.

Seems kind of weird at this price point.

Yeah that $234K server is too much for people to do a trial. It has 8xMI300X GPUs along with a bunch of other shit.

Give me a single MI300X GPU in PCIe form factor for $20K and I'd very seriously consider. I'm sure there are many people who would help adapt the ecosystem if they were truly available.

nwiswell
5 replies
18h45m

Why would you be looking to dip your toe into the AMD ecosystem for the first time using an MI300X? It doesn't make any sense. It's not entry level hardware.

dheera
4 replies
18h42m

To help fix the ecosystem. It's way more affordable than Nvidia.

I'm not looking for entry level hardware.

nwiswell
3 replies
18h39m

Yes, that's why you'd choose AMD, I'm saying that you don't enter the ecosystem for the first time by purchasing the absolute cutting edge hardware.

As far as I'm aware you can't simply buy an Nvidia B200 PCIe card over the counter, either.

dheera
2 replies
18h30m

I'm not looking to enter the ecosystem, I'm already deep in it and want to fix the AMD problem so that I can build big projects around it and undercut everyone who's using Nvidia.

You can purchase H100 and A100 PCIe cards over the counter. They're great for compiling CUDA code, testing code before you launch a multi-node job into a cluster, and for running evaluations.

AMD has nothing of the sort, and it's hurting them.

I cannot blow 250K on an SMCI server, nor do I have the electricity setup for it. I can blow 20K on a PCIe GPU and start contributing to the ecosystem, or maybe prove out an idea on one GPU before trying to raise millions from a VC to build a more cost-effective datacenter that actually works.

shaklee3
0 replies
13h49m

A 20k GPU will be passively cooled and you'll need a real server for that. Even the old MI210 another poster sent is passive.

latchkey
0 replies
12h24m

Give me a single MI300X GPU in PCIe form factor for $20K and I'd very seriously consider. I'm sure there are many people who would help adapt the ecosystem if they were truly available.

I know this isn't what you're looking for entirely, but my business, Hot Aisle, is working on making MI300x available for rental. Our pricing isn't too crazy given that the GPU has 192GB and one week minimum isn't too bad. We will add on-demand hourly pricing as soon as we technically can.

I'm also pushing hard on Dell and AMD to pre-purchase developer credits on our hardware, that we can then give away to people who want to "kick the tires".

https://hotaisle.xyz/pricing/

msond
8 replies
22h26m

We're going to be publishing more details on later blog posts and documentation about how this works and how we've built it.

Yes, we're not open source, however our license is very permissive. It's both in the software distribution and viewable online at https://docs.scale-lang.com/licensing/

breck
7 replies
21h14m

How about trying _Early_ Source?

It's open source with a long delay, but paying users get the latest updates.

Make the git repo from "today - N years" open source, where N is something like 1 or 2.

That way, students can learn on old versions, and when they grow into professionals they can pay for access to the cutting Edge builds.

Win win win win

( https://breckyunits.com/earlySource.html)

msond
6 replies
20h31m

We're still thinking about our approach but this is a nice suggestion, thank you.

I'm curious, for what reasons are you interested in the source code yourself?

breck
2 replies
20h5m

I'm curious, for what reasons are you interested in the source code yourself?

I am the founder/editor of PLDB. So I try to do my best to help people "build the next great programming language".

We clone the git repos of over 1,000 compilers and interpreters and use cloc to determine what languages the people who are building languages are using. The people who build languages obviously are the experts, so how they go so goes the world.

We call this measurement "Foundation Score". A Foundation Score of 100 means 100 other languages uses this language somehow in their primary implementation.

It is utterly dominated by open source languages, and the disparity is only getting more extreme.

You can see for yourself here:

https://pldb.io/lists/explorer.html#columns=rank~name~id~app...

Some that might have become irrelevant have gained a second wind after going open source.

But some keep falling further behind.

I look at Mathematica, a very powerful and amazing language, and it makes me sad to see so few other language designers using it, and the reason is because its closed source. So they are not doing so hot, and that's a language from one of our world's smartest and most prolific thinkers that's been around for decades.

I don't see a way for a new language to catch on nowadays that is not open source.

msond
1 replies
19h7m

Very interesting, thank you for sharing!

We do believe in open source software and we do want to move the GPGPU market away from fully closed languages. The future is open for discussion but regardless, the status-quo at the moment is a proprietary and dominant implementation which only supports a single vendor.

I don't see a way for a new language to catch on nowadays that is not open source.

I do note that CUDA is itself closed source -- while there's an open source implementation in the LLVM project, it is not as bleeding edge as NVIDIA's own.

breck
0 replies
18h10m

I do note that CUDA is itself closed source

And this is a good point. However, it also has a 17 year head start, and many of those years were spent developing before people realized what a huge market there was.

All it will take is one committed genius to create an open source alternative to CUDA to dethrone it.

But they would have to have some Mojo (hint hint) to pull that off.

mindcrime
0 replies
20h16m

I'm not the person you replied to, and I can't speak for them. But I can say that for myself, and a not small number of other people, it's an ideological issue. I simply do not use software that isn't F/OSS - to the greatest extent that that is possible. For me, I might use a VERY small amount of non F/OSS stuff, but it's very hard to get me to adopt something new if it isn't.

Now should you make business decisions based on that? Probably not. But while I don't claim to be a representative sample, I am pretty sure the number of people who share my beliefs in this regard is substantially "non zero". shrug

idonotknowwhy
0 replies
19h6m

I'm a big fan of opensource for most things but if what you've got actually works, you could probably earn big money selling it. The biggest companies in the world are building / using this sort of thing.

Imagine the shift of capital if for example, Intel GPUS suddenly had the same ML software compatibility as Nvidia

atq2119
0 replies
20h10m

Not GP, but a guaranteed source availability means users can fix issues themselves in the future if the original provider goes belly-up.

ipsum2
1 replies
22h35m

They're using Docusaurus[1] for their website, which is most commonly used with open source projects.

https://docusaurus.io/docs

tempaccount420
0 replies
22h35m

They might be hoping to be acquired by AMD

ladberg
16 replies
22h49m

I don't really see how any code that depends heavily on the underlying hardware can "just work" on AMD. Most serious CUDA code is aware of register file and shared memory sizes, wgmma instructions, optimal tensor core memory & register layouts, tensor memory accelerator instructions, etc...

Presumably that stuff doesn't "just work" but they don't want to mention it?

lmeyerov
13 replies
22h9m

Sort of

A lot of our hw-aware bits are parameterized where we fill in constants based on the available hw . Doable to port, same as we do whenever new Nvidia architectures come out.

But yeah, we have tricky bits that inline PTX, and.. that will be more annoying to redo.

Retr0id
12 replies
21h55m

SCALE accepts CUDA programs as-is. [...] This is true even if your program uses inline PTX asm
lmeyerov
11 replies
21h50m

Oh that will be interesting to understand, as PTX gets to more about trickier hw-arch-specific phenomena that diff brands disagree on, like memory models. Neat!

lmeyerov
10 replies
21h41m

Looks like the PTX translation is via another project ZLUDA, though how they bridge the differences in memory/consistency/etc models safely remains unclear to me...

ckitching
9 replies
21h9m

Hi! Spectral engineer here!

SCALE does not use any part of ZLUDA. We have modified the clang frontend to convert inline PTX asm block to LLVM IR.

To put in a less compiler-engineer-ey way: for any given block of PTX, there exists a hypothetical sequence of C++/CUDA code you could have written to achieve the same effect, but on AMD (perhaps using funky __builtin_... functions if the code includes shuffles/ballots/other-weird-gpu-stuff). Our compiler effectively converts the PTX into that hypothetical C++.

Regarding memory consistency etc.: NVIDIA document the "CUDA memory consistency model" extremely thoroughly, and likewise, the consistency guarantees for PTX. It is therefore sufficient to ensure that we use operations at least as synchronising as those called for in the documented semantics of the language (be it CUDA or PTX, for each operation).

Differing consistency _between architectures_ is the AMDGPU backend's problem.

lmeyerov
4 replies
17h42m

Ah I was reading the 'deeper dive' section on my phone and missed it was a comparison, not a warning, thank you

I'm curious how something like this example would translate:

===

Mapping lower-level ptx patterns to higher-level AMD constructs like __ballot, and knowing it's safe

```

  #ifdef INLINEPTX
  inline uint ptx_thread_vote(float rSq, float rCritSq) {
      uint result = 0;
      asm("{\n\t"
           ".reg .pred cond, out;\n\t"
           "setp.ge.f32 cond, %1, %2;\n\t"
           "vote.sync.all.pred out, cond, 0xffffffff;\n\t"
           "selp.u32 %0, 1, 0, out;\n\t"
           "}\n\t"
           : "=r"(result)
           : "f"(rSq), "f"(rCritSq));
      return result;
  }
  #endif
```

===

Again, I'm guessing there might be an equiv simpler program involving AMD's __ballot, but I'm unsure of the true equivalence wrt safety, and it seems like a tricky rewrite as it needs to (afaict) decompile to recover the higher-level abstraction. Normally it's easier to compile down or sideways (translate), and it's not clear to me these primitives are 1:1 for safely doing so.

===

FWIW, this is all pretty cool. We stay away from PTX -- most of our app code is higher-level, whether RAPIDS (GPU dataframes, GPU ML, etc libs), minimal cuda, and minimal opencl, with only small traces of inline ptx. So more realistically, if we had the motivation, we'd likely explore just #ifdef'ing it with something predictable.

ckitching
3 replies
16h50m

I compiled your function with SCALE for gfx1030:

        .p2align        2                               ; -- Begin function _Z15ptx_thread_voteff
        .type   _Z15ptx_thread_voteff,@function
  _Z15ptx_thread_voteff:                  ; @_Z15ptx_thread_voteff
  ; %bb.0:                                ; %entry
        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        s_waitcnt_vscnt null, 0x0
        v_cmp_ge_f32_e32 vcc_lo, v0, v1
        s_cmp_eq_u32 vcc_lo, -1
        s_cselect_b32 s4, -1, 0
        v_cndmask_b32_e64 v0, 0, 1, s4
        s_setpc_b64 s[30:31]
  .Lfunc_end1:
        .size   _Z15ptx_thread_voteff, .Lfunc_end1-_Z15ptx_thread_voteff
                                        ; -- End function


What were the safety concerns you had? This code seems to be something like `return __all_sync(rSq >= rCritSq) ? 1 : 0`, right?

lmeyerov
2 replies
16h26m

It's supposed to be waiting for all threads to vote

I'm not familiar with AMD enough to know if additional synchronization is needed. ChatGPT recommended adding barriers beyond what that gave, but again, I'm not familiar with AMD commands.

ckitching
1 replies
16h12m

Indeed, no extra synchronisation is needed here due to the nature of the hardware (threads in a warp can't get out of sync with each other).

Even on NVIDIA, you could've written this without the asm a discussed above!

lmeyerov
0 replies
13h11m

Yeah I think, after this snippet was written, cuda added __all_sync as an intrinsic. The divergent code before this was plain-ish cuda, and this snippet ensures they wait on the comparison vote before recurring.

So in the AMD version, the compiler correctly realized the synchronization was on the comparison, so adds the AMD version right before it. That seems like a straightforward transform here.

It'd be interesting to understand the comparison of what Nvidia primitives map vs what doesn't. The above is a fairly simple barrier. We avoided PTX as much as we could and wrote it as simply as we could, I'd expect most of our PTX to port for similar reasons. The story is a bit diff for libraries we call. E.g., cudf probably has little compute-tier ptx directly, but will call nvidia libs, and use weird IO bits like cufile / gpu direct storage.

ladberg
3 replies
19h56m

Just to check here, if you're given something like the following PTX:

  wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16
Do you reverse it back into C++ that does the corresponding FMAs manually instead of using tensor hardware? Or are you able to convert it into a series of __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt instructions that emulate the same behavior?

ckitching
2 replies
18h21m

Rather awkwardly, you've asked about an instruction that isn't currently implemented. :D Support for wmma and friends is in development.

But in general the answer to your question is yes: we use AMD-specific builtins where available/efficient to make things work. Otherwise many things would be unrepresentble, not just slow!

saagarjha
1 replies
17h42m

What do you do when a builtin doesn't exist?

ckitching
0 replies
16h47m

Add one: it's trivial to add a compiler builtin to carry the instruction from the frontend to the backend if an instruction exists and the backend knows about it.

If there's no instruction, either, you can write a C++ function to replicate the behaviour and codegen a call to it. Since the PTX blocks are expanded during initial IR generation, it all inlines nicely by the end. Of course, such software emulation is potentially suboptimal (depends on the situation).

consf
0 replies
7h11m

It involves significant challenges

Moldoteck
0 replies
11h6m

it's a speculation, but I think it's similar with processors = nobody guarantees the code will run the way you set it up. You may want to use some specific register but if the processor will think it has another register that can fulfill the task, it'll use that but tell you that your code is executed as expected. Maybe the internal gpu processor of amd can sufficiently simulate the behavior of nvidia hardware so that higher abstractions will be unaware that something different is happening under the hood

resters
15 replies
22h44m

The main cause of Nvidia's crazy valuation is AMD's unwillingness to invest in making its GPUs as useful as Nvidia's for ML.

Maybe AMD fears antitrust action, or maybe there is something about its underlying hardware approach that would limit competitiveness, but the company seems to have left billions of dollars on the table during the crypto mining GPU demand spike and now during the AI boom demand spike.

1024core
5 replies
22h11m

I think this could be cultural differences, AMD's software department is underfunded and doing poorly for a long time now.

Rumor is that ML engineers (that AMD really needs) are expensive; and AMD doesn't want to give them more money than the rest of the SWEs they have (for pissing off the existing SWEs). So AMD is caught in a bind: can't pay to get top MLE talent and can't just sit by and watch NVDA eat its lunch.

karolist
2 replies
21h51m

I find this strange to believe. Every big company has levels, unless your existing L7+ IC is below market, you can just pull L7+ salaried ML engineers with some secret signing bonus like literally everyone else.

Der_Einzige
1 replies
20h11m

The dirty secret in the tech industry is that most people at AMD or Intel or IBM and historically Nvidia/Oracle (this changed post 2022), were the 2nd-3rd tier tech companies. Staffed heavily by the rejects of the FAANG, they were still happy to have their 100-200K in their MCOL areas, but no free food and a much more boring work culture. Intel's "great place to work" corporate propaganda was known as "great place to leetcode" while I worked there, as Intel was always seen as a stepping stone before you "made it" in a FAANG.

Culturally, none of these companies were happy to pay anyone except the tip, top "distinguished" engineers more than 300K. AMD seems to be stuck in this mentality, just as IBM is.

quotemstr
0 replies
3h56m

AMD seems to be stuck in this mentality, just as IBM is.

And that's why creative destruction is essential for technological progress. It's common for organizations to get stuck in stable-but-suboptimal social equilibria: everyone knows there's a problem but nobody can fix it. The only way out is to make a new organization and let the old one die.

xboxnolifes
0 replies
18h19m

So AMD is caught in a bind: can't pay to get top MLE talent and can't just sit by and watch NVDA eat its lunch.

This isn't being caught in a bind. This is, if true, just making a poor decision. Nothing is really preventing them from paying more for specialized work.

mepian
0 replies
22h4m

AMD recently acquired Silo AI.

DaoVeles
0 replies
16h49m

So nothing has changed since the era of ATI.

ClassyJacket
3 replies
19h39m

I like to watch YouTube retrospectives on old failed tech companies - LGR has some good ones.

When I think of AMD ignoring machine learning, I can't help imagine a future YouTuber's voiceover explaining how this caused their downfall.

There's a tendency sometimes to think "they know what they're doing, they must have good reasons". And sometimes that's right, and sometimes that's wrong. Perhaps there's some great technical, legal, or economic reason I'm just not aware of. But when you actually look into these things, it's surprising how often the answer is indeed just shortsightedness.

They could end up like BlackBerry, Blockbuster, Nokia, and Kodak. I guess it's not quite as severe, since they will still have a market in games and therefore may well continue to exist, but it will still be looked back on as a colossal mistake.

Same with Toyota ignoring electric cars.

I'm not an investor, but I still have stakes in the sense that Nvidia has no significant competition in the machine learning space, and that sucks. GPU prices are sky high and there's nobody else to turn to if there's something about Nvidia you just don't like or if they decide to screw us.

robertlagrant
0 replies
5h35m

There's also just the idea of endeavour - Nvidia tried something, and it worked. Businesses (or rather their shareholders) take risks with their capital sometimes, and it doesn't always work. But in this case it did.

hedora
0 replies
3h53m

In fairness to AMD, they bet on crypto, and nvidia bet on AI. Crypto was the right short term bet.

Also, ignoring is a strong word: I’m staring at a little << $1000, silent 53 watt mini-PC with an AMD SoC. It has an NPU comparable to an M1. In a few months, with the ryzen 9000 series, NPUs for devices of its class will bump from 16 tops to 50 tops.

I’m pretty sure the linux taint bit is off, and everything just worked out of the box.

_boffin_
0 replies
5h19m

If you haven’t heard of this book, you might like it. Dealers of lightening

gukov
1 replies
17h33m

The companies' CEO's are related. My conspiracy theory is that they don't want to step on each other's toes. Not sure if that works with fiduciary duty, though.

dist-epoch
0 replies
22h11m

There are stories from credible sources that AMD software engineers had to buy AMD GPUs with their own money to use in CI machines.

pixelpoet
11 replies
23h11m

Isn't this a bit legally dubious, like zluda?

janice1999
10 replies
23h2m

It's advertised as a "clean room" re-implementation. What part would be illegal?

ekelsen
5 replies
22h46m

If they had to reverse engineer any compiled code to do this, I think that would be against licenses they had to agree to?

At least grounds for suing and starting an extensive discovery process and possibly a costly injunction...

RockRobotRock
3 replies
22h11m

Isn't that exactly what a "clean room" approach avoids?

ekelsen
2 replies
21h22m

oh definitely. But if I was NVIDIA I'd want to verify that in court after discovery rather than relying on their claim on a website.

RockRobotRock
1 replies
19h16m

good point

ekelsen
0 replies
15h15m

FWIW, I think this is really great work and I wish only the best for scale. Super impressed.

msond
0 replies
22h23m

We have not reverse engineered any compiled code in the process of developing SCALE.

It was clean-room implemented purely from the API surface and by trial-and-error with open CUDA code.

mkl
1 replies
21h58m

So add a cheap NVidia card alongside grunty AMD ones, and check for its existence. It doesn't seem to say it needs to run on NVidia GPUs.

Keyframe
0 replies
21h34m

Heh, true. On the other hand, I bet companies are eager to challenge the wrath of a $3T company for a promise of "maybe it'll work, not all of it but at least it'll run worse, at least for now".

JonChesterfield
0 replies
19h25m

I don't think the terms of the Nvidia SDK can restrict running software without said SDK. Nvidia's libraries don't seem to be involved here. Their hardware isn't involved either. It's just some ascii in a bunch of text files being hacked around with before running on someone else's hardware.

jarbus
8 replies
23h8m

Really, really, really curious as to how they managed to pull this off, if their project works as well as they claim it does. If stuff as complex as paged/flash attention can "just work", this is really cool.

Straw
3 replies
23h5m

My understanding from chatting with them is that tensor core operations aren't supported yet, so FlashAttention likely won't work. I think its on their to-do list though!

Nvidia actually has more and more capable matrix multiplication units, so even with a translation layer I wouldn't expect the same performance until AMD produces better ML cards.

Additionally, these kernels usually have high sensitivity to cache and smem sizes, so they might need to be retuned.

Der_Einzige
2 replies
20h7m

So the only part that anyone actually cares about, as usual, is not supported. Same story as it was in 2012 with AMD vs Nvidia (and likely much before that too!). The more things change, the more they stay the same.

Straw
1 replies
18h29m

People did GPGPU computing long before GPUs. Simply look at the list of tested, supported projects on their docs page!

Straw
0 replies
1h4m

[EDIT] long before deep learning!

JonChesterfield
3 replies
22h13m

Cuda is a programming language. You implement it like any other. The docs are a bit sparse but not awful. Targeting amdgpu is probably about as difficult as targeting x64, mostly changes the compiler runtime.

The online ptx implementation is notable for being even more annoying to deal with than the cuda, but it's just bytes in / different bytes out. No magic.

ckitching
2 replies
20h52m

[I work on SCALE]

CUDA has a couple of extra problems beyond just any other programming language:

- CUDA is more than a language: it's a giant library (for both CPU and GPU) for interacting with the GPU, and for writing the GPU code. This needed reimplementing. At least for the device-side stuff we can implement it in CUDA, so when we add support for other GPU vendors the code can (mostly) just be recompiled and work there :D. - CUDA (the language) is not actually specified. It is, informally, "whatever nvcc does". This differs significantly from what Clang's CUDA support does (which is ultimately what the HIP compiler is derived from).

PTX is indeed vastly annoying.

JonChesterfield
1 replies
20h23m

The openmp device runtime library was originally written in cuda. I ported that to hip for amdgpu, discovered the upstream hip compiler wasn't quite as solid as advertised, then ported it to openmp with some compiler intrinsics. The languages are all essentially C++ syntax with some spurious noise obfuscating llvm IR. The libc effort has gone with freestanding c++ based on that experience and and we've now mostly fixed the ways that goes wrong.

You might also find raw c++ for device libraries saner to deal with than cuda. In particular you don't need to jury rig the thing to not spuriously embed the GPU code in x64 elf objects and/or pull the binaries apart. Though if you're feeding the same device libraries to nvcc with #ifdef around the divergence your hands are tied.

ckitching
0 replies
19h39m

You might also find raw c++ for device libraries saner to deal with than cuda.

Actually, we just compile all the device libraries to LLVM bitcode and be done with it. Then we can write them using all the clang-dialect, not-nvcc-emulating, C++23 we feel like, and it'll still work when someone imports them into their c++98 CUDA project from hell. :D

ur-whale
6 replies
22h25m

If this actually works (remains to be seen), I can only say:

   1) Kudos
   2) Finally !

anthonix1
4 replies
20h56m

I just tried it with llm.c ... seems to be missing quite a few key components such as cublaslt, bfloat16 support, nvtx3, compiler flags such as -t

And its linked against an old release of ROCm.

So unclear to me how it is supposed to be an improvement over something like hipify

ckitching
3 replies
20h40m

Greetings, I work on SCALE.

It appears we implemented `--threads` but not `-t` for the compiler flag. Oeps. In either case, the flag has no effect at present, since fatbinary support is still in development, and that's the only part of the process that could conceivably be parallelised.

That said: clang (and hence the SCALE compiler) tends to compile CUDA much faster than nvcc does, so this lack of the parallelism feature is less problematic than it might at first seem.

NVTX support (if you want more than just "no-ops to make the code compile") requires cooperation with the authors of profilers etc., which has not so far been available

bfloat16 is not properly supported by AMD anyway: the hardware doesn't do it, and HIP's implementatoin just lies and does the math in `float`. For that reason we haven't prioritised putting together the API.

cublasLt is a fair cop. We've got a ticket :D.

anthonix1
2 replies
19h33m

Hi, why do you believe that bfloat16 is not supported? Can you please provide some references (specifically the part about the hardware "doesn't do it")?

For the hardware you are focussing on (gfx11), the reference manual [2] and the list of LLVM gfx11 instructions supported [1] describe the bfloat16 vdot & WMMA operations, and these are in fact implemented and working in various software such as composable kernels and rocBLAS, which I have used (and can guarantee they are not simply being run as float). I've also used these in the AMD fork of llm.c [3]

Outside of gfx11, I have also used bfloat16 in CDNA2 & 3 devices, and they are working and being supported.

Regarding cublasLt, what is your plan for support there? Pass everything through to hipblasLt (hipify style) or something else?

Cheers, -A

[1] https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX11.html [2] https://www.amd.com/content/dam/amd/en/documents/radeon-tech... [3] http://github.com/anthonix/llm.c

ckitching
1 replies
18h27m

Hi, why do you believe that bfloat16 is not supported?

Apologies, I appear to be talking nonsense. I conflated bfloat16 with nvidia's other wacky floating point formats. This is probably my cue to stop answering reddit/HN comments and go to bed. :D

So: ahem: bfloat16 support is basically just missing the fairly boring header.

Regarding cublasLt, what is your plan for support there? Pass everything through to hipblasLt (hipify style) or something else?

Prettymuch that, yes. Not much point reimplementing all the math libraries when AMD is doing that part of the legwork already.

anthonix1
0 replies
17h37m

OK, so in the case of llm.c, if you're just including the HIP headers, using hipblasLt, etc, what would be the benefit of using scale instead of hipify?

gedy
0 replies
21h22m

or: 1) CUDAs

JonChesterfield
6 replies
22h26m

This is technically feasible so might be the real thing. Parsing inline ptx and mapping that onto amdgpu would be a huge pain.

Working from cuda source that doesn't use inline ptx to target amdgpu is roughly regex find and replace to get hip, which has implemented pretty much the same functionality.

Some of the details would be dubious, e.g. the atomic models probably don't match, and volta has a different instruction pointer model, but it could all be done correctly.

Amd won't do this. Cuda isn't a very nice thing in general and the legal team would have kittens. But other people totally could.

ckitching
5 replies
21h5m

[I work on SCALE]

Mapping inline ptx to AMD machine code would indeed suck. Converting it to LLVM IR right at the start of compilation (when the initial IR is being generated) is much simpler, since it then gets "compiled forward" with the rest of the code. It's as if you wrote C++/intrinsics/whatever instead.

Note that nvcc accepts a different dialect of C++ from clang (and hence hipcc), so there is in fact more that separates CUDA from hip (at the language level) than just find/replace. We discuss this a little in [the manual](https://docs.scale-lang.com/manual/dialects/)

Handling differences between the atomic models is, indeed, "fun". But since CUDA is a programming language with documented semantics for its memory consistency (and so is PTX) it is entirely possible to arrange for the compiler to "play by NVIDIA's rules".

JonChesterfield
4 replies
20h46m

Huh. Inline assembly is strongly associated in my mind with writing things that can't be represented in LLVM IR, but in the specific case of PTX - you can only write things that ptxas understands, and that probably rules out wide classes of horrendous behaviour. Raw bytes being used for instructions and for data, ad hoc self modifying code and so forth.

I believe nvcc is roughly an antique clang build hacked out of all recognition. I remember it rejecting templates with 'I' as the type name and working when changing to 'T', nonsense like that. The HIP language probably corresponds pretty closely to clang's cuda implementation in terms of semantics (a lot of the control flow in clang treats them identically), but I don't believe an exact match to nvcc was considered particularly necessary for the clang -x cuda work.

The ptx to llvm IR approach is clever. I think upstream would be game for that, feel free to tag me on reviews if you want to get that divergence out of your local codebase.

saagarjha
1 replies
17h34m

nvcc is nowhere near that bad these days, it supports most C++ code directly (for example, I've written kernels that include headers like <span> or <algorithm> and they work just fine).

ckitching
0 replies
16h45m

NVCC is doing much better than before in terms of "broken C++". There was indeed a time when lots of modern C++ just didn't work.

Nowadays the issues are more subtle and nasty. Subtle differences in overload resolution. Subtle differences in lambda handling. Enough to break code in "spicy" ways when you try to port it over.

ckitching
1 replies
16h43m

I certainly would not attempt this feat with x86 `asm` blocks :D. PTX is indeed very pedestrian: it's more like IR than machine code, really. All the usual "machine-level craziness" that would otherwise make this impossible is just unrepresentable in PTX (though you do run into cases of "oopsie, AMD don't have hardware for this so we have to do something insane").

JonChesterfield
0 replies
4h13m

It's a beautiful answer to a deeply annoying language feature. I absolutely love it. Yes, inline asm containing PTX definitely should be burned off at the compiler front end, regardless of whether it ultimately codegens as PTX or something else.

I'm spawned a thread on the llvm board asking if anyone else wants that as a feature https://discourse.llvm.org/t/fexpand-inline-ptx-as-a-feature... in the upstream. That doesn't feel great - you've done something clever in a proprietary compiler and I'm suggesting upstream reimplement it - so I hope that doesn't cause you any distress. AMD is relatively unlikely to greenlight me writing it so it's probably just more marketing unless other people are keen to parse asm in string literals.

cheptsov
5 replies
22h44m

Sounds really awesome. Any chance someone can suggest if this works also inside a Docker container?

ckitching
1 replies
20h49m

It works exactly as well as other AMDGPU-related software (HIP etc.) works inside Docker.

There are some delightful AMD driver issues that make certain models of GPU intermittently freeze the kernel when used from docker. That was great fun when building SCALE's CI system :D.

cheptsov
0 replies
5h41m

Would love to give it a try! Thanks for answering my question.

cheptsov
0 replies
4h56m

Thank you! This link is very helpful.

cheptsov
0 replies
5h43m

Wow, somebody doesn’t like Docker enough to downvote my question.

juujian
4 replies
23h10m

I don't understand how AMD has messed up so badly that I feel like celebrating a project like this. Features of my laptop are just physically there but not usable, particularly in Linux. So frustrating.

jeroenhd
1 replies
20h54m

AMD hardware works fine, the problem is that the major research projects everyone copies are all developed specifically for Nvidia.

Now AMD is spinning up CUDA compatibility layer after CUDA compatibility layer. It's like trying to beat Windows by building another ReactOS/Wine. It's an approach doomed to fail unless AMD somehow manages to gain vastly more resources than the competition.

Apple's NPU may not be very powerful, but many models have been altered specifically to run on them, making their NPUs vastly more useful than most equivalently powerful iGPUs. AMD doesn't have that just yet, they're always catching up.

It'll be interesting to see what Qualcomm will do to get developers to make use of their NPUs on the new laptop chips.

JonChesterfield
0 replies
20h18m

Interesting analogy. The last few programs from the windows world I tried to run were flawless under wine and abjectly failed under windows 11.

djbusby
0 replies
22h53m

Same boat, AMD CPU but nothing else. I feel like a moderate improvement of their FOSS support, drivers would open new hardware revenue - to say nothing about the AI channel.

ActorNightly
0 replies
22h4m

I don't know if I would call it a mess up. AMD still has massive market in server chips, and their ARM stuff is on the horizon. We all assume that graphics cards are the way forward for ML, which may not be the case in the future.

Nvidia were just ahead in this particular category due to CUDA, so AMD may have just let them run with it for now.

pjmlp
3 replies
13h19m

This targets CUDA C++, not CUDA the NVIDIA infrastructure for C, C++, Fortran, and anything else targeting PTX.

ckitching
2 replies
6h25m

The CUDA C APIs are supported as much in C as in C++ using SCALE!

Cuda-fortran is not currently supported by scale since we haven't seen much use of it "in the wild" to push it up our priority list.

anon291
1 replies
1h6m

It doesn't matter though. NVIDIA distributes tons of libraries built atop CUDA that you cannot distribute or use on AMD chips legally. Cutlass, CuBLAS, NCCL, etc.

tama_sala
0 replies
48m

Correct, which one of the main moats Nvidia has when it comes to training

m3kw9
3 replies
23h7m

This isn’t a solution for pros because it will always play catch up and Nvidia can always add things to make it difficult. This is like emulation.

ok123456
0 replies
21h21m

It's not emulation. It's a compiler.

dboreham
0 replies
21h45m

Pros will end up overruled by bean counters if it works.

bachmeier
0 replies
21h56m

it will always play catch up

That's not important if the goal is to run existing CUDA code on AMD GPUs. All you have to do is write portable CUDA code in the future regardless of what Nvidia does if you want to keep writing CUDA.

I don't know the economics here, but if the AMD provides a significant cost saving, companies are going to make it work.

Nvidia can always add things to make it difficult

Sounds like Microsoft embedding the browser in the OS. It's hard to see how doing something like that wouldn't trigger an antitrust case.

gizajob
3 replies
22h59m

Is Nvidia not likely to sue or otherwise bork this into non-existence?

chx
1 replies
22h52m

Sue over what...?

gizajob
0 replies
21h59m

Whatever IP related issues they’d want to sue over. Sorry I don’t know specifics about what this would specifically infringe but I’m sure expensive legal brains could come up with something

CoastalCoder
0 replies
22h50m

I wonder if nVidia's current anti-trust woes would make them reluctant to go that route at the moment.

ashvardanian
3 replies
22h17m

It’s great that there is a page about current limitations [1], but I am afraid that what most people describe as “CUDA” is a small subset of the real CUDA functionality. Would be great to have a comparison table for advanced features like warp shuffles, atomics, DPX, TMA, MMA, etc. Ideally a table, mapping every PTX instruction to a direct RDNA counterpart or a list of instructions used to emulate it.

[1]: https://docs.scale-lang.com/manual/differences/

ckitching
2 replies
20h57m

You're right that most people only use a small subset of cuda: we prioritied support for features based on what was needed for various open-source projects, as a way to try to capture the most common things first.

A complete API comparison table is coming soon, I belive. :D

In a nutshell: - DPX: Yes. - Shuffles: Yes. Including the PTX versions, with all their weird/wacky/insane arguments. - Atomics: yes, except the 128-bit atomics nvidia added very recently. - MMA: in development, though of course we can't fix the fact that nvidia's hardware in this area is just better than AMD's, so don't expect performance to be as good in all cases. - TMA: On the same branch as MMA, though it'll just be using AMD's async copy instructions.

mapping every PTX instruction to a direct RDNA counterpart or a list of instructions used to emulate it.

We plan to publish a compatibility table of which instructons are supported, but a list of the instructions used to produce each PTX instruction is not in general meaningful. The inline PTX handler works by converting the PTX block to LLVM IR at the start of compilation (at the same time the rest of your code gets turned into IR), so it then "compiles forward" with the rest of the program. As a result, the actual instructions chosen vary on a csae-by-case basis due to the whims of the optimiser. This design in principle produces better performance than a hypothetical solution that turned PTX asm into AMD asm, because it conveniently eliminates the optimisation barrier an asm block typically represents. Care, of course, is taken to handle the wacky memory consistency concerns that this implies!

We're documenting which ones are expected to perform worse than on NVIDIA, though!

einpoklum
0 replies
5h17m

You're right that most people only use a small subset of cuda

This is true first and foremost for the host-side API. From my StackOverflow and NVIDIA forums experience - I'm often the first and only person to ask about any number of nooks and crannies of the CUDA Driver API, with issues which nobody seems to have stumbled onto before; or at least - not stumbled and wrote anything in public about it.

yieldcrv
2 replies
21h37m

the real question here is whether anybody has gotten cheap, easily available AMD GPUs to run their AI workloads, and if we can predict more people will do so

anthonix1
0 replies
19h20m

I ported Karparthy's llm.c repo to AMD devices [1], and have trained GPT2 from scratch with 10B tokens of fineweb-edu on a 4x 7900XTX machine in just a few hours (about $2 worth of electricity) [2].

I've also trained the larger GPT2-XL model from scratch on bigger CDNA machines.

Works fine.

[1] https://github.com/anthonix/llm.c [2] https://x.com/zealandic1

JonChesterfield
0 replies
19h23m

Microsoft have their production models running on amdgpu. I doubt it was easy but it's pretty compelling as an existence proof

spfd
2 replies
17h46m

Very impressive!

But I can't help but think if something like this can be done to this extend, I wonder what went wrong/why it's a struggle for OpenCL to unify the two fragmentized communities. While this is very practical and has a significant impact for people who develop GPGPU/AI applications, for the heterogeneous computing community as a whole, relying on/promoting a proprietary interface/API/language to become THE interface to work with different GPUs sounds like bad news.

Can someone educate me on why OpenCL seems to be out of scene in the comments/any of the recent discussions related to this topic?

vedranm
0 replies
13h22m

If you are going the "open standard" route, SYCL is much more modern than OpenCL and also nicer to work with.

JonChesterfield
0 replies
3h2m

Opencl gives you the subset of capability that a lot of different companies were confident they could implement. That subset turns out to be intensely annoying to program in - it's just the compiler saying no over and over again.

Or you can compile as freestanding c++ with clang extensions and it works much like a CPU does. Or you can compile as cuda or openmp and most stuff you write actually turns into code, not a semantic error.

Currently cuda holds lead position but it should lose that place because it's horrible to work in (and to a lesser extent because more than one company knows how to make a GPU). Openmp is an interesting alternative - need to be a little careful to get fast code out but lots of things work somewhat intuitively.

Personally, I think raw C++ is going to win out and the many heterogeneous languages will ultimately be dropped as basically a bad idea. But time will tell. Opencl looks very DoA.

sakras
2 replies
22h49m

One question I always have about these sorts of translation layers is how they deal with the different warp sizes. I'd imagine a lot of CUDA code relies on 32-wide warps, while as far as I know AMD tends to have 64-wide warps. Is there some sort of emulation that needs to happen?

mpreda
0 replies
22h35m

The older AMD GCN had 64-wide wavefront, but the newer AMD GPUs "RDNA" support both 64 and 32 wavefront, and this is configurable at runtime. It appears the narrower wavefronts are better suited for games in general.

Not sure what is the situation with "CDNA", which is the compute-oriented evolution of "GCN", i.e. whether CDNA is 64-wavefront only or dual like RNDA.

arjvik
2 replies
23h12m

Who is this Spectral Compute, and where can we see more about them?

JonChesterfield
0 replies
22h17m

The branch free regex engine is an interesting idea. I would have said that can't be implemented in finite code.

Compile to DFA by repeatedly differentiating then unroll the machine? You'd still have back edges for the repeating sections.

shmerl
1 replies
22h44m

Compiler isn't open source? That feels like DOA in this day and age. There is ZLUDA already which is open.

If they plan to open it up, it can be something useful to add to options of breaking CUDA lock-in.

uyzstvqs
0 replies
21h4m

ZLUDA is pretty good, except that it lacks cuDNN which makes most PyTorch projects just not work. Not sure if this project does cover that? That could be a game changer, otherwise yeah ZLUDA is the better open-source option.

qwerty456127
1 replies
21h44m

gfx1030, gfx1100, gfx1010, gfx1101, gfx900...

How do I find out which do I have?

adzm
1 replies
23h11m

I'd love to see some benchmarks but this is something the market has been yearning for.

msond
0 replies
23h0m

We're putting together benchmarks to publish at a later time, and we've asked some independent third parties to work on their own additionally.

uptownfunk
0 replies
13h7m

Very clearly the business motive make sense, go after nvidia gpu monopoly. Can someone help a lay person understand the pitfalls here that prevent this from being an intelligent venture?

tallmed
0 replies
35m

I wonder if this thing has anything common with zluda, its permissively licensed after all.

stuaxo
0 replies
5h35m

What's the licensing, will I be able run this as a hobbyist for free software?

rjurney
0 replies
17h10m

If it's efficient, this is very good for competition.

paulmist
0 replies
22h36m

Doesn't seem to mention CDNA?

nabogh
0 replies
18h31m

I've written a bit of CUDA before. If I want to go pretty bare-bones, what's the equivalent setup for writing code for my AMD card?

localfirst
0 replies
17h30m

SCALE does not require the CUDA program or its build system to be modified.

how big of a deal is this?

joe_the_user
0 replies
23h1m

This sounds fabulous. I look forward to AMD being drawn kicking and screaming into direct competition with Nvidia.

galaxyLogic
0 replies
21h41m

Companies selling CUDA software should no doubt adopt this tool

ekelsen
0 replies
15h9m

A major component of many CUDA programs these days involves NCCL and high bandwidth intra-node communication.

Does NCCL just work? If not, what would be involved in getting it to work?

einpoklum
0 replies
7h35m

At my workplace, we were reluctant in making the choice between writing OpenCL and being AMD-compliant, but missing out on CUDA features and tooling; and writing CUDA and being vendor-locked.

Our jerry-rigged solution for now is writing kernels that are the same source for both OpenCL and CUDA, with a few macros doing a bit of adaptation (e.g. the syntax for constructing a struct). This requires no special library or complicated runtime work - but it does have the downside of forcing our code to be C'ish rather than C++'ish, which is quite annoying if you want to write anything that's templated.

Note that all of this regards device-side, not host-side, code. For the host-side, I would like, at some point, to take the modern-C++ CUDA API wrappers (https://github.com/eyalroz/cuda-api-wrappers/) and derive from them something which supports CUDA, OpenCL and maybe HIP/ROCm. Unfortunately, I don't have the free time to do this on my own, so if anyone is interested in collaborating on something like that, please drop me a line.

-----

You can find the OpenCL-that-is-also-CUDA mechanism at:

https://github.com/eyalroz/gpu-kernel-runner/blob/main/kerne...

and

https://github.com/eyalroz/gpu-kernel-runner/blob/main/kerne...

(the files are provided alongside a tool for testing, profiling and debugging individual kernels outside of their respective applications.)

deliveryboyman
0 replies
23h9m

Would like to see benchmarks for the applications in the test suite.

E.g., how does Cycles compare on AMD vs Nvidia?

dagmx
0 replies
23h12m

Has anyone tried this and knows how well it works? It definitely sounds very compelling

Straw
0 replies
23h9m

I worked for spectral compute a few years ago. Very smart and capable technical team.

At the time, not only did they target AMD (with less compatibility than they have now), but also outperformed the default LLVM ptx backend, and even NVCC, when compiling for Nvidia GPUs!

EGreg
0 replies
24m

Does it translate to OpenCL?

This sounds like DirectX vs OpenGL debate when I was younger lol

EGreg
0 replies
19h56m

But the question is, can it also run SHUDA and WUDA?