As another commenter said, it's CUDA. Intel and AMD and whoever can turn out chips reasonably fast, but nobody gets that it's the software and ecosystem. You have to out-compete the ecosystem. You can pick up a used Mi100 that performs almost like an A100 for 5x less money on eBay for example. Why is it 5x less? Because the software incompatibilities mean you'll spend a ton of time getting it to work compared to an Nvidia GPU.
Google is barely limping along with it's XLA interface to pytorch providing researchers a decent compatibility path. Same with Intel.
Any company in this space should basically setup a giant test suite of IDK, every model on hugging face and just start brute force fixing the issues. Then maybe they can sell some chips!
Intel is basically doing the same shit they always do here, announcing some open initiative and then doing literally the bare minimum to support it. 99% chance openvino goes nowhere. OpenAIs Triton already seems more popular, at least I've heard it referenced a lot more than openvino.
The funny thing to me is that so much of the "AI software ecosystem" is just PyTorch. You don't need to develop some new framework and make it popular. You don't need to support a zillion end libraries. Just literally support PyTorch.
If PyTorch worked fine on Intel GPUs, a lot of people would be happy to switch.
But you can't support Pytorch without a proper foundation in place. They don't need to support zillion _end_ libraries, sure, but they do need to have at least a very good set of standard libraries, equivalent of Cublas, Curand etc.
And they don't. My work recently had me working with rocRAND (Rocm's answer to Curand). It was frankly pretty bad- the design, performance (50% slower in places that don't make any sense because generating random numbers is not exact that complicated), and documentation (God it was awful).
Now, that's a small slice of the larger pie. But imagine if this trend continues for other libraries.
Generating random numbers is a bit complicated! I wrote some of the samplers in Pytorch (probably replaced by now) and some of the underlying pseudo-random algorithms that work correctly in parallel are not exactly easy... running the same PRNG with the same seed on all your cores will produce the same result, which is probably NOT what you want from your API.
But, to be honest, it's not that hard either. I'm surprised their API is 2x slower, Philox is 10 years old now and I don't think there's a licensing fee?
I wonder if the next generation chips are going to just have a dedicated hardware RNG per-core if that's an issue?
Why bother?
It's not the generation that matters so much, it's the gathering of entropy, which comes from peripherals and not possible to generate on-die.
If you don't need cryptographically secure randomness, you still want the entropy for generating the seeds per thread/die/chip.
It absolutely is possible to generate entropy on-die, assuming you actually want entropy and not just a unique value that gets XORed with the seed, so you can still have repeatable seeds.
Pretty much every chip has an RNG which can be as simple as just a single free running oscillator you sample
Every chip may have some sort of noise to sample, but they are nowhere near good sources of entropy.
Entropy is not a binary thing (you either have it or don't), it's a spectrum and entropy gathered on-die is poor entropy.
Look, I concede that my knowledge on this subject is a bit dated, but the last time I checked there were no good sources of entropy on-die for any chip in wide use. All cryptographically secure RNGs depend on a peripheral to grab noise from the environment to mix into the entropy pool.
A free-running oscillator is a very poor source of entropy.
For non-cryptographic applications, a PRNG like xorshift reseeded by a few bits from an oscillator might be enough.
As I understand it, the reason they don't use on-chip RNGs by themselves isn't due to lack of entropy, it's because people don't trust them not to put a backdoor on the chips or to have some kind of bug.
Intel has https://en.m.wikipedia.org/wiki/RDRAND but almost all chips seem to now have some kind of RNG.
I know! I just wrote a whole paper and published a library on this!
But really, perhaps not as much as many from outside might think. The core of a Philox implementation can be around 50 lines of C++ [1], with all the bells and whistles maybe around 300-400. That implementation's performance equals CuRAND's , sometimes even surpasses it! (the API is designed to avoid maintaining any rng states on device memory, something curand forces you to do).
You're right. Solution here is to utilize multiple generator objects, one per thread, ensuring each produces statistically independent random streams. Some good algorithms (Philox for example), allow you to use any set of unique values as seeds for your threads (e.g. thread id).
[1] https://github.com/msu-sparta/OpenRAND/blob/main/include/ope...
Cool! I’ll have a lookseee. I’ve my own experiments in this space.
for GPGPU, the better approach is CBRNG like random123.
https://github.com/DEShawResearch/random123
if you accept the principles of encryption, then the bits of the output of crypt(key, message) should be totally uncorrelated to the output of crypt(key, message+1). and this requires no state other than knowing the key and the position in the sequence.
the direct-port analogy is that you have an array of CuRand generators, generator index G is equivalent to key G, and you have a fixed start offset for the particular simulation.
moreover, you can then define the key in relation to your actual data. the mental shift from what you're talking about is that in this model, a PRNG isn't something that belongs to the executing thread. every element can get its own PRNG and keystream. And if you use a contextually-meaningful value for the element key, then you already "know" the key from your existing data. And this significantly improves determinism of the simulation etc because PRNG output is tied to the simulation state, not which thread it happens to be scheduled on.
(note that the property of cryptographic non-correlation is NOT guaranteed across keystreams - (key, counter) is NOT guaranteed to be uncorrelated to (key+1, counter), because that's not how encryption usually is used. with a decent crypto, it should still be very good, but, it's not guaranteed to be attack-resistant/etc. so notionally if you use a different key index for every element, element N isn't guaranteed to be uncorrelated to element N+1 at the same place in the keystream. If this is really important then maybe you want to pass your array indexes through a key-spreading function etc.)
there are several benefits to doing it like this. first off obviously you get a keystream for each element of interest. but also there is no real state per-thread either - the key can be determined by looking at the element, but generating a new value doesn't change the key/keystream. so there is nothing to store and update, and you can have arbitrary numbers of generators used at any given time. Also, since this computation is purely mathematical/"pure function", it doesn't really consume any memory-bandwidth to speak of, and since computation time is usually not the limiting element in GPGPU simulations this effectively makes RNG usage "free". my experience is that this increases performance vs CuRand, even while using less VRAM, even just directly porting the "1 execution thread = 1 generator" idiom.
Also, by storing "epoch numbers" (each iteration of the sim, etc), or calculating this based on predictions of PRNG consumption ("each iteration uses at most 16 random numbers"), you can fast-forward or rewind the PRNG to arbitrary times, and you can use this to lookahead or lookback on previous events from the keystream, meaning it serves as a massively potent form of compression as well. Why store data in memory and use up your precious VRAM, when you could simply recompute it on-demand from the original part of the original keystream used to generate it in the first place? (assuming proper "object ownership" of events ofc!) And this actually is pretty much free in performance terms, since it's a "pure function" based on the function parameters, and the GPGPU almost certainly has an excess of computation available.
--
In the extreme case, you should be able to theoretically "walk" huge parts of the keystream and find specific events you need, even if there is no other reference to what happened at that particular time in the past. Like why not just walk through parts of the keystream until you find the event that matches your target criteria? Remember since this is basically pure math, it's generated on-demand by mathing it out, it's pretty much free, and computation is cheap compared to cache/memory or notarizing.
(ie this is a weird form of "inverted-index searching", analogous to Elastic/Solr's transformers and how this allows a large number of individual transformers (which do their own searching/indexing for each query, which will be generally unindexable operations like fulltext etc) to listen to a single IO stream as blocks are broadcast from the disk in big sequential streaming batches. Instead of SSD batch reads you'd be aiming for computation batch reads from a long range within a keystream. (And this is supposition but I think you can also trade back and forth between generator space and index hitrate by pinning certain bits in the output right?)
--
Anyway I don't know how much that maps to your particular use-case but that's the best advice I can give. Procedural generation using a rewindable, element-specific keystream is a very potent form of compression, and very cheap. But, even if all you are doing is just avoiding having to store a bunch of CuRand instances in VRAM... that's still an enormous win even if you directly port your existing application to simply use the globalThreadIdx like it was a CuRand stateful instance being loaded/saved back to VRAM. Like I said, my experience is that because you're changing mutation to computation, this runs faster and also uses less VRAM, it is both smaller and better and probably also statistically better randomness (especially if you choose the "hard" algorithms instead of the "optimized" versions like threefish instead of threefry etc). The bit distribution patterns of cryptographic algorithms is something that a lot of people pay very very close attention to, you are turning a science toy implementation into a gatling gun there simply by modeling your task and the RNG slightly differently.
That is the reason why you shouldn't do the "just download random numbers", as a sibling comment mentions (probably a joke) - that consumes VRAM, or at least system memory (and pcie bandwidth). and you know what's usually way more available as a resource in most GPGPU applications than VRAM or PCIe bandwidth? pure ALU/FPU computation time.
buddy, everyone has random numbers, they come with the fucking xbox. ;)
thinking this through a little bit, you are launching a series of gradient-descent work tasks, right? taskId is your counter value, weightIdx is your key value (RNG stream). That's how I'd port that. Ideally you want to define some maximum PRNG usage for each stage of the program, which allows you to establish fixed offsets from the epoch value for a given event. Divide your keystream in whatever advantageous way, based on (highly-compressible) epoch counters and event offsets from that value.
in practice, assuming a gradient-descent event needs a lot of random numbers, having one keystream for a single GD event might be too much and that's where key-spreading comes in. if you take the "weightIdx W at GradientDescentIdx G" as the key, you can have a whole global keystream-space for that descent stage. And the key-spreading-function lets you go between your composite key and a practical one.
https://en.wikipedia.org/wiki/Key_derivation_function
(again, like threefry, there is notionally no need for this to be cryptographically secure in most cases, as long as it spreads in ways that your CBRNG crypto algorithm can tolerate without bit-correlation. there is no need to do 2 million rounds here either etc. You should actually pick reasonable parameters here for fast performance, but good enough keyspreading for your needs.)
I've been out of this for a long time, I've been told I'm out of date before and GPGPUs might not behave exactly this way anymore, so please just take it in the spirit it's offered, can't guarantee this is right but I've specifically gazed into the abyss the CuRand situation a decade ago and this was what I managed to come up with. I do feel your pain on the stateful RNG situation, managing state per-execution-thread is awful and destroys simulation reproducibility, and managing a PRNG context for each possible element is often infeasible. What a waste of VRAM and bandwidth and mutation/cache etc.
And I think that cryptographic/pseudo-cryptographic PRNG models are frankly just a much better horse to hook your wagon to than scientific/academic ones, even apart from all the other advantages. Like there's just not any way mersenne twister or w/e is better than threefish, sorry academia
--
edit: Real-world sim programs are usually very low-intensity and have effectively unlimited amounts of compute to spare, they just ride on bandwidth (sort/search or sort/prefix-scan/search algorithms with global scope building blocks often work well).
And tbh that's why tensor is so amazing, it's super effective at math intensity and computational focus, and that's what GPUs do well, augmented by things like sparse models etc. Make your random not-math task into dense or sparse (but optimized) GPGPU math, plus you get a solution (reasonable optimum) to an intractible problem in realtime. The experienced salesman usually finds a reasonable optimum, but we pay him in GEMM/BLAS/Tensor compute time instead of dollars.
Sort/search or sort/prefix-sum/search often works really well in deterministic programs too. Do you ever have a "myGroup[groupIdx].addObj(objIdx) stage? that's a sort and prefix-sum operation right there, and both of those ops run super well on GPGPU.
Folks also underestimate how complex these libraries are. There are dozens of projects to make BLAS alternatives which give up after ~3-6 months when they realize that this project will take years to be successful.
How does that work? Why not pick up where the previous team left off instead of everyone starting new ones? Or are they all targeting different backends and hardware?
It's a compiler problem and there is no money in compilers [1]. If someone made an intermediate representation for AI graphs and then wrote a compiler from that intermediate format into whatever backend was the deployment target then they might be able to charge money for support and bug fixes but that would be it. It's not the kind of business anyone wants to be in so there is no good intermediate format and compiler that is platform agnostic.
1: https://tinygrad.org/
JAX is a compiler.
So are TensorFlow and PyTorch. All AI/ML frameworks have to translate high-level tensor programs into executable artifacts for the given hardware and they're all given away for free because there is no way to make money with them. It's all open source and free. So the big tech companies subsidize the compilers because they want hardware to be the moat. It's why the running joke is that I need $80B to build AGI. The software is cheap/free, the hardware costs money.
You can't bench implementations of random numbers against each other purely on execution speed.
A better algorithm (better statistical properties) will be slower.
Yeah. In this instance, I was talking about the same algorithm (phillox), the difference is purely in implementation.
I have the fastest random number generator in the world. And it works in parallel too!
https://i.stack.imgur.com/gFZCK.jpg
If you haven't already, please consider filing issues on the rocrand GitHub repo for the problems you encountered. The rocrand library is being actively developed and your feedback would be valuable for guiding improvements.
Appreciate it, will do.
Instead of generating pseudorandom numbers you can just download files of them.
https://archive.random.org/
Or you could just re-use the same number; no one can prove it is not random.
https://xkcd.com/221/
I honestly don't see why it's so hard. On my project we wrote our own gemm kernels from scratch so llama.cpp didn't need to depend on cublas anymore. Only took a few days and a few hundred lines of code. We had to trade away 5% performance.
For a given set kernels, and a limited set of architectures, the problem is relatively easy.
But covering all the important kernels acros all the crazy architecture out there and with relatively good performance and numerical accuracy ... Much harder
Just to point out it does, kind of: https://github.com/intel/intel-extension-for-pytorch
I've asked before if they'll merge it back into PyTorch main and include it in the CI, not sure if they've done that yet.
In this case I think the biggest bottleneck is just that they don't have a fast enough card that can compete with having a 3090 or an A100. And Gaudi is stuck on a different software platform which doesn't seem as flexible as an A100.
They could compete on ram, if the software was there. Just having a low cost alternative to the 4060ti would allow them to break into the student/hobbies/open source market.
I tried the a770, but returned it. Half the stuff does not work. They have the CPU side and GPU development on different branches (GPU seems to be ~6 months behind CPU) and often you have to compile it yourself, (if you want torchvision or torchaudio) it also currently on 2.0.1 of pytorch so somewhat lagging, and does not have most of the performance analysis software available. You also, do need to modify your pytorch code, often more than just replacing cuda for xpu as the device. They are also doing all development internally, then pushing intermittently to public. A lot of this would not be as bad if there was a better idea of feature timeline, or if they made their CI public. (Trying to build it myself involved a extremely hacky bash script, that inevitably failed halfway through.)
The amount of VRAM is the absolute killer USP for the current large AI model hobbyist segment. Something that had just as much VRAM as a 3090 but at half the speed and half the price would sell like hot cakes.
You are describing the ebay market for used nvidia tesla cards. The k80, p40, or m40 are widely available and sell for ~$100 with 24gb vram. The m10 even has 32gb! The problem for ai hobbyists is it won't take long to realize how many apis use the "optical flow" pathways and so on nvidia they'll only run at acceptable speeds on rtx hardware, assuming they run at all. Cuda versions are pinned to hardware to some extent.
Yep. I have a fleet of P40s that are good at what they do (Whisper ASR primarily) but anything even remotely new... nah. fp16 support is missing so you need P100 cards, and usually that means you are accepting 16GB of VRAM rather than 24GB.
Still some cool hardware.
For us hobbyists used 3090 or new 7900xtx seem to be the way. But even then you still need to build a machine with 3 or 4 of these GPUs to get enough VRAM to play with big models.
For sure - our prod machine has 6x RTX 3090s on some old cirrascale hardware. But P40s are still good for last-gen models. Just nothing new unfortunately.
Out of these three, only P40 is worth the effort to get running vs the capabilities they offer. That's also before considering that other than software hacks or configuration tweaks, those cards require specialised cooling shrouds for adequate cooling in tower-style cases.
If your time or personal energy is worth >$0, these cards work out to much more than $100. And you can't even file the time burnt on getting them to run as any kind of transferable experience.
That's not to say I don't recommend getting them - I have a P4 family card and will get at least one more, but I'm not kidding myself that the use isn't very limited.
The K80 is literally worth < $0 given how ridiculous it is to configure and use.
P40s are definitely the best value along with P100s. I still think there is a good amount of value to be extracted from both of those cards, especially if you are interested in using Whisper ASR, video transcoding, or CUDA models that were relevant before LLMs (a time many people have forgotten apparently).
This is pretty disappointing to hear. I’m really surprised they can’t even get a clean build script for users, let alone integrate into the regular Pytorch releases.
oh no, I just bought a refurbed A770 16GB for tinkering with GPGPU lol. It was $220, return?
OneAPI isn't bad for PyTorch, the performance isn't there yet but you can tell it's an extremely top priority for Intel.
Intel has to do it by themselves. NVIDIA just lets Meta/OpenAI/Google engineers do it for them. Such a handicapped fight.
It wasn’t always like this. Nvidia did the initial heavy lifting to get cuda off the ground to a point where other people could use it.
That's because CUDA is a clear, well-functioning library and Intel has no equivalent. It makes any "you just have to get Pytorch working" a little less plausible.
But this is the the thing. Speaking as someone who dabbles in this area rather than any kind of expert, it’s baffling to me that people like Intel are making press releases and public statements rather than (I don’t know) putting in the frikkin work to make performance of the one library that people actually use decent.
You have a massive organization full of gazillions of engineers many of whom are really excellent. Before you open your mouth in public and say something is a priority, deploy a lot of them against this and manifest that priority by actually doing the thing that is necessary so people can use your stuff.
It’s really hard to take them seriously when they haven’t (yet) done that.
You know how it works. The same busybodies who are putting out this useless noise releases are the ones who squandered Intel's lead, and now are patting themselves on the back for figuring out that with this they'll again be on top for sure!
There was a post on HN a few months ago about how Nvidia's CEO still has meetings with engineers in the trenches. Contrast that with what we know of Intel, which is not much good, and a lot of bad. (That they are notoriously not-well-paying, because they were riding on their name recognition.)
PyTorch includes some Vulkan compat already (though mostly tested on Android, not on desktop/server platforms), and they're sort of planning to work on OpenCL 3.0 compat, which would in turn lead to broad-based hardware support via Mesa's RustiCL driver.
(They don't advertise this as "support" because they have higher standards for what that term means. PyTorch includes a zillion different "operators" and some of them might be unimplemented still. Besides performance is still lacking compared to CUDA, Rocm or Metal on leading hardware - so only useful for toy models.)
This is a big reason why AMD did this deal with PyTorch...
https://pytorch.org/blog/experience-power-pytorch-2.0/
It's not just Intel. Open initiatives and consortiums (the phase two of the same) are always the losers ganging up hoping that it will give them the leg up they don't have. If you're older you'll have seen this play out over and over in the industry - the history of Unix vs. Windows NT from the 1990s was full of actions like this, networking is going through it again for the nth time (this time with UltraEthernet) and so on. OpenGl was probably the most successful approach, barely worked, and didn't help any of the players who were not on the road to victory already. Unix 95 didn't work, unix 98 didn't work, etc.
You're just listing the ones that didn't knock it out of the park.
TCP/IP completely displaced IPX to the point that most people don't even remember what it was. Nobody uses WINS anymore, even Microsoft uses DNS. It's rare to find an operating system that doesn't implement the POSIX API.
The past is littered with the corpses of proprietary technologies displaced by open standards. Because customers don't actually want vendor-locked technology. They tolerate it when it's the only viable alternative, but make the open option good and the proprietary one will be on its way out.
Except that it has barelly improved beyond CLI and daemons, still thinks terminals are the only hardware, everything else that matters isn't part of it, not even more modern networking protocols that aren't exposed in socket configurations.
Its purpose was to create compatibility between Unix vendors so developers could write software compatible with different flavors. The primary market for Unix vendors is servers, which to this day are still about CLI and daemons, and POSIX systems continue to have dominant market share in that market.
Arguably the dominant APIs in the server space are the cloud APIs not POSIX.
Many of which are also open, like OpenStack or K8s, or have third party implementations, like Ceph implementing the Amazon S3 API.
Also all reimplementations of proprietary technology.
The S3 API is a really good example of the “OSS only becomes dominant when development slows down” principle. As a friend of mine who has had to support a lot of local blob storage says, “On the gates of hell are emblazoned — S3 compatible.”
That's generally where open standards come from. You document an existing technology and then get independent implementations.
Unix was proprietary technology. POSIX is an open standard.
Even when the standard comes at the same time as the first implementation, it's usually because the first implementer wrote the standard -- there has to be one implementation before there are two.
Nah, POSIX on servers is only relevant enough for language runtimes and compilers, which then use their own package managers and cloud APIs for everything else.
Alongside a cloud shell, which yeah, we now have a VT100 running on a browser window.
There is a reason why there are USENIX papers on the loss of POSIX relevance.
Those things are a different level of abstraction. The cloud API is making POSIX system calls under the hood, which would allow the implementation of the cloud API to be ported to different POSIX-compatible systems (if anybody cared to).
The main reason POSIX is less relevant is that everybody is using Linux and the point of POSIX was to create compatibility between all the different versions of proprietary Unix that have since fallen out of use.
Hardly as many recent TCP/IP features aren't fully exposed by POSIX socket API, rather OS specific APIs.
Security and asynchronous servers are also hardly implemented with raw POSIX APIs.
POSIX also doesn't have anything to say about hypervisors, containers, kubernetes infrastructure, or unikernels.
Nor it says anything about infrastructure written in Go, Java, .NET, Rust, C++.
Maybe it's POSIX that is holding new developments back because people think it's good enough. It's not '60s anymore. I would have expected to have totally new paradigms in 2023 if you asked me 23 years ago. Even NT kernel seems more modern.
While POSIX was state of the art when it was invented, it shouldn't be today.
Lots of research was thrown in recycle bin because "Hey, we have POSIX, why reinvent the wheel?", up to the point that nobody wants to do operating systems research today, because they don't want their hard work to get thrown into the same recycle bin.
I think that people who invented POSIX were innovators and have they live today, they would come with a totally new paradigm, more fit to today's needs and knowledge.
I think OP's point was less that the tech didn't work (e.g. OpenGL was fantastically successful) and that it didn't produce a good outcome for the "loser" companies that supported it.
The point of the open standard is to untether your prospective customers from the incumbent. That means nobody is going to monopolize that technology anymore, but that works out fine when it's not the thing you're trying to sell -- AMD and Intel aren't trying to sell software libraries, they're trying to sell GPUs.
And this strategy regularly works out for companies. It's Commoditize Your Complement.
If you're Intel you support Linux and other open source software so you can sell hardware that competes with vertically integrated vendors like DEC. This has gone very well for Intel -- proprietary RISC server architectures are basically dead, and Linux dominates much of the server market in which case they don't have to share their margins with Microsoft. The main survivor is IBM, which is another company that has embraced open standards. It might have also worked out for Sun but they failed to make competitive hardware, which is not optional.
We see this all over the place. Google's most successful "messaging service" is Gmail, using standard SMTP. It's rare to the point of notability for a modern internet service to use all proprietary networking protocols instead of standard HTTP and TCP and DNS, but many of them are extremely successful.
And some others are barely scraping by, but they exist, which they wouldn't if there wasn't a standard they could use instead of a proprietary system they were locked out of.
That is the point.
But FWIW the incumbent adopts it and dominates anyway. (Though you now are technically "untethered.")
That's assuming the incumbent's advantage isn't rooted in the lock-in.
If ML was suddenly untethered from CUDA, now you're competing on hardware. Intel would still have mediocre GPUs, but AMD's are competitive, and Intel's could be in the near future if they execute competently.
The open standard doesn't automatically give you the win, but it puts you in the ring.
And either of them have the potential to gain an advantage over Nvidia by integrating GPUs with their x86_64 CPUs, e.g. so the CPU and GPU can share memory, avoiding copying over PCIe and giving the CPU direct access to HBM. They could even put a cut down but compatible version of the technology in every commodity PC by default, giving them a huge installed base of hardware that encourages developers to target it.
If the software side no longer mattered, I would expect all three vendors would magically start competing on available RAM. A slower card with double today's RAM would absolutely sell.
Absolutely, SQL analytics people (like me) have been itching for a viable GPU for analytics for years now. The price/performance just isn't there yet because there's such a bias towards high compute and low memory.
Open source generally wins once the state of the art has stopped moving. When a field is still experiencing rapid change closed source solutions generally do better than open source ones. Until we somehow figure out a relatively static set of requirements for running and training LLMs I wouldn’t expect any open source solution to win.
That doesn't really make sense. Pretty much all LLMs are trained in pytorch, which is open source. LLMs only reached the state it is now because many academic conferences insisted that paper submissions have open source code attached to it. So much of the ML/AI ecosystem is open source. Pretty much only CUDA is not open source.
What stops Intel to make their own CUDA and plug in into pytorch?
CUDA is huge and nvidia spent a ton in a lot of "dead end" use cases optimizing it. There have been experiments with CUDA translation layers with decent performance[1]. There are two things that most projects hit:
1. The CUDA API is huge; I'm sure Intel/AMD will focus on what they need to implement pytorch and ignore every other use case ensuring that CUDA always has the leg up in any new frontier
2. Nvidia actually cares about developer experience. The most prominent example is Geohotz with tinygrad - where AMD examples didn't even work or had glaring compiler bugs. You will find nvidia engineer in github issues for CUDA projects. Intel/AMD hasn't made that level of investment and thats important because GPUs tend to be more fickle than CPUs.
[1] https://github.com/vosen/ZLUDA
The same shit as always, patents and copyright.
Windows still uses WINS and NetBIOS when DNS is unavailable.
In a business-class network, even one running Windows, if DNS breaks "everything" is going to break.
You didn't have a choice when it came to protocols for the Internet; it's TCP/IP and DNS or you don't get to play. Everyone was running dual stack to support their LAN and Internet and you had no choice with one of them. So, everything went TCP/IP and reduced overall complexity.
This right here. Until Intel (and/or AMD) get serious about the software side and actually invest the money CUDA isn't going anywhere. Intel will make noises about various initiatives in that direction and then a quarter or two later they'll make big cuts in those divisions. They need to make a multi-year commitment and do some serious hiring (and they'll need to raise their salaries to market rates to do this) if they want to play in the CUDA space.
Literally, every single announcement (and action) coming out of AMD these days is that they are serious about the software. I don't see any reason at this point to doubt them.
The larger issue is that they need to fix the access to their high end GPUs. You can't rent a MI250... or even a MI300x (yet, I'm working on that myself!). But that said, you can't rent an H100 either... there are none available.
<removed>
What _7il4 removed were these two comments:
"AMD is not serious, and neither is Intel for that matter. Their software are piles of proprietary garbage fires. They may say they are serious but literally nothing indicates they are."
"Yes, and ROCm also doesn't work on anything non-AMD. In fact it doesn't even work on all recent AMD gpus. T"
It's not too polite to repost what people removed. Errors on the internet shouldn't haunt people forever.
However, my experience is that the comments about AMD are spot-on, with the exception of the word "proprietary."
Intel hasn't gotten serious yet, and has a good track record in other domains (compilers, numerical libraries, etc.). They've been flailing for a while, but I'm curious if they'll come up with something okay.
Normally, I wouldn't do that, but this time I felt like both comments were intentionally inflammatory and then the context for my response was lost.
"literally nothing" is also wrong given that they just had a large press announcement on Dec 6th (yt as part of my response below), where they spent 2 hours saying (and showing) they are serious.
The second comment was made and then immediately deleted, in a way that was to send me a message directly. It is what irked me enough to post their comments back.
I deleted it because I was in an extremely bad mood and later realized it was simply wrong of me to post it and vent my unrelated frustration in those comments. I think it's in extremely bad taste to repost what I wrote when I made the clear choice to delete it.
You're right. I apologize. I've emailed dang to ask him to remove this whole thread.
(Posting here because I don't have an email address for you.)
I reassigned your comments in this thread to a random user ID, so it's as if you had used a throwaway account to post them and there's no link to your main account. I also updated the reference to your username in another comment. Does that work for you?
If you didn't say it, the rocks would cry out. Personally, I'll believe AMD is maybe serious about Rocm if they make it a year without breaking their Debian repository.
IMO the comment deletion system handles deleting your own comment wrong - it should grey out the comment and strikethrough it, and label it "comment disavowed" or something with the username removed, but it shouldn't actually delete the comment.
Deleting the comment damages the history, and makes the comment chain hard to follow.
I'm not sure history should be sacred. In the world of science, this is important, but otherwise, we didn't used to live in a universe where every embarrassing thing we did in middle school would haunt us in our old age.
I feel bad for the younger generation. Privacy is important, and more so than "the history."
Long ago, under a different account, I emailed dang about removing some comment content that became too identifying only years after the comments in question. He did so and was very gracious about it. dang, you're cool af and you make HN a great place!
You're being very obsequious about having to personally get approval from a moderator to do on your behalf something every other forum lets you do on your own by default.
Plus, dang does a good job.
Not a perfect job -- everyone screws up once in a while -- but this forum works better than most.
As a long time user of intels scientific compiler/accelerator stack, I'm not sure if I'd call it a "good track record". Once you get all their libs working they tend to be fairly well optimized but they're always a huge hassle to install, configure, and distribute. And when I say a hassle to install, I'm talking about hours to run their installers.
They have a track record of taking open projects, adding proprietary extensions, and then requiring those extensions to work with other tools. This sounds fine, but they are very slow/never update the base libs. From version to version they'll muck with deep dependencies, sometimes they'll even ship different rules on different platforms (I dare you to try to static link openmp in a recent version of oneapi targeting windows). If you ship a few tools (let's say A and B) that use the same dynamic lib, it's a royal pain to make sure they don't conflict with each other if you update software A but not B. Ranting about consumer junk aside, their cluster focused tooling on Linux tends to be quite good, especially compared to amd.
How’s SYCL proprietary exactly?
ROCm is open source.
and: https://www.youtube.com/watch?v=pVl25BbczLI
I have a bridge to sell.
It is way, way better in the last year or so. Perfectly reasonable cards for inference if you actually understand the stack and know how to use it. Is nVidia faster? Sure, but at twice the price for 20-30% gains. If that makes sense for you, keep paying the tax.
Not just tax, but fighting with centralization on a single provider and subsequent unavailability.
They're having to announce it so much because people are rightly sceptical. Talk is cheap, and their software has sucked for years. Have they given concrete proof of their commitment, e.g. they've spent X dollars or hired Y people to work on it (or big names Z and W)?
Agreed. Time will tell.
MI300x and ROCm 6 and their support of projects like Pytorch, are all good steps in the right direction. HuggingFace now supports ROCm.
AMD made a presentation on their AI software strategy at Microsoft Ignite two weeks ago. Worth a watch for the slides and live demo
https://youtu.be/7jqZBTduhAQ?t=61
Nvidia is probably ten times more scared of this guy https://github.com/ggerganov than Intel or AMD.
Can you expand on this? This is my first time seeing this guy’s work
He’s the main developer of Llama.cpp, which allows you to run a wide range of open-weights models on a wide range of non-NVIDIA processors.
but it's all inference, and most of Nvidias moat is in training afaik.
People have really bizarre overdramatic misunderstandings of llama.cpp because they used it a few times to cook their laptop. This one really got me giggling though.
I am integrating llama.cpp into my application. I just went through one of their text generation examples line-by-line and converted it into my own class.
This is a leading-edge software library that provides a huge boost for non-Nvidia hardware in terms of inference capability with quantized models.
If you don't understand that, then you have missed an important development in the space of machine learning.
Jeez. Lol.
At length:
- yes, local inference is good. I can't say this strongly enough: llama.cpp is a fraction of a fraction of local inference.
- avoid talking down to people and histrionics. It's a hot field, you're in it, but like all of us always, you're still learning. When faced with a contradiction, check your premises, then share them.
There is an example of training https://github.com/ggerganov/llama.cpp/tree/1f0bccb27929e261...
But that's absolutely false about the Nvidia moat being only training. Llama.cpp makes it far more practical run inference on a variety of devices. Including ones with or without Nvidia hardware.
Really? Because I’m confident Intel knows exactly what it’s about. Have you looked at their contributions to Linux and open source in general? They employ thousands of software developers.
Which is why intel left nvidia and everyone else in the dust with CUDA, which they developed.
Oh, wait…
What exactly do you think Intel was going to write CUDA for? Their GPU products have been on the market less than 2 years, and they're still trying to get their arms wrapped around drivers.
Them understanding what was coming doesn't mean they have a magic wand to instantly have a fully competitive product. You can't write a CUDA competitor until you've gotten the framework laid. The fact they invested so heavily in their GPUs makes it pretty obvious they weren't caught off-guard, but catching up takes time. Sometimes you can't just throw more bodies at the problem...
That’s my point though - they’re catching up, not leading, which rather implies that they absolutely missed a beat, and don’t therefore understand where the market is going before it goes there.
“Nobody gets it’s the software”
If they didn’t get it’s the software, they wouldn’t catch up.
You didn’t say they were late to the party, you said they don’t understand what needs to happen. My point is they understand exactly what needs to happen they just didn’t have the technology to even start to tackle the problem until recently.
Both AMD and Intel (and Qualcomm to some degree) just don't seem to get how you beat NVIDIA.
If they want to grab a piece of NVIDIA's pie, they do NOT need to build something better than an H100 right away. There are a million consumers who are happy with a 4090 or 4080 or even 3080 and would love for something that's equally capable at half price, and moreover, actually available for purchase, from Amazon/NewEgg/wherever, and without a "call for pricing" button. AMD and Intel are much better at making their chips available for purchase than NVIDIA. But that's not enough.
What they DO need to do to take a piece of NVIDIA's pie is to build "intelcc", "amdcc", and "qualcommcc" that accept the EXACT SAME code that people feed to "nvcc" so that it compiles as-is, with not a single function prototype being different, no questions asked, and works on the target hardware. It needs to just be a drop-in replacement for CUDA.
When that is done, recompiling PyTorch and everything else to use other chips will be trivial.
That's not going to work because each GPU has a different internal architecture and aligning the way data is fed with how stream processors operate is different for each architecture (stuff like how memory buffers are organized/aligned/paged etc.). AMD is very incompatible to Nvidia at the lowest level so things/approaches that are fast on Nvidia can be 10x slower on AMD and vice versa.
That's fine but the job of software is to abstract that out. The code should at least compile, even if it is 10x less efficient and if multiple awkward sets of instructions need to be used instead of one.
If Pytorch can be recompiled for AMD overnight with zero effort (only `ln -s amdcc nvcc`, `ln -s /usr/local/cuda /usr/local/amda`) they will gain some footing against Nvidia.
thats what hip is though. a recompiler for cuda code. its not good enough for ptx assembly to amdgpu yet
That's simple, just do what Nvidia did. Be better than the competition.
Guys need to do ye olde embrace, extend maneuver. What wine did, what Javas of the world did. CUDA driver API and CUDA runtime API either translation or implementation layer that offers compatibility and speed. I see no way around it at this point, for now.
They could do that. That would eliminate Nvidia's monopoly. AMD has made gestures in that direction with Hip. But they ultimately don't want to do that - Hip support is half-assed and inconsistent. AMD creates and abandons a variety of APIs. So the conclusion is the other chip makers whine about Nvidia's monopoly but don't want to end it - they just want maneuver to get their smaller monopolies of some sort or other.
You're right. At this stage CUDA is de facto what the standard is around. Just like in ISA wars x86 was. Doesn't matter if you have POWER whatever when everything's on the other thing. I get why not though, it would drag the battle onto Nvidia's home turf. At least it would be a battle though.
Seriously, why don't they just dedicate a group to creating the best pytorch backend possible? Proving it there will gain researcher traction and prove that their hardware is worth porting the other stuff over to.
You can't "just" do stuff like this. You need the right guy and big corps have no clue who is capable.
Wasnt Intel the biggest supporter of OpenCV? I dont know any open source heavely supported by Nvidia
Is Mojo trying to solve this?
https://www.youtube.com/watch?v=SEwTjZvy8vw
huh, I didn't even know openvino supported anything but CPUs! TIL
I agree. That’s what MS didn’t understand with cloud and Linux at first.
There is more than just a hardware layer to adoption.
CUDA is a platform is an ecosystem is also some sort of attitude. It won’t go away. Companies invested a lot into it.
They still don't get that those who are serious about hardware, must make their own software.
Leverage LLMs to port the SW