HN comments for: Nvidia Warp: A Python framework for high performance GPU simulation and graphics

raytopia

28 replies

1d3h

2024-06-14 15:20:02 UTC

I love how many python to native/gpu code projects there are now. It's nice to see a lot of competition in the space. An alternative to this one could be Taichi Lang [0] it can use your gpu through Vulkan so you don't have to own Nvidia hardware. Numba [1] is another alternative that's very popular. I'm still waiting on a Python project that compiles to pure C (unlike Cython [2] which is hard to port) so you can write homebrew games or other embedded applications.

[0] https://www.taichi-lang.org/

[1] http://numba.pydata.org/

[2] https://cython.readthedocs.io/en/stable/

Joky

9 replies

1d1h

2024-06-14 16:55:47 UTC

I'm still waiting on a Python project that compiles to pure C

In case you haven't tried it yet, Pythran is an interesting one to play with: https://pythran.readthedocs.io

Also, not compiling to C but to native code still would be Mojo: https://www.modular.com/max/mojo

holoduke

7 replies

23h44m

2024-06-14 18:44:19 UTC

Does it really matters in performance. I see python in these kind of setups as orchestrators of computing apis/engines. For example from python you instruct to compute following list etc. No hard computing in python. Performance not so much of an issue.

crabbone

5 replies

22h25m

2024-06-14 20:03:51 UTC

Marshaling is an issue as well as concurrency.

Simply copying a chunk of data between two libraries through Python is already painful. There are so-called "buffer API" in Python, but it's very rare that Python users can actually take advantage of this feature. If anything in Python as much as looks at the data, that's not going to work etc.

Similarly, concurrency. A lot of native libraries for Python are written with the expectation that nothing in Python really runs concurrently. And then you are presented with two bad options: try running in different threads (so that you don't have to copy data), but things will probably break because of races, or run in different processes, and spend most of the time copying data between them. Your interface to stuff like MPI is, again, only at the native level, or you will copy so much that the benefits of distributed computation might not outweigh the downsides of copying.

LoganDark

2 replies

15h5m

2024-06-15 03:23:41 UTC

Do you think in a decade or so, most popular Python dependencies will work well enough no-GIL for multithreading to be a bit less terrible?

mylons

0 replies

5h12m

2024-06-15 13:16:43 UTC

there’s already a PEP to address the GIL. python is going through its JVM optimization phase. it’s too popular and ubiquitous that improving the GIL and things like it are inevitable

mabster

0 replies

13h58m

2024-06-15 04:30:20 UTC

I think we will get there in the end but it will be slow.

When I was doing performances stuff: Intel was our main platform and memory consistency was stronger there. We would try to write platform-agnostic multi threading code (in our case, typically spin-locks) but without testing properly on the other platforms, we would make mistakes and end up with race conditions, accessing unsynchronized data, etc.

I think Python will be the same deal. With Python being GIL'd through most of its life cycle bits and pieces won't work properly, multi-threaded until we fix them.

timomaxgalvin

1 replies

8h16m

2024-06-15 10:12:33 UTC

Python is a bad joke that went too far.

mylons

0 replies

5h13m

2024-06-15 13:15:24 UTC

wow, someone thinks python is bad? what year is it? 2003?

LoganDark

0 replies

15h6m

2024-06-15 03:22:56 UTC

I believe it matters for startup time and memory usage. Once you've fully initialized the library and set it off, the entire operation happens without the Python interpreter's involvement, but that initial setup can still be important sometimes.

ok123456

0 replies

22h32m

2024-06-14 19:56:19 UTC

nuitka already does this

pjmlp

6 replies

23h11m

2024-06-14 19:17:34 UTC

I would rather that Python catches up with Common Lisp tooling in JIT/AOT in the box, instead of compilation via C.

heavyset_go

4 replies

22h55m

2024-06-14 19:33:29 UTC

I'd kill for AOT compiled Python. 3.13 ships with a basic JIT compiler.

LoganDark

2 replies

15h3m

2024-06-15 03:25:45 UTC

mypyc can compile a strictly-typed subset of Python AOT to native code, and as a bonus it can still interop with native Python libraries whose code wasn't compiled. It's slightly difficult to set up but I've used it in the past and it is a decent speedup. (plus mypy's strict type checking is sooo good)

zelphirkalt

1 replies

10h30m

2024-06-15 07:59:08 UTC

Why is it named after a type checking library?

LoganDark

0 replies

6h42m

2024-06-15 11:46:14 UTC

Because it uses that library for type checking?

pjmlp

0 replies

22h38m

2024-06-14 19:50:36 UTC

In 3.13 you need to compile Python yourself if you want to test the preview JIT.

kazinator

0 replies

18h15m

2024-06-15 00:14:00 UTC

Python is designed from the ground up to be hostile to efficient compiling. It has really poorly designed internals, and proudly exposes them to application code, in a documented way and everything.

Only a restricted subset of Python is efficiently compilable.

skrhee

2 replies

1d2h

2024-06-14 15:45:21 UTC

I would like to warn people away from taichi if possible. At least back in 1.7.0 there were some bugs in the code that made it very difficult to work with.

hoosieree

1 replies

1d2h

2024-06-14 15:48:46 UTC

Do you have any more specifics about these limitations? I'm considering trying Taichi for a project because it seems to be GPU vendor agnostic (unlike CuPy).

sinuhe69

0 replies

2024-06-14 18:02:41 UTC

I only dabbled in Taichi, but I find its magic has limitation. I took a provided example, just increased the length of the loop and bam! it crashed the Windows driver. Obviously it ran out of memory but I have no idea how how to adjust except experiment with different values. If it has information about the GPU and its memory, I thought it could automatically adjust the block size but apparently not. There is a config command to fine tune the for loop parallelizing but the docs says we normally do not need to use them.

setopt

2 replies

1d2h

2024-06-14 15:32:12 UTC

CuPy is also great – makes it trivial to port existing numerical code from NumPy/SciPy to CUDA, or to write code than can run either on CPU or on GPU.

I recently saw a 2-3 orders of magnitude speed-up of some physics code when I got a mid-range nVidia card and replaced a few NumPy and SciPy calls with CuPy.

6gvONxR4sf7o

1 replies

1d2h

2024-06-14 15:50:08 UTC

Don’t forget JAX! It’s my preferred library for “i want to write numpy but want it to run on gpu/tpu with auto diff etc”

westurner

0 replies

23h34m

2024-06-14 18:54:47 UTC

From https://news.ycombinator.com/item?id=37686351 :

> sympy.utilities.lambdify.lambdify() https://github.com/sympy/sympy/blob/a76b02fcd3a8b7f79b3a88df... :

> """Convert a SymPy expression into a function that allows for fast numeric evaluation""" [e.g. the CPython math module, mpmath, NumPy, SciPy, CuPy, JAX, TensorFlow, SymPy, numexpr,]

sympy#20516: "re-implementation of torch-lambdify" https://github.com/sympy/sympy/pull/20516

jkercher

2 replies

16h28m

2024-06-15 02:00:32 UTC

I'm not looking for an argument, but my knee jerk reaction to seeing 4 or 5 different answers to the question of getting python to C... Why not just learn C?

parentheses

1 replies

16h10m

2024-06-15 02:18:32 UTC

The python already exists. These efforts enable increasing performance without having to rewrite in a very different language.

richrichie

0 replies

13h52m

2024-06-15 04:36:47 UTC

I have dabbled in Cython, C and Rust via PyO3.

C is much cleaner and portable. Easy to use in Python directly.

tony69

0 replies

19h45m

2024-06-14 22:43:16 UTC

https://nuitka.net/ ?

szvsw

0 replies

1d2h

2024-06-14 15:49:29 UTC

I’m a huge Taichi stan. So much easier and more elegant than numba. The support for data classes and data_oriented classes is excellent. Being able to define your own memory layouts is extremely cool. Great documentation. Really really recommend!

eigenvalue

20 replies

1d4h

2024-06-14 14:23:41 UTC

I really like how nvidia started doing more normal open source and not locking stuff behind a login to their website. It makes it so much easier now that you can just pip install all the cuda stuff for torch and other libraries without authenticating and downloading from websites and other nonsense. I guess they realized that it was dramatically reducing the engagement with their work. If it’s open source anyway then you should make it as accessible as possible.

foresterre

6 replies

1d4h

2024-06-14 14:29:02 UTC

I would argue that this isn't "normal open source", though it is indeed not locked behind a login on their website. The license (1) is feels very much proprietary, even if the source code is available.

(1) https://github.com/NVIDIA/warp/blob/main/LICENSE.md

bionhoward

5 replies

1d3h

2024-06-14 15:19:17 UTC

Agreed, especially given this

"2.7 You may not use the Software for the purpose of developing competing products or technologies or assist a third party in such activities."

"California’s public policy provides that every contract that restrains anyone from engaging in a lawful profession, trade, or business of any kind is, to that extent, void, except under limited statutory exceptions."

https://leginfo.legislature.ca.gov/faces/billNavClient.xhtml...

(Owner/Partner who sold business, may voluntarily agree to a noncompete, (which is now federally https://www.ftc.gov/legal-library/browse/rules/noncompete-ru... banned) is the only exception I found).

I'm not a lawyer. Any lawyers around? Could the 2nd provision invalidate the 1st, or not?

philipov

3 replies

1d2h

2024-06-14 16:28:08 UTC

You're free to engage in a lawful profession, just not using that Software for it. "to that extent" is not there merely for show.

bionhoward

1 replies

1d1h

2024-06-14 16:51:18 UTC

Hey, that's a real argument, and it makes sense. Thank you for helping to clarify this topic.

Question: why would NVIDIA, makers of general intelligence, which seems to compete with everyone, publish code for software nobody can use without breaking NVIDIA rules? Wouldn't it be better for everyone if they just kept that code private?

bionhoward

0 replies

1d1h

2024-06-14 17:07:08 UTC

ah, just found this license here for another NVIDIA product released today https://developer.download.nvidia.com/licenses/nvidia-open-m... this is way better

supriyo-biswas

0 replies

9h56m

2024-06-15 08:32:43 UTC

“To that extent” in this context means that the remainder of the contract stays valid. The interpretation you state is not only incorrect, it would be toothless to introduce it because people could simply work around it by adding that clause and arguing that it constitutes a sufficient exception under the law.

As for why the “to the extent” phrasing exists, consider an example: an employment contract consists of two clauses, A: that prevents the employee from disclosing confidential customer data to third parties, and B: a non-compete clause (which does come under the same provision mentioned by grandparent). If the employer ever sues an employee for violation of A, they shouldn’t be allowed to argue that they aren’t subject to it because of clause B.

jasongill

0 replies

21h11m

2024-06-14 21:17:39 UTC

FYI, the FTC noncompete rule does not go into effect until September, and it specifically carves out an exception to the rule for existing noncompetes for senior executives

jjmarr

4 replies

1d4h

2024-06-14 14:27:59 UTC

It being on GitHub doesn't mean it's open-source.

https://github.com/NVIDIA/warp?tab=License-1-ov-file#readme

Looks more "source available" to me.

nitinreddy88

3 replies

1d3h

2024-06-14 15:29:06 UTC

That's what open-source means. Source code is open for reading. It has nothing to do with Licensing. You can have any type of license on top of that based on your business needs

j-r-d

0 replies

1d2h

2024-06-14 15:53:10 UTC

No. That's not how it works. It's great that they're making source available but if I can't modify and distribute it, it's not open.

dagenix

0 replies

1d2h

2024-06-14 15:42:24 UTC

That may be your definition, but that's not everyone's definition. Wikipedia, for example, says:

Open-source software (OSS) is computer software that is released under a license in which the copyright holder grants users the rights to use, study, change, and distribute the software and its source code to anyone and for any purpose.

https://en.m.wikipedia.org/wiki/Open-source_software

TimeBearingDown

0 replies

23h36m

2024-06-14 18:52:31 UTC

No. The Open Source Initiative maintains the definition, which is accepted internationally by multiple government agencies.

https://opensource.org/osd

https://opensource.org/authority

dagmx

4 replies

1d3h

2024-06-14 15:08:28 UTC

This isn’t open source. It s the equivalent of headers being available to a dylib, just that they happen to be a python API.

Most of the magic is behind closed source components, and it’s posted with a fairly restrictive license.

fragmede

2 replies

2024-06-14 18:19:31 UTC

And people say nvida doesn't have a moat.

markhahn

1 replies

18h52m

2024-06-14 23:36:23 UTC

donno. it's clear they have no effective moat. they still try hard, making themselves quite customer-hostile.

yosefk

0 replies

6h42m

2024-06-15 11:47:11 UTC

CUDA is a moat. What's ineffective about it?

boywitharupee

0 replies

23h4m

2024-06-14 19:24:25 UTC

In a similar fashion, you'll see that JAX has frontend code being open-sourced, while device-related code is distributed as binaries. For example, if you're on Google's TPU, you'll see libtpu.so, and on macOS, you'll see pjrt_plugin_metal_1.x.dylib.

The main optimizations (scheduler, vectorizer, etc.) are hidden behind these shared libraries. If open-sourced, they might reveal hints about proprietary algorithms and provide clues to various hardware components, which could potentially be exploited.

water-your-self

0 replies

2024-06-14 18:27:50 UTC

Accessible, as long as you purchase their very contested hardware.

rldjbpin

0 replies

9h17m

2024-06-15 09:11:56 UTC

there has been a rise of "open"-access and freeware software/services in this space. see hugging face and certain models tied to accounts that accept some eula before downloading model weights, or weird wrapper code by ai library creators which makes it harder to run offline (ultralytics library comes to mind for instance).

i like the value they bring, but the trend is against the existing paradigm of how python ecosystem used to be.

markhahn

0 replies

18h54m

2024-06-14 23:35:12 UTC

it's not open source and can only be used with nvidia gpus (by license).

w-m

12 replies

1d3h

2024-06-14 14:58:28 UTC

I was playing around with taichi a little bit for a project. Taichi lives in a similar space, but has more than an NVIDIA backend. But its development has stalled, so I’m considering switching to warp now.

It’s quite frustrating that there’s seemingly no long-lived framework that allows me to write simple numba-like kernels and try them out in NVIDIA GPUs and Apple GPUs. Even with taichi, the Metal backend was definitely B-tier or lower: Not offering 64 bit ints, and randomly crashing/not compiling stuff.

Here’s hoping that we’ll solve the GPU programming space in the next couple years, but after ~15 years or so of waiting, I’m no longer holding my breath.

https://github.com/taichi-dev/taichi

szvsw

5 replies

1d2h

2024-06-14 15:51:32 UTC

I’ve been in love with Taichi for about a year now. Where’s the news source on development being stalled? It seemed like things were moving along at pace last summer and fall at least if I recall correctly.

w-m

3 replies

1d2h

2024-06-14 15:58:28 UTC

https://github.com/taichi-dev/taichi/discussions/8506

bsavery

1 replies

14h56m

2024-06-15 03:33:09 UTC

Yeah this discussion is pretty interesting. I was wondering what what happening with development as well. Taichi is cool tech, that if I had to be honest, it seems like they lacked direction of how to monetize it. For example they tried doing this "Taitopia thing" https://taitopia.design/ (which is already EOL'ed).

IMO if they had focused from the beginning on ML similar to Mojo, they would be in a better place.

szvsw

0 replies

1h25m

2024-06-15 17:04:05 UTC

It developed out of a dissertation at MIT (honestly a pretty damn impressive one IMO) and it seems like without some sort of significant support from universities, foundations or corporations, it would be pretty difficult to “monetize” - these in turn require some sort of substantial industry adoption or dogfooding it in some other job/contracts, which is tough I assume.

szvsw

0 replies

1d2h

2024-06-14 16:17:53 UTC

Ha, interesting timing, last post 6Hr ago. Sounds like they are dogfooding it at least which is good. And I would agree with the assessment that 1.x is fairly feature complete, at least from my experience using it (scientific computing). And good to hear that they are planning on pushing support patches for eg python 3.12 cuda 12 etc

contravariant

0 replies

1d2h

2024-06-14 16:00:10 UTC

There's only been 7 commits to master in the last 6 month, half of those purely changes to test or documentation, so it kind of sounds like you're both right.

paulmd

2 replies

1d2h

2024-06-14 15:35:39 UTC

the problem with the GPGPU space is that everything except CUDA is so fractally broken that everything eventually converges to the NVIDIA stuff that actually works.

yes, the heterogeneous compute frameworks are largely broken, except for OneAPI, which does work, but only on CUDA. SPIR-V, works best on CUDA. OpenCL: works best on CUDA.

Even once you get past the topline "does it even attempt to support that", you'll find that AMD's runtimes are broken too. Their OpenCL runtime is buggy and has a bunch of paper features which don't work, and a bunch of AMD-specific behavior and bugs that aren't spec-compliant. So basically you have to have an AMD-specific codepath anyway to handle the bugs. Same for SPIR-V: the biggest thing they have working against them is that AMD's Vulkan Compute support is incomplete and buggy too.

https://devtalk.blender.org/t/was-gpu-support-just-outright-...

https://render.otoy.com/forum/viewtopic.php?f=7&t=75411 ("As of right now, the Vulkan drivers on AMD and Intel are not mature enough to compile (much less ship) Octane for Vulkan")

If you are going to all that effort anyway, why are you (a) targeting AMD at all, and (b) why don't you just use CUDA in the first place? So everyone writes more CUDA and nothing gets done. Cue some new whippersnapper who thinks they're gonna cure all AMD's software problems in a month, they bash into the brick wall, write blog post, becomes angry forums commenter, rinse and repeat.

And now you have another abandoned cross-platform project that basically only ever supported NVIDIA anyway.

Intel, bless their heart, is actually trying and their stuff largely does just work, supposedly, although I'm trying to get their linux runtime up and running on a Serpent Canyon NUC with A770m and am having a hell of a time. But supposedly it does work especially on windows (and I may just have to knuckle under and use windows, or put a pcie card in a server pc). But they just don't have the marketshare to make it stick.

AMD is stuck in this perpetual cycle of expecting anyone else but themselves to write the software, and then not even providing enough infrastructure to get people to the starting line, and then surprised-pikachu nothing works, and surprise-pikachu they never get any adoption. Why has nvidia done this!?!? /s

The other big exception is Metal, which both works and has an actual userbase. The reason they have Metal support for cycles and octane is because they contribute the code, that's really what needs to happen (and I think what Intel is doing - there's just a lot of work to come from zero). But of course Metal is apple-only, so really ideally you would have a layer that goes over the top...

sorenjan

1 replies

16h45m

2024-06-15 01:43:36 UTC

yes, the heterogeneous compute frameworks are largely broken, except for OneAPI, which does work, but only on CUDA.

Intel, bless their heart, is actually trying and their stuff largely does just work, supposedly, .... But they just don't have the marketshare to make it stick.

Isn't OneAPI basically SYCL? And there are different SYCL runtimes that run on Intel, Cuda, and Rocm? So what's lost if you use oneAPI instead of Cuda, and run it on Nvidia GPUs? In an ideal world that should work about the same as Cuda, but can also be run on other hardware in the future, but since the world rarely is ideal I would welcome any insight into this. Is writing oneAPI code mainly for use on Nvidia a bad idea? Or how about AdaptiveCpp (previously hipSYCL/Open SYCL)?

I've been meaning to try some GPGPU programming again (I did some OpenCL 10+ years ago), but the landscape is pretty confusing and I agree that it's very tempting to just pick Cuda and get started with something that works, but if at all possible I would prefer something that is open and not locked to one hardware vendor.

pjmlp

0 replies

9h53m

2024-06-15 08:35:15 UTC

OneAPI started as Data Parallel C++, and is SYSCL with special Intel sauce on top.

zelphirkalt

0 replies

10h53m

2024-06-15 07:35:30 UTC

But when it is made by Nvidia, that means it will not work on AMD GPUs. I do not consider this kind of thing a solution, but rather a vendor lock in.

talldayo

0 replies

1d2h

2024-06-14 16:25:30 UTC

Here’s hoping that we’ll solve the GPU programming space in the next couple years, but after ~15 years or so of waiting, I’m no longer holding my breath.

It feels like the ball is entirely in Apple's court. Well-designed and Open Source GPGPU libraries exist, even ones that Apple has supported in the past. Nvidia supports many of them, either through CUDA or as a native driver.

panagathon

0 replies

1d3h

2024-06-14 15:14:31 UTC

This is the library I've always wanted. Look at that Julia set. Gorgeous. Thanks for this. I'm sorry to hear about the dev issues. I wish I could help.

paulluuk

10 replies

1d3h

2024-06-14 14:55:47 UTC

While this is really cool, I have to say..

import warp as wp

Can we please not copy this convention over from numpy? In the example script, you use 17 characters to write this just to save 18 characters later on in the script. Just import the warp commands you use, or if you really want "import warp", but don't rename imported libraries, please.

dahfizz

4 replies

1d3h

2024-06-14 14:58:30 UTC

Strongly agreed! This convention has even infected internal tooling at my company. Scripts end up with tons of cryptic three letter names. It saves a couple keystrokes but wastes engineering time to maintain

physicsguy

1 replies

1d3h

2024-06-14 15:21:47 UTC

The convention is a convention because the libraries are used so commonly. If you give anyone in scientific computing Python world something with “np” or “pd” then they know what that is. Doing something other than what is convention for those libraries wastes more time when people jump into a file because people have to work out now whether “array” is some bespoke type or the NumPy one they’re used to.

paulluuk

0 replies

1d2h

2024-06-14 16:02:20 UTC

There is no way that "warp" is already such a household name that it's common enough to shorten it to "wp". Likewise, the libraries at OP's company are for sure not going to be common to anyone starting out at the company, and might still be confusing to anyone who has worked there for years but just hasn't had to use that specific library.

Pandas and Numpy are popular, sure. As is Tensorflow (often shortened to tf). But where do you draw the line, then? should the openai library be imported as oa? should flask be imported as fk? should requests be imported as rq?

It seems to happen mostly to libraries that are commonly used by one specific audience: scientists who are forced to use a programming language, and who think that 1-letter variables are good variable names, and who prefer using notebooks over scripts with functions.

Don't get me wrong, I'm glad that Python gets so much attention from the scientific community, but I feel that small little annoyances like this creep in because of it, too.

hot_gril

1 replies

2024-06-14 17:44:05 UTC

It doesn't really matter

m463

0 replies

10h44m

2024-06-15 07:45:13 UTC

  import os as o
  import sys as s

hoosieree

1 replies

1d2h

2024-06-14 15:57:46 UTC

The more Python I write, the more I feel that "from foo import name1,name2,nameN" is The Way. Yes it's more verbose. Yes it loses any benefits of namespaces. However it encourages you to focus on your actual problem rather than hypothetical problems you might have someday, and the namespace clashes might have a positive unintended consequence of making you realize you don't actually need that other library after all.

water-your-self

0 replies

23h59m

2024-06-14 18:30:09 UTC

the namespace clashes might have a positive unintended consequence of making you realize you don't actually need that other library after all.

dr_kiszonka

0 replies

1d3h

2024-06-14 15:20:04 UTC

Interesting. That is a good point. However, if I saw someone writing numpy.array() or pandas.read_csv(), my first reaction would be to think they were a beginner.

Y_Y

0 replies

2024-06-14 18:19:21 UTC

    import warp as np

Now you can re-use your old code as-is!

2cynykyl

0 replies

19h48m

2024-06-14 22:40:26 UTC

This math is not adding up for me...isn't import warp necessary? So you only 6 more characters to write as wp. And anyway, to me savings in cognitive load later when I'm in the flow of coding is worth it.

dudus

6 replies

1d3h

2024-06-14 15:11:06 UTC

Gotta keep digging that CUDA moat as hard and as fast as possible.

astromaniak

3 replies

1d2h

2024-06-14 15:57:12 UTC

Exactly. and that's why it's valued at $3T++, about 10x of AMD and Intel put together.

markhahn

2 replies

18h49m

2024-06-14 23:39:25 UTC

you mean because that's how you get to be a meme stock? yep. stock markets are casinos filled with know-nothing high-rollers and pension sheep.

talldayo

0 replies

17h23m

2024-06-15 01:05:34 UTC

How many meme stocks are TSMC customers?

astromaniak

0 replies

14h39m

2024-06-15 03:49:19 UTC

doing business is hard this days. you can't be just a rich a*hole while working with people. have to care about your image. and this is one of the ways of doing it. hanging out free stuff. sort of selfless donations. but in fact this rises the bar and makes competitors' life much harder. of course you can be rich and narrow minded, like intel. but then it's hard to attract external developers and make them believe in you future. nvidia's stock rise is based on the vision, investors believe in it. while other giants are being dominated by carrier managers. who know the procedures, but absolutely blind when it comes to technology evaluation. if someone comes to them with a great idea they first evaluate how it fits in their plans. sometimes they their own primitive vision, like in facebook. which proved to be a... not that good. so, all this sort of managers can do is look at what is _alrady_ successful and try to replicate it throwing a lot of money. it may be not enough. like intel still lags behind in GPUs.

tomjen3

1 replies

14h50m

2024-06-15 03:39:10 UTC

Thats the part I don't get. When you are developing AI, how much code are you really running on GPUs? How bad would it be to write it for something else if you could get 10% more compute per dollar?

incrudible

0 replies

7h43m

2024-06-15 10:46:06 UTC

Those 10% are going to matter when you have an established business case and you can start optimizing. The AI space is not like that at all, nobody cares about losing money 10% faster. You can not risk a 100% slowdown running into issues with an exotic platform for a 10% speedup.

beebmam

5 replies

14h40m

2024-06-15 03:48:22 UTC

Why Python? I really don't understand this choice of language other than accessibility.

rldjbpin

0 replies

9h15m

2024-06-15 09:13:44 UTC

to me it goes beyond that. many leetcode grinders swear by specific data structures such as hashmaps, which python makes available as dictionaries.

behind the sytax, there is plenty of heavy lifting for writing sophisticated code, when need be. that surely helps with the network effect.

pzo

0 replies

14h8m

2024-06-15 04:20:33 UTC

Huge ecosystem starting with numpy, pandas, mathplot et al for data science, pytorch, tensorflow, jax for ML, gradio, rerun for visualization, opencv, open3d for image/pointcloud processing, pyside for gui and others.

mkl

0 replies

14h36m

2024-06-15 03:52:48 UTC

I think you answered your own question there. Python is very accessible, very popular, and already widely used for GPU-based things like machine learning.

int_19h

0 replies

12h25m

2024-06-15 06:03:37 UTC

Because that's where the vast majority of DS/ML is already, and they are too busy to learn something else.

danielmarkbruce

0 replies

14h37m

2024-06-15 03:52:04 UTC

Because accessibility.

VyseofArcadia

5 replies

1d3h

2024-06-14 15:12:14 UTC

Aren't warps already architectural elements of nvidia graphics cards? This name collision is going to muddy search results.

logicchains

3 replies

1d2h

2024-06-14 16:03:23 UTC

Aren't warps already architectural elements of nvidia graphics cards?

Architectural elements of _all_ graphics cards.

VyseofArcadia

2 replies

1d1h

2024-06-14 16:39:13 UTC

Unsure of how authoritative this is, but this article[0] seems to imply it's a matter of branding.

The efficiency of executing threads in groups, which is known as warps in NVIDIA and wavefronts in AMD, is crucial for maximizing core utilization.

[0] https://www.xda-developers.com/how-does-a-graphics-card-actu...

logicchains

1 replies

22h45m

2024-06-14 19:43:31 UTC

ROCm also refers to them as warps https://rocm.docs.amd.com/projects/HIP/en/latest/understand/... :

The threads are executed in groupings called warps. The amount of threads making up a warp is architecture dependent. On AMD GPUs the warp size is commonly 64 threads, except in RDNA architectures which can utilize a warp size of 32 or 64 respectively. The warp size of supported AMD GPUs is listed in the Accelerator and GPU hardware specifications. NVIDIA GPUs have a warp size of 32.

int_19h

0 replies

12h26m

2024-06-15 06:02:59 UTC

It actually kinda makes some sense when you realize that "warp" is a reference to warp threads in actual weaving: https://en.wikipedia.org/wiki/Warp_and_weft.

ahfeah7373

0 replies

19h9m

2024-06-14 23:19:38 UTC

There is also already WARP in the graphics world:

https://learn.microsoft.com/en-us/windows/win32/direct3darti...

Its basically the software implementation of DirectX

owenpalmer

2 replies

1d1h

2024-06-14 16:51:19 UTC

Warp is designed for spatial computing

What does this mean? I've mainly heard the term "spatial computing" in the context of the Vision Pro release. It doesn't seem like this was intended for AR/VR

educasean

1 replies

2024-06-14 18:09:06 UTC

As someone not in this space, I was immediately tripped up by this as well. Does spatial computing mean something else in this context?

basiccalendar74

0 replies

21h23m

2024-06-14 21:05:39 UTC

main use case seems to be simulations in 2D, 3D or nD spaces. spaces -> spatial.

jkbbwr

2 replies

6h17m

2024-06-15 12:11:40 UTC

I really wish python would stop being the go-to language for GPU orchestration or machine learning, having worked with it again recently for some proof of concepts its been a massive pain in the ass.

seydor

0 replies

4h40m

2024-06-15 13:48:35 UTC

We should have by now a new language for AI systems, not just frameworks

FrozenSynapse

0 replies

5h53m

2024-06-15 12:35:25 UTC

seeing as every big corp chooses it for their libraries, I'd say it's a skill issue

water-your-self

1 replies

2024-06-14 18:27:04 UTC

GPU support requires a CUDA-capable NVIDIA GPU and driver (minimum GeForce GTX 9xx).

Very tactful from nvidia. I have a lovely AMD gpu and this library is worthless for it.

coldtea

0 replies

23h18m

2024-06-14 19:11:00 UTC

Err, it is nvidia. Why would they support AMD?

marmaduke

1 replies

3h13m

2024-06-15 15:15:29 UTC

Ive dredged though Julia, Numba, Jax, Futhark, looking a way to have good CPU performance in absence of GPU, and I'm not really happy with any of them. Especially given how many want you to lug LLVM along with.

A recent simulation code when pushed with gcc openmp-simd matched performance on a 13900K vs jax.jit on a rtx 4090. This case worked because the overall computation can be structured into pieces that fit in L1/L2 cache, but I had to spend a ton of time writing the C code, whereas jax.jit was too easy.

So I'd still like to see something like this but which really works for CPU as well.

mccoyb

0 replies

3h0m

2024-06-15 15:28:16 UTC

Agreed, JAX is specialized for GPU computation -- I'd really like similar capabilities with more permissive constructs, maybe even co-effect tagging of pieces of code (which part goes on GPU, which part goes on CPU), etc.

I've thought about extending JAX with custom primitives and a custom lowering process to support constructs which work on CPU (but don't work on GPU) -- but if I did that, and wanted a nice programmable substrate -- I'd need to define my own version of abstract tracing (because necessarily, permissive CPU constructs might imply array type permissiveness like dynamic shapes, etc).

You start heading towards something that looks like Julia -- the problem (for my work) with Julia is that it doesn't support composable transformations like JAX does.

Julia + JAX might be offered as a solution -- but it's quite unsatisfying to me.

jokoon

1 replies

19h30m

2024-06-14 22:58:53 UTC

funny that now some softwares are hardware dependent

OpenCL seems like it's just obsolete

pjmlp

0 replies

9h50m

2024-06-15 08:38:49 UTC

OpenCL has been obsolete for years, as Intel, AMD and Google never provided a proper development experience with good drivers.

The fact that OpenCL 3.0 is basically OpenCL 1.0 rebranded, as acknwoledgement of OpenCL 2.0 adoption failure, doesn't help either.

arvinsim

1 replies

1d3h

2024-06-14 14:42:22 UTC

As someone who is not in the simulation and graphic space, what does this library bring that current libraries do not?

ok123456

0 replies

1d3h

2024-06-14 14:52:50 UTC

It overlaps a lot with the library Taichi, which Disney supports.

It's noteworthy that Taichi also supports AMD, MPI, and Kokkos.

wallscratch

0 replies

21h16m

2024-06-14 21:13:10 UTC

Can anyone comment on how efficient the Warp code is compared to manually written / fine-tuned CUDA?

nurettin

0 replies

1d3h

2024-06-14 15:15:55 UTC

How is this different than taichi? Even the decorators look similar.

jorlow

0 replies

1d4h

2024-06-14 14:27:32 UTC

Does this compete at all with openAI's triton (which is sort of a higher level cuda without the vendor lock in)?

jarmitage

0 replies

23h4m

2024-06-14 19:25:05 UTC

What's Taichi's take on NVIDIA's Warp?

Overall the biggest distinction as of now is that Taichi operates at a slightly higher level. E.g. implict loop parallelization, high level spatial data structures, direct interops with torch, etc.

We are trying to implement support for lower level programming styles to accommodate such things as native intrinsics, but we do think of those as more advanced optimization techniques, and at the same time we strive for easier entry and usage for beginners or people not so used to CUDA's programming model

– https://github.com/taichi-dev/taichi/discussions/8184

bytesandbits

0 replies

11h40m

2024-06-15 06:48:37 UTC

How is this different than Triton?

TNWin

0 replies

15h43m

2024-06-15 02:45:32 UTC

Slightly related

What's this community's take on Triton? https://openai.com/index/triton/

Are there better alternatives?

BenoitP

0 replies

22h58m

2024-06-14 19:30:30 UTC

This should be seen in light of the Great Differentiable Convergence™:

NERFs backpropagating pixels colors into the volume, but also semantic information from the image label, embedded from an LLM reading a multimedia document.

Or something like this. Anyway, wanna buy an NVIDIA GPU ;)?