HN comments for: So you want to rent an NVIDIA H100 cluster? 2024 Consumer Guide

latchkey

72 replies

1d5h

2024-07-12 12:48:10 UTC

Great post. The ethernet section is especially interesting to me.

I'm building a cluster of 16x Dell XE9680's (128 AMD MI300x GPUs) [0], with 8x 2p200G broadcom cards (running at 400G), all connected to a single Dell PowerSwitch Z9864F-ON, which should prevent any slowness. It will be connected over rocev2 [1].

We're going with ethernet because we believe in open standards, and few talk about the fact that the lead time on IB was last quoted to me at 50+ weeks. As kind of mentioned in the article, if you can't even deploy a cluster the speed of the network means less and less.

I can't wait to do some benchmarking on the system to see if we run into similar issues or not. Thankfully, we have a great Dell partnership, with full support, so I believe that we are well covered in terms of any potential issues.

Our datacenter is 100% green and low PUE and we are very proud of that as well. Hope to announce which one soon.

  [0] https://hotaisle.xyz/compute/

  [1] https://hotaisle.xyz/networking/

csmpltn

26 replies

1d4h

2024-07-12 13:52:05 UTC

"Our datacenter is 100% green"

Cool, where can I read more about this? How do you power your DC?

walrus01

11 replies

1d4h

2024-07-12 14:06:59 UTC

Plenty of datacenters that are somewhere generally in the pacific NW can claim to be "Green" because their power supply is entirely hydroelectric.

https://www.nwd.usace.army.mil/CRWM/CR-Dams/

Many of those areas also happen to have the lowest $ per kWh electricity in North American, the only lower rate is available near a few hydroelectric dams in Quebec.

dlkf

8 replies

1d2h

2024-07-12 15:50:53 UTC

Why is green in scare-quotes?

uncertainrhymes

6 replies

1d2h

2024-07-12 15:57:11 UTC

People make the argument that is a giant datacenter is consuming 50% of some local hydro installation, everyone else is town is buying something else that is less green.

It opens up questions about grids and market efficiency, so your mileage may vary.

sangnoir

5 replies

2024-07-12 17:56:25 UTC

People make the argument that is a giant datacenter is consuming 50% of some local hydro installation, everyone else is town is buying something else that is less green.

I don't think that's a cogent argument. It's akin to saying a vegan commune in a small is buying is buying up 50% of the vegan food, "forcing" others to buy meat-products, and framing this to cast doubts on whether they are truly vegan. Consumers aren't in a position to solve supply problems.

netrus

3 replies

22h25m

2024-07-12 20:03:11 UTC

Hydro power is a great thing, it was the first renewable energy that was available in meaningful quantities. However, great sites for hydro power are definitely limited. We will not suddenly find a great spot for a new huge dam. Imagine the only source of vegan B12 to be some obscure plant that can only be grown on a tiny island. In this scenario, the possible extend of vegan consumption is fixed.

dlkf

1 replies

21h23m

2024-07-12 21:04:42 UTC

In the regions where it works (PNW, Quebec, etc) we could easily build more. The hurdles are regulatory. The regulation isn’t baseless - a dam will affect the local ecosystem adversely. But that’s a tradeoff we choose rather than a fundamental limitation.

aziaziazi

0 replies

21h1m

2024-07-12 21:26:48 UTC

The other trade off is correlated with the energy stored : potential for catastrophic disaster in case of failure. as a society, living below is not risk free in the long run.

sangnoir

0 replies

21h18m

2024-07-12 21:09:34 UTC

You've just painted a clearer picture than I did: the crux is that it is a supply-side problem.

jodleif

0 replies

11h14m

2024-07-13 07:14:20 UTC

Thats a terrible analogy. Maybe if a bio-fuel plant used 50% of the vegan food to produce “green fuel”

saagarjha

0 replies

21h49m

2024-07-12 20:39:16 UTC

There's also discussion of environmental damage from damming rivers

mulmen

0 replies

1d2h

2024-07-12 15:48:20 UTC

As anyone who has driven through the Columbia River Basin can tell you wind power is also abundant in Washington. The grid is very clean here but it’s certainly not purely hydro.

latchkey

0 replies

1d4h

2024-07-12 14:10:05 UTC

Previously, I had multiple data centers in Quincy, WA. Those were hydro green. It is an area that hosts a whole multitude of big hyperscaler companies.

epistasis

10 replies

1d3h

2024-07-12 15:01:13 UTC

I have no idea how they are powering it, but with the speed with which solar and battery prices are falling, and the slowness of getting a new big grid interconnection, I would not be surprised to see new data centers that are primarily powered by their own solar+batteries. Perhaps with a small, fast and cheap grid connection for small bits of backup.

If not this year, definitely in the 2030s.

Edit: for a much smaller scale version of this, here's a titanium plant doing this, instead of a data center. The nice thing about renewables is that they easily scale; if you can do it for 50MW you can do it for 500MW or 5GW with a linear increase in the resources. https://www.canarymedia.com/articles/clean-industry/in-a-fir...

rlupi

7 replies

1d2h

2024-07-12 15:49:30 UTC

For hundreds of MW?

https://www.visualcapitalist.com/cp/top-data-center-markets/

You can do that but they are the size of a mountain (literally: https://www.swissinfo.ch/eng/sci-tech/inside-switzerland-s-g... this is 900MW)

epistasis

4 replies

1d2h

2024-07-12 16:07:27 UTC

Why not? A typical value for land is 10 acres/MW, so 15-30 sq mi for 1-2 GW, which will average out to hundreds of megawatts over the course of 24 hours, even in shady days.

Including land costs, solar is the cheapest source of energy, at <$1/W, which is a tiny fraction of the cost of the rest of the data center equipment, and has a 30-50 year lifetime. For less than $5B you could have hundreds of megawatts of continuous solar power backed by 24+ hours (5+GWh) of batteries. Hydro storage really can't compete with batteries for this sort of application, at least for new storage capacity. Existing hydro certainly is great, just building new stuff is hard.

And GW-scale solar installations are fairly commonplace. Far easier to procure than a matching number of H100s.

rlupi

3 replies

1d1h

2024-07-12 16:31:30 UTC

Maybe it's possible, I haven't seen it done yet. I guess there were better alternatives.

Well, I can't share numbers about datacenter MW sizes... the fact that I misread some of those numbers as per datacenter MW is telling :0

In any case, Meta (not my employer) has 24k GPU clusters. In the most dense (and less power hungry) setup, Nvidia superpods have 4x DGX per rack, 8 GPU per DGX (hyperscaler use HGX to build their stuff, but it's the same), and each rack uses ~40kW. That's 750 racks and 30MW of just ML, you need to add some 10-20% for supporting compute&storage, and other DC infrastructure (cooling, etc.).

24k GPU is likely one building, or even just one floor. Meta will likely have multiple clusters like that in the same POP.

That's in the ballpark of 100+MW per datacenter, as the starting point.

epistasis

2 replies

1d1h

2024-07-12 16:41:59 UTC

Oh I don't dispute your hundreds of MW at all. Other freely available information definitely supports that for the hypescalers. Browsing recent construction projects, I see 250,000 sq ft projects from Apple, and 700,000+ from the hyperscalers, and typically consumption is 150-300W/sqft, with the hyper dense DGX systems at that or above.

There are lots of large scale renewable power projects out there waiting to get onto the grid, stuck in long interconnection queues, more than 1TW projects last I heard. There are also lots of data centers wanting to get big power connections, enough that utilities are able to scare their regulatory bodies to make bad short-term decisions to try to support the new load centers.

Connecting the builders of these large projects directly to the new demand, and going outside the slow, corrupt, and inept utilities would solve a lot of problems. And you could still eventually get that big interconnection to the grid installed, and in the interim 3-5 years, power the data center mostly off-grid. Because that massive battery plus solar resource would eventually be a massive grid asset that could benefit everyone too, if the utilities weren't so slow.

bboygravity

1 replies

10h34m

2024-07-13 07:53:38 UTC

I don't get how this would make sense.

Assuming no hydro-electric: at night with no wind you'd need to draw the same 100('s) of MW from batteries (or use gas/coal/nuclear which defeats the purpose of ALSO using renewables on top of that 100 percent backup capacity).

Batteries with that capacity are still extremely expensive (and massive in volume) which would essentially mean your energy price is 5 to 10x higher (ballpark rough estimate) than non-renewable continious sources.

Not to mention the huge amount of land needed/wasted (costs money), there's a recycling problem (solar panels are not actually sustainable in the sense that they'll last forever or can be recycled) and so on.

The only company in the world for which I could see that setup MAYBE make sense business-wise is Tesla/xAI: they could relatively quickly and cheaply roll-out massive battery storage, data center and solar (for example). If only to be slightly faster and bigger at rolling out than their competitors it could make sense from a business perspective. But that's only because they can produce massive battery capacity at the lowest possible cost and quickest turn-around.

Maybe I'm missing something.

epistasis

0 replies

3h22m

2024-07-13 15:05:54 UTC

I think your talking points are mostly put of date or incorrect.

First, batteries are cheap today and being installed on grids at fantastic scales all the time. I suggested 5GWh of batteries above, which at that scale could probably be delivered at $300/kWh installed in a 2024 project. (Back in 2022, that figure was a $481/kWh and price competition has been brutal since then, see figure 3 [1]) At 6000 cycles of lifetime, the cost of delivering a stored a kWh is only $0.05, less than typical transmission and distribution charges.

Second, solar is cheap, and that includes the land costs, at $0.04/kWh unsubsidized (slide 31 [2]). Land producing valuable electricity is now wasted, it's the exact opposite of waste. Solar is recyclable, as a simple web search will show. Further to say that solar is somehow "not sustainable" is just bad propaganda on a massive scale.

There are 571GW of solar+battery projects seeking connection on the grid. [3] Very few of those projects are planning on using Tesla's storage. Now, all of those projects are going to have vastly smaller batteries, but scaling up the battery to cover for 24hour power is an easy design change, especially if it helps the project start generating money years faster than it would if it had to wait for an interconnection. A new data center could partner on site as the off-take for one of these hundreds of proposed projects, if it's close enough to the data center resources. NC would be a likely site.

Well, I do not know if I have convinced you it's a good idea, but I have definitely gained a lot of conviction for myself that it's a fantastic business idea for both the power project and the data center... now if only I had serious skin in the game on either side so that I could benefit from it!

[1] https://www.nrel.gov/docs/fy23osti/85332.pdf

[2] https://emp.lbl.gov/sites/default/files/utility_scale_solar_...

[3] https://energyanalysis.lbl.gov/publications/queued-2024-edit...

rapsey

1 replies

1d2h

2024-07-12 16:12:41 UTC

I don't know what the power consumption of an individual data center is, but your link talks about the consumption of all datacenters in individual states/countries.

epistasis

0 replies

1d2h

2024-07-12 16:17:34 UTC

A couple hundred megawatts is what I would expect from a data center, and had planned out before commenting, so that's on the mark. An H100 is very roughly a kW after adding in all the rest of the overhead, so a 200MW data center would have 200,000 H100s. That's massive cluster, but not inconceivable.

spywaregorilla

1 replies

1d3h

2024-07-12 15:25:19 UTC

I don't think you could cover a data center with enough solar panels to power it

FrojoS

0 replies

1d2h

2024-07-12 15:43:29 UTC

They can be to the side of the data center, too.

latchkey

2 replies

1d4h

2024-07-12 14:11:58 UTC

We haven't announced the dc yet, but will soon. Very well known. I'm actually pretty excited about it.

wferrell

0 replies

17h40m

2024-07-13 00:48:05 UTC

tamiral

0 replies

1d3h

2024-07-12 15:10:58 UTC

waiting to see it posted on HN!

zxexz

11 replies

17h46m

2024-07-13 00:41:35 UTC

If you’re OK with used equipment, I find Infiniband basically on par pricewise with used ethernet gear. Mellanox stuff is super easy to test and you can run everything with a mainline kernel. MQM8700/MQM8790 switches can be had really cheap now used. Cables are dumb expensive...except a dime a dozen used. Or buy new, 100% compatible from FS or the likes. ConnectX-6 cards are quite cheap used thanks to HFT firms constantly upgrading to the latest, and many can be set to Infiniband OR Ethernet (I _think_ some of the dual port cards suppirt simultaneous). I set up a cluster that right now has like 70 of these, at least half of them used. Have not had an issue with any of them yet (I botched a couple firmware updates but every time I saw able to just reset the card and push firmware again). Every machine is connected to the IB fabric AND to at least 100G ethernet.

FWIW, used 100G Ethernet equipment is now cheap enough I’ve been upgrading my home network to be 100G. Cheaper than new consumer 10G equipment.

latchkey

9 replies

17h25m

2024-07-13 01:02:49 UTC

If you’re OK with used equipment

I'm building a business, not a home lab.

(before you continue to downvote me, read what I wrote below)

adastra22

7 replies

16h40m

2024-07-13 01:47:26 UTC

So? The difference in cost can be as much as 10x. If you're building a startup, that matters.

latchkey

3 replies

16h34m

2024-07-13 01:54:13 UTC

What matters is support contracts and uptime.

If you have a dozen customers on a server that cannot access things because of an issue, then as a startup, without a whole customer support department, you're literally screwed.

I've been on HN long enough to have seen plenty of companies get complaints after growing too quickly and not being able to handle the issues they run into.

I'm building this business in a way to de-risk things as much as possible. From getting the best equipment I can buy today, to support contracts, to the best data center to just scaling with revenue growth. This isn't a cost issue, it is a long term viability issue.

Home lab... certainly cut as many corners as you want. Cloud service provider building top super computers for rent... not so much. There is a reason why not a lot of people start to do this... it is extremely capital intensive. That is a huge moat and getting the relationships and funding to do what I'm doing isn't easy and took me over 5 years to get to this point of just getting started. I'm not going to blow it all on cutting corners on some used equipment.

p1esk

2 replies

13h5m

2024-07-13 05:22:25 UTC

I'm building this business in a way to de-risk things as much as possible

Then why did you go with AMD and not Nvidia? Are you not interested in AI/ML customers?

latchkey

1 replies

12h6m

2024-07-13 06:21:48 UTC

In my eyes, it is less risky with AMD. When you're rooting for the underdog, they have every incentive to help you. This isn't a battle to "win" all of AI or have one beat the other, I just need to create a nice profitable business that solves customer needs.

If I go with Nvidia, then I'm just another one of the 500 other companies doing exactly the same thing.

I'm a firm believer that there should not be a single company that controls all of the compute for AI. It would be like having Cisco be the only company that provides routers for the internet.

Additionally, we are not just AMD. We will run any compute that our customers want us to deploy for them. We are the capex/opex for businesses that don't want to put up the millions, or figure out and deal with all the domain specific details of deploying this level of compute. The only criteria we have is that it is the best-in-class available today for each accelerator. For example, I wouldn't deploy H100's because they are essentially old tech now.

Are you not interested in AI/ML customers?

Read these blog posts and tell me why you'd ask that question...

https://chipsandcheese.com/2024/06/25/testing-amds-giant-mi3...

https://www.nscale.com/blog/nscale-benchmarks-amd-mi300x-gpu...

p1esk

0 replies

3h57m

2024-07-13 14:31:17 UTC

OK, I just looked at the first blog post: “ROCm is nowhere near where it needs to be to truly compete with CUDA.”

That’s all I need to know as an AI/ML customer.

wmf

2 replies

15h54m

2024-07-13 02:33:27 UTC

The latest generation generally isn't available used in much quantity. Used 100G equipment is cheap because it's almost ten years old.

adastra22

1 replies

13h10m

2024-07-13 05:18:04 UTC

Depends on your use case. For AI it’s not much good. Other applications don’t require as much internode bandwidth.

zxexz

0 replies

52m

2024-07-13 17:36:11 UTC

What do you mean not much good for AI? I guess it depends, 10 year old equipment is not great but plenty of 7 year old stuff is excellent. SN2700 switches for example are EOL, but that doesn’t matter because you can run mainline Linux flawlessly, if you enable switchdev (I’ve managed to run FreeBSD too). CX5 and CX6 cards are used everywhere still. I don’t have much experience with Broadcom gear, but I hear there are good options there too, but they tend to require more to get the stack setup on Linux.

zxexz

0 replies

1h10m

2024-07-13 17:17:45 UTC

So am I, but that makes total sense in your case, providing cloud compute. Doing that without contracts and warranties sounds like a nightmare (I haven’t downvoted you at all, I’ve seen your comments in a few threads and an really interested in what you’re doing). Best of luck, especially on the AMD side. I see a lot of people being skeptical about that, but I think we are very close to them being competitive with Nvidia in this space. I’m pretty much entirely Nvidia at the moment, but I’d love to hack on some MI300X whenever I can get access.

smivan

0 replies

17h19m

2024-07-13 01:08:39 UTC

Can you share some vendors and model names? I've been looking into upgrading my home lab from 10g to 100G, but the costs were absolutely astronomical.

I'm likely running a personal lab similar to many folks on HN - about 20-30 wired servers and a small rack of managed Unifi switches.

Appreciate you!

rlupi

8 replies

1d2h

2024-07-12 15:57:08 UTC

https://docs.nvidia.com/dgx-superpod/reference-architecture-...

NVIDIA large GPU supercomputers have separate compute-networking (between GPUs) and storage-networking (storage to GPUs, or storage to SSD, SSD to GPUs with CPU assistance). This helps avoid networking issues, even more if not using Infiniband.

From what I read here and on your website, you don't go that route. I haven't found the equivalent system level reference architecture for MI300x from AMD. I wonder if you have a link to a public document where AMD provides guidance about this choice?

derefr

5 replies

1d1h

2024-07-12 16:54:00 UTC

even more if not using Infiniband

It's interesting that the above "HPC reference architecture" shows a GPU-to-GPU Infiniband fabric, despite Nvidia also nominally pushing NVLink Switch (https://www.nvidia.com/en-us/data-center/nvlink/) for the HPC use-case.

bee_rider

4 replies

2024-07-12 18:17:02 UTC

How does NVLink work? Because I already know MPI and I’m not going to learn anything else, lol.

Edit: after googling it looks like OpenMPI has some NVLink support, so maybe it is OK.

kcb

2 replies

18h0m

2024-07-13 00:27:47 UTC

There is "CUDA-aware MPI" which would let you RDMA from device to device. But the more modern way would be MPI for the host communication and their own library NCCL for the device communication. NCCL has similar collective functions a MPI but runs on the device which makes it much more efficient to integrate in the flow of your kernels. But you would still generally bootstrap your processes and data through MPI.

shaklee3

0 replies

13h34m

2024-07-13 04:53:48 UTC

If you use ucx it does all that automatically without you choosing.

latchkey

0 replies

11h30m

2024-07-13 06:57:26 UTC

The naming games on the libraries are rather entertaining... NCCL, RCCL... and oneAPI (oneCCL).

zxexz

0 replies

23h46m

2024-07-12 18:42:05 UTC

I use OpenMPI with no issues over multiple H100 nodes and A100 nodes, with multiple infiniband 200G and ethernet 100G/200G networks, and RDMA (though using mellanox instead of broadcom cards, but afaik broadcom supports this just the same). Side note, make sure you compile nvidia_peermem correctly if you want GDRMA to work :)

latchkey

1 replies

1d1h

2024-07-12 17:10:55 UTC

We have a separate OOB/east-west network which is 100G and would be used for external storage. We're spending an absurd amount of money on just cables.

It is documented on the website [0], but I do see that I did not document the actual cards for that, will add when I wake up tomorrow. The card is:

Broadcom 57504 Quad Port 10/25GbE,SFP28, OCP NIC 3.0

As far as I know, AMD doesn't really have the docs, it is Dell. Their team actively helped us design this whole cluster.

We haven't decided on which type storage we want to get yet. It'll really depend on customer demand and since we haven't deployed quite yet, we are punting that can down the road a bit. Our boxes do all have 122TB in them and we have some additional servers not listed as well with 122TB... so for now I think we can cobble something useful together.

[0] https://hotaisle.xyz/networking/

latchkey

0 replies

14h13m

2024-07-13 04:15:10 UTC

website updated with more details

dpflan

5 replies

1d4h

2024-07-12 14:01:06 UTC

What are you using these for? Providing compute for customers that want to train/infer? What is the level of interest and level of success customers are seeing using these services Hot Aisle offers?

latchkey

4 replies

1d4h

2024-07-12 14:08:55 UTC

We are a bare metal compute offering. They can be used for whatever people want to use them for (within legal limits, of course). Interest is much higher now that we've started to work with teams publishing benchmarks which show that H100's have a nice competitor [0].

I'll admit, it is still early days. We just finished up another free compute [1] two week stint with a benchmarking team. One thing we discovered is that saving checkpoints is slow AF. I'm guessing an issue with ROCm. Hopefully get that resolved soon. Now we are in the process of onboarding the next team.

  [0] https://hotaisle.xyz/benchmarks-and-analysis/

  [1] https://hotaisle.xyz/free-compute-offer/

rvnx

3 replies

1d1h

2024-07-12 17:12:59 UTC

We would love to offer hourly on-demand rates for individual GPUs, but we can't do so at this time due to a limitation in the ROCm/AMD drivers. This limitation prevents PCIe pass-through to a virtual machine, making multi-tenancy impossible. AMD is aware of this issue and has committed to resolving it.

One idea to help you: Are you sure you need a virtual machine ? Couldn't you boot the machines under PXE to solve the imaging problem ?

Essentially you have TFTP server that gives a Linux image and boot on it directly

latchkey

2 replies

1d1h

2024-07-12 17:26:12 UTC

1 chassis, 8 gpus.

We want to be able to break that chassis up into individual GPUs and allocate 1 GPU to 1 "machine". I previously PXE booted 20,000 individual playstation 5 diskless blades and I'm not sure how PXE would solve this.

The only alternative right now is to do what runpod (and AMD's aac) are doing and do docker containers. But that has the limitation of docker in docker, so people end up having to repackage everything. You also can't easily run different ROCm versions since that comes from the host, and if you have 8 people on a single chassis... it becomes a nightmare to manage it.

We're just patiently waiting for AMD to fix the problem.

rvnx

1 replies

2024-07-12 18:28:08 UTC

Got it it's clear, I thought you had 1 GPU in one chassis in some cases.

latchkey

0 replies

17h20m

2024-07-13 01:07:46 UTC

No such thing with mi300x. They come 8 at a time on a OAM/UBB.

teaearlgraycold

3 replies

23h14m

2024-07-12 19:14:23 UTC

We just set up a small cluster of our own. We’re not using infiniband but it didn’t seem like it would be a 50 week lead time to setting it up. Where did you get that number?

latchkey

2 replies

17h17m

2024-07-13 01:10:39 UTC

I hear a couple issues with your comment... "small cluster", "we're not using infinband"

I'm only saying what was quoted to me. The cards are easy to get, it is the switches that are more difficult.

teaearlgraycold

1 replies

16h38m

2024-07-13 01:50:05 UTC

I saw plenty of used switches available. Are current gen switches necessary?

latchkey

0 replies

14h11m

2024-07-13 04:16:36 UTC

https://news.ycombinator.com/item?id=40951133

RobRivera

3 replies

1d1h

2024-07-12 17:15:35 UTC

Nice! Whats benchmark standard these days? Still Superbench or yall have something inhouse?

Re: lead time quote :O I guess I got spoiled working for one of the major cloud vendors. The thought of poor b2b vendor support never entered my risk matrix.

If you own your own cluster, the network bottleneck becomes less a dollar cost I suppose, since you arent being charged a premium to rent someone elses compute

latchkey

2 replies

11h50m

2024-07-13 06:37:38 UTC

We don't run benchmarks ourselves. We donate the expensive Ferrari worth of compute for others to do it. This is the most unbiased way I could think of getting useful real world data to share.

The 3rd team just finished up and the 4th is getting started now. I've got 23 others in the wings.

https://hotaisle.xyz/free-compute-offer/

jononor

1 replies

8h37m

2024-07-13 09:50:37 UTC

This seems like an excellent strategy in terms of marketing and building mindshare for the platform (as you mention elsewhere, it is an underdog).

latchkey

0 replies

7h17m

2024-07-13 11:11:03 UTC

Thank you for the kind words.

startupsfail

1 replies

23h31m

2024-07-12 18:57:20 UTC

It seems the reliability, speed and scalability drops with the Ethernet are somewhat manageable.

To quote the article - From our tests, we found that Infiniband was systematically outperforming Ethernet interconnects in terms of speed. When using 16 nodes / 128 GPUs, the difference varied from 3% to 10% in terms of distributed training throughput[1]. The gap was widening as we were adding more nodes: Infiniband was scaling almost linearly, while other interconnects scaled less efficiently.

And then they do mention that the research team needs to debug unexplained failures on Ethernet that they’ve not seen on Infiniband. This actually can be the expensive part. Particularly if the failures are silent and cause numerical errors only.

latchkey

0 replies

17h22m

2024-07-13 01:05:54 UTC

A single switch should mitigate some of the throughput issues.

As for issues, this is why I have a full professional support contract with Dell and Advizex. If there are issues in the gear, they will step in to help out.

Especially on the switch, since it is a spof, we went with a 4 hour window.

pheatherlite

1 replies

2024-07-12 18:15:21 UTC

Being out of the loop for awhile. Has amd made anything similar to cuda? Are cots frameworks such as pytorch and tensorflow on par when running on amd hardware? What makes investing in amd cluster/chips worthwhile?

jononor

0 replies

8h34m

2024-07-13 09:53:49 UTC

ROCm is the framework. Both PyTorch and Tensorflow have versions that support it. Have no experience with it, so cannot say how it works in practice.

omneity

1 replies

1d4h

2024-07-12 13:41:58 UTC

Did you procure the servers directly from Dell or through a distribution partner?

latchkey

0 replies

1d4h

2024-07-12 13:58:37 UTC

Kind of both. We started talking to Dell first and then they introduced us to Advizex. We are now effectively partnered with both companies, which is fantastic as the Advizex team have ex-Dell people working directly with us. We are lucky to have two whole teams of super talented people helping us out on this journey.

logicchains

1 replies

1d4h

2024-07-12 13:49:37 UTC

Meta had success building an ethernet cluster on Arista 7800 with Wedge400 and Minipack2 OCP rack switches.: https://www.datacenterdynamics.com/en/news/meta-reveals-deta...

latchkey

0 replies

1d4h

2024-07-12 14:04:06 UTC

I actually have a call with Arista next week to learn more about their solutions. Especially those 7800's. They look awesome.

One "problem" we have right now is that our cluster cannot support more than 128 GPUs. If we wanted to scale with Dell, we'd have to buy 6x more Z9864F to add one more cluster, which is crazy expensive and complicated.

I want to see if Arista has something that can help us. That said, I also have to find a customer that wants more than 128 MI300x and that hasn't happened... yet.

JackYoustra

1 replies

2024-07-12 18:18:33 UTC

Why xeon instead of epyc?

latchkey

0 replies

17h19m

2024-07-13 01:08:57 UTC

Sadly, the only option Dell supports today. To get to market the fastest, they took their existing H100 chassis solution, swapped out the GPUs/baseboard and called it a day.

8organicbits

11 replies

4d6h

2024-07-09 11:44:06 UTC

Pretty sparse on pricing data, I guess everyone asked them to keep it private.

Tepix

3 replies

4d5h

2024-07-09 12:57:20 UTC

Genesiscloud.com mentions „starting at @2.00/hr“ for HGX100 H100.

spott

2 replies

1d3h

2024-07-12 14:34:17 UTC

Only 200Gbps network per node.

The infiniband stuff is 400Gbps per GPU (3.2Tbps per node).

GC_tris

1 replies

1d3h

2024-07-12 15:25:10 UTC

Each GPU is connected with 400Gbps. The rest ist just the normal dataplane which is independent from the GPUs.

Source: Was personally involved in design of that deployment.

spott

0 replies

1d2h

2024-07-12 15:47:23 UTC

Good to know. I went to recheck, because I swore I didn't see anything about that when I looked earlier, but they now say 3.2Tbps infiniband... not sure if they changed it or I was just blind.

Thanks!

Der_Einzige

3 replies

1d2h

2024-07-12 16:07:07 UTC

I hate how all high end markets wage wars on price discovery. Most sell their products through middle-men instead of directly for a reason.

High end furniture? Suddenly prices go away and you have to "get quotes".

High end GPUs? Suddenly you learn that spot pricing =/= website quoted prices =/= (actual prices paid with volume + related discounts)

I regularly talk to suits who are paying $$$ for knowledge about the GPU market who are somehow still in the belief that a single 1xA100 80GB costs 13$ an hour to rent through AWS. When I tried to correct them, they almost seemed not to believe me.

Things that take us tech bros minutes (checking the up-to-date price data by going to the screen in your cloud console to spin one up) or hours (emailing your cloud rep for pricing data with discounts) take suits years to poorly approximate knowledge of.

If I go into a market, and price discovery isn't easy on a product, I know I'm dancing with a good chance of being scammed, and by the most greedy, comic-book evil kind of rich people. Suits aren't immune to this, and I'm certainly not either.

lotsofpulp

2 replies

1d1h

2024-07-12 16:30:41 UTC

Because it is always in a sellers best interest to “price discriminate”. A seller benefits most by selling at the highest price that each buyer is willing to buy at, and each buyer has a different highest price they can or are willing to pay, so price transparency works against being able to price discriminate.

https://en.wikipedia.org/wiki/Price_discrimination

Buyers obviously benefit from price transparency. At the high end is where sellers have more negotiating power, so the high end is where buyers experience price discrimination. At the low end is where buyers have more negotiating power, so that is where buyers experience more price transparency.

Der_Einzige

1 replies

1d1h

2024-07-12 16:39:32 UTC

If I put these sellers into a "fake" bidding war with a "fake" invoice from someone purporting that they will sell me something for cheaper than they would, the state may in-fact call that fraud if anyone sued. Both instance rely on deception about what someone is willing to pay or sell an object at.

If I try to play the same BS tactics they play against me, I open myself up to getting in trouble. Heads you win, tails I lose.

Price discrimination as an idea should be rooted out. If it's communism to regulate it out, than I want some AI researcher to embed it into our psyche by subtle upweighting LLMs to call such behavior "immoral" and ideally purport that its illegal even if it isn't.

lotsofpulp

0 replies

1d1h

2024-07-12 17:01:15 UTC

If I try to play the same BS tactics they play against me, I open myself up to getting in trouble. Heads you win, tails I lose.

What? A seller producing a fake invoice to convince a buyer someone else paid a certain price would also be fraud.

You are free to use the exact same tactics as the seller. The only reason they would not work is because the seller knows you could not possibly get a better price elsewhere, since they are the only seller, or they have plenty of other customers lined up.

Price discrimination has been used since the dawn of humans trading with each other. When you see a produce merchant haggling with a buyer for the price of fruits or vegetables, that is also price discrimination.

spott

1 replies

1d3h

2024-07-12 14:34:50 UTC

https://gpulist.ai

No idea about how accurate that is, but if you want cluster pricing...

latchkey

0 replies

1d1h

2024-07-12 17:17:53 UTC

I got them to add the Verified badge, but the rest is pretty much about as accurate as craigslist.

latchkey

0 replies

1d4h

2024-07-12 13:32:16 UTC

Compute pricing isn't really private. When you're talking about high end compute, pricing is very much case by case. What is the point of posting it if changes due to everyone having different needs?

We have base pricing on our website, but I guarantee that if someone comes to me asking for a year reservation, I'm not going to give the quoted price. What I have there is just a good starting point to get the discussion going.

I also had a great dialog with GC on LI over their version of this posting, it seems they really value this customer and it is long term relationship. My assumption is the actual pricing reflects that.

One other thing on the special needs, GC mentioned they had 3 extra spare chassis in play as uptime was critical. That is not an insignificant amount of investment to have just laying around.

barbazoo

10 replies

1d1h

2024-07-12 16:41:27 UTC

Electricity sources and CO2 emissions

I love that they included this in their consideration and pointed out the impact running these GPUs has on the environment.

123yawaworht456

5 replies

2024-07-12 18:22:15 UTC

it has fuck all any impact if the electricity is sourced from nuclear/hydro/solar/wind/geothermal.

barbazoo

3 replies

2024-07-12 18:25:31 UTC

Well, exactly, that's their point. If possible, choose a data center location with access to renewable energy.

joe_the_user

2 replies

23h38m

2024-07-12 18:50:20 UTC

The idea that consumer choice can change total carbon output is so absurd that it's better to boycott anything claiming some special source.

Yes, the present CO2 output is a planet wide-catastrophe. No, your little "contribution" can't change that - if you don't buy cheap coal energy, someone will, etc. Strong state regulation forcing a decrease in total CO2 output with no exceptions is the only force that could save us in the near term (and I'm not holding my breath here but I had to say that). All your "market" and "choice" solutions are burning up like the California vegetation.

barbazoo

1 replies

23h29m

2024-07-12 18:59:07 UTC

Wouldn't that be great, we wouldn't have to feel bad about destroying the planet anymore :) I don't think it's true though. Do you have any evidence that consumer choice does not have any impact on carbon emissions? In other areas it definitely does [1]

if you don't buy cheap coal energy, someone will

No, not if everyone starts asking for clean energy instead. There isn't an infinite number of people, at some point demand for clean energy will make it unattractive to offer dirty energy.

[1] https://news.climate.columbia.edu/2020/12/16/buying-stuff-dr...

abdullahkhalids

0 replies

20h59m

2024-07-12 21:28:27 UTC

Generally, globally sub-optimal many-agent game-theoretic equilibria can't be escaped by individual agents shooting themselves in the foot. There is a reason why it is an equilibrium in the first place. If there a traffic jam every day that adds 30 min to everyone's drive, you can't solve it by asking some people to voluntarily take the alternate path that adds 1 hour to their drive.

The correct way to move from the sub-optimal equilibrium to a better one is via global action i.e. changing the rules of the game. In the real world that means state level actions as GP talks about.

qeternity

0 replies

9h10m

2024-07-13 09:17:45 UTC

Energy is largely fungible in aggregate, so this ends up not being true.

But I do agree with your implicit point that we should be emphasizing renewable generation instead of energy rationing.

storyinmemo

3 replies

2024-07-12 17:59:06 UTC

Shameless employer promotion here while I work on H100 clusters today: https://www.datacenterdynamics.com/en/news/crusoe-puts-cpus-..., https://crusoe.ai/cloud/

Iceland seems to have an excess of energy to population and it's very green.

barbazoo

2 replies

2024-07-12 18:17:04 UTC

I applaud the vision, I wish you folks hired remote software devs

irq

1 replies

20h12m

2024-07-12 22:16:18 UTC

Crusoe is unabashedly anti-remote work, which is curious considering their company’s environmental and energy locality focus. I interviewed with them, they are very much an old school “you must all work physically in San Francisco” company. I work for one of their competitors now, one that embraces remote work.

tucnak

0 replies

13h14m

2024-07-13 05:14:11 UTC

I'm sure they don't even realise What and Whom they have lost. You've taught them a lesson!!

eigenvalue

8 replies

1d3h

2024-07-12 15:06:34 UTC

Lots of good and detailed information here, thanks. I'm curious why Ethernet interconnect is so unreliable in practice compared to the Infiniband. I would think that at this point, after a decade or more of current Ethernet standards, all the kinks would be worked out and the worst that would happen would be occasional latency spikes and a few lost packets that could be retransmitted quickly. Shouldn't the training frameworks be more robust to that sort of thing?

rlupi

6 replies

1d2h

2024-07-12 16:14:08 UTC

Infiniband and ethernet are very different at the lowest levels. Ethernet interconnects use RoCE (RDMA over converged ethernet), which actually encapsulated infiniband transport in ethernet, but you still pay for higher routing latency, and you need separate compute-network and storage-network to avoid queueing (lossless ethernet).

https://community.fs.com/article/infiniband-vs-ethernet-whic...

https://en.wikipedia.org/wiki/RDMA_over_Converged_Ethernet

Also... don't underestimate the PCI bus bottlenecks when you put 8x 400GB networking + 8x GPUs. There are ways now to have a tree of PCI switches and avoid overloading the main one, each GPU gets its own networking card and PCI switch.

eigenvalue

3 replies

22h41m

2024-07-12 19:47:04 UTC

Interesting, thanks. From the wikipedia link, this seems like the probable culprit for why things break:

"Although in general the delivery order of UDP packets is not guaranteed, the RoCEv2 specification requires that packets with the same UDP source port and the same destination address must not be reordered."

wmf

2 replies

21h57m

2024-07-12 20:31:01 UTC

In practice it's easy to design an Ethernet network that doesn't reorder packets.

eigenvalue

1 replies

18h6m

2024-07-13 00:21:25 UTC

But if packets have to go over the internet at some point, aren't all bets off if you use UDP?

wmf

0 replies

17h40m

2024-07-13 00:47:40 UTC

ROCE doesn't go over the Internet, most ISPs don't reorder packets, and UDP isn't treated specially.

shaklee3

0 replies

13h31m

2024-07-13 04:57:08 UTC

RoCE does not encapsulate infiniband

latchkey

0 replies

1d1h

2024-07-12 17:16:45 UTC

This is a great comment.

Our cluster is 128 GPUs into a single Dell switch... should help with the queuing. We also have a separate e-w 100G network.

This is why we went with Dell XE9680 chassis... people forget that PCI switches are quite important with this level of compute. Dell has done a good job here.

cavisne

0 replies

15h45m

2024-07-13 02:42:40 UTC

ML training is a tightly interconnected HPC style workload (network bound), which is a market Infiniband has been targeting for a long time.

Nvidia saw that and bought Mellanox, and made NCCL/GPU's work really well with Infiniband.

Large public clouds already have huge investments in ethernet and don't want to be further locked into Nvidia, so Nvidia does have a roadmap for ethernet GPU clusters (roughly 1 year behind Infiniband).

But if you are building your own Nvidia cluster, it would be silly to build it on ethernet. Just buy exactly what Nvidia recommends, you are already locked in anyway.

huqedato

3 replies

1d4h

2024-07-12 13:35:59 UTC

The only essential aspect this article doesn't answer: How much does it cost? All the rest is metadata. I would have preferred a clear table with vendors, prices and features. And less bla-bla.

ea016

2 replies

1d3h

2024-07-12 15:07:40 UTC

I couldn't share any pricing data since the discussions with providers are private. Instead, I added a graph of prices from gpulist.ai. For an Infiniband cluster, median is $2.3 per H100 hour, average is $2.47.

ilaksh

1 replies

1d2h

2024-07-12 15:43:14 UTC

$2.47 * 256 * 24 * 30 = $455k ?

dijit

0 replies

1d1h

2024-07-12 17:24:53 UTC

Based solely on my own calculations that I made for the board of my company; this is within the parameters I would expect, yeah.

Jun8

3 replies

21h58m

2024-07-12 20:29:42 UTC

Say you want to burn about $500 as a curiosity project for 8 nodes for a day. Any suggestions for what job to run?

turtles3

0 replies

4h43m

2024-07-13 13:45:23 UTC

As a random thought, this seems to be about the same order of magnitude compute as Karpathy's recent GPT-2 work:

https://github.com/karpathy/llm.c/discussions/677

You could take the final checkpoint from that page and run it for some additional steps and see if it improves? You could always publish the final checkpoint and training curves - someone might find it useful.

teaearlgraycold

0 replies

20h39m

2024-07-12 21:48:42 UTC

If you’re just burning money you might as well mine crypto.

fragmede

0 replies

2h29m

2024-07-13 15:58:38 UTC

You could benchmark how fast you can count the number of words (and characters and lines) in all of project gutenberg with wc-gpu.

https://github.com/fragmede/wc-gpu

ec109685

2 replies

1d2h

2024-07-12 16:12:39 UTC

How do the large clouds compare from an availability and cost perspective compared to finding a smaller provider and renting a dedicated cluster?

choppaface

0 replies

2024-07-12 17:57:02 UTC

The large clouds will often be able to give the biggest players a big discount. Perhaps not on raw GPU prices (maybe extended trial), but if you already have e.g. 1PB in object storage then they might give you 20-30% discount. Moreover if your contract is jumbo, they’ll give you not only a Slack channel but send Forward Deployed Engineers to your office and/or help you build part of the model training software.

But for a deployment the size of the OP (Photoroom) I doubt any of the big clouds would offer a discount. Especially if they were not already negotiating with multiple clouds.

Probably the best argument for going with a large cloud provider on a smaller budget is that you already use some of their other services significantly and your MLE-to-devops headcount makes something like Photoroom’s test infeasible.

cavisne

0 replies

15h37m

2024-07-13 02:51:02 UTC

Large clouds tend to be not very good for GPU clusters. The security & management of multi-tenant GPU's is very complex. They have to buy from Nvidia so there is no negotiating leverage. And they know you will be a high maintenance customer (wanting all nodes to work at all times).

So there is no big price or availability advantage for a large cloud (unless you are large enough to rent a dedicated cluster from a large cloud)

silverlake

1 replies

1d4h

2024-07-12 13:53:57 UTC

Good info! I use an HPC with SLURM. 40k GPUs shared by hundreds of users. It works well enough. I don’t know how the market for cloud-based clusters works. Why didn’t OP use AWS or Google for on-demand training? Is it just down to cost?

macksd

0 replies

1d3h

2024-07-12 15:18:05 UTC

If you do, in fact, need H100s, they can be very hard to get. Even the smaller flavors of A100 you sometimes request, wait days for, and then 1 node might show up during a weekend. And for the reasons described in the article and the fact that large training jobs can be network-limited, nicer networks can be a big deal.