Great post. The ethernet section is especially interesting to me.
I'm building a cluster of 16x Dell XE9680's (128 AMD MI300x GPUs) [0], with 8x 2p200G broadcom cards (running at 400G), all connected to a single Dell PowerSwitch Z9864F-ON, which should prevent any slowness. It will be connected over rocev2 [1].
We're going with ethernet because we believe in open standards, and few talk about the fact that the lead time on IB was last quoted to me at 50+ weeks. As kind of mentioned in the article, if you can't even deploy a cluster the speed of the network means less and less.
I can't wait to do some benchmarking on the system to see if we run into similar issues or not. Thankfully, we have a great Dell partnership, with full support, so I believe that we are well covered in terms of any potential issues.
Our datacenter is 100% green and low PUE and we are very proud of that as well. Hope to announce which one soon.
[0] https://hotaisle.xyz/compute/
[1] https://hotaisle.xyz/networking/
Cool, where can I read more about this? How do you power your DC?
Plenty of datacenters that are somewhere generally in the pacific NW can claim to be "Green" because their power supply is entirely hydroelectric.
https://www.nwd.usace.army.mil/CRWM/CR-Dams/
Many of those areas also happen to have the lowest $ per kWh electricity in North American, the only lower rate is available near a few hydroelectric dams in Quebec.
Why is green in scare-quotes?
People make the argument that is a giant datacenter is consuming 50% of some local hydro installation, everyone else is town is buying something else that is less green.
It opens up questions about grids and market efficiency, so your mileage may vary.
I don't think that's a cogent argument. It's akin to saying a vegan commune in a small is buying is buying up 50% of the vegan food, "forcing" others to buy meat-products, and framing this to cast doubts on whether they are truly vegan. Consumers aren't in a position to solve supply problems.
Hydro power is a great thing, it was the first renewable energy that was available in meaningful quantities. However, great sites for hydro power are definitely limited. We will not suddenly find a great spot for a new huge dam. Imagine the only source of vegan B12 to be some obscure plant that can only be grown on a tiny island. In this scenario, the possible extend of vegan consumption is fixed.
In the regions where it works (PNW, Quebec, etc) we could easily build more. The hurdles are regulatory. The regulation isn’t baseless - a dam will affect the local ecosystem adversely. But that’s a tradeoff we choose rather than a fundamental limitation.
The other trade off is correlated with the energy stored : potential for catastrophic disaster in case of failure. as a society, living below is not risk free in the long run.
You've just painted a clearer picture than I did: the crux is that it is a supply-side problem.
Thats a terrible analogy. Maybe if a bio-fuel plant used 50% of the vegan food to produce “green fuel”
There's also discussion of environmental damage from damming rivers
As anyone who has driven through the Columbia River Basin can tell you wind power is also abundant in Washington. The grid is very clean here but it’s certainly not purely hydro.
Previously, I had multiple data centers in Quincy, WA. Those were hydro green. It is an area that hosts a whole multitude of big hyperscaler companies.
I have no idea how they are powering it, but with the speed with which solar and battery prices are falling, and the slowness of getting a new big grid interconnection, I would not be surprised to see new data centers that are primarily powered by their own solar+batteries. Perhaps with a small, fast and cheap grid connection for small bits of backup.
If not this year, definitely in the 2030s.
Edit: for a much smaller scale version of this, here's a titanium plant doing this, instead of a data center. The nice thing about renewables is that they easily scale; if you can do it for 50MW you can do it for 500MW or 5GW with a linear increase in the resources. https://www.canarymedia.com/articles/clean-industry/in-a-fir...
For hundreds of MW?
https://www.visualcapitalist.com/cp/top-data-center-markets/
You can do that but they are the size of a mountain (literally: https://www.swissinfo.ch/eng/sci-tech/inside-switzerland-s-g... this is 900MW)
Why not? A typical value for land is 10 acres/MW, so 15-30 sq mi for 1-2 GW, which will average out to hundreds of megawatts over the course of 24 hours, even in shady days.
Including land costs, solar is the cheapest source of energy, at <$1/W, which is a tiny fraction of the cost of the rest of the data center equipment, and has a 30-50 year lifetime. For less than $5B you could have hundreds of megawatts of continuous solar power backed by 24+ hours (5+GWh) of batteries. Hydro storage really can't compete with batteries for this sort of application, at least for new storage capacity. Existing hydro certainly is great, just building new stuff is hard.
And GW-scale solar installations are fairly commonplace. Far easier to procure than a matching number of H100s.
Maybe it's possible, I haven't seen it done yet. I guess there were better alternatives.
Well, I can't share numbers about datacenter MW sizes... the fact that I misread some of those numbers as per datacenter MW is telling :0
In any case, Meta (not my employer) has 24k GPU clusters. In the most dense (and less power hungry) setup, Nvidia superpods have 4x DGX per rack, 8 GPU per DGX (hyperscaler use HGX to build their stuff, but it's the same), and each rack uses ~40kW. That's 750 racks and 30MW of just ML, you need to add some 10-20% for supporting compute&storage, and other DC infrastructure (cooling, etc.).
24k GPU is likely one building, or even just one floor. Meta will likely have multiple clusters like that in the same POP.
That's in the ballpark of 100+MW per datacenter, as the starting point.
Oh I don't dispute your hundreds of MW at all. Other freely available information definitely supports that for the hypescalers. Browsing recent construction projects, I see 250,000 sq ft projects from Apple, and 700,000+ from the hyperscalers, and typically consumption is 150-300W/sqft, with the hyper dense DGX systems at that or above.
There are lots of large scale renewable power projects out there waiting to get onto the grid, stuck in long interconnection queues, more than 1TW projects last I heard. There are also lots of data centers wanting to get big power connections, enough that utilities are able to scare their regulatory bodies to make bad short-term decisions to try to support the new load centers.
Connecting the builders of these large projects directly to the new demand, and going outside the slow, corrupt, and inept utilities would solve a lot of problems. And you could still eventually get that big interconnection to the grid installed, and in the interim 3-5 years, power the data center mostly off-grid. Because that massive battery plus solar resource would eventually be a massive grid asset that could benefit everyone too, if the utilities weren't so slow.
I don't get how this would make sense.
Assuming no hydro-electric: at night with no wind you'd need to draw the same 100('s) of MW from batteries (or use gas/coal/nuclear which defeats the purpose of ALSO using renewables on top of that 100 percent backup capacity).
Batteries with that capacity are still extremely expensive (and massive in volume) which would essentially mean your energy price is 5 to 10x higher (ballpark rough estimate) than non-renewable continious sources.
Not to mention the huge amount of land needed/wasted (costs money), there's a recycling problem (solar panels are not actually sustainable in the sense that they'll last forever or can be recycled) and so on.
The only company in the world for which I could see that setup MAYBE make sense business-wise is Tesla/xAI: they could relatively quickly and cheaply roll-out massive battery storage, data center and solar (for example). If only to be slightly faster and bigger at rolling out than their competitors it could make sense from a business perspective. But that's only because they can produce massive battery capacity at the lowest possible cost and quickest turn-around.
Maybe I'm missing something.
I think your talking points are mostly put of date or incorrect.
First, batteries are cheap today and being installed on grids at fantastic scales all the time. I suggested 5GWh of batteries above, which at that scale could probably be delivered at $300/kWh installed in a 2024 project. (Back in 2022, that figure was a $481/kWh and price competition has been brutal since then, see figure 3 [1]) At 6000 cycles of lifetime, the cost of delivering a stored a kWh is only $0.05, less than typical transmission and distribution charges.
Second, solar is cheap, and that includes the land costs, at $0.04/kWh unsubsidized (slide 31 [2]). Land producing valuable electricity is now wasted, it's the exact opposite of waste. Solar is recyclable, as a simple web search will show. Further to say that solar is somehow "not sustainable" is just bad propaganda on a massive scale.
There are 571GW of solar+battery projects seeking connection on the grid. [3] Very few of those projects are planning on using Tesla's storage. Now, all of those projects are going to have vastly smaller batteries, but scaling up the battery to cover for 24hour power is an easy design change, especially if it helps the project start generating money years faster than it would if it had to wait for an interconnection. A new data center could partner on site as the off-take for one of these hundreds of proposed projects, if it's close enough to the data center resources. NC would be a likely site.
Well, I do not know if I have convinced you it's a good idea, but I have definitely gained a lot of conviction for myself that it's a fantastic business idea for both the power project and the data center... now if only I had serious skin in the game on either side so that I could benefit from it!
[1] https://www.nrel.gov/docs/fy23osti/85332.pdf
[2] https://emp.lbl.gov/sites/default/files/utility_scale_solar_...
[3] https://energyanalysis.lbl.gov/publications/queued-2024-edit...
I don't know what the power consumption of an individual data center is, but your link talks about the consumption of all datacenters in individual states/countries.
A couple hundred megawatts is what I would expect from a data center, and had planned out before commenting, so that's on the mark. An H100 is very roughly a kW after adding in all the rest of the overhead, so a 200MW data center would have 200,000 H100s. That's massive cluster, but not inconceivable.
I don't think you could cover a data center with enough solar panels to power it
They can be to the side of the data center, too.
We haven't announced the dc yet, but will soon. Very well known. I'm actually pretty excited about it.
X?
waiting to see it posted on HN!
If you’re OK with used equipment, I find Infiniband basically on par pricewise with used ethernet gear. Mellanox stuff is super easy to test and you can run everything with a mainline kernel. MQM8700/MQM8790 switches can be had really cheap now used. Cables are dumb expensive...except a dime a dozen used. Or buy new, 100% compatible from FS or the likes. ConnectX-6 cards are quite cheap used thanks to HFT firms constantly upgrading to the latest, and many can be set to Infiniband OR Ethernet (I _think_ some of the dual port cards suppirt simultaneous). I set up a cluster that right now has like 70 of these, at least half of them used. Have not had an issue with any of them yet (I botched a couple firmware updates but every time I saw able to just reset the card and push firmware again). Every machine is connected to the IB fabric AND to at least 100G ethernet.
FWIW, used 100G Ethernet equipment is now cheap enough I’ve been upgrading my home network to be 100G. Cheaper than new consumer 10G equipment.
I'm building a business, not a home lab.
(before you continue to downvote me, read what I wrote below)
So? The difference in cost can be as much as 10x. If you're building a startup, that matters.
What matters is support contracts and uptime.
If you have a dozen customers on a server that cannot access things because of an issue, then as a startup, without a whole customer support department, you're literally screwed.
I've been on HN long enough to have seen plenty of companies get complaints after growing too quickly and not being able to handle the issues they run into.
I'm building this business in a way to de-risk things as much as possible. From getting the best equipment I can buy today, to support contracts, to the best data center to just scaling with revenue growth. This isn't a cost issue, it is a long term viability issue.
Home lab... certainly cut as many corners as you want. Cloud service provider building top super computers for rent... not so much. There is a reason why not a lot of people start to do this... it is extremely capital intensive. That is a huge moat and getting the relationships and funding to do what I'm doing isn't easy and took me over 5 years to get to this point of just getting started. I'm not going to blow it all on cutting corners on some used equipment.
I'm building this business in a way to de-risk things as much as possible
Then why did you go with AMD and not Nvidia? Are you not interested in AI/ML customers?
In my eyes, it is less risky with AMD. When you're rooting for the underdog, they have every incentive to help you. This isn't a battle to "win" all of AI or have one beat the other, I just need to create a nice profitable business that solves customer needs.
If I go with Nvidia, then I'm just another one of the 500 other companies doing exactly the same thing.
I'm a firm believer that there should not be a single company that controls all of the compute for AI. It would be like having Cisco be the only company that provides routers for the internet.
Additionally, we are not just AMD. We will run any compute that our customers want us to deploy for them. We are the capex/opex for businesses that don't want to put up the millions, or figure out and deal with all the domain specific details of deploying this level of compute. The only criteria we have is that it is the best-in-class available today for each accelerator. For example, I wouldn't deploy H100's because they are essentially old tech now.
Read these blog posts and tell me why you'd ask that question...
https://chipsandcheese.com/2024/06/25/testing-amds-giant-mi3...
https://www.nscale.com/blog/nscale-benchmarks-amd-mi300x-gpu...
OK, I just looked at the first blog post: “ROCm is nowhere near where it needs to be to truly compete with CUDA.”
That’s all I need to know as an AI/ML customer.
The latest generation generally isn't available used in much quantity. Used 100G equipment is cheap because it's almost ten years old.
Depends on your use case. For AI it’s not much good. Other applications don’t require as much internode bandwidth.
What do you mean not much good for AI? I guess it depends, 10 year old equipment is not great but plenty of 7 year old stuff is excellent. SN2700 switches for example are EOL, but that doesn’t matter because you can run mainline Linux flawlessly, if you enable switchdev (I’ve managed to run FreeBSD too). CX5 and CX6 cards are used everywhere still. I don’t have much experience with Broadcom gear, but I hear there are good options there too, but they tend to require more to get the stack setup on Linux.
So am I, but that makes total sense in your case, providing cloud compute. Doing that without contracts and warranties sounds like a nightmare (I haven’t downvoted you at all, I’ve seen your comments in a few threads and an really interested in what you’re doing). Best of luck, especially on the AMD side. I see a lot of people being skeptical about that, but I think we are very close to them being competitive with Nvidia in this space. I’m pretty much entirely Nvidia at the moment, but I’d love to hack on some MI300X whenever I can get access.
Can you share some vendors and model names? I've been looking into upgrading my home lab from 10g to 100G, but the costs were absolutely astronomical.
I'm likely running a personal lab similar to many folks on HN - about 20-30 wired servers and a small rack of managed Unifi switches.
Appreciate you!
https://docs.nvidia.com/dgx-superpod/reference-architecture-...
NVIDIA large GPU supercomputers have separate compute-networking (between GPUs) and storage-networking (storage to GPUs, or storage to SSD, SSD to GPUs with CPU assistance). This helps avoid networking issues, even more if not using Infiniband.
From what I read here and on your website, you don't go that route. I haven't found the equivalent system level reference architecture for MI300x from AMD. I wonder if you have a link to a public document where AMD provides guidance about this choice?
It's interesting that the above "HPC reference architecture" shows a GPU-to-GPU Infiniband fabric, despite Nvidia also nominally pushing NVLink Switch (https://www.nvidia.com/en-us/data-center/nvlink/) for the HPC use-case.
How does NVLink work? Because I already know MPI and I’m not going to learn anything else, lol.
Edit: after googling it looks like OpenMPI has some NVLink support, so maybe it is OK.
There is "CUDA-aware MPI" which would let you RDMA from device to device. But the more modern way would be MPI for the host communication and their own library NCCL for the device communication. NCCL has similar collective functions a MPI but runs on the device which makes it much more efficient to integrate in the flow of your kernels. But you would still generally bootstrap your processes and data through MPI.
If you use ucx it does all that automatically without you choosing.
The naming games on the libraries are rather entertaining... NCCL, RCCL... and oneAPI (oneCCL).
I use OpenMPI with no issues over multiple H100 nodes and A100 nodes, with multiple infiniband 200G and ethernet 100G/200G networks, and RDMA (though using mellanox instead of broadcom cards, but afaik broadcom supports this just the same). Side note, make sure you compile nvidia_peermem correctly if you want GDRMA to work :)
We have a separate OOB/east-west network which is 100G and would be used for external storage. We're spending an absurd amount of money on just cables.
It is documented on the website [0], but I do see that I did not document the actual cards for that, will add when I wake up tomorrow. The card is:
Broadcom 57504 Quad Port 10/25GbE,SFP28, OCP NIC 3.0
As far as I know, AMD doesn't really have the docs, it is Dell. Their team actively helped us design this whole cluster.
We haven't decided on which type storage we want to get yet. It'll really depend on customer demand and since we haven't deployed quite yet, we are punting that can down the road a bit. Our boxes do all have 122TB in them and we have some additional servers not listed as well with 122TB... so for now I think we can cobble something useful together.
[0] https://hotaisle.xyz/networking/
website updated with more details
What are you using these for? Providing compute for customers that want to train/infer? What is the level of interest and level of success customers are seeing using these services Hot Aisle offers?
We are a bare metal compute offering. They can be used for whatever people want to use them for (within legal limits, of course). Interest is much higher now that we've started to work with teams publishing benchmarks which show that H100's have a nice competitor [0].
I'll admit, it is still early days. We just finished up another free compute [1] two week stint with a benchmarking team. One thing we discovered is that saving checkpoints is slow AF. I'm guessing an issue with ROCm. Hopefully get that resolved soon. Now we are in the process of onboarding the next team.
One idea to help you: Are you sure you need a virtual machine ? Couldn't you boot the machines under PXE to solve the imaging problem ?
Essentially you have TFTP server that gives a Linux image and boot on it directly
1 chassis, 8 gpus.
We want to be able to break that chassis up into individual GPUs and allocate 1 GPU to 1 "machine". I previously PXE booted 20,000 individual playstation 5 diskless blades and I'm not sure how PXE would solve this.
The only alternative right now is to do what runpod (and AMD's aac) are doing and do docker containers. But that has the limitation of docker in docker, so people end up having to repackage everything. You also can't easily run different ROCm versions since that comes from the host, and if you have 8 people on a single chassis... it becomes a nightmare to manage it.
We're just patiently waiting for AMD to fix the problem.
Got it it's clear, I thought you had 1 GPU in one chassis in some cases.
No such thing with mi300x. They come 8 at a time on a OAM/UBB.
We just set up a small cluster of our own. We’re not using infiniband but it didn’t seem like it would be a 50 week lead time to setting it up. Where did you get that number?
I hear a couple issues with your comment... "small cluster", "we're not using infinband"
I'm only saying what was quoted to me. The cards are easy to get, it is the switches that are more difficult.
I saw plenty of used switches available. Are current gen switches necessary?
https://news.ycombinator.com/item?id=40951133
Nice! Whats benchmark standard these days? Still Superbench or yall have something inhouse?
Re: lead time quote :O I guess I got spoiled working for one of the major cloud vendors. The thought of poor b2b vendor support never entered my risk matrix.
If you own your own cluster, the network bottleneck becomes less a dollar cost I suppose, since you arent being charged a premium to rent someone elses compute
We don't run benchmarks ourselves. We donate the expensive Ferrari worth of compute for others to do it. This is the most unbiased way I could think of getting useful real world data to share.
The 3rd team just finished up and the 4th is getting started now. I've got 23 others in the wings.
https://hotaisle.xyz/free-compute-offer/
This seems like an excellent strategy in terms of marketing and building mindshare for the platform (as you mention elsewhere, it is an underdog).
Thank you for the kind words.
It seems the reliability, speed and scalability drops with the Ethernet are somewhat manageable.
To quote the article - From our tests, we found that Infiniband was systematically outperforming Ethernet interconnects in terms of speed. When using 16 nodes / 128 GPUs, the difference varied from 3% to 10% in terms of distributed training throughput[1]. The gap was widening as we were adding more nodes: Infiniband was scaling almost linearly, while other interconnects scaled less efficiently.
And then they do mention that the research team needs to debug unexplained failures on Ethernet that they’ve not seen on Infiniband. This actually can be the expensive part. Particularly if the failures are silent and cause numerical errors only.
A single switch should mitigate some of the throughput issues.
As for issues, this is why I have a full professional support contract with Dell and Advizex. If there are issues in the gear, they will step in to help out.
Especially on the switch, since it is a spof, we went with a 4 hour window.
Being out of the loop for awhile. Has amd made anything similar to cuda? Are cots frameworks such as pytorch and tensorflow on par when running on amd hardware? What makes investing in amd cluster/chips worthwhile?
ROCm is the framework. Both PyTorch and Tensorflow have versions that support it. Have no experience with it, so cannot say how it works in practice.
Did you procure the servers directly from Dell or through a distribution partner?
Kind of both. We started talking to Dell first and then they introduced us to Advizex. We are now effectively partnered with both companies, which is fantastic as the Advizex team have ex-Dell people working directly with us. We are lucky to have two whole teams of super talented people helping us out on this journey.
Meta had success building an ethernet cluster on Arista 7800 with Wedge400 and Minipack2 OCP rack switches.: https://www.datacenterdynamics.com/en/news/meta-reveals-deta...
I actually have a call with Arista next week to learn more about their solutions. Especially those 7800's. They look awesome.
One "problem" we have right now is that our cluster cannot support more than 128 GPUs. If we wanted to scale with Dell, we'd have to buy 6x more Z9864F to add one more cluster, which is crazy expensive and complicated.
I want to see if Arista has something that can help us. That said, I also have to find a customer that wants more than 128 MI300x and that hasn't happened... yet.
Why xeon instead of epyc?
Sadly, the only option Dell supports today. To get to market the fastest, they took their existing H100 chassis solution, swapped out the GPUs/baseboard and called it a day.