return to table of content

The art of high performance computing

dist1ll
35 replies
1d2h

It's very interesting how abtracted away HPC sometimes looks from hardware. The books seem to revolve a lot around SPMD programming, algo & DS, task parallelism, synchronization etc, but very little about computer architecture details like supercomputer memory subsystems, high-bandwidth interconnects like CXL, GPU architecture and so on. Are the abstractions and tooling already good enough that you don't need to worry about these details? I'm also curious if HPC practitioners have to fiddle a lot of black-box knobs to squeeze out performance?

crabbone
10 replies
1d1h

You'd be surprised how actually backwards and primitive are the tools used in HPC.

Take for instance the so-called workload managers, of which the most popular ones are Slurm, PBS, UGE, LSF. Only Slurm is really open-source, PBS has a community edition, the rest is proprietary stuff executed in the best traditions of enterprise software which locks you into using pathetically bad tools, ancient and backwards tech with crappy / nonexistent documentation and inept tech support.

The interface between WLMs and the user who wants to use some resources is through submitting "jobs". These jobs can be interactive, but most often they are the so-called "batch jobs". A batch job is usually defined as... a Unix Shell script, where the comments are parsed to interpret those as instructions to the WLM. In the world with dozens of configuration formats... they chose to do this: embed configuration into Shell comments.

Debugging job failures is a nightmare, mostly because WLM software has really poor quality of execution. Pathetic error reporting. Idiotic defaults. Everything is so fragile it falls apart if you just as much as look at it in the wrong way. Working with it reminds me the very early days of Linux, when sometimes things just won't build, or would segfault right after you've tried running them, and there wasn't much you could do beside spending days or weeks trying to debug it just to get some basic functionality going.

When I have to deal with it, I feel kind of like in a steam-punk movie. Some stuff is really advanced, and then you find out that this advanced stuff is propped by some DIY retro nonsense you thought have died off decades ago. The advanced stuff is usually more on the side of hardware, while software is not keeping up with it for the most part.

victotronics
1 replies
23h51m

You do a lot of scare quotes. Do you have any suggestions on how things could be different? You need batch jobs because the scheduler has to wait for resources to be available. It's kinda like Tetris in processor/time space. (In fact, that's my personal "proof" that workload scheduling is NP-complete: it's isomorphic to Tetris.)

And what's wrong with shell scripts? It's a lingua franca, generally accepted across scientific disciplines, cluster vendors, workload managers, .... Considering the complexity of some setups (copy data to node-local file systems; run multiple programs, post-process results, ... ) I don't see how you could set up things other than in some scripting language. And then unix shell scripts are not the worst idea.

Debugging failures: yeah. Too many levels where something can go wrong, and it can be a pain to debug. Still, your average cluster processes a few million jobs in its lifetime. If more than a microscopic portion of that would fail, computing centers would need way more personnel than they have.

crabbone
0 replies
39m

And what's wrong with shell scripts?

When used as configuration? Here are some things that are wrong:

* Configuration forced into a single line makes writing long lines inconvenient (for example, if you want Slurm with Pyxis, and you need to specify the image name -- it will most likely not fit on the screen.

* Oh, and since we mentioning Pyxis -- their image names have pound sign in them, and now you also need to figure out how to escape it, because for some reason if used literally it breaks the comments parser.

* No syntax highlighting (because it's all comments).

* No way to create more complex configuration, i.e. no way to have any types other than strings, no way to have variables, no way to have collections of things.

* No way to reuse configuration (you have to copy it from one job file to another). I honestly don't even know what happens if you try to source a job configuration file from another job configuration.

All in all, it's really hard to imagine a worse configuration format. This sounds like a solution from some sort of a code-golfing competition where the goal was to make it as bad as possible, while still retaining some shreds of functionality.

convolvatron
1 replies
1d

HPC software is one area where we have arguably regressed in the last 30 years. Chapel is the only light I see in the darkness

trentnelson
0 replies
16h39m

Want to elaborate more on Chapel? I’ve recently being tasked with integrating Chapel into our system and it’s quite interesting.

bee_rider
1 replies
1d

Having switched from LSF to slurm, I have to appreciate that the ecosystem is so bash-centric. Lots of re-use in the conversion. If I’d had to learn some kind of slurm-markup-language or slurmScript or find buttons in some SlurmWizard, it would have been a nightmare.

crabbone
0 replies
1d

Oh LSF... I don't know if you know this. LSF is perhaps the only system alive today that I know of that uses literal patches as a means of software distribution.

Fist time I saw it, I had a flashback to the times when I worked for HP, and they were making some huge SAP knock-off, and that system was so labor-intensive to deploy that their QA process involved actual patches. As in pre-release QA cycle involved installing the system, validating it (which could take a few weeks) and if it's not considered DoD, then the developers are given the final list of things they need to fix and those fixes would have to be submitted as patches (sometimes, literal diffs that need to be applied to the deployed system with the patch tool).

This is, I guess, how the "patch version component" came to be in SemVer spec. It's kind of funny how lots of tools are using this component today for completely unrelated purposes... but yeah, LSF feels like the time is ticking there at a different pace :)

StableAlkyne
1 replies
1d1h

Working with it reminds me the very early days of Linux

The other cool thing about HPC is it is one of the last areas where multi-user Unix is used! At least, if you're using a university or NSF cluster that is!

Only other place I really see multiple humans using the same machine is SDF or the Tildes

victotronics
0 replies
23h46m

It's saturday afternoon.

  [login1 ~:3] who | cut -d ' ' -f 1 | sort -u | wc -l
  41

romanows
0 replies
23h29m

I really like using Slurm, the documentation is great (https://slurm.schedmd.com) and the model is pretty straightforward, at least for the mostly-single-node jobs I used it for.

You can launch a job(s) via command-line, config in Bash comments, REST APIs, linking to their library, and I think a few more ways.

I found it pretty easy to setup and admin. Scaling in the cloud was way less developed when I used it, so I just hacked in a simple script that allowed scaling up and down based on the job queue size.

What do you like better and for what use-case? Mine was for a group of researchers training models, and the feature I desired most was an approximately fair distribution of resources (cores, GPU hours, etc.).

OPA100
0 replies
1d

I've dug deeply into LSF in the last few years and it's like a car crash - you can't look away. It feels like something that started in the early unix days but was developed into perhaps the late 90s, but in reality LSF was only started in the 90s (in academia). As far as I can tell development all but stopped when IBM acquired it some ten years ago.

bluedino
6 replies
23h35m

I started in HPC about 2 years ago on a ~500 node cluster at a Fortune 100 company. I was really just looking for a job where I was doing Linux 100% of the time, and it's been fun so far.

But it wasn't what I thought it would be. I guess I expected to be doing more performance oriented work, analyzing numbers and trying to get every last bit of performance out of the cluster. To be honest, they didn't even have any kind of monitoring running. I set some up, and it doesn't really get used. Once in a while we get questions from management about "how busy is the cluster", to justify budgets and that sort of thing.

Most of my 'optimization' work ends up being things like making sure people aren't (usually unknowingly) requesting 384 CPUs when their script only uses 16, testing software to see what # of CPU's it works with before you see a degradation, etc. I've only had the Intel profiler open twice.

And I've found that most of the job is really just helping researchers and such with their work. Typically running either a commercial or open-source program, troubleshooting it, or getting some code written by another team on another cluster and getting it built and running on yours. Slogging through terrible Python code. Trying to get a C++ project built on a more modern cluster in a CentOS 7 environment.

It can be fun in a way. I've worked with different languages over the years so I enjoy trying to get things working, digging through crashes and stack traces. And working with such large machines, your sense of normal gets twisted when you're on a server with 'only' 128GB of RAM or 20TB of disk.

It's a little scary when you know the results of some of this stuff are being used in the real world, and the people running the simulations aren't even doing things right. Incorrect code, mixed up source code, not using the data they thing they are, I once found a huge bug that had existed for 3 years. Doesn't this invalidate all the work you've done on this subject?

The one drawback I find is that a lot of HPC jobs want you do have a masters degree. Even to just run the cluster. Doesn't make sense to me, I'm not writing the software you're running, we aren't running some state of the art, TOP500 cluster. We're just getting a bunch of machines networked together and running some code.

throwawaaarrgh
4 replies
23h14m

I always found that funny too. A business who needs a powerful computing solution can come up with some amazingly robust stuff, whereas science/research just buys a big mainframe and hopes it works.

s_Hogg
3 replies
18h17m

I was working in a company that had been spun out of a university until recently and it was shocking how hopeless the researchers were. I've always been critical of how poor the job security in academia is but you'd think it's still too much given how slapdash some of the crap you see is. We basically had to reinvent their product from the ground up, awful.

danparsonson
1 replies
16h6m

This is probably a naive question but isn't that the point of having developers on staff? The researchers aren't coders and vice versa, so having researchers produce prototypes that are productized by engineers makes sense to me.

rramadass
0 replies
14h19m

Exactly! This is how it should be.

Researchers/Scientists with their hard earned PhDs should only concentrate on doing cutting-edge "researchy" stuff. It is hard enough that they should not be asked to learn all the intricacies/problems inherent in Software Development. That is the domain of a "Professional Software Engineer".

There is now in fact a new class called "Research Software Engineer" who are Software Developers working in Research developing code specific to their needs - https://www.nature.com/articles/d41586-022-01516-2 and https://en.wikipedia.org/wiki/Research_software_engineering

m-ee
0 replies
14h42m

I've had very similar experiences working with former researchers including at a university spinout. Mechanical rather than CS. It was perplexing how they still carried the elitism that industry was mostly for people who can't hack it in academia given the quality of their work. Would be unacceptable coming from a new hire PD engineer at Apple yet you're demanding respect because you used to lead a whole lab apparently producing rubbish?

justin66
0 replies
22h9m

The one drawback I find is that a lot of HPC jobs want you do have a masters degree.

Is it possible that pretty much any specialization, outside of the most common ones, engages in a lot of gatekeeping? I remember how difficult it appeared to be after I graduated to break into embedded systems (I never did). I persisted until I realized it doesn't even pay very well, comparatively.

dahart
2 replies
1d1h

There is a lot of abstraction, but knowing which abstraction to use still takes knowing a lot about the hardware.

I’m also curious if HPC practitioners have to fiddle a lot of black-box knobs to squeeze out performance?

In my experience with CUDA developers, yes the Shmoo Plot (https://en.wikipedia.org/wiki/Shmoo_plot, sometimes called a ‘wedge’ in some industries) is one of the workhorses of every day optimization. I’m not sure I’d call it black-box, though maybe the net effect is the same. It’s really common to have educated guesses and to know what the knobs do and how they work, and still find big surprises when you measure. The first rule of optimization is measure. I always think of Michael Abrash’s first chapter in the “Black Book”: “The Best Optimizer is Between Your Ears” http://twimgs.com/ddj/abrashblackbook/gpbb1.pdf. This is a fabulous snippet of the philosophy of high performance (even though it’s PC game centric and not about modern HPC.)

Related to your point about abstraction, the heaviest knob-tuning should get done at the end of the optimization process, because as soon as you refactor or change anything, you have to do the knob tuning again. A minor change in register spills or cache access patterns can completely reset any fine-tuning of thread configuration or cache or shared memory size, etc.. Despite this, some healthy amount of knob tuning is still done along the way to check & balance & get an intuitive sense of the local perf space of the code. (Just noticed Abrash talks a little about why this is a good idea.)

squidgyhead
1 replies
21h46m

Could you explain how you use a shmoo plot for optimization? Do you just have a performance metric at each point in parameter space?

dahart
0 replies
18h58m

The shmoo plot is just the name for measuring something (such as perf) over a range of parameter space. The simplest and most straightforward application is to pick a parameter or two that you don’t know what value they should be using, do the shmoo over the range of parameter space, and then set the knobs at whatever values give you the optimal measurement.

Usually though, you have to iterate. Doing shmoos along the way can help with understanding the effects of code changes, help understand how the hardware works, and it can sometimes help identify what code changes you might need to make. A simple abstract example might be I know what my theoretical peak bandwidth is, but my program only gets 30% of peak. I suspect it has to do with how many registers are used, and I have a knob to control it, so I turn the knob and plot all possible register settings, and find out that I can get 45% of peak with a different value. Now I know it was partially registers I was limited by, but I also know to look for something else too. Then I profile, examine the code, maybe refactor or adjust some things, hypothesize, test, and then shmoo again on a different knob or two if I suspect something else is the bottleneck.

bayindirh
2 replies
20h43m

HPC admin here, generally serving "long tail of science" researchers.

In today's x86_64 hardware, there's no "supercomputer memory subsystem". It's just a glorified NUMA system, and the biggest problem is putting the memory close to your core, i.e. keeping data local in your NUMA node to reduce latencies.

Your resource mapping is handled by your scheduler. It knows your hardware, hence it creates a cgroup which satisfies your needs and as optimized as possible, and stuffs your application into that cgroup and runs it.

Currently king of high performance interconnects is Infiniband, and it accelerates MPI at the fabric level. You can send messages, broadcasts and reduce results like there's no tomorrow. Because when the message arrives you, it's already reduced. When you broadcast, you only send a single message which is broadcasted at fabric layer. Multiple Context IB cards have many queues and more than one MPI job can run on the same node/card with queue/context isolation.

If you're using a framework for GPU work, the architecture & optimization is done at that level automatically (the framework developers do the hard work generally). NVIDIA's drivers are pure black magic, too. They handle some parts of the optimization, too. InterGPU connection is handled by a physical fabric, managed by drivers and its own daemon.

If you're CPU bound, your libraries are generally hand tuned by its vendor (Intel MKL, BLAS, Eigen, etc.). I personally used Eigen, and it has processor specific hints and optimizations baked in.

The things you have to worry is to compile your code for the correct architecture, make sure that the hardware you run on can satisfy your demands (i.e.: do not make too many random memory accesses, keep the prefetcher and branch predictor happy if you're trying to go "all-out fast" on the node, do not abuse disk access, etc.).

On the number crunching side, keeping things independent (so they can be instruction level parallelized/vectorized), making sure you're not doing unnecessary calculations, and not abusing MPI (reducing inter-node talk to only necessary chatter) is the key.

It's way easier said than done, but when you get the hang of it, it becomes like a second nature to think about these things, if these kinds of things are your cup of tea.

dist1ll
1 replies
18h21m

Thanks for the thoughtful comment, pretty fascinating stuff.

In today's x86_64 hardware, there's no "supercomputer memory subsystem". It's just a glorified NUMA system, and the biggest problem is putting the memory close to your core, i.e. keeping data local in your NUMA node to reduce latencies.

I mean, memory topology varies greatly by uarch (doubly so between vendors). I can't take a routine tuned to Nehalem, run it on Haswell or Skylake and expect it to stay competitive. More generally, different hardware has different bandwidth and latency ratios, which affects software design (e.g. software written for commodity Dell w/ PCIe cards probably won't translate to Cray accelerator grid connected by HPE slingshot). And then there's hardware-specific features like RNICs bypassing DRAM and writing RDMA messages directly into the receiver's cache. So I think that ccNUMA and data locality is not sufficient to reason about memory perf.

bayindirh
0 replies
2h34m

I mean, memory topology varies greatly by uarch...

You're absolutely right, this is why I said that if you're using libraries, this burden is generally handled by them. Also compilers do this and handle this very well.

If you're writing your own routines, the best way is to read the arch docs, maybe some low-level sites like chips and cheese, do some synthetic benchmarks and write your code in a semi informed way.

After writing the code, a suite of cachegrind, callgrind and perf is on order. See if there are any other bottlenecks, and tune your code accordingly. Add hints for your compiler, if possible.

I was able to reach insane saturation levels with Eigen plus, some hand-tuned code. For the next level, I needed to change my matrix ordering, but it was already fast enough (30 minutes to 45 seconds: 40x speedup), so I left it there.

Sometimes there are no replacement for blood, sweat and tears in this thing.

I have never played with custom interconnects (Slingshot, etc.), yet, so I can't tell much.

atrettel
2 replies
1d2h

Yes and no.

MPI and OpenMP are the primary abstractions from the hardware in HPC, with MPI being an abstracted form of distributed-memory parallel computing and OpenMP being an abstracted form of shared-memory parallel computing. Many researchers write their codes purely using those, often both in the same code. When using those, you really do not need to worry about the architectural details most of the time.

Still, some researchers who like to further optimize things do in fact fiddle with a lot of small architectural details to increase performance further. For example, loop unrolling is pretty common and can get quite confusing in my opinion. I vaguely recall some stuff about trying to vectorize operations by preferring addition over multiplication due to the particular CPU architecture, but I do not think I've seen that in practice.

Preventing cache misses is another major one, where some codes are written so that the most needed information is stored in the CPU's cache rather than memory. Most codes only handle this by ensuring column-major order loops for array operations in Fortran or row-major order loops in C, but the concept can be extended further. If you know the cache size for your processors, you could hypothetically optimize some operations to keep all of the needed information inside the cache to minimize cache misses. I've never seen this in practice but it was actively discussed in the scientific computing course I took in 2013.

The use of particular GPUs depends heavily on the problem being solved, with some being great on GPUs and others being too difficult. I'm not too knowledgeable about that, unfortunately.

bee_rider
1 replies
1d2h

Of course, not every problem can be solved by BLAS, but if you are doing linear algebra, the cache stuff should be mostly handled by BLAS.

I’m not sure how much multiplication vs addition matters on a modern chip. You can have a bazillion instructions in flight after all, as long as they don’t have any dependencies, so I’d go with whichever option shortens the data dependencies on the critical path. The computer will figure out where to park longer instruction if it needs to.

atrettel
0 replies
1d1h

You're right that the addition vs. multiplication issue likely does not matter on a modern chip. I just gave the example because it shows how the CPU architecture can affect how you write the code. I do not recall precisely when or where I heard the idea, but it was about a decade ago --- ages ago by computing standards.

efxhoy
1 replies
20h8m

I wrote scientific simulation software in academia for a few years. None of us writing the software had any formal software engineering training above what we’d pieced together ourselves from statistics courses. We wrote our simulations to run independently on many nodes and aggregated the results at the end, no use of any HPC features other than “run these 100 scripts on a node each please, thank you slurm”. That approach worked very well for our problem.

I’d bet a significant part of compute work on HPC clusters in academia works the same way. The only thing we paid attention to was number of cores on the node and preferring node local storage over the shared volumes for caching. No MPI.

There are of course problems requiring “genuine” HPC clusters but ours could have run on any pile of workers with a job queue.

jltsiren
0 replies
7h26m

That's often the ideal case. Individual tasks are small enough to run on commodity hardware but large enough that you don't have an excessive number of them. That means you can write simple software without wasting effort on distributed computing.

I've seen similar things at the intersection of bioinformatics and genomics. Computers are getting bigger but the genomes aren't, and tasks that require distributed computing are getting rare.

mgaunard
0 replies
1d2h

Regardless of what you do, domain knowledge tends to be more valuable than purely technical skills.

Knowing more numerical analysis will get probably get you further in HPC than knowledge of specific hardware architectures.

Ideally you want both, of course.

marcosdumay
0 replies
1d1h

It's not intuitive, but for HPC is more about scalability than performance.

You won't be able to use a supercomputer at all without scalability, and it's the one topic that is specific to it. But, of course, those computers time is quite expensive so you'll want to optimize for performance too. It's just secondary.

jandrewrogers
0 replies
1d1h

For most HPC, you will not be able to maximize parallelism and throughput without intimate knowledge of the hardware architecture and its behavior. As a general principle, you want the topology of the software to match the topology of the hardware as closely as possible for optimal scaling behavior. Efficient HPC software is strongly influenced by the nature of the hardware.

When I wrote code for new HPC hardware, people were always surprised when I asked for the system hardware and architecture docs instead of the programming docs. But if you understood the hardware design, the correct way of designing software for it became obvious from first principles. The programming docs typically contained quite a few half-truths intended to make things seem misleadingly easier for developers than a proper understanding would suggest. In fact, some HPC platforms failed in large part because they consistently misrepresented what was required from developers to achieve maximum performance in order to appear "easy to use", and then failing to deliver the performance the silicon was capable of if you actually wrote software the way the marketing implied would be effective.

You can write HPC code on top of abstractions, and many people do, but the performance and scaling losses are often unavoidably integer factor. As with most software, this was considered an acceptable loss in many cases if it allowed less capable software devs to design the code. HPC is like any other type of software in that most developers that notionally specialize in it struggle to produce consistently good results. Much of the expensive hardware used in HPC is there to mitigate the performance losses of worse software designs.

In HPC there are no shortcuts to actually understanding how the hardware works if you want maximum performance. Which is no different than regular software, in HPC the hardware systems are just bigger and more complex.

eslaught
0 replies
1d2h

No, the abstractions are not sufficient. We do care about these details, a lot.

Of course, not every application is optimized to the hilt. But if you want to so optimize an application, exactly things you're talking about are what come into play.

So yes, I would expect every competent HPC practitioner to have a solid (if not necessarily intimate) grasp of hardware architecture.

bee_rider
0 replies
1d2h

I don’t think I do HPC (I only will use up to, say, 8 nodes at a time), but the impression I get is that they are already working on quite hard problems at the high-level, so they need to lean on good libraries for the low-level stuff, otherwise it is just too much.

MichaelZuo
0 replies
1d2h

Memory architecture and bandwidth are still very important, most of IBM's latest performance gains for both mainframes and POWER are reliant on some novel innovations there.

LASR
27 replies
22h18m

The hardware / datacenter side of this is equally fascinating.

I used to work in AWS, but on the software / services side of things. But now and then, we would crash some talks from the datacenter folks.

One key relevation for me was that increasing compute power in DCs is primarily a thermodynamics problem than actual computing. The nodes have become so dense that shipping power in and shipping heat out, with all kinds of redundancies is an extremely hard problem. And it's not like you can perform a software update if you've discovered some inefficiencies.

This was ~10 years ago, so probably some things have changed.

What blows me away is that Amazon, starting out as an internet bookstore is at the cutting edge of solving thermodynamics problems.

cogman10
23 replies
19h29m

It always made me wonder why liquid cooling wasn't more of a thing for datacenters.

Water has a massive amount of thermal capacity and can quickly and in bulk be cooled to optimal temperatures. You'd probably still need fans and AC to dissipate heat of non-liquid cooled parts, but for the big energy items like CPUs and GPUs/compute engines, you could ship out huge amounts of heat fairly quickly and directly.

I guess the complexity and risk of a leak would be a problem, but for amazon sized data centers that doesn't seem like a major concern.

bayindirh
15 replies
18h51m

Because it’s complex. Even more complex than “engineered” air.

You need two circuits, and a CDU between them. Coolants needs maintaining. You add antifreeze, biocides, etc.

Air is brute force. It cools everything it touches. Liquid cooling is serialized in a node. Two sockets? Second will be hotter. HBA not making good contact? It’ll overheat.

You add extensive leak detection subsystems, the amount of coolant moving in your primary circuit becomes massive.

Currently you can remove 97% of the heat via liquid (including the PSUs), and it’s cheaper to do so than air, but it’s not “rails, screws, cables, power on”. Air cooled systems can be turned on in a week. Liquid cooled ones take a month.

However, using liquid is mandatory after some point. You can’t cool systems that dense and under that load with air. They’ll melt.

gopher_space
8 replies
17h12m

What's this all look like without an atmosphere?

eutropia
7 replies
16h40m

Worse, heat dissipation is a major constraint for spacecraft and satellites because you can only radiate heat away as infrared photons.

taneq
3 replies
15h27m

Doesn’t have to be infrared but yeah, space isn’t “cold” so much as it’s an insulator.

_a_a_a_
2 replies
5h36m

at ~2 kelvin I'd have thought you can radiate away a truckload of heat surely

zmgsabst
1 replies
4h41m

You can radiate easily.

But not convect. Hence why it’s much, much harder than removing heat on Earth.

_a_a_a_
0 replies
4h14m

Of course you're not convecting but if you are radiating from a hot body into an ambient two Kelvin then you are going to lose heat really, really fast. IIRC heat loss by black body radiation into its surroundings is proportional to the fourth power of the temperature difference between the body and surroundings (from memory, and going back a very long way, so maybe incorrect).

uticus
1 replies
14h6m

Amazing considering how much heat travels from Sun (and punches through atmosphere) to Earth surface. Didn’t realize there was that much of an insulation property.

pixelpoet
0 replies
7h56m

Besides the heat insulation, without the vacuum of space we would all be deafened by the sun's roar.

fecal_henge
0 replies
9h15m

You just need to radiate in the visible spectrum then the problem will be much reduced.

magicalhippo
4 replies
12h34m

Liquid cooling is serialized in a node. Two sockets?

I've seen tests done on heavy PC loops (ie multi-GPU) both high-flow and low-flow, as well as on car engines, in different coolant flow configurations. The results from all of those are that the water doesn't rise meaningfully in temperature between components.

Unless I did my back-of-the-napkin math wrong, this seems reasonable. If you have a single 10mm ID pipe going through a 1U server and up to the next, then for a full 42U rack you have about 1.7kg of water going through the servers. If the flow rate is about 1s per server (so 42 seconds for the full rack) and each 1U server dumps 500W of energy into the water, there should be just a 3 degree C difference in the water temperature between the first and the last server.

bayindirh
3 replies
11h45m

In our system every node gets inlet water at the same temperature via parallel piping, but when it’s in node, it goes through processors first, then RAM, then PCIe and disks. Delta T between two sockets is 5 degrees C, and the delta T between input and output is around 15-18 C depending on load.

menaerus
1 replies
9h41m

First, thanks for sharing these details, I find them fascinating because they are not so common to be read or heard about.

Delta T between two sockets is 5 degrees C

And secondly, ~5-10 degrees is what I see on my dual-socket workstation, and have been wondering about this delta ever since the first day I started monitoring the temperatures. At first, I thought that the heat sink wasn't installed properly but after reinstalling it the delta remained. Since I didn't notice any CPU throttling or whatsoever I figured it's "normal" and ignored it.

bayindirh
0 replies
9h35m

Hey, no worries. Using one is equally fascinating as much as reading about it. It feels like a space shuttle, so different, yet so enjoyable.

I mean, water travels from one socket to another, so one processor adds heat equal to 5 degrees C under nominal load. The second socket doesn’t complain much, but this is enormous amounts of heat transferred in a such quick pace.

magicalhippo
0 replies
6h24m

Interesting. What's your flow rate and pipe size?

is_true
0 replies
4h47m

Is rising air's water content (humidity) worth it? Humid air can "store" more heat.

I guess it could be bad past some %, but there's a probably a point where it's worth it.

adev_
3 replies
18h55m

It always made me wonder why liquid cooling wasn't more of a thing for datacenters.

Liquid cooling is almost a defacto-standard in data centers in the HPC world. The Top of the TOP500 machines are all liquid cooled. Not by choice, but due to physics constraints.

There is a big gap in power density between the HPC world and the usual datacenter-commodity-hardware world.

Commodity DS are designed with the assumption that the average machine will run with a fraction of it's maximum load. HPC systems at the opposite are designed to operate safely at 100% load all the time.

In a previous company where I worked, we attempted to install a medium size HPC cluster in a well-known commerical datacenter and network provider. The commercial of the DS almost felt from his chair when we announced the power requirements.

bayindirh
2 replies
18h49m

we attempted to install a medium size HPC cluster in a well-known commerical Datacenter and network provider. The commercial of the DS almost fall from his chair when we announced the power requirements.

Heh. We tried it too. They didn’t believe that a single node used their entire rack’s budget at first.

mrgaro
1 replies
8h32m

Sounds fascinating. Can you give any more details? What kind of nodes are they and how they differ from "traditional" DC hardware, say from Supermicro?

lhoff
0 replies
5h48m

The difference is GPUs. A normal dual socket system serving a database or webserver use under medium load around 200-300W, One of these [1] equipped with 10xA100 can easily use in the ballpark of 3kW under load. So we are talking 10x the power usage.

[1]https://www.supermicro.com/en/products/system/gpu/5u/sys-521...

victotronics
0 replies
18h53m

Immersion cooling is getting big. At the last Supercomputing conference I probably saw at least a dozen vendors of immersion cooling equipment. My datacenter has one cluster with liquid cooling caps over the sockets, and two immersed clusters. The latter two have basins of various degrees of sophistication under them for when they do spring a leak.

lub
0 replies
18h56m
_kb
0 replies
2h3m

In contexts where there's a good chance of standardisation, I believe it is. Both OCP [0] and Open19 [1] have liquid cooling as part of the standard.

[0]: https://www.opencompute.org/projects/cooling-environments

[1]: https://gitlab.com/open19/v2-specification/-/blob/main/syste...

projectileboy
1 replies
20h13m

Seymour Cray used to say this all the way back in the 1970s: his biggest problems were associated with dissipating heat. For the Cray 2 he took an even more dramatic approach: "The Cray-2's unusual cooling scheme immersed dense stacks of circuit boards in a special non-conductive liquid called Fluorinert™" (https://www.computerhistory.org/revolution/supercomputers/10...)

logtempo
0 replies
17h10m

Few days ago I saw an article passing by,about chips hiting the kw floor.

cyrillite
0 replies
16h39m

Is there any good data on the scale of this problem or that can be used to visualise it?

What is the cutting edge of cooling tech like?

rlupi
10 replies
1d1h

I am interested in the more hardware management side of HPC (how problems are detected, diagnosed, mapped into actions such as reboot/reinstall/repairs, how these are scheduled and how that is optimized to provide the best level of service, how this is done if there are multiple objectives to optimize at once e.g. node availability vs overall throughput, how different topologies affect the above, how other constraints affect the above, and in general a system dynamics approach to these problems).

I haven't found many good sources for this kind of information. If you are aware of any, please cite them in a comment below.

CoastalCoder
4 replies
1d

This seemed like a big topic when I was interviewing with Meta and nVidia some months ago.

Meta had a few good YouTube videos about the problems of dealing with this many GPUs at scale.

keefle
3 replies
23h22m

Could you link me the YouTube videos/articles in question? It happens to be my research area and I'm interested in knowing how big companies such as meta deal with multi-GPU systems

CoastalCoder
1 replies
23h4m

I don't have them bookmarked anymore, but they may have been from this playlist: [0]

[0] https://www.youtube.com/playlist?list=PLBnLThDtSXOw_kePWy3CS...

keefle
0 replies
9h59m

Thank you for sharing! I'll hunt it down

mackid
0 replies
9h20m

Mark did a good video on ChatGPT infra.

[1]. https://techcommunity.microsoft.com/t5/microsoft-mechanics-b...

synergy20
1 replies
1d

check out openbmc project and DTMF association

timoteostewart
0 replies
23h3m

DMTF (not DTMF)

https://www.dmtf.org/

nyrikki
0 replies
23h26m

Assuming you are moving past just the typical nonblocking folded-Clos networksor Little's Law; and want to have a more engineering focus, "Queuing theory" is one discipline you want to dig into.

Queuing theory seems trivial and easy how it is introduced, but it has many open questions.

Performance metrics for a system with random arrival times, independent service times, with k servers (M/G/k) is still an open question as an example.

https://www.sciencedirect.com/science/article/pii/S089571770...

There are actually lots of open problems in queuing theory that one wouldn't expect.

mackid
0 replies
13h8m

Mark Russinovich gives a good talk most years on the internals of Azure and the systems that run it. [1] is an example. Look for talks from other years as well.

Meta also publishes a number of papers/blogs/OSS projects on their engineering site [2]

James Hamilton of AWS gives a talk most years on their infrastructure. Worth watching multiple years [3].

[1] https://youtu.be/69PrhWQorEM?si=u7vh_Um6SQNoyeFH

[2] https://engineering.fb.com/category/data-center-engineering/

[3] https://youtu.be/AyOAjFNPAbA?si=nFRJVcQI4EiamC-O

cavisne
0 replies
22h8m

This paper from Microsoft [1] is the coolest thing I've seen in this space. Basically workload (deep learning in this case) level optimization to allow jobs to be resized and preempted.

[1] https://arxiv.org/pdf/2202.07848.pdf

mkoubaa
7 replies
1d3h

UT Austin really is a fantastic institution for HPC and computational methods.

bee_rider
6 replies
1d2h

Every BLAS you want to use has at least some connection to UT Austin’s TACC.

victotronics
2 replies
23h19m

Not quite. Every modern BLAS is (likely) based on Kazushige Goto's implementation, and he was indeed at TACC for a while. But probably the best open source implementation "BLIS" is from UT Austin, but not connected to TACC.

bee_rider
1 replies
23h12m

Oh really? I thought BLIS was from TACC. Oops, mea culpa.

RhysU
0 replies
22h49m

https://github.com/flame/blis/

Field et al, recent winners of the James H. Wilkinson Prize for Numerical Software.

Field and Goto both collaborated with Robert van de Geijn. Lots of TACC interaction in that broader team.

mgaunard
2 replies
1d2h

aren't the lapack people in tennessee?

bee_rider
1 replies
1d1h

Sort of like BLAS, LAPACK is more than just one implementation. Dongarra described what everybody should do from Tennesse, but other places implemented it elsewhere.

mgaunard
0 replies
19h8m

plasma and magma are also from there.

I'm not aware of any other significant lapack-related developments, but I might just not know about them.

jebarker
4 replies
21h7m

I'm interested in what people think of the approach to teaching C++ used here. Any particular drawbacks?

I'm a very experienced Python programmer with some C, C++ and CUDA doing application level research in HPC environments (ML/DL). I'd really like to level up my C++ skills and looking through book 3 it seems aimed exactly at the right level for me - doesn't move too slowly and teaches best practices (per the author) rather than trying to be comprehensive.

leopoldj
3 replies
18h36m

C++ programmer and educator here. This (volume 3) is well organized good beginner level teaching material. You probably know most of it already.

I was looking for range-based for loop, std::array and std::span and happy to see that they are all there.

Because this book relates to HPC, I'd add a few things: Return Value Optimization, move semantics, and in the recursive function section a note about Tail Call Optimization.

As a beginner level material I can highly recommend it.

jebarker
2 replies
17h6m

That's great - thank-you. Assuming I work through this quickly, what resources would you recommend as a follow-on?

rramadass
0 replies
15h10m

I am not the person you asked the question to, but my recommendation would be;

1) Discovering Modern C++: An Intensive Course for Scientists, Engineers, and Programmers by Peter Gottschling - Not too thick and focuses on how to program in the language.

2) Software Architecture with C++: Design modern systems using effective architecture concepts, design patterns, and techniques with C++20 by Adrian Ostrowski et al. - Shows how to use C++ in the modern way/ecosystems i.e. with CI/CD, Microservices etc.

Optional but highly recommended;

a) Scientific and Engineering C++: An Introduction with Advanced Techniques and Examples by Barton & Nackman - Old pre-Modern C++ book which pioneered many of the techniques which have now become common. One of the best for learning C++ design.

leopoldj
0 replies
4h54m

Instead of giving you a list of books I'll give you a list of topics to learn well. They are listed in a proper learning sequence.

- Modern object initialization using {} and ().

- std::string_view

- std::map

- std::stack

- Emplace addition of objects to containers like vector and map.

- Smart pointers (std::unique_ptr, std::shared_ptr and their ilk).

- Ranges library.

- Concurrency support library (std::async, std::future, std::thread, locks and the whole deal).

toddm
2 replies
23h47m

Kudos to Victor for assembling such a wonderful resource!

While I am not acquainted with him personally, I did my doctoral work at UT Austin the the 1990's and had the privilege of working with the resources (Cray Y-MP, IBM SP/2 Winterhawk, and mostly on Lonestar, a host name which pointed to a Cray T3E at the time) maintained by TACC (one of my Ph.D. committee members is still on staff!) to complete my work (TACC was called HPCC and/or CHPC if I recall the acronyms correctly).

Back then, it was incumbent on the programmer to parallelize their code (in my case, using MPI on the Cray T3E in the UNICOS environment) and have some understanding of the hardware, if only because the field was still emergent and problems were solved by reading the gray Cray ring-binder and whichever copies of Gropp et al. we had on-hand. That and having a very knowledgeable contact as mentioned above :) of course helped...

victotronics
0 replies
23h41m

Lonestar, a host name which pointed to a Cray T3E

Lonestar5 was a Cray again. Currently Lonestar6 is an oil-immersion AMD Milan cluster with A100 GPUs. The times, they never stand still.

huitzitziltzin
0 replies
15h42m

Dealt with him via TACC for a big simulation I did and was grateful enough for his help to buy a paper copy of the first volume in the series. Very interesting though a bit outside of my area. I will look at the others and encourage anyone interested to check them out.

teleforce
2 replies
1d2h

Is there something wrong with the GitHub files since I cannot render any of the textbooks PDF files?

https://github.com/VictorEijkhout/TheArtofHPC_pdfs/blob/main...

npalli
1 replies
1d2h

I think the files are too large to render in the github browser and they give an error. You can pick the 'download raw' option to download locally and read the file. Worked for me.

TimMeade
0 replies
23h35m

I just "git clone https://github.com/VictorEijkhout/TheArtofHPC_pdfs.git" on my local drive. Had it all in under a minute.

rramadass
0 replies
1d1h

Just amazed at how the author has created (and shared for free) such a comprehensive set of books including teaching C++ and Unix tools! There is something to learn for all Programmers (HPC specific or not) here.

Related: Jorg Arndt's "Matters Computational" book and FXT library - https://www.jjj.de/fxt/

justin66
0 replies
23h57m

There is some really good content here for any programmer.

And with volume 3, such a contrast: the author teaches C++17 and... Fortran2008.

guenthert
0 replies
7h1m

When joining a small company supporting the engineers of the HPC of a large car manufacturer, I was surprised to see so many in-house developed scripts around the scheduler (LSF). Only much later, when playing myself on a private miniature cluster with SLURM, I noticed that different versions of the scheduler software were generally incompatible to each other, i.e. one couldn't use one inside the cluster and another on external client machine. Hence the need for glue software to inject jobs into the scheduler from outside and retrieve the results later on (IMHO devaluating the scheduler).

I would have thought, that after some 30 years of high performance distributed computing, the requirements were well known and at least the protocol for command and data exchange could be fixed. Apparently not so.

davidthewatson
0 replies
1d3h

I was asked to share a TA role on a graduate course in HPC a decade ago. I turned down the offer.

After a cursory glance, I can honestly say that if this book were available then, I'd have taken the opportunity.

The combination of what I perceive to be Knuth's framing of art, along with carpentry and the need to be a better devops person than your devops person is compelling.

Kudos to the author for such an achievement. UT Austin seems to have achieved in computer science what North Texas State did in music.

atrettel
0 replies
1d2h

I took a course on scientific computing in 2013. It was cross-listed under both the computer science and applied math departments. The issue is that the field is pretty broad overall and a lot of topics were covered in a cursory manner, including anything related to HPC and parallel programming in particular. I don't regret taking the course, but it was too broad for the applications I was pursuing.

I haven't looked at what courses are being offered in several years, but when I was a graduate student, I really would have benefited from a dedicated semester-long course on parallel computing, especially going into the weeds about particular algorithms and data structures in parallel and distributed computing. Those were handled in a super cursory manner in the scientific computing course I took, as if somehow you'd know precisely how to parallelize things the first time you try. I've since learned a lot of this stuff on my own and from colleagues over the years, as many people do in HPC, but books like these would have been invaluable as part of a dedicated semester-long course.