It's very interesting how abtracted away HPC sometimes looks from hardware. The books seem to revolve a lot around SPMD programming, algo & DS, task parallelism, synchronization etc, but very little about computer architecture details like supercomputer memory subsystems, high-bandwidth interconnects like CXL, GPU architecture and so on. Are the abstractions and tooling already good enough that you don't need to worry about these details? I'm also curious if HPC practitioners have to fiddle a lot of black-box knobs to squeeze out performance?
You'd be surprised how actually backwards and primitive are the tools used in HPC.
Take for instance the so-called workload managers, of which the most popular ones are Slurm, PBS, UGE, LSF. Only Slurm is really open-source, PBS has a community edition, the rest is proprietary stuff executed in the best traditions of enterprise software which locks you into using pathetically bad tools, ancient and backwards tech with crappy / nonexistent documentation and inept tech support.
The interface between WLMs and the user who wants to use some resources is through submitting "jobs". These jobs can be interactive, but most often they are the so-called "batch jobs". A batch job is usually defined as... a Unix Shell script, where the comments are parsed to interpret those as instructions to the WLM. In the world with dozens of configuration formats... they chose to do this: embed configuration into Shell comments.
Debugging job failures is a nightmare, mostly because WLM software has really poor quality of execution. Pathetic error reporting. Idiotic defaults. Everything is so fragile it falls apart if you just as much as look at it in the wrong way. Working with it reminds me the very early days of Linux, when sometimes things just won't build, or would segfault right after you've tried running them, and there wasn't much you could do beside spending days or weeks trying to debug it just to get some basic functionality going.
When I have to deal with it, I feel kind of like in a steam-punk movie. Some stuff is really advanced, and then you find out that this advanced stuff is propped by some DIY retro nonsense you thought have died off decades ago. The advanced stuff is usually more on the side of hardware, while software is not keeping up with it for the most part.
You do a lot of scare quotes. Do you have any suggestions on how things could be different? You need batch jobs because the scheduler has to wait for resources to be available. It's kinda like Tetris in processor/time space. (In fact, that's my personal "proof" that workload scheduling is NP-complete: it's isomorphic to Tetris.)
And what's wrong with shell scripts? It's a lingua franca, generally accepted across scientific disciplines, cluster vendors, workload managers, .... Considering the complexity of some setups (copy data to node-local file systems; run multiple programs, post-process results, ... ) I don't see how you could set up things other than in some scripting language. And then unix shell scripts are not the worst idea.
Debugging failures: yeah. Too many levels where something can go wrong, and it can be a pain to debug. Still, your average cluster processes a few million jobs in its lifetime. If more than a microscopic portion of that would fail, computing centers would need way more personnel than they have.
When used as configuration? Here are some things that are wrong:
* Configuration forced into a single line makes writing long lines inconvenient (for example, if you want Slurm with Pyxis, and you need to specify the image name -- it will most likely not fit on the screen.
* Oh, and since we mentioning Pyxis -- their image names have pound sign in them, and now you also need to figure out how to escape it, because for some reason if used literally it breaks the comments parser.
* No syntax highlighting (because it's all comments).
* No way to create more complex configuration, i.e. no way to have any types other than strings, no way to have variables, no way to have collections of things.
* No way to reuse configuration (you have to copy it from one job file to another). I honestly don't even know what happens if you try to source a job configuration file from another job configuration.
All in all, it's really hard to imagine a worse configuration format. This sounds like a solution from some sort of a code-golfing competition where the goal was to make it as bad as possible, while still retaining some shreds of functionality.
HPC software is one area where we have arguably regressed in the last 30 years. Chapel is the only light I see in the darkness
Want to elaborate more on Chapel? I’ve recently being tasked with integrating Chapel into our system and it’s quite interesting.
Having switched from LSF to slurm, I have to appreciate that the ecosystem is so bash-centric. Lots of re-use in the conversion. If I’d had to learn some kind of slurm-markup-language or slurmScript or find buttons in some SlurmWizard, it would have been a nightmare.
Oh LSF... I don't know if you know this. LSF is perhaps the only system alive today that I know of that uses literal patches as a means of software distribution.
Fist time I saw it, I had a flashback to the times when I worked for HP, and they were making some huge SAP knock-off, and that system was so labor-intensive to deploy that their QA process involved actual patches. As in pre-release QA cycle involved installing the system, validating it (which could take a few weeks) and if it's not considered DoD, then the developers are given the final list of things they need to fix and those fixes would have to be submitted as patches (sometimes, literal diffs that need to be applied to the deployed system with the patch tool).
This is, I guess, how the "patch version component" came to be in SemVer spec. It's kind of funny how lots of tools are using this component today for completely unrelated purposes... but yeah, LSF feels like the time is ticking there at a different pace :)
The other cool thing about HPC is it is one of the last areas where multi-user Unix is used! At least, if you're using a university or NSF cluster that is!
Only other place I really see multiple humans using the same machine is SDF or the Tildes
It's saturday afternoon.
I really like using Slurm, the documentation is great (https://slurm.schedmd.com) and the model is pretty straightforward, at least for the mostly-single-node jobs I used it for.
You can launch a job(s) via command-line, config in Bash comments, REST APIs, linking to their library, and I think a few more ways.
I found it pretty easy to setup and admin. Scaling in the cloud was way less developed when I used it, so I just hacked in a simple script that allowed scaling up and down based on the job queue size.
What do you like better and for what use-case? Mine was for a group of researchers training models, and the feature I desired most was an approximately fair distribution of resources (cores, GPU hours, etc.).
I've dug deeply into LSF in the last few years and it's like a car crash - you can't look away. It feels like something that started in the early unix days but was developed into perhaps the late 90s, but in reality LSF was only started in the 90s (in academia). As far as I can tell development all but stopped when IBM acquired it some ten years ago.
I started in HPC about 2 years ago on a ~500 node cluster at a Fortune 100 company. I was really just looking for a job where I was doing Linux 100% of the time, and it's been fun so far.
But it wasn't what I thought it would be. I guess I expected to be doing more performance oriented work, analyzing numbers and trying to get every last bit of performance out of the cluster. To be honest, they didn't even have any kind of monitoring running. I set some up, and it doesn't really get used. Once in a while we get questions from management about "how busy is the cluster", to justify budgets and that sort of thing.
Most of my 'optimization' work ends up being things like making sure people aren't (usually unknowingly) requesting 384 CPUs when their script only uses 16, testing software to see what # of CPU's it works with before you see a degradation, etc. I've only had the Intel profiler open twice.
And I've found that most of the job is really just helping researchers and such with their work. Typically running either a commercial or open-source program, troubleshooting it, or getting some code written by another team on another cluster and getting it built and running on yours. Slogging through terrible Python code. Trying to get a C++ project built on a more modern cluster in a CentOS 7 environment.
It can be fun in a way. I've worked with different languages over the years so I enjoy trying to get things working, digging through crashes and stack traces. And working with such large machines, your sense of normal gets twisted when you're on a server with 'only' 128GB of RAM or 20TB of disk.
It's a little scary when you know the results of some of this stuff are being used in the real world, and the people running the simulations aren't even doing things right. Incorrect code, mixed up source code, not using the data they thing they are, I once found a huge bug that had existed for 3 years. Doesn't this invalidate all the work you've done on this subject?
The one drawback I find is that a lot of HPC jobs want you do have a masters degree. Even to just run the cluster. Doesn't make sense to me, I'm not writing the software you're running, we aren't running some state of the art, TOP500 cluster. We're just getting a bunch of machines networked together and running some code.
I always found that funny too. A business who needs a powerful computing solution can come up with some amazingly robust stuff, whereas science/research just buys a big mainframe and hopes it works.
I was working in a company that had been spun out of a university until recently and it was shocking how hopeless the researchers were. I've always been critical of how poor the job security in academia is but you'd think it's still too much given how slapdash some of the crap you see is. We basically had to reinvent their product from the ground up, awful.
This is probably a naive question but isn't that the point of having developers on staff? The researchers aren't coders and vice versa, so having researchers produce prototypes that are productized by engineers makes sense to me.
Exactly! This is how it should be.
Researchers/Scientists with their hard earned PhDs should only concentrate on doing cutting-edge "researchy" stuff. It is hard enough that they should not be asked to learn all the intricacies/problems inherent in Software Development. That is the domain of a "Professional Software Engineer".
There is now in fact a new class called "Research Software Engineer" who are Software Developers working in Research developing code specific to their needs - https://www.nature.com/articles/d41586-022-01516-2 and https://en.wikipedia.org/wiki/Research_software_engineering
I've had very similar experiences working with former researchers including at a university spinout. Mechanical rather than CS. It was perplexing how they still carried the elitism that industry was mostly for people who can't hack it in academia given the quality of their work. Would be unacceptable coming from a new hire PD engineer at Apple yet you're demanding respect because you used to lead a whole lab apparently producing rubbish?
Is it possible that pretty much any specialization, outside of the most common ones, engages in a lot of gatekeeping? I remember how difficult it appeared to be after I graduated to break into embedded systems (I never did). I persisted until I realized it doesn't even pay very well, comparatively.
There is a lot of abstraction, but knowing which abstraction to use still takes knowing a lot about the hardware.
In my experience with CUDA developers, yes the Shmoo Plot (https://en.wikipedia.org/wiki/Shmoo_plot, sometimes called a ‘wedge’ in some industries) is one of the workhorses of every day optimization. I’m not sure I’d call it black-box, though maybe the net effect is the same. It’s really common to have educated guesses and to know what the knobs do and how they work, and still find big surprises when you measure. The first rule of optimization is measure. I always think of Michael Abrash’s first chapter in the “Black Book”: “The Best Optimizer is Between Your Ears” http://twimgs.com/ddj/abrashblackbook/gpbb1.pdf. This is a fabulous snippet of the philosophy of high performance (even though it’s PC game centric and not about modern HPC.)
Related to your point about abstraction, the heaviest knob-tuning should get done at the end of the optimization process, because as soon as you refactor or change anything, you have to do the knob tuning again. A minor change in register spills or cache access patterns can completely reset any fine-tuning of thread configuration or cache or shared memory size, etc.. Despite this, some healthy amount of knob tuning is still done along the way to check & balance & get an intuitive sense of the local perf space of the code. (Just noticed Abrash talks a little about why this is a good idea.)
Could you explain how you use a shmoo plot for optimization? Do you just have a performance metric at each point in parameter space?
The shmoo plot is just the name for measuring something (such as perf) over a range of parameter space. The simplest and most straightforward application is to pick a parameter or two that you don’t know what value they should be using, do the shmoo over the range of parameter space, and then set the knobs at whatever values give you the optimal measurement.
Usually though, you have to iterate. Doing shmoos along the way can help with understanding the effects of code changes, help understand how the hardware works, and it can sometimes help identify what code changes you might need to make. A simple abstract example might be I know what my theoretical peak bandwidth is, but my program only gets 30% of peak. I suspect it has to do with how many registers are used, and I have a knob to control it, so I turn the knob and plot all possible register settings, and find out that I can get 45% of peak with a different value. Now I know it was partially registers I was limited by, but I also know to look for something else too. Then I profile, examine the code, maybe refactor or adjust some things, hypothesize, test, and then shmoo again on a different knob or two if I suspect something else is the bottleneck.
HPC admin here, generally serving "long tail of science" researchers.
In today's x86_64 hardware, there's no "supercomputer memory subsystem". It's just a glorified NUMA system, and the biggest problem is putting the memory close to your core, i.e. keeping data local in your NUMA node to reduce latencies.
Your resource mapping is handled by your scheduler. It knows your hardware, hence it creates a cgroup which satisfies your needs and as optimized as possible, and stuffs your application into that cgroup and runs it.
Currently king of high performance interconnects is Infiniband, and it accelerates MPI at the fabric level. You can send messages, broadcasts and reduce results like there's no tomorrow. Because when the message arrives you, it's already reduced. When you broadcast, you only send a single message which is broadcasted at fabric layer. Multiple Context IB cards have many queues and more than one MPI job can run on the same node/card with queue/context isolation.
If you're using a framework for GPU work, the architecture & optimization is done at that level automatically (the framework developers do the hard work generally). NVIDIA's drivers are pure black magic, too. They handle some parts of the optimization, too. InterGPU connection is handled by a physical fabric, managed by drivers and its own daemon.
If you're CPU bound, your libraries are generally hand tuned by its vendor (Intel MKL, BLAS, Eigen, etc.). I personally used Eigen, and it has processor specific hints and optimizations baked in.
The things you have to worry is to compile your code for the correct architecture, make sure that the hardware you run on can satisfy your demands (i.e.: do not make too many random memory accesses, keep the prefetcher and branch predictor happy if you're trying to go "all-out fast" on the node, do not abuse disk access, etc.).
On the number crunching side, keeping things independent (so they can be instruction level parallelized/vectorized), making sure you're not doing unnecessary calculations, and not abusing MPI (reducing inter-node talk to only necessary chatter) is the key.
It's way easier said than done, but when you get the hang of it, it becomes like a second nature to think about these things, if these kinds of things are your cup of tea.
Thanks for the thoughtful comment, pretty fascinating stuff.
I mean, memory topology varies greatly by uarch (doubly so between vendors). I can't take a routine tuned to Nehalem, run it on Haswell or Skylake and expect it to stay competitive. More generally, different hardware has different bandwidth and latency ratios, which affects software design (e.g. software written for commodity Dell w/ PCIe cards probably won't translate to Cray accelerator grid connected by HPE slingshot). And then there's hardware-specific features like RNICs bypassing DRAM and writing RDMA messages directly into the receiver's cache. So I think that ccNUMA and data locality is not sufficient to reason about memory perf.
You're absolutely right, this is why I said that if you're using libraries, this burden is generally handled by them. Also compilers do this and handle this very well.
If you're writing your own routines, the best way is to read the arch docs, maybe some low-level sites like chips and cheese, do some synthetic benchmarks and write your code in a semi informed way.
After writing the code, a suite of cachegrind, callgrind and perf is on order. See if there are any other bottlenecks, and tune your code accordingly. Add hints for your compiler, if possible.
I was able to reach insane saturation levels with Eigen plus, some hand-tuned code. For the next level, I needed to change my matrix ordering, but it was already fast enough (30 minutes to 45 seconds: 40x speedup), so I left it there.
Sometimes there are no replacement for blood, sweat and tears in this thing.
I have never played with custom interconnects (Slingshot, etc.), yet, so I can't tell much.
Yes and no.
MPI and OpenMP are the primary abstractions from the hardware in HPC, with MPI being an abstracted form of distributed-memory parallel computing and OpenMP being an abstracted form of shared-memory parallel computing. Many researchers write their codes purely using those, often both in the same code. When using those, you really do not need to worry about the architectural details most of the time.
Still, some researchers who like to further optimize things do in fact fiddle with a lot of small architectural details to increase performance further. For example, loop unrolling is pretty common and can get quite confusing in my opinion. I vaguely recall some stuff about trying to vectorize operations by preferring addition over multiplication due to the particular CPU architecture, but I do not think I've seen that in practice.
Preventing cache misses is another major one, where some codes are written so that the most needed information is stored in the CPU's cache rather than memory. Most codes only handle this by ensuring column-major order loops for array operations in Fortran or row-major order loops in C, but the concept can be extended further. If you know the cache size for your processors, you could hypothetically optimize some operations to keep all of the needed information inside the cache to minimize cache misses. I've never seen this in practice but it was actively discussed in the scientific computing course I took in 2013.
The use of particular GPUs depends heavily on the problem being solved, with some being great on GPUs and others being too difficult. I'm not too knowledgeable about that, unfortunately.
Of course, not every problem can be solved by BLAS, but if you are doing linear algebra, the cache stuff should be mostly handled by BLAS.
I’m not sure how much multiplication vs addition matters on a modern chip. You can have a bazillion instructions in flight after all, as long as they don’t have any dependencies, so I’d go with whichever option shortens the data dependencies on the critical path. The computer will figure out where to park longer instruction if it needs to.
You're right that the addition vs. multiplication issue likely does not matter on a modern chip. I just gave the example because it shows how the CPU architecture can affect how you write the code. I do not recall precisely when or where I heard the idea, but it was about a decade ago --- ages ago by computing standards.
I wrote scientific simulation software in academia for a few years. None of us writing the software had any formal software engineering training above what we’d pieced together ourselves from statistics courses. We wrote our simulations to run independently on many nodes and aggregated the results at the end, no use of any HPC features other than “run these 100 scripts on a node each please, thank you slurm”. That approach worked very well for our problem.
I’d bet a significant part of compute work on HPC clusters in academia works the same way. The only thing we paid attention to was number of cores on the node and preferring node local storage over the shared volumes for caching. No MPI.
There are of course problems requiring “genuine” HPC clusters but ours could have run on any pile of workers with a job queue.
That's often the ideal case. Individual tasks are small enough to run on commodity hardware but large enough that you don't have an excessive number of them. That means you can write simple software without wasting effort on distributed computing.
I've seen similar things at the intersection of bioinformatics and genomics. Computers are getting bigger but the genomes aren't, and tasks that require distributed computing are getting rare.
Regardless of what you do, domain knowledge tends to be more valuable than purely technical skills.
Knowing more numerical analysis will get probably get you further in HPC than knowledge of specific hardware architectures.
Ideally you want both, of course.
It's not intuitive, but for HPC is more about scalability than performance.
You won't be able to use a supercomputer at all without scalability, and it's the one topic that is specific to it. But, of course, those computers time is quite expensive so you'll want to optimize for performance too. It's just secondary.
For most HPC, you will not be able to maximize parallelism and throughput without intimate knowledge of the hardware architecture and its behavior. As a general principle, you want the topology of the software to match the topology of the hardware as closely as possible for optimal scaling behavior. Efficient HPC software is strongly influenced by the nature of the hardware.
When I wrote code for new HPC hardware, people were always surprised when I asked for the system hardware and architecture docs instead of the programming docs. But if you understood the hardware design, the correct way of designing software for it became obvious from first principles. The programming docs typically contained quite a few half-truths intended to make things seem misleadingly easier for developers than a proper understanding would suggest. In fact, some HPC platforms failed in large part because they consistently misrepresented what was required from developers to achieve maximum performance in order to appear "easy to use", and then failing to deliver the performance the silicon was capable of if you actually wrote software the way the marketing implied would be effective.
You can write HPC code on top of abstractions, and many people do, but the performance and scaling losses are often unavoidably integer factor. As with most software, this was considered an acceptable loss in many cases if it allowed less capable software devs to design the code. HPC is like any other type of software in that most developers that notionally specialize in it struggle to produce consistently good results. Much of the expensive hardware used in HPC is there to mitigate the performance losses of worse software designs.
In HPC there are no shortcuts to actually understanding how the hardware works if you want maximum performance. Which is no different than regular software, in HPC the hardware systems are just bigger and more complex.
No, the abstractions are not sufficient. We do care about these details, a lot.
Of course, not every application is optimized to the hilt. But if you want to so optimize an application, exactly things you're talking about are what come into play.
So yes, I would expect every competent HPC practitioner to have a solid (if not necessarily intimate) grasp of hardware architecture.
I don’t think I do HPC (I only will use up to, say, 8 nodes at a time), but the impression I get is that they are already working on quite hard problems at the high-level, so they need to lean on good libraries for the low-level stuff, otherwise it is just too much.
Memory architecture and bandwidth are still very important, most of IBM's latest performance gains for both mainframes and POWER are reliant on some novel innovations there.