I worked on port of a Konami game from PSX to PC, that was 1999-2000 - the code (C) had lots of #ifdefs like where inline assembly was placed, and the original "C" code was kept. It seemed all done by one specific programmer, and what really saved us (in porting) the game was the originallly kept "C" code. My mips knowledge was never as good as x86.
So yes, it was the norm back then. My second job (1998), was working on a team that was going to do some software for Intel for the upcoming back then Katmai processor (Pentium I was it). It had all the new fancy SIMD instructions. The software was supposed to be something like media composer - you slap images, rotate them, etc all in realtime using software rendering (GPUs were still relatively expensive).
I wrote a bilinear and bicubic texture mapper with marching squares for transparent areas. It was all in assembly, and I spent lots of time optimizing it. Back then we used Intel's VTune, and it was super-precise (for the processors back then) - how they are going to pipeline, how much (supposedly) instructions cycles would take, waits, etc. That helped a lot!
But the real lesson was, that me and my manager - both claiming to be super good at assembly (after our recent achievements), rewrote the sphere mapping code for a game another team was writting in assembly, but alas our assembly (but no katmai instructions) code was slower than what the compiler did ;) - TBH, if we did proper mipmapping and texture swizzling we would've fared both ways, but hey, demo coders were not always to be found so they had to rely on regular programmers like us!
flipcode keeps a lot of good articles, with lots of good assembly for that - https://www.flipcode.com/archives/articles.shtml - there were even better materials from earlier years, but can't find them.
Turbo/Borland Pascal were so awesome, because they allowed for very easy of inline assembly use (somehow) than C/C++ - though you had to know which registers you can touch or not.
It was always so disappointing, spending hours coding up a tightly wound assembly language version of some inner loop that uses half the instructions generated by the C++ compiler, only to find that your slaved-over version is actually 5% slower. But OTOH... the thrill when it actually was faster!
This was back in the Pentium 4 era, where there were deep pipelines and oddities like some simple ALU instructions (ADD, SUB, etc.) taking 0.5 cycles(!), while others (ADC, SHR) took 4 cycles IIRC.
is there a tool that could profile/predict ahead of time, so that one does not attempt to hand write assembly before knowing for sure it will beat the compiled version?
There was Intel VTune, which I heard was good, though I haven't used it myself. One difficulty is that there are many non-obvious and hard-to-predict factors that interact to produce pipeline stalls. Instructions had specified throughputs and latencies (throughout being the number of cycles before another independent instruction of that type could be initiated; latency being the number of cycles before its output could be used by another instruction), but that was only part of the story. Was that memory read from L1 cache? L2? Main memory? Is this conditional branch predictable? Which of the several applicable execution units will this micro-op get sent to? There were also occasional performance cliffs (alternating memory reads that were exactly some particular power of 2 apart would alias in the cache, leading to worst-case cache behaviour; tight loops that did not begin on a 16-byte boundary would confuse the instruction prefetcher on some CPUs...)
I may be getting x86 CPU generations mixed up. But having wrestled with all this, I can certainly see the appeal of hand-optimising for older, simpler CPUs like the 6510 used in the C64, where things were a lot more deterministic.
VTune still exists and is free since a few years. Neat thing with VTune is that it has support for a few runtimes, so it understands for example CPython internals to the point that stack-traces can be a mixture of languages. That's something only becoming available just now outside of VTune, like Python 3.12 has some hooks for Linux perf.
A purely pen-and-paper tool was IACA; you simply inserted some macros around a bit of code in the binary and IACA simulated how these would/could be scheduled on a given core: https://stackoverflow.com/questions/26021337/what-is-iaca-an...
Didn't know VTune is free now, nor had I ever heard of IACA which looks very nice (and would have saved me a lot of brow-sweat)! Thanks.
Note that there's also the open-source uiCA [0], which similarly predicts scheduling and overall throughput for a basic block. Their benchmarks claim it to be more accurate than IACA and other tools for newer Intel CPUs, but I wouldn't be qualified to judge those claims.
[0] https://uops.info/uiCA.html
For small pieces of code I would try to use a superoptimizer like souper.
https://github.com/google/souper
It looks like it only supports Linux and macOS - no Windows, but no other things too like mobile.
It seems it exists for ten years, I wonder what optimizations aren't still picked by the recent compilers.
Compilers need to balance compilation speed with optimization. SMT solvers are right out for speed reasons.
This varies from trivial to very hard to mostly data dependent with different architectures. llvm-mca might be of interest.
One should be able to do a best-case calculation, mostly assuming caches hit and branch prediction gets the answer right. Register renaming manages to stay out of the way.
Getting more dubious, there is a statistical representation of program performance on unknown (or partially known) data. One might be able to estimate that usefully, though I haven't seen it done.
Are there any good resources on using mipmapping and swizzling effectively?
I was actually trying to find it - there were lots of .txt files published back then - and there was one about texture mapping from 1992... 1994? - and it explained swizzling and why it was efficient with caches.
You might be thinking of the article written by Pascal of Cubic Team. I think the archive is called pasroto.zip and it explained why rotozoomer performance would tank when the texture was rotated by 90 degrees (since you're stepping in the Y direction and blowing out your cache for every pixel). Really interesting at the time.
Now I remember - back then these texture mappers were called roto zoomers, Alex Champandard did excellent series here ->
https://www.flipcode.com/archives/The_Art_of_Demomaking-Issu... - look for zip file with the source code.
I've also found Go version (and there were 2 java ones)
https://github.com/search?q=BlockTextureScreen
So it's not as neat as swizzling (quickly looking at it) - but essentially same goal - keep pixels that have to be drawn together at close (e.g. blocks). Mipmapping helps too, as then you don't "skip" too many pixels - and you both gain better quality and less cache misses.
fatmap.txt discusses tiled textures a bit https://hornet.org/code/3d/trifill/texmap/fatmap.txt
You worked on the port of Metal Gear Solid?! OH MY GOD THIS IS SO COOL!
Edit: Looking it up you might have worked on a larger amount of games than I thought. Very awesome nonetheless. https://www.mobygames.com/game/company:99/from:1998/platform...
Thanks!!! Yes it was MGS - the deal was with our studio to do the port from PSX -> PC and since Microsoft was the publisher, part of the deal was for Konami to port Age Of Empires (1? 2?) to PSX.
It was Age Of Empires 2, I had a EU copy. I bet there are some interesting stories regarding that port, too.
AFAIK the user interface for the win9x version was drawn using native GDI windows library, and yet the PS2 version, using a completely different rendering framework and architecture, sported the very same appearance, font glitches and all. I wonder if they actually wrote an emulation layer around that.
The AoE2 PS2 version actually had half-decent USB keyboard and mouse support. Back then USB keyboard/mice were very uncommon (it was all PS/2) but when we tried it, it actually worked.
Imagine reading this as a layman
This was a nice read, thanks for writing.
Another Flipcode user!
I missed that other game forums of similar vintage are now gone.
The way to being able to do Assembly in Borland products, and to lesser extent on VC++, was great.
I always feel like vomiting when looking at the mess UNIX compilers have come up for it, instead of inferring how registers are being used, and the whole string concatenation stuff.