Original Age of Empires 2 dev talks about its usage of assembly code

I worked on port of a Konami game from PSX to PC, that was 1999-2000 - the code (C) had lots of #ifdefs like where inline assembly was placed, and the original "C" code was kept. It seemed all done by one specific programmer, and what really saved us (in porting) the game was the originallly kept "C" code. My mips knowledge was never as good as x86.

So yes, it was the norm back then. My second job (1998), was working on a team that was going to do some software for Intel for the upcoming back then Katmai processor (Pentium I was it). It had all the new fancy SIMD instructions. The software was supposed to be something like media composer - you slap images, rotate them, etc all in realtime using software rendering (GPUs were still relatively expensive).

I wrote a bilinear and bicubic texture mapper with marching squares for transparent areas. It was all in assembly, and I spent lots of time optimizing it. Back then we used Intel's VTune, and it was super-precise (for the processors back then) - how they are going to pipeline, how much (supposedly) instructions cycles would take, waits, etc. That helped a lot!

But the real lesson was, that me and my manager - both claiming to be super good at assembly (after our recent achievements), rewrote the sphere mapping code for a game another team was writting in assembly, but alas our assembly (but no katmai instructions) code was slower than what the compiler did ;) - TBH, if we did proper mipmapping and texture swizzling we would've fared both ways, but hey, demo coders were not always to be found so they had to rely on regular programmers like us!

flipcode keeps a lot of good articles, with lots of good assembly for that - https://www.flipcode.com/archives/articles.shtml - there were even better materials from earlier years, but can't find them.

Turbo/Borland Pascal were so awesome, because they allowed for very easy of inline assembly use (somehow) than C/C++ - though you had to know which registers you can touch or not.

It was always so disappointing, spending hours coding up a tightly wound assembly language version of some inner loop that uses half the instructions generated by the C++ compiler, only to find that your slaved-over version is actually 5% slower. But OTOH... the thrill when it actually was faster!

This was back in the Pentium 4 era, where there were deep pipelines and oddities like some simple ALU instructions (ADD, SUB, etc.) taking 0.5 cycles(!), while others (ADC, SHR) took 4 cycles IIRC.

uses half the instructions generated by the C++ compiler

is there a tool that could profile/predict ahead of time, so that one does not attempt to hand write assembly before knowing for sure it will beat the compiled version?

There was Intel VTune, which I heard was good, though I haven't used it myself. One difficulty is that there are many non-obvious and hard-to-predict factors that interact to produce pipeline stalls. Instructions had specified throughputs and latencies (throughout being the number of cycles before another independent instruction of that type could be initiated; latency being the number of cycles before its output could be used by another instruction), but that was only part of the story. Was that memory read from L1 cache? L2? Main memory? Is this conditional branch predictable? Which of the several applicable execution units will this micro-op get sent to? There were also occasional performance cliffs (alternating memory reads that were exactly some particular power of 2 apart would alias in the cache, leading to worst-case cache behaviour; tight loops that did not begin on a 16-byte boundary would confuse the instruction prefetcher on some CPUs...)

I may be getting x86 CPU generations mixed up. But having wrestled with all this, I can certainly see the appeal of hand-optimising for older, simpler CPUs like the 6510 used in the C64, where things were a lot more deterministic.

VTune still exists and is free since a few years. Neat thing with VTune is that it has support for a few runtimes, so it understands for example CPython internals to the point that stack-traces can be a mixture of languages. That's something only becoming available just now outside of VTune, like Python 3.12 has some hooks for Linux perf.

A purely pen-and-paper tool was IACA; you simply inserted some macros around a bit of code in the binary and IACA simulated how these would/could be scheduled on a given core: https://stackoverflow.com/questions/26021337/what-is-iaca-an...

Didn't know VTune is free now, nor had I ever heard of IACA which looks very nice (and would have saved me a lot of brow-sweat)! Thanks.

Note that there's also the open-source uiCA [0], which similarly predicts scheduling and overall throughput for a basic block. Their benchmarks claim it to be more accurate than IACA and other tools for newer Intel CPUs, but I wouldn't be qualified to judge those claims.

[0] https://uops.info/uiCA.html

For small pieces of code I would try to use a superoptimizer like souper.

https://github.com/google/souper

It looks like it only supports Linux and macOS - no Windows, but no other things too like mobile.

It seems it exists for ten years, I wonder what optimizations aren't still picked by the recent compilers.

Compilers need to balance compilation speed with optimization. SMT solvers are right out for speed reasons.

This varies from trivial to very hard to mostly data dependent with different architectures. llvm-mca might be of interest.

One should be able to do a best-case calculation, mostly assuming caches hit and branch prediction gets the answer right. Register renaming manages to stay out of the way.

Getting more dubious, there is a statistical representation of program performance on unknown (or partially known) data. One might be able to estimate that usefully, though I haven't seen it done.

Are there any good resources on using mipmapping and swizzling effectively?

I was actually trying to find it - there were lots of .txt files published back then - and there was one about texture mapping from 1992... 1994? - and it explained swizzling and why it was efficient with caches.

You might be thinking of the article written by Pascal of Cubic Team. I think the archive is called pasroto.zip and it explained why rotozoomer performance would tank when the texture was rotated by 90 degrees (since you're stepping in the Y direction and blowing out your cache for every pixel). Really interesting at the time.

Now I remember - back then these texture mappers were called roto zoomers, Alex Champandard did excellent series here ->

https://www.flipcode.com/archives/The_Art_of_Demomaking-Issu... - look for zip file with the source code.

I've also found Go version (and there were 2 java ones)

https://github.com/search?q=BlockTextureScreen

So it's not as neat as swizzling (quickly looking at it) - but essentially same goal - keep pixels that have to be drawn together at close (e.g. blocks). Mipmapping helps too, as then you don't "skip" too many pixels - and you both gain better quality and less cache misses.

fatmap.txt discusses tiled textures a bit https://hornet.org/code/3d/trifill/texmap/fatmap.txt

You worked on the port of Metal Gear Solid?! OH MY GOD THIS IS SO COOL!

Edit: Looking it up you might have worked on a larger amount of games than I thought. Very awesome nonetheless. https://www.mobygames.com/game/company:99/from:1998/platform...

Thanks!!! Yes it was MGS - the deal was with our studio to do the port from PSX -> PC and since Microsoft was the publisher, part of the deal was for Konami to port Age Of Empires (1? 2?) to PSX.

It was Age Of Empires 2, I had a EU copy. I bet there are some interesting stories regarding that port, too.

AFAIK the user interface for the win9x version was drawn using native GDI windows library, and yet the PS2 version, using a completely different rendering framework and architecture, sported the very same appearance, font glitches and all. I wonder if they actually wrote an emulation layer around that.

The AoE2 PS2 version actually had half-decent USB keyboard and mouse support. Back then USB keyboard/mice were very uncommon (it was all PS/2) but when we tried it, it actually worked.

mipmapping and texture swizzling

Imagine reading this as a layman

This was a nice read, thanks for writing.

Another Flipcode user!

I missed that other game forums of similar vintage are now gone.

The way to being able to do Assembly in Borland products, and to lesser extent on VC++, was great.

I always feel like vomiting when looking at the mess UNIX compilers have come up for it, instead of inferring how registers are being used, and the whole string concatenation stuff.

A key speedup technique AoE used was realized during discussions I had with iD software programmer and optimization guru Michael Abrash over lunch at Tia's Mexican Restaurant in Mesquite, TX.

How many freeform interactions like this did we lose because of the Internet's illusion of being connected?

We've gained many. Chatting in a discord is similar to this.

And IRC was even better at it.

I don't think so. I know everybody loves to say that IRC is better than [insert commercial chat application] but in this particular use case discord is superior to irc imo because of the voice/video chat features and greater convenience.

It's the structure/topology.

Discord has a million deserted "servers" with redundant general channels. Peak IRC had a dozen or so large networks that often bundled a lot of e.g. adjacent FOSS projects. The chances of running into someone interesting a la the article anecdote was just higher.

Discord is a massive net negative for chatting on the internet because of this flaw imho.

IRC’s biggest fault is no history. Everything else is secondary.

Yeah, bouncers exist, but it should have been baked into the spec at some point.

You mean, like Jabber did 10 years ago using 10x less resources?

Well, yes, but I also see their point (even if I'm not sure I agree with it): by being forced to meet IRL, you're also forced to make real contact and seem more likely to form strong relations, and it is off-the-record by default so you can share more things

Of course, the (massive) counterpoint is that you get to talk to way fewer people, particularly if they're more than half an hour traveling away. Quantity versus quality, but by having a lot of quantity through more diverse online interactions, you can find the conversations that have a lot of quality for you (overlapping fields of work, hobbies, or just a personality match).

Which is better? Probably something in the middle, where you hang out in chat rooms but are also conscious of the advantages of arranging to meet up. I do find it inspirational (too strong a word, but you get the idea) to hear of other times or cultures where things are done differently

NEVER FORGET WHAT THEY HAVE TAKEN FROM US. WE WERE ONCE A CULTURE. NOW WE ARE LOST, FOREVER. WHITHER OUR SENSE OF BEAUTY. (etc., etc., insert architecture pics to taste)

C'mon dude. The opportunity for people to talk to one another about this stuff is unimaginably better than it was back then. Like, here we are, right now, me telling you you're full of shit. What are the chances of us being able to do that in 1999?

(I'm sorry to be mean, but I remember 1999 very well and it was much harder to get good information about things, and discuss things with others interested in the same topics, than it is today. And it was already markedly better then than it was even 5 years prior to that!)

Do you want to have lunch at Tia's in Mesquite, Texas?

I do. There isn't a Tia's there anymore but there's one an hour north, still in the DFW area.

The fact that it was hard to get information meant the ones which did break out were infinitely better. The ones you could discuss things were very into the things you were discussing.

I think we've lost some and gained other.

People may say that Discord and similar will compensate, greatly, as the number of interactions can grow a lot. On the other hand, I don't think the experience is comparable to fully focusing on the person you're eating in person with.

People may say that Discord and similar will compensate, greatly, as the number of interactions can grow a lot.

Are Discord discussions indexed by any public search engines? What about communities that are invite-only (without much actual reason to be so)? What about community admins who decide to take their whole thing down, communities that break site-wide rules and get removed by site admins, etc? Does Discord Inc. make any commitments towards publishing discussions that have archival/historical value?

How much knowledge is already irrecoverably stuck in Slack's bit bucket, as people flocked away to the next walled-garden chat app?

I mean it's not like technical discussions over lunch at a Mexican restaurant we're indexable either.

With the rise of AIs-hoovering-up-everything-reachable-on-the-public-web I'm not sure I want every conversation indexed either.

Discord is a poor replacement. I think it allows a lot of broad connections without depth which sounds good on the surface but unfortunately real insights require a bit of digging but the conversation has already moved ahead in the chatroom. That's why old school forums were actually a better kind of a discussion board but discord being free and not requiring technical setup won.

Don't people still have lunch?

They do. Over their desks, while working or in a video call.

We get a 1 hour lunch break required by law and yet people still do this willingly.

That's probably very country specific, because I've not seen that happen and would also never do it myself.

I get what you're saying. There are a lot of tinier discords and chat rooms where people post technical points and other people chime in. There are even websites with these small places.

The challenge of course is finding your way there. They're not exactly discoverable, and unlike with a job, it's usually through some pretty odd connections that you end up there.

The use of assembly in the drawing core resulting in a ~10x sprite drawing speed improvement over the C++

Wow that's a chunky improvement over an already fast lang

This was in 1999. C++ compilers have come a long ways since then. While there are still opportunities for hand-written asm to go and order of magnitude faster than C++, they're mostly around manual vectorization where the auto-vectorizer fails.

Yes, younger devs grown up on the myth of C and C++ being always fast, have missed the days when inline Assembly was a higher count than pure C and C++ code.

I have seen applications for MS-DOS, effectively using C as a Macro Assembler, only the data structures and high level logic was C as if Macro Assembler macros.

Yes, younger devs grown up on the myth of C and C++ being always fast, have missed the days when inline Assembly was a higher count than pure C and C++ code.

And still is in VLC. (Okay, maybe not higher, but they do use a crapton of assembly in their decoders, and it does speed them up by a factor of 10 or so today.)

Do you mean the VLC media player? Just curious is there any reason modern compilers don't beat assemblies or is that simply a legacy issue?

Compilers beat hand written Assembly for the general use cases.

Now beating special use cases, like using vector instructions to parallel process video decoding streams is another matter.

It is no accident that after all the efforts improving Java and .NET JITs for auto-vectorization across various vendors, both platforms now expose SIMD intrinsics as well.

The choices and resulting codegen are fairly different. Only one of them works "properly" as of today. Though I'm open to be proven wrong once Panama vectors get stable in the Java land.

They will only be out of preview when Valhala ships, as per roadmap.

Then there is the whole issue when will they reach other implementations beyond OpenJDK, specially a very important alternative implementation running on many phones across the globe.

Nevertheless the need to explicitly being allowed to write vector code is there.

Thanks for the explanation. Aside from vectorization is there anything else that handwritten assembly could be better? Assuming on modern CPUs and modern compilers.

Sure, people who are good at assembly can often do register allocation and instruction selection better for small snippets of code. Or optimize based on guarantees the compiler can’t see or know about.

Video decoding has always been a prime example for SIMD stuff, however I wonder how much of that code VLC devs could wipeout, assuming hardware vídeo decoding being available everywhere.

I was there, 3000 years ago, when C was considered bloated and slow. Times are a changing.

Unfortunately much of those optimizations are thanks to taking UB to full throttle.

And when that happens SIMD intrinsics are often used rather than going straight into raw assembly.

Even intrinsics didn't even necessarily work well. MSVC, in particular, was really, really bad back then with SIMD intrinsics -- any use of MMX or SSE intrinsics in VC6 would result in more than two-thirds of the generated code being move instructions, with a single value sometimes moved two or three times between ALU instructions. It was trivial to beat the compiler with hand-written assembly. MMX intrinsics were never fixed and SSE intrinsics weren't fixed until VS2010.

For scalar code, it was more that the CPUs got better, as out-of-order execution starting with the Pentium Pro made instruction scheduling less important. The original Pentium CPU was an in-order design with two pipes where the second V pipe had significant restrictions, which was harder for compilers to deal with than the PPro/PII and its decoding pattern.

Functionality should have been provided by directx though

I think AOE was using DirectX, so I assume that was one level up the stack, ie figuring out which sprites are visible to which extent by walking different data structures and then just throwing stuff at directdraw.

DirectDraw wasn't really meant as a drawing toolkit; you _did_ have blits, but they were not hardware accelerated AFAIK and not nearly flexible enough for what the article suggests (mirror, stiple, probably other stuff). In general, what DirectDraw gave you was a rectangle you could draw pixels into, and a way to get those pixels efficiently to the screen. In other words, more like a clean abstraction over the display driver.

Those were also the days when, according to a famous interview with the same devs:

The general argument is that if you know you are going to need to release a patch, then you shouldn't be shipping the game in the first place.

This quote shall be summoned every time I open Steam and note there is yet another 22GB 'rebalancing patch' for a certain game.

Well, in certain online games the balancing can't just stay static. But update sizes really are quite insane. One Baldur's Gate 3 update had 100GB and of course that required 100GB additional free space. At that point it was easier to just reinstall the whole game.

The balancing in AoE2:DE can't stay static either - and the pathing seems to have a different bug every other patch. Still a fun game though.

You probably need to ignore the first comment thread to see the comment of interest.

Not any more; in any case, here's a link to the comment in question:

https://old.reddit.com/r/aoe2/comments/18ysttu/_/kgltqrg/

I love that Michael Abrash shows up in this story. Seems like anywhere graphical innovation happened, he was there.

For those interested, the competitive AoE2 scene is alive and well with one of its biggest tournaments, NAC5, going on now, an in-person LAN in Berlin.

https://liquipedia.net/ageofempires/Nili%27s_Apartment_Cup/5

https://www.twitch.tv/nili_aoe

https://www.youtube.com/channel/UCXeY7zz-1LsyZdnpFbffdqA

I'm reading a history of Borland and the author claims that the Turbo Pascal compiler was mostly written in assembly and was also used in Delphi 1.0. No one in Borland could make significant changes in the code so eventually they rewrote it for Delphi 2.0.

Not sure if all was true but very fascinating. I think there is a certain character in programmers who went through the fire and storm by writing softwares in assembly language for non trivial CPUs (like Pentium and up) that is unique.

A few years ago I wrote a naive SLP drawing routine and it was very slow, and left me wondering what exactly they did to make it usable in AOE2. So this makes a lot of sense to me.

Fun little format to mess with if you find the docs for it - surprisingly not that difficult to implement (there used to be a wiki with the spec on it, might be gone now?)

AoE2DE still ships with hand-written assembly, though not part of the game code itself. The executable is initially encrypted by an executable packer. It unpacks the game code at runtime - some functions even on-demand.

Not sure why they do this, but this even leaves all code as RWX (readable, writable and executable) which is highly insecure.

https://old.reddit.com/r/aoe2/comments/18ysttu/aoe_is_writte...

OOO maybe this is my chance to figure out why the newest "definitive addition" runs so much worse in Wine.