HN comments for: Box64 and RISC-V in 2024: What It Takes to Run the Witcher 3 on RISC-V

Manfred

34 replies

10h1m

2024-08-27 08:31:15 UTC

At least in the context of x86 emulation, among all 3 architectures we support, RISC-V is the least expressive one.

RISC was explained to me as a reduced instruction set computer in computer science history classes, but I see a lot of articles and proposed new RISC-V profiles about "we just need a few more instructions to get feature parity".

I understand that RISC-V is just a convenient alternative to other platforms for most people, but does this also mean the RISC dream is dead?

flanked-evergl

18 replies

9h50m

2024-08-27 08:41:33 UTC

Is there a RISC dream? I think there is an efficiency "dream", there is a performance "dream", there is a cost "dream" — there are even low-complexity relative to cost, performance and efficiency "dreams" — but a RISC dream? Who cares more about RISC than cost, performance, efficiency and simplicity?

Joker_vD

10 replies

8h5m

2024-08-27 10:26:31 UTC

There was such dream. It was about getting the mind-bogglingly simple CPU, put caches into the now empty place where all the control logic used to be, and clock it up the wazoo, and let the software deal with load/branch delays, efficiently using all 64 registers, etc. That'll beat the hell out of those silly CISC architectures at performance, and at the fraction of the design and production costs!

This didn't work out, for two main reasons: first, just being able to turn clocks hella high is still not enough to get great performance: you really do want your CPU to be super-scalar, out-of-order, and with great branch predictor, if you need amazing performance. But when you do all that, the simplicity of RISC decoding stops mattering all that much, as Pentium II demonstrated when it equalled DEC Alpha on performance, while still having practically useful things like e.g. byte loads/stores. Yes, it's RISC-like instructions under the hood but that's an implementation detail, no reason to expose it to the user in the ISA, just as you don't have to expose the branch delay slots in your ISA because it's a bad idea to do so: e.g. MIPS II added 1 additional pipeline stage, and now they needed two branch/load delay slots. Whoops! So they added interlocks anyway (MIPS originally stood for "Microprocessor without Interlocked Pipelined Stages", ha-ha) and got rid of the load delays; they still left 1 branch delay slot exposed due to backwards compatibility, and the circuitry required was arguably silly.

The second reason was that the software (or compilers, to be more precise) can't really deal very well with all that stuff from the first paragraph. That's what sank Itanium. That's why nobody makes CPUs with register windows any more. And static instruction scheduling in the compilers still can't beat dynamic instruction reordering.

panick21_

4 replies

4h12m

2024-08-27 14:19:47 UTC

This didn't work out

... except it did.

You had literal students design chips that outperformed industry cores that took huge teams and huge investment.

Acorn had a team of just a few people build a core that outperformed an i460 with likely 1/100 investment. Not to mention the even more expensive VAX chips.

Can you imagine how fucking baffled the DEC engineers at the time were when their absurdly complex and absurdly expensive VAX chip were smocked by a bunch of first time chip designers?

as Pentium II demonstrated

That chip came out in 1997. The original RISC chip research happened in the early 80s or even earlier. It did work, its just that x86 was bound to the PC market and Intel had the finances huge teams hammer away at the problem. x86 was able to overtake Alpha because DEC was not doing well and they couldn't invest the required amount.

no reason to expose it to the user in the ISA

Except that hidden the implementation is costly.

If you give 2 equal teams the same amount of money, what results in a faster chip. A team that does a simply RISC instruction set. Or a team that does a complex CISC instruction set, transforms that into an underlying simpler instruction set?

Now of course for Intel, they had backward comparability so they had to do what they had to do. They were just lucky they were able to invest so much more then all the other competitors.

baq

1 replies

4h8m

2024-08-27 14:24:02 UTC

All fine except Itanium happened and it goes against everything you list out...?

pjc50

0 replies

3h15m

2024-08-27 15:16:54 UTC

Itanium was not in any sensible way RISC, it was "VLIW". That pushed a lot of needless complexity into compilers and didn't deliver the savings.

pjc50

0 replies

3h8m

2024-08-27 15:23:33 UTC

You had literal students design chips that outperformed industry cores that took huge teams and huge investment

Everyone remember to thank our trans heroine Sophie Wilson (CBE).

Joker_vD

0 replies

3h27m

2024-08-27 15:04:44 UTC

If you give 2 equal teams the same amount of money, what results in a faster chip.

Depends on the amount of money. If it's less a certain amount, RISC design will be faster. If it's above, both designs will perform about the same.

I mean, look at ARM: they too have decode their instructions into micro-ops and cache those in their high-performance models. What RISC buys you is the ability to be competitive at the low end of the market, with simplistic implementations. That's why we won't ever see e.g. a stack-like machine — no exposed general-purpose registers, but with flexible addressing modes for the stack, even something like [SP+[SP+12]]; stack is mirrored onto the hidden register file which is used as an "L0" cache which neatly solves the problem that register windows were supposed to solve, — such a design can be made as fast as server-grade x86 or ARM, but only by throwing billions of dollars and several man-millenia at it; and if you try to do it cheaper and quicker, its performance would absolutely suck. That's why e.g. System/360 didn't make that design choice although IBM seriously considered it for half a year — they then found out that the low-level machines would be unacceptably slow so they went with "registers with base-plus-offset addressed memory" design.

vlovich123

3 replies

3h18m

2024-08-27 15:14:11 UTC

To add on to what the sibling said, ignoring that CISC chips have a separate frontend to break complex instructions down into an internal RISC-like instruction set and thus the difference is blurred, more RISC instruction sets do tend to win on performance and power for the main reason that the instruction set has a fixed width. This means that you can fetch a line of cache and 4 byte instructions you could start decoding 32 instructions in parallel whereas x86’d variableness makes it harder to keep the super scalar pipeline full (it’s decoder is significantly more complex to try to still extract parallelism which further slows it down). This is a bit more complex on ARM (and maybe RISCV?) where you have two widths but even then in practice it’s easier to extract performance out of it because x86 can be anywhere from 1-4 bytes (or 1-8? Can’t remember) which makes it hard to find boundary instructions in parallel.

There’s a reason that Apple is whooping AMD and Intel on performance/watt and it’s not solely because they’re on a newer fab process (it’s also why AMD and Intel utterly failed to get mobile CPU variants of their chips off the ground).

Joker_vD

2 replies

2h56m

2024-08-27 15:35:56 UTC

x86 instruction lengths range from 1 to 15.

a line of cache and 4 byte instructions you could start decoding 32 instructions in parallel

In practice, ARM processors decode up to 4 instructions in parallel; so do Intel and AMD.

adgjlsfhk1

1 replies

1h33m

2024-08-27 16:59:08 UTC

Apple's m1 chips are 8 wide. and AMD and Intel's newest chips are also doing more fancy things than 4 wide

vlovich123

0 replies

26m

2024-08-27 18:05:36 UTC

Any reading resources? I’d love to learn better the techniques they’re using to get better parsllelism. The most obvious solution I can imagine is that they’d just try to brute force starting to execute every possible boundary and rely on it either decoding an invalid instruction or late latching the result until it got confirmed that it was a valid instruction boundary. Is that generally the technique or are they doing more than even that? The challenge with this technique of course is that you risk wasting energy & execution units on phantom stuff vs an architecture that didn’t have as much phantomness potential in the first place.

baq

0 replies

7h27m

2024-08-27 11:05:09 UTC

Great post as it is also directly applicable to invalidate the myth that the arm instruction set somehow makes the whole cpu better than analogous x86 silicon. It might be true and responsible for like 0.1% (guesstimate) of the total advantage; it's actually all RISC under the hood and both ISAs need decoders, x86 might need a slightly bigger one which amounts to accounting noise in terms of area.

c.f. https://chipsandcheese.com/2021/07/13/arm-or-x86-isa-doesnt-...

impossiblefork

6 replies

9h36m

2024-08-27 08:56:13 UTC

But we define the RISC dream as a dream that efficiency, performance and low-cost could be achieved by cores with very small instruction sets?

flanked-evergl

3 replies

9h24m

2024-08-27 09:07:19 UTC

If adding more instructions negatively impacts efficiency, performance, cost and complexity, nobody would do it.

foldr

1 replies

9h21m

2024-08-27 09:11:14 UTC

Probably true now, but in ye olde days, some instructions existed primarily to make assembly programming more convenient.

Assembly programming is a real pain in the RISCiest of RISC architectures, like SPARC. Here's an example from https://www.cs.clemson.edu/course/cpsc827/material/Code%20Ge...:

• All branches (including the one caused by CALL, below) take place after execution of the following instruction.

• The position immediately after a branch is the “delay slot” and the instruction found there is the “delay instruction”.

• If possible, place a useful instruction in the delay slot (one which can safely be done whether or not a conditional branch is taken).

• If not, place a NOP in the delay slot.

• Never place any other branch instruction in a delay slot.

• Do not use SET in a delay slot (only half of it is really there).

pjc50

0 replies

3h7m

2024-08-27 15:24:58 UTC

Delay slots were such a hack. ARM never needed them.

patmorgan23

0 replies

6h26m

2024-08-27 12:05:57 UTC

Only if decoder complexity/ efficiency is you bottleneck

fanf2

1 replies

9h11m

2024-08-27 09:20:41 UTC

Not small instruction sets, simplified instruction sets. RISC’s main trick is to reduce the number of addressing modes (eg, no memory indirect instructions) and reduce the number of memory operands per instruction to 0 or 1. Use the instruction encoding space for more registers instead.

The surviving CISCs, x86 and z390 are the least CISCy CISCs. The surviving RISCs, arm and power, are the least RISCy RISCs.

RISC V is a weird throwback in some aspects of its instruction set design.

panick21_

0 replies

4h30m

2024-08-27 14:01:56 UTC

Lets be real, its about business models. POWER was and is backed by IBM. ARM won on mobile. Does this mean POWER and ARM are better then MIPS, SPARC, PA-RISC, Am29000, i860? I don't think so.

gary_0

9 replies

9h23m

2024-08-27 09:09:09 UTC

As I've heard it explained, RISC in practise is less about "an absolutely minimalist instruction set" and more about "don't add any assembly programmer conveniences or other such cleverness, rely on compilers instead of frontend silicon when possible".

Although as I recall from reading the RISC-V spec, RISC-V was rather particular about not adding "combo" instructions when common instruction sequences can be fused by the frontend.

My (far from expert) impression of RISC-V's shortcomings versus x86/ARM is more that the specs were written starting with the very basic embedded-chip stuff, and then over time more application-cpu extensions were added. (The base RV32I spec doesn't even include integer multiplication.) Unfortunately they took a long time to get around to finishing the bikeshedding on bit-twiddling and simd/vector extensions, which resulted in the current functionality gaps we're talking about.

So I don't think those gaps are due to RISC fundamentalism; there's no such thing.

Suppafly

7 replies

2h6m

2024-08-27 16:25:33 UTC

and more about "don't add any assembly programmer conveniences or other such cleverness, rely on compilers instead of frontend silicon when possible"

What are the advantages of that?

Closi

2 replies

56m

2024-08-27 17:35:17 UTC

Instructions can be completed in one clock cycle, which removes a lot of complexity compared to instructions that require multiple clock cycles.

Removed complexity means you can fit more stuff into the same amount of silicon, and have it be quicker with less power.

gary_0

1 replies

35m

2024-08-27 17:56:33 UTC

That's not exactly it; quite a few RISC-style instructions require multiple (sometimes many) clock cycles to complete, such as mul/div, floating point math, and branching instructions can often take more than one clock cycle as well, and then once you throw in pipelining, caches, MMUs, atomics... "one clock cycle" doesn't really mean a lot. Especially since more advanced CPUs will ideally retire multiple instructions per clock.

Sure, addition and moving bits between registers takes one clock cycle, but those kinds of instructions take one clock cycle on CISC as well. And very tiny RISC microcontrollers can take more than one cycle for adds and shifts if you're really stingy with the silicon.

(Memory operations will of course take multiple cycles too, but that's not the CPU's fault.)

Suppafly

0 replies

2024-08-27 18:26:15 UTC

quite a few RISC-style instructions require multiple (sometimes many) clock cycles to complete, such as mul/div, floating point math

Which seems like stuff you want support for, but this is seemingly arguing against?

adgjlsfhk1

1 replies

1h40m

2024-08-27 16:51:30 UTC

complexity that the compiler removes doesn't have to be handled by the CPU at runtime

Suppafly

0 replies

2024-08-27 18:28:53 UTC

Sure but that's not necessarily at odds with "programmer conveniences or other such cleverness" is it?

Retr0id

1 replies

1h56m

2024-08-27 16:35:55 UTC

It shifts implementation complexity from hardware onto software. It's not an inherent advantage, but an extra compiler pass is generally cheaper than increased silicon die area, for example.

On a slight tangent, from a security perspective, if your silicon is "too clever" in a way that introduces security bugs, you're screwed. On the other hand, software can be patched.

flyingpenguin

0 replies

38m

2024-08-27 17:54:12 UTC

I honestly find the lack of compiler/interpreter complexity disheartening.

It often feels like as a community we don't have an interest in making better tools than those we started with.

Communicating with the compiler, and generating code with code, and getting information back from the compiler should all be standard things. In general they shouldn't be used, but if we also had better general access to profiling across our services, we could then have specialists within our teams break out the special tools and improve critical sections.

I understand that many of us work on projects with already absurd build times, but I feel that is a side effect of refusal to improve ci/cd/build tools in a similar way.

If you have ever worked on a modern TypeScript framework app, you'll understand what I mean. You can create decorators and macros talking to the TypeScript compiler and asking it to generate some extra JS or modify what it generates. And the whole framework sits there running partial re-builds and refreshing your browser for you.

It makes things like golang feel like they were made in the 80s.

Freaking golang... I get it, macros and decorators and generics are over-used. But I am making a library to standardize something across all 2,100 developers within my company... I need some meta-programming tools please.

Closi

0 replies

8h27m

2024-08-27 10:04:43 UTC

Put another way, "try to avoid instructions that can't be executed in a single clock cycle, as those introduce silicon complexity".

wang_li

0 replies

54m

2024-08-27 17:37:51 UTC

Beyond the most trivial of microcontrollers and experimental designs there are no RISC chips under the original understanding of RISC. The justification for RISC evaporated when we became able to put 1 million, 100 million, and so on, transistors on a chip. Now all the chips called "RISC" include vector, media, encryption, network, FPUs, and etc. instructions. Someone might want to argue that some elements of RISC designs (orthogonal instruction encoding, numerous registers, etc.) make a particular chip a RISC chip. But they really aren't instances of the literal concept of RISC.

To me, the whole RISC-V interest is all just marketing. As an end user I don't make my own chips and I can't think of any particular reason I should care whether a machine has RISC-V, ARM, x86, SPARC, or POWER. In the end my cost will be based on market scale and performance. The licensing cost of the design will not be passed on to me as a customer.

ahartmetz

0 replies

9h35m

2024-08-27 08:56:58 UTC

The explanation that I've seen is that it's "(reduced instruction) set computer" - simple instructions, not necessarily few.

WhyNotHugo

0 replies

8h15m

2024-08-27 10:16:46 UTC

In this particular context, they're trying to run code compiled for x86_64 on RISCV5. The need from "we just need a few more instructions to get feature parity" comes from trying to run code that is already compiled for an architecture with all those extra instructions.

In theory, if you compiled the original _source_ code for RISC, you'd get an entirely binary and wouldn't need those specific instructions.

In practice, I doubt anyone is going to actually compile these games for RISCV5.

Symmetry

0 replies

7h5m

2024-08-27 11:26:53 UTC

In order to have an instruction set that a student can implement in a single semester class you need to make simplifications like having all instructions have two inputs and one output. That also makes the lives of researchers experimenting one processor design a lot simpler as well. But it does mean that some convenient instructions are off the table for getting to higher performance.

That's not the whole story, a simpler pipeline takes less engineering resources for teams going to a high performance design so they can spend more time optimizing.

RISC is generally a philosophy of simplification but you can take it further or less far. MIPS is almost as simplified as RISC-V but ARM and POWER are more moderate in their simplifications and seem to have no trouble going toe to toe with x86 in high performance arenas.

But remember there are many niches for processors out there besides running applications. Embedded, accelerators, etc. In the specific niche of application cores I'm a bit pessimistic about RISC-V but from a broader view I think it has a lot of potential and will probably come to dominate at least a few commercial niches as well as being a wonderful teaching and research tool.

RiverCrochet

0 replies

38m

2024-08-27 17:54:11 UTC

The RISC dream was to simplify CPU design because most software was written using compilers and not direct assembly.

Characteristics of classical RISC:

- Most data manipulation instructions work only with registers.

- Memory instructions are generally load/store to registers only.

- That means you need lots of registers.

- Do your own stack because you have to manually manipulate it to pass parameters anyway. So no CALL/JSR instruction. Implement the stack yourself using some basic instructions that load/store to the instruction pointer register directly.

- Instruction encoding is predictable and each instruction is the same size.

- More than one RISC arch has a register that always reads 0 and can't be written. Used for setting things to 0.

This worked, but then the following made it less important:

- Out-of-order execution - generally the raw instruction stream is a declaration of a path to desired results, but isn't necessarily what the CPU is really doing. Things like speculative execution, branch prediction and register renaming are behind this.

- SIMD - basically a separate wide register space with instructions that work on all values within those wide registers.

So really OOO and SIMD took over.

justahuman74

23 replies

13h0m

2024-08-27 05:31:38 UTC

I hope they're able to get this ISA-level feedback to people at RVI

dmitrygr

18 replies

12h53m

2024-08-27 05:38:48 UTC

None of this is new. None of it.

In fact, bitfield extract is such an obvious oversight that it is my favourite example of how idiotic the RISCV ISA is (#2 is lack of sane addressing modes).

Some of the better RISCV designs, in fact, implement a custom instr to do this, eg: BEXTM in Hazard3: https://github.com/Wren6991/Hazard3/blob/stable/doc/hazard3....

renox

16 replies

12h19m

2024-08-27 06:12:55 UTC

Whoa, someone else who doesn't believe that the RISC-V ISA is 'perfect'! I'm curious: how the discussions on the bitfield extract have been going? Because it does really seem like an obvious oversight and something to add as a 'standard extension'.

What's your take on

1) unaligned 32bit instructions with the C extension?

2) lack of 'trap on overflow' for arithmetic instructions? MIPS had it..

dmitrygr

12 replies

12h7m

2024-08-27 06:24:21 UTC

1. aarch64 does this right. RISCV tries to be too many things at once, and predictably ends up sucking at everything. Fast big cores should just stick to fixed size instrs for faster decode. You always know where instrs start, and every cacheline has an integer number of instrs. microcontroler cores can use compressed intrs, since it matters there, while trying to parallel-codec instrs does not matter there. Trying to have one arch cover it all is idiotic.

2. nobody uses it on mips either, so it is likely of no use.

loup-vaillant

9 replies

8h17m

2024-08-27 10:15:15 UTC

Fast big cores should just stick to fixed size instrs for faster decode.

How much faster, though? RISC-V decode is not crazy like x86, you only need to look at the first byte to know how long the instruction is (the first two bits if you limit yourself to 16 and 32-bit instructions, 5 bits if you support 48-bits instructions, 6 bits if you support 64-bits instructions). Which means, the serial part of the decoder is very very small.

The bigger complain about variable length instruction is potentially misaligned instructions, which does not play well with cache lines (a single instruction may start in a cache line and end at the next, making hardware a bit more hairy).

And there’s an advantage to compressed instructions even on big cores: less pressure on the instruction cache, and correspondingly fewer cache misses.

Thus, it’s not clear to me that fixed size instructions is the obvious way to go for big cores.

newpavlov

6 replies

7h57m

2024-08-27 10:34:18 UTC

Another argument against the C extension is that it uses a big chunk of the opcode space, which may be better used for other extensions with 32-bit instructions.

camel-cdr

5 replies

7h6m

2024-08-27 11:25:43 UTC

Are just 32-bit and naturally aligned 64 bit instruction a better path than fewer 32 bit, but 16/48/64 bit instructions?

I think it's quite unclear which one is better. 48-bit instructions have a lot of potential imo, they have better code density then naturally aligned 64 bit instructions, and they can encode more that 32-bit. (2/3 to 3/4 of 43-bits of encoding)

There are essentially two design philosophies:

1. 32-bit instructions, and 64 bit naturally aligned instructions

2. 16/32/48/64 bit instructions with 16 bit alignment

Implementation complexity is debatable, although it seems to somewhat favor options 1:

1: you need to crack instructions into uops, because your 32-bit instructions need to do more complex things

2: you need to find instruction starts, and handle decoding instructions that span across a cache line

How big the impact is relative to the entire design is quite unclear.

Finding instruction starts means you need to propagate a few bits over your entire decode width, but cracking also requires something similar. Consider that if you can handle 8 uops, then those can come from the first 4 instructions that are crackes into 2 uops each, or from 8 instructions that don't need to be cracked, and everything in between. With cracking, you have more freedom when you want to do it in the pipeline, but you still have to be able to handle it.

In the end, both need to decode across cachelines for performance, but one needs to deal with an instruction split across those cache lines. To me this sounds like it might impact verification complexity more than the actual implementation, but I'm not qualified enough to know.

If both options are suited for high performance implementations, then it's a question about tradeoffs and ISA evolution.

newpavlov

4 replies

4h55m

2024-08-27 13:36:18 UTC

There is also a middle ground of requiring to pad 16/48-bit sequences with 16-bit NOP to align them to 32 bits. I agree that at this time it's not clear whether the C extension is a good idea or not (same with the V extension).

sweetjuly

3 replies

3h20m

2024-08-27 15:12:11 UTC

The C extension authors did consider requiring alignment/padding to prevent the misaligned 32-bit instruction issues, but they specifically mention rejecting it since it ate up all the code size savings.

Dylan16807

2 replies

2h0m

2024-08-27 16:32:07 UTC

Did they specifically analyze doing alignment on a cache line basis?

adgjlsfhk1

1 replies

1h11m

2024-08-27 17:20:45 UTC

that seems really tough for compilers.

dmitrygr

0 replies

1h7m

2024-08-27 17:24:45 UTC

Not really. Most modern x86 compilers already align jump targets to cache line boundaries since this helps x86 a lot. So it is doable. If you compile each function into a section (common), then the linker can be told to align them to 64 or 128 bytes easily. Code size would grow (but tetris can be played to reduce this by packing functions)

inkyoto

1 replies

6h32m

2024-08-27 11:59:49 UTC

Frankly, there is no advantage to compressed instructions in a high performance CPU core as a misaligned instruction can span a memory page boundary, which will generate a memory fault, potentially a TLB flush, and, if the memory page is not resident in memory, will require an I/O operation. Which is much worse than crossing a cache line. It is a double whammy when both occur simultaneously.

One suggested solution has been filling in gaps with NOP's, but then the compiler would have to track the page alignment, which would not work anyway if a system supports pages of varying sizes (ordinary vs huge pages).

The best solution is perhaps to ignore compressed instructions when targeting high performance cores and confine their usage to where they belong: power efficient or low performance microcontrollers.

Dylan16807

0 replies

1h54m

2024-08-27 16:37:46 UTC

One suggested solution has been filling in gaps with NOP's, but then the compiler would have to track the page alignment, which would not work anyway if a system supports pages of varying sizes (ordinary vs huge pages).

If it's in the linker then tracking pages sounds pretty doable.

You don't need to care about multiple page sizes. If you pad at the minimum page size, or even at 1KB boundaries, that's a miniscule number of NOPs.

renox

0 replies

6h51m

2024-08-27 11:40:45 UTC

2. nobody uses it on mips either, so it is likely of no use.

Sure but at the time Rust, Zig didn't exist, these two languages have a mode which detects integer overflow..

bonzini

0 replies

10h41m

2024-08-27 07:50:35 UTC

Fixed size instructions are not absolutely necessary, but keeping them naturally aligned is just better even if that means using C instructions a bit less often. It's especially messy that 32-bit instructions can span a page.

phkahler

1 replies

5h0m

2024-08-27 13:31:48 UTC

IMHO they made a mistake by not allowing immediate data to follow instructions. You could encode 8 bit constants within the opcode, but anything larger should be properly supported with immediate data. As for the C extension, I think that was also inferior because it was added afterward. I'd like to see a re-encoding of the entire ISA in about 10 years once things are really stable.

dmitrygr

0 replies

2h39m

2024-08-27 15:52:40 UTC

The main problem with what you’re saying is that none of the lessons learned are new. They were all well-known before this ISA was designed, so if the designers had any intention of learning from the past, they had every opportunity to do so.

newpavlov

0 replies

9h40m

2024-08-27 08:51:18 UTC

The handling of misaligned loads/stores in RISC-V is also can be considered a disappointing point: https://github.com/riscv/riscv-isa-manual/issues/1611 It oozes with preferring convenience of hardware developers and "flexibility" over making practical guarantees needed by software developers. It looks like the MIPS patent on misaligned load/store instructions has played its negative role. The patent expired in 2019, but it seems we are stuck with the current status quo nevertheless.

Findecanor

0 replies

7h45m

2024-08-27 10:46:51 UTC

Bitfield-extract is being discussed for a future extension. E.g. Qualcomm is pressing for it to be added.

In the meantime, it can be done as two shifts: left to the MSB, and then right filling with zero or sign bits. There is at least one core in development (SpaceMiT X100) that is supposed to be able to fuse those two into a single µop, maybe some that already do.

However, I've also seen that one core (XianShan Nanhu) is fusing pairs of RVI instructions into one in the B extension, to be able to run old binaries compiled for CPUs without B faster. Throwing hardware at the problem to avoid a recompile ... feels a bit backwards to me.

camel-cdr

3 replies

12h31m

2024-08-27 06:01:10 UTC

The scalar efficiency SIG has already been discussing bitfield insert and extract instructions.

We figured out yesterday [1], that the example in the article can already be done in four risc-v instructions, it's just a bit trickier to come up with it:

    # a0 = rax, a1 = rbx
    slli t0, a1, 64-8
    rori a0, a0, 16
    add a0, a0, t0
    rori a0, a0, 64-16

[1] https://www.reddit.com/r/RISCV/comments/1f1mnxf/box64_and_ri...

bonzini

1 replies

11h5m

2024-08-27 07:26:43 UTC

Nice trick, in fact with 4 instructions it's as efficient as extract/insert and it works for all ADD/SUB/OR/XOR/CMP instructions (not for AND), except if the source is a high-byte register. However it's not really a problem if code generation is not great in this case: compilers in practice will not generate accesses to these registers, and while old 16-bit assembly code has lots of such accesses it's designed to run on processors that ran at 4-20 MHz.

Flag computation and conditional jumps is where the big optimization opportunities lie. Box64 uses a multi-pass decoder that computes liveness information for flags and then computes flags one by one. QEMU instead tries to store the original operands and computes flags lazily. Both approaches have advantages and disadvantages...

ptitSeb

0 replies

5h4m

2024-08-27 13:28:09 UTC

Actually, Box64 can also store operands for later computation, depending on what comes next...

ksco

0 replies

2h27m

2024-08-27 16:04:20 UTC

Author here, we have adopted this approach as a fast path to box64: https://github.com/ptitSeb/box64/pull/1763, thank you very much!

jokoon

15 replies

4h52m

2024-08-27 13:39:19 UTC

Question for somebody who doesn't work in chips: what does a software engineer has to do differently when targeting software for RISC5?

I would imagine that executable size increases, meaning it has to be aggressively optimized for cache locality?

I would imagine that some types of softwares are better suited for either CISC or RISC, like games, webservers?

dzaima

12 replies

4h42m

2024-08-27 13:50:05 UTC

RISC-V with the compressed instruction extension actually ends up smaller than x86-64 and ARM on average.

There's not much inherent that needs to change in software approach. Probably the biggest thing vs x86-64 is the availability of 32 registers (vs 16 on x86-64), allowing for more intermediate values before things start spilling to stack, which also applies to ARM (which too has 32 registers). But generally it doesn't matter unless you're micro-optimizing.

More micro-optimization things might include:

- The vector extension (aka V or RVV) isn't in the base rv64gc ISA, so you might not get SIMD optimizations depending on the target; whereas x86-64 and aarch64 have SSE2 and NEON (128-bit SIMD) in their base.

- Similarly, no popcount & count leading/trailing zeroes in base rv64gc (requires Zbb); base x86-64 doesn't have popcount, but does have clz/ctz. aarch64 has all.

- Less efficient branchless select, i.e. "a ? b : c"; takes ~4-5 instrs on base rv64gc, 3 with Zicond, but 1 on x86-64 and aarch64. Some hardware can also fuse a jump over a mv instruction to be effectively branchless, but that's even more target-specific.

RISC-V profiles kind of solve the first two issues (e.g. Android requires rva23, which requires rvv & Zbb & Zicond among other things) but if linux distros decide to target rva20/rv64gc then they're ~forever stuck without having those extensions in precompiled code that hasn't bothered with dynamic dispatch. Though this is a problem with x86-64 too (much less so with ARM as it doesn't have that many extensions; SVE is probably the biggest thing by far, and still not supported widely (i.e. Apple silicon doesn't)).

packetlost

11 replies

4h24m

2024-08-27 14:07:17 UTC

That seems like something the compiler would generally handle, no? Obviously that doesn't apply everywhere, but in the general case it should.

vlovich123

6 replies

3h34m

2024-08-27 14:57:52 UTC

Vector stuff is typically hand coded with intrinsics or assembly. Autovectorization has mixed results because there’s no way to request the compiler to promise that it vectorized the code.

But for an emulator like this, box64 has to pick how to emulate vectorized instructions on RiscV (eg slowly using scalars or trying to reimplement using native vector instructions). The challenge of course is that typically you don’t get as good a performance unless the emulator can actually rewrite the code on the fly because a 1:1 mapping is going to be suboptimal vs noticing patterns of high level operations being performed and providing a more optimized implementation that replaces an alternate chunk of instructions at once instead to account for implementation differences on the chip (eg you may have to emulate missing instructions but a rewriter could skip emulation if there’s an alternate way to accomplish the same high level computation)

The biggest challenge for something like this from a performance perspective of course will be translating the GPU stuff efficiently to hit the native driver code and that Riscv likely is relying on OSS GPU drivers (and maybe wine to add another translation layer if the game is windows only )

packetlost

1 replies

2h59m

2024-08-27 15:32:21 UTC

Vector stuff is typically hand coded with intrinsics or assembly. Autovectorization has mixed results because there’s no way to request the compiler to promise that it vectorized the code.

Right, but most of the time those are architecture specific and RVV 1.0 is substantially different than say, NEON or SSE2, so you need to change it anyways. You also typically use specialized registers for those, not the general purpose registers. I'm not saying there isn't work to be done (especially in for an application like this one, that is extremely performance sensitive), I'm saying that most applications won't have these problems are be so sensitive that register spills matter much if at all.

vlovich123

0 replies

21m

2024-08-27 18:10:41 UTC

I’m highlighting that the compiler doesn’t automatically take care of vector code quite as automatically and as well as it does register allocation and instruction selection which are slightly more solved problems. And it’s easy to imagine that a compiler will fail to optimize a piece of code as well on something that’s architecturally quite novel. RISCV and ARM aren’t actually hugely dissimilar architectures at a high level that completely different optimization need to be written and even selectively weighted by architecture, but I imagine something like a Mill CPU might require quite a reimagining to get anything approaching optimal performance.

fngjdflmdflg

1 replies

55m

2024-08-27 17:36:19 UTC

I read somewhere that since floating point addition is not associative the compiler will not autovectorize because the order might change.

vlovich123

0 replies

25m

2024-08-27 18:07:03 UTC

It’s somewhat more complicated than that (& presumed your hot path is floating point instead of integral), but that can be a consideration.

tormeh

0 replies

2h59m

2024-08-27 15:33:12 UTC

I'd assume it uses RADV, same as the Steam Deck. For most workloads that's faster than AMD's own driver. And yes, it uses Wine and DXVK. As dar as the game is concerned it's running on a DirectX-capable x86 Windows machine. That's a lot of translation layers.

dzaima

0 replies

3h21m

2024-08-27 15:10:38 UTC

On clang, you can actually request that it gives a warning on missed vectorization of a given loop with "#pragma clang loop vectorize(enable)": https://godbolt.org/z/sP7drPqMT (and you can even make it an error).

There's even "#pragma clang loop vectorize(assume_safety)" to tell it that pointer aliasing won't be an issue (gcc has a similar "#pragma GCC ivdep"), which should get rid of most odd reasons for missed vectorization.

dzaima

3 replies

4h20m

2024-08-27 14:12:04 UTC

It's something that the compiler would handle, but can still moderately influence programming decisions, i.e. you can have a lot more temporary variables before things start slowing down due to spill stores/loads (esp. in, say, a loop with function calls, as more registers also means more non-volatile registers (i.e. those that are guaranteed to not change across function calls)). But, yes, very limited impact even then.

packetlost

2 replies

3h55m

2024-08-27 14:36:28 UTC

It's certainly something I would take into consideration when making a (language) runtime, but probably not at all during all but the most performance sensitive of applications. Certainly a difference, but far lower level than what most applications require

dzaima

1 replies

3h45m

2024-08-27 14:46:51 UTC

Yep. Unfortunately I am one to be making language runtimes :)

It's just the potentially most significant thing I could come up with at first. Though perhaps RVV not being in rva20/rv64gc is more significant.

packetlost

0 replies

2h56m

2024-08-27 15:36:12 UTC

Looks like an APL project? That's really cool!

cesarb

0 replies

3h35m

2024-08-27 14:56:58 UTC

Question for somebody who doesn't work in chips: what does a software engineer has to do differently when targeting software for RISC5?

Most of the time, nothing; code correctly written on higher-level languages like C should work the same. The biggest difference, the weaker memory model, is something you also have on most non-x86 architectures like ARM (and your code shouldn't be depending on having a strong memory model in the first place).

I would imagine that executable size increases, meaning it has to be aggressively optimized for cache locality?

For historical reasons, executable code density on x86 is not that good, so the executable size won't increase as much as you'd expect; both RISC-V with its compressed instructions extension and 32-bit ARM with its Thumb extensions are fairly compact (there was an early RISC-V paper which did that code size comparison, if you want to find out more).

I would imagine that some types of softwares are better suited for either CISC or RISC, like games, webservers?

What matters most is not CISC vs RISC, but the presence and quality of things like vector instructions and cryptography extensions. Some kinds of software like video encoding and decoding heavily depend on vector instructions to have good performance, and things like full disk encryption or hashing can be helped by specialized instructions to accelerate specific algorithms like AES and SHA256.

Pet_Ant

0 replies

1h8m

2024-08-27 17:23:30 UTC

No, any ISA pretty much should be equally good for any type of workload. If you are doing assembly programming then it makes a difference but if you were doing something in Python or Unity it really isn’t going to matter.

This is more about being free of ARM’s patents and getting a fresh start using the lessons learned

nolist_policy

9 replies

7h3m

2024-08-27 11:28:30 UTC

The x86 instruction set is very very big. According to rough statistics, the ARM64 backend implements more than 1,600 x86 instructions in total, while the RV64 backend implements about 1,000 instructions

This is just insane and gets us full-circle to why we want RISC-V.

patmorgan23

2 replies

6h31m

2024-08-27 12:00:38 UTC

Not really. RISC-V's benefits are not the "Reduced Instruction Set" part, it's the open ISA part. A small instruction set as actually has several disadvantages. It means you binary bigger because what was a single operation in x86 is now several in RISC-V, meaning more memory bandwidth and cache is taken up by instructions instead of data.

Modern CPUs are actually really good at deciding operations into micro-ops. And the flexibility of being able to implement a complex operation in microcode, or silicon is essential for CPU designers.

Is there a bunch of legacy crap in x86? Yeah. Does getting rid of dramatically increase the performance ceiling? Probably not.

The real benefit of RISC-V is anybody can use it. It's democratizing the ISA. No one has to pay a license to use it, they can just build their CPU design and go.

zozbot234

0 replies

6h14m

2024-08-27 12:18:04 UTC

Modern CPUs are actually really good at deciding operations into micro-ops.

The largest out-of-order CPUs are actually quite reliant on having high-performance decode that can be performed in parallel using multiple hardware units. Starting from a simplified instruction set with less legacy baggage can be an advantage in this context. RISC-V is also pretty unique among 64-bit RISC ISA's wrt. including compressed instructions support, which gives it code density comparable to x86 at a vastly improved simplicity of decode (For example, it only needs to read a few bits to determine which insns are 16-bit vs. 32-bit length).

panick21_

0 replies

5h11m

2024-08-27 13:20:49 UTC

means you binary bigger .... meaning more memory bandwidth and cache

Except this isn't actually true.

Does getting rid of dramatically increase the performance ceiling? Probably not.

No but it dramatically DECREASES the amount of investment necessary to reach that ceiling.

Assume you have 2 teams, each get the same amount of money. Then ask them to make the highest performing spec compatible chip. What team is gone win 99% of the time?

And the flexibility of being able to implement a complex operation in microcode, or silicon is essential for CPU designers.

You can add microcode to a RISC-V chip if you want, most people just don't want to.

The real benefit of RISC-V is anybody can use it.

That is true, but its also just a much better instruction set then x86 -_-

eternauta3k

1 replies

5h25m

2024-08-27 13:07:04 UTC

If an insane instruction set gives us higher performance and makes CPU and compiler design more complex, this might be an acceptable trade-off.

panick21_

0 replies

5h5m

2024-08-27 13:26:59 UTC

But it doesn't.

Its simply about the amount of investment. x86 had 50 years of gigantic amounts of sustained investment. Intel outsold all the RISC vendors combined by like 100 to 1 because they owned the PC business.

When Apple started seriously investing in ARM. They were able to match of beat x86 laptops.

The same will be true for RISC-V.

aithrowaway1987

1 replies

6h1m

2024-08-27 12:30:47 UTC

I think the 1600 number is a coarse metric for this sort of thing. Keep in mind that these instructions are limited in the number of formal parameters they can take: e.g. 16 nominally distinct instructions can be more readily understood/memorized as one instruction with an implicit 4-bit flag. Obviously there's a ton of legacy cruft in Intel ISAs, along with questionable decisions, and I'm not trying to take away from the appeals of RISC (e.g. there are lots of outstanding compiler bugs around these "pseudoparamaterized" instructions). But it's easy to look at "1600" and think "ridiculous bloat," when in reality it's somewhat coherent and systematic - and more to the point, clearly necessary for highly performance-sensitive work.

panick21_

0 replies

5h8m

2024-08-27 13:24:07 UTC

clearly necessary for highly performance-sensitive work

Its clearly necessary to have comparability back to the 80s. Its clearly necessary to have 10 different generation of SIMD. Its clearly necessary to have multiple different floating point systems.

h_tbob

0 replies

5h23m

2024-08-27 13:08:22 UTC

I want somebody to make a GPT fine tune that specializes in converting instructions and writing tests. If you made it read all x86 docs a bunch and risc v docs, a lot of this could be automated.

ben-schaaf

0 replies

2h58m

2024-08-27 15:33:31 UTC

ARM64 has approximately 1300 instructions.

littlecranky67

6 replies

12h45m

2024-08-27 05:46:31 UTC

Article is a bit short on "the basics" - I assumed they used some kind of wine port to run it. But it seems they implemented the x86_64 ISA on a RISC-V chip in some way - anyone can shed more light on that part how that is done?

anewhnaccount2

5 replies

12h38m

2024-08-27 05:53:55 UTC

The basics are here: https://box86.org/ It is an emulator but:

Because box86 uses the native versions of some “system” libraries, like libc, libm, SDL, and OpenGL, it’s easy to integrate and use with most applications, and performance can be surprisingly high in some cases.

Wine can also be compiled/run as native.

ThatPlayer

4 replies

9h10m

2024-08-27 09:21:27 UTC

Wine can also be compiled/run as native.

I'm not sure you can run Wine natively to run x86 Windows programs on RISC-V because Wine is not an emulator. There is an ARM port of Wine, but that can only run Windows ARM programs, not x86.

Instead box64 is running the x86_64 Wine https://github.com/ptitSeb/box64/blob/main/docs/X64WINE.md

gary_0

3 replies

8h42m

2024-08-27 09:49:54 UTC

It should be theoretically possible to build Wine so that it provides the x86_64 API while compiling it to ARM/RISCV. Your link doesn't make it clear if that's what's being done or not.

(Although I suspect providing the API of one architecture while building for another is far easier said than done. Toolchains tend to be uncooperative about such shenanigans, for starters.)

ThatPlayer

2 replies

8h6m

2024-08-27 10:25:43 UTC

Box64's documentation is just on installing the Wine x64 builds from winehq repos, because most arm repos aren't exactly hosting x64 software. It's even possible to run Steam with their x64 Proton running Windows games. At least on ARM, not sure about RISC-V.

Wine's own documentation says it requires an emulator: https://wiki.winehq.org/Emulation

As Wine Is Not an Emulator, all those applications can't run on other architectures with Wine alone.

Or do you mean provide the x86_64 Windows API as a native RISC-V/ARM to the emulator layer? That would require some deeper integration for the emulator, but that's what Box64/box86 already does with some Linux libraries: intercept the api calls and replace them with native libraries. Not sure if it does it for wine

gary_0

1 replies

7h8m

2024-08-27 11:23:36 UTC

but that's what Box64/box86 already does with some Linux libraries: intercept the api calls and replace them with native libraries. Not sure if it does it for wine

Yeah, that's what I meant. It's simple in principle, after all: turn an AMD64 call into an ARM/RISCV call and pass it to native code.

Doing that for Wine would be pretty tricky (way more surface area to cover, possible differences between certain Win32 arch-specific structs and so forth) so I bet that's not how it works out of the box, but I couldn't tell for sure by skimming through the box64 repo.

lmz

0 replies

5h7m

2024-08-27 13:24:32 UTC

As demonstrated by Microsoft themselves in Windows 11: https://learn.microsoft.com/en-us/windows/arm/arm64ec

int0x29

3 replies

12h19m

2024-08-27 06:12:57 UTC

That screenshot shows 31 gb of ram which is distinctly more than the mentioned dev board at max specs. Are they using something else here?

snvzz

0 replies

11h50m

2024-08-27 06:42:10 UTC

Pioneer, an older board.

Note that, today, one of the recent options with several, faster cores implementing RVA22 and RVV 1.0 is the better idea.

ptitSeb

0 replies

11h1m

2024-08-27 07:30:34 UTC

The milk-v pioneer comes with 128GB of RAM.

pengaru

0 replies

12h9m

2024-08-27 06:22:56 UTC

https://milkv.io/pioneer

victor_cl

2 replies

11h6m

2024-08-27 07:25:27 UTC

I remember learning RISC-V in Berkeley CS61C. Anyone from Berkeley？

jychang

1 replies

10h54m

2024-08-27 07:37:58 UTC

There's nobody from Berkeley on HN

victor_cl

0 replies

6h53m

2024-08-27 11:38:27 UTC

oh really, didn't know that. Me neither. That course was open-sourced.

theragra

2 replies

9h46m

2024-08-27 08:46:12 UTC

Reminded me how one famous Russian guy ran Atomic Heart on Elbrus 8S.

Elbrus has native translator, though, and pretty good one, afaik. Atomic Heart was kinda playable, 15-25 fps.

mrweasel

0 replies

9h13m

2024-08-27 09:19:04 UTC

This guy: https://www.youtube.com/watch?v=-0t-5NWk_1o

Beijinger

0 replies

2h24m

2024-08-27 16:07:47 UTC

Elbrus is/was RISC?-V?

mrlonglong

2 replies

8h15m

2024-08-27 10:17:14 UTC

Is this the 86Box? I found it fun reliving the time I got my Amstrad PC1512, I added two hard cards of 500MB and a 128k memory expansion to 640KB which made things a lot more fun. Back then I only had two 360KB floppies and added a 32MB hard card a few years later. I had Borland TurboPascal and Zortech C too. Fun times.

ptitSeb

1 replies

7h54m

2024-08-27 10:37:53 UTC

No, it's Box64, a completly different project.

(But I do remember the time I had an Amstrad PC1512 too :D )

mrlonglong

0 replies

7h34m

2024-08-27 10:58:06 UTC

It will be interesting to try out Box64 as soon as I get my hands on some suitable RISCV hardware. I have played with RISCV microcontrollers they're quite nice to work with.

lyu07282

2 replies

9h46m

2024-08-27 08:46:11 UTC

Another technically impressive Witcher 3 feat was the Switch port, it ran really well. Goes to show how much can be done with optimization and how much resources are wasted on the PC purely by bad optimization.

zamadatix

0 replies

1h2m

2024-08-27 17:30:15 UTC

You too can run Witcher 3 equally on a minimal PC if you're willing to set the render resolution to 720p (540p undocked), settings to below minimum, and call ~30 FPS well.

laserbeam

0 replies

8h39m

2024-08-27 09:53:08 UTC

And with using much lower quality textures and 3D models, therefore using much less RAM for assets. It's not an apples to apples comparison and you can't really make claims about bad optimization on PCs when the scope of what's shown on screen is vastly different.

brandonpelfrey

1 replies

13h6m

2024-08-27 05:25:51 UTC

Incredible result! This is a tremendous amount of work and does seem like RV is at its limits in some of these cases. The bit gather and scatter instructions should become an extension!

glitchc

0 replies

4h32m

2024-08-27 14:00:03 UTC

Would be useful to see test results on a game that relies more heavily on the graphics core than the CPU. Perhaps Divinity 2?

sylware

0 replies

4h32m

2024-08-27 14:00:13 UTC

lol, I am going the other way around.

Since RISC-V ISA is worldwide royalty free and more than nice, I am writting basic rv64 assembly which I do interpret on x86_64 hardware with a linux kernel.

I did not push the envelop up to have a "compiler", because it is indeed while waiting for hardcore performant desktop, aka large, rv64 hardware implementations.

stuckinhell

0 replies

5h40m

2024-08-27 12:51:31 UTC

wow very impressive

sdwrj

0 replies

2h7m

2024-08-27 16:25:08 UTC

box64 is getting too advanced lol

high_na_euv

0 replies

8h53m

2024-08-27 09:39:00 UTC

Great game choice!

bee_rider

0 replies

2h59m

2024-08-27 15:32:44 UTC

I wonder if systems will ship at some point that are a handful of big RISC-V CPUs, and then a “GPU” implemented as a bunch of little RISC-V CPUs (with the appropriate vector stuff—actually, side-question, can classic vectors, instead of packed SIMD, be useful in a GPU?)

anthk

0 replies

6h45m

2024-08-27 11:46:35 UTC

I used to use GL4ES on the PocketCHIP. And I daily use it on a netbook to get more performance on some GL 2.1 games.

Thaxll

0 replies

4h39m

2024-08-27 13:52:45 UTC

Box86 is so good, I run x86-64 steam games ( servers ) on free Oracle instance ( ARM64 ) with it.

KingOfCoders

0 replies

6h2m

2024-08-27 12:30:15 UTC

"which allows games like Stardew Valley to run, but it is not enough for other more serious Linux games"

Hey! ;-)

Havoc

0 replies

6h21m

2024-08-27 12:11:00 UTC

15 fps in-game

Wow...that's substantially more than I would have guessed. Good times ahead for hardware

Beijinger

0 replies

2h21m

2024-08-27 16:10:36 UTC

Previously: https://news.ycombinator.com/item?id=19118642

And:

Milk-V Pioneer A 64-core, RISC-V motherboard and workstation for native development

https://www.crowdsupply.com/milk-v/milk-v-pioneer