return to table of content

FuryGpu – Custom PCIe FPGA GPU

MalphasWats
22 replies
1d6h

It's incredible how influential Ben Eater's breadboard computer series has been in hobby electronics. I've been similarly inspired to try to design my own "retro" CPU.

I desperately want something as easy to plug into things as the 6502, but with jussst a little more capability - few more registers, hardware division, that sort of thing. It's a really daunting task.

I always end up coming back to just use an MCU and be done with it, and then I hit the How To Generate Graphics problem.

MenhirMike
7 replies
1d4h

I was about to recommend the Parallax Propeller (the first one that's available in DIP format), but arguably, that one is way more complex to program for (and also significantly more powerful, and at that point you might as well look into an ESP32 and that is "just use an MCU" :))

And yeah, video output is a significant issue because of the required bandwidth for digital outputs (unless you're okay with composite or VGA outputs, I guess they can still be done with readily available chips?). The recent Commander X16 settled for an FPGA for this.

MalphasWats
6 replies
1d3h

I feel like the CX16 lost its way about a week after the project started and it suddenly became an expensive FPGA-based blob. But at the same time, I'm not sure what other option there is for a project like that.

I always got the impression that David sort of got railroaded by the other members of the team that wanted to keep adding features and MOAR POWAH, and didn't have a huge amount of choice because those features quickly scoped out of his own areas of knowledge.

erik
1 replies
21h59m

Modern retro computer designs run into the problem of generating a video signal. Ideally you'd have a tile and sprite based rendering. And you'd like to support HDMI or at least VGA. But there are no modern parts that offer this and building the functionality out of discrete components is impractical and unwieldy.

A FPGA is really just the right tool for solving the video problem. Or some projects do it with a micro-controller. But it's sort of too bad as it kind of undercuts the spirit of the whole design. If you video processor is orders of magnitude more powerful than the rest of the computer, then one starts to ask why not just implement the entire computer inside the video processor?

MenhirMike
0 replies
21h14m

It's one of the funny things of the Raspberry Pi Pico W: The Infineon CYW4343 has an integrated ARM Cortex-M3 CPU, so the WiFi/BT chip is technically more advanced than the actual RP2040 (which is a Cortex-M0+) and also has more built-in ROM/RAM than what's on the Pico board for the RP2040 to use.

And yeah, you can't really buy sprite-based video chips anymore, and you don't even have to worry about stuff like "Sprites per Scanline" because you can get a proper framebuffer for essentially free - but now you might as well go further and use one microprocessor to be the CPU, GPU, and FM Synthesizer Sound Chip and "just" add the logic to generate the actual video/audio signals.

MenhirMike
2 replies
1d3h

I think so too - it must have been a great learning experience for him though, but for me, the idea of "The best C64-like computer that ever existed" died pretty quickly.

He also did run into a similar problem that I ran into when I tried something like that as well: Sound Chips. Building a system around a Yamaha FM Synthesizer is perfect, but I found as well that most of the chips out there are broken, fake, or both and that no one else makes them anymore. Which makes sense because if you want a sound chip in this day, you use an AC97 or HD Audio codec and call it a day, but that goes against that spirit.

I think that the spirit on hobby electronics is really found in FPGAs these days instead of rarer and rarer DIP parts. Which is a bit sad, but I guess that's just the passage of time. I wonder if that's how some people felt in the 70s when CPUs replaced many distinct layouts, or if they rejoiced and embraced it instead.

I've given up trying to build a system on a breadboard and think that MiSTer is the modern equivalent of that.

dragontamer
1 replies
1d2h

I think that the spirit on hobby electronics is really found in FPGAs these days instead of rarer and rarer DIP parts. Which is a bit sad, but I guess that's just the passage of time. I wonder if that's how some people felt in the 70s when CPUs replaced many distinct layouts, or if they rejoiced and embraced it instead.

Microcontrollers have taken over. When 8kB SRAM and 20MHz microcontrollers exist below 50-cents and at miniscule 25mm^2 chip sizes drawing only 500uA of current... there's very little reason to use a collection of 30 chips to do equivalent functionality.

Except performance. If you need performance then bam, FPGA land comes in and Zynq just has too much performance at too low a cost (though not quite as low as the microcontroller gang).

----------

Hobby Electronics is great now. You have so many usable parts at very low costs. A lot of problems are "solved" yes, but that's a good thing. That means you can focus on solving your hobby problem rather than trying to invent a new display driver or something.

gnramires
0 replies
1d1h

Another advantage of hobby anything is that you can just do, and reinvent whatever you want. Sure, fast CPUs/MCUs exist now and can do whatever you want. But if you feel like reinventing the wheel just for the sake of it, no one will stop you![1]

I do think some people that remember fondly the user experience of those old machines might be better served by using modern machines (like a raspberry pi or even a standard pc) in a different way instead of trying to use old hardware. That's from the good old Turing machine universality (you can simulate practically any machine you like using newer hardware, if what you're interested in is software). You can even add artificial limitations like PICO-8 or TIC-80 does.

See also uxn:

https://100r.co/site/uxn.html

and (WIP) picotron:

https://www.lexaloffle.com/picotron.php

I think there's a general concept here of making 'Operating environments' that are pleasant to work within (or have fun limitations), which I think are more practical than a dedicated Operating System optionally with a dedicated machine. Plus (unless you particularly want to!) you don't need to worry about all the complex parts of operating systems like network stacks, drivers and such.

[1] Maybe we should call that Hobby universality (or immortality?) :P If it's already been made/discovered, you can always make it again just for fun.

jsheard
6 replies
1d6h

I've been looking into graphics on MCUs and was disappointed to learn that the little "NeoChrom" GPU they're putting on newer STM32 parts is completely undocumented. Historically they have been good about not putting black boxes in their chips, but I guess it's probably an IP block they've licensed from a third party.

jsheard
0 replies
1d6h

You can do something similar on STM32 parts that have an LCD controller, which can be abused to drive a VGA DAC or a DVI encoder chip. The LCD controller at least is fully documented, but many of their parts pair that with a small GPU, which would be an advantage over the GPU-less RP2040... if there were any public documentation at all for the GPU :(

CarVac
0 replies
1d5h

I used "composite" (actually monochrome) video output software someone wrote on the RP2040 for an optional feature on the PhobGCC custom gamecube controller motherboard to allow easy calibration, configuration, and high-frequency input recording and graphing.

Pictures of the output here: https://github.com/PhobGCC/PhobGCC-doc/blob/main/For_Users/P...

unwind
1 replies
1d5h

Agreed. It is so, so, so very disappointing. I was deeply surprised (in a non-pleasant way) when I first opened up a Reference Manual for one of those chips and saw that the GPU chapter was, like, four pages. :(

nick__m
0 replies
1d2h

On the ST forum the company clearly said that they will only release to some selected partners. That's sad.

MrBuddyCasino
0 replies
11h46m

That sucks. There are other MCUs with 2D graphics peripherals, eg the NXP i.MX line.

verticalscaler
4 replies
1d5h

True, can't think of much else this popular.

He started posting videos again recently with some regularity after a lull. Audience is in the low hundreds of thousands. I assume fewer than 100k actually finish videos and fewer still do anything with it.

Hobby electronics seems surprisingly small in this era.

hedora
2 replies
1d3h

I wonder if there’s much overlap between people that watch YouTube to get deep technical content (instead of reading), and people that care about hobby electronics.

I’m having trouble wrapping my head around how / why you’d use youtube to present analog electrical engineering formulas and pin out diagrams instead of using latex or a diagram.

robinsonb5
0 replies
1d1h

I consider YouTube (or rather, video in general) a fantastic platform for showcasing something cool, demonstrating what it can do, and even demonstrating how to drive a piece of software - but for actual technical learning I loathe the video format - it's so hard to skim, re-read, pause, recap and digest at your own speed.

The best compromise seems to be webpages with readable technical info and animated video illustrations - such as the one posted here yesterday about how radio works.

jpc0
0 replies
22h10m

For some things there is a lot of nuance lost in just writing. The unknowm unknowns.

There has been a lot of times where I am showing someone new to my field something and they stop me before I get to what I thought was the "educational" point and ask what I just did.

Video can portray that pretty well because the information is there for you to see, with a schematic or write-up if the author didn't put it there the information isn't there.

TillE
0 replies
22h55m

Even if you're not much of a tinkerer, Ben Eater's videos are massively helpful if you want to truly understand how computers work. As long as you come in knowing the rudiments of digital electronics, just watching his stuff is a whole education in 8-bit computer design. You won't quite learn how modern computers work with their fancy caches and pipelines and such, but it's a really strong foundation to build on.

I've built stuff with microcontrollers (partially aided by techniques learned here), but that was very purpose-driven and I'm not super interested in just messing around for fun.

bArray
0 replies
1d1h

Registers can be worked around by using the stack and/or memory. Division could always be implemented as a simple function. It's part of the fun of working at that level.

Regarding graphics, initially output serial. Abstract the problem away until you are ready to deal with it. If you sneak up on an Arduino and make it scream, you can make it into a very basic VGA graphics card [1]. Even easier is ESP32 to VGA (also gives keyboard and mouse) [2].

[1] https://www.instructables.com/Arduino-Basic-PC-With-VGA-Outp...

[2] https://www.aliexpress.us/item/1005006222846299.html

PfhorSlayer
0 replies
1d1h

Funny enough, that's exactly where this project started. After I built his 8 bit breadboard computer, I started looking into what might be involved in making something a bit more interesting. Can't do a whole lot of high-speed anything with discrete logic gates, so I figured learning what I could do with an FPGA would be far more interesting.

PfhorSlayer
17 replies
1d1h

So, this is my project! Was somewhat hoping to wait until there was a bit more content up on the site before it started doing the rounds, but here we are! :)

To answer what seems to be the most common question I get asked about this, I am intending on open-sourcing the entire stack (PCB schematic/layout, all the HDL, Windows WDDM drivers, API runtime drivers, and Quake ported to use the API) at some point, but there are a number of legal issues that need to be cleared (with respect to my job) and I need to decide the rest of the particulars (license, etc.) - this stuff is not what I do for a living, but it's tangentially-related enough that I need to cover my ass.

The first commit for this project was on August 22, 2021. It's been a bit over two and a half years I've been working on this, and while I didn't write anything up during that process, there are a fair number of videos in my YouTube FuryGpu playlist (https://www.youtube.com/playlist?list=PL4FPA1MeZF440A9CFfMJ7...) that can kind of give you an idea of how things progressed.

The next set of blog posts that are in the works concern the PCIe interface. It'll probably be a multi-part series starting at the PCB schematic/layout and moving through the FPGA design and ending with the Windows drivers. No timeline on when that'll be done, though. After having written just that post on how the Texture Units work, I've got even more respect for those that can write up technical stuff like that with any sort of timing consistency.

I'll answer the remaining questions in the threads where they were asked.

Thanks for the interest!

michaelt
10 replies
1d1h

Googling the Xilinx Zynq UltraScale+ it seems kinda expensive.

Of course plenty of hobbies let people spend thousands (or more) so there's nothing wrong with that if you've got the money. But is it the end target for your project? Or do you have ambitions to go beyond that?

0xcde4c3db
4 replies
21h17m

I've been told by several people that distributor pricing for FPGAs is ridiculously inflated compared to what direct customers pay, and considering that one can apparently get a dev board on AliExpress for about $110 [1] while Digikey lists the FPGA alone for about $1880 [2], I believe it (this example isn't an UltraScale chip, but it is significantly bigger than the usual low-end Zynq 7000 boards sold to undergrads and tinkerers).

[1] https://www.aliexpress.us/item/3256806069467487.html

[2] https://www.digikey.com/en/products/detail/amd/XC7K325T-1FFG...

bangaladore
2 replies
20h12m

I have some first- and second-hand experience with this, and you are correct. I'm not sure who benefits from this practice. It's anywhere from 5-25x cheaper in even small-ish quantities.

oasisaimlessly
1 replies
16h5m

What magnitude of a quantity is "small-ish"? How does a business go about becoming a "direct customer" / bypassing the distributors?

0xcde4c3db
0 replies
14h24m

I'm personally too far from those negotiations to offer any likely-pivotal insight (such as a concrete quantity), but my very rough understanding is that there's some critical volume beyond which a customer basically becomes "made" with the Xilinx/Altera sales channels via a financially significant design win, at which point sales engineers etc. all but have a blank check to do things like comp development boards, advance a tray of whatever device is relevant to the design, etc..

Basically, as George Carlin put it, "it's a big club, and you ain't in it".

mips_r4300i
0 replies
4h51m

This is both true and false. While I work with Intel/Altera, Xilinx is basically the same.

That devboard is using recycled chips 100 percent. Their cost is almost nothing.

The kintex-7 part in question can probably be bought in volume quantities for around $190. Think 100kEAU.

This kind of price break comes with volume and is common with many other kinds of silicon besides FPGAs. Some product lines have more pricing pressure than others. For example, very popular MCUs may not get as wide of a price break. Some manufacturers price more fairly to distributors, some allow very large discounts.

PfhorSlayer
3 replies
1d

Let's be clear here, this is a toy. Beyond being a fun project to work on that could maybe get my foot in the door were I ever to decide to change careers and move into hardware design, this is not going to change the GPU landscape or compete with any of the commercial players. What it might do is pave the way for others to do interesting things in this space. A board with all of the video hardware that you can plug into a computer with all the infrastructure available to play around with accelerating graphics could be a fun, if extremely niche, product. That would also require a *significant* time and money investment from me, and that's not something I necessarily want to deal with. When this is eventually open-sourced, those who really are interested could make their own boards.

One thing to note that is that while the US+ line is generally quite expensive (the higher end parts sit in the five-figures range for a one-off purchase! No one actually buying these is paying that price, but still!), the Kria SOMs are quite cheap in comparison. They've got a reasonably-powerful Zynq US+ for about $400, or just $350ish the dev boards (which do not expose some of the high-speed interfaces like PCIe). I'm starting to sound like a Xilinx shill given how many times I've re-stated this, but for anyone serious about getting into this kind of thing, those devboards are an amazing deal.

belter
1 replies
21h44m

"...I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu) for 386(486) AT clones..."

Rinzler89
0 replies
5h19m

Yeah you're referring to the Linux kernel but software is much cheaper to design, test, build, scale and turn profitable than HW, especially GPUs.

Open source GPUs won't threat Nvidia/AMD/Intel anytime soon or ever. They're way too far ahead in the game and also backed by patents if any new player were to become a thereat.

chrsw
0 replies
5h58m

could maybe get my foot in the door were I ever to decide to change careers and move into hardware design

With a project like this I think you're well past a "foot in the door".

kanetw
0 replies
1d

The Kria SOM in use here is like $300.

ruslan
2 replies
21h59m

How much it depends on hard IP blocks ? I mean, can it be ported to FPGAs of other vendors, like Lattice ECP5 ? Did you implement PCIe in HDL or used vendor specific IP block ? Please, provide some resource utilization statistics. Thanks.

alexforencich
0 replies
46m

The GPU uses https://github.com/alexforencich/verilog-pcie + the Xilinx PCIe hard IP core. When using the device-independent DMA engine, that library supports both Xilinx and Intel FPGAs.

PfhorSlayer
0 replies
20h12m

Implementing PCIe in the fabric without using the hard IP would be foolish, and definitely not the kind of thing I'd enjoy spending my time on! The design makes extensive use of the DSP48E2 and various BRAM/URAM blocks available in the fabric. I don't have exact numbers off the top of my head, but roughly it's ~500 DSP units (primarily for multiplication), ~70k LUTs, ~135k FFs, and ~90 BRAMs. Porting it to a different device would be a pretty significant undertaking, but would not be impossible. Many of the DSP resources are inferred, but there is a lot of timing stuff that depends on the DSP48E2's behavior - multiple register stages following the multiplies, the inputs are sized appropriately for those specific DSP capabilities, etc.

pocak
1 replies
21h38m

In the post about the texture unit, that ROM table for mip level address offsets seems to use quite a bit of space. Have you considered making the mip base addresses a part of the texture spec instead?

PfhorSlayer
0 replies
19h24m

The problem with doing that is it would require significantly more space in that spec. At a minimum, one offset for each possible mip level. That data needs to be moved around the GPU internally quite a bit, crossing clock domains and everything else, and would require a ton of extra registers to keep track of. Putting it in a ROM is basically free - a pair of BRAM versus a ton of registers (and the associated timing considerations), the BRAM wins almost every time.

rustybolt
0 replies
1d1h

I have seen semi-regular updates from you on discord and it is awesome to see how far this project has come (and also a bit frustrating to see how relatively little progress I have made on my FPGA projects in the same time!). I was hoping you'd do a writeup, can't wait!

jamesu
16 replies
1d9h

Similarly there is this: https://github.com/ToNi3141/Rasterix

Would be neat if someone made an FPGA GPU which had a shader pipeline honestly.

actionfromafar
12 replies
1d8h

How good would a Ryzen with 32 cores be if it did just graphics?

immibis
9 replies
1d7h

Wasn't Intel Larrabee something like that? Get a bunch of dumb x86 cores together and tell them to do graphics?

actionfromafar
7 replies
1d7h

I'm so sad Larrabee or similar things never took off. No, it might not have benchmarked well against contemporary graphics cards, but I think these matrixes of x86 cores could have come to great use for cool things not necessarily related to graphics.

fancyfredbot
6 replies
1d6h

Intel launched Larabee as Xeon Phi for non-graphics purposes. Turns out it wasn't especially good at those either. You can still pick one up on eBay today for not very much.

actionfromafar
2 replies
1d5h

That's where we have to agree to (potentially) disagree. I lament that these or similar designs didn't last longer in the market, so people could learn how to harness them.

Imagine for instance hard real time tasks, each one task running on its own separate core.

rjsw
0 replies
1d5h

I think Intel should have made more effort to get cheap Larabee dev boards onto the market, they could have been using chips that didn't run at full speed or with too many broken cores to sell at full price.

fancyfredbot
0 replies
10h13m

I think Intel have similar designs? The Xeon Phi had 60 cores, and their high core count CPUs have 56. The GPU max 1550 has 128 low power xe cores.

bee_rider
1 replies
1d3h

Probably not aided by the fact that conventional Xeon core counts were sneaking up on them—not quite caught up, but anybody could see the trajectory—and offered a much more familiar environment.

actionfromafar
0 replies
22h37m

Yes, I agree. Still unfortunate. I think the concept was very promising. But Intel had no appetite for burning money on it to see where it would go in the long run.

Y_Y
0 replies
1d5h

The novelty of sshing into a PCI card is nice though. I remember trying to use them at a hpc cluster, all the convenience of wrangling GPUs but at a fraction of the performance

erik
0 replies
21h49m

Larrabee was mostly x86 cores, but it did have sampling/texturing hardware because it's way more efficient to do those particular things in the 3d pipeline with dedicated hardware.

tux3
0 replies
1d8h

You can run Crysis in software rendering on a high core count AMD CPU.

It's terrible use of the hardware and the performance is far from stellar, but you can!

danbruc
2 replies
1d4h

If you are going to that effort, you might also want a decent resolution. Say we aim for one megapixel (720p) and 30 frames per second, then we have to calculate 27.7 megapixel per second. If you get your FPGA to run at 500 MHz, that gives you 18 clock cycles per pixel. So you would probably want something like 100 cores keeping in mind that we also have to run vertex shaders. We also need quick access to a sizable amount of memory and I am not sure if one can get away with integer respectively fixed point arithmetics or whether floating point arithemtics is pretty much necessary. Another complication that I would expect is that it is probably much easier to build a long execution pipeline if you are implementing a fixed function pipeline as compared to a programmable processor. Things like out-of-order execution are probably best off-loaded to the compiler in order to keep the design simpler and more compact.

So my guess is that it would be quite challenging to implement a modern GPU in an affordable FPGA if you want more than a proof of concept.

d_tr
0 replies
1d1h

There's a new board by Trenz with a Versal chip which can do 440 GFLOPS just with the DSP58 slices (the lowest speed grade) and it costs under 1000 Euros, but you also need to buy a Vivado license currently.

Cheaper boards are definitely possible since there are smaller parts in that family, but they need to offer support for some of them in the free version of Vivado...

PfhorSlayer
0 replies
1d

You've nailed the problem directly on the head. For hitting 60Hz in FuryGpu, I actually render at 640x360 and then pixel-double (well, pixel->quad) the output to the full 720p. Even with my GPU cores running at 400MHz and the texture units at 480MHz with fully fixed-function pipelines, it can still struggle to keep up at times.

I do not doubt that a shader core could be built, but I have reservations about the ability to run it fast enough or have as many of them as would be needed to get similar performance out of them. FuryGpu does its front-end (everything up through primitive assembly) in full fp32. Because that's just a simple fixed modelview-projection matrix transform it can be done relatively quickly, but having every single vertex/pixel able to run full fp32 shader instructions requires the ability to cover instruction latency with additional data sets - it gets complicated, fast!

snvzz
11 replies
1d9h

Pipeline seems retro, but far better than nothing.

There's no open hardware GPU to speak of. Depending on license (can't find information?), this could be the first, and a starting point for more.

crote
5 replies
1d8h

It all depends on your definition of "open", of course. As far as I know there is no open-source toolchain for any remotely-recent FPGA, so you're still stick with proprietary (paid?) tooling to actually modify it. You're pretty much out of luck if you need more than an iCE40 UP5k.

snvzz
1 replies
1d5h

You're pretty much out of luck if you need more than an iCE40 UP5k.

Lattice ECP5 (which goes up to 85k LUT or so?) and Nexus have more than decent support.

Gowin FPGAs are supported via project apicula up to 20k LUT models. Some new models go above 200k LUT so there's hope there.

robinsonb5
0 replies
1d1h

Yeah I've used yosys / nextpnr on an ECP5-85 with great results - it's pretty mature and dependable now.

robinsonb5
0 replies
1d6h

There's been some interesting recent work to get the QMTech Kintex7-325 board (among others) supported under yosys/nextpnr - https://github.com/openXC7 It works well enough now to build a RISC-V SoC capable of running Linux.

monocasa
1 replies
1d6h

There's no open hardware GPU to speak of. Depending on license (can't find information?), this could be the first, and a starting point for more.

There's this which is about the same kind of GPU

https://github.com/asicguy/gplgpu

mips_r4300i
0 replies
1d2h

Ticket2Ride Number9 is a fixed function GPU from the late 90s that was completely open sourced under GPL

iAkashPaul
9 replies
1d8h

FPGAs for native FP4 will change the entire landscape

luma
2 replies
1d7h

How so?

iAkashPaul
0 replies
1d7h

Reduced memory requirements, dropping higher precision IP blocks for starters

CamperBob2
0 replies
1d

4-bit values (or 6-bit values, nowadays) values are interesting because they're small enough to address a single LUT, which is the lowest-level atomic element of an FPGA. That gives them major advantages in the timing and resource-usage departments.

jsheard
1 replies
1d6h

Very briefly, until someone makes an ASIC that does the same thing and FPGAs are relegated to niche use-cases once again.

FPGAs only make long-term sense in applications that are so low-volume that it's not worth spinning an ASIC for them.

iAkashPaul
0 replies
1d6h

Absolutely

Y_Y
1 replies
1d6h

Four-bit floats are not as useful as Nvidia would have you believe. Like structured sparsity it's mainly a trick to make newer-gen cards look faster in the absence of an improvement in the underlying tech. If you're using it for NN inference you have to carefully tune the weights to get good accuracy and it offers nothing over fixed-point.

imtringued
0 replies
6h59m

The actual problem is that nobody uses these low precision floats for training their models. When you do quantization you are merely compressing the weights to minimize memory usage and to use memory bandwidth more efficiently. You still have to run the model at the original precision for the calculations so nobody gives a damn about the low precision floats for now.

imtringued
0 replies
7h11m

How? NPUs are going to be included in every PC in 2025. The only differentiators will be how much SRAM and memory bandwidth you have or whether you use processing in memory or not. AMD is already shipping APUs with 16 TOPS or 4 TFLOPS (bfloat16) and that is more than enough for inference considering the limited memory bandwidth. Strix Halo will have around 12 TFLOPS (bfloat16) and four memory channels.

llama.cpp already supports 4 bit quantization. They unpack the quantization back to bfloat16 at runtime for better accuracy. The best use case for an FPGA I have seen so far was to pair it with SK Hynix's AI GDDR and even that could be replaced by an even cheaper inference chip specializing in multi board communication and as many memory channels as possible.

blacklion
0 replies
1d4h

Entire landscape of open graphic chips?

Not every GPU should be used to train or infer so-called AI.

Please, stop, we need some hardware to put images on the screens.

spuz
6 replies
1d8h

This looks like an incredible achievement. I'd love to see some photos of the physical device. I'm also slightly confused about which FGPA module is being used. The blog mentions the Xylinx Kria SoMs but if you follow the links to the specs of those modules, you see they have ARM SoCs rather than Xylinx FGPAs. The whole world of FGPAs is pretty unfamiliar to me so maybe I'm missing something.

https://www.amd.com/en/products/system-on-modules/kria/k26/k...

crote
3 replies
1d8h

you see they have ARM SoCs rather than Xylinx FGPAs

It's a mixed chip: FPGA and traditional SoC glued together. This mean you don't have a softcore MCU taking up precious FPGA resources just to do some basic management tasks.

chrsw
1 replies
1d6h

I didn't see any mention of what the software on the Zynq's ARM core is doing, which made me wonder why use Zynq at all.

PfhorSlayer
0 replies
23h52m

The hardened DisplayPort IP is connected to the ARM cores, and requires a significant amount of configuration and setup. FuryGpu's firmware primarily handles interfacing with that block: setting up descriptor sets to DMA video frame and audio data from memory (where the GPU has written it for video, or where the host has DMA'd it for audio), responding to requests to reconfigure things for different resolutions, etc. There's also a small command processor there that lets me do various things that building out hardware for doesn't make sense - moving memory around with the hardened DMA peripheral, setting up memory buffers used internally by the GPU, etc. If I ever need to expose a VGA interface in order to have motherboards treat this as a primary graphics output device during boot, I'd also be handling all of that in the firmware.

spuz
0 replies
1d8h

Ah that makes sense. It's slightly ironic then that the ARM SoC includes a Mali GPU which presumably easily outperforms what can be achieved with the FGPA.

chiral-anomaly
0 replies
1d6h

Xilinx doesn't mention the exact FPGA p/n used in the Kria SoMs. However according to their public specs they appear to match [1] the ZU3EG-UBVA530-2L and ZU5EV-SFVC784-2L devices, with the latter being the only one featuring PCIe support.

Designing and bringing-up the FPGA board as described in the blog post is already a high bar to clear. I hope the author will at some point publish schematics and sources.

[1] https://docs.amd.com/v/u/en-US/zynq-ultrascale-plus-product-...

PfhorSlayer
0 replies
1d1h

You're in luck! https://imgur.com/a/BE0h9cZ

As mentioned in the rest of this thread, the Kria SoMs are FPGA fabric with hardened ARM cores running the show. Beyond just being what was available (for oh so cheap, the Kria devboards are like $350!), these devices also include things like hardened DisplayPort IP attached to the ARM cores allowing me to offload things like video output and audio to the firmware. A previous version of this project was running on a Zynq 7020, for which I needed to write my own HDMI stuff that, while not super complicated, takes up a fair amount of logic and also gets way more complex if it needs to be configurable.

codedokode
6 replies
1d6h

"UltraScale" in name assumes ultra price? FPGAs seem to be an expensive toy.

nxobject
2 replies
1d5h

It's worth mentioning that it's easy enough to find absurdly cheap (~$20) early-generation dev boards for Zynq FPGAs with embedded ARM cores on Aliexpress, shucked from obsolete Bitcoin miners [1]. Interfaces include SD, Ethernet, 3 banks of GPIO.

[1] https://github.com/xjtuecho/EBAZ4205

thrtythreeforty
1 replies
1d3h

Zynq is deeply annoying to work with, though. Unfortunately the hard ARM core bootloads the FPGA fabric, rather than the other way around (or having the option to initialize both separately). This means you have to muck with software on the target to update FPGA bitstreams.

CamperBob2
0 replies
1d

Isn't it mostly just boilerplate code that does the FPGA configuration, though?

varispeed
0 replies
1d6h

Ages ago I bought TinyFPGA, which is like £40 and I was able to synthesize RISC-V cpu on it. It was fun.

mattalex
0 replies
1d6h

Not in the grand scheme of things: you can get fpga dev boards for $50 that are already useable for this type of thing (you can go even lower, but those aren't really useable for "CPU like" operation and are closer to "a whole lot of logic gates in a single chip"). Of course the "industry grade" solutions pack significantly more of a punch, but they can also be had for <$500.

PfhorSlayer
0 replies
1d

In general, yes. However, the Kria series are amazingly good deals for what you get - a quite powerful Zynq US+ part and a dev board for like $350.

nxobject
5 replies
1d5h

I hope the author goes into some detail about how he implements the PCIe interface! I doubt I'll ever do hardware work at that level of sophistication, but for general cultural awareness I think it's worth looking under the hood of PCIe.

gorkish
2 replies
1d5h

The FPGA he is using has native pcie so usually all you get on this front is an interface to a vendor proprietary ip block. The state of open interfaces in FPGA land is abysmal. I think the best I’ve seen fully open source is a gigabit MAC

0xcde4c3db
0 replies
20h35m

There is an open-source DisplayPort transmitter [1] that apparently supports multiple 2.7 Gbps lanes (albeit using family-specific SERDES/differential transceiver blocks, but I doubt that's avoidable at these speeds). This isn't PCIe, but it's also surprisingly close to PCIe 1.0 (2.5 Gbps/lane, and IIRC they use the same 8b/10b code and scrambling algorithm).

[1] https://github.com/hamsternz/FPGA_DisplayPort

PfhorSlayer
0 replies
1d1h

Next blog post will be covering exactly that! Probably going to do a multi-part series - first one will be the PCB schematic/layout, then the FPGA interfaces and testing, followed by Windows drivers.

detuur
3 replies
1d1h

I can't believe that this is the closest we have to a compact, stand-alone GPU option. There's nothing like a M.2 format GPU out there. All I want is a stand-alone M.2 GPU with modest performance, something on the level of embedded GPUs like Intel UHD Graphics, AMD Radeon, or Qualcomm's Adreno.

I have an idea for a small embedded product which needs a lot of compute and networking, but only very modest graphical capabilities. The NXP Layerscape LX2160A [1] would be perfect, but I have to pass on it because it doesn't come with an embedded GPU. I just want a small GPU!

[1]: https://www.nxp.com/products/processors-and-microcontrollers...

magixx
0 replies
1d1h

What about MXM GPUs that used to be found in gaming laptops? I know the standard is very niche and thus expensive ($400 for a 3080M used on ebay) but it does exists and you could convert them to PCI-E and thus m.2

cpgxiii
0 replies
21h43m

There's at least one m.2 GPU based on the Silicon Motion SM750 controller made by Asrock Rack. Similar products exist for mPCIe form factor.

Performance is nowhere near a modern iGPU, because an iGPU has access to all of the system memory and caches and power budget, and a simple m.2 device has node of that. Even low-end PCIe GPUs (single slot, half-length/half-height) struggle to outperform better iGPUs and really only make sense when you have to use them for basic display functionality.

KallDrexx
2 replies
1d3h

This is my dream!

The last year I've been working on a 2d focused GPU for I/O constrained microcontrollers (https://github.com/KallDrexx/microgpu). I've been able to utilize this to get user interfaces on slow SPI machines to render on large displays, and it's been fascinating to work on.

But seeing the limitation of processor pipelines I've had the thought for a while that FPGAs could make this faster. I've recently gotten some low end FPGAs to start learning to try and turn my microgpu from an ESP32 based one to an FPGA one.

I don't know if I"ll ever get to this level due to kids and free time constraints, but man, I would love to get even a hundredth of this level.

Chabsff
1 replies
1d2h

You probably know this already, but for anyone else curious about going down that road: For this type of use, it's definitely worth it to constrain yourself to FPGAs with dedicated high-bandwidth transceivers. A "basic" 1080p RGB signal at 60hz requires some high-frequency signal processing that's really hard to contend with in pure FPGA-land.

KallDrexx
0 replies
1d1h

That's good to know actually. I'm still very very early in my FPGA adaption (learning the fpga basics) and I am intending to start with standard 640x480 VGA before expanding.

notorandit
1 replies
1d4h

It needs to be very fancy to write text in light gray on white.

I am not sure your product will be a success.

I am sure you web design skills need a good overhaul.

nicolas_17
0 replies
11m

It's not a "product" that will be "sold" or has intention of being "successful" in a commercial sense.

gchadwick
1 replies
1d9h

Cool! I found the hello blog here illuminating to understand the creators intentions: https://www.furygpu.com/blog/hello

As I read it, it's just a fun hobby project for them first and foremost and looks like they're intending to write a whole bunch more about how they built it.

It's certainly an impressive piece of work, in particular as they've got the full stack working, a windows driver implementing a custom graphics API and then quake running on top of that. A shame they've not got some DX/GL support but I can certainly understand why they went the custom API route.

I wonder if they'll open source the design?

PfhorSlayer
0 replies
1d1h

I'm in the process of actually trying to work out what would be feasible performance-wise if I were to spent the considerable effort to add the features required for base D3D support. It's not looking good, unfortunately. Beyond just "shaders", there are a significant amount of other requirements that even just the OS's window manager needs to function at all. It's all built up on 20+ years of evolving tech and for the normal players in this space (AMD, Nvidia, Intel, Imagination, etc.) it's always been an iterative process.

wpwpwpw
0 replies
1d7h

Excellent job. Would be amazing if this became an open source hardware project.

userbinator
0 replies
14h58m

Supporting hardware features equivalent to a high-end graphics card of the mid 1990s

I see no one else has asked this question yet, so I will: How VGA-compatible is it? Would I be able to e.g. plug it into any PC with a PCIe slot, boot to DOS and play DOOM with it?

sylware
0 replies
1d7h

Hopefully their hardware programming model is going full hardware circular command/interrupt buffers (even for GPU register programming).

It is how it is done on AMD GPU, that said I have no idea what is the nvidia hardware programming model.

raphlinus
0 replies
1d1h

Very cool project, and I love to see more work in this space.

Something else to look at is the Vortex project from Georgia Tech[1]. Rather than recapitulating the fixed-function past of GPU design, I think it looks toward the future, as it's at heart a highly parallel computer, based on RISC-V with some extensions to handle GPU workloads better. The boards it runs on are a few thousand dollars, so it's not exactly a hobbyist friendly, but it certainly is more accessible than closed, proprietary development. There's a 2.0 release that just landed a few months ago.

[1]: https://vortex.cc.gatech.edu/

bobharris
0 replies
3h14m

beyond amazing. i've dreamt of this. so inspiring. it reminds me of alot of time i spent thinking about this: https://rcl.ece.iastate.edu/sites/default/files/papers/SteJo... i actually wrote one of the professors asking for more info. didn't get a reply. my dream EE class I never got to take.

bloatfish
0 replies
1d6h

This is insane! As a hobby hardware designer myself, I can imagine how much work must have gone into reaching this stage. Well done!

anon115
0 replies
1h17m

can you run valorant on it?

allanrbo
0 replies
1h43m

What an inspiring passion project! Very ambitious first Verilog project.