return to table of content

Diffusion models from scratch, from a new theoretical perspective

ycy
14 replies
21h8m

Author here, when I tried to understand diffusion models I realized that the code and math can be greatly simplified, which led to me writing this blog post and diffusion library.

Happy to answer any questions.

godelski
5 replies
18h47m

As a researcher, there's a lot of diffusion blogs that I do not like. But I actually really do like this one! It does a great job at getting to the meat of things, showing some of the complexities (often missing) but without getting lost or distracted. I especially like the discussion of trajectories as this is motivating to many things I think a lot of people struggle with (e.g. with schedulers). That can be hard to write. Albeit not as complete, I think this is far more approachable than the Song or Lilian's blogs (great resources too, but wrong audience). Great job! I'm actually going to recommend this to others.

FWIW, a friend of mine (and not "a friend of mine") wrote this minimal diffusion awhile back that I've found useful that's a bit more "full" w.r.t DDPM. More dropping here since I saw others excited to see code and it could provide good synergy here: https://github.com/VSehwag/minimal-diffusion/

DinaCoder98
2 replies
17h23m

FWIW, a friend of mine (and not "a friend of mine")

What? Do you mean a coworker or something as compared to a friend?

lucubratory
1 replies
17h1m

They are clarifying that "a friend of mine" is not a euphemism for themselves, because it's more common to use it as a euphemism for yourself than it is to actually talk to strangers online about your friend's life, problems, opinions etc.

DinaCoder98
0 replies
9h44m

Oh, huh. I now recognize what you're referring to but I never would have realized that without being explicitly told so. Thank you.

ycy
1 replies
5h18m

Glad you liked it! Just curious, what else would you like to see to make the post more complete? I had focused on getting through the basics of diffusion training and sampling as concisely as possible, but perhaps in future posts I can delve deeper into other aspects of diffusion.

godelski
0 replies
27m

That's actually hard for me to answer. I'm afraid to taint it. Honestly, I'd be wary to touch it especially being advice coming from someone already well versed in the domain. It's too easy to accidentally complexify and then you end up losing your target audience. But maybe the connection to VAEs could be made clearer? Maybe a bit clearer or better introducing the topic of optimization? But again, it's a hard line to walk because you've successfully written something that is approachable to a novice. This is the same problem I suffer when trying to teach haha. But I think the fact that you're asking and looking to improve it's the biggest sign you'll continue to do well.

For later topics I think there's a lack on discussions about things like score modeling, the difficulties of understanding latent representations (so many misunderstand the curse of dimensionality), and Schrodinger Bridges. There's so much lol

The better advice I can give is to listen to anyone complaining. It's hard because you have to try to see beyond their words and find their intent (I think this also makes them more digestible and less emotionally draining/hurtful). Especially consider a novice isn't going to have to right words to correctly express their difficulties parsing and it may just come out looking like anger when it's frustration.

Just keep doing your best :)

thomasahle
1 replies
19h49m

Your `get_sigma_embeds(batches, sigma)` seems to not use its first input? Did you mean to broadcast sigma to shape (batches, 1)?

ks2048
1 replies
16h20m

This looks great! How long does it take (on what hardware) to train the toy models? Such as the `fashion_mnist.py` example? Thanks.

ycy
0 replies
15h53m

The 2d toy models such as the swissroll takes 2 mins to train on the CPU, whereas the fashionMNIST model takes a couple hours on any modern GPU

eutropia
1 replies
1h44m

In your example images at the very end, the momentum term seems to have a deleterious effect on the digital painting of the house (the door is missing in the gamma = 2.0 image). I would like to know more details of that example to build an intuition for the effect of your gradient-informed ddim sampler.

As someone who's spent a little time experimenting with sampling procedures on stable diffusion, I was also hoping for a comparison to DDIM in terms of convergence time/steps. Is there a relationship between momentum, convergence and error? (i.e, your momentum sampler at 16 steps is ~equivalent to ddim at 20 steps ± %err_term)

thanks for the excellent post

ycy
0 replies
1h20m

It is hard to quantify the performance of samplers on individual images, but we have quantitative experiments on pretrained models in our paper (https://arxiv.org/pdf/2306.04848.pdf). Table 2 has a comparison to other samplers and Figure 9 plots the effects of the momentum term on convergence.

xchip
0 replies
20h48m

Your post is awesome and is explaining something nobody else did, thanks!

3abiton
0 replies
18h20m

I am curious if any of these concepts derive somehow from some physics principles? Like the same way neural networks are modeled after biological neural networks? Maybe you have some insights on that conception?

swyx
12 replies
21h37m

oh this has code! great stuff. diffusion papers are famous for a lot of equations (https://twitter.com/cto_junior/status/1766518604395155830) but code is much more legible (and precise?) for the rest of us. all theory papers should come with reference impl code.

i'd love an extension of this for the diffusion transformer, which drives Sora and other videogen models. maybe combine this post with https://jaykmody.com/blog/gpt-from-scratch/ and make the "diffusion transformer from scratch" intro

godelski
8 replies
18h35m

They're famous for lots of equations, but truth be told, most diffusion researchers I know have the exact same response. A lot of people repeat the same exact equations and they only are there for review purposes.

On the other hand, if you want to really dig in, I'd suggest reading into works by Kingma, Gao, Ricky Tian Qi Chen, and honestly, any of Max Welling's students (Tomczak (was post doc), Hoogeboom, etc), and of course the unsung hero Aapo Hyvärinen. Here's a taste at a Kingma & Gao work that's on the lighter side but relevant to the SD3 paper. The unfortunate part is that there's a lot of reliance on knowing and understanding prior works which make these less approachable, but honestly this is a bit difficult to call a meaningful critique (it's research, not educational work aimed at public).

https://arxiv.org/abs/2303.00848

api
3 replies
17h38m

How long before more legible code or code-like constructs (or markup?) start to displace arcane mathematical notation?

It would take quite a long time but I could see this slowly happening for analogous reasons to how the printing press favored alphabetic languages over pictorial ones. Computers teach us to read, write, and express linguistically and are now thanks to LLMs capable of working much more fluently with language. Squiggles and lots of Greek letters seem like an anachronism.

godelski
2 replies
10h49m

How long before more legible code or code-like constructs (or markup?) start to displace arcane mathematical notation?

My best guess is never.

The problem is that math is a language itself. In fact, I'd more accurately call it a family of languages. People often confuse it for an inherent thing because most of the time we use it in applied ways for subjects like physics, and the sciences. But clearly those people have never taken abstract algebra or developed their own toy algebra (or group, ring (or rung), or field). You might say they have an _ideal_istic notion of mathematics.

That said, I am absolutely sympathetic to the complexity of mathematics and there is so much that is difficult to grasp. What's worse, is that when this abstract gobbledy goop finally clicks it's so abundantly obvious that it deceives you into thinking you've known it all along. I think this contributes to the difficulty of teaching the subject because those qualified have forgotten the struggles they themselves have had (to be fair, human psychology plays a big role here too because if we only remember the struggles we'd have difficulty finding the passion and beauty).

But I think our best hope is to find a unified mathematics. The obsession people have with things like set theory and category theory is that you can see this interconnectedness. You get this deep feeling that you're looking at the same things manifesting in different ways. Remember that a lot of mathematics is converting your system into another system (we might call this an isomorphic mapping) such that your problem is easier to solve in the other system (think about how we might want to go from Cartesian coordinates to polar, but now abstract that concept out more than your friend that took a few tabs of acid). You zoom out move around, and zoom back in, and maybe a few more times (but you have to keep track of your path).

Squiggles and lots of Greek letters seem like an anachronism.

Do you have another suggestion? Because I don't see how code, English, or literally any other thing that can be written down is anything more meaningful than "a bunch of squiggles." There's only so many simple, and importantly, distinguishable squiggles that we can write down. Luckily we have a tool to exactly determine this ;) The shitty part of mathematics is also the best part, in that you have to embrace the abstractness of it all. But of course human brains weren't designed for this. (Yes, there are some of us trying to figure out how to get machines to think with that extreme level of abstractedness at its core. No, this is not how computers already "think.")

So yeah, I empathize with you. Quite a lot. There's so much more that I want to know about this thing too. But it is often seemingly impenetrable. All I can tell you is that it just looks that way and isn't. Everyone has the capacity but finding the right way up this mountain (that I'm nowhere near Terry, and he's nowhere near the summit) can be just as hard as the climb itself. The best way is persistence though. Just remember that you don't play video games because they are easy. Sometimes you gotta just trick yourself into pushing through. And some of the best advice I can give is revisiting old topics. So much of what you think you've known actually has a tremendous amount of depth to it. Coming back with fresh eyes and a better understanding of the larger picture can help you see all the beauty in it, even if this is sometimes fleeting.

I'm sorry to have vomited so much, but you know, Stockholm Syndrome is a bitch.

zrm
1 replies
8h50m

Remember that a lot of mathematics is converting your system into another system

Claude Shannon had something to say about this as I recall.

Do you have another suggestion? Because I don't see how code, English, or literally any other thing that can be written down is anything more meaningful than "a bunch of squiggles."

The difference is that the squiggles appear on your keyboard. If you use modern language to construct a variable name, the name can be suggestive of what it is to someone already fluent in natural language. Multi-character names also reduce collisions -- there are only so many Greek letters. And then the research paper doesn't show an inscrutable glyph as a JPEG that you can't even copy and paste into a search engine if you don't already know what it represents.

godelski
0 replies
38m

Shannon said a lot of things, you'll have to be a bit more specific to jog my memory.

Shannon would tell you that there can only be so many simple __distinguishable__ letters. It's not just Greek. Sounds like you're mentioning the current Chinese phenomena. Where people have an easier time reading and it helps that it's text, including traditional characters. But typing can still be hard. Because that's the problem, that there's only so many easy to distinguish squiggles you can put on a keyboard. Because I'm sure while every key in distinguishable, you wouldn't want to operate a keyboard that looked like this [0] (it's actually a car decorated in keys, in case you want to see [1]).

There's a balance that that's hard to strike. The reason many languages use alphabets is not because the complexity of writing hieroglyphics, but because hieroglyphics even when written well can be difficult to distinguish and be cumbersome. There's probably no universal language (sorry Chomsky) so there's no symbols you can put on those keys that are universally recognizable. If you're actually interested in this kind of topic there's two things I suggest you look into. The first is Korean (specifically Hangul) [2] which has an interesting history relevant here and the other is long term nuclear storage [3,4]. The former was specifically designed to increase literacy and was extremely effective. The later is the difficulties of universal innate communications where we think certain things are absolutely obvious but aren't.

The other problem with languages is just usage. There's many over loaded words and it's not like we're at a shortage of combinations in the English language. Even combination of short words! The word cb doesn't exist, nor cbra! This alone should make someone suspicious that the depth of the problem is so much more than symbol complexity and combinatorics.

I'm sure Shannon would have something to say about the tyranny about encoding and complexity... Your suggesting is just the other end of the problem.

[0] https://www.ocregister.com/wp-content/uploads/migration/m1m/...

[1] https://www.ocregister.com/2012/03/30/keyboard-decorated-car...

[2] https://en.m.wikipedia.org/wiki/Hangul

[3] https://en.m.wikipedia.org/wiki/Long-term_nuclear_waste_warn...

[4] https://en.m.wikipedia.org/wiki/Long-term_nuclear_waste_warn...

swyx
1 replies
18h21m

yeah fair enough :) i am unfortunately not driven enough on diffusion models yet to need to dive into those but i hope a future seeker finds your references here.

godelski
0 replies
10h38m

That's fine! Not everyone needs to be. I think it is just a shame we pretend. Makes research convoluted and messy. As I said the other day, we should just allow people to pursue whatever they want and lay a framework where the incentives drive the clearest explanations possible.

https://news.ycombinator.com/item?id=39665427

catgary
1 replies
17h7m

I’d probably say Karras has written the best papers on how to think about diffusion models.

boppo1
0 replies
11h21m

Hopefully Garrett doesn't interfere with his hard work.

GaggiX
2 replies
21h32m

i'd love an extension of this for the diffusion transformer

All you need to do is replace the U-net with a transformer encoder (remove the embeddings, and project the image patches into vectors of size n_embd), and the diffusion process can remain the same.

swyx
1 replies
20h28m

seems too simple. isn't there also a temporal dimension you need to encode?

GaggiX
0 replies
20h21m

For the conditioning and t there are different possibilities, for example the unpooled text embeddings (if the model is conditioned on text) usually go in the cross-attn while the pooled text embedding plus t is used in adaLN blocks like StyleGAN (the first one), but there are many other different strategies.

adamnemecek
4 replies
19h44m

All machine learning models are convolutions, mark my words.

sva_
3 replies
18h51m

I think you posted this a few times, maybe you want to elaborate on this point?

I have trouble seeing reinforcement learning as convolution, for example.

adamnemecek
2 replies
17h48m

It's a Fredholm equation which in turn is a convolutional equation.

carlthome
1 replies
17h4m

Next you're gonna say that polynomial multiplication is also just convolution of the arguments.

adamnemecek
0 replies
16h54m

It's not the standard convolution.

skybrian
1 replies
20h51m

This is a nice explanation of the theory. It seems to be dataset-independent. I'm wondering about the specifics of generating images.

For example, what is it about image generators makes it hard for them to generate piano keyboards? It seems like some better representation of medium-distance constraints is needed to get alternating groups of two and three black notes.

Vecr
0 replies
20h22m

It's the finger problem, you've got to get the number, size, angle, position, etc. right every single time or people can tell very fast. It's not like tree branches where people won't notice if the splitting positions are "wrong".

ycy
0 replies
18h31m

Yes, these blog posts offer a different perspective on diffusion models from the "projection onto data" perspective described in this blog post. You can view them as different ways of interpreting the same training objective and sampling process. In our perspective, diffusion models are easier to train because instead of predicting the gradient of the _exact_ distance function, the training objective predicts the gradient of a _smoothed_ distance function. Sampling the diffusion model is akin to taking multiple approximate gradient steps.

To gain a deeper understanding of diffusion models, I encourage everyone to read all of these blog posts and learn about the different interpretations :)

strangecasts
0 replies
19h49m

Super interesting!

Immediately reminded of Iterative alpha-(de)Blending [1] which also sets out to set up a conceptually simpler diffusion model, and also arrives at formulating it as an approximate iterative projection process - I think this approach allows for more interesting experiments like the denoiser error analysis, though.

[1] https://arxiv.org/pdf/2305.03486.pdf

porphyra
0 replies
16h6m

Another great post is also called Diffusion Models From Scratch: https://www.tonyduan.com/diffusion/index.html

That goes into a lot more mathematical detail but is also accompanied by a minimal, very easy to understand <500 line implementation.

dr_dshiv
0 replies
16h16m

Is part of the idea of diffusion that you get a huge amount of training data? Like, you get to contrast all of these randomly diffused images with the undiffused image?