HN comments for: FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision

edude03

15 replies

2024-07-11 17:55:22 UTC

How much is the flash attention algorithm tied to the hardware? For example, in this announcement they mention taking advantage of the async capabilities of the H100 GPUs which I assume means you won't get those speedups on non H series card. Two, the actual flash attention library requires CUDA, although the algorithm has apparently?[^0] been ported to metal. I would imagine if the algorithm was literally just a pure function it could be implemented for any GPU/ML framework?

[0]: https://github.com/philipturner/metal-flash-attention

kristjansson

5 replies

2024-07-11 18:10:51 UTC

FlashAttention's algorithmic improvements is mostly just splitting/combining the softmax part of attention, and is itself not totally novel. The overwhelming contribution is implementing that, and all its fiddly pieces, efficiently on Nvidia hardware.

refulgentis

3 replies

22h59m

2024-07-11 19:33:24 UTC

Clarifying:

Given the question: "How much is the flash attention algorithm tied to the hardware?"

The answer is 0.

ex. you can find generic flash attention recently added in llama.cpp and ONNX (MS needed it for Phi-3, needed for Recall).

On the side, novelty, I have no direct knowledge on, IMHO, asking that question would devolve the way novelty arguments do in any field: there's always someone else who can claim they did 80% of $X via $X-1, therefore, $X is by and large not novel. Ad infinitum.

kristjansson

2 replies

20h25m

2024-07-11 22:06:43 UTC

I think the right analogy for FA is high-quality cache-aware BLAS kernel implementations. The algorithm(s) is (are) clever and (as you note) completely independent of hardware. However, a hardware-naive implementation is approximately worthless. Most of the value of MKL, or Accelerate, or FA is in the careful matching of the parameters and implementation of the algorithm to the capabilities of hardware it's going run on.

I definitely don't mean to take away from Tri/FA by mentioning novelty - I'm just repeating from paper, which refers back to algebraic aggregates[0] in its discussion of their tiled softmax.

[0]: https://web.stanford.edu/class/cs345d-01/rl/olap.pdf

kiratp

1 replies

13h44m

2024-07-12 04:47:44 UTC

However, a hardware-naive implementation is approximately worthless.

This isn’t true when there is one vendor that’s 90% of the market and 2 maybe 3 generations of hardware to consider. Support A100, H100 and you are supporting most of the current market.

saagarjha

0 replies

11h42m

2024-07-12 06:49:47 UTC

Supporting A100 and H100 is the opposite of being hardware naive, though.

namibj

0 replies

2024-07-11 18:28:03 UTC

To clarify further, flash attention is explicitly targeting a compute engine with separate MMA and "scalar" vector execution units that allow post-processing the MMA outputs without involving memory bandwidth (though arithmetic intensity, especially relative between the MMA and the "scalar" instructions, is of concern), with a substantial amount of manually-managed L1D$ to use as sub-matrix accumulator, and a linear-in-context-length amount of "VRAM" that requires sensible arithmetic intensity to avoid being a bandwidth bottleneck (iirc in the hundreds when counting the scalar multiplies hiding in the MMA instructions).

This v3 with async might for once be so tied to Hopper that it's not trivially portable to another platform that has the mentioned hardware blocks (AFAIK every AMD GCN card that can do compute shaders would qualify, though they do lack a specialized MMA unit).

f_devd

3 replies

2024-07-11 18:23:57 UTC

How much is the flash attention algorithm tied to the hardware?

The original FA, almost none.

For the latest versions depends on your abstraction, ThunderKittens[0] provides about the same speed up over FA2 (1.3x-2x%) as the article but relatively universal across GPUs. For any new hardware there may be hardware specific features that make it edge out more performance; usually vendors will adopt any new features that seems to beat them, but you do get fragmented API/libraries (which is already true for CUDA).

[0]: https://hazyresearch.stanford.edu/blog/2024-05-12-tk

eyegor

1 replies

14h21m

2024-07-12 04:11:30 UTC

What do you mean by "relatively universal"? This is Cuda only [0] with a promise of a rocm backend eventually. There's only one project I'm aware of that seriously tries to address the Cuda issue in ml [1].

[0] https://github.com/HazyResearch/ThunderKittens?tab=readme-ov...

[1] https://github.com/vosen/ZLUDA

f_devd

0 replies

7h52m

2024-07-12 10:40:16 UTC

If you read the article I linked they show that it's entirely based on 16x16 matrices (or "tiles") which is fairly standard across gpus.

kristjansson

0 replies

20h15m

2024-07-11 22:17:40 UTC

I mean they're building an API to abstract away some of the SKU-to-SKU differences, but the broader point cuts the other way, I think:

In fact, more broadly we believe we should really reorient our ideas of AI around what maps well onto the hardware. How big should a recurrent state be? As big can fit onto an SM. How dense should the compute be? No less so than what the hardware demands. An important future direction of this work for us is to use our learnings about the hardware to help us design the AI to match.

The value is in adapting the implementation (either manually at write-time or programmatically at run-time) to the specifics of the hardware.

Also, great line:

And we ask: if your matrix multiply is smaller than 16x16, are you sure what you’re doing is AI?

3abiton

2 replies

21h50m

2024-07-11 20:42:15 UTC

To add to the discussion, from a practical perspective, AMD hardware totally sucks and yet to have proper implementation with flash-attention-2. ROCm is moving to usable slowly, but not close to being even comparable with cuda.

LarsDu88

1 replies

11h39m

2024-07-12 06:52:56 UTC

Whi os it so hard to port FA2 to the m1300 instinct?

latchkey

0 replies

8h3m

2024-07-12 10:28:47 UTC

https://github.com/kailums/flash-attention-rocm

vhiremath4

0 replies

21h54m

2024-07-11 20:38:30 UTC

There are a bunch of good answers, but I wanted to succinctly say "practically, quite a bit". Here's a good little rabbit-hole example:

https://github.com/karpathy/nanoGPT/blob/master/model.py#L45

Karpathy's nanoGPT calling flash attention by checking if torch.nn.functional.scaled_dot_product_attention exists

https://pytorch.org/docs/stable/generated/torch.nn.functiona...

Looking at the docs, in reality, most of the time you want this to call out to FA2 which optimizes the kernals on the device to split ops on the Softmax of the triangular matrix as well as reduce moving unnecessary batches of floating point numbers back and forth from the GPU to the CPU.

https://arxiv.org/pdf/2307.08691

The paper for FA2 almost entirely considers itself through the hardware it's running on.

slashdave

0 replies

21h12m

2024-07-11 21:20:17 UTC

Conceptually, just a bit, practically (in terms of implementation), a lot. The standard python implementation internally compiles a kernel for your specific hardware.

WanderPanda

9 replies

23h37m

2024-07-11 18:54:51 UTC

Compiler folks: Is there any chance compilers will be able to find optimizations like FlashAttention on their own? Seems like TVM and tinygrad are working in that direction but I find it hard to believe that that would be feasible

rfoo

3 replies

23h30m

2024-07-11 19:02:18 UTC

No. Think of it like a different algorithm. You just take the shape of the hardware into consideration when designing the algorithm instead of considering math only.

Seems like TVM

Fair enough, though technically they are still about different things but it's indeed very close, but

and tinygrad

?????? what gives you this impression?

dauertewigkeit

1 replies

23h25m

2024-07-11 19:07:33 UTC

What's the distinction between what TVM does and FlashAttention type optimizations?

rfoo

0 replies

23h18m

2024-07-11 19:14:11 UTC

There is more than layout / tile schedule in FA. For example, first, to be able to fuse all these together [0] at all, you need to "decompose" the softmax to make it combinable, which requires maintaining some extra statistics. Won't gonna repeat the math here as the original FA paper is already very clear.

[0] so you can avoid materializing intermediate matrices and still being able to compute in blocks.

FL33TW00D

0 replies

3h27m

2024-07-12 15:05:06 UTC

Geo has explicitly stated he wants to be able to find FA in the search space of algos eventually. Actually achieving that is another matter.

namibj

1 replies

23h10m

2024-07-11 19:22:37 UTC

In theory, yes, it's "just" some algebraic properties of the math used that allow for substantial reordering, and then you'd add fairly regular polyhedral loop tiling. Just expensive to do, so you'll have to cache the effort.

The area of e-graph optimizers seems well-suited to this, btw. It's not really deployed outside of some niche tooling though, as it's a big paradigm shift in optimizer pass handling (e.g., doesn't work well with chairs classic call graphs, so control flow needs to be massively revamped to deploy e-graphs outside/across basic blocks and for loops (break and return not supported!)).

skyde

0 replies

2h7m

2024-07-12 16:25:31 UTC

Just discovered e-graph recently and I have a good understanding of compiler from taking compiler class at university.

I would like to understand why you say e-graph would need control-flow to be revamped.

Do you have anything I could read on it ?

slashdave

0 replies

13h19m

2024-07-12 05:12:58 UTC

Kinda tricky if you want to call higher level operators in a wrapped language like Python.

skyde

0 replies

2h11m

2024-07-12 16:20:48 UTC

https://github.com/uwplse/tensat

Lerc

0 replies

19h35m

2024-07-11 22:57:22 UTC

This strikes me as an extremely difficult but not intractable problem.

I'm not sure what the state of the art in compiler optimisation is with regard to data positioning and targeting maximum processor usage

There was a video on optimisation a while back that showed small optimisations caused increases in speed that were insignificant when compared to the speed variance induced by the memory layout that the optimisation (or even a random change) caused.

While that talk was more focused on getting a signal past the noise. That noise itself is an artifact of compilers being not particularly good at handling a much simpler form of the problem you describe.

CPU and memory architectures are complex when caches and access patterns impact upon speed.

When you add in GPU architectures to the mix I think you might be in fairly uncharted territory.

Maybe one day.

Of course since we are in the field of AI there is also the question of could a sufficiently smart AI do this. It depends on the value of sufficient.

I would like to think that an extremely high level test for an AI model could be to give it something like micrograd and tell it to produce something with the same interface that outperforms torch.

We're not even in the ballpark of being able to do that yet, but it will be interesting when and if that happens.

latchkey

7 replies

13h38m

2024-07-12 04:54:16 UTC

If anyone wants to port this over to ROCm / AMD MI300x, reach out to me: hello@hotaisle.xyz (we won't ever spam you).

Happy to donate the compute time for this work.

saagarjha

5 replies

11h37m

2024-07-12 06:54:48 UTC

Not trying to be rude but what is the thinking behind this offer? Why would someone do this port for…free save for access to the hardware? What’s the upside for them?

latchkey

4 replies

11h29m

2024-07-12 07:02:54 UTC

Not a rude question. I'm building public HPC super computers, currently focused on AMD hardware. The one I'm about to deploy is Top 150, which is a pretty good start.

The goal is to encourage a developer flywheel. The more developers working with AMD hardware, the more hardware that is needed, the more hardware I can justify buying, the bigger my super computers get.

Nvidia has been doing the flywheel for years and it has clearly worked. Why not do the same for AMD? As I said in another thread, anyone who thinks that there should be a single provider for all of AI compute needs, will be on the wrong side of history.

gitfan86

3 replies

6h44m

2024-07-12 11:48:08 UTC

No one person or one company SHOULD have huge control over humanity, I agree.

But practically speaking this is a unique time in history of technology because there are quick feedback loops that cause that flywheel you mentioned to be a insurmountable first mover advantage.

latchkey

2 replies

6h29m

2024-07-12 12:03:20 UTC

But practically speaking this is a unique time in history of technology because there are quick feedback loops that cause that flywheel you mentioned to be a insurmountable first mover advantage.

I'm staking my career and business on you being wrong about the insurmountable part. This is just the beginning of a long road and I'm not the only one who believes this. My partnership with Dell, Advizex and a huge soon to be announced datacenter company, isn't small beans.

Much like how I didn't know how the internet would look when I first joined in 1991. But, what I can see very clearly from my decades of experience in the tech field, is that history is repeating itself with what is happening in AI.

As I'm also prone to say... this isn't a football match where one team needs to "beat" the other. It really is enough to have multiple players in the market and nothing more than that. In fact, I'm more than happy to deploy any type of compute that my customers want me to deploy for them, including Nvidia.

Even Lamini, whom were previously AMD only, just announced [0] that they are partnering with Nvidia. Their software will run equally well on any system. Why? Because it builds a simple bridge from one platform to the next. Reminds me of the Java "write once, run anywhere" slogan. It actually worked pretty well.

[0] https://x.com/realsharonzhou/status/1811439958277927294

gitfan86

1 replies

6h6m

2024-07-12 12:25:59 UTC

I'm not saying it is impossible for other companies to build good and profitable products. Google, AMD, Tesla all have good AI systems.

I'm saying NVDA uses their own chips to help build more chips, AND they are intricately involved in the buildout of the 100B data centers and intricately involved in TSMC roadmaps. That with the combination of huge profits that are increasing create even more advantages over competitors.

Obviously this doesn't go on forever, NVDA will never have 100T of profit in a quarter. Years from now the feedback loops will have diminishing returns and there will be commodity AI systems eventually.

latchkey

0 replies

5h59m

2024-07-12 12:32:55 UTC

I did not use the word impossible. Nobody is arguing that Nvidia won't be the dominate player for a long time. That does not mean that there isn't a good business in being in the game.

Years from now the feedback loops will have diminishing returns and there will be commodity AI systems eventually.

Maybe, but the cat is out of the bag. Before it was a question of Moore's law and speed, but nobody talks about that anymore... all they talk about is that the need for raw compute (not even the fastest), is officially boundless.

JackYoustra

0 replies

13h26m

2024-07-12 05:06:16 UTC

You're the AMD accelerator server company! Such cool work, hope someone takes you up :)

lxe

4 replies

2024-07-11 18:19:00 UTC

FlashAttention-3 is optimized for Hopper GPUs (e.g. H100).

How does FA3 fare for consumer GPUs such as 3090 and 4090?

apsec112

3 replies

23h44m

2024-07-11 18:48:38 UTC

It's Hopper-specific, the improvements are closely tied to Hopper features like warp groups and TMA. For 4090s, you might get a speedup by using the Triton implementation of FP8 attention: https://triton-lang.org/main/getting-started/tutorials/06-fu...

moffkalast

2 replies

23h27m

2024-07-11 19:05:38 UTC

The original flash attention (v1?) took like a year to get added to llama.cpp and only provides single digit percent VRAM savings for typical context lengths and practically no speed boost. Still nice to have, but man was this thing overhyped. I doubt v3 will do more than marginally better on the RTX 5000 series.

apsec112

1 replies

23h23m

2024-07-11 19:08:50 UTC

On GPU, or on CPU/Metal? For the latter I'm not surprised, but that's because they have a totally different memory/cache hierarchy.

moffkalast

0 replies

21h58m

2024-07-11 20:34:25 UTC

With CUDA offloading, I don't think it runs otherwise at all.

Der_Einzige

3 replies

2024-07-11 18:29:32 UTC

This is one of the most important improvements in all of AI, because it benefits most AI users by giving them access to more, faster, for the same hardware with little to no tradeoffs.

snovv_crash

2 replies

23h55m

2024-07-11 18:37:37 UTC

...for all those users with H100s.

stale2002

0 replies

8h27m

2024-07-12 10:05:32 UTC

Indeed.

Anyone who is doing anything important or at scale would be at least renting those, or even using an abstracted service that is on top of another service.

Those cost savings allow people to train things for cheaper, causing those cost savings to benefit almost everyone doing important stuff in the space.

rfoo

0 replies

23h32m

2024-07-11 18:59:43 UTC

... which is currently the most cost-efficient and environment-friendly way to do LLM inference [0].

[0] Small footprint time: before B100 ships; for actually large language models; for prefill only; may cause cancer in California.

localfirst

2 replies

23h44m

2024-07-11 18:48:34 UTC

spoiler: $xxx,xxx hardware required to run

sva_

0 replies

23h36m

2024-07-11 18:56:36 UTC

$25k-$30k

aabhay

0 replies

22h51m

2024-07-11 19:41:37 UTC

If you need to run it continuously for a year

andy_xor_andrew

2 replies

23h48m

2024-07-11 18:43:47 UTC

hoping an expert can answer a few Qs I have :)

Is FlashAttention simply a drop-in replacement for the attention operation in an LLM? Can it be used anywhere that an "attention" operation is used? Or does a LLM need to be trained specially to use FA?

How does FA relate to attention strategies like GQA (grouped query attention) or sliding-window attention? Are they orthogonal concepts? Or you need a specific FA implementation for each strategy?

Recently llama.cpp added flash attention support - does this just mean they started consuming a flash attention-provided CUDA kernel or something?

lastly, in this post, they compare FlashAttention to Triton. I thought Triton was like an abstraction layer? Couldn't FA be implemented in Triton? I just don't really get what it means to say "FlashAttention vs. Triton".

zaptrem

0 replies

23h43m

2024-07-11 18:48:55 UTC

Is FlashAttention simply a drop-in replacement for the attention operation in an LLM? Can it be used anywhere that an "attention" operation is used? Or does a LLM need to be trained specially to use FA?

Yes

How does FA relate to attention strategies like GQA (grouped query attention) or sliding-window attention? Are they orthogonal concepts? Or you need a specific FA implementation for each strategy?

Flash Attention is a way of calculating the Softmax(QK^T)V part of attention, whereas GQA is a way of calculating the Q, K, and V matricies. Sliding window attention (less sure about this, there are a bunch of windowed attention techniques) change the attention mask (the thing that controls which queries can attend to which keys).

Recently llama.cpp added flash attention support - does this just mean they started consuming a flash attention-provided CUDA kernel or something?

I don't use llama.cpp but that sounds about right.

lastly, in this post, they compare FlashAttention to Triton. I thought Triton was like an abstraction layer? Couldn't FA be implemented in Triton? I just don't really get what it means to say "FlashAttention vs. Triton".

They're talking about a previous Flash Attention implementation written in Triton.

apsec112

0 replies

23h40m

2024-07-11 18:52:24 UTC

1) Pretty much, it's mathematically equivalent. The only software issues are things like managing dependency versions and data formats in-memory, but Flash Attention 2 is already built into HuggingFace and other popular libraries. Flash Attention 3 probably will be soon, although it requires an H100 GPU to run

2) Flash Attention 2 added support for GQA in past version updates:

https://github.com/Dao-AILab/flash-attention

3) They're comparing this implementation of Flash Attention (which is written in raw CUDA C++) to the Triton implementation of a similar algorithm (which is written in Triton): https://triton-lang.org/main/getting-started/tutorials/06-fu...

refibrillator

1 replies

13h35m

2024-07-12 04:57:05 UTC

The code has a comment which seems to hint that Tri Dao was working on FA3 as early as April 2022, the month after Hopper/H100 was announced. I find it mildly curious that over 2 years has elapsed before the code was released today. Perhaps it’s because now there’s better solutions in the pipeline?

Tri’s publication history has been leaning toward SSM and Mamba style architectures recently. Unlike Flash Attention which has quadratic time complexity wrt sequence length, these latest algorithms are subquadratic. Thus they do much less computation, instead of just doing it more efficiently a la Flash Attention.

Dao and Gu published a really long paper this year which demonstrated (among other things) how Mamba/SSM can be formulated such that it’s amenable to acceleration using the same hardware primitives that Transformers benefit from.

nyrikki

0 replies

4h5m

2024-07-12 14:27:14 UTC

Until the strong exponential hypothesis is (dis-)proven, the quadratic cost is required or you have to give something up. Just the cost of exhaustive search.

As (dis-)proving SETH will resolve the P vs NP problem, I wouldn't hold my breath.

The question is if a particular use case can accept those costs.

ex3ndr

1 replies

21h53m

2024-07-11 20:39:23 UTC

I am wondering why flash attention is like 5x slower with variable masking than without it? Lack of good masking support almost zeros out the optimizations

chillee

0 replies

11h53m

2024-07-12 06:38:44 UTC

Where are you seeing these benchmarks?

LarsDu88

1 replies

11h32m

2024-07-12 06:59:51 UTC

I was wondering... this post mentions that ops like sigmoid are very slow.

A lot of modern LLMs use activation functions with sigmoid or soft max like SiLU, Swish, and SOLU.

Does Relu take less of a performance hit, and if so, maybe it'd be better to go back to good old relu?

eigenvalue

0 replies

18m

2024-07-12 18:14:15 UTC

Relu is literally just a linear function that gets clamped to zero at some point, so yes, it's much less computationally intensive than anything involving an exponential function. But I doubt you would get competitive results using such a simple activation.

saagarjha

0 replies

11h45m

2024-07-12 06:47:24 UTC

TMA (Tensor Memory Accelerator). This is a special hardware unit that accelerates the transfer of data between global memory and shared memory, taking care of all index calculation and out-of-bound predication. This frees up registers, which is a valuable resource to increase tile size and efficiency.

My understanding was that while it frees up registers it more importantly lets the hardware handle address generation, which can become a bottleneck as other operations around it become faster.