return to table of content

Whisper: Nvidia RTX 4090 vs. M1 Pro with MLX

tiffanyh
36 replies
1d2h

Key to this article is understanding it’s leveraging the newly released Apple MLX, and their code is using these Apple specific optimizations.

https://news.ycombinator.com/item?id=38539153

modeless
26 replies
1d1h

Also, this is not comparing against an optimized Nvidia implementation. There are faster implementations of Whisper.

Edit: OK I took the bait. I downloaded the 10 minute file he used and ran it on my 4090 with insanely-fast-whisper, which took two commands to install. Using whisper-large-v3 the file is transcribed in less than eight seconds. Fifteen seconds if you include the model loading time before transcription starts (obviously this extra time does not depend on the length of the audio file).

That makes the 4090 somewhere between 6 and 12 times faster than Apple's best. It's also much cheaper than M2 Ultra if you already have a gaming PC to put it in, and still cheaper even if you buy a whole prebuilt PC with it.

This should not be surprising to people, but I see a lot of wishful thinking here from people who own high end Macs and want to believe they are good at everything. Yes, Apple's M-series chips are very impressive and the large RAM is great, but they are not competitive with Nvidia at the high end for ML.

swores
11 replies
23h37m

Would you be so kind as to link to a guide for your method or share it in a comment yourself?

I installed following the official docs and found it much, much slower, although I sadly don't have a 4090, instead a 3080 Ti 12GB (just big enough to load the large whisper model into GPU memory).

modeless
10 replies
23h35m

I'm running on Linux with a 13900k, 64 GB RAM, and I already have CUDA installed. Install commands directly from the README:

    pipx install insanely-fast-whisper

    pipx runpip insanely-fast-whisper install flash-attn --no-build-isolation
To transcribe the file:

    insanely-fast-whisper --flash True --file-name ~/Downloads/podcast_1652_was_jetzt_episode_1289963_update_warum_streiken_sie_schon_wieder_herr_zugchef.mp3 --language german --model-name openai/whisper-large-v3
The file can be downloaded at: https://adswizz.podigee-cdn.net/version/1702050198/media/pod...

I just ran it again and happened to get an even better time, under 7 seconds without loading and 13.08 seconds including loading. In case anyone is curious about the use of Flash Attention, I tried without it and transcription took under 10 seconds, 15.3 including loading.

swores
2 replies
22h23m

Thanks!

Another question that's only slightly related, but while we're here...

Using OAI's paid Whisper API, you can give a text prompt to a) set the tone/style of the transcription and b) teach it technical terms, names etc that it might not be familiar with and should expect in the audio to transcribe.

Am I correct that this isn't possible with any released versions of Whisper, or is there a way to do it on my machine that I'm not aware of?

modeless
1 replies
22h15m

You can definitely do this with the open source version. Many transcription implementations use it to maintain context between the max-30-second chunks Whisper natively supports.

swores
0 replies
21h19m

I'll try to understand some of how stuff like faster-whisper works when I've got time over the weekend, but I fear it may be too complex for me...

I was rather hoping for a guide of just how to either adapt classic whisper usage or adapt one of the optimised ones like faster-whisper (which I've just set up in a docker container but that's used up all the time I've got for playing around right now) to take a text prompt with the audio file.

sundvor
2 replies
21h53m

Cheers, I've been wanting to get into doing something else with my 4090 order than multi monitor simulator gaming, quad screen workstation work - and this will get me kicked off!

The 4090 is an absolute beast, runs extremely quiet and simply powers through everything. DCS pushes it to the limit, but the resulting experience is simply stunning. Mine's coupled to a 7800x3d which uses hardly any power at all, absolutely love it.

modeless
1 replies
21h47m

If you're looking for something easy to try out, try my early demo that hooks Whisper to an LLM and TTS so you can have a real time speech conversation with your local GPU that feels like talking to a person! It's way faster than ChatGPT: https://apps.microsoft.com/detail/9NC624PBFGB7

pixelpoet
0 replies
13h52m

This sounds awesome! Will check it out soon

owehrens
1 replies
22h38m

Thanks so much again. Got it working. 8 seconds. Nvidia is the king. Updated the blog post.

darkteflon
0 replies
21h46m

I think insanely-faster-whisper uses batching, so faster-whisper (which doesn’t) might be a fairer comparison for the purposes of your post.

owehrens
1 replies
23h1m

I just can't get it to work, it errors out with 'NotImplementedError: The model type whisper is not yet supported to be used with BetterTransformer.' Did you happen to run into this problem?

modeless
0 replies
22h59m

Sorry, I didn't encounter that error. It worked on the first try for me. I have wished many times that the ML community didn't settle on Python for this reason...

isodev
3 replies
1d1h

It also wasn’t optimised for Apple Silicon. Given how the different platforms performed in this test, the conclusions seem pretty solid.

jbellis
2 replies
1d1h

He is literally comparing whisper.cpp on the 4090 with an optimized-for-apple-silicon-by-apple-engineers version on the M1.

ETA: actually it's unclear from the article if the whisper optimizations were done by apple engineers, but it's definitely an optimized version.

rowanG077
0 replies
1d1h

I don't think whisper was optimized for apple silicon. Doesn't it just use MLX? I mean if using an API for a platform counts as specifically optimized then the Nvidia version is "optimized" as well since it's probably using CUDA.

isodev
0 replies
1d1h

Maybe I’m not seeing it right, but comparing the source of Apple’s Whisper to Python Whisper seems there are minimal changes to redirect certain operations to using MLX.

There is also cpp Whisper (https://github.com/ggerganov/whisper.cpp) which seems to have it’s own kind of optimizations for Apple Silicon - I don’t think this was the one used with Nvidia during the test.

darkteflon
3 replies
22h32m

Surely the fact that IFW uses batching makes it apples to oranges? The MLX-enabled version didn’t batch, did it? That fundamentally changes the nature of the operation. Wouldn’t the better comparison be faster-whisper?

modeless
2 replies
22h22m

I don't know exactly what the MLX version did, but you're probably right. I'd love to see the MLX side optimized to the max as well. I'm confident that it would not reach the performance of the 4090, but it might do better.

That said, for practical purposes, the ready availability of Nvidia-optimized versions of every ML system is a big advantage in itself.

darkteflon
1 replies
22h14m

Yeah, I think everyone knows that Nvidia is doing a cracker job. But it is good to just be specific about these benchmarks because numbers get thrown around and it turns out people are testing different things. The other thing is that Apple is extracting this performance on a laptop, at ~1/8 the power draw of the desktop Nvidia card.

In any event, it’s super cool to see such huge leaps just in the past year on how easy it is to run this stuff locally. Certainly looking very promising.

modeless
0 replies
22h10m

The M2 Ultra that got the best numbers that I was comparing to is not in a laptop. Regardless, you're probably right that the power consumption is significantly lower per unit time. However, is it lower per unit work done? It would be interesting to see a benchmark optimized for power. Nvidia's power consumption can usually be significantly reduced without much performance cost.

Also, the price difference between a prebuilt 4090 PC and a M2 Ultra Mac Studio can buy a lot of kilowatt hours.

owehrens
1 replies
23h36m

You mean 'https://github.com/Vaibhavs10/insanely-fast-whisper' ? Did not know that until now. I'm running all of that since over ~10 months and just had it running. Happy to try that out. The GPU is fully utilized by using whisper and pytorch with cuda and all. Thanks for the link.

m463
0 replies
18h29m
justinclift
1 replies
22h31m

Nvidia at the high end for ML.

Wouldn't the high end for Nvidia be their dedicated gear rather than a 4090?

modeless
0 replies
22h23m

Yes, H100 would be faster still, and Grace Hopper perhaps even somewhat faster. But Apple doesn't have comparable datacenter-only products, so it's also interesting to see the comparison of consumer hardware. Also, 4090 is cheaper than Apple's best, but H100 is more than both (if you are even allowed to buy it).

WhitneyLand
1 replies
22h55m

I’m afraid the article as well as your benchmarks can be misleading because there are a lot of different whisper implementations out there.

For ex ctranslate optimized whisper “implements a custom runtime that applies many performance optimization techniques such as weights quantization, layers fusion, batch reordering…”

Intuitively, I would agree with your conclusion about Apple’s M-Series being impressive for what they do but not generally competitive with Nvidia in ML.

Objectively however, I don’t see concluding much with what’s on offer here. Once you start changing libraries, kernels, transformer code, etc you end up with an apples to oranges comparison.

modeless
0 replies
22h53m

I think it's fair to compare the fastest available implementation on both platforms. I suspect that the MLX version can be optimized further. However, it will not close a 10x gap.

intrasight
8 replies
1d

Honest question: Why would I (or most users) care? If I have a Mac, I'm going to get the performance of that machine. If I have a gaming PC, I'll get the performance of that machine. If I have both, I'm still likely to use whichever AI is running on my daily driver.

shikon7
5 replies
1d

Suppose you’re a serious hobbyist, wanting to buy the most powerful consumer device for deep learning. Then the choice of buying a RTX 4090 or a M3 Mac is probably quite interesting.

Personally, I have already a RTX 3090, so for me it’s interesting if a M3 would be a noticeable upgrade (considering RAM, speed, library support)

light_hue_1
4 replies
1d

It's really not interesting. The Nvidia card is far superior.

The super optimized code for a network like this is one test case. General networks are another.

mac-mc
2 replies
23h18m

The "available VRAM" is the big difference. You get a lot more to use on an apple silicon machine for much cheaper than you get on an nvidia card on a $/GB basis, especially if you want to pass the 24GB threshold. You can't use NVIDIA consumer cards after that and there is a shortage of them.

The power:perf ratios seem about equal although, so it's really up to apple to release equivalently sized silicon on their desktop to really have this be a 1:1 comparison.

swores
1 replies
20h53m

I wonder how complicated it would be to make happen / how likely it is to happen, for the Apple way of sharing memory between CPU and GPU to become a thing on PCs. Could NVIDIA do it unilaterally or would it need to be in partnership with other companies?

Could anyone who understands this hardware better than me chime in on the complexities of bringing unified ram/vram to PC, or reasons it hasn't happened yet?

kalleboo
0 replies
11h18m

AMD built "Heterogeneous System Architecture" for their APUs to do the same thing, but since their APUs all ended up mostly targeting cheap/low-power/low-profile laptops and business machines, it was never actually utilized (from what I've been able to find, there's no public API to actually use this as designed and you still have to copy things from "system RAM" to "GPU RAM" even though the hardware was designed to support zero-copy, whereas Apple has made it all automatic when you use Metal)

mi_lk
0 replies
23h39m

The Nvidia card is far superior.

I mean, you're probably right, but MLX framework was released just a week ago so maybe we don't really know what it's capable of yet.

winwang
0 replies
1d

I have both. I drive both depending on context of my day and where I want to work. If I were interested in this, and the 4090 were faster, I'd just ssh into that machine from my Mac if I wanted to use my Mac -- and presumably, vice versa.

tomatotomato31
0 replies
1d

Curiosity for example.

Or having a better image of performance in your head when you buy new hardware.

It's just a blog article. The time and effort to make it and to consume it is not in the range of millions there is no need for 'more'.

whywhywhywhy
30 replies
1d3h

Find these findings questionable unless Whisper is very poorly optimized the way it was run on a 4090.

I have a 3090 and an M1 Max 32GB and and although I haven't tried Whisper the inference difference on Llama and Stable Diffusion between the two is staggering, especially with Stable Diffusion where SDXL is about 0:09 seconds 3090 and 1:10 minute on M1 Max.

ps
7 replies
1d2h

I have 4090 and M1 Max 64GB. 4090 is far superior on Llama 2.

astrodust
3 replies
1d2h

On models < 24GB presumably. "Faster" depends on the model size.

brucethemoose2
2 replies
1d2h

In this case, the 4090 is far more memory efficient thanks to ExLlamav2.

70B in particular is indeed a significant compromise on the 4090, but not as much as you'd think. 34B and down though, I think Nvidia is unquestionably king.

michaelt
1 replies
23h11m

Doesn't running 70B in 24GB need 2 bit quantisation?

I'm no expert, but to me that sounds like a recipe for bad performance. Does a 70B model in 2-bit really outperform a smaller-but-less-quantised model?

brucethemoose2
0 replies
15h14m

2.65bpw, on a totally empty 3090 (and I mean totally empty).

I woukd say 34B is the performance sweetspot, yeah. There was a long period where allow we had in the 33B range was llamav1, but now we have Yi and Codellamav2 (among others).

jb1991
2 replies
1d2h

But are you using the newly released Apple MLX optimizations?

ps
1 replies
1d1h

It's been approximately 2 months since I have tested it, so probably not.

jb1991
0 replies
1d1h

But those optimizations are the subject of the article you are commenting on.

oceanplexian
6 replies
1d2h

M1 Max has 400GB/s of memory bandwidth and a 4090 has 1TB/s of memory bandwidth, M1 Max has 32 GPU cores and a 4090 has 16,000. The difference is more about how well the software is optimized for the hardware platform than any performance difference between the two, which are frankly not comparable in any way.

segfaultbuserr
2 replies
1d1h

M1 Max has 32 GPU cores and a 4090 has 16,000.

Apple M1 Max has 32 GPU cores, each core contains 16 Execution Units, each EU has 8 ALUs (also called shaders), so overall there are 4096 shaders. Nvidia RTX 4090 contains 12 Graphics Processing Clusters, each GPC has 12 Streaming Multi-Processors, and each SM has 128 ALUs, overall there are 18432 shaders.

A single shader is somewhat similar to a single lane of a vector ALU in a CPU. One can say that a single-core CPU with AVX-512 has 8 shaders, because it can process 8 FP64s at the same time. Calling them "cores" (as in "CUDA core") is extremely misleading, so "shader" became the common name for a GPU's ALU due to that. If Nvidia is in charge of marketing a 4-core x86-64 CPU, they would call it a CPU with 32 "AVX cores" because each core has 8-way SIMD.

jrk
1 replies
1d

Actually each of those x86 CPUs probably has at least two AVX FMA units, and can issue 16xFP32 FMAs per cycle – it’s at least “64 AVX cores”! :)

kimixa
0 replies
21h2m

Doesn't zen4 have 2x 256-bit FADD and 2x 256-bit FMA, and with avx512 ops it double-pumps the ALU (a good overview here [0]). If you count FADD as a single flop and FMA as 2, that's 48 "1 flop cores" per core.

I think it's got the same total FP ALU resources as zen3, and shows how register width and ALU resources can be completely decoupled.

[0] https://www.mersenneforum.org/showthread.php?p=614191

codedokode
1 replies
1d2h

I think that 4090 has 16000 ALUs, not "cores" (let's call a component capable to execute instructions independently from others, a "core"). And M1 Max probably has more than 1 ALU in every core, otherwise it resembles an ancient GPU.

rsynnott
0 replies
1d2h

Yeah; 'core' is a pretty meaningless term when it comes to GPUs, or at least it's meaningless outside the context of a particular architecture.

We may just be thankful that this particular bit of marketing never caught on for CPUs.

stonemetal12
0 replies
1d1h

Nvidia switched to marketing speak a long time ago when it came to the word "core". If we go with Nvidia's definition then M1 Max has 4096 cores, still behind the 4090, but the gap isn't as big as 32 to 16k.

woadwarrior01
4 replies
1d2h

You're taking benchmark numbers from a latent diffusion model's (SDXL) inference and extrapolating them to encoder-decoder transformer model's (Whisper) inference. These two model architectures have little in common (except perhaps the fact that Stable Diffusion models use a pre-trained text encoder from clip, which again is very different from an encoder-decoder transformer).

brucethemoose2
3 replies
1d2h

The point still stands though. Popular models tend to to have massively hand optimized Nvidia implementations.

Whisper is no exception: https://github.com/Vaibhavs10/insanely-fast-whisper

SDXL is actually an interesting exception for Nvidia because most users still tend to run it in PyTorch eager mode. There are super optimized Nvidia implementations, like stable-fast but their use is less common. Apple, on the other hand, took the odd step of hand writing a Metal implementation themselves, at least for SD 1.5.

woadwarrior01
0 replies
23h35m

Although both LDM inference and encoder-decoder and decoder-only LLM inference are both fundamentally autoregressive in nature, LLM inference is memory bound and LDM inference is compute bound. In that light, it makes sense that the difference between a 4090 and M1 Pro isn't as pronounced as one would expect at first approximation.

Also, as you hint whisper.cpp certainly isn't one of the fastest implementations of whisper inference out there. Perhaps a comparison between a pure PyTorch version running on the 4090 with an MLX version of Whisper running on the M1 Pro would be fairer. Or better yet, run the whisper encoder on ANE with CoreML and have the decoder running with Metal and Accelerate (which uses Apple's undocumented AMX ISA) using MLX, since MLX currently does not use the ANE. IIRC, whisper.cpp has a similar optimization on Apple hardware, where it optionally runs the encoder using CoreML and the decoder using Metal.

tgtweak
0 replies
1d2h

Modest 30x speedup

kkielhofner
0 replies
23h37m

This will determine who has a shot at actually being Nvidia competitive.

What I like to say is (generally speaking) other implementations like AMD (ROCm), Intel, Apple, etc are more-or-less at the “get it to work” stage. Due to their early lead and absolute market dominance Nvidia has been at the “wring every last penny of performance out of this” stage for years.

Efforts like this are a good step but they still have a very long way to go to compete with multiple layers (throughout the stack) of insanely optimized Nvidia/CUDA implementations. Bonus points nearly anything with Nvidia is a docker command that just works on any chip they’ve made in the last half decade from laptop to datacenter.

This can be seen (dramatically) with ROCm. I recently took the significant effort (again) to get an LLM to run on an AMD GPU. The AMD GPU is “cheaper” in initial cost but when the dollar equivalent (to within 10-30%) Nvidia GPU is 5-10x faster (or whatever) you’re not saving anything.

You’re already at a loss unless your time is free just to get it to work (random patches, version hacks, etc) and then the performance just isn’t even close so the “value prop” of AMD currently doesn’t make any sense whatsoever. The advantage for Apple is you likely spent whatever for the machine anyway, and when you have it just sitting in front of you for a variety of tasks the value prop increases significantly.

kamranjon
2 replies
1d2h

There has been a ton of optimization around whisper with regards to apple silicon, whisper.cpp is a good example that takes advantage of this - also this article is specifically referencing the new apple MLX framework which I’m guessing your tests with llama and stable diffusion weren’t utilizing.

sbrother
1 replies
1d2h

I assume people are working on bringing a MLX backend to llama.cpp... Any idea what the state of that project is?

tgtweak
0 replies
1d2h

https://github.com/ml-explore/mlx-examples

Several people working on mlx-enabled backends to popular ML workloads but it seems inference workloads are the most accelerated vs generative/training.

KingOfCoders
1 replies
1d2h

"I haven't tried Whisper"

I haven't tried the hardware/software/framework/... of the article, but I have an opinion on this exact topic.

xxs
0 replies
1d1h

The topic is benchmarking some hardware and specific implementation of some tool.

The provided context is n earlier version of hardware where known implementations perform drastically differently, an order of magnitude differently.

That leaves the question why that specific tool exhibits the behavior described in the article.

tgtweak
0 replies
1d2h

Reading through some (admittedly very early) MLX docs and it seems that convolutions (as used heavily in GANs and particularly stable diffusion) are not really seeing meaningful uplifts on MLX at all, and in some cases are slower than on the cpu.

Not sure if this is a hardware limitation or just unoptimized MLX libraries but I find it hard to believe they would have just ignored this very prominent use case. It's more likely that convolutions use high precision and much larger tile sets that require some expensive context switching when the entire transform can't fit in the gpu.

stefan_
0 replies
1d

Having used whisper a ton, there are versions of it that have one or two magnitudes of better performance at the same quality while using less memory for reasons I don't fully understand.

So I'd be very careful about your intuition on whisper performance unless it's literally the same software and same model (and then the comparison isn't very meaningful still, seeing how we want to optimize it for different platforms).

mv4
0 replies
20h45m

Thank you for sharing this data. I've just been debating between M2 Mac Studio Max and a 64GB i9 10900x with RTX 3090 for personal ML use. Glad I chose the 3090! Would love to learn more about your setup.

liuliu
0 replies
1d1h

Both of your SDXL and M1 Max number should be faster (of course, it depends on how many steps). But the point stands, for SDXL, 3090 should be 5x to 6x faster than M1 Max and should be 2x to 2.5x faster than M2 Ultra.

agloe_dreams
0 replies
1d2h

It all is really messy, I would assume that almost any model is poorly optimized to run on Apple Silicon as well.

etchalon
16 replies
1d2h

The shocking thing about these M series comparisons is never "the M series is fast as the GIANT NVIDIA THING!" it's always "Man, the M series is 70% as fast with like 1/4 the power."

kllrnohj
10 replies
1d1h

It's not really that shocking. Power consumption is non-linear with respect to frequency and you see this all the time in high end CPU & GPU parts. Look at something like the eco modes on the Ryzen 7xxx for a great example. The 7950X stock pulls something like 260w on an all core workload at 5.1ghz. Yet enable the 105w eco mode and that power consumption plummets to 160w at 4.8ghz. That means the last 300mhz of performance, which is borderline inconsequential (~6% performance loss), costs 100W. The 65W option then cuts that in half almost again down to 88w (at 4ghz now), for a "mere" 20% reduction in performance. Or phrased differently, for 1/3rd the power the 7950X will give you 75% of the performance of a 7950X.

Matching performance while using less power is impressive. Using less power while also being slower not so much, though.

nottorp
9 replies
1d1h

Or phrased differently, for 1/3rd the power the 7950X will give you 75% of the performance of a 7950X.

So where is the - presumably much cheaper - 65 W 7950X?

dotnet00
4 replies
1d1h

Why would it be much cheaper? The chips are intentionally clocked higher than the most efficient point because the point of the CPU is raw speed, not power consumption, especially since 7950x is a desktop chip.

His point is that M-CPUs being somewhat competitive but much more efficient is not as stunning since you're comparing a CPU tuned to be at the most efficient speed to a CPU tuned to be at the highest speed.

Similarly a 4090's power consumption drops dramatically if you underclock or undervolt even slightly, but what's the point? You're almost definitely buying a 4090 for its raw speed.

nottorp
3 replies
1d1h

Because they would be able to sell some rejects.

And because I don't want a software limit that may or may not work.

wtallis
0 replies
1d

AMD's chiplet-based design means they have plenty of other ways to make good use of parts that cannot hit the highest clock speeds. They have very little reason to do a 16-core low-clock part for their consumer desktop platform.

And your concerns about "a software limit that may or may not work" are completely at odds with how their power management works.

kllrnohj
0 replies
23h1m

That's not how binning works. Quality silicon is also the ones that are more efficient. A flagship 65W part would be just as expensive as a result, it's the same-ish quality of parts.

dotnet00
0 replies
1d

It isn't a "software limit" beyond just being controlled by writing to certain CPU registers via software. It's very much a feature of the hardware, the same feature that allows for overclocking the chips.

coder543
3 replies
1d1h

Every 7950X offers a 65W mode. It’s not a separate SKU.

It’s a choice each user can make if they care more about efficiency. Tasks take longer to complete, but the total energy consumed for the completion of the task is dramatically less.

nottorp
2 replies
19h47m

You think? I think it's a choice that very few technical users even know about. And of those, 90% don't care about efficiency.

The 250 W space heater vacuum cleaner soundtrack mode should be opt in rather than opt out. Same for video cards.

coder543
1 replies
19h31m

Of course most users don’t pick the 65W option. They want maximum performance, and the cost of electricity is largely negligible to most people buying a 7950X.

AMD isn’t going to offer a huge discount for a 65W 7950X for the reasons discussed elsewhere: they don’t need to.

nottorp
0 replies
8h50m

and the cost of electricity is largely negligible to most people buying a 7950X.

“I can afford it” is still wasteful. Most people wouldn’t even notice the speed difference in 65 W mode. Even if they can afford the space heater mode.

lern_too_spel
4 replies
1d1h

As others have pointed out, it is nowhere near even 1/4 as fast.

etchalon
3 replies
1d1h

The result for a 10 Minute audio is 0:03:36.296329 (216 seconds). Compare that to 0:03:06.707770 (186 seconds) on my Nvidia 4090. The 2000 € GPU is still 30 seconds or ~ 16% faster. All graphics core where fully utilized during the run and I quit all programs, disabled desktop picture or similar for that run.
lern_too_spel
2 replies
21h38m

What other people have mentioned is that there are multiple implementations for Nvidia GPUs that are many times faster than whisper.cpp.

etchalon
1 replies
21h8m

And my comment was about M1 comparison articles, of which this was one. And which exhibited the property I mentioned.

lern_too_spel
0 replies
20h53m

It does this poorly. They compared the currently most optimized whisper implementation for M1 against an implementation that is far from the best currently available for the Nvidia GPU. They cannot make a claim of reaching 70% speed at 25% power usage.

Edit: The article has been updated to compare against a faster implementation for the Nvidia GPU and no longer makes that claim.

Flux159
11 replies
1d2h

How does this compare to insanely-fast-whisper though? https://github.com/Vaibhavs10/insanely-fast-whisper

I think that not using optimizations allows this to be a 1:1 comparison, but if the optimizations are not ported to MLX, then it would still be better to use a 4090.

Having looked at MLX recently, I think it's definitely going to get traction on Macs - and iOS when Swift bindings are released https://github.com/ml-explore/mlx/issues/15 (although there might be some C++20 compilation issue blocking right now).

brucethemoose2
8 replies
1d2h

This is the thing about Nvidia. Even if some hardware beats them in a benchmark, if its a popular model, there will be some massively hand optimized CUDA implementation that blows anything else out of the water.

There are some rare exceptions (like GPT-Fast on AMD thanks to PyTorch's hard work on torch.compile, and only in a narrow use case), but I can't think of a single one for Apple Silicon.

jeroenhd
3 replies
1d1h

The one being benchmarked here is heavily optimised for Apple Silicon. I think there are a few algorithms that Apple uses (like the one tagging faces on iPhones) that are heavily optimised for Apple's own hardware.

I think Apple's API would be as popular as CUDA if you could rent their chips at scale. They're quite efficient machines that don't need a lot of cooling, so I imagine the OPEX of keeping them running 24/7 in big cloud racks would be pretty low if they were optimised for server usage.

Apple seems to focus their efforts on bringing purpose-built LLMs to Apple machines. I can see why it makes sense (just like Google's attempts to bring Tensor cores to mobile) but there's not much practical use in this technology right now. Whisper is the first usable technology like this, but even my Android phone can live translate spoken text into words as an accessibility feature, I don't think Apple can sell Whisper as a product to end users.

mac-mc
1 replies
23h13m

Apple would need to make rackmount versions of the machines with replaceable storage and maybe RAM and would really need to really beef up their headless management systems of the machines before they start becoming competitive.

Otherwise you need a whole bunch of custom mac mini style racks and management software which really increases costs and lead times. If you don't believe me, look how expensive AWS macOS machines are compared to linux ones with equivalent performance.

poyu
0 replies
22h33m

They already make rack mount Mac Pros. But yeah they need to up their game on the management software

jdminhbg
0 replies
1d

The one being benchmarked here is heavily optimised for Apple Silicon.

I don't think so, in the sense of a hand-optimized CUDA implementation. This just using the MLX API in the same way that you'd use CUDA via PyTorch or something.

rfoo
1 replies
1d1h

but I can't think of a single one for Apple Silicon.

The post here is exactly one for Apple Silicon. It compared a naive implementation in PyTorch which may not even keep 4090 busy (for smaller/not-that-compute-intensive models having the entire computation driven by Python is... limiting, which is partly why torch.compile gives amazing improvements) to a purposedly-optimized one (optimized for both CPU/GPU efficiency) for Apple Silicon one.

brucethemoose2
0 replies
15h13m

The pytorch performance is awful though. You'd have to be kinda crazy to not use an optimized implementation.

MBCook
1 replies
1d1h

I wouldn’t be surprised a $2k top of the line GPU is a match/better than the built in accelerator on a Mac. Even if the Mac was slightly faster you could just stick multiple GPUs in a PC.

To me the news here is how well the Mac runs without needing that additional hardware/large power draw on this benchmark.

NorwegianDude
0 replies
12h27m

The power draw is not impressive here. Sure, it's low, but if you account for performance/W then the GPU is much more efficient.

claytonjy
0 replies
1d1h

To have a good comparison I think we'd need to run the insanely-fast-whisper code on a 4090. I bet it handily beats both the benchmarks in OP, though you'll need a much smaller batch size than 24.

You can beat these benchmarks on a CPU; 3-4x realtime is very slow for whisper these days!

chrisbrandow
0 replies
21h55m

he updated with insanely-fast

atty
10 replies
23h0m

I think this is using the OpenAI Whisper repo? If they want a real comparison, they should be comparing MLX to faster-whisper or insanely-fast-whisper on the 4090. Faster whisper runs sequentially, insanely fast whisper batches the audio in 30 second intervals.

We use whisper in production and this is our findings: We use faster whisper because we find the quality is better when you include the previous segment text. Just for comparison, we find that faster whisper is generally 4-5x faster than OpenAI/whisper, and insanely-fast-whisper can be another 3-4x faster than faster whisper.

youssefabdelm
5 replies
21h42m

Does insanely-fast-whisper use beam size of 5 or 1? And what is the speed comparison when set to 5?

Ideally it also exposes that parameter to the user.

Speed comparisons seem moot when quality is sacrificed for me, I'm working with very poor audio quality so transcription quality matters.

busup
2 replies
11h48m

It's beam size 1. From my quick tests on a Colab T4, CTranslate2 (faster-whisper's backend) is about 30% faster with like for like settings. I decoded the audio, got mel features, split into 30s segments, and ran it batched (beam size 1, batch size 24, no temperature fallback passes). Takes a bit more effort than a cli utility but isn't too hard.

Side note, the insanely fast whisper readme gives benchmarks on an A100 but only the FA2 lines were. The rest were on a T4 looking at the notebooks/history. Turing doesn't support FA2 so the gap should be smaller with it, but based on the distil-whisper paper CTranslate2 is probably still faster.

TensorRT-LLM might be faster but I haven't looked into it yet.

sanchit-gandhi
1 replies
7h43m

Hugging Face Whisper (the backend to insanely-fast-whisper) now supports PyTorch SDPA attention with PyTorch>=2.1.1

It's enabled by default with the latest Transformers version, so just make sure you have:

* torch>=2.1.1

* transformers>=4.36.0

busup
0 replies
2h56m

Nice, thanks for your work on everything Whisper related. I tested it a couple weeks ago which largely matched the results in the insanely fast whisper notebook. Comparison was with BetterTransformers.

I just reran the notebook with 4.36.1 (minus the to_bettertransformer line) but it was slower (the batch size 24 section took 8 vs 5 min). Is there something I need to change? Going back to 4.35.2 gives the old numbers so the T4 instance seems fine.

atty
1 replies
20h59m

Our comparisons were a little while ago so I apologize I can’t remember if we used BS 1 or 5 - whichever we picked, we were consistent across models.

Insanely fast whisper (god I hate the name) is really a CLI around Transformers’ whisper pipeline, so you can just use that and use any of the settings Transformers exposes, which includes beam size.

We also deal with very poor audio, which is one of the reasons we went with faster whisper. However, we have identified failure modes in faster whisper that are only present because of the conditioning on the previous segment, so everything is really a trade off.

sanchit-gandhi
0 replies
7h57m

Indeed, insanely-fast-whisper supports beam-search with a small code modification to this code snippet: https://huggingface.co/openai/whisper-large-v3

Just call the pipeline with:

result = pipe(sample, generate_kwargs={"num_beams": 5})

moffkalast
1 replies
21h58m

Is insanely-fast-whisper fast enough to actually run on the CPU and still trascribe in realtime? I see that none of these are running quantized models, it's still fp16. Seems like there's more speed left to be found.

Edit: I see it doesn't yet support CPU inference, should be interesting once it's added.

atty
0 replies
20h55m

Insanely fast whisper is mainly taking advantage of a GPU’s parallelization capabilities by increasing the batch size from 1 to N. I doubt it would meaningfully improve CPU performance unless you’re finding that running whisper sequentially is leaving a lot of your CPU cores idle/underutilized. It may be more complicated if you have a matrix co-processor available, I’m really not sure.

PH95VuimJjqBqy
1 replies
21h38m

yeah well, I find that super-duper-insanely-fast-whisper is 3-4x faster than insanely-fast-whisper.

/s

atty
0 replies
21h7m

Yes I am not a fan of the naming either :)

mightytravels
4 replies
1d2h

Use this Whisper derivative repo instead - one hour of audio gets transcribed within a minute or less on most GPUs - https://github.com/Vaibhavs10/insanely-fast-whisper

thrdbndndn
2 replies
1d2h

Could someone elaborate how this is accomplished and if there is any quality disparity compared to original?

Repos like https://github.com/SYSTRAN/faster-whisper makes immediate sense on why it's faster than the original implementation, and lots of others do so by lowering quantization precision etc (and worse results).

but this one, it's not very clear how. Especially considering it's even much faster.

mightytravels
0 replies
1d

From what I can see it is parallel batch processing - default for that repo is 24. You can reduce batches and if you use 1 it's as fast or slow as Whisper. Quality is the exact same (same large model used).

lern_too_spel
0 replies
1d1h

The Acknowledgments section on the page that GP shared says it's using BetterTransformer. https://huggingface.co/docs/optimum/bettertransformer/overvi...

claytonjy
0 replies
22h11m

Anecdotally I've found ctranslate2 to be even faster than insanely-fast-whisper. On an L4, using ctranslate2 with a batch size as low as 4 beats all their benchmarks except the A100 with flash attention 2.

It's a shame faster-whisper never landed batch mode, as I think that's preventing folks from trying ctranslate2 more easily.

SlavikCA
4 replies
1d2h

It's easy to run Whisper on my Mac M1. But it's not using MLX out of the box.

I spend an hour or two, trying to run figure out what I need to install / configure to enable it to use MLX. Was getting cryptic Python errors, Torch errors... Gave up on it.

I rented VM with GPU, and started Whisper on it within few minutes.

xd1936
0 replies
1d1h

I've really enjoyed this macOS Whisper GUI[1]. It doesn't use MLX, but does use Metal.

1. https://goodsnooze.gumroad.com/l/macwhisper

tambourine_man
0 replies
1d1h

Is was released last week. Give it a month or two

jonnyreiss
0 replies
22h5m

I was able to get it running on MLX on my M2 Max machine within a couple minutes using their example: https://github.com/ml-explore/mlx-examples/tree/main/whisper

JCharante
0 replies
1d1h

Hmm, I've been using this product for whisper https://betterdictation.com/

bee_rider
3 replies
1d1h

Hmm… this is a dumb question, but the cookie pop up appears to be in German on this site. Does anyone know which button to press to say “maximally anti-tracking?”

layer8
2 replies
1d1h

If only we had a way to machine-translate text or to block such popups.

bee_rider
1 replies
1d1h

I think it is better not to block these sorts of pop-ups, they are part of the agreement to use the site after all.

Anyway the middle button is “refuse all” according to my phone, not sure how accurate the translation is or if they’ll shuffle the buttons for other people.

It is poor design to have what appear to be “accept” and “refuse” both in green.

layer8
0 replies
23h31m

According to GDPR, the site is not allowed to track you as long as you haven’t given your consent. There is no agreement until you actually agreed to something.

accidbuddy
2 replies
22h0m

About whisper, anyone knows a project (github) about using the model in real-time? I'm studying a new language, and it appears to be a good chance to use and learning pronunciation vs. word.

samx81
1 replies
15h18m

This one uses faster-whisper as the backend, I've tried with small model and the performance is good. https://github.com/collabora/WhisperLive

The is another one that uses huggingface's implementation, but I haven't tried it since my spec doesn't support flash-att2 https://github.com/luweigen/whisper_streaming

accidbuddy
0 replies
1h40m

Thanks. I'll try.

LiamMcCalloway
2 replies
1d1h

I'll take this opportunity to ask for help: What's a good open source transcription and diarization app or work flow?

I looked at https://github.com/thomasmol/cog-whisper-diarization and https://about.transcribee.net/ (from the people behind Audapolis) but neither work that well -- crashes, etc.

Thank you!

mosselman
0 replies
20h42m

I would like to know the same.

It shouldn’t be so hard since many apps have this. But what is the most reliable way right now?

dvfjsdhgfv
0 replies
1d1h

I developed my own solutions, pretty rudimentary - it divides the MP3s into chunks that Whisper is able to handle and then sends them one by one to the API to transcribe. Works as expected so far, it's just a couple of lines of Python code.

throwaw33333434
1 replies
1d1h

META: is M3 Pro good enough to run Cyberpunk 2077 smoothly? Does Max really makes a difference?

ed_balls
0 replies
1d1h

14 inch M3 Max may overheat.

tgtweak
1 replies
1d2h

Does this translate to other models or was whisper cherry picked due to it's serial nature and integer math? looking at https://github.com/ml-explore/mlx-examples/tree/main/stable_... seems to hint that this is the case:

At the time of writing this comparison convolutions are still some of the least optimized operations in MLX.

I think the main thing at play is the fact you can have 64+G of very fast ram directly coupled to the cpu/gpu and the benefits of that from a latency/co-accessibility point of view.

These numbers are certainly impressive when you look at the power packages of these systems.

Worth considering/noting that the cost of m3 max system with the minimum ram config is ~2x the price of a 4090...

densh
0 replies
1d1h

Apple’s silicon memory is fast only in comparison to consumer CPUs that stagnated for ages with having only 2 memory channels which was fine in 4 core era but wakes no sense at all with modern core counts. Memory scaling on GPUs is much better, even on the consumer front.

lars512
1 replies
1d

Is there a great speech generation model that runs on MacOS, to close the loop? Something more natural than the built in MacOS voices?

treprinum
0 replies
21h1m

You can try VALL-E; it takes around 5s to generate a sentence on a 3090 though.

DeathArrow
1 replies
1d2h

Ok, OpenAI will ditch Nvidia and buy macs instead. :)

baldeagle
0 replies
1d2h

Only if Sam Altman is appointed to the Apple board. ;)

theschwa
0 replies
1d2h

I feel like this is particularly interesting in light of their Vision Pro. Being able to run models in a power efficient manner may not mean much to everyone on a laptop, but it's a huge benefit for an already power hungry headset.

sim7c00
0 replies
1d

looking at the comments perhaps the article could be more eptly titled. the author does stress these benchmarks, maybe better called test runs, are not of any scientific accuracy or worth, but simply to demonstrate what is being tested. i think its interesting though that apple and 4090s are even compared in any way since the devices are so vastly different. id expect the 4090 to be more powerful, but apple optimized code runs really quick on apple silicon despite this seemingly obvious fact, and that i think is interesting. you dont need a 4090 to do things if you use the right libraries. is that what i can take from it?

runjake
0 replies
1d

Anyone have overall benchmarks or qualified speculation on how an optimized implementation for a 4070 compares against the M series -- especially the M3 Max?

I'm trying to decide between the two. I figure the M3 Max would crush the 4070?

jauntywundrkind
0 replies
1d1h

I wonder how AMD's XDNA accelerator will fair.

They just shipped 1.0 of the Ryzen AI Software and SDK. Alleges ONNX, PyTorch, and Tensorflow support. https://www.anandtech.com/show/21178/amd-widens-availability...

Interestingly, the upcoming XDNA2 supposedly is going to boost generative performance a lot? "3x". I'd kind of assumed these sort of devices would mainly be helping with inference. (I don't really know what characterizes the different workloads, just a naive grasp.)

iAkashPaul
0 replies
1d1h

There's a better parallel/batching that works on the 30s chunks resulting in 40X. From HF at https://github.com/Vaibhavs10/insanely-fast-whisper

This is again not native PyTorch so there's still room to have better RTFX numbers.

ex3ndr
0 replies
23h8m

So running on M2 Ultra would beat 4090 by 30%? (since it has 2x of gpu cores)

darknoon
0 replies
1d2h

Would be more interesting if Pytorch with MPS backend was also included.

brcmthrowaway
0 replies
21h51m

Shocked that Apple hasn't released a high end compute chip competitive with NVIDIA

bcatanzaro
0 replies
1d2h

What precision is this running in? If 32-bit, it’s not using the tensor cores in the 4090.

atlas_hugged
0 replies
21h26m

TL;DR

If you compare whisper on a mac with Mac optimized build Vs on a pc with a few NON-optimized NVIDIA build The results are close! If nvidia optimized is compared, it’s not even remotely close.

Pfft

I’ll be picking up a Mac but I’m well aware it’s not close to Nvidia at all. It’s just the best portable setup I can find that I can run completely offline.

Do people really need to make these disingenuous comparisons to validate their purchase?

If a mac fits your overall use case better, get a Mac. If a pc with nvidia is the better choice, get it. Why all these articles of “look my choice wasn’t that dumb”??

Lalabadie
0 replies
1d2h

There will be a lot of debate about which is the absolute best choice for X task, but what I love about this is the level of performance at such a low power consumption.

2lkj22kjoi
0 replies
19h58m

4090 -> 82 TFLOPS

M3 MAX GPU -> 10 TFLOPS

It is 8 times slower than 4090.

But yeah, you can claim that a bike has a faster acceleration than Ferrari, because it could reach the speed of 1km per hour faster...