HN comments for: Llama 3-V: Matching GPT4-V with a 100x smaller model and 500 dollars

arnaudsm

26 replies

19h18m

2024-05-28 23:10:44 UTC

Tangential question : did anyone ever use GPT4-V in production in visual tasks? It's never consistent enough for me to be useful

serjester

12 replies

17h25m

2024-05-29 01:03:09 UTC

Don’t use it for anything OCR related that needs perfect accuracy. Stuff where some errors are ok, we’ve had great success. Depending on your budget, you can also run it multiple times to catch errors.

toomuchtodo

10 replies

17h1m

2024-05-29 01:27:34 UTC

How does it compare to Tesseract?

Edit: Thank you!

elanning

9 replies

16h45m

2024-05-29 01:42:59 UTC

I’ve done a lot of OCR work and tesseract is nearly a decade out of date at this point. It is not a serious technology for anything requiring good accuracy or minor complexity. From what I’ve seen, GPT-4V completely smokes tesseract, but then again, most modern OCR systems do. If you want fast and pretty powerful OCR, check out paddle. If you want slower but higher accuracy, check out transformer based models such as TrOCR.

authorfly

3 replies

7h27m

2024-05-29 11:00:56 UTC

Running PaddleOCR in production now, I would suggest contrasting Tesseract v4 and v5, since v5 is a lot better(but until recently has not been available on Linux) - PaddleOCR does still smoke it though, you are right (especially for concurrency and fairly easily just setting different workers to different GPUs for best concurrent batching).

cpursley

2 replies

6h29m

2024-05-29 11:58:58 UTC

How is Paddle on complex data tables? This is my biggest challenge at the moment.

authorfly

1 replies

5h23m

2024-05-29 13:04:46 UTC

What format? The entire data table in one image, or a PDF for example printed off with 8 pages where the user choose to only put the header on the first page etc? Or decent formatting, font size 8+ on an image with decent resolution? With the latter you are probably fine although you will need some manual implementation for parsing the output. You get bounding boxes at word level. One thing if I started nowadays I would do is use basic columns (x coordinates) to add '|' inbetween the outputs(including detecting empty span positions), keep items with similarish y coordinates together on lines, and put it into ChatGPT to format as desired, I suspect this would avoid misreading.

I would say PaddleOCR is good in general for tables - it's much better (in terms of recall rate) at recognising numerical digits / symbols than Tesseract although I notice it often misrecognises "l" in "Lullaby/ml/million" etc as "1" sometimes.

The cloud providers have better table extraction iff you can guarantee the same format each time for the document.

cpursley

0 replies

4h54m

2024-05-29 13:34:11 UTC

A wide variety of PDFs (both in length and content) that can have a variety of different tables, real estate related with a lot of financial content. And I need to be able to run on local models / software (no parsing as a service, no OpenAI, etc).

Here's just one example: https://www.totalflood.com/samples/residential.pdf (I struggle getting accurate data out of the Sales Comp section - basically all approaches mix up the properties.

Zuiii

2 replies

14h35m

2024-05-29 03:53:24 UTC

Tesseract's true value is being one apt-get command away (i.e. opensource). Does Debian host more modern OCR systems in their repos?

nunez

0 replies

13h18m

2024-05-29 05:10:36 UTC

Tesseract the tool is one apt-get away but the trained models are not, and I've found that they are a starting point, not a final destination. You still have to do more training on top of them for anything that isn't black text on a crisp white background.

elanning

0 replies

14h15m

2024-05-29 04:13:04 UTC

Big mistake on my part; I should clarify I fine-tuned both PaddleOCR and TrOCR on large amounts of data specific to my domain. I cannot speak on the best out of the box “ready to go” solutions (besides cloud ones, which were quite good with the right pre and post processing).

nh2

1 replies

14h53m

2024-05-29 03:35:25 UTC

See this for a comparison of PaddleOCR, TrOCR, and various cloud ones (note: on documents of typed and handwritten text):

https://news.ycombinator.com/item?id=32077375

authorfly

0 replies

7h29m

2024-05-29 10:58:55 UTC

Caveat that being from 2022, the Tesseract version used was almost certainly v4 (if Linux), rather than v5 which is much better (and widely available on Windows in 2022, but not Linux yet).

However Tesseract is quite behind still as you note, even with v5.

nomel

0 replies

16h40m

2024-05-29 01:48:44 UTC

you can also run it multiple times to catch errors.

Does this require a slight offset and/or rotation to the image, or just literal rerun, with seed seed/whatever giving a different result?

alexcnwy

4 replies

11h46m

2024-05-29 06:42:14 UTC

the coolest use case i've seen is this https://github.com/ddupont808/GPT-4V-Act

nucleative

3 replies

11h15m

2024-05-29 07:13:06 UTC

I feel like this is the beginning of the end for all captchas

udev4096

1 replies

10h37m

2024-05-29 07:50:49 UTC

Image based or any kind of visual captchas will never be extremely effective. I think we will see more of PoW captchas in the upcoming years (just like cloudflare's turnstile captcha)

fennecfoxy

0 replies

6h7m

2024-05-29 12:21:39 UTC

I'm not suer about that, can't you give GPT4 a math problem in an image already and have it solve it correctly most of the time?

And these haven't even been trained to defeat captchas/logic problem captchas yet, if it was fine tuned on the general pattern of them I imagine any form of captcha is bust.

jorvi

0 replies

5h39m

2024-05-29 12:48:56 UTC

Twitter / X has a very interesting captcha: you get to see 10 objects that have weird colors and are slightly deformed, and then you have to match them (1 at a time) with another row that has the same objects but seen from a different angle.

Of course eventually this will be defeated too, but for now it seems to work pretty well.

zacmps

3 replies

17h49m

2024-05-29 00:39:03 UTC

Nope, I tried it for graph and diagram understanding and it wasn't good enough. Planning to repeat the evaluation with 4o when I have time.

SparkyMcUnicorn

2 replies

17h36m

2024-05-29 00:52:24 UTC

I'm using 4o to convert visual diagrams into mermaid, and it's been almost perfectly accurate in my experience.

cuu508

1 replies

12h17m

2024-05-29 06:11:35 UTC

This is the out of the box thinking I love about HN. What do you do with the mermaid?

SparkyMcUnicorn

0 replies

3h51m

2024-05-29 14:37:12 UTC

The resulting mermaid is used for... more LLM processing. Converting to mermaid first is more cost-effective, consistent, and accurate for my purposes.

abrichr

3 replies

16h2m

2024-05-29 02:25:47 UTC

It's very reliable for GUI segment understanding; see e.g. https://github.com/OpenAdaptAI/OpenAdapt/pull/610 (scroll down to `gpt-4-vision-preview`).

amelius

2 replies

6h33m

2024-05-29 11:55:44 UTC

Can it be used for automatic annotation?

As in: you tell it that these and these parts should be masked such and such, and then it does that?

abrichr

1 replies

4h35m

2024-05-29 13:53:27 UTC

We have not had success with that unfortunately.

amelius

0 replies

4h23m

2024-05-29 14:05:17 UTC

Thank you, your comment will save me some trouble ;)

behnamoh

14 replies

20h18m

2024-05-28 22:10:16 UTC

This "matching gpt-4" catchy phrase has lost its meaning to me. Everytime an article like this pops up, I see marketing buzz and unrealistic results in practice.

Mo3

9 replies

20h14m

2024-05-28 22:14:34 UTC

Of course, it's nothing else. Who could possibly believe that OpenAI and others would dump billions into development and training and aren't smart enough to figure out they could also do it with $500.

nomel

2 replies

19h38m

2024-05-28 22:50:29 UTC

It's llama 3 training cost + their cost. Meta "kindly" covered the first $700M.

We add a vision encoder to Llama3 8B

lanceflt

1 replies

17h7m

2024-05-29 01:21:38 UTC

They didn't train the vision encoder either, it's unchanged SigLIP by Google.

qeternity

0 replies

6h33m

2024-05-29 11:55:14 UTC

“We finetuned billions of dollars of research by Google and Meta.”

KorematsuFredt

2 replies

19h41m

2024-05-28 22:46:50 UTC

You have clearly not read the article. $500 is the cost of fine tuning.

selcuka

1 replies

16h48m

2024-05-29 01:40:23 UTC

Fair enough. Is it now safe to say that OpenAI could have done with a 8B model + $500 of fine tuning instead of running a (much) larger model on their GPU cluster?

wrycoder

0 replies

16h37m

2024-05-29 01:51:18 UTC

Maybe they did

whimsicalism

0 replies

19h42m

2024-05-28 22:45:59 UTC

it would have been a lot cheaper for oai if they had access to llama3 in 2018

nickpsecurity

0 replies

19h25m

2024-05-28 23:03:29 UTC

While that may be true, the opposite has also happened to hundreds of companies in other areas:

https://news.ycombinator.com/item?id=39136472

Many companies also optimize for tools, like Python, that have boost productivity more than price/performance ratio. OpenAI had billions of other people's money. They might just keep using tools which worked before.

Lastly, there are tons of papers published on techniques that claim to reduce cost. Most of them aren't good. Their benchmarks aren't good. Even reviewing most of them is more time than a lot of AI researchers have. Those that make it to established communities usually have gotchas that come with the benefits. So, they could also simply miss a needle in a large haystack.

I think you're right that they'd be using whatever really worked with no loss in model performance. It's just that they might not for a number of reasons. The rational choice is for others to keep experimenting with those things in case they get a competitive advantage.

bilbo0s

0 replies

19h40m

2024-05-28 22:48:14 UTC

Who could possibly believe that OpenAI and others would dump billions into development and training and aren't smart enough to figure out they could also do it with $500.

People upvoting the post??

Not really sure? But PT Barnum said there's always a lot of them out there.

Pretty sure they mean fine tuning though?

But even that is total tripe.

These guys are snake oil salesmen. (Or Sylvester McMonkey McBean is behind it.)

mpalmer

1 replies

20h13m

2024-05-28 22:14:56 UTC

For me it's become a signal the person making the claim is unserious.

moffkalast

0 replies

8h3m

2024-05-29 10:25:01 UTC

If "beats GPT 4" is in the title it's almost a guarantee that it's a bold faced lie that includes benchmark overfitting.

The first time a model that actually matched GPT 4 launched (i.e. Command-R+) there was no mention of it at all. If your results speak for themselves, there's no need to shout.

nbk_2000

0 replies

16h45m

2024-05-29 01:43:08 UTC

Starting to sound like the "iPhone Killer" we've all heard about... for the past 15+ years

alfalfasprout

0 replies

17h41m

2024-05-29 00:46:46 UTC

Sadly the front page is often riddled with posts like these.

lanceflt

6 replies

20h38m

2024-05-28 21:49:46 UTC

- Llava is not the SOTA open VLM, InternVL-1.5 is https://huggingface.co/spaces/opencompass/open_vlm_leaderboa...

You need to compare the evals to strong open VLMs including this and CogVLM

- This is not "first-ever multimodal model built on top of Llama3", there's already a Llava on Llama3-8b https://huggingface.co/lmms-lab

valine

2 replies

19h39m

2024-05-28 22:49:31 UTC

Very curious how it performs on OCR tasks compared to InternVL. To be competitive at reading text you need tiling support, and InternVL does tiles exceptionally well.

hovering_nox

1 replies

11h14m

2024-05-29 07:14:44 UTC

I think CogVLM2 is even better than Intern at OCR (my usecase is extracting information from an invoice)

mkesper

0 replies

6h1m

2024-05-29 12:26:58 UTC

After some superficial testing I with bad quality scans you can find on kaggle I can not confirm that. CogVLM2 refuses to handle scans that InternVL-V1.5 still can comprehend.

yieldcrv

0 replies

18h24m

2024-05-29 00:04:37 UTC

I’m going to be saying First Ever AI something for the next 15 years for clout and capital, not going to be listening to anybody’s complicated ten step funnel if they’re not doing the obvious

gigel82

0 replies

20h24m

2024-05-28 22:03:45 UTC

Like InternVL, no llama.cpp support severely limits its applications. Close to GPT4v performance level and runnable locally on any machine (no need for a GPU) would be huge for the accessibility community.

abrichr

0 replies

15h56m

2024-05-29 02:32:08 UTC

Thank you for the link! Our initial testing suggests MiniCPM outperforms InternVL for GUI understanding: https://github.com/OpenAdaptAI/OpenAdapt/issues/637#issuecom...

(InternVL appears to hallucinate more.)

dcreater

6 replies

15h44m

2024-05-29 02:44:29 UTC

If I had a nickel for every outrageous "matches/beats GPT-x" claim, I'd have more money than the capital these projects raise from VC.

This absolutely is not the first Llama3 vision model. They even quote it's performance compared to Llava. Hard to take anything they say seriously with such obviously false claims

qeternity

4 replies

12h47m

2024-05-29 05:41:12 UTC

This absolutely is not the first Llama3 vision model. They even quote it's performance compared to Llava.

Although this is true, there have been earlier Llama3 based vision releases, none of the latest Llava releases are Llama3 based.

abbaselmas

3 replies

9h26m

2024-05-29 09:02:11 UTC

https://ollama.com/library/llava-llama3 llava-llama3

CGamesPlay

1 replies

8h8m

2024-05-29 10:19:59 UTC

This appears to be a Llava model which was then fine-tuned using outputs from Llama 3. If I understand correctly, that would make it Llama-2-based.

GaggiX

0 replies

7h57m

2024-05-29 10:31:21 UTC

fine-tuned using outputs from Llama 3.

Llama 3 outputs text and can only see text, this is a vision model.

that would make it Llama-2-based.

It's based on Llama 3, Llama 2 has nothing to do with it. They took Llama 3 Instruct and CLIP-ViT-Large-patch14-336, train the projection layer first and then later finetuned the Llama 3 checkpoint and train a LoRA for the ViT.

qeternity

0 replies

5h51m

2024-05-29 12:37:17 UTC

That is someone else who has just used the Llava name.

It is not by the original group who have published a series of models under the Llava name.

vixen99

0 replies

11h55m

2024-05-29 06:33:30 UTC

All models surely write 'its performance'.

KTibow

6 replies

20h13m

2024-05-28 22:15:38 UTC

Is there a reason Phi Vision is omitted?

cadence-

5 replies

19h55m

2024-05-28 22:33:14 UTC

Is there any place that currently hosts phi3 Vision and provides API access to it? I cannot run it on my local machine, unfortunately.

ai_what

3 replies

19h9m

2024-05-28 23:18:56 UTC

Nvidia has a 1000 free credits API for phi3 vision.

You only need an email address.

https://build.nvidia.com/microsoft/phi-3-vision-128k-instruc...

trog

1 replies

17h52m

2024-05-29 00:36:10 UTC

Was also looking for something like this - I can't find pricing listed anywhere for their API usage, only the free 1,000 credits - or am I completely misunderstanding how this works?

cpursley

0 replies

6h27m

2024-05-29 12:00:47 UTC

I can’t find the pricing either. I’m interested, the demo worked well.

cadence-

0 replies

18h49m

2024-05-28 23:39:28 UTC

Beautiful. Thank you.

KTibow

0 replies

18h41m

2024-05-28 23:47:44 UTC

Azure has a playground, I haven't tried to use it with an API though. https://ai.azure.com/explore/models/Phi-3-vision-128k-instru...

doctorpangloss

3 replies

20h52m

2024-05-28 21:36:03 UTC

Shouldn't CogAgent be in this comparison?

m00x

2 replies

20h30m

2024-05-28 21:58:25 UTC

CogVLM should be, not sure how CogAgent plays into this. This isn't an agent.

doctorpangloss

1 replies

19h57m

2024-05-28 22:31:24 UTC

You would use CogAgent in VQA mode. Why would someone downvote suggesting to test one of the most powerful multimodal LLMs? Because it doesn't have "V" in its name? CogAgent is improved on many tasks compared to CogVLM.

m00x

0 replies

15h31m

2024-05-29 02:57:23 UTC

I didn't downvote, only replied.

CogAgent is also CogVLM modified to handle documents and larger images. CogVLM is better for VQA.

vikrantrathore

1 replies

16h7m

2024-05-29 02:21:09 UTC

How does it compare with MiniCPM-Llama3-V 2.5 [0]? Based on what I see it seems much better than Llama 3-V on the benchmarks. Also it can directly be tried on Huggingface Spaces to check the performance [1]. It has the dataset, code and fine-tuning details with screenshots of it running on Xiaomi 14 pro. It has strong OCR performance and supports 30+ languages.

[0] https://github.com/OpenBMB/MiniCPM-V

[1] https://huggingface.co/spaces/openbmb/MiniCPM-Llama3-V-2_5

cpursley

0 replies

6h31m

2024-05-29 11:57:04 UTC

Woah, this actually did quite well on table data extraction. I wonder how this could be used for long documents. Maybe paired with some kind of hybrid rag approach.

yeldarb

0 replies

20h9m

2024-05-28 22:19:00 UTC

Don't see a license listed in the repo; presumably needs to be the same as Meta's Llama 3 license?

anais9

0 replies

6h27m

2024-05-29 12:00:56 UTC

Would love to see Ollama support for this - seems promising given my experience with LLaVA so far and would love to get some hands on head to head experience

Havoc

0 replies

18h50m

2024-05-28 23:38:36 UTC

Oh wow. I was expecting it to be the 70B one as base given those stats

2Gkashmiri

0 replies

15h26m

2024-05-29 03:02:09 UTC

Is there a local small llm that can OCR images or haabdwritten invoices ?

Traditional OCR do not handle multiple invoice formats or handwritten ones.

I would like to train one locally with as many invoices it wants