return to table of content

How Does GPT-4o Encode Images?

ComputerGuru
19 replies
1d1h

We desperately need a modern open source replacement for tesseract built on current SoTA ML tech. It is insane that we are resorting to using LLMs — which aside from being the wrong tool and far too overpowered for the job also are prone to hallucinations, have insanely expensive training and inference costs, etc — for this purpose because the “best” non-LLM solution is so bad it can’t even correctly ocr monospaced hi-res scans of ascii text with sufficient accuracy.

rvnx
5 replies
1d1h

One good self-hosted OCR is PaddleOCR, https://github.com/PaddlePaddle/PaddleOCR

Beats everything else, truly international and multi-lingual, including Chinese (as it is made in China)

ComputerGuru
2 replies
22h27m

Why is it “self-hosted” and not “library + desktop/cli app”? “Self-hosted” implies it need a full web stack and rdbms backend?

rvnx
1 replies
22h4m

It was just to show that you can run it locally, in opposition to "cloud APIs" referred in the thread, but you are right, the more correct term is local

ComputerGuru
0 replies
3h37m

Thanks. I had clicked the readme but I was on my phone and wasn’t able to translate it to English to see if it was a web app.

paul-tharun
0 replies
1d

It is insanely fast compared alternatives and has really high accuracy even on new tasks without any training.

Their PaddleLayout models are also miles ahead compared to LayoutParser or TableTransformers in both inference speed and output quality

jakderrida
0 replies
22h5m

I think that's Baidu. I remember https://github.com/PaddlePaddle/ from when Ernie 3.0 was released back when text encoder models weren't forgotten with the progress of decoder-only ones.

orbital-decay
3 replies
1d

A good open source model for handwriting recognition is sorely missing as well.

ComputerGuru
1 replies
23h5m

The United States Postal Service probably has the best in the world, though its training probably restricts it to a subset of possible inputs. I wonder if it would be possible to get a senator or congressman to push for open sourcing it.

daemonologist
0 replies
15h3m

I believe the USPS system makes extensive use of knowledge of possible valid addresses so you're probably right that it wouldn't be generally applicable. Their _dataset_ must be glorious (and extremely confidential) though.

nine_k
0 replies
1d

Often in humans, too, depending on the badness of the particular handwritten word.

rfoo
2 replies
1d1h

There's certainly smaller and even better models for OCR.

But the whole "point" of LLM (forget it, it's not AGI) is you don't need to make many specialized models and cursed pipelines anymore, to solve a definitely-in-reach-without-LLM problem your farmer neighbor wants to pay $500 for.

Before LLM it's not going to be done as it takes more than $500 engineer hours. Now we just brute force. Sure, more compute, but we get it done!

I guess your OCR dream is covered by this.

Zetaphor
1 replies
10h2m

There's certainly smaller and even better models for OCR

Could you please list some? I am developing a tool that relies on OCR and everything I've found refers to tesseract as being the best choice

jaggirs
0 replies
7h40m

Surya OCR

asadm
2 replies
1d1h

hmmm I haven't tried but does apple's OCR api do better here? ie. is it possible to do it.

rgovostes
1 replies
1d

The API: https://developer.apple.com/documentation/vision/recognizing...

In my experience it works remarkably well for features like scanning documents in Notes and in copying or translating text embedded in images in Safari.

It is not open source, but free to use locally. Someone has written a Python wrapper (apple-ocr) around it if you want to use it in other workflows. The model files might be in /System/Library/PrivateFrameworks/TextRecognition.framework if you wanted to port them to other platforms.

nexuist
0 replies
23h17m

I also wrote a Swift CLI that wraps over the Vision framework: https://github.com/nexuist/seev

Text extraction is included (including the ability to specify custom words not found in the dictionary) but there are also utilities for face detection, classification, etc.

EarlyOom
0 replies
15h48m

We're trying to do something similar with VLM-1 https://vlm-docs.nos.run/guides/guide-pdf-presentations. We've found that a lot of the peculiarities of LLMs for text parsing (hallucinations etc.) can be avoided with structured output that restricts everything to a known schema/output range while constraining the number of output tokens required.

AndrewKemendo
0 replies
22h49m

Fully agree

Improving OCR would require innovation within CV - separate from transformer architectures and frankly I don’t expect much new work to happen here

simonw
13 replies
1d4h

Something I don't get is why OpenAI don't provide clear, comprehensive documentation as to how this actually works,

I get that there's competition from other providers now so they have an instinct to keep implementation details secret, but as someone building on their APIs this lack of documentation really holds me back.

To make good judgements about how to use this stuff I need to know how it works!

I had a hilarious bug a few weeks ago where I loaded in a single image representing multiple pages of a PDF and GPT-4 vision effectively hallucinated the contents of the document when asked to OCR it, presumably because the image was too big and was first resized to a point where the text was illegible: https://simonwillison.net/2024/Apr/17/ai-for-data-journalism...

If OpenAI had clear documentation about how their image handling works I could avoid those kinds of problems much more effectively.

nolok
3 replies
1d2h

The fact that it's so eager to hallucinate random things that sounds plausible enough if you're not paying attention without warning you or giving any error should make people reconsider using it for "data journalism" or similar.

If you make your system and it "works", then how will you see the one time out of X where it confidently provides you false information that you happily use because it usually work ?

TeMPOraL
1 replies
1d2h

how will you see the one time out of X where it confidently provides you false information that you happily use because it usually work ?

You don’t. You treat it like you would a human worker: set your process to detect or tolerate wrong output. If you can't, don't apply this tool to your work.

IanCal
0 replies
1d2h

This is true but misses a key fact, that typical llm errors are different to human errors. Not that they're worse or better but just that you need to understand where and when they're more likely to make mistakes and how to manage that.

Onawa
3 replies
1d4h

I was trying to figure out this exact same issue. OCR on a PDF worked great, up until a certain point when it just started hallucinating like crazy. I was working on a whole pipeline to just feed in a PDF one page at a time to try and get around this issue. Otherwise, the OCR works absolutely fantastic compared to all other other tools I've been trying lately. These include OCRmyPDF (Tesseract), SuryaOCR, and some of the models on the Visual LLM Leaderboard.

I've also seen some people recommend Paddle OCR, but I find their documentation to be lacking and I haven't got that one working yet to evaluate.

Onawa
0 replies
1d3h

Funny enough, Simon Willison is the op of this comment thread lol.

infecto
0 replies
1d4h

For document text/table extraction, nothing beats the quality from the cloud providers. It can get costly but the accuracy is much higher than what you will find using an openai API.

resters
1 replies
21h33m

Is there documentation (is it possible?) on how to upload a PDF to gpt-4o using the API?

simonw
0 replies
20h46m

I think you have to split it into a page per image and then upload each page separately. That's how I've been doing it.

infecto
1 replies
1d4h

But they do document that the images are resized and give you some rough guidelines on how you should be sizing your images. Low resolution is 1024 x 1024 with no tiling and High Resolution starts at 2048 x 2048 which then gets tiled. It could use further documentation but it is enough to know more than one page should never be used via the API.

alach11
0 replies
1d4h

Right. But I still have a lot of questions. How does the model handle when something important overlaps multiple tiles in high-resolution mode? Am I better off doing the tiling myself with some overlap?

ilaksh
0 replies
23h47m

There is an effectively infinite number of possibilities of things people could throw at it and they can't know ahead of time whether your use case will work or not. Even if they told you exactly how it worked, you wouldn't know for sure until you tried it. And giving a vague explanation wouldn't help you either.

rvnx
13 replies
1d4h

Author claims that the most likely is that there is Tesseract running behind ChatGPT-4v/o.

There is no way that this is Tesseract.

-> Tesseract accuracy is very low, it can barely do OCR on printed documents.

llm_trw
4 replies
1d4h

Because no one knows how to prep the images. With the right file type and resolution I get under a single character error per 10 pages and it's been that good since the late 00s.

yorwba
1 replies
1d4h

How do you prep the images?

llm_trw
0 replies
22h7m

May hourly rate starts at $300. If you'd like to hire me you're more than welcome to. I've done this work for a number of companies in the past.

alach11
1 replies
1d

With handwriting? With mixed fonts? Tesseract requires heavy customization and extension to perform reasonably on these workloads. The off-the-shelf options from major cloud providers blow it out of the water.

llm_trw
0 replies
22h6m

Never had to use it with handwriting, mixed fonts and text where location carries semantic infirmation: absolutely.

RicoElectrico
3 replies
1d4h

Yeah, Tesseract is barely production quality.

lyu07282
2 replies
1d4h

yeah it was SOTA in 2006, 18 years ago

jascha_eng
1 replies
1d3h

Other than proprietary models, what is better than it today? Just asking in case I ever need OCR and don't want to pay the cloud providers for it :D

jerrygenser
1 replies
1d4h

Even if tesseract accuracy is low, if the tesseract result in addition to the image is then passed to the LLM, it can result in a much more accurate OCR.

For example, GPT4 with some vision capability would be able to fill in the incorrect OCR with the additional word co-occurrence understanding.

I've tested this approach with purely text LLM to correct OCR mistakes and it works quite well.

Also note that in some newer OCR pipelines that don't involve LLMs, there is a vision component and then a text correcting model that is in some ways similar to some forms of spell check, which can further improve results.

lyu07282
0 replies
1d4h

you can tell that the OCR fails more in cases without natural language like with code/random characters. OAI seems to claim 4o is a fully end to end multimodal model, but we will never know for sure, we can't trust a single word OpenAI is saying.

kherud
0 replies
1d4h

Shouldn't this theory be testable? The response time for an image of the same size should remain constant (assuming a generated response of constant size). You could then try to put an increasing amount of text inside of the image. If this text is fed to the LLM using OCR, the total amount of tokens grows. You should then be able to observe an increase in response time.

freedmand
0 replies
1d4h

Agreed. Tesseract is not able to handle handwriting or text that is distorted well, e.g. colored text over an image background — to the point that it would hurt any downstream LLM trying to make sense of the contents. It won’t even pick out bounding boxes.

I doubt they are running an OCR model, but if they actually were it would likely be an in-house one trained with more modern techniques.

valine
8 replies
1d4h

Llava1.6, IntenVL, CogVLM2 can all do OCR with nothing but tiled image embeddings and an LLM. Feeding in OCR results from tesseract improves the reliability of the transcript, especially for long strings of random characters, but it’s not strictly necessary for the model to read the text out of the image.

Clip embeddings can absolutely “read” text if the text is large enough. Tiling enables the model to read small text.

Onawa
4 replies
1d3h

Do you know of any guides or tutorials to doing this? I tried using the MiniCPM model for this task, but it just OCRed a tiny bit of information then told me that it couldn't extract the rest.

3abiton
1 replies
1d2h

I thought ComfyUI was mainly for SD. I should get into the game again.

lagniappe
0 replies
1d2h

You can build just about anything with it

pests
0 replies
23h38m

thanks been trying to remember the name of this project for weeks now

tictacttoe
0 replies
22h54m

I found llava to be disappointing, but Claude Haiku is quite good

qeternity
0 replies
20h0m

They can do it. They can not do it particularly well compared to SoTA OCR systems.

cpursley
0 replies
22h38m

How well does this work on complex data tables?

blixt
5 replies
1d4h

I went through a similar journey back when GPT-4V came out. Here's an additional puzzle for you: GPT-4V knows the exact pixel dimensions of the image (post-resize since there is a max size for images in the pipeline, besides 512x512), but I'm 99% sure it's not provided as text tokens. How am I so sure? It's easy to get GPT to divulge everything from system prompt to tool details, etc. but I've tried every trick in the book and then some, multiple times over, and there is no way to get it to quote the dimensions as text. The only way to get it to give you the dimensions is to tell it to output a structure that contains width and height and just pick something reasonable, and they will "randomly" be the correct values:

https://x.com/blixt/status/1722298733470024076

dannyw
2 replies
1d4h

Perhaps images aren’t tokens at all… and 170 tokens is just an approximation of the compute cost.

qarl
0 replies
1d4h

They address this question in the article.

blixt
0 replies
1d4h

I think that would have pretty serious implications for the transformer architecture though. If they're not embedded like text tokens, how would attention, etc work? And a conversation with multiple images back and forth? Not to mention with GPT-4o now having audio support as well. I would assume it does become tokens.

llm_trw
1 replies
1d4h

It's easy to get GPT to divulge everything from system prompt to tool details,

It's easy enough to get it to hallucinate those things. It doesn't actually tell them to you.

blixt
0 replies
1d3h

I'm well aware of that, but there are plenty of ways to induce verbatim quoting from "hidden" information, and mostly verify it (through sampling a large number of times in separate runs).

Models are improving in truly hiding or ignoring information these days though. As the author of the article states, you'll have a hard time tricking GPT-4o to read text in images as instructions, most likely thanks to this research: https://openai.com/index/the-instruction-hierarchy/

I do feel pretty confident that when the model is happily spitting out its system prompt, and all metadata around the image, but not its pixel dimensions, that probably those dimensions were not provided in any system/assistant/tool message. So maybe part of the image embeddings also encode the pixel dimensions somehow (it would also help the model not think of the image as a squished square for non-1:1 images that have been resized to 512x512).

GaggiX
4 replies
1d4h

An important aspect that is not considered in the article is that GPT-4o can generate images by itself (even though the feature is not enable to the public) meaning that it's very likely trained on sequential image tokens and the images are quantized using a VQGAN, my guess is that the VQGAN takes 512x512 images and outputs 13x13 tokens (169 image tokens + special token), the VQGAN can be a convolutional network like shown in the article, for a transformer-based VQGAN I cannot think of a configuration with overlapping patches where it would output 13x13 tokens on a 512x512 image (unless they just added a padding of 4 on the entire image and the patches are not overlapping).

edude03
3 replies
1d4h

How do we know it generates the images itself and isn’t passing the text to dalle? It’s supposedly how the current gpt4 model does listen mode (with whisper but same idea)

hackerlight
0 replies
18h11m

Two reasons - the shown capabilities are way beyond what dalle is capable of, and they've been clear that this "omni" model by the "omni team" is natively multimodal

GaggiX
0 replies
1d3h

Go to the "Explorations of capabilities" and explore all the capabilities: https://openai.com/index/hello-gpt-4o/

You cannot have this level of control by prompting Dalle, also GPT-4o isn't using Whisper (older GPT-4s yes).

simonw
3 replies
1d4h

The way this tests GPT-4o performance by feeding in a 7x7 grid of colored shapes and requesting them back as JSON (about half way down the page) is really clever.

blixt
2 replies
1d4h

I did something similar when GPT-4V came out, partially with the goal to figure out the input format (I did not get anywhere other than "magic vectors"), but also to roughly estimate the amount of data you can get back out of a 512x512 (the low quality option) image.

What I found is that you can sometimes get more text out of 85-token image than you can out of 85 tokens of text! That said, I think there will be plenty of edge cases where it actually loses some information, and maybe you could argue that if you remove every other word in the text, it could still restore the text.

I never went deeper on this, but I believe there's something clever to be done in the context window with the fact that images are relatively cheap tokens-wise.

_dark_matter_
1 replies
1d4h

The author mentions this in the article, that more than 170 tokens of text can be pulled from an image.

blixt
0 replies
1d4h

Ah, you're right! My bad!

rafaelero
3 replies
1d3h

They are very likely using VQVAE to create a dictionary of tokens and then just converting images into them with an encoder.

HarHarVeryFunny
1 replies
22h32m

Wouldn't that be more applicable to image generation, or at least wanting to encode the image as a whole?

If you need to be able to reason about multiple objects in the image and their relative positions, then don't you need to use a tiled approach?

rafaelero
0 replies
22h26m

VQVAE is trained to reconstruct the image, so in theory it should contain all the information (both content and location) inside its embeddings.

lisperforlife
0 replies
1d3h

Why is this not the top comment? FAIR published their C3MLeon paper about decoder-only autoregressive models that work with both text and image tokens. I believe GPT-4o's vocabulary has room for both image and audio tokens. For audio tokens, they probably trained an RVQ-VAE model like Encodec or Soundstream.

iknownothow
3 replies
1d4h

I'm probably wrong but the author may have have misunderstood input embeddings. Input embeddings are just dictionary lookup tables. The tokenizer generates tokens and for each token you find its embedding from the lookup.

The author is speculating about an embedding model but in reality they're speculating about the image-tokenizer.

If I'm not wrong the text tokenizer Tiktoken has a dictionary size of 50k. The image tokenizer could have a very large dictionary size or a very small dictionary size. The 170 tokens this image tokenizer generates might actually have repeating tokens!

EDIT: PS. What I meant to say was that input embeddings do not come from another trained model. Tokens come from other trained models. The input embedding matrix undergoes back propagation (learning). This is very important. This allows the model to move the embeddings of the tokens together or apart as it sees fit. If you use embeddings from another model as input embeddings, you're basically adding noise.

kolinko
1 replies
1d1h

Input embeddings are taken from a dictionary in case of text tokens, but they don’t need to be - they can be any vector really.

iknownothow
0 replies
1d

But don't input embeddings need to undergo backprop during training? Won't the external-model's embeddings just be noise since they don't share embedding space with the model that is being trained?

If the external-model also undergoes training along with the model then I think that might work.

iknownothow
0 replies
1d1h

I've pondered a bit more about it and I was the one who was mistaken. I think the author made great observations. It's just that I don't want to go back to non token thinking. I don't want there to be a 13x13xE final output from the CNN. I really want there to be a visual vocabulary from which tokens are chosen. And I want this visual vocabulary to be fixed/untrainable/deterministic. That'd be very cool.

But why only choose 13x13 + 1? :(

I'm willing to bet that the author's conclusion of embeddings coming from CNNs is wrong. However, I cannot get the 13x13 + 1 observation out my head though. He's definitely hit on something there. I'm with them that there is very likely a CNN involved. And I'm going to put my bet on the final filters and kernel are the visual vocabulary.

And how do you go from 50k convolutional kernels (think tokens) to always 170 chosen tokens for any image? I don't know...

cs702
3 replies
1d3h

One possibility is that mapping images to a token embedding consumes ~170x more compute+space than mapping a token id.

Another possibility is that OpenAI is mapping each image to ~170 vectors in an embedding space that is shared with token IDs. If that's the case, the architecture of the image-to-fixed-number-of-tokens model has not been disclosed. It could be a standard CNN, a ViT-like model, an autoencoder, a model that routes a variable number of vectors with RGB data to a fixed number of vectors, or something else that has not yet been ublished. The whole thing is likely trained end-to-end.

CuriouslyC
1 replies
1d

At some point we're going to go from tokens to embeddings for everything. I saw some research on variable length embeddings, I wouldn't be surprised if someone generated a huge embedding space, did some form of PCA on generated embeddings, threw away low eigenvalue vectors, then trained a distilled model that generated variable length embeddings directly from that.

cs702
0 replies
23h32m

> At some point we're going to go from tokens to embeddings for everything.

Yes, I agree.

Further down the road, I imagine we will end up finding interesting connections to the symbolic approaches of GOFAI, given that the embedding of a token, object, concept, or other entity in some vector space is basically a kind of symbol that represents that token, object, concept, or entity in that vector space.

Interestingly, old terms like "representation" and "capsule," which didn't become as widely adopted as "embedding," tried more explicitly to convey this idea of using vectors/matrices of feature activations to stand in for objects, concepts, and other entities.

For example, see Figure 1 in this paper from 2009-2012: http://www.cs.princeton.edu/courses/archive/spring13/cos598C... -- it's basically what we're talking about!

cs702
0 replies
1d

*that has not yet been published

eminence32
2 replies
1d4h

I'm assuming that the tokens used to encode an image are entirely distinct from the tokens used to encode text. Does anyone know if this is actually the case?

tempusalaria
0 replies
1d4h

It’s probable that there is a separate vision encoder which projects the image tiles into the distribution space of the text tokenizer a la CLIP/LLava

blixt
0 replies
1d4h

I would assume it has a "mode" token where it switches between text/image (and now audio), or you'd have to try to maximize the number of reserved tokens between multiple modes. GPT-4o did go from 100K to 200K vocabulary, but as far as I understand all of that vocabulary is in use for text (reducing the token cost for non-English).

tantalor
1 replies
1d4h

CLIP embeds the entire image as a single vector, not 170 of them.

Single token?

GPT-4o must be using a different, more advanced strategy internally

Why

freediver
0 replies
1d4h

The embeddings do not offer the level of fidelity to recognize fine details on an image, hand writing for example.

yorwba
0 replies
1d3h

It would be interesting to see what happens when you slightly shift the grid of objects until they're split across multiple tiles, and how that affects accuracy.

sva_
0 replies
1d3h

Great article. Perhaps some part of this magic number simply factors in the amount of compute necessary to run the image through the CNN (proportional to compute use per token in the LM).

sashank_1509
0 replies
22h56m

I would be disappointed if OpenAI had a separate model for OCR, though I guess that is believable. Much cooler if the LLM just understands language from text

riemannzeta
0 replies
1d4h

Love this curious and open-minded exploration of how this stuff works.

The pyramid strategy loosely tracks with renormalization group theory, which has been formally studied for years as a method of interpreting machine learning models:

https://arxiv.org/abs/1410.3831

I love the convergence we're seeing in the use of models from different fields to understand machine learning, fundamental physics, and human consciousness. What a time to be alive.

joelburget
0 replies
1d4h

Vision transformers should be our default guess as to how GPT-4o works, yet this article never mentions them.

jamesy0ung
0 replies
19h43m

I've always wondered how Text to Image LLMs like stable diffusion work, do they just encode RGB values into a matrix and then have a helper tool convert that data into a jpg?

imranhou
0 replies
1d

Not to be nit-picky but double checking myself, isn't a token just 0.75 words, so 170 token would be 127 words and not 227?

enjoylife
0 replies
23h45m

Interestingly enough, it’s actually more efficient to send text as images: A 512x512 image with a small but readable font can easily fit 400-500 tokens worth of text, yet you’re only charged for 170 input tokens plus the 85 for the ‘master thumbnail’ for a grand total of 255 tokens—far less than the number of words on the image.

Sounds like an arbitrage opportunity for all those gpt wrappers. Price your cost per token the same, send over the prompt via image, pocket the difference?

comboy
0 replies
22h52m

I love how well this is written. Definitely "look how interesting this is" rather than "look how much do I know". And it dives as deep as needs to, while being accessible for almost everyone. One really needs to master some topic to be able to describe it simply. Great job.

alach11
0 replies
1d4h

I really hope we see improvements to the resolutions large multimodal models can handle. Right now this patchwork approach leads to lots of unwieldly workarounds in applications.

SubiculumCode
0 replies
22h59m

I'm not sure how chatgpt4o routes information. If a picture is submitted that contains text, does the text then get resubmitted to chatgpt4o as a textual query, or do the model weights themselves essentially transform the textual images to textual tokens. I do wonder if a response to the textual images is similar to a response to text queries...i.e. processed by the the same weights.

HarHarVeryFunny
0 replies
1d2h

I don't think a 13x13 tiling (of N channels/features) can be ruled out just because it can't recognize a grid of 13x13 objects. There is presumably a lot of overlap between the receptive fields of the tiles (due to kernel step sizes).

A pyramid of overlapped tiling resolutions is of course possible too.