Entertaining, but I think the conclusion is way off.
their vision is, at best, like that of a person with myopia seeing fine details as blurry
is a crazy thing to write in an abstract. Did they try to probe that hypothesis at all? I could (well actually I can't) share some examples from my job of GPT-4v doing some pretty difficult fine-grained visual tasks that invalidate this.
Personally, I rate this paper [1], which makes the argument that these huge GenAI models are pretty good at things - assuming that it has seen a LOT of that type of data during training (which is true of a great many things). If you make up tasks like this, then yes can be REALLY bad at them, and initial impressions of AGI get harder to justify. But in practice, we aren't just making up tasks to trip up these models. They can be very performant on some tasks and the authors have not presented any real evidence about these two modes.
There are quite a few "ai apologists" in the comments but I think the title is fair when these models are marketed towards low vision people ("Be my eyes" https://www.youtube.com/watch?v=Zq710AKC1gg) as the equivalent to human vision. These models are implied to be human level equivalents when they are not.
This paper demonstrates that there are still some major gaps where simple problems confound the models in unexpected ways. These is important work to elevate otherwise people may start to believe that these models are suitable for general application when they still need safeguards and copious warnings.
I disagree. I think the title, abstract, and conclusion not only misrepresents the state of the models but it misrepresents Thier own findings.
They have identified a class of problems that the models perform poorly at and have given a good description of the failure. They portray this as a representative example of the behaviour in general. This has not been shown and is probably not true.
I don't think that models have been portrayed as equivalent to humans. Like most AI in it has been shown as vastly superior in some areas and profoundly ignorant in others. Media can overblow things and enthusiasts can talk about future advances as if they have already arrived, but I don't think these are typical portayals by the AI Field in general.
Exactly... I've found GPT-4o to be good at OCR for instance... doesn't seem "blind" to me.
You don't really need a LLM for OCR. Hell, I suppose they just run a python script in its VM and rephrase the output.
At least that's what I would do. Perhaps the script would be a "specialist model" in a sense.
It's not that you need an LLM for OCR but the fact that an LLM can do OCR (and handwriting recognition which is much harder) despite not being made specifically for that purpose is indicative of something. The jump from knowing "this is a picture of a paper with writing on it" like what you get with CLIP to being able to reproduce what's on the paper is, to me, close enough to seeing that the difference isn't meaningful anymore.
GPT-4v is provided with OCR
That's a common misconception.
Sometimes if you upload an image to ChatGPT and ask for OCR it will run Python code that executes Tesseract, but that's effectively a bug: GPT-4 vision works much better than that, and it will use GPT-4 vision if you tell it "don't use Python" or similar.
No reason to believe that. Open source VLMs can do OCR.[1]
[1] https://huggingface.co/spaces/opencompass/open_vlm_leaderboa...
I know that GPT-4o is fairly poor to recognize music sheets and notes. Totally off the marks, more often than not, even the first note is not recognize on a first week solfège book.
So unless I missed something but as far as I am concerned, they are optimized for benchmarks.
So while I enjoy gen AI, image-to-text is highly subpart.
Useful to know, thank you!
Most adults with 20/20 vision will also fail to recognize the first note on a first week solfege book.
Well maybe not blind but the analogy with myopia might stand.
For exemple in the case of OCR, a person with myopia will usually be able to make up letters and words even without his glasses based on his expectation (similar to vlm training) of seeing letters and words in, say, a sign. He might not see them all clearly and do some errors but might recognize some letters easily and make up the rest based on context, words recognition, etc. Basically experience.
I also have a funny anecdote about my partner, which has sever myopia, who once found herself outside her house without her glasses on, and saw something on the grass right in front. She told her then brother in law "look, a squirrel" Only for the "squirrel" to take off while shouting its typical caws. It was a crow. This is typical of VLM's hallucinations.
I think the conclusion of the paper is far more mundane. It's curious that VLM can recognize complex novel objects in a trained category, but cannot perform basic visual tasks that human toddlers can perform (e.g. recognizing when two lines intersect or when two circles overlap). Nevertheless I'm sure these models can explain in great detail what intersecting lines are, and even what they look like. So while LLMs might have image processing capabilities, they clearly do not see the way humans see. That, I think, would be a more apt title for their abstract.
Yea, really if you look at human learning/seeing/acting there is a feedback loop that LLM for example isn't able to complete and train on.
You see an object. First you have to learn how to control all your body functions to move toward it and grasp it. This teaches you about the 3 dimensional world and things like gravity. You may not know the terms, but it is baked in your learning model. After you get an object you start building a classification list "hot", "sharp", "soft and fuzzy", "tasty", "slick". Your learning model builds up a list of properties of objects and "expected" properties of objects.
Once you have this 'database' you create as a human, you can apply the logic to achieve tasks. "Walk 10 feet forward, but avoid the sharp glass just to the left". You have to have spatial awareness, object awareness, and prediction ability.
Models 'kind of' have this, but its seemingly haphazard, kind of like a child that doesn't know how to put all the pieces together yet. I think a lot of embodied robot testing where the embodied model feeds back training to the LLM/vision model will have to occur before this is even somewhat close to reliable.
Embodied is useful, but I think not necessary even if you need learning in a 3D environment. Synthesized embodiment should be enough. While in some cases[0] it may have problems with fidelity, simulating embodied experience in silico scales much better, and more importantly, we have control over time flow. Humans always learn in real-time, while with simulated embodiment, we could cram years of subjective-time experiences into a model in seconds, and then for novel scenarios, spend an hour per each second of subjective time running a high-fidelity physics simulation[1].
--
[0] - Like if you plugged a 3D game engine into the training loop.
[1] - Results of which we could hopefully reuse in training later. And yes, a simulation could itself be a recording of carefully executed experiment in real world.
In the sense that we can't fast-forward our offline training, sure, but humans certainly "go away and think about it" after gaining IRL experience. This process seems to involve both consciously and subconsciously training on this data. People often consciously think about recent experiences, run through imagined scenarios to simulate the outcomes, plan approaches for next time etc. and even if they don't, they'll often perform better at a task after a break than they did at the start of the break. If this process of replaying experiences and simulating variants of them isn't "controlling the flow of (simulated) time" I don't know what else you'd call it.
Isn't this what synthesized embodiment basically always is? As long as the application of the resulting technology is in a restricted, well controlled environment, as is the case for example for an assembly-line robot, this is a great strategy. But I expect fidelity problems will make this technique ultimately a bad idea for anything that's supposed to interact with humans. Like self-driving cars, for example. Unless, again, those self-driving cars are segregated in dedicated lanes.
The paper I linked should hopefully mark me out as far from an AI apologist, it's actually really bad news for GenAI if correct. All I mean to say is the clickbait conclusion and the evidence do not match up.
We have started the ara of ai.
It really doesn't matter how good current llms are.
They have been good enough to start this ara.
And no it's not and never has been just llms. Look what Nvidia is doing with ml.
Whisper huge advantage, segment anything again huge. Alpha fold 2 again huge.
All the robot announcements -> huge
I doubt we will reach agi just through llms. We will reach agi through multi modal, mix of experts, some kind of feedback loop, etc.
But the stone started to roll.
And you know I prefer to hear about ai advantages for the next 10-30 years. That's a lot better than the crypto shit we had the last 5 years.
We won't reach agi in our lifetimes.
If we're throwing "citation needed" tags on stuff, how about the first sentence?
"Large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro are powering countless image-text processing applications"
I don't know how many a "countless" is, but I think we've gotten really sloppy in terms of what counts for LLMs as a demonstrated, durable win in a concrete task attached to well-measured outcomes and holding up over even modest periods of time.
This stuff is really promising and lots of builders are making lots of nifty things, so if that counts as an application then maybe we're at countless, but in the enterprise and in government and in refereed academic literature we seem to be at the proof-of-concept phase. Impressive chat bots as a use case are pretty dialed in, enough people claim that they help with coding that I tend to believe it's a real thing (I never seem to come out ahead of going directly to the source, StackOverflow).
The amount of breathless press on this seems "countless", so maybe I missed the totally rigorous case study on how X company became Y percent more profitable by doing Z thing with LLMs (or similar), and if so I'd be grateful for citations, but neither Google nor any of the big models seem to know about it.
"maybe I missed the totally rigorous case study on how X company became Y percent more profitable by doing Z thing with LLMs (or similar), and if so I'd be grateful for citations, but neither Google nor any of the big models seem to know about it."
Goldman Sachs recently issued a report.
https://www.goldmansachs.com/intelligence/pages/gs-research/...
"We estimate that the AI infrastructure buildout will cost over $1tn in the next several years alone, which includes spending on data centers, utilities, and applications. So, the crucial question is: What $1tn problem will AI solve? Replacing low- wage jobs with tremendously costly technology is basically the polar opposite of the prior technology transitions I’ve witnessed in my thirty years of closely following the tech industry"
Ah yes the blind person who constantly needs to know if two lines intersect.
Let's just ignore what a blind person normally needs to know.
You know what blind people ask? Sometimes there daily routine is broken because there is some type of construction and models can tell you this.
Sometimes they need to read a basic sign and models can do this.
Those models help people already and they will continue to get better.
I'm not sure if I'm more frustrated how condescending the authors are or your ignorance.
Valid criticism doesn't need to be shitty
As an aside... from 2016 this is what was a valid use case for a blind person with an app.
Seeing AI 2016 Prototype - A Microsoft research project - https://youtu.be/R2mC-NUAmMk
https://www.seeingai.com are the actual working apps.
The version from 2016 I recall showing (pun not intended) to a coworker who had some significant vision impairments and he was really excited about what it could do back then.
---
I still remain quite impressed with its ability to parse the picture and likely reason behind it https://imgur.com/a/JZBTk2t
Be My Eyes user here. I disagree with your uninformed opinion. Be My Eyes is more often than not more useful then a human. And I am reporting from personal experience. What experience do you have?
Simple is a relative statement. There are vision problems where monkeys are far better than humans. Some may look at human vision and memory and think that we lack basic skills.
With AINwe are creating intelligence but with different strengths and weaknesses. I think we will continue to be surprised at how well they work on some problems and how poor they do at some “simple” ones.
I don’t see Be My Eyes or other similar efforts as “implied” to be equivalent to humans at all. They’re just new tools which can be very useful for some people.
“These new tools aren’t perfect” is the dog bites man story of technology. It’s certainly true, but it’s no different than GPS (“family drives car off cliff because GPS said to”).
I think this is a communication issue and you're being a bit myopic in your interpretation. It is clearly an analogy meant for communication and is not an actual hypothesis. Sure, they could have used a better analogy and they could have done other tests, but the paper still counters quite common claims (from researchers) about VLMs.
I find it hard to believe that there is no example you can give. It surely doesn't have to be exactly your training data. If it is this good, surely you can create an example no problem. If you just don't want to, that's okay, but then don't say it.
But I have further questions. Do you have complicated prompting? Or any prompt engineering? It sure does matter how robust these models are to prompting. There's a huge difference between a model being able to accomplish a task and a model being able to perform a task in a non-very-specific environment. This is no different than something working in a tech demo and not in the hand of the user.
I see this sentiment quite often and it is baffling to me.
First off, these tasks are not clearly designed to trick these models. A model failing at a task is not suddenly "designed to trick a model." Its common with the river crossing puzzles where they're rewritten to be like "all animals can fit in the boat." If that is "designed to trick a model", then the model must be a stochastic parrot and not a generalist. It is very important that we test things where we do know the answer to because, unfortunately, we're not clairvoyant and can't test questions we don't know the answer to. Which is the common case in the real world usage.
Second, so what if a test was designed to trick up a model? Shouldn't we be determining when and where models fail? Is that not a critical question in understanding how to use them properly? This seems doubly important if they are tasks that humans don't have challenges with them.
I don't think people are claiming that large models can't be performant on some tasks. If they are, they're rejecting trivially verifiable reality. But not every criticism and has to also contain positive points. There's plenty of papers and a lot of hype already doing that. And if we're going to be critical of anything, shouldn't it be that the companies creating these models -- selling them, and even charging researchers to perform these types of experiments that the can and are used to improve their products -- should be much more clear about the limitations of their models? If we need balance, then I think there's bigger fish to fry than Auburn and Alberta Universities.
People are rushing to build this AI into all kinds of products, and they actively don’t want to know where the problems are.
The real world outside is designed to trip up the model. Strange things happen all the time.
Because software developers have no governing body, no oaths of ethics and no spine someone will end up dead in a ditch from malfunctioning AI.
Counterpoint: real world is heavily sanitized towards things that don't trip human visual perception up too much, or otherwise inconvenience us. ML models are trained on that, and for that. They're not trained for dealing with synthetic images, that couldn't possibly exist in reality, and designed to trip visual processing algorithms up.
Also:
Glass half-full (of gasoline) take: those products will trip over real-world problems, identifying them in the process, and the models will get better walking over the corpses of failed AI-get-rich-quick companies. The people involved may not want to know where the problems are, but by deploying the models, they'll reveal those problems to all.
That, unfortunately, I 100% agree with. Though AI isn't special here - not giving a fuck kills people regardless of the complexity of software involved.
Neither of these claims are true. ML is highly trained on synthetic images. In fact, synthetic data generation is the way forward for the scale is all you need people. And there are also loads of synthetic images out in the wild. Everything from line art to abstract nonsense. Just take a walk down town near the bars.
What has me the most frustrated is that this "move fast break things and don't bother cleaning up" attitude is not only common in industry but also in academia. But these two are incredibly intertwined these days and it's hard to publish without support from industry because people only evaluate on benchmarks. And if you're going to hack your benchmarks, you just throw a shit ton of compute at it. Who cares where the metrics fail?
> I think this is a communication issue and you're being a bit myopic in your interpretation. It is clearly an analogy meant for communication and is not an actual hypothesis.
I don't know, words have meanings. If that's a communication issue, it's on part of the authors. To me, this wording in a what is supposed to be a research paper abstract clearly suggests the insufficient resolution as the cause. How else should I interpret it?
> The shockingly poor performance of four state-of-the-art VLMs suggests their vision is, at best, like that of a person with myopia seeing fine details as blurry
And indeed, increasing the resolution is expensive, and the best VLMs have something like 1000x1000. But the low resolution is clearly not the issue here, and the authors don't actually talk about it in the paper.
>I find it hard to believe that there is no example you can give.
I'm not the person you're answering to, but I actually lazily tried two of authors' examples in a less performant VLM (CogVLM), and was surprised it passed those, making me wonder whether I can trust their conclusions until I reproduce their results. LLMs and VLMs have all kinds of weird failure modes, it's not a secret they fail at some trivial tasks and their behavior is still not well understood. But working with these models and narrowing it down is notoriously like trying to nail a jelly to the wall. If I was able to do this in a cursory check, what else is there? More than one research paper in this area is wrong from the start.
That's quite true. Words mean exactly what people agree upon them meaning. Which does not require everyone, or else slang wouldn't exist. Nor the dictionary, which significantly lags. Regardless, I do not think this is even an unusual use of the word, though I agree the mention of myopia is. The usage makes sense if you consider that both myopic and resolution have more than a singular meaning.
I agree that there are far better ways to communicate. But my main gripe is that they said it was "their hypothesis." If reading the abstract as a whole, I find it an odd conclusion to come to. It doesn't pair with the words that follow with blind guessing (and I am not trying to defend the abstract. It is a bad abstract). But if you read the intro and look at the context of their landing page, I find it quite difficult to come to this conclusion. It is poorly written, but it is still not hard to decode the key concepts the authors are trying to convey.I feel the need to reiterate that language has 3 key aspects to it: the concept attempted to be conveyed, the words that concept is lossy encoded into, and the lossy decoding of the person interpreting it. Communication doesn't work by you reading/listening to words and looking up those words in a dictionary. Communication is a problem where you use words (context/body language/symbols/etc) to decrease the noise and get the reciever to reasonably decode your intended message. And unfortunately we're in a global world and many different factors, such as culture, greatly affect how one encodes and/or decodes language. It only becomes more important to recognize the fuzziness around language here. Being more strict and leaning into the database view of language only leads to more errors.
Because they didn't claim that image size and sharpness was an issue. They claimed the VLM cannot resolve the images "as if" they were blurry. Determining what the VLM actually "sees" is quite challenging. And I'll mention that arguably they did test some factors that relate to blurriness. Which is why I'm willing to overlook the poor analogy.
I'm not. Depending on the examples you pulled, 2 random ones passing isn't unlikely given the results.
Something I generally do not like about these types of papers is that they often do not consider augmentations. Since these models tend to be quite sensitive to both the text (prompt) inputs and image inputs. This is quite common in generators in general. Even the way you load in and scale an image can have significant performance differences. I've seen significant differences in simple things like loading an image from numpy, PIL, tensorflow, or torch have different results. But I have to hand it to these authors, they looked at some of this. In the appendix they go through with confusion matrices and look at the factors that determine misses. They could have gone deeper and tried other things, but it is a more than reasonable amount of work for a paper.
It's not that far from reality, most models sees images in very low resolution/limited colors, so not so far from this description
They didn't test that claim at all though. Vision isn't some sort of 1D sliding scale with every vision condition lying along one axis.
First of all myopia isn't 'seeing fine details as blurry' - it's nearsightedness - and whatever else this post tested it definitely didn't test depth perception.
And second - inability to see fine details is a distinct/different thing from not being able to count intersections and the other things tested here. That hypothesis, if valid, would imply that improving the resolution of the image that the model can process would improve its performance on these tasks even if reasoning abilities were the same. That - does not make sense. Plenty of the details in these images that these models are tripping up on are perfectly distinguishable at low resolutions. Counting rows and columns of blank grids is not going to improve with more resolution.
I mean, I'd argue that the phrasing of the hypothesis ("At best, like that of a person with myopia") doesn't make sense at all. I don't think a person with myopia would have any trouble with these tasks if you zoomed into the relevant area, or held the image close. I have a very strong feeling that these models would continue to suffer on these tasks if you zoomed in. Nearsighted != unable to count squares.
It seems to me they've brought up myopia only to make it more approachable to people how blurry something is, implying they believe models work with a blurry image just like a nearsighted person sees blurry images at a distance.
While myopia is common, it's not the best choice of analogy and "blurry vision" is probably clear enough.
Still, I'd only see it as a bad choice of analogy — I can't imagine anyone mistaking optical focus problems for static image processing problems — so in the usual HN recommendation, I'd treat their example in the most favourable sense.
My thoughts as well. I too would have trouble with the overlapping lines tests if all the images underwent convolution.
I like the idea that these models are so good at some sort of specific and secret bit of visual processing that things like “counting shapes” and “beating a coin toss for accuracy” shouldn’t be considered when evaluating them.
Those don't really have anything to do with fine detail/nearsightedness. What they measured is valid/interesting - what they concluded is unrelated.
LLMs are bad at counting things just in general. It’s hard to say whether the failures here are vision based or just an inherent weakness of the language model.
I think gpt4o is probably doing some ocr as preprocessing. It's not really controversial to say the vmls today don't pick up fine grained details - we all know this. Can just look at the output of a vae to know this is true.
If so, it's better than any other ocr on the market.
I think they just train it on a bunch of text.
Maybe counting squares in a grid was not probably considered important enough to train for.
Why do you think it's probable? The much smaller llava that I can run in my consumer GPU can also do "OCR", yet I don't believe anyone has hidden any OCR engine inside llama.cpp.
There's definitely something interesting to be learned from the examples here - it's valuable work in that sense - but "VLMs are blind" isn't it. That's just clickbait.
Yeah I think their findings are def interesting but the title and the strong claims are a tad hyperbolic.
Is this the sales pitch though? Because 15 years ago, I had a scanner with an app that can scan a text document and produce the text on Windows. The machine had something like 256Mb of RAM.
Tech can be extremely good at niches in isolation. You can have an OCR system 10 years ago and it'll be extremely reliable at the single task it's configured to do.
AI is supposed to bring a new paradigm, where the tech is not limited to the specific niche the developers have scoped it to. However, if it reliably fails to detect simple things a regular person should not get wrong, then the whole value proposition is kicked out of the window.
Entertaining is indeed the right word. Nice job identifying corner cases of models' visual processing; curiously, they're not far conceptually from some optical illusions that reliably trip humans up. But to call the models "blind" or imply their low performance in general? That's trivially invalidated by just taking your phone out and feeding a photo to ChatGPT app.
Like, seriously. One poster below whines about "AI apologists" and BeMyEyes, but again, it's all trivially testable with your phone and $20/month subscription. It works spectacularly well on real world tasks. Not perfectly, sure, but good enough to be useful in practice and better than alternatives (which often don't exist).