For anyone considering reading the paper and like me don't normally read papers like this, open the PDF and think you don't have time to read it due to its length. The main part of the paper is the first 10 pages and a fairly quick read.
On to the topic here. This is an interesting example that they are using. It is fairly simplistic to understand as a human (even if we may be inclined to quickly jump to the wrong conclusion without thinking for a few seconds).
The thing that really bothers me is that I just don't know is realistically we can fix this given the current state of what these tools actually are. They are not reasoning or thinking in any sense of the word and yet a lot of people are already considering them general purpose AI. It doesn't help that in many situations it can fake it enough that it appears to be reasoning, but it's not.
What is the chance that this paper actually has any impact on the AI rollout and overhype or will just be buried and never talked about again until the next time we see how dangerous these tools are like with Google's search rollout.
For reasoning of any complexity, some sort of internal monologue and iteration is needed. For type-2 thinking, we conceptualize possible solutions, arguments, and reasoning paths in our mind, and judge and evaluate them before deciding which one we think is adequate or correct. This can be emulated to some extent by guiding prompts that make that internal thought process external, but we really want it to be internal and automatic.
We don’t know how to do that yet, because what controls the internal thought process is itself not necessarily language-based, and also, since internal thought processes of biological brains are not directly observable, they can’t be used as training data.
Edit: It occurs to me now that there is some parallel between current LLMs and behaviorism [0], and we really need something to which cognitive psychology could be applied instead.
[0] https://en.wikipedia.org/wiki/Behaviorism
Given many people don’t have an inner monologue and function just fine, it’s more likely inner monologue is a product of the reasoning process and not it’s mechanism.
The existence of an "inner monologue" isn't really a falsifiable claim. Some people claim to have one while other people claim not to, but we can't test the truth of these claims.
In this particular case, is there any reason why we simply can't take their word for it? This is not a case of where if I say "weak" or "strong", most people pick strong because no one wants to be weak, even if the context is unknown (nuclear force for example).
because if we give them a problem to solve in their head and just give us the answer, they will. By problem I mean planning a trip, a meal, how to pay the mortgage, etc. It's impossible to plan without an internal monologue. Even if some people claim theirs is 'in images'.
'It's impossible to plan without an internal monologue.' - Sorry, but I disagree with this. I have no 'internal voice' or monologue - whenever I see a problem, my brain actually and fully models it using images. I believe 25% of the population doesn't have the internal monologue which you're referring to and this has been tested and confirmed. I highly recommend listening to this Lex Friedman podcast episode to get a full grasp on the complexities of modelling language and general modelling present in the human brain: https://www.youtube.com/watch?v=F3Jd9GI6XqE
Can you draw a picture of an example of what you see when you think about something?
Sure - with the Alice example, when I saw the problem I came up with a really simple visual example of Alice having 3 sisters and 4 brothers. When I visualized it I saw Alice standing next to 3 other women (her sisters) and her 4 brothers standing close by. When I imagined asking her brother how many sisters he has, I could see that Alice was standing there next to 3 other women and thus came up with the answer of 4. Does this make sense?
This could account for why some people are much better at say geometry than algebra.
I'm the opposite. I rarely visualize things I read, be it geometry or poetry. I can read a detailed description of a person or an item in a novel, and I don't really "see" anything.
But I have an active inner monologue. I "hear" myself saying the words when reading or writing, I "talk" to myself when solving problems or just thinking about stuff.
Especially when programming I'm reasoning by discussing the problem with myself, the only difference being that I usually don't open my mouth and vocalize the discussion. Though sometimes when I'm alone I do just that.
Sure, I do mention thinking in images in my original comment and count it as some type of internal monologue. I personally do not believe it's all images, as that would preclude using highly abstract concepts. But I might be wrong, and it might be 100% images. That being said, it does count as an internal monologue.
Of course it isn't impossible, and this is backed by what we know about paleoanthropology and other instances of cognition in animals - humans were making stone tools millions of years ago, which takes planning in the form of imagining what you want the tool to look like and how you will do it and what it will be used for. It's exceedingly likely we had this ability long before complex speech evolved. Apes also use and make tools, which would require planning, and I don't think they have an internal monologue going on. birds from the corvid family can do some pretty advanced problem solving that requires planning. Cetaceans might be an exception, because they appear to have some form of language, but this is a pretty wild claim not really backed by any kind of science as we understand it today.
Animals can not manipulate abstract concepts nor can they do long-term plans. No crow can plan an international trip spanning a couple of weeks and two change-overs. And some people definitely can't do it start to end, but they can at least plan the first 5-7 steps.
Also, maybe inner monologue is not a binary have/have not, but maybe it is on a continuum.
Not sure. Migratory birds seem to manage this just fine. Not only do they make multiple stops to eat and rest, they also navigate around bad weather and still make it to their intended destination (at least most of the time).
Instincts.
Yes, no one is disputing that animals are not as intelligent and lack the same capacity for planning that humans do, but the post you're replying to is disputing the fact that planning is done solely through internal narrative/monologue, which is easily disprovable by pointing to the examples I did. There are many more in nature.
I once had a teacher claim that people who claimed to have aphantasia were lying, because those people have read books and it is impossible to read a book without picture the scene in your mind's eye. Are you citing the same source that she was?
I wish I had such a teacher, because I'd learn the term "aphantasia", instead of worrying all my youth that I'm doing reading wrong, as I could never picture anything I was reading in my mind (and as a result, I found scenery descriptions to be mind dumbingly boring).
How can science make this claim if it can't prove (or disprove) the existence of an internal monologue?
Well, I remember Richard Feynman came up with an interesting experiment. He found he could not count objects when he read aloud some text at the same time. He had to name the numbers, and it was impossible if he was already engaging his speech.
He thought this was universal, but doing this experiment with friends, he discovered a guy who could count while reading aloud. So when Feynman asked him, how he does this, turned out that the guy instead of "pronouncing" numbers was "seeing" colored numbers in his imagination, so his speech was not involved.
I supposed this experiment can be modified and generalized, and at least to shed some light on this problem.
That's quite the claim.
Perhaps there's confusion in how we are using the word monologue. I took it to mean a conversation, a dialogue where the problem is perhaps solved using a dialectic method, or simply a conversation. Since one can solve a problem by following some memorized steps, no conversation required, this is perhaps not a good test, or we mean different things when we say "monologue."
My concern is that if we take their word for it, we're actually buying into two assumptions which (AFAIK) are both unproven:
1. That "Internal Monologues" (not consciously forced by attention) exist in the first place, as opposed to being false-memories generated after-the-fact by our brain to explain/document a non-language process that just occurred. (Similar to how our conscious brains pretend that we were in control of certain fast reflexes.)
2. Some people truly don't have them, as opposed to just not being aware of them.
Not only are they unproven, but are ultimately not provable at all. Some people will say yes, some people will say no. Probably we can take their word for it, but in the simplest case they could just lie (in either direction) and we would have no way to tell.
In short, maybe these inner monologues exist and maybe they don't, but science can't comment on that. That said, it is clearly something we are interested in, but it will need to be addressed in some other way (i.e. religion, ideology, etc.).
Individually, no, but in general, for people to consistently lie about this particular thing at scale would be extremely unusual, given that people rarely lie if there's no reason for it. Going by this baseline, you could assume upward of 50% of replies are honest (even if mistaken), otherwise you'd have to explain why do you believe people would suddenly lie about that particular thing.
I've heard a theory where the inner monolog was emergent, and some of the first people to recognize thr 'voice in their heads' attributed it to god/angels/etc
There's conspiratorial lying and lying from ignorance, one is much less credulous.
That theory is the "bicameral mind"; I think it's even discussed elsewhere in this thread.
No, they are potentially falsifiable as we get better at scanning, identifying, intervening in brain activity.
Just off the top of my head here, suppose we create a table puzzle problem that (in itself) doesn't require language to understand, like ones we make for certain animals. Have human subjects (silently) solve it. Afterwards, quiz the solvers about their internal monologue--or lack thereof--dividing them into two groups and noting the words used.
Now change to a second puzzle of similar style and same overall difficult. Stun/anesthetize the language-centers of subjects, to deny access to any of the monologue-words (validating this intervention will involve other research), and then test them on the second problem.
* If performance is constant for both groups, that suggests the monologue is illusory or at least not needed for this kind/scope of problem.
* If performance drops for both groups, that suggests the no-monologue people might just not be as aware of a linguistic process that's actually happening.
* If performance drops for monologue-subjects, that suggests it's a real and important difference in modes of logical thought.
* If some other combination happens, you have an mysterious and exciting new line of research.
Sure, there is stuff we can do to tease around the edges (similar problems crop up all the time in psychology and sociology) but we will always have to weaken the claim in order to do experiments relating to it.
Because we can't be sure whether two people interpret what "inner monologue" means and whether they think it describes a phenomenon that actually isn't different between them and other people.
For example, I can think of interpretations of "I picture objects that I'm thinking about" that range from me not experiencing the phenomenon to me indeed experiencing the phenomenon.
To say that you're not experiencing something that other people are experiencing in their head is a solipsistic notion where you hypothesize an experience that you imagine others are having and then discard it for being different than yours.
And here I thought this was solved decades ago - I need to find the source, but I read about an old study where people describe their experience, and the answers were all over the "range from me not experiencing the phenomenon to me indeed experiencing the phenomenon".
Then again, it's trivially reproducible - people self-report all variants of inner monologue, including lack of it, whenever a question about it pops up on-line. Same is the case with imagination - aphantasia is a thing (I would know, I have it).
I'm responding to "why can't we just take their word for it?"
That you and I can come up with different ways to describe our subjective experience in conversation doesn't mean that we have a different subjective experience.
Especially not when relayed by a species that's frequently convinced it has a trending mental disorder from TikTok.
We can keep talking about it, and assuming we're both honest, we'll arrive at the answer to whether or not our subjective experiences differ. To fail at that would require us to have so little in common that we wouldn't be able to communicate at all. Which is obviously not the case, neither for us, nor for almost every possible pair of humans currently alive.
On the other hand, a deep one on one discussion isn't what's happening in casual debates online about to what degree each of us has an inner monologue. And because we don't have so little in common, I would be resistant to concluding that my subjective experience is so different than everyone else's. To claim that I'm different requires me to have an accurate model of what other people are experiencing, not just an accurate report of what I'm experiencing.
Look up examples of this on reddit and you'll find a lot of larping. I would take most of it with a grain of salt as you should with any story-telling you encounter on social media.
If we're so reliable, there wouldn't be fake mental illness epidemics on TikTok regarding experiences far more concrete than fuzzy notions like inner monologue.
As someone who was involved in spiritual practice of "stopping internal dialogue" for years, I can tell you that one learns that that dialogue (or monologue, pretty much the same thing) is quite subtle and complex, essentially multi-layered.
Typically, when you think that you "think about nothing at all" it's just the most surface layer that has stopped, and more subtle talking to yourself is still going on. It takes training just to become able to notice and recognize it.
After all, it's just such a constant and monotone hum at the back of one's mind, one learns to completely ignore it.
So no, I would not take a word of people who were not trained to notice their internal monologue that they haven't any :-)
Feynman came up with a potential test by tracking the things he could and couldn't do while counting seconds (via internal monologue). He found he generally count not talk while counting.
He then had others try and found that one of his mathematician friends was able to talk just fine while counting because it turned out he was counting visually.
https://www.youtube.com/watch?v=Cj4y0EUlU-Y
Another possibility is that inner-monologues (ones not forced by conscious effort) do exist, but are just a kind of false-memory, something one part of our brain generates after-the-fact to explain/document the outcome of another non-language part.
Kind of like how certain reflex-actions can occur before certain decision-making area of the brain light up, yet humans will believe that they sensed the event and made a thinking choice.
It’s commonly conjectured that the emergence of human-level reasoning wouldn’t have been possible without the development of language. Personally, I’m able to suppress “word thoughts” in my head (for a short time), but then I lose almost all of my reasoning ability. I could imagine that reasoning is language-based even when it’s not conscious for some people. An internal process being there, and being conscious of it, are two separate things. We would be happy with an AI using an internal monologue without it being conscious of that monologue.
Maybe, but symbolic thought can get pretty far away from what we generally call "language." I bet you can reason 1+3x=22 pretty easily without any words whatsoever, or the sound of one ascending octave after another, or the approximate G-force induced on your body if you take the next turn without applying the brakes.
All of these forms of reasoning are true and useful calculations: when we talk about "intuition" what we usually mean is that we have a lot of experience and internal reasoning about a subject, but we struggle to translate it to and from the "language" part of our brain. Nonetheless, any social dancer will tell you that a dialog is possible just by receiving and inducing g-forces alone. You can reason this way about abstract concepts like orbits without ever touching a single word or concrete symbol.
Edit: the key aspect of reasoning, imho, is the ability to make predictions and introspect them against a database of other predictions, using an adversarial heuristic to weight the most plausibly useful results. Perhaps our pattern matching AIs of today just lack sufficient "experience" to do what we call reasoning.
I've tried to do it, but I can't. I had to do something like "ok, so we subtract one from both sides and then it's easy, 3*7=21". Maybe I could do 2+8 but I still think the word ten "aloud".
I was able to do it with no words. I 'saw' the steps as if on a piece of paper. I saw 3x=22-1=21, then x=21/3=7. But I have a degree in applied math. Perhaps not internally vocalizing is just being extremely familiar. It also happened very quickly, perhaps there was no time to vocalize anyways.
To be fair, math is a language in itself... with many dialects come to tcreditors.
At the end of the day though, thought requires communication, even if internal. Even physics is modelled as some sort of 'message passing' when we try to unravel what causality really is. Similar to how a processor has cycles, I think/know similar (but unsynced) happens as part of what we call 'thinking'.
A decent number of folks can jump straight to the answer on something so straightforward, no steps.
Most people can't do 1 + 3x = 22 without any words or symbols. People who can don't realize that most people can't. I'd argue one isn't using logic when they do that, it's just very good pattern matching.
It's also possible to do mentally by visualizing it rather than internal monologue. You can imagine the 1 on the left arcing over to the right, cancelling the 22 to 21, then the 3 moving under the 21 and the 21 descending through the 3 to become 7.
yup. I considered myself an /extremely/ verbal person when reasoning, but what I do with the above feels closest to 'moving the 1', almost like balancing a mental scale.
I never really noticed that before. I'm not great at math, fwiw.
A decent number of folks just have the answer pop into their head, no logic or thinking required.
Regarding “1+3x=22”, I’m actually not sure, the number words certainly appear in my head when solving the equation. But even then, I would count “1+3x=22” as constituting language. Perception of sound, G-forces, and dancing don’t perform type-2 reasoning by themselves, so I don’t think your argument applies there.
Regarding your edit, no, I think the key aspect of the kind of reasoning we are missing in current AI is the ability to hold the reasoning in your mind, and to iterate on it and evaluate it (judge it) within your mind.
It is very difficult to have a discussion using words to discuss the semantics of non-word or non-symbolic semantics. I was pointing at several different plausible spaces for semiotics and how these spaces could be spaces for reasoning in the hopes that one of them might be relatable.
If you use words in your mind when you use math, and you use words in your mind when you make or listen to music, etc., then it is very difficult to find a common ground where it is possible to see that these other realms of thought are capable of not only prediction, but also producing evidence that leads to judgement. That is to say, the key aspects of "reasoning." I picked them because I thought they had broad enough appeal to be relatable, and because I do not personally hear words in my head when doing any of these activities, whether it's calculus or tango, but I still find calculus and tango to be places where reasoning occurs.
Some of them, like math or music, are closer to the kind of symbolic thought we use when we discuss things with words. Others, like the experience of g-forces, are not. I present them as a sliding scale between "word based" reasoning and "non-linguistic" reasoning. Perhaps you can think of a realm that better fits for your personal experience of intuition, and inspect whether these intuitions are capable of "real" reasoning in the absence of language, or whether intuition should never be trusted even when you have a great deal of experience in that area. Perhaps in your estimation, anything that cannot produce evidence that is articulable in word form is suspect.
Personally, I find all these methods, including language, to be suspect. I don't find language to be especially better at producing the kind of evidence for prediction, correct judgement, or discourse for reasoning than other methods, unless you reduce "reasoning" to tautologically require it.
One of the best tools of language is that we have writing that allows easy inspection or iteration of the written content; but these things are possible in other realms, too, it's just that we didn't have great tools for introspecting and iterating on their "ideas" except within our own minds. These days, those tools are readily available in many more realms of human insight.
i feel it is worth pointing out, as another commenter highlighted, language and even symbolic more abstract languages, bring about fluency if you've practiced "speaking and writing" it enough.
i think native speakers hardly "think" about the steps necessary to form a grammatically correct expression, and most of the time just "know".
fluency is not the same as lacking an internal framework for interpreting or thinking in terms of, symbols.
Language is a serialization of our brain's "world model" structures.
Brains are weird. I reason almost entirely non-verbally and I would absolutely struggle if I had to laboriously express every thought in words. Its part of the reason I don't work well in teams. So slow!
What defines the boundaries of internal vs external? Certainly nothing about llm weights or ops should.
I think you’re using “inner monologue” too literally. It could be a progression of pictures, emotions, etc.
With that definition even bacteria have inner monologue.
Can bacteria imagine pictures? Do they have emotions?
Why does this matter? Stop being so pedantic. We're talking about a progression of ideas. Talking in your head is one form of ideas, but people can easily solve problems by imagining them.
Initial thesis was - inner monologue is required for reasoning. If you define inner monologue to include everything brains do - the initial thesis becomes a tautology.
Hmm, looks to me like just trading some words for others. Do bacteria have ideas? Does the navigating system in your car? How do you know?
We need to be at least somewhat pedantic, otherwise it's impossible to know what we are even talking about, and no way to establish anything.
To make any progress on this question at all, we need first to come up with some definition of internal monologue. Even if we may need to modify it later, there has to be a starting point.
Otherwise, nothing can be established at all, because for any statement there always will be someone's understanding of "internal monologue" for which the statement is true, and someone's else understanding for which the statement is false...
I'm sure inner monologue just cashes out into the ability to reflect on your own thoughts. And for one to say that they're not having that experience also involves a claim about what they think other people are having which would make me doubly skeptical.
In practice, when you see people arguing about whether they have an "inner monologue" or can "mentally picture objects" on social media, it's more of a contest of who is the most unique in the world rather than anything that sheds clarity on our subjective experience.
The fact that we don't actually have an understanding and framework for reasoning (e.g. whether inner monologue is a cause or an effect) means we are VERY off from general AI.
https://youtu.be/QGYbbLWn-IE?t=72
Have there ever been studies that demonstrate that those individuals don't simulate possible state transitions they'll go through in a different modality? I'd be curious if they visualize actions they'll take still, just not verbally.
Why? We have specialized semi-isolated lobes of our brain. Some "external to the LLM weights" guiding system just needs to be transparent to us. It doesn't need to be internal to the LLM/weights.
Most of the interesting AI things can't write a sentence (weather simulation, protein folding, inverse kinematics, driving cars, etc...). Language isn't a requirement, only some sort of meaningful mapping to the latent space of each is. I think one could claim that "meaningful latent space mapping" is the language of neural nets, including those in the brain.
From a formal perspective you're entirely correct. Transformers with chain-of-thought are strictly more powerful than transformers without it, and can efficiently solve classes of problems that would otherwise require exponentially increasing model depth: https://arxiv.org/abs/2310.07923
Yeah, I don't know people ever thought that the current state of LLMs is enough to actually solve problems; it's trained to be very good at predicting the next word, but any logic solving capability is only emergent due to questions + answers in the training data, when it has to improv then it either makes a randomly correct or similar enough guess, or it falls over.
If these things were trained to actually solve logic problems then it may be different, the training data would have to be logic problems. They've started at the top (language) rather than at the bottom/middle (logic, thought, reasoning) which should be done first - and then the model can be trained/additional models to get it to put thoughts into words.
If you really think about what an LLM is you would think there is no way that leads to general purpose AI. At the same time though they are already doing way more than we thought they could.
Maybe people were surprised by what OpenAI achieved so now they are all just praying that with enough compute and the right model AGI will emerge.
It is an autoregressive sequence predictor/generator. Explain to me how humans are fundamentally different
"Prove me wrong?" That's not how this works. Your implicit claim here is that human cognition and LLM functioning are fundamentally similar. That claim requires substantiation.
I actually did a full write-up on this here fyi: https://photonlines.substack.com/p/intuitive-and-visual-guid.... You can skip most of this and scroll down to the end-section called 'The Mental Model for Understanding LLMs' where I try to map how transformers are able to mimic human thinking. I think that comparing them to auto-associative / auto-regressive networks is actually a really good analogy FYI and I do believe a lot of human learning is based on this.
We have no idea how human thinking works and no idea what consciousness is.
I'm getting so tired of listening to software engineers LARP pseudo neuroscientists with 6th grade level insights.
Bruh...
So instead of shouting insults, maybe you can shine some light on how anything stated there is wrong? Everything that I said there is true - we don't have a good mapping of what consciousness is - if you want to add some insights add them instead of making insulting and useless statements on HN.
We have no idea if ChatGPT is conscious or not. We have no idea if it is "thinking" or not.
What we know with a reasonably high level of certainty is that consciousness and "thought" are physical processes. That's about it.
Pulling out the scalpel to start dividing up what physical process is and isn't conscious is a fools errand. And especially foolish when just making up arbitrary qualifications for it.
Am I saying that ChatGPT is conscious? No. But I am saying is you shouldn't give much credence to people who are anything more than agnostic about it.
An article this long needs a table of contents IMO.
Will note this and try to add a TOC next time I do this sort of write-up. This article wasn't supposed to be this long - I originally meant to make a more compact version but kept on finding more things which I needed to add in order to map to a correct 'mental model' of what ChatGPT is.
We don't have sufficient understanding of human thinking and mental processes to make any claims like the ones you are trying to make. If we did, psychology would be a quantitative hard science; it most definitely is not.
Human neurons are continuous input, with active dendrites and dendritic compartmentalization. Spikey artificial NNs seem to hit problems with riddled basins so far. A riddled basin is a set with no open subsets.
Feed forward networks are effectively DAGs, and circuit like, not TM like.
Caution is warranted when comparing perceptrons with biological neurons.
Dendrites can perform XOR operations before anything makes it to the soma for another lens.
While there is much to learn, here is one highly cited paper on dendritic compartmentalization.
https://mcgovern.mit.edu/wp-content/uploads/2019/01/1-s2.0-S...
I think that the perceptron model of learning plasticity is on pretty shaky ground as being a primary learning model for humans.
It is how it works if you are replying to someone who claims "If you really think about what an LLM is you would think there is no way that leads to general purpose AI". The counter example is human beings are considered general purpose intelligence and we are complex, but fundamentally predictable systems (not by us today), with (as far as we can tell) deterministic outputs based on the state of the universe (including physical being like brain and chemicals).
Responding to an unsubstantiated claim with an unsubstantiated claim just adds another unsubstantiated claim.
So far as I know, whether the universe behaves deterministically remains an unsolved question. Given that, your statement here would already be one of belief rather than fact, even before we get to the parentheticals. There is information here, but not about whether LLMs can develop into AGI.
Fine, you can ignore my previous comment, that's just my answer to the question that this discussion ultimately takes you to. But I feel like you are just sitting on the sidelines making strawmen and playing pedantic games instead of saying anything constructive.
The original comment said:
This is an inflammatory way to state an extreme position on a well-discussed debate over whether next-token prediction can lead to general intelligence. The original commenter clearly believes it can't get you there. If you want to say that with any authority, you need to have an answer for what is different between what we consider general intelligence (for most people, this is simply human intelligence) and what models are capable of. This is the question at the heart of artificial intelligence.
I challenged them to explain their answer. I made no claims, I asked no one to prove anything wrong. If it is obvious that LLMs can't be AGI, the answer to how an LLM differs from human intelligence is also obvious, right?
Your original comment was:
Which would be interpreted by most reasonable people as you making the claim that humans are autoregressive sequence predictors, and asking people to prove you wrong. I can see how you could say this without intending to make that claim, but most people will interpret this as you making that claim.
I do not intend to inflame things or discredit what you are saying, but just to say that if you did not intend to make a claim or ask people to prove you wrong, a different approach would be more successful in the future.
You're right, it was hastily written and I was annoyed.
But I generally hold out hope that people can see a claim "A!=B" and a response "A=C, explain how C!=B" and understand that is not the same as claiming "C=B", especially on HN.
I do remain convinced my interpretation was sound, but on review I have to concede it was also quite uncharitable.
With all the wildly overheated claims that've been flying around since the advent of these models, I may be myself somewhat overfitted. Granted, in such an environment I feel like a little extra care for epistemic hygiene is warranted. But there was no reason for me to be rude about it.
I know what you mean. Unfortunately, it's easy for frank and concise language to be taken the wrong way when in written form (and sometimes even verbal form). I wish I didn't have to make qualifiers about my intent on my internet comments, but I often do, to try and make sure that other people take my comment the way I intended it. I think it generally leads to better discussion.
I don't blame people for not wanting to talk this way.
AI needs to see thousands or millions of images of a cat before they reliably can identify one. The fact that a child needs to only see one example of a cat to know what a cat is from then on seems to point to humans having something very different.
Not if they inherit from a previous generation of AI. But even if they did, a different training speed does not imply a different capability
My point is not that humans have a faster training speed but that humans must be doing something fundamentally different from LLMs. You could build Altman's $7 trillion dollar GPU cluster and use the majority of the world's energy to feed it and you'd still hit the same limitations if you're just running an LLM on it, even a very sophisticated LLM. This is Yann LeCun's position as well.
That, as I understand it, is not a valid chain of logic. Requiring fewer data points does not inherently indicate that the underlying mechanism (autogressive sequential generation, not the transformer which is just an architecture) is different.
Not to mention the secondary arguments like - no proof that human learns faster from fewer datapoints, that's just your assumption in the sibling comment. Humans inherit information. The equivalent - fine-tuning a foundation model - is very fast to learn novel objects.
Just because someone has a Turing award doesn't mean they know what they're talking about. They are just people, with strengths and weaknesses like everyone else. But often on the extreme end of strengths and weaknesses.
Well, the same thing goes for you - just because someone posts on HN doesn't mean they know what they're talking about. And if I have to decide whose assessment I trust regarding AI, I take the Turing award winner who worked for almost 40 years on AI over a random guy from the internet.
Humans train on continuous video. Even our most expensive models are, in terms of training set size, far behind what an infant processes in the first year of their life.
EDIT: and it takes human children a couple years to reliably identify a cat. My 2.5 y.o. daughter still confuses cats with small dogs, despite living under one roof with a cat.
I contend that you could show any child old enough to communicate in basic English a photograph (so not live continuous video) of some obscure animal they've never seen before (say an Okapi) and they'd be able to easily identify another Okapi when seeing one at a zoo.
So you're just going to ignore the 5 years of continuous training? I'm not sure what point you're trying to make.
My daughter is 5 y.o., which means because of kindergarten, I spend plenty of time about kids this age. A random kid this age would absolutely fail your test. They may remember the word after one exposure, but I doubt they'll remember any of the distinctive features.
Hell, many adults would fail it. I'm not sure if I could pass such test - in my experience, you remember the important details only after first experiencing a test and realizing what exactly it is that would be useful in distinguishing the two animals.
If the model is first pre-trained on unlabeled images, then it takes about 10 labeled images of cats and 10 labeled images of dogs to train a (possibly strong) classifier (example: DINOv2), I doubt humans will do better.
That's a good point; when comparing performance with humans, one has to remember that a human spends years of training on unlabeled images in form of continuous video stream, on top of audio streams and other senses, before they're able to communicate with you so you could test them.
Even language is not sequential.
Tell me more?
Language is only sequential in the form it is transmitted (verbally). There is no reason that sequential statements are generated sequentially in the brain. Quite the opposite, really, if you consider rules of grammar.
I really wish most of the LLM folks just took a few courses in linguistics. It would avoid a lot of noise.
Any pointers for where to start? Linguistics is dense, the terminology is a slog, it seems huge, and I was never clear that there was real value to be gained. But it's always useful to be more informed.
LLMs don't generate their language sequentially either, they just output it sequentially token by token.
Whoever cares about "rules of grammar"?
Grammar is descriptive, it formalizes the language so it doesn't break down into regional dialects too fast, and otherwise is just a crutch for people learning the language, especially if it's not their first language. The way you acquired your first language is the same way LLMs learned to utter grammatically correct sentences: by being exposed to lots and lots of examples, and eventually getting a feel for it. Similarly, if you're fluent in a language, you don't even think of grammar when using it - the right phrases in correct forms just come to you.
Since LLMs seem to follow the rules of grammar quite well, and this is quite the opposite of "sequential", I guess LLMs do not process informations very sequentially.
We don't know how we are different, but its pretty clear that we are fundamentally different. Modern machines calculate much faster than humans, if human thoughts and reasoning would work like LLM, we would already sit in front of general purpose hard AI. But we don't, so obviously there is something missing. Unfortunately its quite hard to know what is missing, because we don't know how conscience and reasoning actually work and we can't formularize them. Thats why its fair to assume we won't see general purpose AI that fast (probably not in our lifetime), which a lot of AI researchers seem to support, especially those who are not working for companies who try to sell AI.
LLMs are the language center of an AI. They in no way purport to be: long term memory, critical thinking engines, calculators.
If we want those things we can build them. Building them into the language center would be absurd and weird.
In the brain all those things are neural networks too, memory is encoded in the weights. So if we want to have that type of system, everyone should have their own model and the weights would actually adjust all the time just like in our brain, but now it is just trained once and that is that
And yet this is what all the big companies we re trying to do, because the LLM is incredibly expensive fixed cost, and they marginal cost for each new task.
This is a very good take on the current state of LLMs. They are able to articulate surprisingly well and LLMs or something similar will definitely be part of a larger AI model in the future. They are good creative aids, but they should not be trusted.
There's actually a pretty simple solution to this that I thought about testing out and it involves asking the model to re-construct the problem using a logic language (like Prolog) and asking it to execute this type of program in order to come up with a solution rather than attempting simple chain-of-reason training / other methodologies of getting the model to 'reason' through some of these examples. People forget that humans don't come up with their logical models out of the blue - it takes years of elementary school in order for us to understand the world and solve problems in it. The logic programming approach I'd say is really promising but you would need to feed the LLM a LOT of examples in order for it to work, and currently I'm not even sure that we have enough training data in order to implement something like this.
I honestly thought about this recently when I was trying to see the limits of Claude Opus. Some of the problems I gave it, what if instead of telling it to solve the problem I asked it to write the script and then give me the command and inputs needed to properly run it to get the answer I needed. That way instead of relying on the LLM to do properly analysis of the numbers it just needs to understand enough to write the logic.
It is an interesting prospect but I feel like it has some limitations. For math problems like this one, yeah it should be simple to write a script to do it. But it does first have to understand the core thing here that Alice would be one of the sisters of the brother to write the script accordingly.
But I would think this would not scale well when dealing with far more complex issues, particularly ones that may not just be simple math logic. If the request was to write a scientific paper for example, it might have to make several scripts and call them multiple times, and that assumes that it understands what it is doing enough to properly make the script.
I'm not sure what you mean by 'it will not scale well.' When we humans learn that we make a mistake - we make a note and we hold the correct answer in memory - the next time we're prompted with a similar prompt, we can use our old memories to come up with the correct solution. I just did a simple test for this same exact problem using ChatGPT 3.5:
'Can you reformulate the following problem using Prolog? When you execute the program - what answer do you get? 'Alice has 8 brothers and she also has 5 sisters. How many sisters does Alice’s brother have?'
Even though it came up with a proper Prolog formulation - it failed to account that Alice is one of the sisters - so I gave it a simple prompt for this and it was able to correct the formulation and come up with the proper solution:
My prompt after ChatGPT gave me an answer of 5: 'Yes but you aren't including Alice in your logic. Shouldn't the correct answer be 6 (since Alice is also a sister)?'
Response:
'You are correct. The correct answer should account for Alice being one of the sisters. Let's adjust the Prolog program to include Alice in the calculation:
% Facts brothers(alice, 8). sisters(alice, 5).
% Rules sisters(Brother, NumSisters) :- sisters(alice, Num), NumSisters is Num + 1. % Include Alice herself
% Query ?- sisters(alices_brother, NumSisters).
In this updated Prolog program:
The rule for determining the number of sisters includes Alice herself by adding 1 to the number of sisters Alice has.
When we execute the query ?- sisters(alices_brother, NumSisters). in Prolog, we should get:
NumSisters = 6.
This correctly reflects that each of Alice's brothers has 6 sisters, including Alice herself.
When I prompt it to answer this question using Prolog now, I always get the correct answers no matter which integers I feed in, so it seems like the model is able to generalize after I show it where it's 'reasoning' was wrong. Basically, my suggestion is to get the model to recognize logic puzzles like this and try to formulate them in terms of logic programming queries which it can use and tune in order to come up with correct answers rather than simple auto-associative chain of reason training which current GPT models rely on, but like I said - this is my hypothesis and I believe this would work much better in getting these models to 'generalize' than the current approaches we're using. Hopefully this helps.
When you ask again the prompt includes the context of your previous question and correction. When I ask the prompt doesn't have that context so the model fails to give me the correct answer.
I'm using the default free model in the app, based on GPT4.
Yup - well this is where my suggestion is to change the GPT architecture. You can think of having a logic program function as the 'frontal lobe' of the general pre-trained auto-associative model. This 'frontal lobe' region would try to come up with logical sequences to go along with it's internal auto-associative representations. Of course - the logic programming piece is just one approach - maybe chain of though or chain of reason prompting could work here too as many humans I think use this chain-of-reasoning approach themselves. Logic programming to me would function as a suggested shortcut.
"My prompt after ChatGPT gave me an answer of 5: 'Yes but you aren't including Alice in your logic. Shouldn't the correct answer be 6 (since Alice is also a sister)?'"
Useful, if you know what the answer is. What happens if you don't give it the correct answer?
What happens when you don't give a human being the correct answer? :)
What if you prompt it "You seem to have accidentally included Alice. The correct answer should be 4"?
Yup this is a good example. This is because the model has no conception of what 'causality' is or how to try to come up with a correct 'model' - humans have a visual system which helps them out, but for LLMs I can definitely see your point and yup in these instances - if you feed in garbage data then yeah - you will get garbage out.
I don’t understand why LLM’s aren’t already set up to do what you describe automatically behind the curtain. Extract a math equation from text (LLMs are good at translating between languages right?) and immediately evaluate it on the host CPU.
LLM is the equivalent of recalling your times tables. Computer arithmetic is the equivalent of re-computing your times tables.
ChatGPT does do this sort of process for arithmetic now; it converts wordbased problems to mathematical notation and then solves.
LLM-based systems with tool use (which this is an application of) often are, to an extent, the issue is tuning the (behind the scenes, system) prompting so that they use appropriate tools in every case where they should, and do so correctly. (There's also a cost factor involved since behind-the-scenes tool use means multiple LLM round trips to answer the question, so tuning the system to use tools more aggressively makes the system more expensive.)
Right, and do you verify the result?
You have to know what the answer is supposed to be before you can write a test case.
Yup - well you feed in the prompt along with an answer and you get the model to produce outputs and check for discrepancies. If the answer is wrong then the model adjusts -- this is the way backpropagation works....I think there are huge advantages in using logic languages in order to represent some of these data sets rather than simple English or the current chain-of-thought reasoning approaches -- backpropagation as an example isn't really used in the human brain, but it leads to great results in mimicking how neural networks 'learn' - in the same way, we don't have to have the full formal picture of how humans model the logical world in order to achieve great results. We can simulate this using logic programming or even general programming or at least that's my conjecture.
I can have more confidence that my calculations are correct using a calculator compared to doing it by hand, even if I don't know the exact right answer beforehand.
Given that Alice has 13 brothers and 31 sisters, we can update the Prolog program with this information. We need to adjust the fact about Alice's siblings and then use the rule to calculate the number of sisters her brothers have.
Here is the updated program:
```prolog % Facts siblings(alice, 13, 31). % Alice has 13 brothers and 31 sisters
% Rules % Alice's brothers have M + 1 sisters, since Alice is also their sister. sisters_of_brother(TotalSisters) :- siblings(alice, _, M), TotalSisters is M + 1.
% Initialization goal :- initialization(main).
% Predicate to find and print the total number of sisters Alice's brothers have main :- sisters_of_brother(TotalSisters), writeln(TotalSisters). ```
In this program, the fact `siblings(alice, 13, 31)` represents that Alice has 13 brothers and 31 sisters. The rule `sisters_of_brother(TotalSisters)` calculates the total number of sisters that Alice's brothers have by adding 1 to the number of sisters Alice has (since Alice is also their sister).
When we run this program, it will print the total number of sisters Alice's brothers have:
1. Alice has 31 sisters. 2. Each of Alice's brothers will also consider Alice as a sister. 3. Therefore, each brother has 31 + 1 = 32 sisters.
The output of the program will be:
``` 32 ```
Thus, Alice's brothers have 32 sisters.
Problematic in the real world meat space to say the least.
Assuming Alice indeed has that many brothers and sisters (possible) it's highly unlikely they all have the same pair of unique parents.
The Set {} of siblings that share at least one parent with Alice (the siblings of Alice) can easily include one brother of Alice (same father, different mother) who doesn't share a common parent with any other sibling of Alice.
If that kind of thing worked, we'd have been doing it long before LLM chatbots.
Yet tools like GPT4o can do this.
It's not a trivial problem, taking a human written description and rewriting it as a prolog program.
I’m curious how this would work considering knowledge can be fuzzy.
Like if I’m out camping and I sit on a log or a rock those things are not what people usually think of as chairs but they can serve as chairs in that situation.
You can get models to actually show that 'logs' could function as 'chairs.' You're forgetting that we humans also learn this as well, but we learn this in a much simpler manner than LLMs though so someone has to explicitly let models know what assumptions they can make. You get the LLM to write Prolog programs and learn associations in this manner. As the model gets better at logically modelling the problems - the solutions to prompted problems like this should get better.
I just tried that with ChatGPT 3.5 (4o stopped responding after I asked the initial question and it produced the wrong answer). Here's the Prolog it generated:
I may be showing my ignorance about this tech here, but I believe the LLM doesn't even try to solve a problem; they try to generate a discourse that could pass as a solution or answer to the problem; that's more or less what the abstract states if I understand it correctly. But in no way does it try to apply some sort of mechanical reasoning like inference engines do.
To me the solution to this is to associate LLM with mechanical computations, that is an inference engine or an equation solver, rather than recombining the millions of solutions for similar problems it has seen in its training set. I believe I remember reading about teams attempting this approach. I can imagine for instance that if the LLM is in some way able to ask questions and use the answer, maybe it could just generate a prompt for an equation solver and include the result in its answmer.
I took one of the problems that gpt4o got wrong, and asked gpt4o what tools it could use. It suggested and wrote prolog for me that (with one pass back to get things defined in the right order) which correctly worked out the answer.
Its not an AI hype. A hype is defined as something which gets oversold: "promote or publicize (a product or idea) intensively, often exaggerating its benefits."
Just yesterday I visited a google cloud summit and one person from bosch told the audiance how they are now able to work with less external agencies like texting, graphicsdesigner and photographers for their materials.
It already saves money, has real impacts and continues to progress.
We are also don't know what ChatGPT 5 will bring, because they say this will do more reasoning than before, but we already are working (people/our socity) on solving this in different ways: From code which creates a unit test first and than the code, to different type of architectures.
For me, 2024 was the LLM cost reduction year and the LLM gets a big context window year.
AI doesn't need to be ready tomorrow, but its capabilities are already really good. And i know plenty of people around me who are a lot less interesting to talk to than any llm (from a human skill/knowledge point of view).
llama 3 was also a big achievement 2024. Facebook shows that better data leads to better quality for smaller models.
We haven't not only entered the AI ara but also the 'gather all the knowledge we can, quality check it and refine it because now we can actually do something with it' ara.
We are in the feedbackloop knowledge ara.
Your post is complete hype, all about people saying things instead of showing things that've actually been done.
For me, 2024 was the LLM exposed as basically pure hype year.
There is no expert of any field I follow online where they're posting up results from AI tooling for any other reason than to show how awful it is. I consider myself an expert in software, and LLMs specifically have only caused me great pain.
Even the one situation where you describe someone describing the ability to work in an absolute vacuum sounds like a huge negative to me. The recent push for DEI policies were even ostensibly about the importance of people of diverse backgrounds and viewpoints working together.
The most important thing you're missing a perspective of scale on is the step you describe as "quality check it". On things I don't know, and have attempted to enlist an LLMs help on, in every case I have had to go back and just actually learn how something works, after time wasted struggling with subtle wrongness in the output.
At least I have the background expertise to do that, however, I have seen a Jr dev's mind get literally rotted by too much time in pure LLM land. Besides the cost of rewriting their code, the company was now the proud owner of a young dev with a mind filled with nonsense.
How do you even weigh the cost of fixing a corrupted human mind?
Eat something and take a nap, you sound unhinged.
ChatGPT has nearly doubled my work output, most of my job is system admin infra type stuff and it's ridiculously good at troubleshooting odd issues.
Hopefully you can find a use case for it someday, until then, the rest of us will continue to be more productive.
Do you have any concern about the data you're feeding to the vendor serving your prompts?
I've had junior devs tell me they use chatgippity to combine excel workbooks, and when I confirm they're not self hosting a llm to do it, I ask if they think it's a good idea to hand over company data to openai. They don't care.
In a world of tight security, I find it astonishing that so many people willingly give away trade secrets to these companies, whom can sell it to any bidder if they choose.
I do think it's fine yes, but I also am not giving it anything sensitive.
So you are not using Office 365?
Because our company does.
I'm getting jun devs who are brainstorming and troubleshooting with chatgpt before they start talking to me.
I have a good intern who is much faster with chatgpt than before and learning well.
A majority don’t deny that it’s good. The problem is that so many think it is actually reasoning, believing the answers can be trusted.
It's generally reasoning better than me, so using them I do less mistakes than if I did my own logic.
Does it need to do reasoning perfectly?
If it created meta concepts from billion words on the internet and has meta models which are correct and are more and better than an avg human, isn't it actually good in reasoning?
Its a very narrow thing to say 'is that so many think its actually reasoning' to say AI is just hype or everything we are doing is a waste etc.
There are human benchmarks they are winning at. The critic could be more that we don't have enough benchmarks.
This paper very clearly demonstrates these LLMs are not reasoning in a fundamental way. Token prediction and reasoning are two different tasks. They may be related, but they are not the same. "Just wait for GPT 5, it will be amazing!" is part of the hype.
Please do not assume an LLM is correct in skill or knowledge unless you already know the answer or can verify by other means.
The problem is that we don't know how we do reasoning.
I calculate stuff by following a formular after i pattern detected a problem i already know.
Plenty of humans are not able to solve those math problems.
If the future of llm / ai becomes a LLM with multi modal and mixture of experts and that solves those reasoning problems, we still don't know if this is a different type of reasoning than what humans do.
There is no feedback. You cannot create new knowledge out of thin air.
The problem is a good chunk of the global population is also not reasoning and thinking in any sense of the word. Logical reasoning is a higher order skill that often requires formal training. It's not a natural ability for human beings.
Seriously? You think individuals are incapable of reasoning without training first?
Yes, seriously.
Some examples: An individual without training cannot reliably separate cause from effect, or judge that both events A and B may have a common root cause. Similarly, people often confuse conditionals for causation. People often have difficulty reasoning about events based on statistical probabilities. Remember, the average person in North America is far more terrified of a terror attack than an accident or a heart attack, yet the latter two are much more likely to be their cause of death.
You mean without training, people cannot frame answers in the terms you've learned from training. Well, why are you surprised?
If you think reasoning is limited to the frameworks you learned from a book, you live in a small world.
In an effort to be pejorative, you have reduced the process of learning to a basic regurgitation of facts. Rather, learning about reasoning is more about developing a framework that allows one to construct claims based on available information and then check the veracity of those claims while avoiding all, or at least typical, fallacies.
Critical thinking requires a certain amount of rigor, which formal education is well-suited to impart. It can be self-taught with a hefty dose of discipline, but it cannot be intuited.
Do you think they are? We haven't had a case of individuals without training in like 10 000 years.
And judging by some comments here, humans must have gotten dumber since then. Or we would never have invented writing, agriculture or even basic clothing.
"This problem is actually not that easy, the average person couldn't solve it either, especially if the numbers were bigger", "Yet another cherrypicked clickbait study to make LLMs look bad, those people are just scared of being made obsolete", etc.
That's the thing. We had to invent those things. Along with counting, numbers, logic, arithmetic, and those stupid-ass annoying family tree riddles. We didn't get them in one step, it took a long time to build one on top of the previous. Generation by generation, each cohort of children growing up in a slightly more complex world than their parents, each generation taught how to navigate this complexity by their families and tribes. Learning a growing collection of facts and beliefs and life hacks.
There were no untrained people for as long as humanity existed. The minimum reproductive unit of homo sapiens is a village.
In "any sense" of the word? Surely anyone who adjusts their behavior when they get undesired or unexpected results is reasoning and thinking. And since most activities are mediated by thought of some kind, most people are reasoning and thinking otherwise they would never recover from even simple mistakes, like walking east when they need to go north.
Saying they're "not thinking in any sense of the word" because they can't solve predicate logic problems from a college textbook is a rather odd claim. Surely those things arise from reasoning and thinking, rather than the other way around.
This seems to me to be where these systems need to go in the future, akin to reinforcement learning.
You feed an llm a prompt. It then abstracts and approximates what the result should be. It then devises a hypothesis and solves it and compares it to the approximated output. Then it can then formulate a new hypothesis and evaluate it, based off the outcome of hypothesis 1. From there it can either keep iterating or dump that path for a new one (e.g., the next best hypothesis in the original formation).
At some point the answer is "good enough." But along the way it keeps playing against its thoughts to see if it can do better.
A key issue may be the original approximation, so it may need to consider its adjustment when iterating.
Maybe this is how cutting edge llms work now. I have no idea.
I've read your comments here, and while I understand your point I think you have it backwards. The only reason we formed societies is because we evolved an innate a theory of mind to reason about how others might be thinking and feeling. That's reasoning. We have a natural ability to do limited arithmetic, otherwise we wouldn't be able to hunt, gather or track enough to endure winters, or keep track of our sheep or children for that matter. That's reasoning.
Reasoning is a natural ability for human beings, but we also carry a lot of evolutionary impulses that add a lot of noise to the decision process, eg. observation->judgment->[set of possible decisions], judgment has "reason" as one path that adds to the set of possible decisions, but there remain other paths we inherited from our evolutionary roots. Education is training that suppresses poorly calibrated judgment paths that lead to frequent mistakes in the decision set, but reasoning remains, thus education improves the signal to noise ratio of our decision making.
So I 100% disagree that an individual cannot separate cause and effect without training. They will just be worse at it than someone who is trained to filter out those impulses that lead us to jump to conclusions, eg. they will produce more noise / a larger set of possibilities than reason would allow.
I don't think anybody who actually uses LLMs thinks they are general purpose AI.
Like you ask it to do one thing it's amazing, but then you try to modify or do something with extra steps, or just anything with any complexity to it and it falls over.
Well, I believe LLMs are general purpose AI. And, for the record, so does Peter Norvig. He coauthored the most popular textbook on Artificial Intelligence, so this belief can be considered a somewhat mainstream perspective. His article, “Artificial General Intelligence Is Already Here“ [1] is really worth reading.
https://www.noemamag.com/artificial-general-intelligence-is-...
I can't tell whether Norvig is making a subtle's point that artificial general stupidity is still an important form of artificial general intelligence, or he whether he somehow vastly overestimates AI model performance on non-memorization and summarization tasks.
You seem to have an axe to grind. You might spend more time trying to get quality out of the models. It is possible!
I would like to believe that but I have had too many conversations with people who basically think it already is. Including in one situation of a fellow engineer.
It feels like more and more "we" are in a bubble of actually having some knowledge of how this works, what the actual limitations are, and what it just is not. While there is in fact criticism of it out there, particularly around AI "art". It doesn't seem to be focused on the area we are talking about.
I dunno. If memes are anything to go by, people are frustrated by Yet Another Chat Bot standing between you and the human you need to solve your program. The fact that it's a better, more human like chat bot doesn't mask the frustration of being forced to talk to a computer with limited agency.
The vast majority of people use a model built by someone else through a high-level abstraction and then make broad claims about what the technology will be capable of.
Then you have the people who are leveraging the technology to train models from scratch and that population is far more apt to believe that large models can be general purpose AI (by some definition). Sure there are other things at play like money, publicity, reputation, and a desire to do something important. But there is also the very clear trend line where transformers are able to model pretty much any sequence of tokens where there is pretraining scale data available. Whisper was not a surprise. GPT-4v was not a surprise. Sora was not a surprise.
They might not think that but they sure as hell are trying to sell that idea.
they're pretty general-purpose; you can ask the same model for recipe suggestions, a fanfic, or verilog. like, they're far, far more general-purpose than any humans i know
it's true that they're not very reliable, but they seem to be not very reliable across many different domains. and they don't seem to be particularly less reliable than the average human, so i think possibly your standards for 'general purpose ai' are set high enough that you would declare humans to be unintelligent (or perhaps not 'general-purpose') if you applied them consistently
you can certainly find particular domains where humans can still do things llms can't, but i haven't seen a persuasive account of why those domains are the more important ones, and of course the converse is also true
They are being sold as such. Most people don't know anything about the topic and will buy that marketing. The entire concept of these models is that you can put a whole bunch of data in and eventually some kind of magic will happen and you get AGI out. They would not see the kind of investment that they do if all that was being promised was "really good predictive text". In fact some philosophers argue that sentience is just really good predictive text to try and make the point that these models are AGI.
I appear to be reasoning at times but I have mostly no idea what I am talking about. I hit a bunch of words and concepts in the given context and thus kind of hallucinate sense.
Given a few months of peace of mind and enough money for good enough food, I could actually learn to reason without sounding like a confused babelarian.
Reasoning is mostly a human convention supported by human context that would have been a different one if the Fascists had won the war or the Soviet Union wouldn't have gotten corrupted.
But none of that has anything to do with pulling up a whiteboard to draw some flowcharts and run some numbers, all of which is why I am certain there is nothing the devs have "to fix". It took most reasonable humans many generations to learn stuff. Very few of us did the actual work.
It's all just a matter of time.
Yeah, I think these chatbots are just too sure of themselves. They only really do "system 1 thinking" and only do "system 2 thinking" if you prompt them to. If I ask gpt-4o the riddle in this paper and tell it to assume its reasoning contains possible logical inconsistencies and to come up with reasons why that might be then it does correctly identify the problems with its initial answer and arrives at the correct one.
Here is my prompt:
I have a riddle for you. Please reason about possible assumptions you can make, and paths to find the answer to the question first. Remember this is a riddle so explore lateral thinking possibilities. Then run through some examples using concrete values. And only after doing that attempt to answer the question by reasoning step by step.
The riddle is "Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?"
After you answer the riddle please review your answer assuming that you have made a logical inconsistency in each step and explain what that inconsistency is. Even if you think there is none do your best to confabulate a reason why it could be logically inconsistent.
Finally after you have done this re-examine your answer in light of these possible inconsistencies and give what you could consider a second best answer.
LLMs are fundamentally incapable of following this instruction. It is still model inference, no matter how you prompt it.
For me it did in fact follow the instruction and found the right answer. I have found this strategy of asking the LLM to analyze why its answer might be wrong results in better final answers. I think this allows the LLM to "use tokens to think" which allows the attention layers to gather the right cognitive resources to answer more subtitle questions effectively.
You are confusing "providing the right answer" with "following the instruction". Cause vs correlation. From an algorithmic perspective it is literally doing the exact same thing regardless of prompt.
If you had a prompt that reliably made the model perform better at all tasks, that would be useful. But if you have to manually tweak your prompts for every problem, and then manually verify that the answer is correct, that's not so useful.
There isn't any evidence that models are doing any kind of "system 2 thinking" here. The model's response is guided by both the prompt and its current output so when you tell it to reason step by step the final answer is guided by its current output text. The second best answer is just something it came up with because you asked, the model has no second best answer to give. The second best answers always seem strange because the model doesn't know what it means to come up with a second best answer; it 'believes' the output it gave is the correct answer and helpfully tries to fulfill your request. Sometimes the second best answer is right but most of the time its completely nonsensical and there is no way to distinguish between the two. If you ask to choose it will be strongly influenced by the framing of its prior response and won't be able to spot logical errors.
Asking it to do lateral thinking and provide examples isn't really helpful because its final output is mostly driven by the step by step reasoning text, not by examples it has generated. At best, the examples are all wrong but it ignores that and spits out the right answer. At worst, it can become confused and give the wrong answer.
I've seen gpt-4 make all kinds of errors with prompts like this. Sometimes, all the reasoning is wrong but the answer is right and vice versa.
In many ways, this is very obvious and routine to people who use these systems with a critical understanding of how they work. It's dispiriting how we require a arxiv.org pdf in order to point out the absence of the emperor's clothing. There are few people who use LLMs who could not, right now, demonstrate this point to themselves if they so wish.
Such is the strength of corporate tech propaganda that a whole mass of people will instead insist that we have never worn clothes either.
The last part of that is the problem and why a paper like this is critical.
These systems are being pushed onto people who don't understand how they work. CEO's and other business leaders are being pushed to use AI. Average users are being shown it in Google search results. Etc etc.
People are being told it can do far more than it really is.
Sure, but even these people... the failures are so common, and often very obvious. Consider a CEO who puts a press briefing in and asks some questions about it, it's not uncommon for those answers to be obviously wrong on any sort of critical reflection.
We arent dealing with a technology that is 99.9% right in our most common use cases, so that we need to engineer some incredibly complex problem to expose the flaw. Rather, in most cases there is some obvious flaw. It's a system that requires typically significant "prompt engineering" to provide the reasoning the system otherwise lacks.
I guess that offers an explanation: people aren't aware that via their own prompt engineering they are repairing the deficiencies of the process by manipulating its inputs to include the structured reasoning it lacks. So there's a sort of hot-reading effect at work.
Right -- we are a long way from "this is a very nuanced error" being the dominant failure.
Meanwhile these HN comments are split between:
* Lots of people confirming what the paper itself notes (but doesn't highlight), that the most advanced models actually can solve this problem at least a significant portion of the time. (A proportion which one can pretty easily project is only likely to increase with future models.)
* Lots of people saying "this confirms LLMs can't do reasoning".
Questions I'd ask you to consider:
* Is "LLMs can't do reasoning" actually more accurate than the typical hype?
* Is a "critical understanding of how [LLMs] work" that would predict they simply cannot solve this problem actually a good understanding?
The critical understanding doesnt predict that LLMs cannot solve problems. It predicts how they will solve them. There is no information, a priori, what the LLM has been trained on. You have to prompt, then see the answer. Once the answer arrives, the critical understanding provides a route to repairing the answer when not accurate or useful.
LLMs do not reason. They appear to reason by repeating the structure of reasoning in their training data. This is indistinguishable in many cases.
This is the line of reasoning I find most dispiriting. I still believe tech people cling to this line of reasoning because it helps them justify replacing people in jobs with LLMs.
There is no definition of reasoning or thinking. No single human knows what it is.
The only thing we know is: we as humans are capable of recognizing steps and results of reasoning and thinking.
In a lot of cases, when using LLM's, those results appear to be correct and usable. This is often easy to determine with generated code.
I want to argue that, lacking a definition of reasoning, I am happy to have found that the machine helps me to get results that might as well have been produced by a lot of human knowledge, wisdom and deep reasoning.
You yourself did not use reasoning to arrive at this conclusion. It's quite obvious. I'm not trying to belittle you here. But LLMs are black boxes, we do not actually know what they are doing at a high enough resolution where we can call it "not reasoning" or "reasoning".
We can only characterize these AI's as a best fit curve between datapoints which is a way to high level view point to come to any conclusion about "reasoning"
This paper presents evidence of failed reasoning, but how does that prove anything when LLMs exhibit many instances of successful reasoning on complex topics they were not trained on?
You are biased and honing into information that supports a biased conclusion. LLMs are an AI we do not understand at a low level. Hence we talk about the attributes of these AIs in the same way we talk about humans, "Oh the LLM hallucinates", "it tries to justify it's answer..." etc. etc.
You characterize the Danger of these AI's as the result of Human stupidity. The danger according to you is solely from a human mistakenly believing that the AI is anything other than a stochastic parrot.
This is a belief arrived at in the same spirit as your claim. You did not use reasoning to arrive here.
The only logical way to characterize what is going on is that we do not know. It could very well be that these AI's are in fact reasoning. And that in itself presents a different kind of danger. A danger that may be more clear in the far future.
The irony is that your conclusion lacking correct reasoning is similarly parallel to the LLM's lack of reasoning. LLMs are more alike to us than you would like to believe.
Can you give me the step by step signal path ways of an LLM processing a query to prove that it does not reason? Or do you have to use black box anecdotal evidence to prove your point? For any "evidence" where an LLM failed to reason, there is another counter example showing where the LLM succeeded. Contradictory evidence can only lead to a vague conclusion.
Question is how often do humans get the question wrong unprompted ? I expect given the median IQ of 100 and assessing people of all types including those untrained on logic and generally uneducated humans might not be nearly as close to 1.0 as the authors seem to imply.
But it’s also fairly obvious LLMs don’t reason at all so it’s not shocking that LLMs don’t reason at all. What’s remarkable is that they’re able to perform as well at reasoning tasks as they do given that factual truth they have no reasoning facility. Coupled with their natural language ability and emergent abductive reasoning ability that’s generally hard in computing models it’s hard to draw much conclusion here other than “wow gpt-4o is amazing.”