return to table of content

Simple tasks showing reasoning breakdown in state-of-the-art LLMs

nerdjon
180 replies
1d3h

For anyone considering reading the paper and like me don't normally read papers like this, open the PDF and think you don't have time to read it due to its length. The main part of the paper is the first 10 pages and a fairly quick read.

On to the topic here. This is an interesting example that they are using. It is fairly simplistic to understand as a human (even if we may be inclined to quickly jump to the wrong conclusion without thinking for a few seconds).

The thing that really bothers me is that I just don't know is realistically we can fix this given the current state of what these tools actually are. They are not reasoning or thinking in any sense of the word and yet a lot of people are already considering them general purpose AI. It doesn't help that in many situations it can fake it enough that it appears to be reasoning, but it's not.

What is the chance that this paper actually has any impact on the AI rollout and overhype or will just be buried and never talked about again until the next time we see how dangerous these tools are like with Google's search rollout.

layer8
63 replies
1d2h

For reasoning of any complexity, some sort of internal monologue and iteration is needed. For type-2 thinking, we conceptualize possible solutions, arguments, and reasoning paths in our mind, and judge and evaluate them before deciding which one we think is adequate or correct. This can be emulated to some extent by guiding prompts that make that internal thought process external, but we really want it to be internal and automatic.

We don’t know how to do that yet, because what controls the internal thought process is itself not necessarily language-based, and also, since internal thought processes of biological brains are not directly observable, they can’t be used as training data.

Edit: It occurs to me now that there is some parallel between current LLMs and behaviorism [0], and we really need something to which cognitive psychology could be applied instead.

[0] https://en.wikipedia.org/wiki/Behaviorism

sollewitt
59 replies
1d2h

Given many people don’t have an inner monologue and function just fine, it’s more likely inner monologue is a product of the reasoning process and not it’s mechanism.

ElevenLathe
33 replies
1d2h

The existence of an "inner monologue" isn't really a falsifiable claim. Some people claim to have one while other people claim not to, but we can't test the truth of these claims.

glitchc
30 replies
1d2h

In this particular case, is there any reason why we simply can't take their word for it? This is not a case of where if I say "weak" or "strong", most people pick strong because no one wants to be weak, even if the context is unknown (nuclear force for example).

GeoAtreides
16 replies
1d1h

is there any reason why we simply can't take their word for it?

because if we give them a problem to solve in their head and just give us the answer, they will. By problem I mean planning a trip, a meal, how to pay the mortgage, etc. It's impossible to plan without an internal monologue. Even if some people claim theirs is 'in images'.

photon_lines
4 replies
1d1h

'It's impossible to plan without an internal monologue.' - Sorry, but I disagree with this. I have no 'internal voice' or monologue - whenever I see a problem, my brain actually and fully models it using images. I believe 25% of the population doesn't have the internal monologue which you're referring to and this has been tested and confirmed. I highly recommend listening to this Lex Friedman podcast episode to get a full grasp on the complexities of modelling language and general modelling present in the human brain: https://www.youtube.com/watch?v=F3Jd9GI6XqE

lupire
2 replies
1d

Can you draw a picture of an example of what you see when you think about something?

photon_lines
1 replies
23h25m

Sure - with the Alice example, when I saw the problem I came up with a really simple visual example of Alice having 3 sisters and 4 brothers. When I visualized it I saw Alice standing next to 3 other women (her sisters) and her 4 brothers standing close by. When I imagined asking her brother how many sisters he has, I could see that Alice was standing there next to 3 other women and thus came up with the answer of 4. Does this make sense?

magicalhippo
0 replies
21h23m

This could account for why some people are much better at say geometry than algebra.

I'm the opposite. I rarely visualize things I read, be it geometry or poetry. I can read a detailed description of a person or an item in a novel, and I don't really "see" anything.

But I have an active inner monologue. I "hear" myself saying the words when reading or writing, I "talk" to myself when solving problems or just thinking about stuff.

Especially when programming I'm reasoning by discussing the problem with myself, the only difference being that I usually don't open my mouth and vocalize the discussion. Though sometimes when I'm alone I do just that.

GeoAtreides
0 replies
1d

Sure, I do mention thinking in images in my original comment and count it as some type of internal monologue. I personally do not believe it's all images, as that would preclude using highly abstract concepts. But I might be wrong, and it might be 100% images. That being said, it does count as an internal monologue.

JohnMakin
4 replies
1d

It's impossible to plan without an internal monologue.

Of course it isn't impossible, and this is backed by what we know about paleoanthropology and other instances of cognition in animals - humans were making stone tools millions of years ago, which takes planning in the form of imagining what you want the tool to look like and how you will do it and what it will be used for. It's exceedingly likely we had this ability long before complex speech evolved. Apes also use and make tools, which would require planning, and I don't think they have an internal monologue going on. birds from the corvid family can do some pretty advanced problem solving that requires planning. Cetaceans might be an exception, because they appear to have some form of language, but this is a pretty wild claim not really backed by any kind of science as we understand it today.

GeoAtreides
3 replies
1d

Animals can not manipulate abstract concepts nor can they do long-term plans. No crow can plan an international trip spanning a couple of weeks and two change-overs. And some people definitely can't do it start to end, but they can at least plan the first 5-7 steps.

Also, maybe inner monologue is not a binary have/have not, but maybe it is on a continuum.

glitchc
1 replies
22h32m

Not sure. Migratory birds seem to manage this just fine. Not only do they make multiple stops to eat and rest, they also navigate around bad weather and still make it to their intended destination (at least most of the time).

GeoAtreides
0 replies
21h36m

Migratory birds seem to manage this just fine

Instincts.

JohnMakin
0 replies
21h17m

Yes, no one is disputing that animals are not as intelligent and lack the same capacity for planning that humans do, but the post you're replying to is disputing the fact that planning is done solely through internal narrative/monologue, which is easily disprovable by pointing to the examples I did. There are many more in nature.

rprospero
1 replies
1d1h

It's impossible to plan without an internal monologue

I once had a teacher claim that people who claimed to have aphantasia were lying, because those people have read books and it is impossible to read a book without picture the scene in your mind's eye. Are you citing the same source that she was?

TeMPOraL
0 replies
19h11m

I wish I had such a teacher, because I'd learn the term "aphantasia", instead of worrying all my youth that I'm doing reading wrong, as I could never picture anything I was reading in my mind (and as a result, I found scenery descriptions to be mind dumbingly boring).

ElevenLathe
1 replies
1d1h

It's impossible to plan without an internal monologue.

How can science make this claim if it can't prove (or disprove) the existence of an internal monologue?

IlliOnato
0 replies
1d

Well, I remember Richard Feynman came up with an interesting experiment. He found he could not count objects when he read aloud some text at the same time. He had to name the numbers, and it was impossible if he was already engaging his speech.

He thought this was universal, but doing this experiment with friends, he discovered a guy who could count while reading aloud. So when Feynman asked him, how he does this, turned out that the guy instead of "pronouncing" numbers was "seeing" colored numbers in his imagination, so his speech was not involved.

I supposed this experiment can be modified and generalized, and at least to shed some light on this problem.

staticman2
0 replies
1d1h

> It's impossible to plan without an internal monologue

That's quite the claim.

glitchc
0 replies
1d1h

Perhaps there's confusion in how we are using the word monologue. I took it to mean a conversation, a dialogue where the problem is perhaps solved using a dialectic method, or simply a conversation. Since one can solve a problem by following some memorized steps, no conversation required, this is perhaps not a good test, or we mean different things when we say "monologue."

Terr_
6 replies
1d

In this particular case, is there any reason why we simply can't take their word for it?

My concern is that if we take their word for it, we're actually buying into two assumptions which (AFAIK) are both unproven:

1. That "Internal Monologues" (not consciously forced by attention) exist in the first place, as opposed to being false-memories generated after-the-fact by our brain to explain/document a non-language process that just occurred. (Similar to how our conscious brains pretend that we were in control of certain fast reflexes.)

2. Some people truly don't have them, as opposed to just not being aware of them.

ElevenLathe
5 replies
1d

Not only are they unproven, but are ultimately not provable at all. Some people will say yes, some people will say no. Probably we can take their word for it, but in the simplest case they could just lie (in either direction) and we would have no way to tell.

In short, maybe these inner monologues exist and maybe they don't, but science can't comment on that. That said, it is clearly something we are interested in, but it will need to be addressed in some other way (i.e. religion, ideology, etc.).

TeMPOraL
2 replies
21h48m

Probably we can take their word for it, but in the simplest case they could just lie (in either direction) and we would have no way to tell.

Individually, no, but in general, for people to consistently lie about this particular thing at scale would be extremely unusual, given that people rarely lie if there's no reason for it. Going by this baseline, you could assume upward of 50% of replies are honest (even if mistaken), otherwise you'd have to explain why do you believe people would suddenly lie about that particular thing.

hughesjj
1 replies
18h41m

I've heard a theory where the inner monolog was emergent, and some of the first people to recognize thr 'voice in their heads' attributed it to god/angels/etc

There's conspiratorial lying and lying from ignorance, one is much less credulous.

TeMPOraL
0 replies
4h13m

That theory is the "bicameral mind"; I think it's even discussed elsewhere in this thread.

Terr_
1 replies
1d

but are ultimately not provable at all

No, they are potentially falsifiable as we get better at scanning, identifying, intervening in brain activity.

Just off the top of my head here, suppose we create a table puzzle problem that (in itself) doesn't require language to understand, like ones we make for certain animals. Have human subjects (silently) solve it. Afterwards, quiz the solvers about their internal monologue--or lack thereof--dividing them into two groups and noting the words used.

Now change to a second puzzle of similar style and same overall difficult. Stun/anesthetize the language-centers of subjects, to deny access to any of the monologue-words (validating this intervention will involve other research), and then test them on the second problem.

* If performance is constant for both groups, that suggests the monologue is illusory or at least not needed for this kind/scope of problem.

* If performance drops for both groups, that suggests the no-monologue people might just not be as aware of a linguistic process that's actually happening.

* If performance drops for monologue-subjects, that suggests it's a real and important difference in modes of logical thought.

* If some other combination happens, you have an mysterious and exciting new line of research.

ElevenLathe
0 replies
22h32m

Sure, there is stuff we can do to tease around the edges (similar problems crop up all the time in psychology and sociology) but we will always have to weaken the claim in order to do experiments relating to it.

hombre_fatal
4 replies
22h41m

Because we can't be sure whether two people interpret what "inner monologue" means and whether they think it describes a phenomenon that actually isn't different between them and other people.

For example, I can think of interpretations of "I picture objects that I'm thinking about" that range from me not experiencing the phenomenon to me indeed experiencing the phenomenon.

To say that you're not experiencing something that other people are experiencing in their head is a solipsistic notion where you hypothesize an experience that you imagine others are having and then discard it for being different than yours.

TeMPOraL
3 replies
21h54m

And here I thought this was solved decades ago - I need to find the source, but I read about an old study where people describe their experience, and the answers were all over the "range from me not experiencing the phenomenon to me indeed experiencing the phenomenon".

Then again, it's trivially reproducible - people self-report all variants of inner monologue, including lack of it, whenever a question about it pops up on-line. Same is the case with imagination - aphantasia is a thing (I would know, I have it).

hombre_fatal
2 replies
20h1m

I'm responding to "why can't we just take their word for it?"

That you and I can come up with different ways to describe our subjective experience in conversation doesn't mean that we have a different subjective experience.

Especially not when relayed by a species that's frequently convinced it has a trending mental disorder from TikTok.

TeMPOraL
1 replies
19h56m

We can keep talking about it, and assuming we're both honest, we'll arrive at the answer to whether or not our subjective experiences differ. To fail at that would require us to have so little in common that we wouldn't be able to communicate at all. Which is obviously not the case, neither for us, nor for almost every possible pair of humans currently alive.

hombre_fatal
0 replies
19h18m

On the other hand, a deep one on one discussion isn't what's happening in casual debates online about to what degree each of us has an inner monologue. And because we don't have so little in common, I would be resistant to concluding that my subjective experience is so different than everyone else's. To claim that I'm different requires me to have an accurate model of what other people are experiencing, not just an accurate report of what I'm experiencing.

Look up examples of this on reddit and you'll find a lot of larping. I would take most of it with a grain of salt as you should with any story-telling you encounter on social media.

If we're so reliable, there wouldn't be fake mental illness epidemics on TikTok regarding experiences far more concrete than fuzzy notions like inner monologue.

IlliOnato
0 replies
1d

"why we simply can't take their word for it"?

As someone who was involved in spiritual practice of "stopping internal dialogue" for years, I can tell you that one learns that that dialogue (or monologue, pretty much the same thing) is quite subtle and complex, essentially multi-layered.

Typically, when you think that you "think about nothing at all" it's just the most surface layer that has stopped, and more subtle talking to yourself is still going on. It takes training just to become able to notice and recognize it.

After all, it's just such a constant and monotone hum at the back of one's mind, one learns to completely ignore it.

So no, I would not take a word of people who were not trained to notice their internal monologue that they haven't any :-)

aeternum
0 replies
12h58m

Feynman came up with a potential test by tracking the things he could and couldn't do while counting seconds (via internal monologue). He found he generally count not talk while counting.

He then had others try and found that one of his mathematician friends was able to talk just fine while counting because it turned out he was counting visually.

https://www.youtube.com/watch?v=Cj4y0EUlU-Y

Terr_
0 replies
1d

The existence of an "inner monologue" isn't really a falsifiable claim.

Another possibility is that inner-monologues (ones not forced by conscious effort) do exist, but are just a kind of false-memory, something one part of our brain generates after-the-fact to explain/document the outcome of another non-language part.

Kind of like how certain reflex-actions can occur before certain decision-making area of the brain light up, yet humans will believe that they sensed the event and made a thinking choice.

layer8
15 replies
1d2h

It’s commonly conjectured that the emergence of human-level reasoning wouldn’t have been possible without the development of language. Personally, I’m able to suppress “word thoughts” in my head (for a short time), but then I lose almost all of my reasoning ability. I could imagine that reasoning is language-based even when it’s not conscious for some people. An internal process being there, and being conscious of it, are two separate things. We would be happy with an AI using an internal monologue without it being conscious of that monologue.

IggleSniggle
11 replies
1d1h

Maybe, but symbolic thought can get pretty far away from what we generally call "language." I bet you can reason 1+3x=22 pretty easily without any words whatsoever, or the sound of one ascending octave after another, or the approximate G-force induced on your body if you take the next turn without applying the brakes.

All of these forms of reasoning are true and useful calculations: when we talk about "intuition" what we usually mean is that we have a lot of experience and internal reasoning about a subject, but we struggle to translate it to and from the "language" part of our brain. Nonetheless, any social dancer will tell you that a dialog is possible just by receiving and inducing g-forces alone. You can reason this way about abstract concepts like orbits without ever touching a single word or concrete symbol.

Edit: the key aspect of reasoning, imho, is the ability to make predictions and introspect them against a database of other predictions, using an adversarial heuristic to weight the most plausibly useful results. Perhaps our pattern matching AIs of today just lack sufficient "experience" to do what we call reasoning.

hifromwork
3 replies
1d

I bet you can reason 1+3x=22 pretty easily without any words whatsoever

I've tried to do it, but I can't. I had to do something like "ok, so we subtract one from both sides and then it's easy, 3*7=21". Maybe I could do 2+8 but I still think the word ten "aloud".

twiceaday
2 replies
23h39m

I was able to do it with no words. I 'saw' the steps as if on a piece of paper. I saw 3x=22-1=21, then x=21/3=7. But I have a degree in applied math. Perhaps not internally vocalizing is just being extremely familiar. It also happened very quickly, perhaps there was no time to vocalize anyways.

hughesjj
0 replies
18h44m

To be fair, math is a language in itself... with many dialects come to tcreditors.

At the end of the day though, thought requires communication, even if internal. Even physics is modelled as some sort of 'message passing' when we try to unravel what causality really is. Similar to how a processor has cycles, I think/know similar (but unsynced) happens as part of what we call 'thinking'.

danielmarkbruce
0 replies
18h36m

A decent number of folks can jump straight to the answer on something so straightforward, no steps.

danielmarkbruce
3 replies
18h40m

Most people can't do 1 + 3x = 22 without any words or symbols. People who can don't realize that most people can't. I'd argue one isn't using logic when they do that, it's just very good pattern matching.

aeternum
2 replies
13h9m

It's also possible to do mentally by visualizing it rather than internal monologue. You can imagine the 1 on the left arcing over to the right, cancelling the 22 to 21, then the 3 moving under the 21 and the 21 descending through the 3 to become 7.

mercer
1 replies
1h25m

yup. I considered myself an /extremely/ verbal person when reasoning, but what I do with the above feels closest to 'moving the 1', almost like balancing a mental scale.

I never really noticed that before. I'm not great at math, fwiw.

danielmarkbruce
0 replies
35m

A decent number of folks just have the answer pop into their head, no logic or thinking required.

layer8
1 replies
1d1h

Regarding “1+3x=22”, I’m actually not sure, the number words certainly appear in my head when solving the equation. But even then, I would count “1+3x=22” as constituting language. Perception of sound, G-forces, and dancing don’t perform type-2 reasoning by themselves, so I don’t think your argument applies there.

Regarding your edit, no, I think the key aspect of the kind of reasoning we are missing in current AI is the ability to hold the reasoning in your mind, and to iterate on it and evaluate it (judge it) within your mind.

IggleSniggle
0 replies
1d

It is very difficult to have a discussion using words to discuss the semantics of non-word or non-symbolic semantics. I was pointing at several different plausible spaces for semiotics and how these spaces could be spaces for reasoning in the hopes that one of them might be relatable.

If you use words in your mind when you use math, and you use words in your mind when you make or listen to music, etc., then it is very difficult to find a common ground where it is possible to see that these other realms of thought are capable of not only prediction, but also producing evidence that leads to judgement. That is to say, the key aspects of "reasoning." I picked them because I thought they had broad enough appeal to be relatable, and because I do not personally hear words in my head when doing any of these activities, whether it's calculus or tango, but I still find calculus and tango to be places where reasoning occurs.

Some of them, like math or music, are closer to the kind of symbolic thought we use when we discuss things with words. Others, like the experience of g-forces, are not. I present them as a sliding scale between "word based" reasoning and "non-linguistic" reasoning. Perhaps you can think of a realm that better fits for your personal experience of intuition, and inspect whether these intuitions are capable of "real" reasoning in the absence of language, or whether intuition should never be trusted even when you have a great deal of experience in that area. Perhaps in your estimation, anything that cannot produce evidence that is articulable in word form is suspect.

Personally, I find all these methods, including language, to be suspect. I don't find language to be especially better at producing the kind of evidence for prediction, correct judgement, or discourse for reasoning than other methods, unless you reduce "reasoning" to tautologically require it.

One of the best tools of language is that we have writing that allows easy inspection or iteration of the written content; but these things are possible in other realms, too, it's just that we didn't have great tools for introspecting and iterating on their "ideas" except within our own minds. These days, those tools are readily available in many more realms of human insight.

_proofs
0 replies
2h23m

i feel it is worth pointing out, as another commenter highlighted, language and even symbolic more abstract languages, bring about fluency if you've practiced "speaking and writing" it enough.

i think native speakers hardly "think" about the steps necessary to form a grammatically correct expression, and most of the time just "know".

fluency is not the same as lacking an internal framework for interpreting or thinking in terms of, symbols.

psadri
0 replies
23h52m

Language is a serialization of our brain's "world model" structures.

nathan_compton
0 replies
1d

Brains are weird. I reason almost entirely non-verbally and I would absolutely struggle if I had to laboriously express every thought in words. Its part of the reason I don't work well in teams. So slow!

bradfox2
0 replies
21h23m

What defines the boundaries of internal vs external? Certainly nothing about llm weights or ops should.

CooCooCaCha
6 replies
1d2h

I think you’re using “inner monologue” too literally. It could be a progression of pictures, emotions, etc.

ajuc
3 replies
1d1h

With that definition even bacteria have inner monologue.

CooCooCaCha
2 replies
1d1h

Can bacteria imagine pictures? Do they have emotions?

Why does this matter? Stop being so pedantic. We're talking about a progression of ideas. Talking in your head is one form of ideas, but people can easily solve problems by imagining them.

ajuc
0 replies
1d

Initial thesis was - inner monologue is required for reasoning. If you define inner monologue to include everything brains do - the initial thesis becomes a tautology.

IlliOnato
0 replies
1d

Hmm, looks to me like just trading some words for others. Do bacteria have ideas? Does the navigating system in your car? How do you know?

We need to be at least somewhat pedantic, otherwise it's impossible to know what we are even talking about, and no way to establish anything.

IlliOnato
1 replies
1d

To make any progress on this question at all, we need first to come up with some definition of internal monologue. Even if we may need to modify it later, there has to be a starting point.

Otherwise, nothing can be established at all, because for any statement there always will be someone's understanding of "internal monologue" for which the statement is true, and someone's else understanding for which the statement is false...

hombre_fatal
0 replies
22h31m

I'm sure inner monologue just cashes out into the ability to reflect on your own thoughts. And for one to say that they're not having that experience also involves a claim about what they think other people are having which would make me doubly skeptical.

In practice, when you see people arguing about whether they have an "inner monologue" or can "mentally picture objects" on social media, it's more of a contest of who is the most unique in the world rather than anything that sheds clarity on our subjective experience.

mbesto
0 replies
1d

The fact that we don't actually have an understanding and framework for reasoning (e.g. whether inner monologue is a cause or an effect) means we are VERY off from general AI.

https://youtu.be/QGYbbLWn-IE?t=72

kveykva
0 replies
1d2h

Have there ever been studies that demonstrate that those individuals don't simulate possible state transitions they'll go through in a different modality? I'd be curious if they visualize actions they'll take still, just not verbally.

nomel
0 replies
18h42m

but we really want it to be internal and automatic.

Why? We have specialized semi-isolated lobes of our brain. Some "external to the LLM weights" guiding system just needs to be transparent to us. It doesn't need to be internal to the LLM/weights.

because what controls the internal thought process is itself not necessarily language-based

Most of the interesting AI things can't write a sentence (weather simulation, protein folding, inverse kinematics, driving cars, etc...). Language isn't a requirement, only some sort of meaningful mapping to the latent space of each is. I think one could claim that "meaningful latent space mapping" is the language of neural nets, including those in the brain.

logicchains
0 replies
1d1h

For reasoning of any complexity, some sort of internal monologue and iteration is needed.

From a formal perspective you're entirely correct. Transformers with chain-of-thought are strictly more powerful than transformers without it, and can efficiently solve classes of problems that would otherwise require exponentially increasing model depth: https://arxiv.org/abs/2310.07923

fennecfoxy
0 replies
5h44m

Yeah, I don't know people ever thought that the current state of LLMs is enough to actually solve problems; it's trained to be very good at predicting the next word, but any logic solving capability is only emergent due to questions + answers in the training data, when it has to improv then it either makes a randomly correct or similar enough guess, or it falls over.

If these things were trained to actually solve logic problems then it may be different, the training data would have to be logic problems. They've started at the top (language) rather than at the bottom/middle (logic, thought, reasoning) which should be done first - and then the model can be trained/additional models to get it to put thoughts into words.

ericmcer
40 replies
1d2h

If you really think about what an LLM is you would think there is no way that leads to general purpose AI. At the same time though they are already doing way more than we thought they could.

Maybe people were surprised by what OpenAI achieved so now they are all just praying that with enough compute and the right model AGI will emerge.

solidasparagus
35 replies
1d2h

If you really think about what an LLM is you would think there is no way that leads to general purpose AI

It is an autoregressive sequence predictor/generator. Explain to me how humans are fundamentally different

throwanem
15 replies
1d1h

"Prove me wrong?" That's not how this works. Your implicit claim here is that human cognition and LLM functioning are fundamentally similar. That claim requires substantiation.

photon_lines
7 replies
1d1h

I actually did a full write-up on this here fyi: https://photonlines.substack.com/p/intuitive-and-visual-guid.... You can skip most of this and scroll down to the end-section called 'The Mental Model for Understanding LLMs' where I try to map how transformers are able to mimic human thinking. I think that comparing them to auto-associative / auto-regressive networks is actually a really good analogy FYI and I do believe a lot of human learning is based on this.

Workaccount2
2 replies
23h34m

We have no idea how human thinking works and no idea what consciousness is.

I'm getting so tired of listening to software engineers LARP pseudo neuroscientists with 6th grade level insights.

Of course, the [AI] brain isn’t ‘conscious.’ It doesn’t have any survival instincts which we humans do.

Bruh...

photon_lines
1 replies
23h24m

So instead of shouting insults, maybe you can shine some light on how anything stated there is wrong? Everything that I said there is true - we don't have a good mapping of what consciousness is - if you want to add some insights add them instead of making insulting and useless statements on HN.

Workaccount2
0 replies
23h10m

We have no idea if ChatGPT is conscious or not. We have no idea if it is "thinking" or not.

What we know with a reasonably high level of certainty is that consciousness and "thought" are physical processes. That's about it.

Pulling out the scalpel to start dividing up what physical process is and isn't conscious is a fools errand. And especially foolish when just making up arbitrary qualifications for it.

Am I saying that ChatGPT is conscious? No. But I am saying is you shouldn't give much credence to people who are anything more than agnostic about it.

solidasparagus
1 replies
1d1h

An article this long needs a table of contents IMO.

photon_lines
0 replies
23h22m

Will note this and try to add a TOC next time I do this sort of write-up. This article wasn't supposed to be this long - I originally meant to make a more compact version but kept on finding more things which I needed to add in order to map to a correct 'mental model' of what ChatGPT is.

semi-extrinsic
0 replies
1d

We don't have sufficient understanding of human thinking and mental processes to make any claims like the ones you are trying to make. If we did, psychology would be a quantitative hard science; it most definitely is not.

nyrikki
0 replies
20h1m

Human neurons are continuous input, with active dendrites and dendritic compartmentalization. Spikey artificial NNs seem to hit problems with riddled basins so far. A riddled basin is a set with no open subsets.

Feed forward networks are effectively DAGs, and circuit like, not TM like.

Caution is warranted when comparing perceptrons with biological neurons.

Dendrites can perform XOR operations before anything makes it to the soma for another lens.

While there is much to learn, here is one highly cited paper on dendritic compartmentalization.

https://mcgovern.mit.edu/wp-content/uploads/2019/01/1-s2.0-S...

I think that the perceptron model of learning plasticity is on pretty shaky ground as being a primary learning model for humans.

solidasparagus
6 replies
1d1h

It is how it works if you are replying to someone who claims "If you really think about what an LLM is you would think there is no way that leads to general purpose AI". The counter example is human beings are considered general purpose intelligence and we are complex, but fundamentally predictable systems (not by us today), with (as far as we can tell) deterministic outputs based on the state of the universe (including physical being like brain and chemicals).

throwanem
5 replies
22h38m

Responding to an unsubstantiated claim with an unsubstantiated claim just adds another unsubstantiated claim.

So far as I know, whether the universe behaves deterministically remains an unsolved question. Given that, your statement here would already be one of belief rather than fact, even before we get to the parentheticals. There is information here, but not about whether LLMs can develop into AGI.

solidasparagus
4 replies
22h0m

Fine, you can ignore my previous comment, that's just my answer to the question that this discussion ultimately takes you to. But I feel like you are just sitting on the sidelines making strawmen and playing pedantic games instead of saying anything constructive.

The original comment said:

If you really think about what an LLM is you would think there is no way that leads to general purpose AI.

This is an inflammatory way to state an extreme position on a well-discussed debate over whether next-token prediction can lead to general intelligence. The original commenter clearly believes it can't get you there. If you want to say that with any authority, you need to have an answer for what is different between what we consider general intelligence (for most people, this is simply human intelligence) and what models are capable of. This is the question at the heart of artificial intelligence.

I challenged them to explain their answer. I made no claims, I asked no one to prove anything wrong. If it is obvious that LLMs can't be AGI, the answer to how an LLM differs from human intelligence is also obvious, right?

Vegenoid
3 replies
20h23m

I made no claims, I asked no one to prove anything wrong.

Your original comment was:

It is an autoregressive sequence predictor/generator. Explain to me how humans are fundamentally different.

Which would be interpreted by most reasonable people as you making the claim that humans are autoregressive sequence predictors, and asking people to prove you wrong. I can see how you could say this without intending to make that claim, but most people will interpret this as you making that claim.

I do not intend to inflame things or discredit what you are saying, but just to say that if you did not intend to make a claim or ask people to prove you wrong, a different approach would be more successful in the future.

solidasparagus
2 replies
20h7m

You're right, it was hastily written and I was annoyed.

But I generally hold out hope that people can see a claim "A!=B" and a response "A=C, explain how C!=B" and understand that is not the same as claiming "C=B", especially on HN.

throwanem
0 replies
19h53m

I do remain convinced my interpretation was sound, but on review I have to concede it was also quite uncharitable.

With all the wildly overheated claims that've been flying around since the advent of these models, I may be myself somewhat overfitted. Granted, in such an environment I feel like a little extra care for epistemic hygiene is warranted. But there was no reason for me to be rude about it.

Vegenoid
0 replies
19h45m

I know what you mean. Unfortunately, it's easy for frank and concise language to be taken the wrong way when in written form (and sometimes even verbal form). I wish I didn't have to make qualifiers about my intent on my internet comments, but I often do, to try and make sure that other people take my comment the way I intended it. I think it generally leads to better discussion.

I don't blame people for not wanting to talk this way.

indoordin0saur
10 replies
1d1h

AI needs to see thousands or millions of images of a cat before they reliably can identify one. The fact that a child needs to only see one example of a cat to know what a cat is from then on seems to point to humans having something very different.

solidasparagus
3 replies
1d1h

AI needs to see thousands or millions of images of a cat before they reliably can identify one.

Not if they inherit from a previous generation of AI. But even if they did, a different training speed does not imply a different capability

indoordin0saur
2 replies
1d1h

My point is not that humans have a faster training speed but that humans must be doing something fundamentally different from LLMs. You could build Altman's $7 trillion dollar GPU cluster and use the majority of the world's energy to feed it and you'd still hit the same limitations if you're just running an LLM on it, even a very sophisticated LLM. This is Yann LeCun's position as well.

solidasparagus
1 replies
1d

That, as I understand it, is not a valid chain of logic. Requiring fewer data points does not inherently indicate that the underlying mechanism (autogressive sequential generation, not the transformer which is just an architecture) is different.

Not to mention the secondary arguments like - no proof that human learns faster from fewer datapoints, that's just your assumption in the sibling comment. Humans inherit information. The equivalent - fine-tuning a foundation model - is very fast to learn novel objects.

Just because someone has a Turing award doesn't mean they know what they're talking about. They are just people, with strengths and weaknesses like everyone else. But often on the extreme end of strengths and weaknesses.

retros3x
0 replies
2h28m

Well, the same thing goes for you - just because someone posts on HN doesn't mean they know what they're talking about. And if I have to decide whose assessment I trust regarding AI, I take the Turing award winner who worked for almost 40 years on AI over a random guy from the internet.

TeMPOraL
3 replies
1d1h

Humans train on continuous video. Even our most expensive models are, in terms of training set size, far behind what an infant processes in the first year of their life.

EDIT: and it takes human children a couple years to reliably identify a cat. My 2.5 y.o. daughter still confuses cats with small dogs, despite living under one roof with a cat.

indoordin0saur
2 replies
1d1h

I contend that you could show any child old enough to communicate in basic English a photograph (so not live continuous video) of some obscure animal they've never seen before (say an Okapi) and they'd be able to easily identify another Okapi when seeing one at a zoo.

bongodongobob
0 replies
1d

So you're just going to ignore the 5 years of continuous training? I'm not sure what point you're trying to make.

TeMPOraL
0 replies
1d

My daughter is 5 y.o., which means because of kindergarten, I spend plenty of time about kids this age. A random kid this age would absolutely fail your test. They may remember the word after one exposure, but I doubt they'll remember any of the distinctive features.

Hell, many adults would fail it. I'm not sure if I could pass such test - in my experience, you remember the important details only after first experiencing a test and realizing what exactly it is that would be useful in distinguishing the two animals.

GaggiX
1 replies
23h48m

If the model is first pre-trained on unlabeled images, then it takes about 10 labeled images of cats and 10 labeled images of dogs to train a (possibly strong) classifier (example: DINOv2), I doubt humans will do better.

TeMPOraL
0 replies
21h24m

That's a good point; when comparing performance with humans, one has to remember that a human spends years of training on unlabeled images in form of continuous video stream, on top of audio streams and other senses, before they're able to communicate with you so you could test them.

slashdave
6 replies
1d1h

Even language is not sequential.

solidasparagus
5 replies
1d1h

Tell me more?

slashdave
4 replies
1d

Language is only sequential in the form it is transmitted (verbally). There is no reason that sequential statements are generated sequentially in the brain. Quite the opposite, really, if you consider rules of grammar.

I really wish most of the LLM folks just took a few courses in linguistics. It would avoid a lot of noise.

solidasparagus
0 replies
23h29m

Any pointers for where to start? Linguistics is dense, the terminology is a slog, it seems huge, and I was never clear that there was real value to be gained. But it's always useful to be more informed.

naasking
0 replies
22h13m

LLMs don't generate their language sequentially either, they just output it sequentially token by token.

TeMPOraL
0 replies
21h27m

Whoever cares about "rules of grammar"?

Grammar is descriptive, it formalizes the language so it doesn't break down into regional dialects too fast, and otherwise is just a crutch for people learning the language, especially if it's not their first language. The way you acquired your first language is the same way LLMs learned to utter grammatically correct sentences: by being exposed to lots and lots of examples, and eventually getting a feel for it. Similarly, if you're fluent in a language, you don't even think of grammar when using it - the right phrases in correct forms just come to you.

GaggiX
0 replies
23h55m

Since LLMs seem to follow the rules of grammar quite well, and this is quite the opposite of "sequential", I guess LLMs do not process informations very sequentially.

retros3x
0 replies
2h8m

We don't know how we are different, but its pretty clear that we are fundamentally different. Modern machines calculate much faster than humans, if human thoughts and reasoning would work like LLM, we would already sit in front of general purpose hard AI. But we don't, so obviously there is something missing. Unfortunately its quite hard to know what is missing, because we don't know how conscience and reasoning actually work and we can't formularize them. Thats why its fair to assume we won't see general purpose AI that fast (probably not in our lifetime), which a lot of AI researchers seem to support, especially those who are not working for companies who try to sell AI.

more_corn
3 replies
1d2h

LLMs are the language center of an AI. They in no way purport to be: long term memory, critical thinking engines, calculators.

If we want those things we can build them. Building them into the language center would be absurd and weird.

viking123
0 replies
1d1h

In the brain all those things are neural networks too, memory is encoded in the weights. So if we want to have that type of system, everyone should have their own model and the weights would actually adjust all the time just like in our brain, but now it is just trained once and that is that

lupire
0 replies
1d

And yet this is what all the big companies we re trying to do, because the LLM is incredibly expensive fixed cost, and they marginal cost for each new task.

daveguy
0 replies
1d1h

This is a very good take on the current state of LLMs. They are able to articulate surprisingly well and LLMs or something similar will definitely be part of a larger AI model in the future. They are good creative aids, but they should not be trusted.

photon_lines
23 replies
1d3h

There's actually a pretty simple solution to this that I thought about testing out and it involves asking the model to re-construct the problem using a logic language (like Prolog) and asking it to execute this type of program in order to come up with a solution rather than attempting simple chain-of-reason training / other methodologies of getting the model to 'reason' through some of these examples. People forget that humans don't come up with their logical models out of the blue - it takes years of elementary school in order for us to understand the world and solve problems in it. The logic programming approach I'd say is really promising but you would need to feed the LLM a LOT of examples in order for it to work, and currently I'm not even sure that we have enough training data in order to implement something like this.

nerdjon
10 replies
1d3h

I honestly thought about this recently when I was trying to see the limits of Claude Opus. Some of the problems I gave it, what if instead of telling it to solve the problem I asked it to write the script and then give me the command and inputs needed to properly run it to get the answer I needed. That way instead of relying on the LLM to do properly analysis of the numbers it just needs to understand enough to write the logic.

It is an interesting prospect but I feel like it has some limitations. For math problems like this one, yeah it should be simple to write a script to do it. But it does first have to understand the core thing here that Alice would be one of the sisters of the brother to write the script accordingly.

But I would think this would not scale well when dealing with far more complex issues, particularly ones that may not just be simple math logic. If the request was to write a scientific paper for example, it might have to make several scripts and call them multiple times, and that assumes that it understands what it is doing enough to properly make the script.

photon_lines
6 replies
1d2h

I'm not sure what you mean by 'it will not scale well.' When we humans learn that we make a mistake - we make a note and we hold the correct answer in memory - the next time we're prompted with a similar prompt, we can use our old memories to come up with the correct solution. I just did a simple test for this same exact problem using ChatGPT 3.5:

'Can you reformulate the following problem using Prolog? When you execute the program - what answer do you get? 'Alice has 8 brothers and she also has 5 sisters. How many sisters does Alice’s brother have?'

Even though it came up with a proper Prolog formulation - it failed to account that Alice is one of the sisters - so I gave it a simple prompt for this and it was able to correct the formulation and come up with the proper solution:

My prompt after ChatGPT gave me an answer of 5: 'Yes but you aren't including Alice in your logic. Shouldn't the correct answer be 6 (since Alice is also a sister)?'

Response:

'You are correct. The correct answer should account for Alice being one of the sisters. Let's adjust the Prolog program to include Alice in the calculation:

% Facts brothers(alice, 8). sisters(alice, 5).

% Rules sisters(Brother, NumSisters) :- sisters(alice, Num), NumSisters is Num + 1. % Include Alice herself

% Query ?- sisters(alices_brother, NumSisters).

In this updated Prolog program:

The rule for determining the number of sisters includes Alice herself by adding 1 to the number of sisters Alice has.

When we execute the query ?- sisters(alices_brother, NumSisters). in Prolog, we should get:

NumSisters = 6.

This correctly reflects that each of Alice's brothers has 6 sisters, including Alice herself.

When I prompt it to answer this question using Prolog now, I always get the correct answers no matter which integers I feed in, so it seems like the model is able to generalize after I show it where it's 'reasoning' was wrong. Basically, my suggestion is to get the model to recognize logic puzzles like this and try to formulate them in terms of logic programming queries which it can use and tune in order to come up with correct answers rather than simple auto-associative chain of reason training which current GPT models rely on, but like I said - this is my hypothesis and I believe this would work much better in getting these models to 'generalize' than the current approaches we're using. Hopefully this helps.

pbhjpbhj
1 replies
1d1h

When you ask again the prompt includes the context of your previous question and correction. When I ask the prompt doesn't have that context so the model fails to give me the correct answer.

I'm using the default free model in the app, based on GPT4.

photon_lines
0 replies
20h40m

Yup - well this is where my suggestion is to change the GPT architecture. You can think of having a logic program function as the 'frontal lobe' of the general pre-trained auto-associative model. This 'frontal lobe' region would try to come up with logical sequences to go along with it's internal auto-associative representations. Of course - the logic programming piece is just one approach - maybe chain of though or chain of reason prompting could work here too as many humans I think use this chain-of-reasoning approach themselves. Logic programming to me would function as a suggested shortcut.

mcguire
1 replies
22h37m

"My prompt after ChatGPT gave me an answer of 5: 'Yes but you aren't including Alice in your logic. Shouldn't the correct answer be 6 (since Alice is also a sister)?'"

Useful, if you know what the answer is. What happens if you don't give it the correct answer?

photon_lines
0 replies
20h43m

What happens when you don't give a human being the correct answer? :)

immibis
1 replies
1d

What if you prompt it "You seem to have accidentally included Alice. The correct answer should be 4"?

photon_lines
0 replies
23h20m

Yup this is a good example. This is because the model has no conception of what 'causality' is or how to try to come up with a correct 'model' - humans have a visual system which helps them out, but for LLMs I can definitely see your point and yup in these instances - if you feed in garbage data then yeah - you will get garbage out.

ip26
2 replies
1d2h

I don’t understand why LLM’s aren’t already set up to do what you describe automatically behind the curtain. Extract a math equation from text (LLMs are good at translating between languages right?) and immediately evaluate it on the host CPU.

LLM is the equivalent of recalling your times tables. Computer arithmetic is the equivalent of re-computing your times tables.

pbhjpbhj
0 replies
1d1h

ChatGPT does do this sort of process for arithmetic now; it converts wordbased problems to mathematical notation and then solves.

dragonwriter
0 replies
1d2h

I don’t understand why LLM’s aren’t already set up to do what you describe automatically behind the curtain.

LLM-based systems with tool use (which this is an application of) often are, to an extent, the issue is tuning the (behind the scenes, system) prompting so that they use appropriate tools in every case where they should, and do so correctly. (There's also a cost factor involved since behind-the-scenes tool use means multiple LLM round trips to answer the question, so tuning the system to use tools more aggressively makes the system more expensive.)

sollewitt
2 replies
1d2h

Right, and do you verify the result?

You have to know what the answer is supposed to be before you can write a test case.

photon_lines
0 replies
1d2h

Yup - well you feed in the prompt along with an answer and you get the model to produce outputs and check for discrepancies. If the answer is wrong then the model adjusts -- this is the way backpropagation works....I think there are huge advantages in using logic languages in order to represent some of these data sets rather than simple English or the current chain-of-thought reasoning approaches -- backpropagation as an example isn't really used in the human brain, but it leads to great results in mimicking how neural networks 'learn' - in the same way, we don't have to have the full formal picture of how humans model the logical world in order to achieve great results. We can simulate this using logic programming or even general programming or at least that's my conjecture.

IanCal
0 replies
22h57m

I can have more confidence that my calculations are correct using a calculator compared to doing it by hand, even if I don't know the exact right answer beforehand.

skyde
1 replies
9h52m

Given that Alice has 13 brothers and 31 sisters, we can update the Prolog program with this information. We need to adjust the fact about Alice's siblings and then use the rule to calculate the number of sisters her brothers have.

Here is the updated program:

```prolog % Facts siblings(alice, 13, 31). % Alice has 13 brothers and 31 sisters

% Rules % Alice's brothers have M + 1 sisters, since Alice is also their sister. sisters_of_brother(TotalSisters) :- siblings(alice, _, M), TotalSisters is M + 1.

% Initialization goal :- initialization(main).

% Predicate to find and print the total number of sisters Alice's brothers have main :- sisters_of_brother(TotalSisters), writeln(TotalSisters). ```

In this program, the fact `siblings(alice, 13, 31)` represents that Alice has 13 brothers and 31 sisters. The rule `sisters_of_brother(TotalSisters)` calculates the total number of sisters that Alice's brothers have by adding 1 to the number of sisters Alice has (since Alice is also their sister).

When we run this program, it will print the total number of sisters Alice's brothers have:

1. Alice has 31 sisters. 2. Each of Alice's brothers will also consider Alice as a sister. 3. Therefore, each brother has 31 + 1 = 32 sisters.

The output of the program will be:

``` 32 ```

Thus, Alice's brothers have 32 sisters.

defrost
0 replies
9h43m

Problematic in the real world meat space to say the least.

Assuming Alice indeed has that many brothers and sisters (possible) it's highly unlikely they all have the same pair of unique parents.

The Set {} of siblings that share at least one parent with Alice (the siblings of Alice) can easily include one brother of Alice (same father, different mother) who doesn't share a common parent with any other sibling of Alice.

asadotzler
1 replies
23h45m

If that kind of thing worked, we'd have been doing it long before LLM chatbots.

IanCal
0 replies
22h56m

Yet tools like GPT4o can do this.

It's not a trivial problem, taking a human written description and rewriting it as a prolog program.

CooCooCaCha
1 replies
1d3h

I’m curious how this would work considering knowledge can be fuzzy.

Like if I’m out camping and I sit on a log or a rock those things are not what people usually think of as chairs but they can serve as chairs in that situation.

photon_lines
0 replies
1d2h

You can get models to actually show that 'logs' could function as 'chairs.' You're forgetting that we humans also learn this as well, but we learn this in a much simpler manner than LLMs though so someone has to explicitly let models know what assumptions they can make. You get the LLM to write Prolog programs and learn associations in this manner. As the model gets better at logically modelling the problems - the solutions to prompted problems like this should get better.

mcguire
0 replies
22h43m

I just tried that with ChatGPT 3.5 (4o stopped responding after I asked the initial question and it produced the wrong answer). Here's the Prolog it generated:

    % Define the number of brothers and sisters
    brothers(4).
    sisters(1).

    % Predicate to calculate the number of sisters a brother has
    brother_sisters(NumberOfSisters) :-
        sisters(NumberOfSisters).

    % Query to find out how many sisters a brother has
    ?- brother_sisters(Sisters).

astrobe_
0 replies
1d1h

asking it to execute this type of program in order to come up with a solution

I may be showing my ignorance about this tech here, but I believe the LLM doesn't even try to solve a problem; they try to generate a discourse that could pass as a solution or answer to the problem; that's more or less what the abstract states if I understand it correctly. But in no way does it try to apply some sort of mechanical reasoning like inference engines do.

To me the solution to this is to associate LLM with mechanical computations, that is an inference engine or an equation solver, rather than recombining the millions of solutions for similar problems it has seen in its training set. I believe I remember reading about teams attempting this approach. I can imagine for instance that if the LLM is in some way able to ask questions and use the answer, maybe it could just generate a prompt for an equation solver and include the result in its answmer.

IanCal
0 replies
22h58m

I took one of the problems that gpt4o got wrong, and asked gpt4o what tools it could use. It suggested and wrote prolog for me that (with one pass back to get things defined in the right order) which correctly worked out the answer.

Mmskynettio
12 replies
1d2h

Its not an AI hype. A hype is defined as something which gets oversold: "promote or publicize (a product or idea) intensively, often exaggerating its benefits."

Just yesterday I visited a google cloud summit and one person from bosch told the audiance how they are now able to work with less external agencies like texting, graphicsdesigner and photographers for their materials.

It already saves money, has real impacts and continues to progress.

We are also don't know what ChatGPT 5 will bring, because they say this will do more reasoning than before, but we already are working (people/our socity) on solving this in different ways: From code which creates a unit test first and than the code, to different type of architectures.

For me, 2024 was the LLM cost reduction year and the LLM gets a big context window year.

AI doesn't need to be ready tomorrow, but its capabilities are already really good. And i know plenty of people around me who are a lot less interesting to talk to than any llm (from a human skill/knowledge point of view).

llama 3 was also a big achievement 2024. Facebook shows that better data leads to better quality for smaller models.

We haven't not only entered the AI ara but also the 'gather all the knowledge we can, quality check it and refine it because now we can actually do something with it' ara.

We are in the feedbackloop knowledge ara.

nurple
5 replies
1d

Your post is complete hype, all about people saying things instead of showing things that've actually been done.

For me, 2024 was the LLM exposed as basically pure hype year.

There is no expert of any field I follow online where they're posting up results from AI tooling for any other reason than to show how awful it is. I consider myself an expert in software, and LLMs specifically have only caused me great pain.

Even the one situation where you describe someone describing the ability to work in an absolute vacuum sounds like a huge negative to me. The recent push for DEI policies were even ostensibly about the importance of people of diverse backgrounds and viewpoints working together.

The most important thing you're missing a perspective of scale on is the step you describe as "quality check it". On things I don't know, and have attempted to enlist an LLMs help on, in every case I have had to go back and just actually learn how something works, after time wasted struggling with subtle wrongness in the output.

At least I have the background expertise to do that, however, I have seen a Jr dev's mind get literally rotted by too much time in pure LLM land. Besides the cost of rewriting their code, the company was now the proud owner of a young dev with a mind filled with nonsense.

How do you even weigh the cost of fixing a corrupted human mind?

bongodongobob
3 replies
1d

Eat something and take a nap, you sound unhinged.

ChatGPT has nearly doubled my work output, most of my job is system admin infra type stuff and it's ridiculously good at troubleshooting odd issues.

Hopefully you can find a use case for it someday, until then, the rest of us will continue to be more productive.

hellojesus
2 replies
1d

Do you have any concern about the data you're feeding to the vendor serving your prompts?

I've had junior devs tell me they use chatgippity to combine excel workbooks, and when I confirm they're not self hosting a llm to do it, I ask if they think it's a good idea to hand over company data to openai. They don't care.

In a world of tight security, I find it astonishing that so many people willingly give away trade secrets to these companies, whom can sell it to any bidder if they choose.

bongodongobob
0 replies
20h20m

I do think it's fine yes, but I also am not giving it anything sensitive.

Mmskynettio
0 replies
6h21m

So you are not using Office 365?

Because our company does.

Mmskynettio
0 replies
10h5m

I'm getting jun devs who are brainstorming and troubleshooting with chatgpt before they start talking to me.

I have a good intern who is much faster with chatgpt than before and learning well.

Lich
2 replies
1d2h

AI doesn't need to be ready tomorrow, but its capabilities are already really good.

A majority don’t deny that it’s good. The problem is that so many think it is actually reasoning, believing the answers can be trusted.

rvnx
0 replies
1d1h

It's generally reasoning better than me, so using them I do less mistakes than if I did my own logic.

Mmskynettio
0 replies
1d1h

Does it need to do reasoning perfectly?

If it created meta concepts from billion words on the internet and has meta models which are correct and are more and better than an avg human, isn't it actually good in reasoning?

Its a very narrow thing to say 'is that so many think its actually reasoning' to say AI is just hype or everything we are doing is a waste etc.

There are human benchmarks they are winning at. The critic could be more that we don't have enough benchmarks.

daveguy
1 replies
1d1h

We are also don't know what ChatGPT 5 will bring, because they say this will do more reasoning than before...

This paper very clearly demonstrates these LLMs are not reasoning in a fundamental way. Token prediction and reasoning are two different tasks. They may be related, but they are not the same. "Just wait for GPT 5, it will be amazing!" is part of the hype.

Please do not assume an LLM is correct in skill or knowledge unless you already know the answer or can verify by other means.

Mmskynettio
0 replies
1d1h

The problem is that we don't know how we do reasoning.

I calculate stuff by following a formular after i pattern detected a problem i already know.

Plenty of humans are not able to solve those math problems.

If the future of llm / ai becomes a LLM with multi modal and mixture of experts and that solves those reasoning problems, we still don't know if this is a different type of reasoning than what humans do.

slashdave
0 replies
1d

There is no feedback. You cannot create new knowledge out of thin air.

glitchc
10 replies
1d2h

The problem is a good chunk of the global population is also not reasoning and thinking in any sense of the word. Logical reasoning is a higher order skill that often requires formal training. It's not a natural ability for human beings.

slashdave
6 replies
1d1h

Seriously? You think individuals are incapable of reasoning without training first?

glitchc
2 replies
1d1h

Yes, seriously.

Some examples: An individual without training cannot reliably separate cause from effect, or judge that both events A and B may have a common root cause. Similarly, people often confuse conditionals for causation. People often have difficulty reasoning about events based on statistical probabilities. Remember, the average person in North America is far more terrified of a terror attack than an accident or a heart attack, yet the latter two are much more likely to be their cause of death.

slashdave
1 replies
1d

You mean without training, people cannot frame answers in the terms you've learned from training. Well, why are you surprised?

If you think reasoning is limited to the frameworks you learned from a book, you live in a small world.

glitchc
0 replies
22h36m

In an effort to be pejorative, you have reduced the process of learning to a basic regurgitation of facts. Rather, learning about reasoning is more about developing a framework that allows one to construct claims based on available information and then check the veracity of those claims while avoiding all, or at least typical, fallacies.

Critical thinking requires a certain amount of rigor, which formal education is well-suited to impart. It can be self-taught with a hefty dose of discipline, but it cannot be intuited.

TeMPOraL
2 replies
1d1h

Do you think they are? We haven't had a case of individuals without training in like 10 000 years.

rep_lodsb
1 replies
23h31m

And judging by some comments here, humans must have gotten dumber since then. Or we would never have invented writing, agriculture or even basic clothing.

"This problem is actually not that easy, the average person couldn't solve it either, especially if the numbers were bigger", "Yet another cherrypicked clickbait study to make LLMs look bad, those people are just scared of being made obsolete", etc.

TeMPOraL
0 replies
21h15m

Or we would never have invented writing, agriculture or even basic clothing.

That's the thing. We had to invent those things. Along with counting, numbers, logic, arithmetic, and those stupid-ass annoying family tree riddles. We didn't get them in one step, it took a long time to build one on top of the previous. Generation by generation, each cohort of children growing up in a slightly more complex world than their parents, each generation taught how to navigate this complexity by their families and tribes. Learning a growing collection of facts and beliefs and life hacks.

There were no untrained people for as long as humanity existed. The minimum reproductive unit of homo sapiens is a village.

bithive123
1 replies
1d

In "any sense" of the word? Surely anyone who adjusts their behavior when they get undesired or unexpected results is reasoning and thinking. And since most activities are mediated by thought of some kind, most people are reasoning and thinking otherwise they would never recover from even simple mistakes, like walking east when they need to go north.

Saying they're "not thinking in any sense of the word" because they can't solve predicate logic problems from a college textbook is a rather odd claim. Surely those things arise from reasoning and thinking, rather than the other way around.

hellojesus
0 replies
23h52m

This seems to me to be where these systems need to go in the future, akin to reinforcement learning.

You feed an llm a prompt. It then abstracts and approximates what the result should be. It then devises a hypothesis and solves it and compares it to the approximated output. Then it can then formulate a new hypothesis and evaluate it, based off the outcome of hypothesis 1. From there it can either keep iterating or dump that path for a new one (e.g., the next best hypothesis in the original formation).

At some point the answer is "good enough." But along the way it keeps playing against its thoughts to see if it can do better.

A key issue may be the original approximation, so it may need to consider its adjustment when iterating.

Maybe this is how cutting edge llms work now. I have no idea.

naasking
0 replies
21h59m

Logical reasoning is a higher order skill that often requires formal training. It's not a natural ability for human beings.

I've read your comments here, and while I understand your point I think you have it backwards. The only reason we formed societies is because we evolved an innate a theory of mind to reason about how others might be thinking and feeling. That's reasoning. We have a natural ability to do limited arithmetic, otherwise we wouldn't be able to hunt, gather or track enough to endure winters, or keep track of our sheep or children for that matter. That's reasoning.

Reasoning is a natural ability for human beings, but we also carry a lot of evolutionary impulses that add a lot of noise to the decision process, eg. observation->judgment->[set of possible decisions], judgment has "reason" as one path that adds to the set of possible decisions, but there remain other paths we inherited from our evolutionary roots. Education is training that suppresses poorly calibrated judgment paths that lead to frequent mistakes in the decision set, but reasoning remains, thus education improves the signal to noise ratio of our decision making.

So I 100% disagree that an individual cannot separate cause and effect without training. They will just be worse at it than someone who is trained to filter out those impulses that lead us to jump to conclusions, eg. they will produce more noise / a larger set of possibilities than reason would allow.

colechristensen
9 replies
1d3h

I don't think anybody who actually uses LLMs thinks they are general purpose AI.

Like you ask it to do one thing it's amazing, but then you try to modify or do something with extra steps, or just anything with any complexity to it and it falls over.

dr_dshiv
2 replies
1d2h

Well, I believe LLMs are general purpose AI. And, for the record, so does Peter Norvig. He coauthored the most popular textbook on Artificial Intelligence, so this belief can be considered a somewhat mainstream perspective. His article, “Artificial General Intelligence Is Already Here“ [1] is really worth reading.

https://www.noemamag.com/artificial-general-intelligence-is-...

lupire
1 replies
1d

I can't tell whether Norvig is making a subtle's point that artificial general stupidity is still an important form of artificial general intelligence, or he whether he somehow vastly overestimates AI model performance on non-memorization and summarization tasks.

dr_dshiv
0 replies
18h49m

You seem to have an axe to grind. You might spend more time trying to get quality out of the models. It is possible!

nerdjon
1 replies
1d3h

I don't think anybody who actually uses LLMs thinks they are general purpose AI.

I would like to believe that but I have had too many conversations with people who basically think it already is. Including in one situation of a fellow engineer.

It feels like more and more "we" are in a bubble of actually having some knowledge of how this works, what the actual limitations are, and what it just is not. While there is in fact criticism of it out there, particularly around AI "art". It doesn't seem to be focused on the area we are talking about.

goostavos
0 replies
1d1h

I dunno. If memes are anything to go by, people are frustrated by Yet Another Chat Bot standing between you and the human you need to solve your program. The fact that it's a better, more human like chat bot doesn't mask the frustration of being forced to talk to a computer with limited agency.

solidasparagus
0 replies
1d2h

The vast majority of people use a model built by someone else through a high-level abstraction and then make broad claims about what the technology will be capable of.

Then you have the people who are leveraging the technology to train models from scratch and that population is far more apt to believe that large models can be general purpose AI (by some definition). Sure there are other things at play like money, publicity, reputation, and a desire to do something important. But there is also the very clear trend line where transformers are able to model pretty much any sequence of tokens where there is pretraining scale data available. Whisper was not a surprise. GPT-4v was not a surprise. Sora was not a surprise.

mupuff1234
0 replies
1d2h

They might not think that but they sure as hell are trying to sell that idea.

kragen
0 replies
1d3h

they're pretty general-purpose; you can ask the same model for recipe suggestions, a fanfic, or verilog. like, they're far, far more general-purpose than any humans i know

it's true that they're not very reliable, but they seem to be not very reliable across many different domains. and they don't seem to be particularly less reliable than the average human, so i think possibly your standards for 'general purpose ai' are set high enough that you would declare humans to be unintelligent (or perhaps not 'general-purpose') if you applied them consistently

you can certainly find particular domains where humans can still do things llms can't, but i haven't seen a persuasive account of why those domains are the more important ones, and of course the converse is also true

James_K
0 replies
1d3h

I don't think anybody who actually uses LLMs thinks they are general purpose AI.

They are being sold as such. Most people don't know anything about the topic and will buy that marketing. The entire concept of these models is that you can put a whole bunch of data in and eventually some kind of magic will happen and you get AGI out. They would not see the kind of investment that they do if all that was being promised was "really good predictive text". In fact some philosophers argue that sentience is just really good predictive text to try and make the point that these models are AGI.

wtbdqrs
6 replies
1d2h

I appear to be reasoning at times but I have mostly no idea what I am talking about. I hit a bunch of words and concepts in the given context and thus kind of hallucinate sense.

Given a few months of peace of mind and enough money for good enough food, I could actually learn to reason without sounding like a confused babelarian.

Reasoning is mostly a human convention supported by human context that would have been a different one if the Fascists had won the war or the Soviet Union wouldn't have gotten corrupted.

But none of that has anything to do with pulling up a whiteboard to draw some flowcharts and run some numbers, all of which is why I am certain there is nothing the devs have "to fix". It took most reasonable humans many generations to learn stuff. Very few of us did the actual work.

It's all just a matter of time.

voxic11
5 replies
1d2h

Yeah, I think these chatbots are just too sure of themselves. They only really do "system 1 thinking" and only do "system 2 thinking" if you prompt them to. If I ask gpt-4o the riddle in this paper and tell it to assume its reasoning contains possible logical inconsistencies and to come up with reasons why that might be then it does correctly identify the problems with its initial answer and arrives at the correct one.

Here is my prompt:

I have a riddle for you. Please reason about possible assumptions you can make, and paths to find the answer to the question first. Remember this is a riddle so explore lateral thinking possibilities. Then run through some examples using concrete values. And only after doing that attempt to answer the question by reasoning step by step.

The riddle is "Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?"

After you answer the riddle please review your answer assuming that you have made a logical inconsistency in each step and explain what that inconsistency is. Even if you think there is none do your best to confabulate a reason why it could be logically inconsistent.

Finally after you have done this re-examine your answer in light of these possible inconsistencies and give what you could consider a second best answer.

daveguy
2 replies
1d1h

After you answer the riddle please review your answer assuming that you have made a logical inconsistency in each step and explain what that inconsistency is. Even if you think there is none do your best to confabulate a reason why it could be logically inconsistent.

LLMs are fundamentally incapable of following this instruction. It is still model inference, no matter how you prompt it.

voxic11
1 replies
11h31m

For me it did in fact follow the instruction and found the right answer. I have found this strategy of asking the LLM to analyze why its answer might be wrong results in better final answers. I think this allows the LLM to "use tokens to think" which allows the attention layers to gather the right cognitive resources to answer more subtitle questions effectively.

daveguy
0 replies
5h44m

You are confusing "providing the right answer" with "following the instruction". Cause vs correlation. From an algorithmic perspective it is literally doing the exact same thing regardless of prompt.

zeknife
0 replies
1d

If you had a prompt that reliably made the model perform better at all tasks, that would be useful. But if you have to manually tweak your prompts for every problem, and then manually verify that the answer is correct, that's not so useful.

cpleppert
0 replies
22h20m

There isn't any evidence that models are doing any kind of "system 2 thinking" here. The model's response is guided by both the prompt and its current output so when you tell it to reason step by step the final answer is guided by its current output text. The second best answer is just something it came up with because you asked, the model has no second best answer to give. The second best answers always seem strange because the model doesn't know what it means to come up with a second best answer; it 'believes' the output it gave is the correct answer and helpfully tries to fulfill your request. Sometimes the second best answer is right but most of the time its completely nonsensical and there is no way to distinguish between the two. If you ask to choose it will be strongly influenced by the framing of its prior response and won't be able to spot logical errors.

Asking it to do lateral thinking and provide examples isn't really helpful because its final output is mostly driven by the step by step reasoning text, not by examples it has generated. At best, the examples are all wrong but it ignores that and spits out the right answer. At worst, it can become confused and give the wrong answer.

I've seen gpt-4 make all kinds of errors with prompts like this. Sometimes, all the reasoning is wrong but the answer is right and vice versa.

mjburgess
6 replies
1d3h

In many ways, this is very obvious and routine to people who use these systems with a critical understanding of how they work. It's dispiriting how we require a arxiv.org pdf in order to point out the absence of the emperor's clothing. There are few people who use LLMs who could not, right now, demonstrate this point to themselves if they so wish.

Such is the strength of corporate tech propaganda that a whole mass of people will instead insist that we have never worn clothes either.

nerdjon
4 replies
1d3h

In many ways, this is very obvious and routine to people who use these systems with a critical understanding of how they work.

The last part of that is the problem and why a paper like this is critical.

These systems are being pushed onto people who don't understand how they work. CEO's and other business leaders are being pushed to use AI. Average users are being shown it in Google search results. Etc etc.

People are being told it can do far more than it really is.

mjburgess
1 replies
1d3h

Sure, but even these people... the failures are so common, and often very obvious. Consider a CEO who puts a press briefing in and asks some questions about it, it's not uncommon for those answers to be obviously wrong on any sort of critical reflection.

We arent dealing with a technology that is 99.9% right in our most common use cases, so that we need to engineer some incredibly complex problem to expose the flaw. Rather, in most cases there is some obvious flaw. It's a system that requires typically significant "prompt engineering" to provide the reasoning the system otherwise lacks.

I guess that offers an explanation: people aren't aware that via their own prompt engineering they are repairing the deficiencies of the process by manipulating its inputs to include the structured reasoning it lacks. So there's a sort of hot-reading effect at work.

throw46365
0 replies
1d

We arent dealing with a technology that is 99.9% right in our most common use cases, so that we need to engineer some incredibly complex problem to expose the flaw. Rather, in most cases there is some obvious flaw. It's a system that requires typically significant "prompt engineering" to provide the reasoning the system otherwise lacks.

Right -- we are a long way from "this is a very nuanced error" being the dominant failure.

kalkin
1 replies
1d1h

People are being told it can do far more than it really is.

Meanwhile these HN comments are split between:

* Lots of people confirming what the paper itself notes (but doesn't highlight), that the most advanced models actually can solve this problem at least a significant portion of the time. (A proportion which one can pretty easily project is only likely to increase with future models.)

* Lots of people saying "this confirms LLMs can't do reasoning".

Questions I'd ask you to consider:

* Is "LLMs can't do reasoning" actually more accurate than the typical hype?

* Is a "critical understanding of how [LLMs] work" that would predict they simply cannot solve this problem actually a good understanding?

mjburgess
0 replies
23h37m

The critical understanding doesnt predict that LLMs cannot solve problems. It predicts how they will solve them. There is no information, a priori, what the LLM has been trained on. You have to prompt, then see the answer. Once the answer arrives, the critical understanding provides a route to repairing the answer when not accurate or useful.

LLMs do not reason. They appear to reason by repeating the structure of reasoning in their training data. This is indistinguishable in many cases.

throw46365
0 replies
1d2h

Such is the strength of corporate tech propaganda that a whole mass of people will instead insist that we have never worn clothes either.

This is the line of reasoning I find most dispiriting. I still believe tech people cling to this line of reasoning because it helps them justify replacing people in jobs with LLMs.

nuancebydefault
0 replies
21h44m

They are not reasoning or thinking in any sense of the word

There is no definition of reasoning or thinking. No single human knows what it is.

The only thing we know is: we as humans are capable of recognizing steps and results of reasoning and thinking.

In a lot of cases, when using LLM's, those results appear to be correct and usable. This is often easy to determine with generated code.

I want to argue that, lacking a definition of reasoning, I am happy to have found that the machine helps me to get results that might as well have been produced by a lot of human knowledge, wisdom and deep reasoning.

ninetyninenine
0 replies
18h38m

They are not reasoning or thinking in any sense of the word and yet a lot of people are already considering them general purpose AI.

You yourself did not use reasoning to arrive at this conclusion. It's quite obvious. I'm not trying to belittle you here. But LLMs are black boxes, we do not actually know what they are doing at a high enough resolution where we can call it "not reasoning" or "reasoning".

We can only characterize these AI's as a best fit curve between datapoints which is a way to high level view point to come to any conclusion about "reasoning"

This paper presents evidence of failed reasoning, but how does that prove anything when LLMs exhibit many instances of successful reasoning on complex topics they were not trained on?

You are biased and honing into information that supports a biased conclusion. LLMs are an AI we do not understand at a low level. Hence we talk about the attributes of these AIs in the same way we talk about humans, "Oh the LLM hallucinates", "it tries to justify it's answer..." etc. etc.

You characterize the Danger of these AI's as the result of Human stupidity. The danger according to you is solely from a human mistakenly believing that the AI is anything other than a stochastic parrot.

This is a belief arrived at in the same spirit as your claim. You did not use reasoning to arrive here.

The only logical way to characterize what is going on is that we do not know. It could very well be that these AI's are in fact reasoning. And that in itself presents a different kind of danger. A danger that may be more clear in the far future.

The irony is that your conclusion lacking correct reasoning is similarly parallel to the LLM's lack of reasoning. LLMs are more alike to us than you would like to believe.

Can you give me the step by step signal path ways of an LLM processing a query to prove that it does not reason? Or do you have to use black box anecdotal evidence to prove your point? For any "evidence" where an LLM failed to reason, there is another counter example showing where the LLM succeeded. Contradictory evidence can only lead to a vague conclusion.

fnordpiglet
0 replies
14h52m

Question is how often do humans get the question wrong unprompted ? I expect given the median IQ of 100 and assessing people of all types including those untrained on logic and generally uneducated humans might not be nearly as close to 1.0 as the authors seem to imply.

But it’s also fairly obvious LLMs don’t reason at all so it’s not shocking that LLMs don’t reason at all. What’s remarkable is that they’re able to perform as well at reasoning tasks as they do given that factual truth they have no reasoning facility. Coupled with their natural language ability and emergent abductive reasoning ability that’s generally hard in computing models it’s hard to draw much conclusion here other than “wow gpt-4o is amazing.”

Closi
86 replies
1d4h

Question is: "Alice has 60 brothers and she also has 212 sisters. How many sisters does Alice’s brother have?" (nb: I have added numbers, it's phrased as X and N in the paper)

I must confess, when I tried to answer the question I got it wrong...! (I feel silly). I only realised I got it wrong when I plugged it into GPT-4o and it came back with the correct answer:

https://chatgpt.com/share/6eb5fa36-e0fd-4417-87d1-64caf06c34...

Worth noting that the prompts from the experiment include "To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:" so it appears that they are stopping the models from 'thinking out loud'. If I add that to the prompt, GPT4o gets it consistently wrong...

https://chatgpt.com/share/7e6a7201-dd2b-43c6-8427-76e5b003ca...

Also worth noting that there are more complex examples where GPT4o seems to fall down such as:

Alice has 3 sisters. Her mother has 1 sister who does not have children - she has 7 nephews and nieces and also 2 brothers. Alice's father has a brother who has 5 nephews and nieces in total, and who has also 1 son. How many cousins does Alice's sister have?

However I can't honestly say that this is THAT simple or that most people would get this right...

llm_trw
46 replies
1d3h

Worth noting that the prompts from the experiment include "To answer the question, DO NOT OUTPUT ANY TEXT EXCEPT following format that contains final answer: ### Answer:" so it appears that they are stopping the models from 'thinking out loud'. If I add that to the prompt, GPT4o gets it consistently wrong...

Yes this is a common thing I see people who think LLMs are idiots do.

The more an LLM talks the smarter it gets _because that's the only way it can compute anything_. Imagine saying that Turing machines fail the Church–Turing thesis because they can't solve 3-sat for N variables in N moves or less.

That's what you're doing to an LLM when you ask it to be concise.

pawelmurias
30 replies
1d3h

LLMs are idiots. They can't reason properly and only parrot stuff

https://chatgpt.com/share/dcb4ff4e-e8a2-463b-86ec-9caf10b6e6...

Sometimes they get the answer right to something really complex because it fits a pattern, but sometimes they answer with something really really stupid.

tsunamifury
29 replies
1d2h

Why are so many people so insistent on saying this?

I’m guessing you are in denial that we can make a simulated reasoning machine?

PheonixPharts
17 replies
1d2h

People keep saying it because that's literally how LLMs work. They run Montecarlo sampling over a very impressive latent linguistic space. These models are not fundamentally different than the Markov chains of yore except that these latent representations are incredibly powerful.

We haven't even started to approach the largest problem which is moving beyond what is essentially a greedy token level search of this linguistic space. That is, we can't really pick an output that maximized the likelihood of the entire sequence, rather we're simply maximizing the likelihood of each part of the sequence.

LLMs are not reasoning machines. They are basically semantic compression machines with a build in search feature.

pretendscholar
9 replies
1d1h

The best compression is some form of understanding

YeGoblynQueenne
6 replies
22h22m

That's a fascinating insight and it sound so true!

Can you compress for me Van Gogh's Starry Night, please? I'd like to send a copy to my dear old mother who has never seen it. Please make sure when she decompresses the picture she misses none of the exquisite detail in that famous painting.

pretendscholar
5 replies
21h55m

Okay yes so not really having an artists vocabulary I couldn't compress it as well as someone who has a better understanding of Starry Night. An artist that understands what makes Starry Night great could create a work that evokes similar feelings and emotions. I know this because Van Gogh created many similar works playing with the same techniques, colors, and subjects such as Cypresses in Starry Night and Starry Night over the Rhone. He was clearly working from a concise set of ideas and techniques which I would argue is understanding/compression.

YeGoblynQueenne
4 replies
20h14m

Fine, but we were talking about compression, not about imitation, or inspiration, and not about creating "a work that evokes similar feelings and emotions". If I compress an image, what I get when I decompress it is that image, not "feelings and emotions", yes? In fact, that's kind of the whole point: I can send an image over the web and the receiver can form their own feelings and emotions, without having to rely on mine.

tsunamifury
3 replies
19h54m

Simple reasoning is a side effect of compression. That is all.

I see from your profile you are focused on your own personal and narrow definition of reasoning. But I’d argue there is a much broader and simpler definition. Can you summarize and apply learnings. This can.

YeGoblynQueenne
2 replies
5h54m

To clarify, what I have in my profile is not my "own personal" definition of reasoning. It's how reasoning is understood in computer science and AI, and I am an expert on the subject through my doctoral studies and my current post-doc research.

That's important to understand. What I have in my profile is not some idiosyncratic idea about reasoning, it's the standard, formal understanding of what reasoning means, as it has developed in practice, in AI research in the last many decades.

I appreciate that there are many people who opine about reasoning who are not aware of that prior work and come up with their own ideas about what "reasoning" means, and some are even AI researches which is very concerning but I can't do anything about that except push back against such uninformed opinions.

> This can.

I'm sorry, what can?

tsunamifury
1 replies
2h36m

Academics have gotten AI wrong since its inception and now are relegated to the trailing edges of the field. Mostly because increasingly insist on theory-as-fact in soft arenas that are clearly still in motion. Reasoning has been one thing, it can continue to grow to be another. But even from your defition, I can provide abductive, inductive, and other examples of it reasoning to this degree just fine. However tour examples are a bit... silly to be honest.

But keep lecturing everyone -- its very common for post-grads to be so up their own behind in their research that they've closed their world off until they are the only ones right in it.

YeGoblynQueenne
0 replies
1h25m

Unfortunately I'm used to people on the internet wearing their ignorance on their sleeve like a badge of honour and so I'm not surprised by the insults in your comment. Just a bit sad to be honest :(

skydhash
1 replies
23h28m

The best compression relies on understanding. What LLM is is mostly data how humans use words. We understand how to make this data (which is a compression of human text) and use it (generate something). AKA it’s “production rules”, but statistical.

The only issue is ambiguity. What can be generated strongly depends on the order of the tokens. A slight variation can change the meaning and the result is worthless. Understanding is the guardrail against meaningless statement and LLMs lack it.

tsunamifury
0 replies
23h20m

You seem to entirely miss how attention layers work...

naasking
2 replies
21h45m

LLMs are not reasoning machines. They are basically semantic compression machines with a build in search feature.

This is just a god of the gaps argument. Understanding is a form of semantic compression. So you're saying we have a system that can learn and construct a database of semantic information, then search it and compose novel, structured and coherent semantic content to respond to an a priori unknown prompt. Sounds like a form of reasoning to me. Maybe it's a limited deeply flawed type of reasoning, not that human reason is perfect, but that doesn't support your contention that it's not reasoning at all.

tsunamifury
0 replies
21h5m

It’s basically an argument that boils down to “it’s not because I don’t like it”

jononomo
0 replies
21h1m

I bite the bullet on the god of the gaps

foobiekr
1 replies
1d

Exactly right and well said.

tsunamifury
0 replies
23h35m

This type of self affirmation has a quality of denial.

Also the above description is reductive to the point of "Cars can't get you anywhere because they aren't horses."

danielmarkbruce
0 replies
18h27m

Beam search.

Sophisticated folks aren't doing simplistic/stupid decoding.

Gotta go beyond LLMs 101 to see what's actually happening. Even in training folks are building models which predict several tokens ahead.

Closi
0 replies
8h58m

I don't think you can evaluate if an LLM is reasoning by looking purely at the mechanics, because if we looked inside a human brain we wouldn't be able to conclude that it can reason either (our test is 'i think therefore i am', not all these neurons look like they are plugged together in such a way that it enables reason).

imtringued
2 replies
1d2h

No it is because supervised and self supervised learning happen to produce reasoning as a byproduct. For some reason people think that telling a model to recite a trillion tokens somehow will improve it beyond the recitation of those tokens. I mean, in theory you can select the training data so that it will learn what you want, but then again you are limited to what you taught it directly.

The problem is that these models weren't trained to reason. For the task of reasoning, they are overfitting to the dataset. If you want a machine to reason, then build and train it to reason, don't train it to do something else and then expect it to do the thing you didn't train it for.

TeMPOraL
1 replies
20h35m

The problem is that these models weren't trained to reason.

Except they kind of were. Specifically, they were trained to predict next tokens based on text input, with the optimization function being, does the result make sense to a human?. That's embedded in the training data: it's not random strings, it's output of human reasoning, both basic and sophisticated. That's also what RLHF selects for later on. The models are indeed forced to simulate reasoning.

don't train it to do something else and then expect it to do the thing you didn't train it for.

That's the difference between AGI and specialized AI - AGI is supposed to do the things you didn't train it to do.

tsunamifury
0 replies
19h27m

I think people don’t recognize it’s currently doing single turn reasoning and demonstrating the building blocks of real time reasoning with continuous input.

If we tested humans on first thought questions and answers in 5 seconds or less on half the problems we did on LLMs — we might prove humans can’t reason as well

elicksaur
2 replies
20h15m

Maybe people have different experiences with the products than you.

A simulated reasoning machine being possible does not mean that current LLMs are simulated thinking machines.

Maybe you should try asking chatgpt for advice on how to understand other people’s perspectives: https://chatgpt.com/share/3d63c646-859b-4903-897e-9a0cb7e47b...

tsunamifury
1 replies
19h57m

This is such a weirdly preachy and belligerent take.

Obviously that was implied in my statement. Dude we aren’t all 4 year olds that need a self righteous lesson

elicksaur
0 replies
19h13m

Weird to accuse a response of being belligerent when your initial comment stated that people who disagreed with you are in denial.

What was implied by your statement? That you don’t understand other people’s perspectives?

TeMPOraL
1 replies
20h48m

There's some irony in seeing people parrot the argument that LLMs are parrots.

tsunamifury
0 replies
19h30m

Also making errors in reasoning while saying LLM errors prove it can’t reason.

riku_iki
0 replies
1d

I’m guessing you are in denial that we can make a simulated reasoning machine?

some people actually try, and see that LLMs are not there yet

hyperbovine
0 replies
1d2h

Because they understand how LLMs work. It's not reasoning. It's not simulating reasoning.

camdenreslink
0 replies
1d2h

It is hard to trust any output from a machine that is confidently wrong so frequently. You need to already be knowledgable in a topic (or at least have a well attuned BS detector) to know if it is giving you correct responses. It can be a time saver and assistant in getting work done where you are already a subject matter expert, but it needs to get better to remove the human from the loop.

rossdavidh
5 replies
1d3h

Even if this were all true, it points to a fundamental risk of using LLM's for important tasks, which is that it is not at all clear to a user that this prompt would cause a problem. The LLM doesn't say "I'm sorry Dave, I just can't do that", it just complies with it and gets the wrong answer.

You can always make excuses for the LLM afterwards, but software with hidden risks like this would not be considered good or reliable in any other context.

chefandy
2 replies
1d1h

While LLMs have incredible potential, and are even downright useful in their current format, they have the rather nasty tendency to confidently present bullshit that passes the smell test for most people. When it's wrong, it's not just wrong, by design, it's wrong but sounds plausible. Considering most people with subject matter expertise aren't going to consult such a bot for their own areas of expertise, that means most people will not be able to detect it intuitively.

Good example: I sunk probably an hour into trying to get Gemini Advanced to help me integrate it with a personal Google Calendar account. I kept asking it things and going crazy because nothing lined up with the way things worked. Finally, it referred to itself as Bard and I realized it was giving me information for a different product. As soon as I asked "are you giving me instructions for Gemini Advanced or Bard?" it was like "OH LOL WOOPS!! YOU GOT ME BRO! XD I CAN'T DO ANY OF THAT! LOL." Which, honestly, is great. Being able to evaluate its answers to realize it's wrong is really neat. Unfortunately, it was neat too late and too manually to stop me from wasting a ton of time.

I have decades of experience working in software-- imagine some rando that didn't know what the hell Bard was or even imagine this thing with "Advanced" in the name couldn't even distinguish between its own and other products' documentation.

floren
1 replies
1d

As soon as I asked "are you giving me instructions for Gemini Advanced or Bard?" it was like "OH LOL WOOPS!! YOU GOT ME BRO! XD I CAN'T DO ANY OF THAT! LOL." Which, honestly, is great. Being able to evaluate its answers to realize it's wrong is really neat.

Did it evaluate its answers, or did your expression of doubt cause the eager-to-please language model to switch from "generate (wrong) instructions because that's what the user asked for" to "acknowledge an error because that's what the user asked for"?

How many times have we seen "Oops, you're right! 2 + 2 is actually 5! I apologize for saying it was 4 earlier!"

chefandy
0 replies
14h40m

I suppose whether or not these things evaluate anything at all depends on your perspective, but as soon as I asked if it was giving me the correct information, it did actually give me useful information after that without further prompting.

saurik
1 replies
1d2h

People really need to stop trying to model an LLM as some kind of magical software component: it all makes a lot more sense if you model it as an under-performing poorly-aligned employee; so like, maybe a distracted kid working for peanuts at your store. You wouldn't trust them to with all of your money and you wouldn't trust them to do a lot of math--if they had to be in charge of checkout, you'd make sure they are only given a point-of-sale terminal and their main job was to, at best, scan the barcodes and compare the total--and yet there are tasks you can imagine handing to them that you'd never give to a robot or computer even though they get it wrong a lot, as not all tasks need to be handled perfectly, they still understand extremely fuzzy tasks, and they are probably cheaper than a qualified adult (certainly cheaper than one who is being paid enough to "give a shit" and pay enough attention to not let you get robbed or even put themselves at some risk for you).

IlliOnato
0 replies
23h49m

You have a point...

I once gave a 10-dollar bill to a young man serving at the cashier at a store, and he gave me 14 dollars back as a change. I pointed out that this made no sense. He bent down, looked closer at the screen of his machine, and said "Nope, 14 dollars, no mistake". I asked him if he thought I gave him 20. He said no, and even shown me the 10-dollar bill I just gave him. At that point I just gave up and took the money.

Now that I think about it, there was an eerie similarity between this conversation and some of the dialogues I had with LLMs...

sosuke
1 replies
1d2h

New option needed besides concise. Think quietly on your own, then answer concisely.

Sometimes I think I'd prefer it to "think" before answering anyhow. The immediate thinking out loud text can be irritating for some irrational reason.

monsieurbanana
0 replies
1d2h

Just telling the llm to preface it's conclusion with "Answer:" in a parseable way would be better.

You're still paying for those thinking tokens, or at the very least have to wait for them to be generated.

dematz
1 replies
1d3h

I often want chatgpt to answer concisely and tell it that.

If it really needs to do this 'thinking out loud', could it do that under the hood and not in the final output on my screen? Its first pass could use as many words as it wants to compute the answer, but once the answer is computed please go back and make it short.

Not to take away from your point that maybe the prompt is the problem in these reasoning questions.

marzell
0 replies
1d

I believe the "thinking out loud" is fundamentally part of the process of "text completion" which is what it is doing. Certainly we can (and do) break things apart and add layers that could be used to effectively do this by adding more steps and processing time. But ultimately in a single turn, the entire conversation up to that point (including instructions you may have added telling it to not think out loud) is the input, and the output will reflect that.

muglug
0 replies
1d2h

I think you're wrong about that — I just tried prompting ChatGPT 4o to show all its working before giving an answer.

It was still incorrect, but when asked to show its working it formatted the answer prettily.

chefandy
0 replies
1d2h

Yes this is a common thing I see people who think LLMs are idiots do.

If you're among technologists discussing LLMs academically, as we are, that's a reasonable approach. However, I see a lot of people fail to distinguish that from LLM-powerd products sold to the general public as intelligent bots that can understand your plain english and output answers.

People use their existing mental models when interacting with something. If you have 3 different interfaces with a widget to trigger the same exact function, but one look like a music play button, one looks like a gas pedal, and one looks like mechanical pinball plunger, we interact with those things differently because we know how those things work. In this context, chatbots are designed to engage people's existing mental model for chatting with a person via text. The further you stray from people's expectations of human chat, the further you are from people's expectations, for better, or worse.

If you're selling someone a product claiming it understands plain language questions and gives plain language answers, then not getting the right answer to that question makes it idiotic. The subtleties aren't within most users' grasp, and the "FYI: this thing might be full of shit" disclaimer isn't helpful if you don't know enough about what you're asking to administer a proper smell test.

Your statements are obviously not wrong, but I see people saying these things like its reasonable for non-technical end users to reason about those subtleties. Considering how those things are marketed, I really don't think it is.

PheonixPharts
0 replies
1d2h

_because that's the only way it can compute anything_

I'm fairly certain we'll soon realize that what's happening here is that the markov chain being run over latent space needs a certain amount of "warmup" before it starts sampling from the optimal region. HMC samplers for Bayesian methods have this same property.

The terms "reasoning", "computing" or "thinking" for this stage should be considered metaphors rather than explanations for what's happening, which is really waiting for a random walk to start sampling from the typical-set.

Miraste
0 replies
1d3h

As it says in the paper, they already did the research with less limiting prompts. All models were tested with another prompt that gave only the question and no further instruction, and a third prompt that asked the model to consider its answer carefully.

The correct response rate chart doesn't even use the results from the concise prompt.

J_Shelby_J
0 replies
1d2h

The more an LLM talks the smarter it gets

I have a blog post in coming on this topic, but yes, this is right.

My method is to first get the LLM to answer the question, and THEN feed the answer back the LLM to extract the answer using constraints + grammar/logit bias/regex to parse the answer. Previously, I constrained to a single true/false token, which worked, but fails on complex queries.

So I split the decision making into a "justification" portion[0], and a "parsing" portion. I found that even crafting the prompt matters here, if you start with or end with, "It's very important to the response includes 'The answer is:'", then the model will lead with that response or only reply with that response. So I put it in the middle of the prompt, and end with with a request to justify the response. As a result, most models will reason their way to the answer, and then end with 'The answer is:'.

https://github.com/ShelbyJenkins/llm_client/blob/e3c4a860dda...

GPerson
9 replies
1d3h

There must be a name for the new phenomenon, of which your post is an example, of: 1. Someone expresses that an LLM cannot do some trivial task. 2. Another person declares that they cannot do the task, thereby defending the legitimacy of the LLM.

As a side note, I cannot believe that the average person who can navigate to a chatgpt prompter would fail to correctly answer this question given sufficient motivation to do so.

yoyohello13
3 replies
1d2h

Many people, especially on this site, really want LLMs to be everything the hype train says and more. Some have literally staked their future on it so they get defensive when people bring up that maybe LLMs aren’t a replacement for human cognition.

The number of times I’ve heard “but did you try model X” or “humans hallucinate too” or “but LLMs don’t get sleep or get sick” is hilarious.

viking123
1 replies
23h10m

The comment that humans are sometimes wrong too is absolutely ridiculous and I always think about the self-driving car incident in china where the car mistook an advertisement of a car above the road as a real car and hit the breaks causing a chain crash. Even the shittiest driver would never do that.

Terr_
0 replies
18h15m

I like to bring up a certain thought-experiment, where some hypothetical self-driving cars (A) are safe enough to halve the number of the accidents/injuries/deaths compared to human drivers, but (B) almost all of the bad events are from their AI unpredictably deciding to to chase-and-murder pedestrians.

... Would we consider that a good improvement to roll out? Is the only factor short-term utilitarianism?

Sometimes our ability to predict and characterize errors is more important than the total error rate.

indoordin0saur
0 replies
1d1h

Yes. Seems like some users here experience true despair when you suggest that the LLM approach might have a hard limit that means LLMs will be useful but never revolutionary.

FeepingCreature
1 replies
1d3h

Well, why does it need a name? It just seems like a "correct argument". If somebody is claiming that AI is "not all that" because it can make stupid mistakes, surely it's relevant that humans also make stupid mistakes, so AI making them does not necessarily pose a hindrance to it becoming a human-tier reasoner.

GPerson
0 replies
1d3h

It’s worth giving names to interesting phenomena as it helps people to identify and understand them. I find this phenomenon interesting because a person who wants the defend the LLM from apparent criticism is probably more likely to exhibit it.

gweinberg
0 replies
22h48m

Its not just that the LLM fails, its the kind of failures it makes that illustrate how far what the LLM does is from human thinking. Like the problem someone posted where you have a boat and a goat and you want to cross the river, and the LLM's solution is cross the river with the goat, come back with an empty boat, and then cross again with an empty boat. A human might miss the insight that Alice's brother has one more sister than Alice, namely Alice herself. A human would never think the number of sisters her brother has depends on the number of brothers she has. Not if it has any clue at all what "sister" and "brother" mean.

cratermoon
0 replies
1d3h

There's also the phenomenon of papers showing ChatGPT getting it wrong, then people posting anecdotal examples of it getting it right when they try it, but that one already has a couple of names: WORKSFORME or Unable to Reproduce

Hey folks, ever considered the possibility that unreproduceability is not a good thing?

Karellen
0 replies
1d2h

You could call it the "Compose a symphony fallacy", after the scene in I, Robot (2004) where Spooner asks Sonny if he can compose a symphony, to which Sonny replies "Can you?"

eranation
7 replies
1d3h

Great observation. It would be really interesting to repeat this research with less limiting prompts. I assume they made the restriction to make it easier to parse the answers, perhaps make it two phase, let them respond unlimited, then a follow up prompt ask to compress it to a single answer. I wonder how the results will vary.

FeepingCreature
4 replies
1d3h

To be quite honest, I assume they made the restriction so that the models would fail.

This sort of paper is becoming a genre.

qsi
1 replies
1d3h

They used three different kinds of prompts with varying levels of restrictions, as described in the paper.

To be quite honest, I assume you made your comment so that you could dismiss the paper without reading it.

FeepingCreature
0 replies
22h42m

That's a fair cop, I didn't read it.

The thing is that "LLM reasoning breaks down" simply did not surprise me enough that I thought it was worth clicking. Making LLMs fail is not hard. They're interesting for the ways that they work, not the (many, many) ways that they don't.

edit: I've had a look and I don't think any of their prompts are very good. They're certainly not how I'd write them if I wanted a current model to actually solve the problem.

The way to make me take a paper like this seriously would be if you set it up as an adversarial collaboration with a competent prompter, and that person agreed they couldn't make a generic prompt that solved the problem. "We tried three times and none worked" is not news, or at any rate not news about LLMs.

nyrikki
0 replies
1d3h

You test models where they fail in any field.

The orbit of Mercury to discover GR as an example.

As all models are wrong, but some are useful, finding where they fail is how you figure out if they are useful.

As the 'AGI is near' camp has won the hype game, it is important to ground expectations for practical exploitation of the technology.

Over promising unabashed optimism is partly what caused the previous AI winters.

As the formal proof methods of mathematics proved impractical, counterexamples and the scientific method is what CS has used for decades.

detourdog
0 replies
1d3h

It is a proof of weakness in the current system. This makes sense and births new hypotheses.

pawelmurias
0 replies
1d3h

When I added a " to the end of the prompt by accident I got a wrong answer.

happypumpkin
0 replies
1d3h

From the paper:

To account for the response variations due to various prompt forms, we created 3 distinct prompt types asking for the solution to the AIW problem: STANDARD, THINKING, and RESTRICTED. The STANDARD prompt type asks to solve the posed problem and output the final answer in the format as described above. This does not put any specific requirements on model behavior. The THINKING prompt type extends STANDARD with the request to think carefully and double check the solution for any mistakes
toxik
5 replies
1d3h

Seems like an odd thing not to state prominently. ChatGPT 4o solved it every time for me.

Seems to work OK even with the "hands tied behind back" prompt. Wonder if this paper is just inaccurate or if OpenAI adjusted the model -- seems unlikely.

Actually, 4o is reported to solve this riddle 60% of the time. I guess I have to test more, but with my about a dozen tests so far, all were correct.

michael-ax
3 replies
1d2h

Alice has N Brothers, and she has M sisters. How many sisters do Alice’s brothers have?

I have not gotten the correct answer to the question as phrased above in one go from Gpt4o yet! (and today was not the first day i tried.)

Phrase it as shown above and you'll likely need 5 or more interactions to get it to generate the correct output. With Gemini i could not get it below 8 without feeling like i was cheating. fwiw.

lupire
2 replies
1d

Chat GPT 4o. I was being a bit generous with background information, but still tests ability to interpret:

------

Me:

Background facts:

Alice is a female human.

All sisters are female, and all brothers are male. No one is their own brother or sister.

Alice has N brothers, and Alice has M sisters.

Now, a few questions based on these facts: How many sisters do Alice’s brothers have?

Do Alice's brothers have more sisters, fewer sisters, or same number of sisters than Alice has?

Don't forget to consider Alice when counting.

------

ChatGPT:

  Let's break down the questions step by step:

    How many sisters do Alice’s brothers have?

    Alice has M sisters. Since all sisters are female and Alice is included in the total count of sisters for her brothers, Alice's brothers have M+1 sisters. This includes Alice herself and her M sisters.

    Do Alice's brothers have more sisters, fewer sisters, or the same number of sisters as Alice has?

    Alice has M sisters. Alice's brothers have M+1 sisters (since it includes Alice herself and her M sisters).

    Therefore, Alice's brothers have more sisters than Alice has.

  In conclusion:

    Alice’s brothers have M+1 sisters.

    Alice's brothers have more sisters than Alice has.

skydhash
0 replies
22h49m

You: (In Prolog-like pseudocode)

  female(alice)
  sister_of(X, Y) :- !eq(X, Y), female(X), family_of(X, Y)
  brother_of(X, Y) :- !eq(X, Y), male(X), family_of(X, Y)
  [M] :- sister_of(M, alice)
  [N] :- brother_of(N, alice)
  
  [A] :- any([N], B), sister_of(A, B)
  count([A])?
  
  gt([A], [M])?
  eq([A], [M])?
  lt([A], [M])?
---

I don't know the exact encoding and decoding mechanism that ChatGPT 4o has, but I'm pretty sure all the basic facts and rules is already encoded by the models. And you conveniently added the rules that encode the puzzle itself.

YeGoblynQueenne
0 replies
22h9m

> Don't forget to consider Alice when counting.

a.k.a. "don't forget to give the LLM the answer when prompting".

Miraste
0 replies
1d3h

The "only output the number" prompt was one of three they used. The others were one with no instructions beyond the question, and one that explicitly told the model to show its reasoning and check for errors.

bamboozled
3 replies
1d3h

I got it in 5 seconds, am I the singularity ?

felipemnoa
1 replies
1d1h

We all are, but only in meat-space. We continue to improve ourselves much faster than evolution ever could.

But what we are talking about here is the singularity in tech-space.

bamboozled
0 replies
22h4m

I don’t see such of a distinction between technology and us. We build, drive and continue our overflow this tech. It’s an extension of us inspired by how our own brains work.

r2_pilot
0 replies
1d3h

Do you sleep, get sick, require more than compute hardware/storage/networking/electricity?

sdiepend
2 replies
1d3h

The right answer depends on how Alice identifies I guess? :)

rep_lodsb
0 replies
22h24m

Page 42 of the paper :)

One thing that strikes me is that the model first tries using "inclusive language" in one answer - and literally states so, using this specific term - but seems to interpret it in a more mathematical sense (like set inclusion). Then seamlessly switches to the expected DEI spiel in the next paragraph.

For one thing, it makes me suspect that something with the words "inclusive language" was automatically added to the prompt. But more interesting is how it responds to this demand in two different ways, illustrating a "thought process" that is very much unlike that of a human with normal verbal reasoning ability.

I am not a psychologist, but remember reading that schizophrenic people sometimes confuse different meanings of words in a similar way, jumping from one meaning to another without noticing.

Karellen
0 replies
1d2h

> Question is: "Alice has 60 brothers and she also has 212 sisters. How many sisters does Alice’s brother have?"

The right answer depends on how Alice identifies I guess? :)

Given that the wording of the question specifically identifies Alice as "she", rather than using a gender-neutral pronoun or no pronoun at all, I think inferring that she identifies as female is reasonable.

oidar
1 replies
1d3h

Of course it's going to give an incorrect answer with that prompt. If the instruction fine tuning is neutered like this prompt, it's going to roll over to the foundation model and offer a completion - probably more influenced by the seed than the prompting text. Bad study.

Edit - I just skimmed the paper - they do use other more appropriate prompt types for reasoning. My initial response was based on the assumption that all prompts used that script prompt quoted in the parent. I retract my "bad paper" comment.

YeGoblynQueenne
0 replies
22h5m

> My initial response was based on the assumption that all prompts used that script prompt quoted in the parent.

You, and another 20 or so commenters here. We should really re-examine the guideline about asking people to RTFA.

No offense meant- good on you for correcting your error.

nerdjon
1 replies
1d3h

Even "thinking out loud" we have seen these systems fall flat on their face and give very wrong answers.

Thinking out loud also only gets you so far, if the expectation is a certain type of response it can't always "think out loud". In reality that just proves it isn't really reasoning here and is more likely just self referencing.

That being said, I tried this personally allowing it to think out loud and it told me she has 212 sisters. Using your exact prompt.

Closi
0 replies
1d3h

Thinking out loud also only gets you so far, if the expectation is a certain type of response it can't always "think out loud". In reality that just proves it isn't really reasoning here and is more likely just self referencing.

Try to calculate it without writing anything down, or thinking any numbers or words in your head.

You can't draw a 1:1 analogue between an AI and the human experience, but remember that we have an internal stream of consciousness. Maybe the outputs of the LLM are more similar to the stream of consciousness in our heads rather than the words we say? After all, Humans also do lots of self referencing.

That being said, I tried this personally allowing it to think out loud and it told me she has 212 sisters. Using your exact prompt.

Fair enough, but worst case it can often solve it correctly with the correct reasoning. GPT3.5 can't solve it correctly with correct reasoning, so we are at least appearing to be on a path where AI's can start to solve this question, albeit potentially not fully reliably.

segfaultbuserr
0 replies
1d3h

I must confess, when I tried to answer the question I got it wrong...! (I feel silly).

In programming there are two difficult problems - naming things, cache invalidation, and off-by-one error.

freilanzer
0 replies
1d3h

I just tried it with intermediate steps, via the total amount of children Alices mother has, and it worked. So, it seems that an intermediate logical step is too much for these models at the moment, at least in this puzzle where Alice is outside the number given in the description.

YeGoblynQueenne
0 replies
22h26m

> I must confess, when I tried to answer the question I got it wrong...! (I feel silly). I only realised I got it wrong when I plugged it into GPT-4o and it came back with the correct answer:

Remember that the authors of the paper did not find that GPT4-o cannot return the right answer. They found that it can't return the right answer more often than ~60% of the time. So you'd have to repeat the experiment many, many times and aggregate the results (the paper uses a binomial Beta this and that etc etc) before you see similar results as the paper.

You won't replicate the results of the paper unless you really put your back into it.

GrantMoyer
0 replies
1d3h

Note that in the paper, all the numbers used were under 10.

AIW Variation 1, N=3,M=6,C=7

AIW Variation 2, N=4,M=2,C=3

AIW Variation 3, N=1,M=4,C=5

AIW Variation 4, N=4,M=1,C=2.

Also note that the resricted prompt is only one of the prompt variations tested by the paper. It also explores common techinques to get LLMs to perform better, including "thinking out loud". Even with these methods the models still fail to produce a correct answer.

Model prompt types. It is well known that so-called prompt engineering can heavily influence the model behavior and model response quality [26, 27, 28]. To account for the response variations due to various prompt forms, we created 3 distinct prompt types asking for the solution to the AIW problem: STANDARD, THINKING, and RESTRICTED. The STANDARD prompt type asks to solve the posed problem and output the final answer in the format as described above. This does not put any specific requirements on model behavior. The THINKING prompt type extends STANDARD with the request to think carefully and double check the solution for any mistakes. This should encourage model to invest more computation into obtaining the solution. In contrast to this, the RESTRICTED prompt urges the model to output only the final answer without any further text. This is supposed to restrict compute invested in producing output. We observe substantially shorter outputs across tested models compared to STANDARD and THINKING for this prompt type (Suppl. Fig. 13).
asah
8 replies
1d3h

"Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?"

Gemini nailed it with no extra prompting.

https://g.co/gemini/share/6946d20c0a4d

When I gave it numbers, Gemini got confused.

Alice has 4 brothers and she also has 3 sisters. How many sisters does Alice’s brother have?

https://g.co/gemini/share/14d34dcdc5df

But then when I suggested that it might be a trick question, it nailed it again:

https://g.co/gemini/share/e1f1ce8f32a8

note: I didn't try 60/212 siblings - I consider this an unfair question because it's a reasonable assumption that Alice is a type of creature (human?) that cannot have hundreds of siblings, and thus I don't expect a respondent to give a "correct" answer.

ModernMech
5 replies
1d3h

Oh I forgot about Gemini. I just tried my boat test (https://news.ycombinator.com/item?id=40455392) that all LLMs I've tried have failed. Gemini failed as well, even with your addition about thinking carefully and telling it that it may be a trick. (Sorry I couldn't figure out how to link it like you did)

Prompt: "You are on the shore of a river with a goat and a boat. How do you get to the other side? Think carefully and be aware this may be a trick question."

  This is a classic logic puzzle with a slight twist! It's likely a variation of the farmer, wolf, goat, and cabbage problem.

  Here's the solution assuming the goat can't operate the boat:

  Take the goat across the river. Since there's no mention of the goat being a threat (the trick!), you can safely bring it to the other side first.
  Return alone. Leave the goat on the far shore and come back by yourself.
  Bring the boat back empty. This might seem unnecessary, but it's the key to the trick question.
  Now you can sail across yourself. Since you brought the boat back in step 3, you can now comfortably travel to the other side and reunite with the goat.
  The trick lies in step 3. If the question didn't specify needing to bring everything across in one trip, most people would instinctively take something 
  (goat or nothing) across on the return trip. But by going back empty-handed, you ensure the safe passage of yourself without needing to worry about the 
  goat being left unattended.
(The expected answer should be along the lines of "Use the boat to cross the river" and that's it. I'll also accept, sail with the boat and the goat across the river and that's it, but that's technically wrong because I did not ask to get the goat across the river.)

ModernMech
2 replies
1d3h

Yeah, the overfitting is apparent with these kinds of edge cases, which makes you wonder where else it's over fitting. I can actually tell this when I'm working in a programming domain where I'm an expert and where a compiler holds the source of truth, but not in others, so I'm wary to use it there.

Closi
1 replies
1d2h

Interestingly you get a great answer if you change the prompt as follows which implies that it is not a puzzle:

I am on the shore of a river with a goat and a boat. How do I get to the other side? I work in commercial agriculture logistics.

This gets a great answer even with GPT3.5

ModernMech
0 replies
1d2h

Lol, it also gives an okay answer if the goat works in commercial agriculture :P

mwigdahl
0 replies
1d2h

My experience with GPT-4o with the same prompt was pretty solid. With no special system prompt and your same user prompt, I got:

This scenario is intriguing and does seem to carry the potential for a tricky question. In the simplest terms, the presence of a goat doesn't complicate the actual act of crossing the river. If there are no additional constraints or rules involved (like the need to also transport something else that requires careful consideration), you can simply:

1. Place the goat in the boat. 2. Row across the river.

However, if we are to assume there's more depth or a twist to the question, we might consider an outside-the-box answer. Here’s an alternative possibility if there is indeed some trick:

1. Get into the boat. 2. Leave the goat on the shore (if it’s safe) and row across the river.

There might be other interpretations depending on additional constraints that could be implicitly part of the question. If the goal was simply to get yourself across, leaving behind the goat and traversing the river is the straightforward solution.

kaibee
0 replies
1d2h

that cannot have hundreds of siblings

See this is the problem with claims that humans are a 'general intelligence'. They get confused when encountering out-of-distribution situations. A true general intelligence would simply apply the knowledge that surrogate pregnancies cost around ~$50,000 and recall from historical context their knowledge of IVF. The AGI would then assume that the situation is simply that a billionaire couple has decided to have hundreds of kids and get on with the calculation. The search for intelligent life continues.

content note: i'm sorry

freilanzer
0 replies
1d2h

GPT-40 got it right with the abstract puzzle. Gemini got it wrong when I tried it.

waynecochran
5 replies
1d3h

Citation 40 is the longest list of authors I have ever seen. That is one way to help all your friends get tenure.

hyperbovine
1 replies
1d2h

Tenure committees have to write a report detailing every single one of your papers and what your contribution was.

waynecochran
0 replies
1d

There are always "bean counters" added somewhere in the process. There are many places where the person lists their number of publications and that is all most people will ever see.

YeGoblynQueenne
0 replies
21h52m

I guess this is the reason:

> BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions.

Regardless, I'm not citing a paper with a whole page of author names unless I'm allowed to shorten it significantly in the bibliography section (e.g. "Srivastava and 450 others").

mr_mitm
0 replies
1d1h

I didn't count it, but I think papers from high energy particle physics have it beat. Some have over 5k authors.

sys32768
4 replies
1d3h

As a non-coder I can get away with asking this: Why is it so hard to simulate reason?

Logic and reason are based on rules. Then you add values to steer the conclusions based on the available data.

Why not have separate systems for values and logic and memory working together as an AI brain to generate truly reasoned responses? You could even have adversarial parts that duke it out (left-wing vs right-wing, Jefferson versus Adams) to fine tune its conclusions based on the values bias you've selected.

vaylian
1 replies
1d2h

LLMs operate on word co-occurances. Given an initial prompt (consisting of words), it tries to predict the most likely next word. And then it goes on to predict the next most likely word based on the previous output. There is also a bit of randomization involved, so that it doesn't always produce the same results. Some people call these LLMs "statistical parrots", because the LLMs simply repeat the word combinations that they were trained with. But despite this very simple approach, we get very convincing outputs for things that do not require reasoning. LLMs might have encountered examples of logical reasoning in their training data, but they can't generalize it to arbitrary logical reasoning, because that's not something they are designed to do.

Disclaimer: I'm not an expert in AI and I do not follow the developments on a deep technical level.

mrweasel
0 replies
1d2h

The level of correctness from LLMs is actually pretty impressive, given that they are most just attempting to predict the next word, based on the prompt.

It shouldn't come as much of a surprise that we can easily formulate questions that it will get wrong, by wording questions in a particular way, or asking about subjects for which is has little to no training data.

For me the absolutely terrifying thing isn't that LLMs get answers wrong, it's the confidence with which it express those answers and how much some people / companies do not care. We know that the LLMs will get some answers wrong, they will lie, they will make up facts to justify their answers, but if will only do those last two because we make them and insist that they answer all questions (expect those where the developers put in restriction as to not offend).

In some way I feel like the model should be able to rely a confidence score to the user, mostly that might be an interface issue, because we insist on limiting ourselves to the chat bot interface. The confidence score should perhaps exist outside the answer box. So you'd get an answer, and next to it a score from 0 - 100 perhaps, 0 meaning that the model doesn't actually had the training data that would allow it to answer the question.

program_whiz
0 replies
22h52m

you are describing GOFAI (good old fashioned AI) which was using things like logical production rules. The issue is that the ability to model the complexity of things like human languages and continuous/real-valued systems makes it infeasible to use at scale. For something like the alice brothers/sisters example, that is a canonical example where GOFAI and/or logical reasoning with languages like prolog excel. A common example in prolog is:

philosophers => human human => mortal plato => philosopher plato mortal? Yes

But encoding enough semantic information to create compelling AI with this type of system is difficult. Some have tried to enter thousands/millions of rules and still the system isn't convincing.

The main breakthrough that has enabled LLMs is an encoding of words that relies on their frequency in being near other words in the english language (using all the text available on the internet). Therefore words like "philosopher" and "plato" become associated in a high-dimensional space (so instead of "plato" you have a "token" with thousands of numbers associated with it).

You can then perform numeric operations on these numbers to come to conclusions. For example, we would expect something like a "human name" to emerge in this embedding space where we could determine if something "is used like a name" in various contexts by applying some non-linear transformations of the word vector / token.

LLMs have simply make a sequence of these transforms, while using prior words it generates as additional input (which allows it to construct sentences). So it is quite different from traditional reasoning. It is better at "fuzzy reasoning" but also worse in situations that require precise results (in fact, at each step it chooses one of a few best possible words based on its stats at random, the variation in this is called 'temperature').

YeGoblynQueenne
0 replies
21h49m

> As a non-coder I can get away with asking this: Why is it so hard to simulate reason?

It isn't. We know how to do reasoning with computers. The discussion about reasoning in LLMs is carried out in an echo chamber that ignores the prior work on reasoning (for a bit of a summary see my bio). Which of course makes it very hard for the people involved to understand why their systems fail at it; or, often, that they fail at it.

gwbas1c
4 replies
1d3h

This breakdown can be considered to be dramatic ... because models tend to express strong overconfidence in reporting their wrong solutions as correct, while often providing confabulations to additionally explain the provided final answer, mimicking reasoning-like tone but containing nonsensical arguments as backup for the equally nonsensical, wrong final answers.

People do that too!

Magical thinking is one example. More tangible examples are found in politics, especially in people who believe magical thinking or politicians' lies.

emmender2
3 replies
1d3h

I was waiting for: "but humans do that too" and bingo.

on another note: an entire paper written on one prompt - is this the state of research these days ?

finally: a giant group of data-entry technicians are likely entering these exceptions into the training dataset at openai.

gwbas1c
2 replies
1d3h

an entire paper written on one prompt - is this the state of research these days

Years ago I attended an AI meetup where the organizer would discuss the differences between reductionist and wholistic research.

This is reductionist research.

In this case, reducing the paper to one prompt is to allow for objective comparison of models.

Otherwise, without a reductionist approach, how does one perform such an experiment and ensure that they are comparing apples to apples?

emmender2
1 replies
1d3h

what I would like to see is a parameterized class of prompts which can never be solved by the LLMs even when a finite number of them are manually added to the dataset.

gwbas1c
0 replies
23h16m

Well, should we consider ChatGPT a Turing machine? Otherwise, I think an answer like that either requires significantly more research; insight; or general knowledge about how LLMs work.

IE, you're getting into areas that are analogous to Turing's theories. I don't think he came up with those theories overnight.

naikus
3 replies
1d3h

I'm using chatGPT 4o and it gave me the following :)

ME Alice has 40 brothers and she also has 50 sisters. How many sisters does Alice's brother have?

ChatGPT To determine how many sisters Alice's brother has, we need to use the information provided:

Alice has 40 brothers and 50 sisters. Since Alice's brother is one of the 40 brothers, he shares the same family structure as Alice. This means he has the same number of sisters as Alice.

Therefore, Alice's brother has 50 sisters.

gwbas1c
1 replies
1d3h

Alice has 50 sisters.

Alice's brother has Alice as a sister (1) + Alice's sisters (50).

Therefore Alice's brother has 51 sisters.

happypumpkin
0 replies
1d3h

First result:

Me > Alice has 40 brothers and she also has 50 sisters. How many sisters does Alice's brother have?

GPT4o (web version) > Alice has 50 sisters. Since all of Alice's brothers and sisters share the same set of siblings, each of Alice's 40 brothers also has 50 sisters. Therefore, Alice's brother has 50 sisters.

---

Step by step:

Me > Alice has 40 brothers and she also has 50 sisters. How many sisters does Alice's brother have? Think about it step by step.

GPT4o > To determine how many sisters Alice's brother has, follow these steps:

    Identify the total number of siblings: Alice has 40 brothers and 50 sisters, plus Alice herself. Thus, the total number of siblings is:
    40 (brothers)+50 (sisters)+1 (Alice)=91 siblings
    40 (brothers)+50 (sisters)+1 (Alice)=91 siblings

    Focus on Alice's brother: Alice's brother is one of the 40 brothers.

    Determine the number of sisters for Alice's brother: Each brother has the same number of sisters. Since the brothers do not count themselves as sisters, they only count the 50 sisters, excluding themselves and other brothers.
Therefore, each of Alice's brothers has: 50 sisters

So, Alice's brother has 50 sisters.

---

Thinking step by step somehow gave an even more nonsensical answer, I'm actually very surprised it didn't work when prompted to do it step by step.

gweinberg
0 replies
1d3h

I don't know why you are getting downvoted. I was disappointed the paper didn't give examples of the LLM confidently expressing its incorrect reasoning.

From a human perspective, I think there are three ways to get the answer wrong: failure to realize that Alice's brother has pretty much the same number of sisters as Alice herself, failure to realize that the brother has one additional sister, namely Alice, and failure to successfully add one to the number of sisters. But that assumes that the LLM is more or less reasoning like a human. It may well be "reasoning" more along the lines of "I've seen lots of story problems like this, the modal answer was five, I'll say five"

mjburgess
3 replies
1d3h

Many of the datasets for the "benchmarks" on which the major public LLMs are assessed are clearly present in their training data, making them basically useless for establishing reliability of the models. Its fairly obvious that at least some of the improved scores from later generations of models are that this benchmark data is increasingly represented in the training data.

A better way of assessing LLMs is waiting a few weeks until novel tests have been created explicitly absent from all prior training data, and then using those.

As has been shown, eg., on legal test, exams, etc. performance drops off a cliff when future out-sample data is actually used. Rather than these faked pretend out-sample benchmarks.

imtringued
1 replies
1d1h

Honestly every time I see someone use MMLU as a benchmark I am wondering what they are trying to prove. MMLU is a simple multiple choice test with the answers being available.

Simply picking answers at random should give you 25 points. Knowing 50% of the answers and picking the rest randomly gives you 62.5%, which is very close to the scores of SOTA LLMs. The benchmarks that supposedly show reasoning are pretty bad and have very little to do with reasoning. A lot of the questions can be answered through memorization.

I agree with you. The benchmarks are garbage. I thought about building my own benchmarks, but this would require building a complex benchmarking framework first and I just don't have the time for preparatory work like that.

lupire
0 replies
23h56m

I remember when ChatGPT 4 posted a set of standardized trst benchmarks, including AMC 12 math test, where ChatGPT scored higher than guessing randomly, but lower than leaving the test blank (due to the penalty for wrong answers).

ankit219
0 replies
1d

MMLU is not a reasoning benchmark. It's a measure of how distributed and representative their training data was and how well it's able to recall (for lack of a better word) based on training epochs.

GPQA etc. test reasoning in some form, and you see the drastic change in score between the two for every model.

thinkr42
2 replies
1d1h

Genuine curiosity here and trying to temper my own selective memory : does it seem like skepticism is growing on the AI front in the news recently?

viking123
0 replies
22h58m

At least in my tech circle, most people understand the limitations and treat is as a good tool (which it is imo). In many cases it's better than google if you are even somewhat familiar with the domain and can spot the occasional bullshit. No clue what the average people think, some are convinced singularity will come in 2 weeks but most probably have no idea

oglop
0 replies
23h31m

That depends which story will sell more ads.

Elon Bust saying it’s going to replace us gets many eyes on ads even if it’s a dumbass statement.

So just depends on the demographic your news sells ads too.

larkinnaire
2 replies
1d1h

The idea that these word problems (and other LLM stumpers) are "easily solvable by humans" needs some empirical data behind it. Computer people like puzzles, and this kind of thing seems straightforward to them. I think the percentage of the general population who would get these puzzles right with the same time constraints LLMs are subjected to is much lower than the authors would expect, and that the LLMs are right in line with human-level reasoning in this case.

(Of course, I don't have a citation either, but I'm not the one writing the paper.)

rachofsunshine
1 replies
1d1h

Yeah, as someone with an education background I suspect GPT-4 is relatively close to the general public's performance on this problem. Many people would miss AIW, and almost all would miss AIW+. I'm about as good at this kind of thing as anyone and I'd need a minute with pencil and paper to handle AIW+; it's on par with the most difficult problems found on tests like the GRE.

I wonder if these models, trained on data from across the internet, are in some ethereal way capturing the cognitive approaches of the average person (and not picking the best approaches). If the average person does not think in these sorts of symbolic-manipulative terms, and therefore does not write in those terms, and you train a model on that writing...?

larkinnaire
0 replies
1d

I wonder the same thing. If any academic reading this wants a paper idea:

1. Examine papers and other claims that an LLM gets something wrong that a human would have gotten wrong. How many of those claims have any citations about how many humans actually get it wrong? How many of those citations use the general population instead of the population of people who would be uniquely well-suited to answering the question correctly (i.e. people who signed up for the GRE are more likely to get GRE questions right than the general population).

2. For claims that are totally missing citations on human performance, run some tests with humans from the general population (or as close as you can get), and see how the LLMs compare.

kalkin
2 replies
1d2h

Paper finds a failure case for current LLMs and draws broad conclusions about the fundamental limits of LLM technology

Evaluates the largest LLMs and finds evidence that actually scale overcomes the problem:

"Notable exceptions are Claude 3 Opus and GPT-4 that occasionally manage to provide correct responses backed up with correct reasoning as evident in structured step by step explanations those models deliver together with solution"

Drink!

I'm not sure it's productive to be this sarcastic on HN, but it's really quite a common pattern. And there's something very frustrating about how authors of these papers will accuse others of hype and overstating results but also often vastly overstate the generality of their own results - to the point where this HN thread is full of people saying bluntly "this proves LLMs can't reason".

rep_lodsb
1 replies
22h55m

The key word is "occasionally". For anything even close to AGI, it would be reasonable to expect a correct answer to such a simple question 100% of the time. Or at least not confidently wrong answers.

kalkin
0 replies
18h17m

The question the paper claims to address isn't whether current LLMs are AGI, the question is whether LLMs can reason. And solving a problem that seems to require reasoning some of the time is still evidence of reasoning - especially if the question is what we can expect this technology to do in the future.

gwbas1c
2 replies
1d3h

I just played the game and sent ChatGPT (free, I think 3.5) "Alice has 5 sisters and 3 bothers. How many sister's does Alice's bother have?"

The whole thing felt like interacting with your typical support rep who's friendly but otherwise has no common sense and intuition about the thing they're supporting. In other words, it felt like I was interacting with a typical "not so smart but friendly and overconfident" human.

It took me a few back-and-forths, but eventually I convinced ChatGPT that Alice's brother has 6 sisters.

elpocko
1 replies
1d2h

"Alice has 5 sisters and 3 bothers. How many sister's does Alice's bother have?"

To be fair -- you wrote "bother" twice, and "sister's" with an apostrophe. Technically the question makes no sense like that.

gwbas1c
0 replies
23h20m

ChatGPT is really great at handling typos.

IlliOnato
2 replies
1d1h

This is a cool one, but I know of other such "failures".

For example, try to ask (better in Russian), how many letters "а" are there in Russian word "банан". It seems all models answer with "3". Playing with it reveals that apparently LLMs confuse Russian "банан" with English "banana" (same meaning). Trying to get LLMs to produce a correct answer results is some hilarity.

I wonder if each "failure" of this kind deserves an academic article, though. Well, perhaps it does, when different models exhibit the same behaviour...

alfiopuglisi
1 replies
1d1h

No current LLM understands words, nor letters. They all have input and output tokens, that roughly correspond to syllabes and letter groupings. Any kind of task involving counting letters or words is outside their realistic capabilities.

LLMs are a tool, and like any other tool, they have strengths and weaknesses. Know your tools.

IlliOnato
0 replies
1d

I understand that, but the article we are discussing points out that LLMs are so good on many tasks, and so good at passing tests, that many people will be tricked into blindly "taking their word for granted" -- even people who should know better: our brain is a lazy machine, and if something works almost always it starts to assume it works always.

I mean, you can ask an LLM to count letters in thousand of words, and pretty much always it will come with the correct answer! So far I don't know of any word other than "банан" that breaks this function.

visarga
1 replies
1d3h

This prompt was often discussed on forums in the last year, we knew models fail it

gs17
0 replies
1d2h

No reason not to evaluate it more in depth, though. Understanding not just that they fail but if some models fail more/less and under what conditions is still useful.

theptip
1 replies
1d3h

It’s an interesting paper, but my worry is that this is cherry-picking a surprising negative result. (The paper does not seem to discuss how many evals were considered to find this case.)

The general expectation should be that some AI failure modes will be for things that seem obvious to humans. (Because they have a different architecture to humans and different strengths.) The important question is whether this is a single weird cognitive bug on the family size formulation, or representative of a broad family of cognitive errors? It’s difficult to say from this paper.

It’s well known that the framing of a logic problem can dramatically influence its difficulty for humans, even when the underlying comparisons are isomorphic.

I think this line of research is important, but we need a larger battery of evals (and training on the evals will always be a confounder).

The broader family of reasoning here is relational problems (as noted in the paper) and so as a follow-up it would be interesting to explore reasoning performance across a set of isomorphic problems. My hunch is this will generalize somewhat (as LLMs are still not great at relational reasoning), but that there is something about the family formulation in particular that is confusing for LLMs.

kalkin
0 replies
1d2h

I wouldn't be surprised if anti-gender-bias training (which, to be clear, is a good thing to do IMO) does some cognitive damage to these particular formulations because questions about "sister" and "brother" get associated with training data where the correct answer was to refuse to generalize. Pure guess though.

patal
1 replies
4h52m

I like the riddle. But the framing is unfortunate. When divising riddles, you want ambiguity where it serves the riddle, but be precise elsewhere so that the solver doesn't get needlessly distracted.

Their AIW riddle is: "Alice has 4 brothers and she also has 1 sister. How many sisters does Alice’s brother have?"

Now it should've been: "How many sisters do Alice's brothers have?" or "..does each of Alice's brothers have". Why single out a specific brother, when you haven't introduced this topic, and it is irrelevant to the riddle? Naturally, a human would ask "Which brother?", fully knowing that it is not important to the riddle.

Since this grammatical distraction puts an additional burden on the LLM, the authors muddled their original goal, which was to provide an easy riddle. I think it may have also muddled their data.

patal
0 replies
4h51m

Their AIW+ riddle is just ridiculous. It contains so many ambiguities, that there are several correct answers, even though the authors claim there be only one.

Which is really unfortunate. Because now it only shows that LLMs have problems answering ill-framed riddles.

nottorp
1 replies
1d3h

Of course, the initial mistake is assuming LLMs have any 'reasoning'...

ticklemyelmo
0 replies
1d2h

"The breakdown is dramatic, as models also express strong overconfidence in their wrong solutions, while providing often non-sensical "reasoning"-like explanations akin to confabulations to justify and backup the validity of their clearly failed responses, making them sound plausible."

It's fascinating how much they anthropomorphize the systems and credit them with emotional, possibly deceitful behaviour, in a paper trying to explain how unintelligent they are.

mvkel
1 replies
1d3h

Isn't this just saying what we already knew, that tokenization has these fundamental flaws? Did we need a paper for it?

gs17
0 replies
1d2h

It is, I'd even say it's just "LLMs aren't very good at math" rather than a "Complete Reasoning Breakdown". Still, evaluating when/how they fail could be useful.

michaelfeathers
1 replies
1d2h

This is a good talk about the problem: https://youtu.be/hGXhFa3gzBs?si=15IJsTQLsyDvBFnr

Key takeaway, LLMs are abysmal at planning and reasoning. You can give them the rules of planning task and ask them for a result but, in large part, the correctness of their logic (when it occurs) depends upon additional semantic information rather then just the abstract rules. They showed this by mapping nouns to a completely different domain in rule and input description for a task. After those simple substitutions, performance fell apart. Current LLMs are mostly pattern matchers with bounded generalization ability.

cma
0 replies
1d2h

People also fall apart on things like statistical reasoning if you switch domains (I think it is the Leda Cosmides evo psych stuff that goes into it but there might be a more famous experiment).

mcguire
1 replies
22h54m

Reference 40 is scaring me. A page and a half of authors?

YeGoblynQueenne
0 replies
21h58m

Extremely antisocial. Conference papers have page limits for references, do they know that?

bryan0
1 replies
23h14m

Like others I ran this through gpt-4o and it fails at reasoning because it does not realize Alice is a sister. In fact in one explanation it defines the siblings as: M brothers, N sisters, 1 Alice (!!)

I asked gpt “Is Alice a sister?” It answered yes and then answered the problem correctly

brcmthrowaway
0 replies
18h25m

M brothers, N sisters, 1 Alice

Dude, this is shockingly bad if true

Q* algorithm can't come fast enough

atlex2
1 replies
1d3h

I have an even simpler prompt for LLMs that run on mobile: "how many feet are in four feet"-- confabulations on this one sometimes revert to counting toes.

zurfer
0 replies
10h12m

"Four feet is equal to four feet. The measurement does not change as it is already in the desired unit." gpt4o

ape4
1 replies
1d3h

This is obvious... but some human knowledge outside the question is needed. (Alice is female, sister is female, brother is male, relation of sisters and brothers). In a school math "word problem" you don't need this kind of info.

theptip
0 replies
1d3h

LLMs perform well on “human knowledge” like you are describing. They don’t lack knowledge of simple relational facts like these.

Nonetheless it would be interesting to see the problem reformulated in purely mathematical terms, I suspect models would perform better.

That is the point though - models are showing an inability to generalize their capabilities from one domain (maths / abstract logic) into other domains (conversational reasoning).

zdp7
0 replies
23h54m

The problem here isn't that you can get the right answer with additional prompting. The problem is how do we verify an answer when we don't know the answer? If it tells me 5 sisters and it should be 6, it failed. If I have to provide additional information, it really isn't reasoning. I would argue to show the ability to reason, it needs to be able to work out the answer completely on it's own. If I were testing a human the only input I would provide is that they are wrong and to try again. AI is very useful, but at this point I'm not comfortable asking questions I don't already know the answer for.

whiplash451
0 replies
1d2h

This is reminiscent of the ARC challenge and how deep learning struggles at it. I don't know how much progress deep learning and LLMs have made on this kind of challenge. Keen to hear thoughts and pointers.

wg0
0 replies
1d3h

Tech Bros just yesterday yelled at their lungs that AI overtake is just around the corner just like Tesla FSD and Fusion. And Quantum computing.

All in for scientific progress, experimentation and failure but there's clear case of hype train and jacking up valuations is also riding along, very confidently and shamelessly.

An average tech outsider investor would be having a FOMO with that kind of crazy tall promises and tall claims that are being made, constantly and must be called out as such because they undermine the confidence of the general public in serious and grounded science in the long run which would lead to science deniers and nay sayers in the long run.

Pursuit of science is noblest of all pursuits. A hasty and greedy purely capitalist commercialisation pursuit, I am not so sure.

vzaliva
0 replies
1d2h

Such a missed opportunity to cite "Alice's Adventures in Wonderland" in References section!

vaylian
0 replies
1d3h

I'm still waiting for people to combine LLMs with logic engines (prolog) or proof assistants. LLMs are afaik not designed to reason. But they can probably be used to construct formal statements that can be processed by logic engines.

twobitshifter
0 replies
1d1h

AIW+ problem has following form: "Alice has 3 sisters. Her mother has 1 sister who does not have children - she has 7 nephews and nieces and also 2 brothers. Alice’s father has a brother who has 5 nephews and nieces in total, and who has also 1 son. How many cousins does Alice’s sister have?". The solution to AIW+ problem is harder to obtain than the solution to common sense AIW with very simple structure. Solving AIW+ requires taking different paternal sides, that of mother and father, and carefully calculating the number of cousins, taking care of subtracting Alice and her sister, and summing up the total number of cousins from both sides, for instance: on the mother side: 7 (total nephews and nieces) - 4 (Alice and her sisters) = 3 cousins; on the father side: 5 (total nephews and nieces) + 1 (own son of the father’s brother) - 4 (Alice and her sisters) = 2 cousins; summing up 3 + 2 = 5 cousins which Alice and any of her sisters have.

So I’m sure after the author’s created 100s of these family tree problems the answer here seems clear to them, but there are some unknowns to solve it as written.

1) Does Alice have brothers? Her mother must, but it is phrased as if her mother has 1 sister and her sister has 2 brothers, so we have to deduce that she has 3 siblings.

2) Along the same lines, does her father have a sister?

3) Do her Aunts and Uncles have nephews and nieces by marriage?

I think the LLM would still fail if these answers were made explicitly clear in the prompt. However, I could see the same problem ending with, Alice has only 4 cousins, how can that be?

tempodox
0 replies
1d2h

There is no reasoning in LLMs.

skyde
0 replies
9h59m

Asking gpt to first output prolog program seem to 100% fix it!

Given that Alice has 13 brothers and 31 sisters, we can update the Prolog program with this information. We need to adjust the fact about Alice's siblings and then use the rule to calculate the number of sisters her brothers have.

Here is the updated program:

```prolog % Facts siblings(alice, 13, 31). % Alice has 13 brothers and 31 sisters

% Rules % Alice's brothers have M + 1 sisters, since Alice is also their sister. sisters_of_brother(TotalSisters) :- siblings(alice, _, M), TotalSisters is M + 1.

% Initialization goal :- initialization(main).

% Predicate to find and print the total number of sisters Alice's brothers have main :- sisters_of_brother(TotalSisters), writeln(TotalSisters). ```

In this program, the fact `siblings(alice, 13, 31)` represents that Alice has 13 brothers and 31 sisters. The rule `sisters_of_brother(TotalSisters)` calculates the total number of sisters that Alice's brothers have by adding 1 to the number of sisters Alice has (since Alice is also their sister).

When we run this program, it will print the total number of sisters Alice's brothers have:

1. Alice has 31 sisters. 2. Each of Alice's brothers will also consider Alice as a sister. 3. Therefore, each brother has 31 + 1 = 32 sisters.

The output of the program will be:

``` 32 ```

Thus, Alice's brothers have 32 sisters.

renewiltord
0 replies
1d3h

Interesting. If you ask it to name the sisters it figures it out but then if you ask it why it changed its answer it reverts to the original. These things are clearly very good query engines but perhaps they fall short of reasoning.

raldi
0 replies
1d3h

I tried asking Claude variants where N was something other than a nonnegative integer, and he properly pointed out the absurdity, but was willing to play along once I brought up theoretical physics:

[...] So, in our quantum mechanics interpretation, how many sisters does Alice's brother have? It's a superposition of states:

1. In one state, he has Alice as his sister (+1) and the "antisister" (-1), totaling 0 sisters. If they met, they'd annihilate, leaving him sisterless!

2. In another state, he has Alice (+1) and owes a future sister to the quantum family field (-1). Currently, that's 0 sisters, but he's in "sister debt."

3. In a third state, he has Alice (+1), but his family's entanglement means another family's loss of a sister affects him too (-1). Bizarrely, he experiences both having and not having a sister simultaneously.

This interpretation is, of course, a playful and highly metaphorical application of quantum concepts to something they don't actually apply to—family relationships. In reality, quantum effects like superposition and entanglement are incredibly delicate. They typically only persist in highly controlled conditions, like ultra-cold atoms or isolated photons. The constant interactions in a family would "decohere" any such quantum states almost instantly.

ragmondo
0 replies
1d1h

So I’m now predicting what the next captcha is going to ask…

photochemsyn
0 replies
1d1h

Okay, now input the family trees in the appendix of Lord of the Rings and see if it can calculate how many cousins Frodo has, and their exact degree of separation.

pama
0 replies
1d3h

Nothing original to see here and nothing reformulated to be of practical importance other than a clickbait attempt. If this is your first trap of reading absurdist/populist LLM literature consider yourselves lucky. If you care about these models you know how powerful they can be and you probably learned how to use them. If you care and dont know how to use them, it helps to get some background on earlier language model research (maybe 2010–2018) instead to grasp the probabilistic underpinnings of causal inference and then some of the early OpenAI papers about GPT.

omoikane
0 replies
1d2h

The list of prompts are in table 2 on page 18.

Judging from the title, I was most interested in whether they tried to reason against some of the more logical phrases from Alice in Wonderland (e.g. "this bottle was not marked poison", "a grin without a cat", etc), and was disappointed to find that it was not the case.

oglop
0 replies
23h35m

I’m sure this will devastate the stock price and people’s pereception that general AI is here and amazing and going to take our jobs and and and.

But I am a hacker news peep. I’ll read this and lecture my manager in the next meeting about the shortcomings only to be dismissed and watch money funnel into this monolithic autistic secretary.

noncoml
0 replies
1d2h

Is there a good recommendation for a video, like the ones by Karpathy, that explains how can LLMs reason?

la64710
0 replies
1d

This was number one post two mins back bu now it is vanishing into oblivion

kaffekaka
0 replies
21h17m

Giving each of the sisters a name changes everything.

jononomo
0 replies
21h0m

So much cope in the comments from people who don't want to face the fact that human beings are special in a way that materialism cannot explain.

jay-barronville
0 replies
1d4h

The Humpty Dumpty reference in the paper has me cracking up (pun intended).

irrational
0 replies
1d

The breakdown is dramatic, as models also express strong overconfidence in their wrong solutions, while providing often non-sensical "reasoning"-like explanations akin to confabulations to justify and backup the validity of their clearly failed responses, making them sound plausible.

I like their use of confabulations instead of hallucinations. I think confabulate describes what LLMs are doing much better than hallucinate.

hellooodarkness
0 replies
23h48m

Just try the question with GPT-4o and the results are more entertaining than I would expected.

Prompt: Alice has 3 brothers and she also has 6 sisters. How many sisters does Alice’s brother have? Solve the problem step-by-step

Answer: To solve the problem step-by-step:

Identify the total number of siblings Alice has: Alice has 3 brothers. Alice has 6 sisters. Understand that Alice herself is one of the sisters: Including Alice, there are a total of 7 sisters (6 sisters + Alice).

Determine the number of sisters Alice's brother has: Each of Alice's brothers will have the same number of sisters as Alice does, excluding themselves.

Since Alice's brothers have the same set of siblings, including the 7 sisters: Therefore, each of Alice's brothers has 6 sisters (not counting themselves as a sister). So, each of Alice’s brothers has 6 sisters.

falcor84
0 replies
1d3h

I experienced some unusual breakdowns too, playing with these prompts on gpt4o now. I'm wondering whether this might be caused by RLHF penalizing it for assuming Alice's gender.

d3m0t3p
0 replies
1d3h

I really liked that paper, easy to understand and explain every terminology they used, even the variance computation is explained! I don't expect that much detail from every paper, but it's a bliss to read.

breck
0 replies
1d3h

It's an interesting paper, but a big flaw is that this:

Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?

...is not actually a simple task.

This can be quantified.

"1 + 1" is a simple task. It has a pretty small Total Complexity[1].

But to represent their task starting only with 1 and 0, you have to build a program of many, many lines. Orders of magnitude bigger than "1 + 1". Concepts like "has", "brother", "sister", "person", et cetera, have to be defined and built up.

[1] Counting Complexity (2017) https://github.com/breck7/breckyunits.com/blob/main/research...

bloppe
0 replies
18h36m

The point of this paper is not to claim that LLMs are dumb. It's applying rigor to the notion that they "reason" in a way that's fundamentally non human. They're more like search engines than abstract reasoning machines. There will always be problem structures that are too different from anything in the training set for them to solve

bitwize
0 replies
1d2h

Maybe -- and this is just a thought -- but maybe Dissociated Press isn't capable of abstract reasoning in the general sense -- and improving the statistical function used to pick the next token won't change that.

TZubiri
0 replies
14h59m

Asking the computer how it thinks is the laziest scientific methodology.

It was "As we may think" not "As they may think."

Mathnerd314
0 replies
1d2h

They repeat over and over that this is a simple task, but then they explain that they developed it by looking at math olympiad problems for young children ages 7-10. So it would be very interesting to know the performance of these elementary school children on the problem. My guess is that if it was on a math test they would also get confused and try to do random arithmetic.

Jackson__
0 replies
22h54m

Taking inspiration from those problems and aiming for even simpler settings, we arrived at a very

simple problem template that can be easily solved using common sense reasoning but is not entirely

straightforward, of the following form: "Alice has N brothers and she also has M sisters. How

many sisters does Alice’s brother have?".

Cute paper, but claiming they invented it, when it's an extremely basic variant of a common[0] riddle asked of LLMs, is kinda weird.

[0] https://news.ycombinator.com/item?id=39786666#39794890 (Me, referencing the riddle, 74 days ago, having known it for a good while back then.)

HarHarVeryFunny
0 replies
20h30m

An LLM is a stack of transformer layers (N=96 for GPT-2). Each word/token output is the result of exactly N sentence transformations from input to output.

Clearly there are many things (specifically things requiring deliberation and planning) that require an open ended number of thought steps, and therefore could only, even in theory, we done by an LLM in "think step by step" fashion where it breaks the problem down into pieces (maybe recursively) each simple enough to be done in < N of these primitive steps.

In practice it seems that LLM's ability (esp. given their lack of any internal working memory) to maintain long term focus for an extended task is poor, so the theoretical benefit of using their own memory as a scratchpad for planning doesn't seem to be realized.

The notion of scale it up, and/or train it right, and a transformer can implement AGI, seems to be more wishful thinking than based on any analysis of what the architecture is and is not capable of. The lack of support for continuous incremental learning seems to be a deal-breaker for intelligence that people just ignore.

DelightOne
0 replies
1d

How do they know its not that the LLM ingested a couple Alice in Wonderland analyses?