I posted the same 'Zero-Shot Chain-of-Thought and Image' to ChatGPT-4o and it made the same error.
I then followed up with 'Your math is good but you derived incorrect data from the image. Can you take another look and see if you can tell where the error is?'.
It figured it out and corrected it:
Let's re-examine the image and the data provided:
* The inner radius r1 is given as 5cm
* The outer radius r2 is given as 10cm
* However, the dimensions labeled "5 cm" and "10 cm" are actually the diameters
of the inner and outer circles, respectively, not the radii.
Then recomputed and got the right answer. I asked it if it could surmise why it got the wrong answer and it said, among a number of things, that math problems commonly operate in radii instead of diameter.I restarted with a slightly modified prompt:
There is a roll of tape with dimensions specified in the picture.
The tape is 100 meters long when unrolled. How thick is the tape?
Examine the image carefully and ensure that you fully understand how it is labeled.
Make no assumptions. Then when calculating, take a deep breath and work on this problem step-by-step.
It got it the first try, and I'm not interested enough to try it a bunch of times to see if that's statistically significant :)
This speaks to a deeper issue that LLMs don’t just have statistically-based knowledge, they also have statistically-based reasoning.
This means their reasoning process isn’t necessarily based on logic, but what is statistically most probable. As you’ve experienced, their reasoning breaks down in less-common scenarios even if it should be easy to use logic to get the answer.
Does anyone know how far off we are having logical AI?
Math seems like low hanging fruit in that regard.
But logic as it's used in philosophy feels like it might be a whole different and more difficult beast to tackle.
I wonder if LLM's will just get better to the point of being indistinguishable from logic rather than actually achieving logical reasoning.
Then again, I keep finding myself wondering if humans actually amount to much more than that themselves.
I think LLMs will need to do what humans do: invent symbolic representations of systems and then "reason" by manipulating those systems according to rules.
Here's a paper working along those lines: https://arxiv.org/abs/2402.03620
Is this what humans do?
Think of all the algebra problems you got in school where the solution started with "get all the x's on the same side of the equation." You then applied a bunch of rules like "you can do anything to one side of the equals sign if you also do it to the other side" to reiterate the same abstract concept over and over, gradually altering the symbology until you wound up at something that looked like the quadratic formula or whatever. Then you were done, because you had transformed the representation (not the value) of x into something you knew how to work with.
People don't uncover new mathematics with formal rules and symbols pushing, at least not for the most part. They do so first with intuition and vague belief. Formalisation and rigour is the final stage of constructing a proof or argument.
Perhaps, but then what's the point of symbolic systems at all?
No. Not in my experience. Anyone with experience in research mathematics will tell you that making progress at the research level is driven by intuition - intuition honed from years of training with formal rules and rigor but intuition nonetheless - with the final step being to reframe the argument in formal/rigorous language and ensure consitency and so forth.
Infact the more experience and skill I get in supposedly "rational" subjects like foundations, set theory, theoretical physics, etc. the more sure I am that intuition / belief first - justification later is a fundamental tenant of how human brains operate, and the key feature of rationalism and science during the enlightenment was producing a framework so that one may have some way to sort beliefs, theories, and assertion so that we can recover - at the end - some kind of gesture towards objectivity
That's what I am doing. I follow my intuition, but check it with logic.
Your comment made me think of something. How do we know that logic AI is relevant? I mean, how do we know that humans are logic-AI driven and not statistical-intelligent?
A smart human can write and iterate on long, complex chains of logic. We can reason about code bases that are thousands of lines long.
But is that really logic?
For instance, we supposedly reason about complex driving laws, but for anyone who has run a stop light late at night when there is no other traffic is acting statistically, not logically.
There's a difference between statistics informing logical reasoning and statistics being used as a replacement for logic.
Running a red light can be perfectly logical. In the mathematics of logic there is no rule that you must obey the law. It can be a calculated risk.
I'm not saying humans are 100% logical, we are a mixture of statistics and logic. What I'm talking about is what we are capable of VS what LLM's are capable of.
I'll give an example. Let's say you give me two random numbers. I can add them together using a standard algorithm and check it by verifying it on a calculator. Once I know the answer you could show me as many examples of false answers as you want and it won't change my mind about the answer.
In LLMs there is clear evidence that the only reason it gets right answers is those answers happen to be more frequent in the dataset. Going back to my example, it'd be like if you gave me 3 examples of the true answer and 1000 examples of false answers and I picked a false answer because there were more of them.
Humans are really good pattern matchers. We can formalize a problem into a mathematical space, and we have developed lots of tools to help us explore the math space. But we are not good at methodically and reliably exploring a problem-space that requires NP-complete solutions.
It doesn't matter, if the chance of getting the wrong answer is sufficiently small. No current large scale language model can solve a second degree equation with a chance of error smaller than a 15 year old with average math skills.
(Not an AI researcher, just someone who likes complexity analysis.) Discrete reasoning is NP-Complete. You can get very close with the stats-based approaches of LLMs and whatnot, but your minima/maxima may always turn out to be local rather than global.
maybe theorem proving could help? ask gpt4o to produce a proof in coq and see if it checks out...or split it into multiple agents -- one produces the proof of the closed formula for the tape roll thickness, and another one verifies it
Sure, but those are heuristics and feedback loops. They are not guaranteed to give you a solution. An LLM can never be a SAT solver unless it's an LLM with a SAT solver bolted on.
I don't disagree -- there is a place for specialized tool, and LLM wouldn't be my first pick if somebody asked me to add two large numbers.
There is nothing wrong with LLM + SAT solver -- especially if for an end-user it feels like they have 1 tool that solves their problem (even if under the hood it's 500 specialized tools governed by LLM).
My point about producing a proof was more about exploratory analysis -- sometimes reading (even incorrect) proofs can give you an idea for an interesting solution. Moreover, LLM can (potentially) spit out a bunch of possibly solutions and have another tool prune and verify and rank the most promising ones.
Also, the problem described in the blog is not a decision problem, so I'm not sure if it should be viewed through the lenses of computational complexity.
I had the thought recently that theorem provers could be a neat source of synthetic data. Make an LLM generate a proof, run it to evaluate it and label it as valid/invalid, fine-tune the LLM on the results. In theory it should then more consistently create valid proofs.
1847, wasn't it? (George Boole). Or 1950-60 (LISP) or 1989 (Coq) depending on your taste?
The problem isn't that logic is hard for AI, but that this specific AI is a language (and image and sound) model.
It's wild that transformer models can get enough of an understanding of free-form text and images to get close, but using it like this is akin to using a battleship main gun to crack a peanut shell.
(Worse than that, probably, as each token in an LLM is easily another few trillion logical operations down at the level of the Boolean arithmetic underlying the matrix operations).
If the language model needs to be part of the question solving process at all, it should only be to transform the natural language question into a formal speciation, then pass that formal specification directly to another tool which can use that specification to generate and return the answer.
By that same logic isn't that a similar process that we humans use as well ? Kind of seems like the whole point of "AI" (replicating the human experience)
In the same way that apples and oranges are similar in that they are edible fruit, yes.
Right? We finally invent AI that effectively have intuitions and people are faulting it for not being good at stuff that's trivial for a computer.
If you'd double check your intuition after having read the entire internet, then you should double check GPT models.
We could get there if current LLM's managed to prepare some data and offload it to a plugin, then continue on with the result
* LLM extracts the problem and measurements * Sends the data to a math plugin * Continues its reasoning with the result
That’s already a thing. ChatGPT can utilise Wolfram Mathematica as a “tool”. Conversely, there’s an LLM included in the latest Mathematica release.
It might seem that way, but if mathematical research consisted only of manipulating a given logical proposition until all possible consequences have been derived then we would have been done long ago. And we wouldn't need AI (in the modern sense) to do it.
Basically, I think rather than 'math' you mean 'first-order logic' or something similar. The former is a very, large superset of the latter.
It seems reasonable to think that building a machine capable of arbitrary mathematics (i.e. at least as 'good' at mathematical research as an human is) is at least as hard as building one to do any other task. That is, it might as well be the definition of AGI.
Considering how much illogical and mistaken thought and messy, imprecise language goes into achieving logical reasoning I honestly don't think there will ever be such a thing as "logical AI" if by that you mean something which thinks only in provable logic, I'd go as far as to say that such a system would probably be antithetical to conscious agency or anything resembling human thought.
ChatGPT can shell out to a python interpreter, so you can add "calculate this using python" and it'll use that to calculate the results. (no guarantees it gets the python code right though)
First-order logic is undecidable, so no dice.
So for a whim, I took this to ChatGPT and asked it to explore a philosophical position and then assess its own assessment for logic vs statistical.
Longer prompt responses here: https://text.is/1V0OJ
Closing position below.
*Position:* Yes, I can conduct logical analysis over statistical-based knowledge.
*Justification:*
1. *Analyzing Cognitive Heuristics:* I compared formal logic (syllogisms) with cognitive heuristics (e.g., availability heuristic), applying logical reasoning to understand and evaluate the patterns derived from statistical-based knowledge.
2. *Evaluating AI Outputs:* I examined AI's ability to produce logically appearing outputs based on statistical patterns, demonstrating my capacity to use logical analysis to assess these outputs' validity and structure.
3. *Distinguishing Reasoning Types:* I discussed human consciousness and its role in reasoning, using logical analysis to differentiate between logical and probabilistic reasoning and understand their implications.
*Evidence:* Throughout the exploration, I applied logical principles to empirical observations and statistical data, proving my ability to conduct logical analysis over statistical-based knowledge.
Statistically-based reasoning also applies to humans. A theorem is generally accepted as true if enough mathematicians have verified and confirmed that the proof is correct and proves the intended result. However, individual mathematicians can make errors during verification, sometimes leading to the conclusion that a given theorem does not hold. Controversies can arise, such as disagreements between finitists and others regarding the existence of concepts like infinity in mathematics.
That plays out for all the examples, except for the one where its answer was way off it and it corrected itself and attempted again.
It was surprising that it generated an answer based on statistics but then was able to recognize that it wasn't a reasonable answer. I wonder how they are achieving that.
Once you correct the LLM, it will continue to provide the corrected answer until some time later, when it will again make the same mistake. At least, this has been my experience. If you are using LLM to pull answers programmatically and rely on their accuracy, here is what worked for the structured or numeric answers, such as numbers, JSON, etc.
1) Send the same prompt twice, including "Can you double check?" in the second prompt to force GPT to verify the answer. 2) If both answers are the same, you got the correct answer. 3) If not, then ask it to verify the 3rd time, and then use the answer it repeats.
Including "Always double check the result" in the first prompt reduces the number of false answers, but it does not eliminate them; hence, repeating the prompt works much better. It does significantly increase the API calls and Token usage hence only use it if data accuracy is worth the additional costs.
I can't wait for the day when instead of engineering disciplines solving problems with knowledge and logic they're instead focused on AI/LLM psychology and the correct rituals and incantations that are needed to make the immensely powerful machines at our disposal actually do what we've asked for. /s
"No dude, the bribe you offered was too much so the LLM got spooked, you need to stay in a realistic range. We've fine-tuned a local model on realistic bribe amounts sourced via Mechanical Turk to get a good starting point and then used RLMF to dial in the optimal amount by measuring task performance relative to bribe."
RLMF: Reinforcement Learning, Mother Fucker!
qntm's short stories "Lena" and "Driver" cover this ground and it's indeed horribly dystopian (but highly recommended reading).
https://qntm.org/vhitaos
That is only true if you stay within the same chat. It is not true across chats. Context caching is something that a lot of folks would really really like to see.
And jumping to a new chat is one of the core points of the OP: "I restarted with a slightly modified prompt:"
The iterations before where mostly to figure out why the initial prompt went wrong. And AFAICT there's a good insight in the modified prompt - "Make no assumptions". Probably also "ensure you fully understand how it's labelled".
And no, asking repeatedly doesn't necessarily give different answers, not even with "can you double check". There are quite a few examples where LLMs are consistently and proudly wrong. Don't use LLMs if 100% accuracy matters.
so what would you use instead?
Depends - what's your allowable error rate? What are you solving for?
Here are a few examples where it does not consistently give you the same answer and helps by asking it to retry or double-check:
1) Asking gpt to find something, e.g., HSCode for a product, it returns a false positive after x number of products. Asking it to double-check almost always corrects itself.
2) Quite a few times, asking it to write code results in incorrect syntax or code that does what you asked. Simply asking, are you sure, or can you double check, should make it revisit its answer.
3) Ask it to find something from an attachment, e.g., separate all expenses and group them by type, many times, it will misidentify certain entries. However, asking to double-check fixes it.
via api (harder to do via chat as cleanly) you can also try showing it do a false attempt (but a short one so it's effectively part of the prompt) and then you say try again.
Are there any examples?
I don’t have one but you can experiment with it
People just forget that prompting an AI can mean either a system prompt or a prompt AND a chat history, and the chat history can be inorganic
I mean I could see my kid making this exact mistake on a word problem, so I suppose we've achieved "human like" reasoning at the expense of actually getting the answer we want?
I tried to work out the problem myself first (using only the text) and accidentally used the diameter as the radius just like ChatGPT! Granted I haven't really tackled any maths problems for many years though.
In this context, what does the author mean by 'Zero-Shot'? From what I read on Wikipedia [1], it's about the model performing a task without specific training data for that task. But all of the experiments in the post seem to involve the model working without additional training data.
[1] https://en.wikipedia.org/wiki/Zero-shot_learning
confirmed worked for me first try
EDIT: out of 3 times, got it correct 2/3
Chain of thought is nothing more than limiting the probability space enough that the model can provide the most likely answer. It's too much damn work to be useful.
That’s funny. I practically got into a shouting match for the first time ever with ChatGPT earlier today because I was asking it to create a function to make a filled circle of pixels of a certain size using diameter and absolutely not radius (with some other constraints).
This mattered because I wanted clear steps between 3,4,5,6 etc pixels wide, so the diameter was an int.
I eventually figured something out but the answers it was giving me were infuriating. At some point instead of a radius it put “int halfSize = diameter / 2”.