return to table of content

CriticGPT: Finding GPT-4's mistakes with GPT-4

lowyek
52 replies
3d21h

I find it fascinating that while in other fields you see lot of theorums/results much before practical results are found. But in this forefront of innovation - I have hardly seen any paper discussing hallucinations and lowerbound/upperbound on that. Or may be I didn't open hacker news on that right day when it was published. Would love to understand the hallucination phenomena more deeply and the mathematics behind it.

hbn
42 replies
3d21h

the hallucination phenomena

There isn't really such thing as a "hallucination" and honestly I think people should be using the word less. Whether an LLM tells you the sky is blue or the sky is purple, it's not doing anything different. It's just spitting out a sequence of characters it was trained be hopefully what a user wants. There is no definable failure state you can call a "hallucination," it's operating as correctly as any other output. But sometimes we can tell either immediately or through fact checking it spat out a string of text that claims something incorrect.

If you start asking an LLM for political takes, you'll get very different answers from humans about which ones are "hallucinations"

raincole
20 replies
3d20h

I don't know why the narrative became "don't call it hallucination". Grantly English isn't my mother tongue so I might miss some subtlty here. If you know how LLM works, call it "hallucination" doesn't make you know less. If you don't know how LLM works, using "hallucination" doesn't make you know less either. It's just a word meaning AI gives wrong[1] answer.

People say it's "anthropomorphizing" but honestly I can't see it. The I in AI stands for intelligence, is this anthropomorphizing? L in ML? Reading and writing are clearly human activities, so is using read/write instead of input/output anthropomorphizing? How about "computer", a word once meant a human who does computing? Is there a word we can use safely without anthropomorphizing?

[1]: And please don't argue what's "wrong".

p1esk
8 replies
3d18h

It's just a word meaning AI gives wrong answer.

No, it’s more specific than just wrong.

Hallucination is when a model creates a bit of fictitious knowledge, and uses that knowledge to answer a question.

olalonde
5 replies
3d15h

Can you give an example of a "wrong" answer vs an "hallucinated" answer?

influx
2 replies
3d8h

Wrong is code that doesn’t compile. Hallucinated is compilable code using a library that never existed.

gbnwl
1 replies
3d4h

Can code using a library that doesn't exist compile? I admit ignorance here.

influx
0 replies
1d16h

No it can't, I should have said code that has valid syntax, but are using APIs or libraries that don't exist.

stefanve
0 replies
3d10h

there are many types of a wrong answer, and the difference is based on how the answer came to be. In case of BS/Hallucination there is no reason or logic behind the answer it is basically, in the case of LLM, just random text. There was no reasoning behind the output or it wasn't based on facts.

You can argue if it matters how a wrong answer came about ofc but there is a difference

omikun
0 replies
3d13h

The issue is there is no difference between a right answer and a hallucinated answer.

ruszki
1 replies
3d12h

It doesn't need to create wrong answers. It's enough to recall people who gave wrong answers.

ben_w
0 replies
3d11h

I've heard the term originated in image recognition, where models would "see" things that weren't there.

You can still get that with zero bad labels in a supervised training set.

Multiple causes for the same behaviour makes progress easier, but knowing if it's fully solved harder.

VanillaCafe
3 replies
3d17h

I don't know why the narrative became "don't call it hallucination".

Context is "don't call it hallicination" picked up meme energy since https://link.springer.com/article/10.1007/s10676-024-09775-5 on the thesis that "Calling their mistakes ‘hallucinations’ isn’t harmless: it lends itself to the confusion that the machines are in some way misperceiving but are nonetheless trying to convey something that they believe or have perceived."

Which is meta-bullshit because it doesn't matter. We want LLMs to behave more factually, whatever the non-factuality is called. And calling that non-factuality something else isn't going to really change how we approach making them behave more factually.

intended
2 replies
3d13h

How are LLMs not behaving factually? They already predict the next most likely term.

If they could predict facts, then these would be gods, not machines. It would be saying that in all the written content we have, there exists a pattern that allows us to predict all answers to questions we may have.

ijk
1 replies
3d12h

The problem is that some people are running around and saying they are gods. Which I wouldn't care about, but an alarming number of people do believe that they can predict facts.

intended
0 replies
3d11h

Our system can effectively predict facts.

It logics its way to it.

By predicting the next word in a sequence of words.

Sure? It kinda sounds plausible? But man, if it’s that straight forward, what have we been doing as a species for so many years ?

Affric
2 replies
3d19h

The AI companies don’t want you “anthropomorphising” the models because it would put them at risk of increased liability.

You will be told that linear algebra is just a model and the fact that epistemology has never turned up a decent result for what knowledge is will be ignored.

We are meant to believe that we are somehow special magical creatures and that the behaviour of our minds cannot be modelled by linear algebra.

duggan
0 replies
3d11h

No, you’re just meant not to assert that linear algebra is equivalent to any process in the human brain, when the human brain is not understood well enough to draw that conclusion.

ben_w
0 replies
3d11h

I don't see how anthropomorphism reduces liability.

If a company does a thing that's bad, it doesn't matter much if the work itself was performed by a blacksmith or by a robot arm in a lights-off factory.

We are meant to believe that we are somehow special magical creatures and that the behaviour of our minds cannot be modelled by linear algebra

I only hear this from people who say AI will never reach human level; of AI developers that get press time, only LeCun seems so dismissive (though I've not actually noticed him making this specific statement, I can believe he might have).

th0ma5
1 replies
3d20h

AI is a nebulous, undefined term, and many people specifically criticize the use of the word intelligent.

Affric
0 replies
3d19h

Always people who consider themselves intelligent

intended
0 replies
3d13h

TLDR TLDR: Assuming we dont argue right/wrong, technically everything an LLM does is a hallucination. This completely dilutes the meaning of the word no?

TLDR: Sure. A rose by any other name would be just as sweet. It’s when I use the name of the rose and imply aspects that are not present, that we create confusion and busy work.

Hey, calling it a narrative is to move it to PR speak. I know people have argued this term was incorrect since the first times it was ever shared on HN.

It was unpopular to say this when ChatGPT launched, because chatGPT was just that. freaking. cool.

It is still cool.

But it is not AGI. It does not “think”.

Hell - I understand that we will be doing multiple columns of turtles all the way down. I have a different name for this approach - statistical committees.

Because we couched its work in terms of “thinking”, “logic”, “creativity”, we have dumped countless man hours and money into avenues which are not fruitful. And this isnt just me saying it - even Ilya commented during some event that many people can create PoCs, but there are very few production grade tools.

Regarding the L in ML, and the I in AI ->

1) ML and AI were never quite as believable as ChatGPT. Calling it learning and intelligence doesnt result in the same level of ambiguity.

2) A little bit of anthropomorphizing was going on.

Terms matter, especially at the start. New things get understood over time, as we progress we do move to better terms. Let’s use hallucinations for when a digital system really starts hallucinating.

devjab
0 replies
3d10h

I suspect it’s about marketing. I’m not sure it would be so easy to sell these tools to enterprise organisations if you outlined that they are basically just very good at being lucky. With the abstraction of hallucinations you sort of put into language why your tool is sometimes very wrong.

To me the real danger comes from when the models get things wrong but also correct at the same time. Not so much in software engineering, I doubt your average programmer without LLM tools will write “better” code without getting some bad answers. What consents me is more how non-technical departments implement LLMs into their decision making or analysis systems.

Done right, it’ll enhance your capabilities. We had a major AI project in cancer detection, and while it actually works it also doesn’t really work on its own. Obviously it was meant to enhance the regular human detection and anyone involved with the project screamed this loudly at any chance they got. Naturally it was seen as an automation process by the upper management and all the humans parts of the process were basically replaced… until a few years later when we had a huge scandal about how the AI worked as it was meant to do, which wasn’t to be on its own. Today it works along side the human detection systems and their quality is up. It took people literally dying to get that point through.

Maybe it would’ve happened this way anyway if the mistakes weren’t sort of written into this technical issue we call hallucinations. Maybe it wouldn’t. From personal experience with getting projects to be approved, I think abstractions are always a great way to hide the things you don’t want your decision makers to know.

IanCal
8 replies
3d19h

It's the new "serverless" and I would really like people to stop making the discussion between about the word. You know what it means, I know what it means, let's all move on.

We won't, and we'll see this constant distraction.

CaptainOfCoit
6 replies
3d19h

It's the new "serverless" and I would really like people to stop making the discussion between about the word. You know what it means, I know what it means, let's all move on.

Well, parent is lamenting the lack of lowerbound/upperbound for "hallucinations", something that cannot realistically exist as "hallucinations" don't exist. LLMs aren't fact-outputting machines, so when it outputs something a human would consider "wrong" like "the sky is purple", it isn't true/false/correct/incorrect/hallucination/fact, it's just the most probable character after the next.

That's why it isn't useful to ask "but how much it hallucinates?" when in reality what you're out after is something more like "does it only output facts?". Which, if it did, LLMs would be a lot less useful.

elif
2 replies
3d18h

There is a huge gap between "facts" and "nonfacts" which compose the majority of human discourse. Statements, opinions, questions, when properly qualified, are not facts or nonfacts or hallucinations.

LLM don't need to be perfect fact machines at all to be honest, and non-hallucinating. They simply need to ground statements in other grounded statements and identify the parts which are speculative or non-grounded.

kreyenborgi
1 replies
3d11h

If you simply want to ground statements in statements, you quickly get into GOFAI territory where you need to build up the full semantics of a sentence (in all supported languages) in order to prove that two sentences mean the same or have the same denotation or that one entails the other.

Otherwise, how do you prove the grounding isn't "hallucinated"?

mattigames
0 replies
3d6h

The root issue is that us humans perceive our own grasp on things better than it is ( "better" may be the wrong word, maybe just "different"), of how exactly concepts are tied to each other in our heads, it's been a primordial tool for our survival, and for our day to day lives but it's at odds with the task of building reasoning skills in the machine, because language evolved first and foremost to communicate among beings that share a huge context, so for example our definition of the word "blue" in "the sky is blue" would be wildly different if humans were all blind (like the machine is, in a sense)

IanCal
2 replies
3d10h

it's just the most probable character after the next.

That's simply not true. You're confusing how they're trained and what they do. They don't have some store of exactly how likely each word is (and it's worth stopping to think about what that would even mean) for every possible sentence.

CaptainOfCoit
1 replies
3d1h

That's simply not true.

It's a simplification. Temperature also influences it to not always be the most probable character, as an example.

IanCal
0 replies
1d11h

No it's fundamentally not true because when you say "most likely" it's the highest value output of the model, not what's most likely either in the underlying data or the goal of what is being trained for.

elif
0 replies
3d18h

"you know what it means, I know what it means"

It is somewhat humorous when humans have ontological objections to the neologisms used to describe a system whose entire function is to relate the meanings of words. It is almost as if the complaint is itself a repressed philosophical rejection of the underlying LLM process, only being wrapped in the apparent misalignment of the term hallucination.

The complaint may as well be a defensive clinging "nuh uh, you can't decide what words mean, only I can"

Perhaps the term "gas lighting" is also an appropriate replacement of "hallucination," one which is not predicated on some form of truthiness standard, but rather THIS neologism focuses on the manipulative expression of the lie.

mortenjorck
3 replies
3d20h

It is an unfortunately anthropomorphizing term for a transformer simply operating as designed, but the thing it's become a vernacular shorthand for, "outputting a sequence of tokens representing a claim that can be uncontroversially disproven," is still a useful concept.

There's definitely room for a better label, though. "Empirical mismatch" doesn't quite have the same ring as "hallucination," but it's probably a more accurate place to start from.

NovemberWhiskey
1 replies
3d20h

"outputting a sequence of tokens representing a claim that can be uncontroversially disproven," is still a useful concept.

Sure, but that would require semantic mechanisms rather than statistical ones.

riwsky
0 replies
3d11h

Statistics has a semantics all its own

hbn
0 replies
3d20h

Regardless I don't think there's much to write papers on, other than maybe an anthropological look at how it's affected people putting too much trust into LLMs for research, decision-making, etc.

If someone wants info to make their model to be more reliable for a specific domain, it's in the existing papers on model training.

hatthew
1 replies
3d19h

LLMs model their corpus, which for most models tends to be factually correct text (or subjective text with no factuality). Sure, there exist factually incorrect statements in the corpus, but for the vast majority of incorrect statements there exist many more equivalent but correct statements. If an LLM makes a statement that is not supported by the training data (either because it doesn't exist or because the equivalent correct statement is more strongly supported), I think that's an issue with the implementation of the model. I don't think it's an intrinsic feature/flaw in what the model is modeling.

Hallucination might not be the best word, but I don't think it's a bad word. If a weather model predicted a storm when there isn't a cloud in the sky, I wouldn't have a problem with saying "the weather model had a hallucination." 50 years ago, weather models made incorrect predictions quite frequently. That's not because they weren't modeling correct weather, it's because we simply didn't yet have good models and clean data.

Fundamentally, we could fix most LLM hallucinations with better model implementations and cleaner data. In the future we will probably be able to model factuality outside of the context of human language, and that will probably be the ultimate solution for correctness in AI, but I don't think that's a fundamental requirement.

intended
0 replies
3d13h

I suspect you would fix first response accuracy.

People still want it to be used for thinking.

This isnt going to happen with better data. Better data means it will be better at predicting the next token.

For questions or interactions where you need to process, consider, decompose a problem into multiple steps, solve those steps etc - you need to have a goal, tools, and the ability to split your thinking and govern the outcome.

That isnt predicting the next token. I think it’s easier to think of LLMs as doing decompression.

They take an initial set of tokens and decompress them into the most likely final set of tokens.

What we want is processing.

We would have to set up the reaction to somehow perfectly result in the next set of tokens to then set up the next set of tokens etc - till the system has an answer.

Or in other words, we have to figure out how to phrase an initial set of tokens so that each subsequent set looks similar enough to “logic” in the training data, that the LLM expands correctly.

wincy
0 replies
3d12h

Well what the heck was Bing Chat doing when it wrote me a message all in emojis like it was the Zodiac killer telling me a hacker had taken it over then spitting out Python code to shutdown the system, and giving me nonsense secret messages like “PKCLDUBB”?

What am I suppose to call that?

sqeaky
0 replies
3d16h

Yet for their value as tools the truth value of statements made by LLMs do matter.

shrimp_emoji
0 replies
3d19h

It should be "confabulation", since that's not carting along the notion of false sensory input.

Humans also confabulate but not as a result of "hallucinations". They usually do it because that's actually what brains like to do, whether it's making up stories about how the world was created or, more infamously, in the case of neural disorders where the machinery's penchant for it becomes totally unmoderated and a person just spits out false information that they themselves can't realize is false. https://en.m.wikipedia.org/wiki/Confabulation

sandworm101
0 replies
3d19h

Hallucination is emergent. It cannot be found as a thing inside the AI systems. It is a phenomena that only exists when the output is evaluated. That makes it an accurate description. A human who has hallucinated something is not lying when they speak of something that never actually happened, nor are they making any sort of mistake in their recollection. Similarly, an AI that is hallucinating isn't doing anything incorrect and doesn't have any motivation. The hallucinated data emerges just as any other output, only to evaluated by outsiders as incorrect.

nl
0 replies
3d17h

There isn't really such thing as a "hallucination" and honestly I think people should be using the word less. Whether an LLM tells you the sky is blue or the sky is purple, it's not doing anything different. It's just spitting out a sequence of characters it was trained be hopefully what a user wants. There is no definable failure state you can call a "hallucination," it's operating as correctly as any other output.

This is a very "closed world" view of the phenomenon which looks at an LLM as a software component on its own.

But "hallucination" is a user experience problem, and it describes the experience very well. If you are using a code assistant and it suggests using APIs that don't exist then the word "hallucination" is entirely appropriate.

A vaguely similar analogy is the addition of the `let` and `const` keywords in JS ES6. While the behavior of `var` was "correct" as-per spec the user experience was horrible: bug prone and confusing.

emporas
0 replies
3d20h

Chess engines, which are used for 25 years by the best human chess players daily, compute the best next move on the board. The total number of all possible chess positions is more than all the atoms in the universe.

Is is possible for a chess engine to compute the next move and be absolutely sure it is the best one? It's not, it is a statistical approximation, but still very useful.

eskibars
2 replies
3d6h

FYI, I work at Vectara and can answer any questions.

For us, we treat hallucinations as the ability to accurately respond in an "open book" format for retrieval augmented generation (RAG) applications specifically. That is, given a set of information retrieved (X), does the LLM-produced summary:

1. Include any "real" information not contained in X? If "yes," it's a hallucination, even if that information is general knowledge. We see this as an important way to classify hallucinations in a RAG+summary context because enterprises have told us they don't want the LLMs "reading between the lines" to infer things. To pick an absurd/extreme case to show a point, the case of a genetic research firm, say, using CRISPR and finding they can create a purple zebra, if the retrieval system in the RAG bits says "zebras can be purple" due to their latest research, we don't want the LLM to override that knowledge with its knowledge that zebras are only ever black/white/brown. We'd treat that as a hallucination.

2. On the extreme opposite end, an easy way to avoid hallucinating would be for the LLM to say "I don't know" for everything thereby avoiding hallucinating by avoiding answering all questions. That has other obvious negative effects, so we also evaluate LLMs for their ability to answer.

We look at the factual consistency, answer rate, summary length, and some other metrics internally to focus prompt engineering, model selection, and model training: https://github.com/vectara/hallucination-leaderboard

cootsnuck
1 replies
3d4h

Great repo, glad y'all are looking into this. So am I reading correctly that Intel has a 7B model that doesn't remarkably well with not hallucinating??

eskibars
0 replies
2d13h

That's correct. We've got a blog that talks a bit about it: https://vectara.com/blog/do-smaller-models-hallucinate-more/

Some people are surprised by smaller models having the ability to outperform bigger models, but it's something we've been able to exploit: if you fine tune a small model for a specific task (e.g. reduced hallucinations on a summarization task) as Intel has done, you can achieve great performance economically.

lowyek
0 replies
3d19h

thank you for sharing this!

dennisy
1 replies
3d21h

Not sure if there is a great deal of maths to understand. The output of an LLM is stochastic by nature, and will read syntactical perfect, AKA a hallucination.

No real way to mathematically prove this, considering there is also no way to know if the training data also had this “hallucination” inside of it.

ben_w
0 replies
3d20h

I think mathematical proof is the wrong framework, in the same way that chemistry is the wrong framework for precisely quantifying and explaining how LSD causes humans to hallucinate (you can point to which receptors it binds with, but AFAICT not much more than that).

Investigate it with the tools of psychologically, as suited for use on a new non-human creature we've never encountered before.

beernet
0 replies
3d21h

How are 'hallucinations' a phenomenon? I have trouble with the term 'hallucination' and believe it sets the wrong narratuve. It suggests something negative or unexpected, which it absolutely is not. Language models aim at, as their name implies, modeling language. Not facts or anything alike. This is per design and you certainly don't have to be an AI researcher to grasp that.

That being said, people new to the field tend to believe that these models are fact machines. In fact, they are the complete opposite.

amelius
0 replies
3d21h

I don't see many deep theorems in the field of psychology either.

GiorgioG
40 replies
4d1h

All these LLMs make up too much stuff, I don't see how that can be fixed.

elwell
32 replies
4d

All these LLMs make up too much stuff, I don't see how that can be fixed.

All these humans make up too much stuff, I don't see how that can be fixed.

advael
15 replies
3d23h

The problems of epistemology and informational quality control are complicated, but humanity has developed a decent amount of social and procedural technology to do these, some of which has defined the organization of various institutions. The mere presence of LLMs doesn't fundamentally change how we should calibrate our beliefs or verify information. However, the mythology/marketing that LLMs are "outperforming humans" combined with the fact that the most popular ones are black boxes to the overwhelming majority of their users means that a lot of people aren't applying those tools to their outputs. As a technology, they're much more useful if you treat them with what is roughly the appropriate level of skepticism for a human stranger you're talking to on the street

mistermann
14 replies
3d22h

I wonder what ChatGPT would have to say if I ran this text through with a specialized prompt. Your choice of words is interesting, almost like you are optimizing for persuasion, but simultaneously I get a strong vibe of intention of optimizing for truth.

advael
8 replies
3d21h

I think you'll find I'm quite horseshit at optimizing for persuasion, as you can easily verify by checking any other post I've ever made and the response it generally elicits. I find myself less motivated by what people think of me every year I'm alive, and less interested in what GPT would say about my replies each of the many times someone replies just to ponder that instead of just satisfying their curiosity immediately via copy-paste. Also, in general it seems unlikely humans function as optimizers natively, because optimization tends to require drastically narrowing and quantifying your objectives. I would guess that if they're describable and consistent, most human utility functions look more like noisy prioritized sets of satisfaction criteria than the kind of objectives we can train a neural network against

mistermann
7 replies
3d20h

This on the other hand I like, very much!

Particularly:

Also, in general it seems unlikely humans function as optimizers natively, because optimization tends to require drastically narrowing and quantifying your objectives. I would guess that if they're describable and consistent, most human utility functions look more like noisy prioritized sets of satisfaction criteria than the kind of objectives we can train a neural network against

Considering this, what do you think us humans are actually up to, here on HN and in general? It seems clear that we are up to something, but what might it be?

advael
6 replies
3d19h

On HN? Killing time, reading articles, and getting nerdsniped by the feedback loop of getting insipid replies that unfortunately so many of us are constantly stuck in

In general? Slowly dying mostly. Talking. Eating. Fucking. Staring at microbes under a microscope. Feeding cats. Planting trees. Doing cartwheels. Really depends on the human

mistermann
5 replies
3d11h

I would tend to agree!!

Talking.

Have you ever noticed any talking that ~"projects seriousness &/or authority about important matters" around here?

advael
4 replies
2d23h

I think most people do that all the time. Projecting authority is one of the most important skills in a world dominated by human institutions, because it's an effective means of manipulating most humans. Sad but true

mistermann
3 replies
2d22h

Do you know any single person who can stop the process, at will? Maybe not always, but at least sometimes, on demand (either internally or externally invoked)?

advael
2 replies
2d20h

What, like not project authority? Admit that they are lost, confused, powerless, don't know something, aren't in control? Break the empire society kayfabe?

Yes, absolutely. I view this as one of the criteria by which I assess emotional maturity, and despite societal pressures to never do so, many manage to, even though most don't

I'm not a sociologist, but I think the degree to which people can't turn it off maps fairly well onto the "low-high trust society" continuum, with lower trust implying less willingness or even sometimes ability to stop trying to do this on average, though of course variation will exist within societies as well

I have this intuition because I think the question of whether to present vulnerability and openness versus authority and strength is essentially shaped like a prisoner's dilemma, with all that that implies

mistermann
1 replies
1d1h

I'm not a sociologist, but I think the degree to which people can't turn it off maps fairly well onto the "low-high trust society" continuum

We're not fully aligned here....I'm thinking more like: stop (or ~isolate/manage) non-intentional cognition, simulated truth formation, etc.....not perfectly in a constant, never ending state of course, but for short periods of time, near flawlessly.

advael
0 replies
11h1m

Sure. There are people who can do that. I think it's a hard skill to master but definitely one that can be performed and improved somewhat reliably for people who manage to get the hang of it initially and care to work at it, and which I have seen a decent number of examples of, including a few who seem better at it than me

refulgentis
4 replies
3d21h

FWIW I don't understand a lot of what either of you mean, but I'm very interested. Quick run-through, excuse the editorial tone, I don't know how to give feedback on writing without it.

# Post 1

The problems of epistemology and informational quality control are complicated, but humanity has developed a decent amount of social and procedural technology to do these, some of which has defined the organization of various institutions.

Very fluffy, creating very uncertain parsing for reader.

Should cut down, then could add specificity:

ex. "Dealing with misinformation is complicated. But we have things like dictionaries and the internet, there's even specialization in fact-checking, like Snopes.com"

(I assume the specifics I added aren't what you meant, just wanted to give an example)

The mere presence of LLMs doesn't fundamentally change how we should calibrate our beliefs or verify information. However, the mythology/marketing that LLMs are "outperforming humans"

They do, or are clearly at par, at many tasks.

Where is the quote from?

Is bringing this up relevant to the discussion?

Would us quibbling over that be relevant to this discussion?

combined with the fact that the most popular ones are black boxes to the overwhelming majority of their users means that a lot of people aren't applying those tools to their outputs.

Are there unpopular ones aren't black boxes?

What tools? (this may just indicate the benefit of a clearer intro)

As a technology, they're much more useful if you treat them with what is roughly the appropriate level of skepticism for a human stranger you're talking to on the street

This is a sort of obvious conclusion compared to the complicated language leading into it, and doesn't add to the posts before it. Is there a stronger claim here?

# Post 2

I wonder what ChatGPT would have to say if I ran this text through with a specialized prompt.

Why do you wonder that?

What does "specialized" mean in this context?

My guess is there's a prompt you have in mind, which then would clarify A) what you're wondering about B) what you meant by specialized prompt. But a prompt is a question, so it may be better to just ask the question?

Your choice of words is interesting, almost like you are optimizing for persuasion,

What language optimizes for persuasion? I'm guessing the fluffy advanced verbiage indicates that?

Does this boil down to "Your word choice creates persuasive writing"?

but simultaneously, I get a strong vibe of intention of optimizing for truth.

Is there a distinction here? What would "optimizing for truth" vs. "optimizing for persuasion" look like?

Do people usually write not-truthful things, to the point it's worth noting that when you think people are writing with the intention of truth?

advael
3 replies
3d21h

As long as we're doing unsolicited advice, this revision seems predicated on the assumption that we are writing for a general audience, which ill suits the context in which the posts were made. This is especially bizarre because you then interject to defend the benchmarking claim I've called "marketing", and having an opinion on that subject at all makes it clear that you also at the very least understand the shared context somewhat, despite being unable to parse the fairly obvious implication that treating models with undue credulity is a direct result of the outsized and ill-defined claims about their capabilities to which I refer. I agree that I could stand to be more concise, but if you find it difficult to parse my writing, perhaps this is simply because you are not its target audience

refulgentis
2 replies
3d21h

Let's go ahead and say the LLM stuff is all marketing and it's all clearly worse than all humans. It's plainly unrelated to anything else in the post, we don't need to focus on it.

Like I said, I'm very interested!

Maybe it doesn't mean anything other than what it says on the tin? You think people should treat an LLM like a stranger making claims? Makes sense!

It's just unclear what a lot of it means and the word choice makes it seem like there's something grander going on, coughs as our compatriots in this intricately weaved thread on the international network known as the world wide web have also explicated, and imparted via the written word, as their scrivening also remarks on the lexicographical phenomenae. coughs

My only other guess is you are doing some form of performance art to teach us a broader lesson?

There's something very "off" here, and I'm not the only to note it. Like, my instinct is it's iterated writing using an LLM asked to make it more graduate-school level.

mistermann
0 replies
3d11h

There's something very "off" here

You mean on this planet?

If not, what do you think of that idea? Does something not seem....weird?

advael
0 replies
3d20h

Your post and the one I originally responded to are good evidence against something I said earlier. The mere existence of LLMs does clearly change the landscape of epistemology, because whether or not they're even involved in a conversation people will constantly invoke them when they think your prose is stilted (which is, by the way, exactly the wrong instinct), or to try to posture that they occupy some sort of elevated remove from the conversation (which I'd say they demonstrate false by replying at all). I guess dehumanizing people by accusing them of being "robots" is probably as old as the usage of that word if not older, but recently interest in talking robots has dramatically increased and so here we are

I can't tell you exactly what you find "off" about my prose, because while you have advocated precision your objection is impossibly vague. I talk funny. Okay. Cool. Thanks.

Anyway, most benchmarks are garbage, and even if we take the validity of these benchmarks for granted, these AI companies don't release their datasets or even weights, so we have no idea what's out of distribution. To be clear, this means the claims can't be verified even by the standards of ML benchmarks, and thus should be taken as marketing, because companies lying about their tech has both a clearly defined motivation and a constant stream of unrelenting precedent

urduntupu
6 replies
4d

Exactly, you can't even fix the problem at the root, b/c the problem is already with the humans, making up stuff.

testfrequency
5 replies
4d

Believe it or not, there are websites that have real things posted. This is honestly my biggest shock that OpenAI thought Reddit of all places is a trustworthy source for knowledge.

empath75
1 replies
3d23h

The websites with content authored by people is full of bullshit, intentional and unintentional.

testfrequency
0 replies
3d19h

It’s genuinely concerning to me how many people replied with thinking reddit is the gospel for factual information.

Reddit, while it has some niche communities with tribal info and knowledge, is FULL of spam, bots, companies masquerading as users, etc etc etc. If people are truly relying on reddit as a source of truth (which OpenAI is now being influenced by), then the world is just going to be amplify all the spam that already exists

p1esk
0 replies
4d

Reddit has been the most trustworthy source for me in the last ~5 years, especially when I want to buy something.

acchow
0 replies
3d22h

While Reddit is often helpful for me (Google site:reddit.com), it's nice to toggle between reddit and non-reddit.

I hope LLMs will offer a "-reddit" model to switch to when needed.

QuesnayJr
0 replies
4d

Reddit is so much better than the average SEO-optimized site that adding "reddit" to your search is a common trick for using Google.

testfrequency
5 replies
4d

I know you’re trying to be edgy here, but if I was deciding between searching online and finding a source vs trying to shortcut and use GPT, but GPT decides to hallucinate and make something up - that’s the deceiving part.

The biggest issue is how confidently wrong GPT enjoys being. You can press GPT in either right or wrong direction and it will concede with minimal effort, which is also an issue. It’s just really bad russian roulette nerdspining until someone gets tired.

sva_
4 replies
4d

I wouldn't call it deceiving. In order to be motivated to deceive someone, you'd need agency and some benefit out of it

advael
2 replies
3d23h

1. Deception describes a result, not a motivation. If someone has been led to believe something that isn't true, they have been deceived, and this doesn't require any other agents

2. While I agree that it's a stretch to call ChatGPT agentic, it's nonetheless "motivated" in the sense that it's learned based on an objective function, which we can model as a causal factor behind its behavior, which might improve our understanding of that behavior. I think it's relatively intuitive and not deeply incorrect to say that that a learned objective of generating plausible prose can be a causal factor which has led to a tendency to generate prose which often deceives people, and I see little value in getting nitpicky about agentic assumptions in colloquial language when a vast swath of the lexicon and grammar of human languages writ large does so essentially by default. "The rain got me wet!" doesn't assume that the rain has agency

sva_
1 replies
3d17h

Well the definition of deception, according to Google and how I understand it, is:

deliberately cause (someone) to believe something that is not true, especially for personal gain.

Emphasis on the personal gain part. It seems like you have a different definition.

There's no point in arguing about definitions, but I'm a big believer in that if you can identify a difference in the definitions people use early into a conversation, you can settle the argument at that.

advael
0 replies
3d15h

I both agree that it's pointless to argue about definitions and think you've presented a definition that fails to capture a lot of common usage of the word. I don't think it matters what the dictionary says when we are talking about how a word is used. Like we use "deceptive" to describe inanimate objects pretty frequently. I responded to someone who thought describing the outputs of a machine learning model as deceiving people implied it had agency, which is nonsense

testfrequency
0 replies
4d

Isn’t that GPT Plus? Trick you into thinking you have found your new friend and they understand everything? Surely OpenAI would like people to use their GPT over a Google search.

How do you think leadership at OpenAI would respond to that?

swatcoder
0 replies
4d

In reality, humans are often blunt and rude pessimists who say things can't be done. But "helpful chatbot" LLM's are specifically trained not to do that for anything but crude swaths of political/social/safety alignment.

When it comes to technical details, current LLM's have a bias towards sycophancy and bullshitting that humans only show when especially desperate to impress or totally fearful.

Humans make mistakes too, but the distribution of those mistakes is wildly different and generally much easier to calibrate for and work around.

nonameiguess
0 replies
3d21h

advael's answer was fine, but since people seem to be hung up on the wording, a more direct response:

We have human institutions dedicated at least nominally to finding and publishing truth (I hate having to qualify this, but Hacker News is so cynical and post-modernist at this point that I don't know what else to do). These include, for instance, court systems. These include a notion of evidentiary standards. Eyewitnesses are treated as more reliable than hearsay. Written or taped recordings are more reliable than both. Multiple witnesses who agree are more reliable than one. Another example is science. Science utilizes peer review, along with its own notion of hierarchy of evidence, similar to but separate from the court's. Interventional trials are better evidence than observational studies. Randomization and statistical testing is used to try and tease out effects from noise. Results that replicate are more reliable than a single study. Journalism is yet another example. This is probably the arena in which Hacker News is most cynical and will declare all of it is useless trash, but nonetheless reputable news organizations do have methods they use to try and be correct more often than they are not. They employ their own fact checkers. They seek out multiple expert sources. They send journalists directly to a scene to bear witness themselves to events as they unfold.

You're free to think this isn't sufficient, but this is how we deal with humans making up stuff and it's gotten us modern civilization at least, full of warts but also full of wonders, seemingly because we're actually right about a lot of stuff.

At some point, something analogous will presumably be the answer for how LLMs deal with this, too. The training will have to be changed to make the system aware of quality of evidence. Place greater trust in direct sensor output versus reading something online. Place greater trust in what you read from a reputable academic journal versus a Tweet. Etc. As it stands now, unlike human learners, the objective function of an LLM is just to produce a string in which each piece is in some reasonably high-density region of the probability distribution of possible next pieces as observed from historical recorded text. Luckily, producing strings in this way happens to generate a whole lot of true statements, but it does not have truth as an explicit goal and, until it does, we shouldn't forget that. Treat it with the treatment it deserves, as if some human savant with perfect recall had never left a dark room to experience the outside world, but had read everything ever written, unfortunately without any understanding of the difference between reading a textbook and reading 4chan.

CooCooCaCha
0 replies
3d21h

If I am going to trust a machine then it should perform at the level of a very competent human, not a general human.

Why would I want to ask your average person a physics question? Of course, their answer will probably be wrong and partly made up. Why should that be the bar?

I want it to answer at the level of a physics expert. And a physics expert is far less likely to make basic mistakes.

ssharp
5 replies
4d

I keep hearing about people using these for coding. Seems like it would be extremely easy to miss something and then spend more time debugging than it would be to do yourself.

I tried recently to have ChatGPT an .htaccess RewriteCond/Rule for me and it was extremely confident you couldn't do something I needed to do. When I told it that it just needed to add a flag to the end of the rule (I was curious and was purposely non-specific about what flag it needed), it suddenly knew exactly what to do. Thankfully I knew what it needed but otherwise I might have walked away thinking it couldn't be accomplished.

GiorgioG
2 replies
4d

My experience is that it will simply make up methods, properties and fields that do NOT exist in well-documented APIs. If something isn't possible, that's fine, just tell me it's not possible. I spent an hour trying to get ChatGPT (4/4o and 3.5) to write some code to do one specific thing (dump/log detailed memory allocation data from the current .NET application process) for diagnosing an intermittent out of memory exception in a production application. The answer as far as I can tell is that it's not possible in-process. Maybe it's possible out of process using the profiling API, but that doesn't help me in a locked-down k8s pod/container in AWS.

neonsunset
0 replies
4d

From within the process it might be difficult*, but please do give this a read https://learn.microsoft.com/en-us/dotnet/core/diagnostics/du... and dotnet-dump + dotnet-trace a try.

If you are still seeing the issue with memory and GC, you can submit it to https://github.com/dotnet/runtime/issues especially if you are doing something that is expected to just work(tm).

* difficult as in retrieving data detailed enough to trace individual allocations, otherwise `GC.GetGCMemoryInfo()` and adjacent methods can give you high-level overview. There are more advanced tools but I always had the option to either use remote debugging in Windows Server days and dotnet-dump and dotnet-trace for containerized applications to diagnose the issues, so haven't really explored what is needed for the more locked down environments.

empath75
0 replies
3d23h

I think once you understand that they're prone to do that, it's less of a problem in practice. You just don't ask it questions that requires detailed knowledge of an API unless it's _extremely_ popular. Like in kubernetes terms, it's safe to ask it about a pod spec, less safe to ask it details about istio configuration and even less safe to ask it about some random operator with 50 stars on github.

Mostly it's good at structure and syntax, so I'll often find the library/spec I want, paste in the relevant documentation and ask it to write my function for me.

This may seem like a waste of time because once you've got the documentation you can just write the code yourself, but A: that takes 5 times as long and B: I think people underestimate how much general domain knowledge is buried in chatgpt so it's pretty good at inferring the details of what you're looking for or what you should have asked about.

In general, I think the more your interaction with chatgpt is framed as a dialogue and less as a 'fill in the blanks' exercise, the more you'll get out of it.

bredren
0 replies
3d23h

This problem applies almost universally as far as I can tell.

If you are knowledgeable on a subject matter you're asking for help with, the LLM can be guided to value. This means you do have to throw out bad or flat out wrong output regularly.

This becomes a problem when you have no prior experience in a domain. For example reviewing legal contracts about a real estate transaction. If you aren't familiar enough with the workflow and details of steps you can't provide critique and follow-on guidance.

However, the response still stands before you, and it can be tempting to glom onto it.

This is not all that different from the current experience with search engines, though. Where if you're trying to get an answer to a question, you may wade through and even initially accept answers from websites that are completely wrong.

For example, products to apply to the foundation of an old basement. Some sites will recommend products that are not good at all, but do so because the content owners get associate compensation for it.

The difference is that LLM responses appear less biased (no associate links, no SEO keyword targeting), but are still wrong.

All that said, sometimes LLMs just crush it when details don't matter. For example, building a simple cross-platform pyqt-based application. Search engine results can not do this. Wheras, at least for rapid prototyping, GPT is very, very good.

BurningFrog
0 replies
4d

If I ever let it AI write code, I'd write serious tests for it.

Just like I do with my own code.

Both AI and I "hallucinate" sometimes, but with good tests you make things work.

spiderfarmer
0 replies
4d

Mixture of agents prevents a lot of fact fabrication.

soloist11
25 replies
4d1h

How do they know the critic did not make a mistake? Do they have a critic for the critic?

GaggiX
6 replies
4d1h

It's written in the article, the critic makes mistakes, but it's better than not having it.

soloist11
5 replies
4d1h

How do they know it's better? The rate of mistakes is the same for both GPTs so now they have 2 sources of errors. If the error rate was lower for one then they could always apply it and reduce the error rate of the other. They're just shuffling the deck chairs and hoping the boat with a hole goes a slightly longer distance before disappearing completely underwater.

yorwba
2 replies
4d

Whether adding unreliable components increases the overall reliability of a system depends on whether the system requires all components to work (in which case adding components can only make matters worse) or only some (in which case adding components can improve redundancy and make it more likely that the final result is correct).

In the particular case of spotting mistakes made by ChatGPT, a mistake is spotted if it is spotted by the human reviewer or by the critic, so even a critic that makes many mistakes itself can still increase the number of spotted errors. (But it might decrease the spotting rate per unit time, so there are still trade-offs to be made.)

soloist11
1 replies
4d

I see what you're saying so what OpenAI will do next is create an army of GPT critics and then run them all in parallel to take some kind of quorum vote on correctness. I guess it should work in theory if the error rate is small enough and adding more critics actually reduces the error rate. My guess is that in practice they'll converge to the population average rate of error and then pat themselves on the back for a job well done.

svachalek
0 replies
3d23h

That description is remarkably apt for almost every business meeting I've ever been in.

vidarh
0 replies
3d12h

How do they know it's better?

From the article:

"In our experiments a second random trainer preferred critiques from the Human+CriticGPT team over those from an unassisted person more than 60% of the time."

Of course the second trainer could be wrong, but when the outcome tilts 60% to 40% in favour of the *combination of a human + CriticGPT that's pretty significant.

From experience doing contract work in this space, it's common to use multiple layers of reviewers to generate additional data for RLHF, and if you can improve the output from the first layer that much it'll have a fairly massive effect on the amount of training data you can produce at the same cost.

GaggiX
0 replies
4d

How do they know it's better?

Probably just evaluation on benchmarks.

jsheard
5 replies
4d1h

Per the article, the critic for the critic is human RLHF trainers. More specifically those humans are exploited third world workers making between $1.32 and $2 an hour, but OpenAI would rather you didn't know about that.

https://time.com/6247678/openai-chatgpt-kenya-workers/

soloist11
1 replies
4d1h

Every leap of civilization was built off the back of a disposable workforce. - Niander Wallace

wmeredith
0 replies
4d

He was the bad guy, right?

vidarh
0 replies
3d9h

OpenAI may well still be employing plenty of people in third world countries for this. But there are also contracts providing anywhere from $20 to $100+ an hour to do this kind of work for more complex prompt/response pairs.

I've done work on what (at least to my belief) is the very high end of that scale (not for OpenAI) to fill gaps, so I know firsthand that it's available, and sometimes the work is complex enough that a single response can take over an hour to evaluate because the requirements often include not just reading and reviewing the code, but ensuring it works, including fixing bugs. Most of the responses then pass through at least one more round of reviews of the fixed/updated responses. One project I did work on involved 3 reviewers (none of whom were on salaries anywhere close to the Kenyan workers you referred to) reviewing my work and providing feedback and a second pass of adjustments. So four high-paid workers altogether to process every response.

Of course, I'm sure plenty lower-level/simpler work had been filtered out to be addressed with cheaper labour, but I wouldn't be so sure their costs for things like code is particularly low.

golergka
0 replies
4d

Exploited? Are you saying that these employees are forced to work for below market rates, and would be better off with other opportunities available to them? If that's the case, it's truly horrible on OpenAI's part.

IncreasePosts
0 replies
4d

That is more than the average entry level position in Kenya. The work is probably also much easier (physically, that is).

esafak
3 replies
4d1h

It's called iteration. Humans do the same thing.

citizen_friend
1 replies
4d1h

It’s not a human, and we shouldn’t assume it will have traits we do without evidence.

Iteration also is when your brain meets the external world and corrects. This is a closed system.

vidarh
0 replies
3d9h

We are not assuming that. The iteration happens by taking the report and passing it to another reviewer who reviews the first review. Their comparison is between a human reviewer passing reports to a human reviewer vs. CriticGPT -> human reviewer vs. CriticGPT+human reviewer -> human reviewer.

soloist11
0 replies
4d1h

Are you sure it's not called recursion?

nmca
2 replies
4d1h

A critic for the critic would be “Recursive Reward Modelling”, an exciting idea that has not been made to work in the real world yet.

soloist11
1 replies
4d1h

Most of my ideas are not original but where can I learn more about this recursive reward modeling problem?

finger
1 replies
4d

There is already a mistake. It refers to a function by the wrong name: os.path.comonpath > commonpath

soloist11
0 replies
4d

In the critical limit every GPT critic chain is essentially a spellchecker.

OlleTO
1 replies
4d1h

It's critics all the way down

azulster
0 replies
4d1h

it's literally just the oracle problem all over again

ertgbnm
0 replies
4d

That's the human's job for now.

A human reviewer might have trouble catching a mistake, but they are generally pretty good at discerning a report about a mistake is valid or not. For example, finding a bug in a codebase is hard. But if a junior sends you a code snippet and says "I think this is a bug for xyz reason", do you agree? It's much easier to confidently say yes or no. So basically it changes the problem from finding a needle in a haystack to discerning if a statement is a hallucination or not.

victor9000
22 replies
4d

This gets at the absolute torrent of LLM diarrhea that people are adding to PRs these days. The worst of it seems to come from junior and first time senior devs who think more is more when it comes to LoC. PR review has become a nightmare at my work where juniors are now producing these magnificent PRs with dynamic programming, esoteric caching, database triggers, you name it. People are using LLMs to produce code far beyond their abilities, wisdom, or understanding, producing an absolute clusterfuck of bugs and edge cases. Anyone else dealing with something similar? How are you handling it?

liampulles
6 replies
3d23h

Maybe time for some pair programming?

surfingdino
5 replies
3d21h

No.

wholinator2
4 replies
3d21h

That would be interesting though. What happens when one programmer attempts to chatgpt in pair programming. It's almost like they're already pair programming, just not with you!

surfingdino
3 replies
3d21h

They are welcome to do so, but not on company time. We do not find those tools useful at all, because we are generally hired to write new stuff and ChatGPT or other tools are useless when there are no good examples to steal from (e.g. darker corners of AWS that people don't bother to offer solutions for) or when there is a known bug or there are only partial workarounds available for it.

raunakchhatwal
2 replies
3d10h

Don't you want programmers to familiarize themselves now to prepare for the time when it does? Claude 3.5 sonnet is getting close.

rsynnott
1 replies
3d8h

1954: "Don't you want power plant operators to familiarise themselves with this tokamak now to prepare for when it actually works?"

(Practical nuclear fusion has 10 years away in 1954, and it's still 10 years away now. I suspect in practice LLMs are in a similar space; everyone seems to be fixated on the near, supposedly inevitable, future where they are actually useful.)

surfingdino
0 replies
3d8h

AI and VR follow the same hype cycles. They always leave a trail of wasted money, energy, and resources. Only this time Gen AI is leaving a rolling coal-like trail of bullshit that will take some time to clean up. I am patiently waiting for the VC money to run out.

crazygringo
3 replies
3d23h

How is that different from junior devs writing bad code previously? The more things change, the more things stay the same.

You handle it by teaching them how to write good code.

And if they refuse to learn, then they get bad performance reviews and get let go.

I've had junior devs come in with all sorts of bad habits, from only using single-letter variable names and zero commenting, to thinking global variables should be used for everything, to writing object-oriented monstrosities with seven layers of unnecessary abstractions instead of a simple function.

Bad LLM-generated code? It's just one more category of bad code, and you treat it the same as all the rest. Explain why it's wrong and how to redo it.

Or if you want to fix it at scale, identify the common bad patterns and make avoiding them part of your company's onboarding/orientation/first-week-training for new devs.

okdood64
2 replies
3d23h

How is that different from junior devs writing bad code previously?

Because if it's bad, at least it's simple. Meaning simple to review, quickly correct and move on.

phatfish
0 replies
3d22h

Maybe fight fire with fire. Feed the ChatGPT PR to ChatGPT and ask it to do a review, paste that as the comment. It will even do the markdown for you!

crazygringo
0 replies
3d23h

Like I said, with "object-oriented mostrosities", it's not like it was always simple before either.

And if you know a solution should be 50 lines and they've given you 500, it's not like you have to read it all -- you can quickly figure out what approach they're using and discuss the approach they should be using instead.

Zee2
3 replies
4d

My company simply prohibits any AI generated code. Seems to work rather well.

ffsm8
1 replies
3d23h

My employer went all in and pays for both enterprise subscriptions (github Copilot+ chatgpt enterprise, which is just a company branded version of the regular interface)

We've even been getting "prompt engineering" meeting invites of 3+ hours to get an introduction into their usage. 100-150 participants each time I joined

It's amazing how much they're valuing it. from my experience it's usually a negative productivity multiplier (x0.7 vs x1 without either)

kridsdale3
0 replies
2d18h

Sounds like they plan to lay half of you off once metrics stabilize.

gnicholas
0 replies
3d22h

How is this enforced? I'm not saying it isn't a good idea, just that it seems like it would be tricky to enforce. Separately, it could result in employees uploading code to a non-privacy-respecting AI, whereas if employees were allowed to use a particular AI then the company could better control privacy/security concerns.

kenjackson
1 replies
3d23h

In the code review, can't you simply say, "This is too complicated for what you're trying to do -- please simplify"?

lcnPylGDnU4H9OF
0 replies
3d22h

Not quite the same but might be more relevant depending on context: "If you can't articulate what it does then please rewrite it such that you can."

ganzuul
1 replies
3d23h

Are they dealing with complexity which isn't there in order to appear smarter?

jazzyjackson
0 replies
3d21h

IMO they're just impressed the AI came to a conclusion that actually runs and aren't skilled enough to recognize there's a simpler way to do it.

xhevahir
0 replies
4d

When it gives me a complicated list expression or regex or something I like to ask ChatGPT to find a simpler way of doing the same thing, and it usually gives me something simpler that still works. Of course, you do have to ask, rather than simply copy-paste its output right into an editor, which is probably one step too many for some.

surfingdino
0 replies
3d21h

I work for clients that do not allow this shit, because their security teams and lawyers won't have it. But... they have in-house "AI ambassadors" (your typical useless middle managers, BAs, project managers, etc.) who see promoting AI as a survival strategy. On the business side these orgs are leaking data, internal comms, and PII like a sieve, but the software side is free of AI. For now.

ssl-3
0 replies
4d

Why not deal with people who create problems like this the same way as one would have done four years ago?

If they're not doing their job, then why do they still have one?

sdenton4
20 replies
4d1h

Looks like the hallucination rate doesn't improve significantly, but I suppose it's still a win if it helps humans review things faster? Though I could imagine reliance on the tool leading to missing less obvious problems.

tombert
17 replies
4d1h

While it's a net good, it would kind of kill one of the most valuable parts of ChatGPT for me, which is critiquing its output myself.

If I ask it a question, I try not to trust it immediately, and I independently look the answer up and I argue with it. In turn, it actually is one of my favorite learning tools, because it kind of forces me to figure out why it's wrong and explain it.

goostavos
14 replies
4d

Unexpectedly, I kind of agree. I've found GPT to be a great tutor for things I'm trying to learn. It being somewhat unreliable / prone to confidently lying embeds a certain amount of useful skepticism and questioning of all the information, which in turn leads to an overall better understanding.

Fighting with the AI's wrongness out of spike is an unexpectedly good motivator.

empath75
5 replies
3d23h

I actually learn a lot from arguing with not just AIs but people and it doesn't really matter if they're wrong or right. If they're right, it's an obvious learning experience for me, if they're wrong, it forced me to explain and understand _why_ they're wrong.

tombert
4 replies
3d23h

I completely agree with that, but the problem is finding a supply of people to argue with on niche subjects. I have occasionally argued with people on the Haskell IRC and the NixOS Matrix server about some stuff, but since they're humans who selfishly have their own lives to live so I can't argue with them infinitely, and since the topics I argue about are specific there just don't exists a lot of people I can argue with even in the best of times.

ChatGPT (Gemini/Anthropic/etc) have the advantage of never getting sick of arguing with me. I can go back and forth and argue about any weird topic that I want for as long as I want at any time of day and keep learning until I'm bored of it.

Obviously it depends on the person but I really like it.

ramenbytes
2 replies
3d22h

I completely agree with that, but the problem is finding a supply of people to argue with on niche subjects.

Beyond just subject-wise, finding people who argue in good faith seems to be an issue too. There are people I'm friends with almost specifically because we're able to consistently have good-faith arguments about our strongly opposing views. It doesn't seem to be a common skill, but perhaps that has something to do with my sample set or my own behaviors in arguments.

tombert
1 replies
3d21h

I dunno, for more niche computer science or math subjects, I don't feel like people argue in bad faith most of the time. The people I've argued with on the Haskell IRC years ago genuinely believe in what they're saying, even if I don't agree with them (I have a lot of negative opinions on Haskell as a language).

Politically? Yeah, nearly impossible to find anyone who argues in good faith.

ramenbytes
0 replies
1d22h

Politics and related stuff is what I had in mind, yeah. To a lesser extent technical topics as well. But, I meant "good faith" in the sense of both believing what they're saying and also approaching the argument open to the possibility of being wrong themselves and/or treating you as capable of understanding their point. I've had arguments where the other person definitely believed what they were saying, but didn't think I was capable of understanding their point or being right myself and approached the discussion thusly.

mistermann
0 replies
3d22h

Arguing is arguably one of humanity's super powers, and that we've yet to bring it to bear in any serious way gives me reason for optimism about sorting out the various major problems we've foolishly gotten ourselves into.

ExtremisAndy
4 replies
4d

Wow, I’ve never thought about that, but you’re right! It really has trained me to be skeptical of what I’m being taught and confirm the veracity of it with multiple sources. A bit time-consuming, of course, but generally a good way to go about educating yourself!

tombert
2 replies
4d

I genuinely think that arguing with it has been almost a secret weapon for me with my grad school work. I'll ask it a question about temporal logic or something, it'll say something that sounds accurate but is ultimately wrong or misleading after looking through traditional documentation, and I can fight with it, and see if it refines it to something correct, which I can then check again, etc. I keep doing this for a bunch of iterations and I end up with a pretty good understanding of the topic.

I guess at some level this is almost what "prompt engineering" is (though I really hate that term), but I use it as a learning tool and I do think it's been really good at helping me cement concepts in my brain.

ramenbytes
1 replies
3d22h

I'll ask it a question about temporal logic or something, it'll say something that sounds accurate but is ultimately wrong or misleading after looking through traditional documentation, and I can fight with it, and see if it refines it to something correct, which I can then check again, etc. I keep doing this for a bunch of iterations and I end up with a pretty good understanding of the topic.

Interesting, that's the basic process I follow myself when learning without ChatGPT. Comparing my mental representation of the thing I'm learning to existing literature/results, finding the disconnects between the two, reworking my understanding, wash rinse repeat.

tombert
0 replies
3d21h

I guess a large part of it is just kind of the "rubber duck" thing. My thoughts can be pretty disorganized and hard to follow until I'm forced to articulate them. Finding out why ChatGPT is wrong is useful because it's a rubber duck that I can interrogate, not just talk to.

It can be hard for me to directly figure out when my mental model is wrong on something. I'm sure it happens all the time, but a lot of the time I will think I know something until I feel compelled to prove it to someone, and I'll often find out that I'm wrong.

That's actually happened a bunch of times with ChatGPT, where I think it's wrong until I actually interrogate it, look up a credible source, and realize that my understanding was incorrect.

Viliam1234
0 replies
2d22h

We were already supposed to use Wikipedia like this, but most people didn't bother and trusted the Wikipedia text uncritically.

Finally, LLMs teach us the good habits.

posix86
1 replies
4d

Reminds me of a prof at uni, who's slides always appeard to have been written 5 mins before the lecture started, resulting in students pointing out mistakes in every other slide. He defended himself saying that you learn more if you aren't sure weather things are correct - which was right. Esp. during a lecture, it's sometimes not that easy to figure out if you truly understood something or fooled yourself, knowing that what you're looking at is provably right. If you know everything can be wrong, you trick your mind to verify it at a deeper level, and thus gain more understanding. It also results in a culture where you're allowed to question the prof. It resulted in many healthy arguments with the prof why something is the way it is, often resulting with him agreeing that his slides are wrong. He never corrected the underlying PPP.

tombert
0 replies
3d22h

I thought about doing that when I was doing adjunct last year, but what made me stop was the fact that these were introductory classes, so I was afraid I might pollute the minds of students who really haven't learned enough to question stuff yet.

tombert
0 replies
4d

Yeah, and what I like is that I can get it to say things in "dumb language" instead of a bunch of scary math terms. It'll be confidently wrong, but in language that I can easily understand, forcing me to looking things up, and kind of forcing me learn the proper terminology and actually understanding it.

Arcane language is actually kind of a pet peeve of mine in theoretical CS and mathematics. Sometimes it feels like academics really obfuscate relatively simple concepts but using a bunch of weird math terms. I don't think it's malicious, I just think that there's value in having more approachable language and metaphors in the process of explaining thing.

julienchastang
1 replies
3d22h

Very good comment. In order to effectively use LLMs (I use ChatGPT4 and 4o), you have to be skeptical of them and being a good AI skeptic takes practice. Here is another technique I've learned along the way: When you have it generate text for some report you are writing, or something, after your initial moment of being dazzled (at least for me), resist the temptation to copy/paste. Instead, "manually" rewrite the verbiage. You then realize there is a substantial amount of BS that can be excised. Nevertheless, it is a huge time saver and can be good at ideation, as well.

tombert
0 replies
3d21h

Yeah I used it last year to generate homework assignments, and it would give me the results in Pandoc compatible markdown. It was initially magic, but some of the problems didn't actually make sense and might actually be unsolveable, so I would have to go through it line by line and then ask it to regenerate it [1].

Even with that, it took a process that had taken multiple hours before down to about 30-45 minutes. It was super cool.

[1] Just to be clear, I always did the homework assignments myself beforehand to make sure that a solution was solvable and fair before I assigned it.

foobiekr
1 replies
4d

The lazy version of that, which I recommend, is always deny the first answer. Usually I deny for some obvious reason, but sometimes I just say "isn't that wrong?"

tombert
0 replies
3d22h

That's a useful trick but I have noticed when I do that it goes in circles a where it suggests "A", I say it's wrong, it suggests "B", I say that's wrong, it suggests "C", I say that's wrong, and then it suggests "A" again.

Usually for it to get a correct answer, I have to provide it a bit of context.

upwardbound
11 replies
3d16h

For those new to the field of AGI safety: this is an implementation of Paul Christiano's alignment procedure proposal called Iterated Amplification from 6 years ago. https://www.alignmentforum.org/s/EmDuGeRw749sD3GKd

According to his website he previously ran the language model alignment team at OpenAI. https://paulfchristiano.com/

It's wonderful to see his idea coming to fruition! I'm honestly a bit skeptical of the idea myself (it's like proposing to stabilize the stack of "turtles all the way down" by adding more turtles - as is insightfully pointed out in this other comment https://news.ycombinator.com/item?id=40817017) but every innovative idea is worth a try, in a field as time-critical and urgent as AGI safety.

For a good summary of technical approaches to AGI safety, start with the Future of Life Institute AI Alignment Podcast, especially these two episodes which serve as an overview of the field:

- https://futureoflife.org/podcast/an-overview-of-technical-ai...

- https://futureoflife.org/podcast/an-overview-of-technical-ai...

In both of those episodes, Cristiano's publication series on Iterated Amplification is link #3 in the list of recommended reading.

curiousgal
5 replies
3d10h

AGI safety

I genuinely laughed. Oh no somebody please save me from a chatbot that's hallucinating half the time!

Joke aside, of course OpenAI is gonna play up how "intelligent" its models are. But it's evident that there's only so much data and compute that you can throw at a machine to make it smart.

ben_w
2 replies
3d8h

Covid isn't what most people would call "high intelligence", yet it's a danger because it's heavily optimised for goals that are not our own.

Other people using half-baked AI can still kill you, and that doesn't have to be a chatbot as we have current examples from self-driving cars that drive themselves dangerously, and historical examples of the NATO early warning radars giving a false alarm from the moon and the soviet early warning satellites giving false alarms from reflected sunlight, but it can also be a chatbot — there are many ways that this can be deadly if you don't know better: https://news.ycombinator.com/item?id=40724283

Every software bug is an example of a computer doing exactly what it was told to do, instead of what we meant.

AI safety is about bridging the gap between optimising for what we said vs. what we meant, in a less risky manner than if covid — and while I think it doesn't matter much if covid did or didn't come from a lab leak (the potential that it did means there's an opportunity to improve bio safety there as well as in wet markets), every AI you can use is essentially a continuous supply of the mystery magic box before we know what the word "safe" even means in this context.

orlp
1 replies
3d7h

Every software bug is an example of a computer doing exactly what it was told to do, instead of what we meant.

That is only as long as the person describing the behaviour as a bug is aligned with the programmer. Most of the time this is the case, but not always. For example a malicious programmer intentionally inserting a bug does in fact mean for the program to have that behavior.

ben_w
0 replies
3d7h

Sure, but I don't think that matters as the ability to know what we are even really asking for is not as well understood for AI as for formal languages — AI can be used for subterfuge etc., but right now it's still somewhat like this old comic from 2017: https://xkcd.com/1838/

upwardbound
0 replies
3d9h

You make a good point so I should clarify that Iterated Amplification is an idea that was first proposed as a technique for AGI safety, but happens to also be applicable to LLM safety. I study AGI safety which is why I recognized the technique.

stareatgoats
0 replies
3d9h

Thanks, added to my collection of AGI-pessimistic comments that I encounter here, and that I aim to revisit in, say, 20 years. I'm not sure I will be able to say: "you were wrong!". But I do expect so.

TeMPOraL
2 replies
3d11h

it's like proposing to stabilize the stack of "turtles all the way down" by adding more turtles

That's completely fine. Say each layer uses the same amount of turtles, but half the size of the layer above. Even allowing for arbitrarily small turtles, the total height of the infinite stack will be just 2x the height of the first layer.

Point being, some series converge to a finite result, including some defined recursively. And in practice, we can usually cut the recursion after first couple steps, as the infinitely long remainder has negligible impact on the final result.

topherclay
1 replies
3d11h

You seem to have interpreted the analogy as meaning "you might run out of turtles" instead of something like: "stacks of turtles aren't stable without stable beneath them, no matter how many turtles you use."

TeMPOraL
0 replies
3d3h

Space them out. I meant that infinite stack of turtles can be stable, of finite height, and for practical purposes, cut off after few layers without noticeable impact on stability.

pama
1 replies
3d16h

I am not sure why you didnt see the citations to 3 different papers from Cristiano. A simple search in the linked PDF suffices: citations 12, 19, and 31.

upwardbound
0 replies
3d16h

Thank you for providing the reference numbers, this is helpful! I'll update my GP comment.

mkmk
9 replies
4d1h

It seems more and more that the solution to AI's quality problems is... more AI.

Does Anthropic do something like this as well, or is there another reason Claude Sonnet 3.5 is so much better at coding than GPT-4o?

GaggiX
2 replies
4d1h

or is there another reason Claude Sonnet 3.5 is so much better at coding than GPT-4o?

It's impossible to say because these models are proprietary.

mkmk
1 replies
4d

Isn't the very article we're commenting on an indication that you can form a basic opinion on what makes one proprietary model different from another?

GaggiX
0 replies
4d

Not really, we know absolutely nothing about Claude 3.5 Sonnet, except that it's an LLM.

ru552
1 replies
4d1h

Anthropic has attributed Sonnet 3.5's model improvement to better training data.

"Which data specifically? Gerstenhaber wouldn’t disclose, but he implied that Claude 3.5 Sonnet draws much of its strength from these training sets."[0]

[0]https://techcrunch.com/2024/06/20/anthropic-claims-its-lates...

swyx
0 replies
3d12h

and water is wet

surfingdino
0 replies
3d21h

It seems more and more that the solution to AI's quality problems is... more AI.

This reminds me of the passage found in the description of the fuckitpy module:

"This module is like violence: if it doesn't work, you just need more of it."

p1esk
0 replies
4d

In my experience Sonnet 3.5 is about the same as 4o for coding. Sometimes one provides a better solution, sometimes the other. Both are pretty good.

jasonjmcghee
0 replies
4d1h

My guess, which could be completely wrong, Anthropic spent more resources on interpretability and it's paying off.

I remember when I first started using activation maps when building image classification models and it was like what on earth was I doing before this... just blindly trusting the loss.

How do you discover biases and issues with training data without interpretability?

Kiro
0 replies
4d

Is it really that much better? I'm really happy with GPT-4o's coding capabilities and very seldom experience problems with hallucinations or incorrect responses, so I'm intrigued by how much better it can actually be.

neom
7 replies
4d

I was curious about the authors, did some digging, they've published some cool stuff:

Improving alignment of dialogue agents via targeted human judgements - https://arxiv.org/abs/2209.14375

Teaching language models to support answers with verified quotes - https://arxiv.org/abs/2203.11147

VWWHFSfQ
6 replies
3d21h

It's an interesting dichotomy happening in the EU vs. USA in terms of how these kinds of phenomena are discovered, presented, analyzed, and approached.

The EU seems to be very much toward a regulate early, safety-first approach. Where USA is very much toward unregulated, move fast, break things, assess the damage, regulate later.

I don't know which is better or worse.

l5870uoo9y
3 replies
3d20h

As a European, after decades of regulations and fines without much to show, nobody in the industry believes the EU the capable of creating tech ecosystem. Perhaps even that the EU is part of the problem and that individual countries could independently move much faster.

Twixes
2 replies
3d19h

The difference in tech outcomes can definitely be attributed to _European_ conditions, but regulation is extremely country-specific – the _EU_ is not the detriment (e.g. Germany might be notably bureaucratic, but the culture in Sweden differs, and so do frameworks in Poland). EU institutions are out to get the US giants, but startups or scaleups? Out of sight.

There really is comparatively little reward for shooting for the moon though – the fragmented stock markets don't provide great exit opportunities, so less money goes into funding ambitious companies. Then, scaling throughout all of Europe is notably hard, with dozens of languages, cultures, and legal frameworks to navigate. Some of these cultures are more risk-averse, and that's not easy to change. Not to mention English being the _lingua franca_ of business and tech.

I would love Europe to reach the States' level of tech strength, but these are all really hard problems.

torginus
0 replies
21h28m

Get-rich-quick startups are overrated, and there isn't that much of them anyways, not ones with real staying power.

The most valuable tech companies have long histories, incredibly broad, and deep technological portfolios that go much farther than 'we use the latest frameworks' and/or 'we own the most eyeballs at the moment' and dubious business models that are a combination of spending free VC money to grow and having exploitative business models.

Such companies are certainly represented, in the top 100 list, but I for one am not sad that Europe missed out on them. It's more of a problem that we don't have or NVIDIA, Apple, Intel, Samsung imo.

As for regulation, there are a bunch of US companies that got to where they are by abusing their monopolistic reach by locking out and disadvantaging potential competitors (think of the smartphone, OS and social media spaces), where the swift kick in the butt from European regulators could've come sooner.

felipeerias
0 replies
3d18h

Another issue is that modern AI is notoriously energy-intensive.

Because of policy choices over the past several decades, as well as the ongoing war with Russia, the EU is already struggling to provide enough energy for its existing industry. There just isn’t any slack left for a newcomer.

This is a relatively recent (2022) comparison of the Industrial electricity prices including taxes:

https://www.gov.uk/government/statistical-data-sets/internat...

Roughly, electricity for industrial uses is 50% more expensive in France that in the USA.

In Germany, it is over 120% more expensive.

In the UK, over 150%.

ipaddr
1 replies
3d21h

Regulate before you understand the problem seems like a poor approach

dns_snek
0 replies
3d10h

Which parts of the EU AI act do you disagree with? It primarily focuses on areas of use that present high or unacceptable risks.

advael
6 replies
4d

This is a really bizarre thing to do honestly

It's plausible that there are potential avenues for improving language models through adversarial learning. GANs and Actor-Critic models have done a good job in narrow-domain generative applications and task learning, and I can make a strong theoretical argument that you can do something that looks like priority learning via adversarial equilibria

But why in the world are you trying to present this as a human-in-the-loop system? This makes no sense to me. You take an error-prone generative language model and then present another instance of an error-prone generative language model to "critique" it for the benefit of... a human observer? The very best case here is that this wastes a bunch of heat and time for what can only be a pretty nebulous potential gain to the human's understanding

Is this some weird gambit to get people to trust these models more? Is it OpenAI losing the plot completely because they're unwilling to go back to open-sourcing their models but addicted to the publicity of releasing public-facing interfaces to them? This doesn't make sense to me as a research angle or as a product

I can really see the Microsoft influence here

kenjackson
5 replies
3d23h

It's for their RLHF pipeline to improve labeling. Honestly, this seems super reasonable to me. I don't get why you think this is such a bad idea for this purpose...

advael
4 replies
3d23h

RLHF to me seems more as a PR play than anything else, but inasmuch as it does anything useful, adding a second LLM to influence the human that's influencing the LLM doesn't solve any of the fundamental problems of either system. If anything it muddies the waters more, because we have already seen that humans are probably too credulous of the information presented to them by these models. If you want adversarial learning, there are far more efficient ways to do it. If you want human auditing, the best case here is that the second LLM doesn't influence the human's decisions at all (because any influence reduces the degree to which this is independent feedback)

kenjackson
1 replies
3d20h

This is not adversarial learning. It's really about augmenting the ability of humans to determine if a snippet of code is correct and write proper critiques of incorrect code.

Any system that helps you more accurately label data with good critiques should help the model. I'm not sure how you come to your conclusion. Do you have some data to indicate that even with improved accuracy that some LLM bias would lead to a worse trained model? I haven't seen that data or assertion elsewhere, but that's the only thing I can gather you might be referring.

advael
0 replies
3d20h

Well, first of all, the stated purpose of RLHF isn't to "improve model accuracy" in the first place (and what we mean by accuracy here is pretty fraught by itself, as this could mean at least three different things). They initially pitched it as a "safety" measure (and I think if it wasn't obvious immediately how nonsensical a claim that is, it should at least be apparent now that the company's shucked nearly the entire subset of its members that claimed to care about "AI safety" that this is not a priority)

The idea of RLHF as a mechanism for tuning models based on the principle that humans might have some hard-to-capture insight that could steer them independent of the way they're normally trained is the very best steelman for its value I could come up with. This aim is directly subverted by trying to use another language model to influence the human rater, so from my perspective it really brings us back to square one on what the fuck RLHF is supposed to be doing

Really, a lot of this comes down to what these models do versus how they are being advertised. A generative language model produces plausible prose that follows from the prompt it receives. From this, the claim that it should write working code is actually quite a bit stronger than the claim that it should write true facts, because plausibile autocompletion will learn to mimic syntactic constraints but actually has very little to do with whether something is true, or whatever proxy or heuristic we may apply in place of "true" when assessing information (supported by evidence, perhaps. Logically sound, perhaps. The distinction between "plausible" and "true" is in many ways the whole point of every human epistemology). Like if you ask something trained on all human writing whether the Axis or the Allies won WWII, the answer will depend on whether you phrased the question in a way that sounds like Phillip K Dick would write it. This isn't even incorrect behavior by the standards of the model, but people want to use these things like some kind of oracle or to replace google search or whatever, which is a misconception about what the thing does, and one that's very profitable for the people selling it

vhiremath4
0 replies
3d23h

This is kind of what I was thinking. I don’t get it. It seems like CriticGPT was maybe trained using RM/RL with PPO as well? So there’s gonna be mistakes with what CriticGPT pushes back on which may make the labeler doubt themselves?

manilbeat
0 replies
3d16h

RLHF worked well for midjourney but I think that is because it is outsourcing something that is ultimately completely subjective and very subtle like human visual aesthetic choice that can't be "wrong".

I tried to understand the paper and I can't really make sense of it for "code".

It seems like this would inherit a subtler version of all the problems from expert systems.

A press release of this does feel rather AI bubbly. Not quite Why The Future Doesn't Need Us level but I think we are getting close.

jmount
4 replies
4d

Since the whole thing is behind an API- exposing the works adds little value. If the corrections worked at an acceptable rate, one would just want them applied at the source.

renewiltord
3 replies
4d

If the corrections worked at an acceptable rate, one would just want them applied at the source.

What do you mean? The model is for improving their RLHF trainers performance. RLHF does get applied "at the source" so to speak. It's a modification on the model behind the API.

Perhaps if you were to say what you think this thing is for and then share why you think it's not "at the source".

Panoramix
2 replies
4d

Not OP but the screenshot in the article pretty much shows something that it's not at the source.

You'd like to get the "correct" answer straight away, not watch a discussion between two bots.

ertgbnm
0 replies
4d

You are missing the point of the model in the first place. By having higher quality RLHF datasets, you get a higher quality final model. CriticGPT is not a product, but a tool to make GPT-4 and future models better.

IanCal
0 replies
4d

Yes, but this is about helping the people who are training the model.

bluelightning2k
4 replies
3d20h

Evaluators with CriticGPT outperform those without 60% of the time.

So, slightly better than random chance. I guess a win is a win but I would have thought this would be higher. I'd kind have assume that just asking GPT itself if it's sure would be this kind of lift.

pama
2 replies
3d16h

I’m not sure why 60 vs 40 is slightly better than random chance. A person using this system has a 50% higher success rate than those not using it. I wouldnt call this a slight better result.

smrq
1 replies
3d16h

That's just... obviously incorrect. It's 10% higher. (60%, as opposed to the expected 50% from random chance.)

pama
0 replies
3d15h

I’m not sure what you mean.

You can see the plots if you prefer, or think of it this way: out of a total of 100 trials, one team gets 40 and the other gets 60 = 40 + 40 * 50%

If you want to think of a 75% win rate as a more extreme example: you could say 25% above random or you could say one team wins 3 times as many cases as the other. Both are equivalent but I think that the second way conveys the strength of the difference much better.

The results in this work are statistically significant and substantial.

torginus
0 replies
21h24m

Not sure what this means, but if someone asked me to critique iOS code for example, I wouldn't be much of a help since I don't know the first thing about it , other than some generic best practices.

I'm sure ChatGPT would outperform me, and I could only aid it in very limited ways.

That doesn't mean an expert iOS programmer wouldn't run circles around it.

wcoenen
1 replies
4d

This is about RLHF training. But I've wondered if something similar could be used to automatically judge the quality of the data that is used in pre-training, and then spend more compute on the good stuff. Or throw out really bad stuff even before building the tokenizer, to avoid those "glitch token" problems. Etc.

smsx
0 replies
4d

Yup, check out Rho-1 by microsoft research.

rob74
1 replies
3d8h

You build an LLM that makes mistakes. Then you build another LLM to find the mistakes the first LLM makes, and, surprise:

CriticGPT’s suggestions are not always correct

Now you have two problems...

tim333
0 replies
3d8h

Bit like human fact checking really.

jimmytucson
1 replies
4d

What's the difference between CriticGPT and ChatGPT with a prompt that says "You are a software engineer, your job is to review this code and point out bugs, here is what the code is supposed to do: {the original prompt}, here is the code {original response}, review the code," etc.

ipaddr
0 replies
3d21h

$20 a month

rvz
0 replies
3d23h

And both can still be wrong as they have no understanding of the mistake.

rodoxcasta
0 replies
4d

Additionally, when people use CriticGPT, the AI augments their skills, resulting in more comprehensive critiques than when people work alone, and fewer hallucinated bugs than when the model works alone.

But, as per the first graphic, CriticGPT alone has better comprehensiveness than CriticGPT+Human? Is that right?

megaman821
0 replies
4d

I wonder if you could apply this to training data. Like here is an example of a common mistake and why that mistake could be made, or here is a statement made in jest and why it could be found funny.

integral_1699
0 replies
4d

I've been using this approach myself, albeit manually, with ChatGPT. I first ask my question, then open a new chat and ask it to find flaws with the previous answer. Quite often, it does improve the end result.

croes
0 replies
3d16h

Next step Critic²GPT, a model based on GPT-4, writes critiques of CriticGPT responses.