HN comments for: RLHF is just barely RL

gizmo

109 replies

9h58m

2024-08-08 08:33:36 UTC

This is why AI coding assistance will leap ahead in the coming years. Chat AI has no clear reward function (basically impossible to judge the quality of responses to open-ended questions like historical causes for a war). Coding AI can write tests, write code, compile, examine failed test cases, search for different coding solutions that satisfy more test cases or rewrite the tests, all in an unsupervised loop. And then whole process can turn into training data for future AI coding models.

I expect language models to also get crazy good at mathematical theorem proving. The search space is huge but theorem verification software will provide 100% accurate feedback that makes real reinforcement learning possible. It's the combination of vibes (how to approach the proof) and formal verification that works.

Formal verification of program correctness never got traction because it's so tedious and most of the time approximately correct is good enough. But with LLMs in the mix the equation changes. Having LLMs generate annotations that an engine can use to prove correctness might be the missing puzzle piece.

discreteevent

61 replies

9h26m

2024-08-08 09:05:36 UTC

Does programming have a clear reward function? A vague description from a business person is not it. By the time someone (a programmer?) has written a reward function that is clear enough, how would that function look compared to a program?

cs702

14 replies

7h58m

2024-08-08 10:34:04 UTC

Programming has a clear reward function when the problem being solving is well-specified, e.g., "we need a program that grabs data from these three endpoints, combines their data in this manner, and returns it in this JSON format."

The same is true for math. There is a clear reward function when the goal is well-specified, e.g., "we need a sequence of mathematical statements that prove this other important mathematical statement is true."

danpalmer

6 replies

7h43m

2024-08-08 10:49:17 UTC

I’m not sure I would agree. By the time you’ve written a full spec for it, you may as well have just written a high level programming language anyway. You can make assumptions that minimise the spec needed… but also programming APIs can have defaults so that’s no advantage.

I’d suggest that the Python code for your example prompt with reasonable defaults is not actually that far from the prompt itself in terms of the time necessary to write it.

However, add tricky details like how you want to handle connection pooling, differing retry strategies, short circuiting based on one of the results, business logic in the data combination step, and suddenly you’ve got a whole design doc in your prompt and you need a senior engineer with good written comms skills to get it to work.

cs702

2 replies

7h34m

2024-08-08 10:58:13 UTC

Thanks. I view your comment as orthogonal to mine, because I didn't say anything about how easy or hard it would be for human beings to specify the problems that must be solved. Some problems may be easy to specify, others may be hard.

I feel we're looking at the need for a measure of the computational complexity of problem specifications -- something like Kolmogorov complexity, i.e., minimum number of bits required, but for specifying instead of solving problems.

danpalmer

1 replies

4h58m

2024-08-08 13:33:52 UTC

Apologies, I guess I agree with your sentiment but disagree with the example you gave as I don't think it's well specified, and my more general point is that there isn't an effective specification, which means that in practice there isn't a clear reward function. If we can get the clear specification, which we probably can do proportionally to the complexity of the problem, and not getting very far up the complexity curve, then I would agree we can get the good reward function.

cs702

0 replies

4h40m

2024-08-08 13:52:14 UTC

the example you gave

Ah, got it. I was just trying to keep my comment short!

chasd00

1 replies

5h44m

2024-08-08 12:47:30 UTC

I’m not sure I would agree. By the time you’ve written a full spec for it, you may as well have just written a high level programming language anyway.

Remember all those attempts to transform UML into code back in the day? This sounds sorta like that. I’m not a total genai naysayer but definitely in the “cautiously curious” camp.

danpalmer

0 replies

5h1m

2024-08-08 13:31:01 UTC

Absolutely, we've tried lots of ways to formalise software specification and remove or minimise the amount of coding, and almost none of it has stuck other than creating high level languages and better code-level abstractions.

I think generative AI is already a "really good autocomplete" and will get better in that respect, I can even see it generating good starting points, but I don't think in its current form it will replace the act of programming.

bee_rider

0 replies

3h14m

2024-08-08 15:17:39 UTC

Yeah, an LLM applied to converting design docs to programs seems like, essentially, the invention of an extremely high level programming language. Specifying the behavior of the program in sufficient detail is… why we have programming languages.

There’s the task of writing syntax, which is the mechanical overhead of the task of telling the computer what to do. People should focus on the latter (too much code is a symptom of insufficient automation or abstraction). Thankfully lots of people have CS degrees, not “syntax studies” degrees, right?

seanthemon

1 replies

7h46m

2024-08-08 10:45:30 UTC

  >when the problem being solving is well-specified

Phew! Sounds like i'll be fine, thank god for product owners.

steveBK123

0 replies

7h24m

2024-08-08 11:08:09 UTC

20 years, number of "well specified" requirements documents I've received: 0.

nyrikki

0 replies

3h43m

2024-08-08 14:49:00 UTC

A couple of problems that is impossible to prove from the constructivism angle:

1) Addition of the natural numbers 2) equality of two real numbers

When you restrict your tools to perceptron based feed forward networks with high parallelism and no real access to 'common knowledge', the solution set is very restricted.

Basically what Gödel proved that destroyed Russel's plans for the Mathmatica Principia applies here.

Programmers can decide what is sufficient if not perfect in models.

ekianjo

0 replies

7h35m

2024-08-08 10:56:32 UTC

Programming has a clear reward function when the problem being solving is well-specified

the reason why we spend time programming is because the problems in question are not easily defined, let alone the solutions.

dartos

0 replies

4h57m

2024-08-08 13:34:47 UTC

programming has a clear reward function.

If you’re the most junior level, sure.

Anything above that, things get fuzzy, requirements change, biz goals shift.

I don’t really see this current wave of AI giving us anything much better than incremental improvement over copilot.

A small example of what I mean:

These systems are statistically based, so there’s no probability. Because of that, I wouldn’t even gain anything from having it write my tests since tests are easily built wrong in subtle ways.

I’d need to verify the test by reviewing it and, imo, writing the test would be less time than coaxing a correct one, reviewing, re-coaxing, repeat.

agos

0 replies

6h12m

2024-08-08 12:19:33 UTC

can you give an example of what "in this manner" might be?

FooBarBizBazz

0 replies

7h16m

2024-08-08 11:15:53 UTC

This could make programming more declarative or constraint-based, but you'd still have to specify the properties you want. Ultimately, if you are defining some function in the mathematical sense, you need to say somehow what inputs go to what outputs. You need to communicate that to the computer, and a certain number of bits will be needed to do that. Of course, if you have a good statistical model of how-probably a human wants a given function f, then you can perform that communication to the machine in 1/log(P(f)) bits, so the model isn't worthless.

Here I have assumed something about the set that f lives in. I am taking for granted that a probability measure can be defined. In theory, perhaps there are difficulties involving the various weird infinities that show up in computing, related to undecideability and incompleteness and such. But at a practical level, if we assume some concrete representation of the program then we can just define that it is smaller than some given bound, and ditto for a number of computational steps with a particular model of machine (even if fairly abstract, like some lambda calculus thing), so realistically we might be able to not worry about it.

Also, since our input and output sets are bounded (say, so many 64-bit doubles in, so many out), that also gives you a finite set of functions in principle; just think of the size of the (impossibly large) lookup table you'd need to represent it.

paxys

11 replies

5h50m

2024-08-08 12:42:03 UTC

Exactly, and people have been saying this for a while now. If an "AI software engineer" needs a perfect spec with zero ambiguity, all edge cases defined, full test coverage with desired outcomes etc., then the person writing the spec is the actual software engineer, and the AI is just a compiler.

sgu999

2 replies

4h29m

2024-08-08 14:02:36 UTC

then the person writing the spec is the actual software engineer

Sounds like this work would involve asking questions to collaborators, guess some missing answers, write specs and repeat. Not that far ahead of the current sota of AI...

nyrikki

1 replies

4h2m

2024-08-08 14:29:35 UTC

Same reason the visual programming paradigm failed, tbe main problem is not the code.

While writing simple functions may be mechanistic, being a developer is not.

'guess some missing answers' is why Waterfall, or any big upfront design has failed.

People aren't simply loading pig iron into rail cars like Taylor assumed.

The assumption of perfect central design with perfect knowledge and perfect execution simply doesn't work for systems which are for more like an organism than a machine.

gizmo

0 replies

2h47m

2024-08-08 15:45:05 UTC

Waterfall fails when domain knowledge is missing. Engineers won't take "obvious" problems into consideration when they don't even know what the right questions to ask are. When a system gets rebuild for the 3rd time the engineers do know what to build and those basic mistakes don't get made.

Next gen LLMs, with their encyclopedic knowledge about the world, won't have that problem. They'll get the design correct on their first attempt because they're already familiar with the common pitfalls.

Of course we shouldn't expect LLMs to be a magic bullet that can program anything. But if your frame of reference is "visual programming" where the goal is to turn poorly thought out requirements into a reasonably sensible state machine then we should expect LLMs to get very good at that compared to regular people.

qup

1 replies

3h30m

2024-08-08 15:01:35 UTC

What makes you think they'll need a perfect spec?

Why do you think they would need a more defined spec than a human?

digging

0 replies

2h33m

2024-08-08 15:58:27 UTC

A human has the ability to contact the PM and say, "This won't work, for $reason," or, "This is going to look really bad in $edgeCase, here are a couple options I've thought of."

There's nothing about AI that makes such operations intrinsically impossible, but they require much more than just the ability to generate working code.

dartos

1 replies

5h1m

2024-08-08 13:30:30 UTC

We’ve also learned that starting off by rigidly defined spec is actually harmful to most user facing software, since customers change their minds so often and have a hard time knowing what they want right from the start.

diffxx

0 replies

52m

2024-08-08 17:39:21 UTC

This is why most of the best software is written by people writing things for themselves and most of the worst is made by people making software they don't use themselves.

satvikpendem

0 replies

5h24m

2024-08-08 13:07:45 UTC

Reminds me of when computers were literally humans computing things (often women). How time weaves its circular web.

mlavrent

0 replies

29m

2024-08-08 18:02:40 UTC

This is not quite right - a specification is not equivalent to writing software, and the code generator is not just a compiler - in fact, generating implementations from specifications is a pretty active area of research (a simpler problem is the problem of generating a configuration that satisfies some specification, "configuration synthesis").

In general, implementations can be vastly more complicated than even a complicated spec (e.g. by having to deal with real-world network failures, etc.), whereas a spec needs only to describe the expected behavior.

In this context, this is actually super useful, since defining the problem (writing a spec) is usually easier than solving the problem (writing an implementation); it's not just translating (compiling), and the engineer is now thinking at a higher level of abstraction (what do I want it to do vs. how do I do it).

mattnewton

0 replies

3h20m

2024-08-08 15:12:18 UTC

I mean, that's already the case in many places, the senior engineer / team lead gathering requirements and making architecture decisions is removing enough ambiguity to hand it off to juniors churning out the code. This just makes very cheap, very fast typing but uncreative and a little dull junior developers.

_the_inflator

0 replies

4h42m

2024-08-08 13:50:15 UTC

Exactly. This is what I tell everyone. The harder you work on specs the easier it gets in the aftermath. And this is exactly what business with lofty goals doesn’t get or ignores. Put another way: a fool with a tool…

Also look out for optimization the clever way.

gizmo

11 replies

7h47m

2024-08-08 10:44:21 UTC

Much business logic is really just a state machine where all the states and all the transitions need to be handled. When a state or transition is under-specified an LLM can pass the ball back and just ask what should happen when A and B but not C. Or follow more vague guidance on what should happen in edge cases. A typical business person is perfectly capable of describing how invoicing should work and when refunds should be issued, but very few business people can write a few thousand lines of code that covers all the cases.

discreteevent

10 replies

5h46m

2024-08-08 12:46:08 UTC

an LLM can pass the ball back and just ask what should happen when A and B but not C

What should the colleagues of the business person review before deciding that the system is fit for purpose? Or what should they review when the system fails? Should they go back over the transcript of the conversation with the LLM?

gizmo

4 replies

4h41m

2024-08-08 13:51:12 UTC

How does a business person today decide if a system is fit for purpose when they can't read code? How is this different?

Jensson

3 replies

3h26m

2024-08-08 15:05:25 UTC

They don't, the software engineer does that. It is different since LLMs can't test the system like a human can.

Once the system can both test and update the spec etc to fix errors in the spec and build the program and ensure the result is satisfactory, we have AGI. If you argue an AGI could do it, then yeah it could as it can replace humans at everything, the argument was for an AI that isn't yet AGI.

gizmo

2 replies

3h1m

2024-08-08 15:30:54 UTC

The world runs on fuzzy underspecified processes. On excel sheets and post-it notes. Much of the world's software needs are not sophisticated and don't require extensive testing. It's OK if a human employee is in the loop and has to intervenes sometimes when an AI-built system malfunctions. Businesses of all sizes have procedures where problems get escalated to more senior people with more decision-making power. The world is already resilient against mistakes made by tired/inattentive/unintelligent people, and mistakes made by dumb AI systems will blend right in.

discreteevent

1 replies

2h47m

2024-08-08 15:44:45 UTC

The world runs on fuzzy underspecified processes. On excel sheets and post-it notes.

Excel sheets are not fuzzy and underspecified.

It's OK if a human employee is in the loop and has to intervenes sometimes

I've never worked on software where this was OK. In many cases it would have been disastrous. Most of the time a human employee could not fix the problem without understanding the software.

gizmo

0 replies

2h24m

2024-08-08 16:08:16 UTC

All software that interops with people, other businesses, APIs, deals with the physical world in any way, or handles money has cases that require human intervention. It's 99.9% of software if not more. Security updates. Hardware failures. Unusual sensor inputs. A sudden influx of malformed data. There is no such thing as an entirely autonomous system.

But we're not anywhere close to maximally automated. Today (many? most?) office workers do manual data entry and processing work that requires very little thinking. Even automating just 30% of their daily work is a huge win.

ben_w

4 replies

4h53m

2024-08-08 13:38:37 UTC

As an LLM can output source code, that's all answerable with "exactly what they already do when talking to developers".

discreteevent

3 replies

4h32m

2024-08-08 13:59:21 UTC

There are two reasons the system might fail:

1) The business person made a mistake in their conversation/specification.

In this case the LLM will have generated code and tests that match the mistake. So all the tests will pass. The best way to catch this before it gets to production is to have someone else review the specification. But the problem is that the specification is a long trial-and-error conversation in which later parts may contradict earlier parts. Good luck reviewing that.

2) The LLM made a mistake.

The LLM may have made the mistake because of a hallucination which it cannot correct because in trying to correct it the same hallucination invalidates the correction. At this point someone has to debug the system. But we got rid of all the programmers.

ben_w

2 replies

4h22m

2024-08-08 14:10:11 UTC

This still resolves as "business person asks for code, business person gets code, business person says if code is good or not, business person deploys code".

That an LLM or a human is where the code comes from, doesn't make much difference.

Though it does kinda sound like you're assuming all LLMs must develop with Waterfall? That they can't e.g. use Agile? (Or am I reading too much into that?)

discreteevent

1 replies

2h55m

2024-08-08 15:36:57 UTC

business person says if code is good or not

How do they do this? They can't trust the tests because the tests were also developed by the LLM which is working from incorrect information it received in a chat with the business person.

ben_w

0 replies

2h26m

2024-08-08 16:05:56 UTC

The same way they already do with humans coders whose unit tests were developed by exactly same flawed processes:

Mediocrely.

Sometimes the current process works, other times the planes fall out of the sky, or updates causes millions of computers to blue screen on startup at the same time.

LLMs in particular, and AI in general, doesn't need to beat humans at the same tasks.

eru

3 replies

9h0m

2024-08-08 09:31:36 UTC

Does programming have a clear reward function? A vague description from a business person isn't it. By the time someone (a programmer?) has written a reward function that is clear enough, how would that function look compared to a program?

Well, to give an example: the complexity class NP is all about problems that have quick and simple verification, but finding solutions for many problems is still famously hard.

So there are at least some domains where this model would be a step forward.

thaumasiotes

2 replies

8h47m

2024-08-08 09:45:00 UTC

But in that case, finding the solution is hard and you generally don't try. Instead, you try to get fairly close, and it's more difficult to verify that you've done so.

eru

1 replies

8h6m

2024-08-08 10:25:23 UTC

No. Most instances of most NP hard problems are easy to find solutions for. (It's actually really hard to eg construct a hard instance for the knapsack problem. And SAT solvers also tend to be really fast in practice.)

And in any case, there are plenty of problems in NP that are not NP hard, too.

Yes, approximation is also an important aspect of many practical problems.

There's also lots of problems where you can easily specify one direction of processing, but it's hard to figure out how to undo that transformation. So you can get plenty of training data.

imtringued

0 replies

4h19m

2024-08-08 14:13:05 UTC

I have a very simple integer linear program and it is really waiting for the heat death of the universe.

No, running it as a linear program is still slow.

I'm talking about small n=50 taking tens of minutes for a trivial linear program. Obviously the actual linear program is much bigger and scales quadratically in size, but still. N=50 is nothing.

jimbokun

2 replies

6h12m

2024-08-08 12:20:10 UTC

The reward function could be "pass all of these tests I just wrote".

marcosdumay

1 replies

5h31m

2024-08-08 13:00:25 UTC

Lol. Literally.

If you have those many well written tests, you can pass them to a constraint solver today and get your program. No LLM needed.

Or even run your tests instead of the program.

emporas

0 replies

2h18m

2024-08-08 16:13:37 UTC

Probably the parent assumes that he does have the tests, billions of them.

One very strong LLM could generate billions of tests alongside the working code and then train another smaller model, or feed it into the next iteration of training same the strong model. Strong LLMs do exist for that purpose, Nemotron 320B and Llama 3 450B.

It would be interesting if a dataset like that would be created like that, and then released as open source. Many LLMs proprietary or not, could incorporate the dataset in their training, and have on the internet hundreds of LLMs suddenly become much better at coding, all of them at once.

LeifCarrotson

2 replies

4h19m

2024-08-08 14:12:57 UTC

I think you could set up a good reward function for a programming assistance AI by checking that the resulting code is actually used. Flag or just 'git blame' the code produced by the AI with the prompts used to produce it, and when you push a release, it can check which outputs were retained in production code from which prompts. Hard to say whether code that needed edits was because the prompt was bad or because the code was bad, but at least you can get positive feedback when a good prompt resulted in good code.

rfw300

0 replies

4h7m

2024-08-08 14:25:14 UTC

GitHub Copilot's telemetry does collect data on whether generated code snippets end up staying in the code, so presumably models are tuned on this feedback. But you haven't solved any of the problems set out by Karpathy here—this is just bankshot RLHF.

bee_rider

0 replies

3h27m

2024-08-08 15:04:39 UTC

That could be interesting but it does seem like a much fuzzier and slower feedback loop than the original idea.

It also seems less unique to code. You could also have a chat bot write an encyclopedia and see if the encyclopedias sold well. Chat bots could edit Wikipedia and see if their edits stuck as a reward function (seems ethically pretty questionable or at least in need of ethical analysis, but it is possible).

The maybe-easy to evaluate reward function is an interesting aspect of code (which isn’t to say it is the only interesting aspect, for sure!)

tablatom

1 replies

8h25m

2024-08-08 10:06:51 UTC

Very good point. For some types of problems maybe the answer is yes. For example porting. The reward function is testing it behaves the same in the new language as the old one. Tricky for apps with a gui but doesn't seem impossible.

The interesting kind of programming is the kind where I'm figuring out what I'm building as part of the process.

Maybe AI will soon be superhuman in all the situations where we know exactly what we want (win the game), but not in the areas we don't. I find that kind of cool.

martinflack

0 replies

3h46m

2024-08-08 14:45:23 UTC

Even for porting there's a bit of ambiguity... Do you port line-for-line or do you adopt idioms of the target language? Do you port bug-for-bug as well as feature-for-feature? Do you leave yet-unused abstractions and opportunities for expansion that the original had coded in, if they're not yet used, and the target language code is much simpler without?

I've found when porting that the answers to these are sometimes not universal for a codebase, but rather you are best served considering case-by-case inside the code.

Although I suppose an AI agent could be created that holds a conversation with you and presents the options and acts accordingly.

littlestymaar

1 replies

8h20m

2024-08-08 10:12:05 UTC

“A precise enough specification is already code”, which means we'll not run out of developers in the short term. But the day to day job is going to be very different, maybe as different as what we're doing now compared to writing machine code on punchcards.

mattmanser

0 replies

6h11m

2024-08-08 12:21:04 UTC

Doubtful. This is the same mess we've been in repeatedly with 'low code'/'no code' solutions.

Every decade it's 'we don't need programmers anymore'. Then it turns out specifying the problem needs programmers. Then it turns out the auto-coder can only reach a certain level of complexity. Then you've got real programmers modifying over-complicared code. Then everyone realizes they've wasted millions and it would have been quicker and cheaper to get the programmers to write the code in the first place.

The same will almost certainly happen with AI generated code for the next decade or two, just at a slightly higher level of program complexity.

waldrews

0 replies

2024-08-08 18:29:23 UTC

There's no reward function in the sense that optimizing the reward function means the solution is ideal.

There are objective criteria like 'compiles correctly' and 'passes self-designed tests' and 'is interpreted as correct by another LLM instance' which go a lot further than criteria that could be defined for most kinds of verbal questions.

tomrod

0 replies

4h35m

2024-08-08 13:57:03 UTC

You can define one based on passed tests, code coverage, other objectives, or weighted combinations without too much loss of generality.

ryukoposting

0 replies

5h59m

2024-08-08 12:32:48 UTC

If we will struggle to create reward functions for AI, then how different is that from the struggles we already face when divvying up product goals into small tasks to fit our development cycles?

In other words, to what extent does Agile's ubiquity prove our competence in turning product goals into de facto reward functions?

rossamurphy

0 replies

9h16m

2024-08-08 09:15:23 UTC

consteval

0 replies

3h32m

2024-08-08 14:59:36 UTC

There's levels to this.

Certainly "compiled" is one reward (although a blank file fits that...) Another is test cases, input and output. This doesn't work on a software-wide scale but function-wide it can work.

In the future I think we'll see more of this test-driven development. Where developers formally define the requirements and expectations of a system and then an LLM (combined with other tools) generates the implementation. So instead of making the implementation, you just declaratively say what the implementation should do (and shouldn't).

axus

0 replies

4h13m

2024-08-08 14:18:55 UTC

If they get permission and don't mind waiting, they could check if people throw away the generated code or keep it as-is.

airstrike

0 replies

5h4m

2024-08-08 13:27:20 UTC

My reward in Rust is often when the code actually compiles...

_the_inflator

0 replies

4h37m

2024-08-08 13:54:55 UTC

Full circle but instead of determinism you introduce some randomness. Not good.

Also the reasoning is something business is dissonant about. The majority of planning and execution teams stick to processes. I see way more potential automating these than all parts in app production.

Business is going to have a hard time, when they believe, they alone can orchestrate some AI consoles.

incorrecthorse

16 replies

8h33m

2024-08-08 09:58:38 UTC

Unless you want an empty test suite or a test suite full of `assert True`, the reward function is more complicated than you think.

rafaelmn

10 replies

8h29m

2024-08-08 10:02:37 UTC

Code coverage exists. Shouldn't be hard at all to tune the parameters to get what you want. We have really good tools to reason about code programmatically - linters, analyzers, coverage, etc.

SkiFire13

7 replies

7h48m

2024-08-08 10:44:19 UTC

In my experience they are ok (not excellent) for checking whether some code will crash or not. But checking whether the code logic is correct with respect to the requirements is far from being automatized.

rafaelmn

6 replies

6h49m

2024-08-08 11:42:33 UTC

But for writing tests that's less of an issue. You start with known good/bad code and ask it to write tests against a spec for some code X - then the evaluation criteria is something like did the test cover the expected lines and produce the expected outcome (success/fail). Pepper in lint rules for preferred style etc.

SkiFire13

4 replies

5h41m

2024-08-08 12:51:04 UTC

But this will lead you to the same problem the tweet is talking! You are training a reward model based on human feedback (whether the code satisfies the specification or not). This time the human feedback may seem more objective, but in the end it's still non-exhaustive human feedback which will lead to the reward model being vulnerable to some adversarial inputs which the other model will likely pick up pretty quickly.

rafaelmn

3 replies

5h38m

2024-08-08 12:53:35 UTC

It's based on automated tools and evaluation (test runner, coverage, lint) ?

SkiFire13

2 replies

4h55m

2024-08-08 13:36:53 UTC

The input data is still human produced. Who decides what is code that follows the specification and what is code that doesn't? And who produces that code? Are you sure that the code that another model produces will look like that? If not then nothing will prevent you from running into adversarial inputs.

And sure, coverage and lints are objective metrics, but they don't directly imply the correctness of a test. Some tests can reach a high coverage and pass all the lint checks but still be incorrect or test the wrong thing!

Whether the test passes or not is what's mostly correlated to whether it's correct or not. But similarly for an image recognizer the prompt of whether an image is a flower or not is also objective and correlated, and yet researchers continue to find adversarial inputs for image recognizer due to the bias in their training data. What makes you think this won't happen here too?

rafaelmn

1 replies

4h29m

2024-08-08 14:03:07 UTC

The input data is still human produced

So are rules for the game of go or chess ? Specifying code that satisfies (or doesn't satisfy) is a problem statement - evaluation is automatic.

but they don't directly imply the correctness of a test.

I'd be willing to bet that if you start with an existing coding model and continue training it with coverage/lint metrics and evaluation as feedback you'd get better at generating tests. Would be slow and figuring out how to build a problem dataset from existing codebases would be the hard part.

SkiFire13

0 replies

4h11m

2024-08-08 14:20:27 UTC

So are rules for the game of go or chess ?

The rules are well defined and you can easily write a program that will tell whether a move is valid or not, or whether a game has been won or not. This allows you generate virtually infinite amount of data to train the model on without human intervention.

Specifying code that satisfies (or doesn't satisfy) is a problem statement

This would be true if you fix one specific program (just like in Go or Chess you fix the specific rules of the game and then train a model on those) and want to know whether that specific program satisfies some given specification (which will be the input of your model). But if instead you want the model to work with any program then that will have to become part of the input too and you'll have to train it an a number of programs which will have to be provided somehow.

and figuring out how to build a problem dataset from existing codebases would be the hard part

This is the "Human Feedback" part that the tweet author talks about and the one that will always be flawed.

layer8

0 replies

26m

2024-08-08 18:05:55 UTC

Who writes the spec to write tests against?

In the end, your are replacing the application code by a spec, which needs to have a comparable level of detail in order for the AI to not invent its own criteria.

incorrecthorse

1 replies

3h57m

2024-08-08 14:35:16 UTC

Code coverage proves that the code runs, not that it does what it should do.

rafaelmn

0 replies

3h3m

2024-08-08 15:29:00 UTC

If you have a test that completes with the expected outcome and hits the expected code paths you have a working test - I'd say that heuristic will get you really close with some tweaks.

gizmo

1 replies

8h12m

2024-08-08 10:20:09 UTC

It's easy to imagine why something could never work.

It's more interesting to imagine what just might work. One thing that has plagued programmers for the past decades is the difficulty of writing correct multi-threaded software. You need fine-grained locking otherwise your threads will waste time waiting for mutexes. But color-coding your program to constrain which parts of your code can touch which data and when is tedious and error-prone. If LLMs can annotate code sufficiently for a SAT solver to prove thread safety that's a huge win. And that's just one example.

imtringued

0 replies

4h8m

2024-08-08 14:24:05 UTC

Rust is that way.

WithinReason

1 replies

7h36m

2024-08-08 10:55:51 UTC

Adversarial networks are a straightforward solution to this. The reward for generating and solving tests is different.

imtringued

0 replies

4h4m

2024-08-08 14:27:39 UTC

That's a good point. A model that is capable of implementing a nonsense test is still better than a model that can't. The implementer model only needs a good variety of tests. They don't actually have to translate a prompt into a test.

littlestymaar

0 replies

8h23m

2024-08-08 10:08:57 UTC

It's not trivial to get right but it sounds within reach, unlike “hallucinations” with general purpose LLM usage.

xxs

8 replies

7h18m

2024-08-08 11:14:14 UTC

This reads as a proper marketing ploy. If the current incarnation of AI + coding is anything to go by - it'll take leaps just to make it barely usable (or correct)

Kiro

5 replies

6h24m

2024-08-08 12:07:58 UTC

My take is the opposite: considering how good AI is at coding right now I'm eager to see what comes next. I don't know what kind of tasks you've tried using it for but I'm surprised to hear someone think that it's not even "barely usable". Personally, I can't imagine going back to programming without a coding assistant.

Barrin92

2 replies

6h16m

2024-08-08 12:16:15 UTC

but I'm surprised to hear someone think that it's not even "barely usable".

write performance oriented and memory safe C++ code. Current coding assistants are glorified autocomplete for unit tests or short api endpoints or what have you but if you have to write any safety oriented code or you have to think about what the hardware does it's unusable.

I tried using several of the assistants and they write broken or non-performant code so regularly it's irresponsible to use them.

imtringued

0 replies

4h9m

2024-08-08 14:23:08 UTC

Isn't this a good reward function for RL? Take a codebase's test suite. Rip out a function, let the LLM rewrite the function, benchmark it and then RL it using the benchmark results.

agos

0 replies

6h11m

2024-08-08 12:21:16 UTC

I've also had trouble having assistants help with CSS, which is ostensibly easier than performance oriented and memory safe C++

commodoreboxer

0 replies

4h51m

2024-08-08 13:40:21 UTC

I've been playing with it recently, and I find unless there are very clear patterns in surrounding code or on the Internet, it does quite terribly. Even for well-seasoned libraries like V8 and libuv, it can't reliably not make up APIs that don't exist and it very regularly spits out nonsense code. Sometimes it writes code that works and does the wrong thing, it can't reliably make good decisions around undefined behavior. The worst is when I've asked for it to refactor code, and it actually subtly changes the behavior in the process.

I imagine it's great for CRUD apps and generating unit tests, but for anything reliable where I work, it's not even close to being useful at all, let alone a game changer. It's a shame, because it's not like I really enjoy fiddling with memory buffers and painstakingly avoiding UB, but I still have to do it (I love Rust, but it's not an option for me because I have to support AIX. V8 in Rust also sounds like a nightmare, to be honest. It's a very C++ API).

ben_w

0 replies

4h46m

2024-08-08 13:45:56 UTC

I've seen them all over the place.

The best are shockingly good… so long as their context doesn't expire and they forget e.g. the Vector class they just created has methods `.mul(…)` rather than `.multiply(…)` or similar. Even the longer context windows are still too short to really take over our jobs (for now), the haystack tests seem to be over-estimating their quality in this regard.

The worst LLM's that I've seen (one of the downloadable run-locally models but I forget which) — one of my standard tests is that I ask them to "write Tetris as a web app", and it started off doing something a little bit wrong (square grid), before giving up on that task entirely and switching from JavaScript to python and continuing by writing a script to train a new machine learning model (and people still ask how these things will "get out of the box" :P)

People who see more of the latter? I can empathise with them dismissing the whole thing as "just autocomplete on steroids".

EugeneOZ

1 replies

7h13m

2024-08-08 11:18:22 UTC

TDD approach could play the RL role.

jgalt212

0 replies

7h8m

2024-08-08 11:23:32 UTC

But what makes you think the ai generated tests will correctly represent the problem at hand?

davedx

6 replies

9h17m

2024-08-08 09:14:46 UTC

I'm pretty interested in the theorem proving/scientific research aspect of this.

Do you think it's possible that some version of LLM technology could discover new physical theories (that are experimentally verifiable), like for example a new theory of quantum gravity, by exploring the mathematical space?

Edit: this is just incredibly exciting to think about. I'm not an "accelerationist" but the "singularity" has never felt closer...

jimbokun

1 replies

6h7m

2024-08-08 12:25:03 UTC

Current LLMs are optimized to produce output most resembling what a human would generate. Not surpass it.

ben_w

0 replies

4h38m

2024-08-08 13:53:41 UTC

The output most pleasing to a human, which is both better and worse.

Better, when we spot mistakes even if we couldn't create the work with the error. Think art: most of us can't draw hands, but we can spot when Stable Diffusion gets them wrong.

Worse also, because there are many things which are "common sense" and wrong, e.g. https://en.wikipedia.org/wiki/Category:Paradoxes_in_economic..., and we would collectively down-vote a perfectly accurate model of reality for violating our beliefs.

gizmo

1 replies

8h29m

2024-08-08 10:02:48 UTC

My hunch is that LLMs are nowhere near intelligent enough to make brilliant conceptual leaps. At least not anytime soon.

Where I think AI models might prove useful is in those cases where the problem is well defined, where formal methods can be used to validate the correctness of (partial) solutions, and where the search space is so large that work towards a proof is based on "vibes" or intuition. Vibes can be trained through reinforcement learning.

Some computer assisted proofs are already hundreds of pages or gigabytes long. I think it's a pretty safe bet that really long and convoluted proofs that can only be verified by computers will become more common.

https://en.wikipedia.org/wiki/Computer-assisted_proof

CuriouslyC

0 replies

6h42m

2024-08-08 11:50:03 UTC

They don't need to be intelligent to make conceptual leaps. DeepMind stuff just does a bunch of random RL experiments until it finds something that works.

tsimionescu

0 replies

8h6m

2024-08-08 10:25:56 UTC

I think the answer is almost certainly no, and is mostly unrelated to how smart LLMs can get. The issue is that any theory of quantum gravity would only be testable with equipment that is much, much more complex than what we have today. So even if the AI came up with some beautifully simple theory, testing that its predictions are correct is still not going to be feasible for a very long time.

Now, it is possible that it could come up with some theory that is radically different from current theories, where quantum gravity arises very naturally, and that fits all of the other predictions of of the current theories that we can measure - so we would have good reasons to believe the new theory and consider quantum gravity probably solved. But it's literally impossible to predict whether such a theory even exists, that is not mathematically equivalent to QM/QFT but still matches all confirmed predictions.

Additionally, nothing in AI tech so far predicts that current approaches should be any good at this type of task. The only tasks where AI has truly excelled at are extremely well defined problems where there is a huge but finite search space; and where partial solutions are easy to grade. Image recognition, game playing, text translation are the great successes of AI. And performance drops sharply with the uncertainty in the space, and with the difficulty of judging a partial solution.

Finding physical theories is nothing like any of these problems. The search space is literally infinite, partial solutions are almost impossible to judge, and even judging whether a complete solution is good or not is extremely difficult. Sure, you can check if it's mathematically coherent, but that tells you nothing about whether it describes the physical world correctly. And there are plenty of good physical theories that aren't fully formally proven, or weren't at the time they were invented - so mathematical rigour isn't even a very strong signal (e.g. Newton's infinitesimal calculus wasn't considerered sound until the 1900s or something, by which time his theories had long since been rewritten in other terms; the Dirac delta wasn't given a precise mathematical definition until much later than it's uses; and I think QFT still uses some iffy math even today).

esjeon

0 replies

9h6m

2024-08-08 09:25:34 UTC

IIRC, there have been people doing similar things using something close to brute-force. Nothing of real significance has been found. A problem is that there are infinitely many physically and mathematically correct theorems that would add no practical value.

djeastm

3 replies

6h40m

2024-08-08 11:51:43 UTC

Coding AI can write tests, write code, compile, examine failed test cases, search for different coding solutions that satisfy more test cases or rewrite the tests, all in an unsupervised loop.

Will this be able to be done without spending absurd amounts of energy?

jimbokun

0 replies

6h8m

2024-08-08 12:24:07 UTC

Energy efficiency might end up being the final remaining axis on which biological brains surpass manufactured ones before the singularity.

commodoreboxer

0 replies

4h42m

2024-08-08 13:49:21 UTC

The amount of energy is truly absurd. I dont chug a 16 oz bottle of water every time I answer a question.

ben_w

0 replies

4h27m

2024-08-08 14:05:19 UTC

Computer energy efficiency is not as constrained as minimum feature size, it's still doubling every 2.6 years or so.

Even if they were, a human-quality AI that runs at human-speed for x10 our body's calorie requirements in electricity, would still (at electricity prices of USD 0.1/kWh) undercut workers earning the UN abject poverty threshold.

lossolo

1 replies

5h12m

2024-08-08 13:20:18 UTC

Writing tests won't help you here, this problem is the same as other generation tasks. If the test passes, everything seems okay, right? Consider this: you now have a 50-line function just to display 'hello world'. It outputs 'hello world', so it scores well, but it's hardly efficient. Then, there's a function that runs in exponential time instead of the standard polynomial time that any sensible programmer would use in specific cases. It passes the tests, so it gets a high score. You also have assembly code embedded in C code, executed with 'asm'. It works for that particular case and passes the test, but the average C programmer won't understand what's happening in this code, whether it's secure, etc. Lastly, tests written by AI might not cover all cases, they could even fail to test what you intended because they might hallucinate scenarios (I've experienced this many times). Programming faces similar issues to those seen in other generation tasks in the current generation of large language models, though to a slightly lesser extent.

jononor

0 replies

44m

2024-08-08 17:47:45 UTC

One can image critics and code rewriters that optimize for computational, code style, and other requirements in addition to tests.

jimbokun

1 replies

6h12m

2024-08-08 12:19:45 UTC

Future coding where developers only ever write the tests is an intriguing idea.

Then the LLM generates and iterates on the code until it passes all of the tests. New requirements? Add more tests and repeat.

This would be legitimately paradigm shifting, vs. the super charged auto complete driven by LLMs we have today.

layer8

0 replies

5h15m

2024-08-08 13:16:30 UTC

Tests don’t prove correctness of the code. What you’d really want instead is to specify invariants the code has to fulfill, and for the AI to come up with a machine-checkable proof that the code indeed guarantees those invariants.

yard2010

0 replies

8h29m

2024-08-08 10:02:43 UTC

Once you have enough data points, from current usage, and these days every company is tracking EVERYTHING even eye movement if they could, it's just a matter of time. I do agree though that before we reach an AGI we have these agents who are really good in a defined mission (like code completion).

It's not even about LLMs IMHO. It's about letting a computer crunch many numbers and find a pattern in the results, in a quasi religious manner.

pilooch

0 replies

7h26m

2024-08-08 11:06:06 UTC

Yes, same for maths. As long as a true reward 'surface' can be optimized. Approximate rewards are similar to approximate and non admissible heuristics,search eventually misses true optimal states and favors wrong ones, with side effects in very large state spaces.

lewtun

0 replies

5h40m

2024-08-08 12:51:44 UTC

I expect language models to also get crazy good at mathematical theorem proving

Indeed, systems like AlphaProof / AlphaGeometry are already able to win a silver medal at the IMO, and the former relies on Lean for theorem verification [1]. On the open source side, I really like the ideas in LeanDojo [2], which use a form of RAG to assist the LLM with premise selection.

[1] https://deepmind.google/discover/blog/ai-solves-imo-problems...

[2] https://leandojo.org/

anshumankmr

0 replies

7h34m

2024-08-08 10:57:35 UTC

Unless if it takes maximizing code coverage as the objective and starts deleting failed test cases.

FooBarBizBazz

0 replies

7h31m

2024-08-08 11:01:04 UTC

Coding AI can write tests, write code, compile, examine failed test cases, search for different coding solutions that satisfy more test cases or rewrite the tests, all in an unsupervised loop. And then whole process can turn into training data for future AI coding models.

This is interesting, but doesn't it still need supervision? Why wouldn't it generate tests for properties you don't want? It seems to me that it might be able to "fill in gaps" by generalizing from "typical software", like, if you wrote a container class, it might guess that "empty" and "size" and "insert" are supposed to be related in a certain way, based on the fact that other peoples' container classes satisfy those properties. And if you look at the tests it makes up and go, "yeah, I want that property" or not, then you can steer what it's doing, or it can at least force you to think about more cases. But there would still be supervision.

Ah -- here's an unsupervised thing: Performance. Maybe it can guide a sequence of program transformations in a profile-guided feedback loop. Then you could really train the thing to make fast code. You'd pass "-O99" to gcc, and it'd spin up a GPU cluster on AWS.

CuriouslyC

0 replies

6h43m

2024-08-08 11:48:33 UTC

Models aren't going to get really good at theorem proving until we build models that are transitive and handle isomorphisms more elegantly. Right now models can't recall factual relationships well in reverse order in many cases, and often fail to answer questions that they can answer easily in English when prompted to respond with the fact in another language.

__loam

13 replies

10h15m

2024-08-08 08:16:42 UTC

Been shouting this for over a year now. We're training AI to be convincing, not to be actually helpful. We're sampling the wrong distributions.

danielbln

7 replies

10h11m

2024-08-08 08:21:07 UTC

I find them very helpful, personally.

sussmannbaka

6 replies

9h46m

2024-08-08 08:45:41 UTC

Understandable, they have been trained to convince you of their helpfulness.

danielbln

3 replies

9h39m

2024-08-08 08:52:21 UTC

If they convinced me of their helpfulness, and their output is actually helpful in solving my problems.. well, if it walks like a duck and quacks like a duck, and all that.

Nullabillity

1 replies

7h36m

2024-08-08 10:55:55 UTC

"Appears helpful" and "is helpful" are two very different properties, as it turns out.

snapcaster

0 replies

45m

2024-08-08 17:46:23 UTC

Sometimes, but that's an edge case that doesn't seem to impact the productivity boosts from LLMs

tpoacher

0 replies

9h2m

2024-08-08 09:30:05 UTC

if it walks like a duck and it quacks like a duck, then it lacks strong typing.

exe34

0 replies

8h53m

2024-08-08 09:38:55 UTC

https://xkcd.com/810/

djeastm

0 replies

5h47m

2024-08-08 12:44:59 UTC

This is true, but part of that convincing is actually providing at least some amount of response that is helpful and moving you forward.

I have to use coding as an example, because that's 95% of my use cases. I type in a general statement of the problem I'm having and within seconds, I get back a response that speaks my language and provides me with some information to ingest.

Now, I don't know for sure if everything sentence I read in the response is correct, but let's say that 75% of what I read aligns with what I currently know to be true. If I were to ask a real expert, I'd possibly understand or already know 75% of what they're telling me, as well, with the other 25% still to be understood and thus trusting the expert.

But either with AI or a real expert, for coding at least, that 25% will be easily testable. I go and implement and see if it passes my test. If it does, great. If not, at least I have tried something and gotten farther down the road in my problem solving.

Since AI generally does that for me, I am convinced of their helpfulness because it moves me along.

dgb23

1 replies

10h7m

2024-08-08 08:24:34 UTC

Depends on who you ask.

Advertisement and propaganda is not necessarily helpful for consumers, but just needs to be convincing in order to be helpful for producers.

khafra

0 replies

9h32m

2024-08-08 08:59:49 UTC

It would be interesting to see RL on a chatbot that's the last stage of a sales funnel for some high-volume item--it'd have fast, real-world feedback on how convincing it is, in the form of a purchase decision.

tpoacher

0 replies

9h3m

2024-08-08 09:29:05 UTC

iamatoool

0 replies

10h5m

2024-08-08 08:26:40 UTC

Sideways eye look at leetcode culture

HarHarVeryFunny

0 replies

2h37m

2024-08-08 15:54:42 UTC

If what you want is auto-complete (e.g. CoPilot, or natural language search) then LLMs are built for that, and useful.

If what you want it AGI then design an architecture with the necessary moving parts! Current approach reminds of the joke of the drunk looking for his dropped cars keys under the street lamp because "it's bright here", rather than near where he actually dropped them. It seems folk have spent years trying to come up with alternate learning mechanisms to gradient descent (or RL), and having failed are now trying to use SGD/pre-training for AGI "because it's what we've got", as opposed to doing the hard work of designing the type of always-on online learning algorithm that AGI actually requires.

timthelion

12 replies

10h12m

2024-08-08 08:20:08 UTC

Karpathy writes that there is no cheeply computed objective check for "Or re-writing some Java code to Python? " Among other things. But it seems to me that Reinforced Learning should be possible for code translation using automated integration testing. Run it, see if it does,the same thing!

fulafel

3 replies

9h45m

2024-08-08 08:47:14 UTC

“Programs must be written for people to read, and only incidentally for machines to execute.” — Harold Abelson

exe34

2 replies

9h5m

2024-08-08 09:26:26 UTC

"programs written by LLMs must run correctly and only incidentally be human readable." - Me.

alex_suzuki

1 replies

8h4m

2024-08-08 10:27:23 UTC

“WTF?!” - engineer who has to troubleshoot said programs.

exe34

0 replies

6h39m

2024-08-08 11:52:25 UTC

"given the updated input and output pairs below, generate code that would solve the problem."

IanCal

3 replies

9h49m

2024-08-08 08:42:40 UTC

Even there how you score it is hard.

"Is it the same for this s y of inputs?" May be fine for a subset of things, but then that's a binary thing. If it's slightly wrong do you score by number of outputs that match? A purely binary thing gives little useful help for nudging a model in the right direction. How do you compare two that both work, which is more "idiomatic"?

exe34

1 replies

9h7m

2024-08-08 09:25:05 UTC

for "does it run" cases, you can ask the model to try again, give it higher temperature, show it the traceback errors, (and maybe intermediate variables?) or even ask it to break up the problem into smaller pieces and then try to translate that.

for testing, if you use something like quickcheck, you might find bugs that you wouldn't otherwise find.

when it comes to idiomatic, I'm not sure - but if we're at the point that gpt is writing code that works, do we really care? as long as this code is split into many small pieces, we can just replace the piece instead of trying to understand/fix it if we can't read it. in fact, maybe there's a better language that is human readable but better for transformers to write and maintain.

IanCal

0 replies

7h6m

2024-08-08 11:25:58 UTC

For "does it run" I'm not talking about how do we test that it does, but how do we either score or compare two+ options?

when it comes to idiomatic, I'm not sure - but if we're at the point that gpt is writing code that works, do we really care?

Yes - it's certainly preferable. You may prefer working over neat, but working and neat over working but insane spaghetti code.

Remember this is about training the models, not about using them later. How do we tell, while training, which option was better to push it towards good results?

theqwxas

0 replies

9h32m

2024-08-08 08:59:39 UTC

I agree that it's a very difficult problem. I'd like to mention AlphaDev [0], an RL algorithm that builds other algorithms, there they combined the measure of correctness and a measure of algorithm speed (latency) to get the reward. But the algorithms they built were super small (e.g., sorting just three numbers), therefore they could measure correctness using all input combinations. It is still unclear how to scale this to larger problems.

[0] https://deepmind.google/discover/blog/alphadev-discovers-fas...

msoad

1 replies

8h56m

2024-08-08 09:35:38 UTC

A program like

    function add(a,b) {
      return 4
    }

Passes the test

falcor84

0 replies

7h53m

2024-08-08 10:38:34 UTC

I suppose you're alluding to xkcd's joke about this [0], which is indeed a good one, but what test does this actually pass?

The approach I was thinking of is that assuming we start with the Java program:

    public class Addition {
        public static int add(int a, int b) {
            return a + b;
        }
    }

We can semi-automatically generate a basic test runner with something like this, generating some example inputs automatically:

    public class Addition {
        public static int add(int a, int b) {
            return a + b;
        }

        public static class AdditionAssert {
            private int a;
            private int b;

            public AdditionAssert a(int a) {
                this.a = a;
                return this;
            }

            public AdditionAssert b(int b) {
                this.b = b;
                return this;
            }

            public void assertExpected(int expected) {
                int result = add(a, b);
                assert result == expected : "Expected " + expected + " but got " + result;
                System.out.println("Assertion passed for " + a + " + " + b + " = " + result);
            }
        }

        public static void main(String[] args) {
            new AdditionAssert().a(5).b(3).assertExpected(8);
            new AdditionAssert().a(-1).b(4).assertExpected(3);
            new AdditionAssert().a(0).b(0).assertExpected(0);

            System.out.println("All test cases passed.");
        }
    }

Another bit of automated preparation would then automatically translate the test cases to Python, and then the actual LLM would need to generate a python function until it passes all the translated test cases:

    def add(a, b):
        return 4

    def addition_assert(a, b, expected):
        result = add(a, b)
        assert result == expected, f"Expected {expected} but got {result}"

    addition_assert(a=5, b=3, expected=8)
    addition_assert(a=-1, b=4, expected=3)
    addition_assert(a=0, b=0, expected=0)

It might not be perfect, but I think it's very feasible and can get us close to there.

[0] https://xkcd.com/221/

jeroenvlek

0 replies

10h3m

2024-08-08 08:28:26 UTC

My takeaway is that it's difficult to make a "generic enough" evaluation that encompasses all things we use an LLM for, e.g. code, summaries, jokes. Something with free lunches.

WithinReason

0 replies

7h23m

2024-08-08 11:08:34 UTC

yes but that's not cheaply computed. You need good test coverage, etc.

mjburgess

6 replies

9h42m

2024-08-08 08:50:03 UTC

It always annoys and amazes me that people in this field have no basic understanding that closed-world finite-information abstract games are a unique and trivial problem. So much of the so-called "world model" ideological mumbojumbo comes from these setups.

Sampling board state from an abstract board space isn't a statistical inference problem. There's no missing information.

The whole edifice of science is a set of experimental and inferential practices to overcome the massive information gap between the state of a measuring device and the state of what, we believe, it measures.

In the case of natural language the gap between a sequence of symbols, "the war in ukraine" and those aspects of the world these symbols refer to is enormous.

The idea that there is even a RL-style "reward" function to describe this gap is pseudoscience. As is the false equivocation between sampling of abstracta such as games, and measuring the world.

pyrale

1 replies

8h59m

2024-08-08 09:32:29 UTC

[...] and trivial problem.

It just took decades and impressive breakthroughs to solve, I wouldn't really call it "trivial". However, I do agree with you that they're a class of problem different from problems with no clear objective function, and probably much easier to reason about that.

mjburgess

0 replies

8h52m

2024-08-08 09:39:57 UTC

They're a trivial inference problem, not a trivial problem to solve as such.

As in, if i need to infer the radius of a circle from N points sampled from that cirlce.. yes, I'm sure there's a textbook of algorithms/etc. with a lot of work spent on them.

But in the sense of statistical inference, you're only learning a property of a distribution given that distribution.. there isn't any inferential gap. As N->inf, you recover the entire circle itself.

compare with say, learning the 3d structure of an object from 2d photographs. At any rotation of that object, you have a new pixel distribution. So in pixel-space a 3d object is an infinite number of distributions; and the inference goal in pixel-space is to choose between sets of these infinities.

That's actually impossible without bridging information (ie., some theory). And in practice, it isn't solved in pixel space... you suppose some 3d geometry and use data to refine it. So you solve it in 3d-object-property-space.

With AI techniques, you have ones which work on abstracta (eg., circles) being used on measurement data. So you're solving the 3d/2d problem in pixel space, expecting this to work because "objects are made out of pixels, arent they?" NO.

So there's a huge inferential gap that you cannot bridge here. And the young AI fantatics in research keep milling out papers showing that it does work, so long as its a cirlce, chess, or some abstract game.

harshitaneja

1 replies

9h23m

2024-08-08 09:08:31 UTC

Forgive my naiveté here but even though solutions to those finite-information abstract games are trivial but not necessarily tractable(for a loser definition of tractable here) and we still need to build heuristics for the subclass of such problems where we need solutions in a given finite time frame. Those heuristics might not be easy to deduce and hence such models help in ascertaining those.

mjburgess

0 replies

9h6m

2024-08-08 09:25:21 UTC

Yes, and this is how computer "scientists" think of problems -- but this isnt science, it's mathematics.

If you have a process, eg., points = sample(circle) which fully describes its target as n->inf (ie., points = circle as n->inf) you arent engaged in statistical infernece. You might be using some of the same formula, but the whole system of science and statistics has been created for a radically different problem with radically different semantics to everything you're doing.

eg., the height of mercury in a thermometer never becomes the liquid being measured.. it might seems insane/weird/obvious to mention this... but we literally have berkelian-style neoidealists in AI research who don't realise this...

Who think that because you can find representations of abstracta in other spaces they can be projected in.. that this therefore tells you anything at all about inference problems. As if it was the neural network algorithm itself (a series of multiplications and additions) that "revealed the truth" in all data given to it. This, of course, is pseudoscience.

It only applies on mathematical problems, for obvious reasons. If you use a function approximation alg to approximate a function, do not be suprised you can succeed. The issue is that the relationship between, say, the state of a theremometer and the state of the temperature of it's target system is not an abstract function which lives in the space of temperature readings.

More precisely, in the space of temperature readings the actual causal relationship between the height of the mecurary and the temperature of the target shows up as an infinite number of temperature distributions (with any given trained NN learning only one of these). None of which are a law of nature -- laws of nature are not given by distributions in measuring devices.

meroes

0 replies

3h15m

2024-08-08 15:16:57 UTC

Yes. Quantum mechanics for example is not something that could have been thought of even conceptually by anything “locked in a room”. Logically coherent structure space is so mind bogglingly big we will never come close to even the smallest fraction of it. Science recognizes that only experiments will bring structures like QM out of the infinite sea into our conceptual space. And as a byproduct of how experiments work, the concepts will match (model) the actual world fairly well. The armchair is quite limiting, and I don’t see how LLMs aren’t locked to it.

AGI won’t come from this set of tools. Sam Altman just wants to buy himself a few years of time to find their next product.

DAGdug

0 replies

20m

2024-08-08 18:12:03 UTC

Who doesn’t? Karpathy, and a pretty much every researcher at OpenAI/Deepmind/FAIR absolutely knows the trivial concept of fully observable versus partially observable environments, which is 101 reinforcement learning.

cs702

5 replies

7h51m

2024-08-08 10:40:39 UTC

Indeed. The reward function we're using in RLHF today induces AI models to behave in ways that superficially seem better to human beings on average, but what we actually want is to induce them to solve cognitive tasks, with human priorities.

The multi-trillion dollar question is: What is the objective reward that would induce AI models like LLMs to behave like AGI -- while adhering to all the limits we human beings wish to impose in AGI behavior?

I don't think anyone has even a faint clue of the answer yet.

Xcelerate

3 replies

6h4m

2024-08-08 12:27:20 UTC

The multi-trillion dollar question is: What is the objective reward that would induce AI models like LLMs to behave like AGI

No, the reward for finding the right objective function is a good future for all of humanity, given that we already have an algorithm for AGI.

The objective function to acquire trillions of dollars is trivial: it’s the same minimization of cross-entropy that we already use for sequence prediction. What’s missing is a better algorithm, which is probably a good thing at the moment, because otherwise someone could trivially drain all value from the stock market.

cs702

2 replies

5h56m

2024-08-08 12:35:34 UTC

You misunderstand.

The phrase "the multi-trillion dollar question" has nothing to do with acquiring trillions of dollars.

The phrase is idiomatic, indicating a crucial question, like "the $64,000 question," but implying much bigger stakes.[a]

---

[a] https://idioms.thefreedictionary.com/The+%2464%2c000+Questio...

Xcelerate

1 replies

5h53m

2024-08-08 12:39:10 UTC

Ah, I see. Apologies.

cs702

0 replies

5h52m

2024-08-08 12:40:11 UTC

Thank you.

By the way, I agree with you that "a good future for all of humanity" would be a fantastic goal :-)

The multi-trillion dollar question is: How do you specify that goal as an objective function?

HarHarVeryFunny

0 replies

2h49m

2024-08-08 15:43:08 UTC

You can't just take an arbitrary neural network architecture, and make it do anything by giving it an appropriate loss function, and in particular you can't take a simple feed forward model like a Transformer and train it to be something other than a feed forward model... If the model architecture doesn't have feedback paths (looping) or memory that persists from one input to the next, then no reward function is going to make it magically sprout those architectural modifications!

Today's Transformer-based LLMs are just what the name says - (Large) Language Models - fancy auto-complete engines. They are not a full blown cognitive architecture.

I think many people do have a good idea how to build cognitive architectures, and what the missing parts are that are needed for AGI, and some people are working on that, but for now all the money and news cycles are going into LLMs. As Chollet says, they have sucked all the oxygen out of the room.

Xcelerate

5 replies

8h23m

2024-08-08 10:09:19 UTC

One thing I’ve wondered about is what the “gap” between current transformer-based LLMs and optimal sequence prediction looks like.

To clarify, current LLMs (without RLHF, etc.) have a very straightforward objective function during training, which is to minimize the cross-entropy of token prediction on the training data. If we assume that our training data is sampled from a population generated via a finite computable model, then Solomonoff induction achieves optimal sequence prediction.

Assuming we had an oracle that could perform SI (since it’s uncomputable), how different would conversations between GPT4 and SI be, given the same training data?

We know there would be at least a few notable differences. For example, we could give SI the first 100 digits of pi, and it would give us as many more digits as we wanted. Current transformer models cannot (directly) do this. We could also give SI a hash and ask for a string that hashes to that value. Clearly a lot of hard, formally-specified problems could be solved this way.

But how different would SI and GPT4 appear in response to everyday chit-chat? What if we ask the SI-based sequence predictor how to cure cancer? Is the “most probable” answer to that question, given its internet-scraped training data, an answer that humans find satisfying? Probably not, which is why AGI requires something beyond just optimal sequence prediction. It requires a really good objective function.

My first inclination for this human-oriented objective function is something like “maximize the probability of providing an answer that the user of the model finds satisfying”. But there is more than one user, so over which set of humans do we consider and with which aggregation (avg satisfaction, p99 satisfaction, etc.)?

So then I’m inclined to frame the problem in terms of well-being: “maximize aggregate human happiness over all time” or “minimize the maximum of human suffering over all time”. But each of these objective functions has notable flaws.

Karpathy seems to be hinting toward this in his post, but the selection of an overall optimal objective function for human purposes seems to be an incredibly difficult philosophical problem. There is no objective function I can think of for which I cannot also immediately think of flaws with it.

bick_nyers

2 replies

6h46m

2024-08-08 11:45:33 UTC

Alternatively, you include information about the user of the model as part of the context to the inference query, so that the model can uniquely optimize its answer for that user.

Imagine if you could give a model "how you think" and your knowledge, experiences, and values as context, then it's "Explain Like I'm 5" on steroids. Both exciting and terrifying at the same time.

Xcelerate

1 replies

6h13m

2024-08-08 12:18:48 UTC

Alternatively, you include information about the user of the model as part of the context to the inference query

That was sort of implicit in my first suggestion for an objective function, but do you really want the model to be optimal on a per-user basis? There’s a lot of bad people out there. That’s why I switched to an objective function that considers all of humanity’s needs together as a whole.

bick_nyers

0 replies

17m

2024-08-08 18:15:19 UTC

Objective Function: Optimize on a per-user basis. Constraints: Output generated must be considered legal in user's country.

Both things can co-exist without being in conflict of each other.

My (hot) take is I personally don't believe that any LLM that can fit on a single GPU is capable of significant harm. An LLM that fits on an 8xH100 system perhaps, but I am more concerned about other ways an individual could spend ~$300k with a conviction of harming others. Besides, looking up how to make napalm on Google and then actually doing it and using it to harm others doesn't make Google the one responsible imo.

marcosdumay

1 replies

5h22m

2024-08-08 13:10:05 UTC

What if we ask the SI-based sequence predictor how to cure cancer? Is the “most probable” answer to that question, given its internet-scraped training data, an answer that humans find satisfying?

You defined your predictor as being able to minimize mathematical definitions following some unspecified algebra, why didn't you define it being able to run chemical and pharmacological simulations through some unspecified model too?

Xcelerate

0 replies

4h36m

2024-08-08 13:55:38 UTC

I don’t follow—what do you mean by unspecified algebra? Solomonoff induction is well-defined. I’m just asking how the responses of a chatbot using Solomonoff induction for sequence prediction would differ from those using a transformer model, given the same training data. I can specify mathematically if that makes it clearer…

leobg

4 replies

10h3m

2024-08-08 08:28:34 UTC

A cheap DIY way of achieving the same thing as RLHF is to fine tune the model to append a score to its output every time.

Remember: The reason we need RLHF at all is that we cannot write a loss function for what makes a good answer. There are just many ways a good answer could look like, which cannot be calculated on the basis of next-token-probability.

So you start by having your vanilla model generate n completions for your prompt. You the. manually score them. And then those prompt => (completion,score) pairs become your training set.

Once the model is trained, you may find that you can cheat:

Because if you include the desired score in your prompt, the model will now strive to produce an answer that is consistent with that score.

visarga

0 replies

3h1m

2024-08-08 15:31:14 UTC

if you include the desired score in your prompt, the model will now strive to produce an answer that is consistent with that score

But you need a model to generate score from answer, and then fine-tune another model to generate answer conditioned on score. The first time the score is at the end and the second time at the beginning. It's how DecisionTransformer works too, it constructs a sequence of (reward, state, action) where reward conditions on the next action.

https://arxiv.org/pdf/2106.01345

By the same logic you could generate tags, including style, author, venue and date. Some will be extracted from the source document, the others produced with classifiers. Then you can flip the order and finetune a model that takes the tags before the answer. Then you got a LLM you can condition on author and style.

viraptor

0 replies

9h31m

2024-08-08 09:00:53 UTC

That works in the same way as actor-critic pair, right? Just all wrapped in the same network/output?

lossolo

0 replies

5h4m

2024-08-08 13:27:21 UTC

Not the same, it will get you worse output and is harder to do right in practice.

bick_nyers

0 replies

6h56m

2024-08-08 11:35:25 UTC

I had an idea similar to this for a model that allows you to parameterize a performance vs. accuracy ratio, essentially an imbalanced MoE-like approach where instead of the "quality score" in your example, you assign a score based on how much computation it used to achieve that answer, then you can dynamically request different code paths be taken at inference time.

cesaref

4 replies

9h29m

2024-08-08 09:02:31 UTC

The final conclusion though stands without any justification - that LLM + RL will somehow out-perform people at open-domain problem solving seems quite a jump to me.

dosinga

2 replies

9h20m

2024-08-08 09:11:27 UTC

To be fair, it says "has a real shot at" and AlphaGo level. AlphaGo clearly beat humans on Go, so thinking that if you could replicate that, it would have a shot doesn't seem crazy to me

SiempreViernes

0 replies

8h14m

2024-08-08 10:17:52 UTC

That only makes sense if you think Go is as expressive as written language.

And here I mean that it the act of making a single (plausible) move that must match the expressiveness of language, because otherwise you're not in the domain of Go but the far less interesting "I have a 19x19 pixel grid and two colours".

HarHarVeryFunny

0 replies

3h9m

2024-08-08 15:23:08 UTC

AlphaGo has got nothing to do with LLMs though. It's a combination of RL + MCTS. I'm not sure where you are seeing any relevance! DeepMind also used RL for playing video games - so what?!

esjeon

0 replies

9h3m

2024-08-08 09:29:13 UTC

I think the point is that it's practically impossible to correctly perform RLHF in open domains, so comparisons simply can't happen.

rossdavidh

3 replies

4h35m

2024-08-08 13:57:11 UTC

The problem of various ML algorithms "gaming" the reward function, is rather similar to the problem of various financial and economic issues. If people are not trying to do something useful, and then expecting $$ in return for that, but rather are just trying to get $$ without knowing or caring what is productive, then you get a lot of non-productive stuff (spam, scams, pyramid schemes, high-frequency trading, etc.) that isn't actually producing anything, but does take over a larger and larger percentage of the economy.

To mitigate this, you have to have a system outside of that, which penalizes "gaming" the reward function. This system has to have some idea of what real value is, to be able to spot cases where the reward function is high but the value is low. We have a hard enough time of this in the money economy, where we've been learning for centuries. I do not think we are anywhere close in neural networks.

csours

1 replies

3h19m

2024-08-08 15:12:23 UTC

Commenting to follow this.

There is a step like this in ML. I think it's pretty interesting that topics from things like economics pop up in ML - although perhaps it's not too surprising as we are doing ML for humans to use.

layer8

0 replies

21m

2024-08-08 18:10:39 UTC

Commenting to follow this.

You can “favorite” comments on HN to bookmark them.

bob1029

0 replies

1h36m

2024-08-08 16:56:04 UTC

This system has to have some idea of what real value is

This is probably the most cursed problem ever.

Assume you could develop such a system, why wouldn't you just incorporate its logic into the original fitness function and be done with it?

I think the answer is that such a system can probably never be developed. At some level humans must be involved in order to adapt the function over time in order to meet expectations as training progresses.

The information used to train on is beyond critical, but heuristics regarding what information matters more than other information in a given context might be even more important.

jrflowers

3 replies

10h1m

2024-08-08 08:30:20 UTC

I enjoyed this Karpathy post about how there is absolutely no extant solution to training language models to reliably solve open ended problems.

I preferred Zitron’s point* that we would need to invent several branches of science to solve this problem, but it’s good to see the point made tweet-sized.

*https://www.wheresyoured.at/to-serve-altman/

klibertp

1 replies

33m

2024-08-08 17:58:58 UTC

I read the article you linked. I feel like I wasted my time.

The article has a single point it repeats over and over again: OpenAI (and "generative AI as a whole"/"transformer-based models") are too expensive to run, and it's "close to impossible" for them to either limit costs or increase revenue. This is because "only 5% of businesses report using the technology in production", and that the technology had no impact on "productivity growth". It's also because "there's no intelligence in it", and the "models can't reason". Oh, also, ChatGPT is "hard to explain to a layman".

All that is liberally sprinkled with "I don't know, but"s and absolutely devoid of any historical context other than in financial terms. No technical details. Just some guesses and an ironclad belief that it's impossible to improve GPTs without accessing more data than there is in existence. Agree or disagree; the article is not worth wading through so many words: others made arguments on both sides much better and, crucially, shorter.

jrflowers

0 replies

2024-08-08 18:26:28 UTC

The article has a single point it repeats over and over again: [7 distinct points]

I don’t think have a single overall thesis is the same thing as repeating oneself. For example “models can’t reason” has nothing at all to do with cost.

moffkalast

0 replies

5h33m

2024-08-08 12:58:26 UTC

That's a great writeup, and great references too.

OpenAI needs at least $5 billion in new capital a year to survive. This would require it to raise more money than has ever been raised by any startup in history

They were probably toast before, but after Zuck decided to take it personally and made free alternatives for most use cases they definitely are, since if they had any notable revenue from selling API access it will just keep dropping.

codeflo

3 replies

10h23m

2024-08-08 08:08:36 UTC

Reminder that AlphaGo and its successors have not solved Go and that reinforcement learning still sucks when encountering out-of-distribution strategies:

https://arstechnica.com/information-technology/2023/02/man-b...

viraptor

1 replies

9h40m

2024-08-08 08:51:27 UTC

I wouldn't say it sucks. You just need to keep training it for as long as needed. You can do adversarial techniques to generate new paths. You can also use the winning human strategies to further improve. Hopefully we'll find better approaches, but this is extremely successful and far from sucking.

Sure, Go is not solved yet. But RL is just fine continuing to that asymptote for as long as we want.

The funny part is that this applies to people too. Masters don't like to play low ranked people because they're unpredictable and the ELO loss for them is not worth the risk. (Which does rise questions about how we really rank people)

shakna

0 replies

8h19m

2024-08-08 10:12:23 UTC

I wouldn't say it sucks. You just need to keep training it for as long as needed.

As that timeline can approach infinity, just adding extra training may not actually be a sufficient compromise.

thanhdotr

0 replies

10h1m

2024-08-08 08:30:35 UTC

well, as yann lecun said :

"Adversarial training, RLHF, and input-space contrastive methods have limited performance. Why? Because input spaces are BIG. There are just too many ways to be wrong" [1]

A way to solve the problem is projecting onto latent space and then try and discriminate/predict the best action down there. There's much less feature correlation down in latent space than in your observation space. [2]

[1]:https://x.com/ylecun/status/1803696298068971992 [2]: https://openreview.net/pdf?id=BZ5a1r-kVsf

voiceblue

2 replies

4h25m

2024-08-08 14:06:28 UTC

Except this LLM would have a real shot of beating humans in open-domain problem solving.

At some point we need to start recognizing LLMs for what they are and stop making outlandish claims like this. A moment of reflection ought to reveal that “open domain problem solving” is not what an LLM does.

An LLM, could not, for example, definitively come up with the three laws of planetary motion like Kepler did (he looked at the data), in the absence of a prior formulation of these laws in the training set.

TFA describes a need for scoring, at scale, qualitative results to human queries. Certainly that’s important (it’s what Google is built upon), but we don’t need to make outlandish claims about LLM capabilities to achieve it.

Or maybe we do if our next round of funding depends upon it.

visarga

0 replies

3h9m

2024-08-08 15:22:27 UTC

An LLM, could not, for example, definitively come up with the three laws of planetary motion like Kepler did (he looked at the data)

You could use Symbolic Regression instead, and the LLM will write the code. Under the hood it would use a genetic programming library like with SymbolicRegressor.

Found a reference:

AI-Descartes, an AI scientist developed by researchers at IBM Research, Samsung AI, and the University of Maryland, Baltimore County, has reproduced key parts of Nobel Prize-winning work, including Langmuir’s gas behavior equations and Kepler’s third law of planetary motion. Supported by the Defense Advanced Research Projects Agency (DARPA), the AI system utilizes symbolic regression to find equations fitting data, and its most distinctive feature is its logical reasoning ability. This enables AI-Descartes to determine which equations best fit with background scientific theory. The system is particularly effective with noisy, real-world data and small data sets. The team is working on creating new datasets and training computers to read scientific papers and construct background theories to refine and expand the system’s capabilities.

https://scitechdaily.com/ai-descartes-a-scientific-renaissan...

textlapse

0 replies

3h58m

2024-08-08 14:34:17 UTC

As a function of energy, it’s provably impossible for a next word predictor with a constant energy per token to come up with anything that’s not in its training. (I think Yann LeCun came up with this?)

It seems to me RL was quite revolutionary (especially with protein folding/AlphaGo) - but using a minimal form of it to solve a training (not prediction) problem seems rather like bringing a bazooka to a banana fight.

Using explore/exploit methods to search potential problem spaces might really help propel this space forward. But the energy requirements do not favor the incumbents as things are now scaled to the current classic LLM format.

rocqua

2 replies

10h20m

2024-08-08 08:11:50 UTC

Alphago didn't have human feedback, but it did learn from humans before surpassing them. Specifically, it had a network to 'suggest good moves' that was trained on predicting moves from pro level human games.

The entire point of alpha zero was to eliminate this human influence, and go with pure reinforcement learning (i.e. zero human influence).

esjeon

0 replies

8h56m

2024-08-08 09:35:49 UTC

AlphaGo is an optimization over a closed problem. Theoretically, computers could have always beat human in such problems. It's just that, without proper optimization, humans will die before the computer finishes its computation. Here, AlphaGo cuts down the computation time by smartly choosing the branches with the highest likelihood.

Unlike the above, open problems can't be solve by computing (in combinatorial senses). Even humans can only try, and LLMs do spew out something that would most likely work, not something inherently correct.

cherryteastain

0 replies

9h56m

2024-08-08 08:36:01 UTC

A game like Go has a clearly defined objective (win the game or not). A network like you described can therefore be trained to give a score to each move. Point here is that assessing whether a given sentence sounds good to humans or not does not have a clearly defined objective, the only way we came up with so far is to ask real humans.

islewis

2 replies

2h15m

2024-08-08 16:17:14 UTC

Karpathy is _much_ more knowledgeable about this than I am, but I feel like this post is missing something.

Go is a game that is fundamentally too complex for humans to solve. We've known this since way back before AlphaGo. Since humans were not the perfect Go players, we didn't use them to teach a model- we wanted the model to be able to beat humans.

I dont see language being comparable. the "perfect" LLM imitates humans perfectly, presumably to the point where you can't tell the difference between LLM generated text, and human generated text. Maybe it's just as flexible as the human mind is too, and can context switch quickly, and can quickly swap between formalities, tones, and slangs. But the concept of "beating" a human doesn't really make much sense.

AlphaGo and Stockfish can push forward our understandings of their respective games, but an LLM cant push forwards our boundary of language. this is because it's fundamentally a copy-cat model. This makes RLHF make much more sense in the LLM realm than the Go realm.

will-burner

0 replies

24m

2024-08-08 18:08:19 UTC

This is a great comment. Another important distinction, I think, is that in the AlphaGo case there's no equivalent to the generalized predict next token pretraining that happens for LLMs (at least I don't think so, this is what I'm not sure of). For LLMs, RLHF teaches the model to be conversational, but the model has already learned language and how to talk like a human from the predict next token pretraining.

Miraste

0 replies

1h54m

2024-08-08 16:37:56 UTC

One of the problems lies in the way RLHF is often performed: presenting a human with several different responses and having them choose one. The goal here is to create the most human-like output, but the process is instead creating outputs humans like the most, which can seriously limit the model. For example, most recent diffusion-based image generators use the same process to improve their outputs, relying on volunteers to select which outputs are preferable. This has lead to models that are comically incapable of generating ugly or average people, because the volunteers systematically rate those outputs lower.

bubblyworld

2 replies

5h44m

2024-08-08 12:47:49 UTC

The SPAG paper is an interesting example of true reinforcement learning using language models that improves their performance on a number of hard reasoning benchmarks. https://arxiv.org/abs/2404.10642

The part that is missing from Karpathy's rant is "at scale" (the researchers only ran 3 iterations of the algorithm on small language models) and in "open domains" (I could be wrong about this but IIRC they ran their games on a small number of common english words). But adversarial language games seem promising, at least.

textlapse

1 replies

3h53m

2024-08-08 14:38:39 UTC

That’s a cool paper - but it seems like it produces better debaters but not better content? To truly use RL’s strengths, it would be a battle of content (model or world representation) not mere token level battles.

I am not sure how that works at the prediction stage as language isn’t the problem here.

bubblyworld

0 replies

19m

2024-08-08 18:13:11 UTC

I think the hypothesis is that "debating" via the right adversarial word game may naturally select for better reasoning skills. There's some evidence for that in the paper, namely that it (monotonically!) improves the model's performance on seemingly unrelated reasoning stuff like the ARC dataset. Which is mysterious! But yeah, it's much too early to tell, although IIRC the results have been replicated already so that's something.

(by the way, I don't think "debating" is the right term for the SPAG game - it's quite subtle and isn't about arguing for a point, or rhetoric, or anything like that)

bjornsing

2 replies

9h24m

2024-08-08 09:07:24 UTC

Perhaps not entirely open domain, but I have high hopes for “real RL” in coding, where you can get a reward signal from compile/runtime errors and tests.

falcor84

1 replies

8h26m

2024-08-08 10:05:56 UTC

Interesting, has anyone been doing this? I.e. training/fine-tuning an LLM against an actual coding environment, as opposed to just tacking that later on as a separate "agentic" contruct?

bjornsing

0 replies

7h45m

2024-08-08 10:47:17 UTC

I suspect that the big vendors are already doing it, but I haven’t seen a paper on it.

tpoacher

1 replies

8h35m

2024-08-08 09:56:51 UTC

I get the point of the article, but I think it makes a bit of a strawman to drive the point across.

Yes, RLHF is barely RL, but you wouldn't use human feedback to drive a Go game unless there was no better alternative; and in RL, finding a good reward function is the name of the game; once you have that, you have no reason to prefer human feedback, especially if it is demonstrably worse. So, no, nobody would actually "prefer RLHF over RL" given the choice.

But for language models, human feedback IS the ground truth (at least until we find a better, more mathematical alternative). If it weren't and we had something better, then we'd use that. But we don't. So no, RLHF is not "worse than RL" in this case, because there 'is' no 'other' RL in this case; so, here, RLHF actually is RL.

SiempreViernes

0 replies

8h22m

2024-08-08 10:09:49 UTC

If you cut out humans, what's the point of the language? Just use a proper binary format, I hear protobuf is popular.

normie3000

1 replies

10h42m

2024-08-08 07:49:52 UTC

In machine learning, reinforcement learning from human feedback (RLHF) is a technique to align an intelligent agent to human preferences.

https://en.m.wikipedia.org/wiki/Reinforcement_learning_from_...

moffkalast

0 replies

6h42m

2024-08-08 11:50:03 UTC

Note that human preference isn't universal. RLHF is mostly frowned upon by the open source LLM community since it typically involves aligning the model to the preference of corporate manager humans, i.e. tuning for censorship and political correctness to make the model as bland as possible so the parent company doesn't get sued.

For actual reinforcement learning with a feedback loop that aims to increase overall performance the current techniques are SPPO and Meta's version of it [0] that slightly outperforms it. It involves using a larger LLM as a judge though, so the accuracy of the results is somewhat dubious.

[0] https://arxiv.org/pdf/2407.19594

nickpsecurity

1 replies

3h9m

2024-08-08 15:22:49 UTC

It sounds really negative about RLHF. Yet, if I read on them correctly, that’s a big part of how ChatGPT and Claude got so effective. There’s companies collecting quality, human responses to many prompts. Companies making models buy them. Even the synthetic examples come from models that largely extrapolate what humans wrote in their pre-training data.

So, I’m defaulting on RLHF is great in at least those ways until an alternative is empirically proven to be better. I also hope for larger, better, open-source collections of RLHF training data.

dontwearitout

0 replies

37m

2024-08-08 17:54:46 UTC

Claude notably does not use RLHF, but uses RLAIF, using a LLM to generate the preferences based a "constitution" instead of human preferences. It's remarkable that it can bootstrap itself up to such high quality. See https://arxiv.org/pdf/2212.08073 for more.

epups

1 replies

9h11m

2024-08-08 09:20:55 UTC

This is partially the reason why we see LLM's "plateauing" in the benchmarks. For the lmsys Arena, for example, LLM's are simply judged on whether the user liked the answer or not. Truth is a secondary part of that process, as are many other things that perhaps humans are not very good at evaluating. There is a limit to the capacity and value of having LLM's chase RLHF as a reward function. As Karpathy says here, we could even argue that it is counter productive to build a system based on human opinion, especially if we want the system to surpass us.

HarHarVeryFunny

0 replies

2h32m

2024-08-08 15:59:20 UTC

RLHF really isn't the problem as far as surpassing human capability - language models trained to mimic human responses are fundamentally not going to do anything other than mimic human responses, regardless of how you fine-tune them for the specific type of human responses you do or don't like.

If you want to exceed human intelligence, then design architectures for intelligence, not for copying humans!

daly

1 replies

9h36m

2024-08-08 08:56:00 UTC

I think that the field of proofs, such as LEAN, which have states (the current subgoal), actions (the applicable theorems, especially effective in LEAN due to strong Typing of arguments), a progress measure (simplified subgoals), a final goal state (the proof completes), and a hierarchy in the theorems so there is a "path metric" from simple theorems to complex theorems.

If Karpathy were to focus on automating LEAN proofs it could change mathematics forever.

jomohke

0 replies

8h40m

2024-08-08 09:51:22 UTC

Deepmind's recent model is trained with Lean. It scored a silver olympiad medal (and only one point away from gold).

AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go

https://deepmind.google/discover/blog/ai-solves-imo-problems...

bilsbie

1 replies

7h23m

2024-08-08 11:08:59 UTC

@mods. I submitted this exact link yesterday. Shouldn’t this post have shown it was already submitted?

https://news.ycombinator.com/item?id=41184948

defrost

0 replies

7h15m

2024-08-08 11:16:51 UTC

Maybe.

There might be logic that says an 'old' link (> 12 hours say) with no comments doesn't need to be cross linked to if submitted later (or other rule).

In any case, @mods and @dang do not work (save by chance) .. if you think it's worth bring to attention then there's generally no downside to simply emailing direct to hn # ycombinator dot com from your login email.

visarga

0 replies

2h53m

2024-08-08 15:38:57 UTC

I agree RLHF is not full RL, more like contextual bandits, because there is always just one single decision and no credit assignment difficulties. But there is one great thing about RLHF compared to supervised training: it updates the model on the whole sequence instead of only the next token. This is fundamentally different from pre-training, where the model learns to be myopic and doesn't learn to address the "big picture".

So there are 3 levels of optimization in discussion here:

1. for the next token (NTP)

2. for a single turn response (RLHF)

3. for actual task completion or long-term objectives (RL)

toxik

0 replies

9h39m

2024-08-08 08:53:03 UTC

Another little tidbit about RLHF and InstructGPT is that the training scheme is by far dominated by supervised learning. There is a bit of RL sprinkled on top, but the term is scaled down by a lot and 8x more compute time is spent on the supervised loss terms.

taeric

0 replies

2h11m

2024-08-08 16:20:50 UTC

I thought the entire point of the human/expert feedback was in domains where you can not exhaustively search the depth of the space? Yes, if you can go deeper in the search space, you should do so. Regardless of how bad the score is at the current spot. You only branch to other options when it is exhausted.

And if you don't have a way to say that something could be exhausted, then you will look for heuristics to choose more profitable places to search. Hence the HF added.

pyrale

0 replies

9h19m

2024-08-08 09:13:09 UTC

It's a bit disingenuous to pick go as a case to make the point against RLHF.

Sure, a board game with an objective winning function at which computers are already better than humans won't get much from RLHF. That doesn't look like a big surprise.

On the other hand, a LLM trained on lots of not-so-much curated data will naturally pick up mistakes from that dataset. It is not really feasible or beneficial to modify the dataset exhaustively, so you reinforce the behaviour that is expected at the end. An example would be training an AI in a specific field of work: it could repeat advices from amateurs on forums, when less-known professional techniques would be more advisable.

Think about it like kids naturally learning swear words at school, and RLHF like parents that tell their kids that these words are inappropriate.

The tweet conclusion seems to acknowledge that, but in a wishful way that doesn't want to concede the point.

nothrowaways

0 replies

4h21m

2024-08-08 14:11:12 UTC

Alpha go is not a good example in this case.

NalNezumi

0 replies

7h1m

2024-08-08 11:30:50 UTC

While I agree to Karpathy and I also had a "wut? They call this RL? " reaction when RLHF was presented as an method of CHATGPT training, I'm a bit surprised by the insight he makes because this same method and insight have been gathered from "Learning from Human preference" [1] from none other than openAI, published in 2017.

Sometimes judging a "good enough" policy is order of magnitudes more easier than formulating an exact reward function, but this is pretty much domain and scope dependent. Trying to estimate a reward function in those situations, can often be counter productive because the reward might even screw up your search direction. This observation was also made by the authors (researchers) of the book "Myth of the objective"[2] with their picbreeder example. (the authors so happens to also work for OpenAI.)

When you have a well defined reward function with no local suboptima and no cost in rolling out faulty policies RL work remarkably well. (Alex Ipran described this well in his widely cited blog [3])

Problem is that this is pretty hard requirements to have for most problems that interact with the real world (and not internet, the artificial world). It's either the suboptima that is in the way (LLM and text), or rollout cost (running GO game a billion times to just beat humans, is currently not a feasible requirement for a lot of real world applications)

Tangentially, this is also why I suspect LLM for planning (and understanding the world) in the real world have been lacking. Robot Transformer and SayCan approaches are cool but if you look past the fancy demos it is indeed a lackluster performance.

It will be interesting to see how these observations and Karpathys observations will be tested with the current humanoid robot hype, which imo is partially fueled by a misunderstanding of LLMs capacity including what Karpathy mentioned. (shameless plug: [4])

[1] https://openai.com/index/learning-from-human-preferences/

[2] https://www.lesswrong.com/posts/pi4owuC7Rdab7uWWR/book-revie...

[3] https://www.alexirpan.com/2018/02/14/rl-hard.html

[4] https://harimus.github.io//2024/05/31/motortask.html

Kim_Bruning

0 replies

7h36m

2024-08-08 10:55:53 UTC

Ceterum censeo install AI in cute humanoid robot.

Robot because physical world provides a lot of RL for free (+ and -).

Humanoid because known quantity.

Cute because people put up with cute a lot more, and lower expectations.