This is a bit off topic to the actual article, but I see a lot of top ranking comments complaining that ChatGPT has become lazy at coding. I wanted to make two observations:
1. Yes, GPT-4 Turbo is quantitatively getting lazier at coding. I benchmarked the last 2 updates to GPT-4 Turbo, and it got lazier each time.
2. For coding, asking GPT-4 Turbo to emit code changes as unified diffs causes a 3X reduction in lazy coding.
Here are some articles that discuss these topics in much more detail.
Can you say in one or two sentences what you mean by “lazy at coding” in this context?
It has a tendency to do:
"// ... the rest of your code goes here"
in it's responses, rather than writing it all out.
It's incredibly lazy. I've tried to coax it into returning the full code and it will claim to follow the instructions while regurgitating the same output you complained about. GPT-4 was great, GPT-4 Turbo first version was pretty terrible bordering on unusable, then they came out with the Turbo second version, which almost feels worse to me, though I haven't compared, but if someone comes claiming they fixed an issue, but you still see it, it will bias you to see it more.
Claude is doing much better in this area, local/open LLMs are getting quite good, it feels like OpenAI is not heading in a good direction here, and I hope they course correct.
I have a feeling full powered LLM's are reserved for the more equal animals.
I hope some people remember and document details of this era, future generations may be so impressed with future reality that they may not even think to question it's fidelity, if that concept even exists in the future.
…could you clarify? Is this about “LLMs can be biased, thus making fake news a bigger problem”?
Imagine if the first version of ChatGPT we all saw was fully sanitised..
We know it knows how to make gunpowder (for example), but only because it would initially tell us.
Now it won't without a lot of trickery. Would we even be pushing to try and trick it into doing so if we didn't know it actually could?
Ah so it’s more about “forbidden knowledge” than “fake news” makes sense. I don’t personally see as that toooo much of an issue since other sources still exist, eg Wikipedia, internet archive, libraries, or that one Minecraft Library of Alexandria project. So I see knowledge storage staying there and LLMs staying put in the interpretation/transformation role, for the foreseeable future.
But obviously all that social infrastructure is fragile… so you’re not wrong to be alarmed, IMO
It is not that much about censorship, even that would be somewhat fine if OpenAI would do it dataset level so chatgpt would not have any knowledge about bomb-making. But it is happening lazily so system prompts get bigger which makes a signal to noise worse etc. I don't care about racial bias or what to call pope when I want chatgpt to write Python code.
While I would agree that "don't tell it how to make bombs" seems like a nice idea at first glance, and indeed I think I've had that attitude myself in previous HN comments, I currently suspect that it may be insufficient and that a censorship layer may be necessary (partly in a addition, partly as an alternative).
I was taught, in secondary school, two ways to make a toxic chemical using only things found in a normal kitchen. In both cases, I learned this in the form of being warned of what not to do because of the danger it poses.
There's a lot of ways to be dangerous, and I'm not sure how to get an AI to avoid dangers without them knowing them. That said, we've got a sense of disgust that tells us to keep away from rotting flesh without explicit knowledge of germ theory, so it may be possible although research would be necessary, and as a proxy rather than the real thing it will suffer from increased rates of both false positives and false negatives. Nevertheless, I certainly hope it is possible, because anyone with the model weights can extract directly modelled dangers, which may be a risk all by itself if you want to avoid terrorists using one to make an NBC weapon.
I recognise my mirror image. It may be a bit of a cliché for a white dude to say they're "race blind", but I have literally been surprised to learn coworkers have faced racial discrimination for being "black" when their skin looks like mine in the summer.
I don't know any examples of racial biases in programming[1], but I can see why it matters. None of the code I've asked an LLM to generate has involved `Person` objects in any sense, so while I've never had an LLM inform me about racial issues in my code, this is neither positive nor negative anecdata.
The etymological origin of the word "woke" is from the USA about 90-164 years ago (the earliest examples preceding and being intertwined with the Civil War), meaning "to be alert to racial prejudice and discrimination" — discrimination which in the later years of that era included (amongst other things) redlining[0], the original form of which was withholding services from neighbourhoods that have significant numbers of ethnic minorities: constructing a status quo where the people in charge can say "oh, we're not engaging in illegal discrimination on the basis of race, we're discriminating against the entirely unprotected class of 'being poor' or 'living in a high crime area' or 'being uneducated'".
The reason I bring that up, is that all kinds of things like this can seep into our mental models of how the world works, from one generation to the next, and lead to people who would never knowingly discriminate to perpetuate the same things.
Again, I don't actually know any examples of racial biases in programming, but I do know it's a thing with gender — it's easy (even "common sense") to mark gender as a boolean, but even ignoring trans issues: if that's a non-optional field, what's the default gender? And what's it being used for? Because if it is only used for title (Mr./Mrs.), what about other titles? "Doctor" is un-gendered in English, but in Spanish it's "doctor"/"doctora". But here matters what you're using the information for, rather than just what you're storing in an absolute sense, as in a medical context you wouldn't need to offer cervical cancer screening for trans women (unless the medical tech is more advanced than I realised).
[0] https://en.wikipedia.org/wiki/Redlining
[1] unless you count AI needing a diverse range of examples, which you may or may not count as "programming"; other than that, the closest would be things like "master branch" or "black-box testing" which don't really mean the things being objected to, but were easy to rename anyway
Would somebody try to push a technical system to do things it wasn't necessarily designed to be capable of? Uh... yes. You're asking this question on _Hacker_ News?
I confidently predict that we sheep will not have access to the same power our shepherds will have.
I suspect it's sort of like "you can have a fully uncensored LLM iff you have the funds"
The former sounds like a great training set to enable the latter. :(
People need to be using their local machines for this. Because otherwise the result is going to be a cloud service provider having literally everyone's business logic somewhere in their system and that goes wrong real quick.
It’s so interesting to see this discussion. I think this is a matter of “more experienced coders like and expect and reward that kind of output, while less experienced ones want very explicit responses”. So there’s this huge LLM Laziness epidemic that half the users cant even see
I'm paying for ChatGPT GPT4 to complete extremely tedious, repetitive coding tasks. The newly occurring laziness directly, negatively impacts my day to day use where I'm now willing to try alternatives. I still think I get value - indeed I'd probably pay $1,000/mo instead of $20/mo - but I'm only going to pay for one service.
I mean, isn't that better as long as it actually writes the part that was asked? Who wants to wait for it to sluggishly generate the entire script for the 5th time and then copy the entire thing yet again.
Short answer: Rather than fully writing code, GPT-4 Turbo often inserts comments like "... finish implementing function here ...". I made a benchmark based on asking it to refactor code that provokes and quantifies that behavior.
Longer answer:
I found that I could provoke lazy coding by giving GPT-4 Turbo refactoring tasks, where I ask it to refactor a large method out of a large class. I analyzed 9 popular open source python repos and found 89 such methods that were conceptually easy to refactor, and built them into a benchmark [0].
GPT succeeds on this task if it can remove the method from its original class and add it to the top level of the file with appropriate changes to the size of the abstract syntax tree. By checking that the size of the AST hasn't changed much, we can infer that GPT didn't replace a bunch of code with a comment like "... insert original method here...". The benchmark also gathers other laziness metrics like counting the number of new comments that contain "...". These metrics correlate well with the AST size tests.
[0] https://github.com/paul-gauthier/refactor-benchmark
I use gpt4-turbo through the api many times a day for coding. I have encountered this behavior maybe once or twice period. It was never an issue that didn’t make sense as essentially the model summarizing and/or assuming some shared knowledge (that was indeed known to me).
This, and people generally saying that chatGPT has been intentionally degraded, are just super strange for me. I believe it’s happening but it’s making me question my sanity. What am I doing to get decent outputs? Am I simply not as picky? I treat every conversion as though it needs to be vetted because it does regardless of how good the model is. I only trust output from the model that I am a subject matter expert on or in a closely adjacent field. Otherwise I treat it much like an internet comment - useful for surfacing curiosities but requires vetting.
Why this instead of GPT-4 through the web app? And how do you actually use it for coding, do you copy and paste your question into a python script, which then calls the OpenAI API and spits out the response?
Not the op, but I also use it through API (specifically MacGPT). My initial justification was that I would save by only paying for what I use, instead of a flat $20/mo, but now it looks like I’m not even saving much.
I use it fairly similarly via a discord bot I've written. This lets me share usage w/ some friends (although has some limitations compared to the openai chatGPT app).
I whenever the chatGPT gets lazy with the coding for example //make sure to implement search function .... I feed its own comments and code as prompt :you make sure to implement the search function and so on has been working for me
I have a bunch of code I need to refactor, and also write tests for. (I guess I should make the tests before the refactor). How do you do a refactor with GPT-4? Do you just dump the file in to the chat window? I also pay for github copilot, but not GPT-4. Can I use copilot for this?
Any advice appreciated!
Yes, along with what you want it to do.
Not that I know of. CoPilot is good at generating new code but can't change existing code.
Copilot will change existing code. (though I find it's often not very good at it) I frequently highlight a section of code that has an issue, press ctrl-i and type something like "/fix SomeError: You did it wrong"
GitHub Copilot Chat (which is part of Copilot) can change existing code. The UI is that you select some code, then tell it what you want. It returns a diff that you can accept or reject. https://docs.github.com/en/copilot/github-copilot-chat/about...
it was really good at some point last fall, solving problems that it had previously completely failed at, albeit after a lot of iterations via autogpt. at least for the tests i was giving it which usually involved heavy stats and complicated algorithms, i was surprised it passed. despite it passing the code was slower than what i had personally solved the problem with, but i was completely impressed because i asked hard problems.
nowadays the autogpt gives up sooner, seems less competent, and doesnt even come close to solving the same problems
Hamstringing high value tasks (complete code) to give forthcoming premium offerings greater differentiation could be a strategy. But in counter to this, doing so would open the door for competitors.
The question I have been wondering is if they are hamstringing high value tasks to creating room for premium offerings or are they trying to minimize cost per task.
this is exactly what I noticed too
Lazy coding is a feature not a bug. My guess is that it breaks aider automation, but by analyzing the AST that wouldn't be a problem. My experience with lazy coding, is it omits the irrelevant code, and focuses on the relevant part. That's good!
As a side note, i wrote a very simple small program to analyze Rust syntax, and single out functions and methods using the syn crate [1]. My purpose was exactly to make it ignore lazy-coded functions.
[1]https://github.com/pramatias/replacefn/tree/master/src
It sounds like you've been extremely lucky and only had GPT "omit the irrelevant code". That has not been my experience working intensively on this problem and evaluating numerous solutions through quantitative benchmarking. For example, GPT will do things like write a class with all the methods as simply stubs with comments describing their function.
Your link appears to be ~100 lines of code that use rust's syntax parser to search rust source code for a function with a given name and count the number of AST tokens it contains.
Your intuitions are correct, there are lots of ways that an AST can be useful for an AI coding tool. Aider makes extensive use of tree-sitter, in order to parse the ASTs of a ~dozen different languages [0].
But an AST parser seems unlikely to solve the problem of GPT being lazy and not writing the code you need.
[0] https://aider.chat/docs/repomap.html
The tool needs a way to guide it to be more effective. It is not exactly trivial to get good results. I have been using GPT for 3.5 years and the problem you describe never happens to me. I could share with you just from last week, 500 to 1000 prompts i used to generate code, but the prompts i used to write the replacefn, can be found here [1]. Maybe there are some tips that could help.
[1] https://chat.openai.com/share/e0d2ab50-6a6b-4ee9-963a-066e18...
The chat transcript you linked is full of GPT being lazy and writing "todo" comments instead of providing all the code:
It took >200 back-and-forth messages with ChatGPT to get it to ultimately write 84 lines of code? Sounds lazy to me.Ok it does happen, but not so frequently. You are right. But is this such a big problem?
Like, you parse the response, and throw away the comment "//implementation goes here", throw away also the function/method/class/struct/enum it belongs to, and keep the functional code. I am trying to implement something exactly like aider, but specifically for Rust, parsing the LLM's response, filtering out blank functions etc.
In Rust, filtering out blank functions is easy, in other languages it might be very hard. I haven't looked into tree-sitter, but getting a sense of Javascript code, Python and more, sounds pretty much a very difficult problem to solve.
Even though i like when GPT compresses the answer and doesn't return a lot of code, other programs like Mixtral 8x7b, never compress it like GPT in my experience. If they are not lacking much than GPT4, maybe they are better for your use case.
Hey Rust throws a lot of errors. We do not want humans go around and debug code, unless it is absolutely necessary, right?
Just use Grimoire.
Really great article. Interestingly I have found that using the function call output significantly improves the coding quality.
However for now, I have not run re-tests for every new version. I guess I know what I will be doing today.
This is an area I have spend a lot of time working on, would love to compare notes.
How is laziness programmatically defined or used as a benchmark
Personally I have seen it saying stuff like:
public someComplexLogic() { // Complex logic goes here }
or another example when the code is long (ex: asking it to create a vue component) is that it will just add a comment saying the rest of the code goes here.
So you could test for it by asking it to create long/complex code and then running the output against unit tests that you created.
Yeah this is a typical issue:
- Can you do XXX (something complex) ?
- Yes of course, to do XXX, you need to implement XXX, and then you are good, here is how you can do:
int main(int argc, char **argv) {
Are you using API or UI? If UI, how do you know which model is used?
It wouldn't be the top comment if it wasn't
FYI, also make sure you’re using the Classic version not the augmented one. The classic has no (at least completely altering) prompt as the default one.
EDIT: This of course applies only if you’re using the UI. Using the API is the same.
Voice Chat in ChatGPT4 was speaking perfect Polish. Now it sounds like a foreigner that is learning.
thanks for these posts, I implemented a version of the idea a whole ago and am getting good results
I have not noticed any reduction in laziness with later generations, although I don't use ChatGPT in the same way that Aider does. I've had a lot of luck with using a chain-of-thought-style system prompt to get it to produce results. Here are a few cherry-picked conversations where I feel like it does a good job (including the system prompt). A common theme in the system prompts is that I say that this is an "expert-to-expert" conversation, which I found tends to make it include less generic explanatory content and be more willing to dive into the details.
- System prompt 1: https://sharegpt.com/c/osmngsQ
- System prompt 2: https://sharegpt.com/c/9jAIqHM
- System prompt 3: https://sharegpt.com/c/cTIqAil Note: I had to nudge ChatGPT on this one.
All of this is anecdotal, but perhaps this style of prompting would be useful to benchmark.