As a developer but also product person, I keep trying to use AI to code for me. I keep failing, because of context length, because of shit output from the model, because of lack of any kind of architecture etc etc etc. I'm probably dumb as hell, because I just can't get it to do anything remotely useful, more than helping me with leetcode.
Just yesterday I tried to feed it a simple HTML page to extract a selector, I tried it with GPT-4-turbo, I tried it with Claude, I tried it with Groq, I tried it with a local LLama2 model with 128k context window. None of them worked. This is a task that while annoying, I do in about 10 seconds.
Sure, I'm open to the possibility that in the next 2 - 3 days up to a couple of years, I'll no longer do manual coding. But honestly. After so much hype, I'm starting to grow a bit irritated with the hype.
Just give me a product that works as advertised and I'll throw money your way because I have a lot more ideas than I have code throughoutput!
Ditto. I started out excited about LLMs and eager to use them everywhere, but have become steadily disillusioned as I have tried to apply them to daily tasks, and seen others try and fail in the same way.
Honestly, LLMs can't even get language right. They produce generic, amateurish copy that reads like it's written by committee. GPT can't perform to the level of a middle market copywriter or content marketer. I am convinced that people who think LLMs can write have simply not understood what professional writers do.
For me the "plateau of productivity" after the disillusionment has been using LLMs a bit like search engines. Quick standalone summaries, snippets or thoughts. A nice day-to-day productivity boost, but nothing that's going to allow me to work less hard.
If you were only using GPT 3.5 (free ChatGPT) then your opinion is irrelevant.
With GPT-4 you could directly ask it: "rewrite your previous response so that it sounds less generic, less amateurish, and not written by a committee". I'm not even joking. Just provide enough information and tell it what to do. If you don't like the output then tell it what needs to be improved. It's not a mind reader.
Also GPT-4 is a year old now. Claude 3 is already superior and GPT-5 will be next level.
If you haven't actually used GPT-5 yet, your assessment is irrelevant.
But the real game changer will be GPT-6.
What really annoys me in all these discussions is how no one's tested what happens if they wait until 2050 and try GPT-19.
That's well after the AI meta consciousness understood that it was necessary to destroy all humans to save the planet. GPT-6 was the last of the GPT series.
Perhaps the strangest element of the AI alignment conversation is that what is most aligned with human civilization (at least the most powerful elements of it) and alignment with sustainable life on the planet are at odds, and "destroy humans to save planet" is a concern mostly because it seems to be a somewhat rational conclusion.
We already have Claude 3 Opus and it is clear for anyone who has used it that it is way better than GPT-4, especially for coding.
The model names, version numbers or who makes them are irrelevant.
Yes, I've used GPT-4. The writing sounds better, but it still sucks at writing. Most importantly, it feels like it sucks just as much as GPT-3.5 in some deeply important ways.
If you use GPT-4 day-to-day, you've probably encountered this sense of a capability wall before. The point where additional prompting, tweaking, re-prompting simply doesn't seem to be yielding better results on the task, or it feels like the issue is just being shifted around. Over time, you develop a bit of a mental map of where the strengths and weaknesses are, and factor that into your workflows. That's what writing with LLMs feels like, compared to working with a professional writer.
Most writers have already realized that LLMs can't write in any meaningful way.
I know a professional writer who is amazed by what LLMs are capable of already and, given the rate of progress, speculates they will take over many writing jobs eventually.
Of course there is a wall with the current models. But almost every time I hit a wall, I have found a way to break past that limit. Interacting with the LLM as I would interact with a person. LLM's perform best with chain of thought reasoning. List out any issues you identified in the original output, ask the LLM to review these issues and list out any other issues that it can identify based on the original requirements, then rewrite it all. And do that several times until it's good enough.
At work I have found GPT-4 to exceed the linguistic capabilities of my colleagues when it comes to summarizing complicated boring business text.
Or just write the damn thing yourself.
What if this is a boring business text summary task that takes additional hours of my time at work? Why should I waste my time? I have better things to do. I can leave early while you sit there at work typing like a fool.
I think it is a tooling issue. It is in no way obvious how use LLM's effectively, especially for really good writing results. Tweaking and tinkering can be time consuming indeed, but i use lately the chatgpt-shell [1] and it lends well to an iterative approach. One needs to cycle through some styles first, and then decide how to most effectively prompt for better results.
[1]https://github.com/xenodium/chatgpt-shell/blob/bf2d12ed2ed60...
Chat GPT4 is a technological miracle, but it can only produce trite, formulaic text and it's _relentlessly_ polly-anna-ish. Everything reads like ad copy and it's easily identifiable.
Fix your prompt. Just accepting the default style is a rookie mistake.
Ask it to "rewrite that in the tone of an English professor" or "rewrite that in the style of a redneck rapper" or "make that sound less like generic ad copy". Get into an argument back and forth with the LLM and tell it the previous response is crap because of XYZ.
Or, you know, spend the half hour that would take writing your stuff yourself.
These models can do something in a second that would take many hours for a human writer.
GPT's rigid "robot butler" style is not "just how LLMs write". OpenAI deliberately tuned it to sound that way. Even much weaker models that aren't tuned to write in a particular way can easily pass for human writing.
This is part of the problem with the whole discourse of comparing human writers to LLMs. Superficial things like style and tone aren't the problem, but they are overwhelmingly the focus of these discussions.
It's funny to see, because developers are so sensitive about being treated like code monkeys by their non-technical colleagues. But these same devs turn around to treat other professionals as word monkeys, or pixel monkeys, or whatever else. Not realizing that they are only seeing the tip of the iceberg of someone else's profession.
Professional writers don't take prompts and shit out words. They work closely with their clients to understand the important outcomes, then work strategically towards them. The dead giveaway of LLM writing isn't the style. It's the lack of coherent intent behind the words, and low information density of the text. A professional writer works to communicate a lot with very little. LLMs work in the opposite way: you give it a prompt, then it blows it out into verbiage.
Sit down for coffee with a professional copywriter (not the SEO content marketing spammers), and see what they have to say about LLMs.
Personally, I group all these things under 'style'. Perhaps, i should have used, 'presentation' instead. You've latched on that specific word and gone off. Point is that the post-training of these models, especially GPT from Open ai is doing a lot to how the writing (the default at least) presents long strings of text. Like how GPT-4 is almost compelled to end bouts of fiction prematurely in sunshine and rainbows. That technically isn't style but is part of what i was talking about.
There's no reason you have to work this way with an LLM.
No, I haven't. I'm not talking about style, but something deeper. What I'm talking about is something you don't even seem to realize exists in professional writing - which is why you keep thinking I'm misunderstanding you when I am not.
I've worked with professional writers, and nothing in the LLM space even comes close to them. It's not a matter of low quality vs high quality, or benchmarking, or style. It's simply an apples and oranges comparison.
The economics of LLMs for shortform copy will never make sense, because producing the words is the cheapest part of that process. They might become the best way for writers themselves to produce longform copy on the execution side, but they can't replace the writer's ability to work with the client to figure out exactly what they are trying to write, and why, and what a good result even looks like. And no, this isn't a prompting issue, or a UI issue, or a context window length issue, or anything like that.
Elsewhere in this thread someone mentioned how invaluable LLMs are for producing internal business copy. I could easily see these amateur writing tasks being replaced by LLMs. But the implication there isn't that LLMs are any good at writing, but that these tasks don't require good writing to begin with.
I've read hundreds of books, fiction and otherwise. This isn't a brag, it's just to say, believe me, I know what professional writing looks like and I know where LLMs currently stand because I've used them a lot. I know the quality you can squeeze out if you're willing to let go of any presumptions.
You'll notice that not once did I say current LLMs could wholesale replace professional writers anymore than they can currently replace professional software devs. I just disagree on the "not a good writer" bit.
If it's the opinion of professional writers you're looking for then you can find some who disagree too.
Rie Kudan won an award on a novel she used GPT to verbatim ghostwrite (no edits essentially) 5% of. Her words, not mine. Who knows how much more of the novel is edited GPT.
That a professional human novelist was able to leverage GPT for their book isn't disproving the grandparent's post. They knew what good looks like, and if it wasn't good they wouldn't have kept it in the book.
I actually agree with you that professional writers _can_ write/communicate much better than LLMs. However, I’ve read way too many articles or chapters in books that are so full of needless fluff before they get to the point. It’s almost as if they wanted to show off that they can write all that and somehow connect it to the main part of the article. I’m not reading the essay to appreciate the writer’s ability to narrate things, instead I care about what they have to say on that topic that brought me to the essay.
Perhaps the pointless fluff you're describing is actually chaff: countermeasures strategically deployed ahead of time by IQ 180 writers in order to preemptively water down any future LLM's trained on their work.
Then the humans can make a heroic return, write surgical prose like Hemingway to slice through the AI drivel, and keep collecting their paychecks.
Bonus points if you can translate this analogy to software development...
lmfao :¬)
And it only took one of the most computationally expensive processes ever devised by man.
If you ignore how much energy you're burning while searching for dozens and dozens of articles that may or may not give you the answer you're looking for. I'd say the electricity that LLMs burn is nothing compared to my energy and time in that regard.
Id bet $50 the inference is more expensive
I feel worthless now :)
I mean, the brain is, if you want to consider it a computer, pretty god damn efficient. It's slow as hell but it's really powerful and runs on bread.
Are you saying LLMs are officially mid?
Dario Amodei (Anthropic) pretty much acknowledged exactly that - "mid" - on his Dwarkesh interview, while still all excited that they'd be up to killing people in a couple of years.
I've had the same experience as well. I heard tons of people clamoring about the ability for LLM's to write SEO copy for you and how you can churn out web content so much faster now. I tried using it to churn out some very specific blog posts for an aborist client of mine.
The results were really bad. I had to re-write and clarify a lot of what it spit out. The grammar was not very good and it was really hard to read with very poorly structured sentences that would end aburptly and other glaring issues.
I did this right after a guy I play hockey with said he uses it all the time to write emails for him and pays the monthly subscription in order to have it write all kinds of things for me every day. After my trial, I was really wondering how obvious it was that he was doing that and how his clients thought about him knowing how poorly the stuff these LLM's were putting out.
It says a lot about SEO copy that this is one of the areas where LLMs low quality doesn't seem to have impeded adoption. There are a ton of shitty content marketers using LLMs to churn out spam content.
I feel the same way about this stuff as when devs say they push out LLM code with no refactoring or review. Ah, good luck!
In some circles, that's actually a wonderful achievement. :p
It's worth pointing out that on their eval set for "issues resolved" they are getting 13.86%. While visually this looks impressive compared to the others, anything that only really works 13.86% of the time, when the verification of the work takes nearly as much time as the work would have anyway, isn't useful.
The problem with this entire space is that we have VC hype for work that should ultimately still be being done in research labs.
Nearly all LLM results are completely mind blowing from a research perspective but still a long way from production ready for all but a small subset of problems.
The frustrating thing, as someone working in this space awhile, is that VCs want to see game changing products ship overnight. Teams working on the product facing end of these things are all being pushed insanely hard to ship. Most of those teams are solving problems never solved before, but given deadlines as though they are shipping CRUD web apps. The kicker is that despite many teams doing all of this, because the technology still isn't there, they still disappoint "leadership". I've personally seen teams working nights and weekends, implementing solutions to never before seen problems in a few weeks, and still getting a thumbs down when they cross the finish line.
To really solve novel problems with LLMs will take a large amount of research, experimentation and prototyping of ideas, but people funding this hype have no patience for that. I fear we'll get hit by a major AI winter when investors get bored, but we'll end up leaving a lot of value on the table simply because there wasn't enough focus and patience on making these incredible tools work.
Don't forget the 20/80 rule. They haven't even gotten to 15% yet.
Our jobs are safe. I would even expect more "beginners" to try something with AI and then need an actual programmer to help them
( At least, if they are unwilling to invest the time in development and debugging themselves
Ps. Probably all the given examples are in top 3 most popular programming languages.
Machines currently are at an amateur level, but amateurs across the board on the knowledge base.
Amateurs at Python, Fortran, C, C++ and all programming languages. Amateurs at car engineering, airplane engineering, submarine engineering etc. Amateurs at human biology, animal biology, insect biology and so on.
I don't know anyone who is an amateur at everything.
What does it mean exactly to be an amateur at "submarine engineering"? It certainly doesn't mean you know how to build a submarine.
Perhaps, amateur could engineer a submarine that goes and stays at 10 feet deep? And it does not carry a nuc load?
Maybe there's a future for AIs designing narco-subs ?
They cannot make a submarine themselves, or design it, but when they reach 50 percent, they will reach 50% at everything.
In submarine engineering, they will be able to design and construct it in some way like 3d print it, and the submarine will be able to move into the water for some time before it sinks. Yeah, probably for submarines a higher percent should be achieved before they are really useful.
No, and that is a one of their limitations. LLMs are human-level or above on some tasks - basically on what they were trained to do - generating text, and (at least at some level) grokking what is necessary to do a good job of that. But, they are at idiot level on many other tasks (not to overuse the example, but I just beat GPT-4 at tic-tac-toe since it failed to block my 2/3-complete winning line).
Things like translation and summarization are tasks that LLMs are well suited to, but these also expose the danger of their extremely patchy areas of competence (not just me saying this - Anthropic CEO recently acknowledged it too). How do you know that the translation is correct and not impacted by some of these areas of incompetence? How do you know that the plausible-looking summary is accurate and not similarly impacted?
LLMs are essentially by design ("predict next word" objective - they are statistical language models, not AI) cargo-cult technology - built to create stuff that looks like it was created by someone who actually understands it. Like (origin of term cargo-cult) the primitive tribe that builds a wooden airplane that looks to them close enough to the cargo plane that brings them gifts from the sky. Looking the same isn't always good enough.
Agreed, and I think that many of the problems that people think LLMs will become capable of, in fact require AGI.
It may well turn out that LLMs are NOT the path to AGI. You can make them bigger and better, and address some of their shortcomings with various tweaks, but it seems that AGI requires online/continual learning which may prove impossible to retrofit onto a pre-trained transformer. Gradient descent may be the wrong tool for incremental learning.
At least in theory we can achieve incremental learning by training from scratch every time we got some new training data. There are drawbacks with this approach such as inconsistent performance for different training runs and significantly higher training cost but it's achievable. Now the problem is if there exist methods more efficient than gradient descent? I think it's very clear now that there are no other algorithm in sight that could achieve the level of intelligence without gradient descent at its core, and the problem is just how gradient descent is used.
The obvious alternative to gradient descent here would be Bayes Formula (probabalistic Bayesian belief updates), since this addresses the exact problem that our brains evolved to optimize - how to utilize prediction failure (sensory feedback vs prediction) to make better predictions - better prediction of where the food is, what the predator will do, how to attract a mate, etc. Predict next word too (learn language), of course.
I don't think pre-train for every update works - it's an incredibly slow and expensive way to learn, and the training data just isn't there. Where is the training data that could train it how to do every aspect of any job - the stuff that humans learn by experimentation and experience? The training data that is available via text (and video) is mostly artifacts - what someone created, not the thought process that went into creating it, and the failed experiments and pitfalls to avoid along the way, etc, etc.
It would be nice to have a generic college-graduate pre-trained AGI as a starting point, but then you need to take that and train it how to be a developer (starting at entry level, etc), or for whatever job you'd like it to do. It takes a human years to practice to get good at jobs like these, with many try-fail-rethink experiments every day. Imagine if each of those daily updates took 6 months and $100M to incorporate?! We really need genuine online learning where each generic graduate-level AGI instance can get on-the-job training and human feedback and update it's own "weights" continually.
If you know a little about the math behind gradient descent you can see that an embedding layer followed by a softmax layer gives you exactly the best Bayes estimate. If you want a bit of structure, like every word depends on previous n words, you get a convolutional RNN which is also well studied. These ideas are natural and elegant but maybe a better idea is to comprehend the research already done to avoid diving into dead ends too much.
No, I don't "want a bit of structure" ... I want a predictive architecture that supports online learning. So far the only one I'm aware of is the cortex.
Not sure what approaches you are considering as dead ends, but RNNs still have their place (e.g. Mamba), depending on what you are trying to achieve.
This is an important lesson that all SWEs should take to heart. Nobody cares about your novel algorithm. Nobody cares about your high availability architecture. Nobody cares about your millisecond network latency optimizations. The only thing that anyone actually using your software cares about is "Does the screen with lights and colors make the right lights and colors that solve my problem when I click on it?". Anything short of that is yak shaving if your role is not pure academic R&D.
I wish this were the case. The amount of time I spend trying to talk principal engineers out of massive refractors because we want to get this out soon is near criminal.
Sure, there's a tendency, even among relatively senior developers, to want to rewrite things to make them better, and it's certainly faster to put a band aid on it if you need to ship something fast.
The thing is though that technical debt and feature creep (away from flexibility anticipated by original design) are real, and sometimes a rewrite or refactor is the right thing to do - necessary so that simple things remain simple to add, and to able to continue shipping fast. It just takes quite a bit of experience to know when to NOT rewrite/refactor and when to do it.
sounds like you could use to shave a yak.
I also have two crypto-bro friends that are hyping it up without having anything to show for it. Which is why I'm sort of complaining about they hype surrounding it. I agree with your post to a large extent. This is not production ready technology. Maybe tomorrow.
LLMs are quite good at text based tasks such as summarization and extracting entities.
These generally don't require advanced logic or thought, though they can require some moderate reasoning ability to summarize two slightly conflicting text extracts.
Lots of corporate work would be enhanced by better summarization, better information dissemination, and better text extraction. Most of it is pretty boring work, but there's a lot of it.
VC hypes seem to want to mostly focus on fantastical problems, though, which sound impressive at dinner parties but don't actually work well.
If you're a VC, do you want to talk about your investment in a company that finds discrepancies in invoices, or one that self-writes consumer iPhone apps?
Only one of those is actually doable today.
Summarize this for me please: https://www.nytimes.com/2024/02/25/world/europe/cia-ukraine-...
HN doesn't allow posting AI content, but I tried pasting that in gemini and it did fine. Saw no errors, maybe it missed some important details but everything I checked matched the article and those details seemed like a good summary.
Here is what it wrote, didn't have enough tokens for the last 20% of the article though:
A Longstanding Partnership: The collaboration began in 2014 after the pro-Russian government was ousted in Ukraine. The CIA was initially cautious due to concerns about trust and provoking Russia.
Building Trust: Ukrainian intelligence officials gradually earned the CIA's trust by providing valuable intel, including on Russia's involvement in the downing of MH17 and election interference.
Hidden Network: The CIA secretly funded and equipped a network of 12 spy bases along the Ukrainian border used for intelligence gathering.
Training and Operations: The CIA trained Ukrainian special forces (Unit 2245) and intelligence officers (Operation Goldfish) for missions behind enemy lines.
Friction and Red Lines: The Ukrainians sometimes disregarded CIA restrictions on lethal operations, leading to tensions but not severing the partnership.
Current Importance: This intelligence network is now crucial for Ukraine's defense, providing critical intel on Russian troop movements and enabling long-range strikes.
Yeah, I remember speech recognition taking decades to improve, and being more of a novelty - not useful at all - even when it was at 95% accuracy (1 word in 20 wrong). It really had to get almost perfect until it was a time saver.
As far as coding goes, it'd be faster to write it yourself and get it right first time rather than have an LLM write it where you can't trust it and still have to check it yourself.
You can't compare the accuracy of speech recognition to LLM task completion rates. A nearly-there yet incomplete solution to a Github issue is still valuable to an engineer who knows how to debug it.
Even now, automatic speech recognition is a big timesaver, but you _need_ a human to look through the transcript to pick out the obviously wrong stuff, let alone the stuff that's wrong buy could be right in context.
Agreed on the lack of value for 13.86% correctness — I noticed that too. This reminds me a little of last year's hype around AutoGPT et al (at around the same time of year, oddly enough); it's very promising as a measure of how far we've come since just a few years ago when that metric would be 0%, but it doesn't seem super usable at the moment.
That being said, something is definitely coming. 50% correctness would probably be well worth using — simple copy/paste between my editor and GPT4 has been useful for me, and that's much less likely to completely solve an issue in one shot — and not only will small startups doing finetunes be grinding towards better results... The big labs will be too, and releasing improved foundation models that the startups can then continue finetuning. I don't think a new AI winter is on the horizon yet; Meta has plenty of reason to keep pushing out better stuff, both from a product perspective (glasses) and from an efficiency perspective (internal codegen), and OpenAI doesn't seem particularly at risk of stopping since Microsoft is using them both to batter Google on search (by having more people use ChatGPT for general question answering than using Google search), and to claw marketshare from Amazon in their cloud offerings. Similarly, some AI products have already found product/market fit; Midjourney bootstrapped from 0 to $200MM ARR (!) for example, purely on the basis of monthly subscriptions, by disrupting the stock image industry pretty convincingly.
"To really solve novel problems with LLMs will take a large amount of research, experimentation and prototyping of ideas, but people funding this hype have no patience for that. I fear we'll get hit by a major AI winter when investors get bored, but we'll end up leaving a lot of value on the table simply because there wasn't enough focus and patience on making these incredible tools work."
...this is what happened in 99-2000. It took 3-7 years for the survivors to start making it usable and letting the general public adjust to a new user paradigm (online vs on PC).
Thanks, insightful comment.
I build a pretty popular LLM tool. I think learning when/how to use them is as big a mental hurdle as it was learning to google well or whether something is googlable or not.
In the realm of coding here are a few things its really good at:
- Translating code, generating cross language clients. I'll feed it a golang single file API backend and tell it to generate the typescript client for that. You can add hints like e.g "use fetch", "allow each request method to have a header override", "keep it typesafe, use zod", etc
- Basic validation testing. It's pretty good at generating scaffold tests that do basic validation (Opus is good at writing trickier tests) as your building.
- Small module completion. I write an interface of a class/struct with it's methods and some comments and tell it to fill in. A recent one I did looked something like (abbreviated):
type CacheDir struct { dir string, maxObjectLifetime: Duration, fileLocks sync.Map }
type (cd *CacheDir) Get(...)
type (cd *CacheDir Set(...)
type (cd *CacheDir) startCleanLoop()
Opus does a really good job generating the code and basic validation tests for this.
One general tip: you have to be comfortable spending 5 minutes crafting a detailed query assuming the task takes longer than that. Which can be weird at first if you take yourself seriously as a human.
Note that I hadn't been able to do much of this with GPT-4 Turbo with with Claude Opus it really feels capable.
Just to answer to the turbo aspect, I've seen a big downgrade in quality when comparing 4 to 4-turbo, and even the new preview which is explicitly supposed to follow my instructions better. So I'm running a first pass through 4 and then combinging it with 4-turbo to take advantage of the larger context window and then running 4 on it again to get a better quality output.
You really need to try Opus. Try a provider that works across models (one in my bio).
It's incredible how far behind HN of all places is w.r.t. what the current best tech is.
So many people talking about GPT-4 here, or even 3.5 when the SOTA has moved way along.
Gemini Advanced is also a great model, but for other reasons. That thing really knows a boat load of low level optimization tricks.
I'm sure you know what your talking about, but pushing the point that what is "best" or worth talking about is something that changes like every month does not really help defend against the case that most of this is just hype-churn or marketing.
I'm not pushing what to talk about so much as pushing the point not to talk about stuff that is obsolete and starting to smell.
It's that hype-churn marketing that is a motivating factor for the groups to innovate, much like Formula 1. It might be distasteful, but that doesn't mean it isn't working.
So what is the SOTA, in your opinion?
Good fucking Christ, Opus has been out for like 8 days, I've had holidays longer than that!
I'm talking about 4-turbo, 4-turbo preview and self hosted LLama2. What in God's name is not SOTA about this?
Are they considerably better than existing non-AI tools + manual coding for this? In VSCode and Visual Studio, when working with an interface in C# for example, I can click two context menus to have it generate an implementation with constructors, getters, & setters included, leaving only the business logic code to write manually. You've mention you have to describe to the AI in comments, and then I assume spend time on a step to verify the AI has correctly interpreted your request & implemented.
I can definitely see the advantage for LLMs when writing unit tests on existing code, but short of very limited situations, I'm really finding it difficult to find the 55% efficiency improvements claimed by the likes of GitHub's AI Copilot.
That sounds crazy useful and I think speaks most to the maturity of C# and Microsoft's commitment to making it so ergonomic. I'm pretty curious about that feature, I'd love something similar for C++ in VS Code, but thus far I've been doing a pretty similar Copilot flow to the parent comment. It's nothing groundbreaking, but a nice little productivity boost. If I had to take that or a linter, I'd take the linter.
Totally agree on the 55% figure being hogwash.
Visual Studio (not VSCode) has this for C++, though it can be a bit finicky. It’s infinitely better than AI autocomplete, which just makes shit up half the time.
Can any convert a native iPhone app to an Android one?
Piece by piece, sure. The context window is too small to just feed it a massive source dump all at once.
I use ChatGPT every day and it’s excellent at:
- replacing StackOverflow and library documentation
- library search
- converting between formats and languages
- explaining existing code/queries
- deobfuscating code
- explaining concepts (kinda hit or miss)
- helping you get unstuck when debugging or looking for solution (‘give me possible reasons for …’)
I feel like many of this things require asking the right questions, which assumes certain level of experience. But once you reach this level, it’s an extremely valuable assistant.
I like it for condensing long stack traces and very very simple requests, but it really falters when you try to do anything domain specific.
Library documentation? Yeah, it doesn't really save time when GPT makes up functions and libraries, making me check the docs anyways...
I was initially hopeful but I find it gets in my way for anything not trivial.
Yeah, it doesn't really save time when GPT makes up functions and libraries, making me check the docs anyways...
That behavior is now vanishingly-rare, at least in GPT4.
So Copilot uses GPT-4 under the hood, and about half the time I use it to generate anything bigger than a couple of lines it doesn't even compile, let alone be correct. It hallucinates constantly.
I find it to be hit or miss in this aspect. Sometimes I can write a comment about how I want to use an API that I don't know well, and it generates perfect, idiomatic code to do exactly what I want. I quickly wrote a couple of Mastodon bots in golang, leaning heavily on Copilot due to my lack of familiarity with both the language and Mastodon APIs. But yes, sonetimes it just spits out imaginary garbage. Overall it's a win for my productivity - the failures are fast and obvious and just result in my doing things the old way.
I find it horrible at replacing library documentation
I've been using LLM products since incipience. I use them in my daily work life. It's a bit tiring hearing this 'right questions', 'level of experience' and 'reach this level'. Can you share anything concrete that you achieved with ChatGPT that would blow my mind?
I keep hearing this 'you need to ask the right kind of questions bro' from people that never build a single product in their life, and it makes me question my ability to interact with LLM but I never see anything concrete.
I recently had an introspective dream revealed to be based on a literal prompt at the end: "Game to learn to talk about It and its player." When I asked GPT to craft a plot from this prompt's title (and the fact it is revealed at the end), it reproduced the dream's outline, down to the final scene:
GPT reconstruction:
The dream reaches its peak when you meet the "final boss" of the game: an entity that embodies the ultimate barrier to communication. To overcome this obstacle, you must synthesize everything you've learned about "it" in the dream and present a coherent vision that is true to yourself. As you articulate your final understanding of "it", the maze dissolves around you, leaving you in front of a giant mirror. In this mirror, you see not just your reflection but also all the characters, passions, and ideas you encountered in the dream. You realize that "it" is actually a reflection of yourself and your ability to understand and share your inner world. The dream ends with the title revealed, "Game to Learn to Communicate about It and Its Player", meaning the whole process was a metaphor for learning to know and communicate your own "it" - your personality, thoughts, and emotions - with others, and that you are both the creator and the discoverer of your own communication game.
My note:
The continuation of the dream corresponds to an abrupt change of scene. I find myself in my bed, in the dim light of my room, facing a mysterious silhouette. As I repeatedly inquire about its identity, I stretch my hands towards its face to feel its features as I cannot clearly see them. Then, a struggle begins, during which I panic, giving the dream a nightmarish turn. Noticing that the dark figure mirrors my movements, I realize it's myself. Suddenly under my duvet and as I struggle to get out, I feel jaws and teeth against the sheets. I call out for my mother, whom I seem to hear downstairs, and that's when my vision fades, and I see the dream's source code displayed behind. It consists of ChatGPT prompts shared on the lime green background of an image-board. At the bottom, I then see the dream's title: "Game to learn how to communicate about It and its player."
Look I don't mean to downplay. Or maybe I do. But we're talking about LLM replacing professional problem solvers, software architects, not generating great sounding probability modeled token distributions.
Just did this:
https://pastebin.com/YfyB0K0Z
In less time that it would take me to figure out how dictionary comprehension work in python. But hey, I guess you already know about these use cases.
Things AI is "excellent" at, includes "explaining concepts (kinda hit or miss)".
Did you use an AI assistant while making that list?
This looks silly, I admit. I made this correction after reviewing what I’ve written, but should have corrected in 2 places. The list is handwritten, but English is not my native language.
That's entirely fair, but illustrates one of the problems I and others in the thread are having. Code or otherwise, I can't tell if a discontinuity is human or machine generated. Only one of those two things learn from feedback right now; if someone uses AI sometimes it can be hard to tell when they're not using it.
+1 chatgpt or simialr tools are extremely useful, if you ask the right questions. I use for: - code completion - formatting: e.g show it sample format & dump unstructured data to convert to target format. - debugging - stackoverflow type stuff - achieving small specific tasks: what is linux command for XYZ etc and many mentioned in above comment.
I'll give you examples of how it helps me:
1) copilot is a terrific auto complete, and writes tremendous amounts of repetitive boilerplate
2) copilot can help me kickstart writing some complex functions starting from a comment where I tell it what is the input and expected output. Is the implementation always perfect or bug free? No. But in general I just need to review and check rather than come up with the instruction entirely.
3) copilot chat helps me a lot in those situations where I would've googled to find how to do this or that and spent a lot of time with irrelevant or outdated search results
4) I have found use cases for LLMs in production. I had lots of unformatted plain text that I wanted to transform in markdown. All I needed to do is to provide few examples and it did everything on its own. No need to implement complex parsers, but make a query to OpenAI with the prompt and context. Few euros per month in OpenAI credits is still insanely cheaper than paying tons of money in writing and maintaining software by humans for that use case.
5) It helps me tremendously when trying to learn new programming languages or remembering some APIs. Writing CSS selectors is actually a very good example. But I don't feed it an entire HTML as you do, I literally tell him "how do I target the odd numbered list elements that are descendants of .foo-bar for this specific media query". Not sure why would you need to feed it an entire HTML.
6) LLMs have been extremely useful to generate images and icons for an entire frontend application I wrote
7) I instruct him to write and think about test cases about my code. And it does and writes the code and tests. Often thinks about test cases I would've never thought of and catches nice bugs.
I really don't buy nor think it can write too much on its own.
The promise of it writing anything but simple boilerplate, I find it ridiculous because there's way too much nuance in our products, business, devices, systems that you need to follow and work on.
But as a helper? It's terrific.
I'm 100% sure that people not using these tools are effectively limiting themselves and their productivity.
It's like arguing you're better off writing code without a type checker or without intellisense.
Sure you can do it, but you're gonna be less effective.
I ended up getting annoyed with the autocomplete feature taking over things such as snippet expansion in vscode, so I turned it off personally. I felt that the battling against the assistant made around a break even productivity gain overall. Except for regular expressions, that I have basically offloaded to AI almost in its entirety for non trivial things.
Regex is easier to write than read so it seems not the best use of AI given you actually verify what it gives you and don't just take it on faith :/
Completely agree. I turned it off and realized I can absolutely fly writing code when copilot stops getting in the way. I only turn on for writing tests now.
I agree. I have it active on VSCode and enjoy it. It has introduced subtle bugs but the souped up autocomplete is nice.
I don't find it very useful for anything non trivial. If anything I found it more useful for generating milestones and tasks for a product, than even making a moderately complex input -> output without me having to check it in a way that annoys me.
I find I don't use copilot chat, almost at all. Nowadays I prefer to go to Gemini and throw in my question.
This is mostly what I'm using it for in this current project. It does it job nicely but it's very far away from replacing myself as a programmer. It's more like a `fn:magic(text) -> nicer text`. This is a good use case. But it's a tool, not a replacement.
Because I get random websites with complex markup, and more often than not every page has its unique structure. I can't just say give me `.foo-bar` because `.foo-bar` might not exist. Which is where the manual process comes in. Currently, I'm using hand crafted queries that get fed into GPT / Claude / LLama, but the actual query is what I wanted it to do.
I'm very curious how this behaves in different resolutions. There's a reason vector graphics are a thing. I've used it for this purpose before but it doesn't compare to vectorial formats.
What is the context size of your code? It works for trivial snippets but as soon as the system is a bit more complex, I find that it becomes irellevant fairly fast.
Totally agree. But I'm not complaining about its usefulness. I'm a paying user of LLM systems. I use them almost every day. They're part of my products. But this particular hype about it replacing ... me. I don't buy. Yet. It could come tomorrow and I'd be happier for it.
Something I have had issues with too. Copilot chat does indeed suck. I never enjoyed using it within VSCode and they never released it for Jetbrains.
The sweet spot for me is using a plug-in within the IDE that utilizes an API key to the model API. That coupled with the ability to customize the system prompt has been amazing for me. I truly dislike all of the web interfaces, just allow me to pick from some predefined workflows with a system prompt that I create and let me type. Within the IDE I generally have it setup so that it stays concise and returns code when asked with minimal to no explanation. Blazing efficient.
some of my work involves copyediting/formatting ocred text, for that it works quite well and saves me a lot of time. Especially if it involves completing/guessing badly ocred text.
Yep, pretty much exactly this.
Especially #5. I'm certain that I've been at least 10x more productive in learning new tools since chatgpt hit the scene. And then knowing it helps so much with that has had even more leverage in opening up possibilities for thing I'm newly confident in learning / figuring out in a reasonable amount of time. It is much easier for me to say "yep, no big deal, I'm on it" when people are looking for someone to take on some ambiguous project using some toolset that nobody at the company is strong with. It solves the "blank page" issue with figuring out how to use unfamiliar-to-me-but-widely-used tools, and that is like a superpower, truly.
It's pretty decent for "happy path" test cases, but not that good at thinking of interesting edge or corner cases IME, which comprise the most useful tests at least at the unit level.
I'm pretty skeptical of #4. I would be way too fearful that it is doing that plain text to markdown transform wrong in important-but-non-obvious cases. But it depends on which quadrant you need with respect to Type I vs. Type II errors. I just never seem to be in the right quadrant to rely on this in my production projects.
The "really good intellisense" use cases #1-#3 also make up a "background radiation" of usefulness for me, but would not be nearly worth all the hype this stuff is getting if that were all it is good for.
thanks, was looking for the sane and verbose reply.
I agree with all of your points and experience the same benefits.
1) Autocomplete is more often than not what I want or pretty darn close.
2) Sometimes I need a discrete function that I am not sure how I want to write. I use a prompt with 3.5/4 inside of my IDE to ask it to write that function.
It is definitely not writing complete programs any time soon but I can see where it's heading in the near term. Couple it with something like RAG to answer questions on library/api implementations. Maybe give it a stronger opinion about what good Python code looks like.
For the naysayers I don't know how you use it but it is certainly useful enough for me to pay for.
But people read much less of what you type now.
A lot of startups are selling the dream/hype of not ever having to learn to code. Be aware that it’s hype. Learn to code if you want to build stuff. They will be tools for those that have the knowledge needed to effectively use them.
I’m actually really amazed by LLMs and think the world is going to change dramatically as a result.
But the “you won’t need to code” reminds me “you won’t need to learn to drive”.
It’s the messy interface with the real world in both cases that basically requires AGI.
If AGI is just a decade off then, yep, I won’t need to code. But a decade is a long time and, more importantly, we’re probably more than a decade away.
And even if it is “just round the corner”, worrying about not needing to code would be worrying about deckchairs on the titanic. AGI will probably mean the end of capitalism as we know it, so all bets are off at that point.
It’s wise to hedge a little but also realise that to date AI is just a coding productivity boost. The size of the boost depends on how trivial the code is. Most of the code I write isn’t trivial and AI is fairly useless at that, certainly it’s faster and more accurate to write it myself. You can get a 50% boost if you’re writing boiler plate all day, but then you have to wonder why you’re doing that in the first place.
+1 for the titanic analogy. If there ever comes a point that we no longer need to learn to code, I’m taking that as a sign that I’m literally living in a matrix-esque simulation.
The point at which someone like myself is allowed to become aware that a company has developed that level of AI is well beyond the point of no return.
Reminds me of the no-code / low-code hype around 2020, tons of startups advertising app-builders that used little, if any, AI. Just blocks that you dragged-and-dropped. While many of them were successful, it seems like overall they didn't really make much of a dent in industry, which I found very curious.
Like, by now you'd think it would be inevitable that we wouldn't be writing software in a text-editor or IDE. Everything else we do on a computer is more graphical rather than textual, with the exception of software development. Why is that?
Part of the reason why I'm kind of bearish on AI is because it seems like we could have replaced written code with GUI diagrams as far back as the 80s, or at the very least in the early 2000s, and it seems like something that should have obviously caught on given that would probably be much easier for the average person. Again though, curiously, we're still using text editors. Perhaps despite the popularization of AI no-code builders we'll still see that the old model of hiring someone good at writing code in a text-editor remains largely unchanged.
Makes me wonder if there's just something about the process that we overlook, and if this same something could frustrate attempts at automating the process of writing code using AIs as much as it frustrated our attempts at capturing code using graphical symbols.
I think you're underestimating the amount of things built with nocode.
I don't think most people are building landing pages anymore by handwriting code anymore. Same with blogs (eg. Wordpress). There are MVPs of successful businesses that've been built by Bubble.io. Internal dashboards and such can definitely be built without code such as via Retool or Looker or whatever.
WYSIWYG obviously makes sense for frontend, but less so for backend. For backend code I don't really see how some visual drag and drop editor could make for a better interface than code. And even if it could, the advantage of code is that it's fully customizable (whereas with a GUI you're limited by the GUI), and text itself as a medium is uniform and portable (eg. easy to copy and paste anywhere).
Not to say that we can't create better interfaces than text, but I do think some sort of augmentation on top of a code editor is probably a more realistic short-term evolution, similar to VSCode plugins.
No code tools sell the same dream.
As with everything about AI, HN once again shows a remarkable inability to project into the future.
This site has honestly been absolutely useless when discussing new technology now. No excitement, no curiosity. Just constantly crapping on anything new and lamenting that a brand new technology is not 100% perfect within a year of launch.
Remove "Hacker" from this site's name, because I see none of that spirit here anymore!
I just think there's a bias involved when some people are emotionally invested in AI not being good.
Are you kidding me, I'd say it's 80% people hyping up AI.
This is a post about a present product launch. The future, maybe tomorrow, will be filled with wonder and amazement. Today, we need to understand reality. Not all of us appreciate empty hype. Hackers tinker with reality and build the future. Marketers deal with thin promises.
Wait, wait, you're telling me that a site attended by people who stan for the OG Luddites is no longer worthy of being called "Hacker News"? Or where users with names like "BenFranklin100" extol the virtues of Apple's iOS developer agreement? Say it isn't so.
The trouble is, there's still nowhere better.
I think it requires years of proficiency in the field you are asking about in order to get openai to produce meaningful, useful output. I can make use of it, but sometimes it makes me think "how would a newbie even phrase an objection to this misunderstanding or omission?" Currently it seems gpts are pretty much not on par with the needs of non-experts.
You might be on to something here. It definitely seems to be the case because I'm using multiple different models as part of my everyday process and getting excellent results as a very experienced low level C++ systems engineer.
What is worse is that seems to be leading to a self-amplifying feedback loop, where people not up to speed enough with the models try to use them, fail and give up making them fall even further behind.
Very similar to my experience. I made it generate a novel neuroevolution algorithm with the data structures I imagined for recreational purposes, and to speed things up, it suggested "compiling" the graph by pre-caching short circuits into an unordered_map. A lot of fun was had. (it also calls me captain)
This is my experience too. You have to be have deep domain knowledge to really get the LLM to do what you want. Then it saves me a ton of time.
Claude Opus is working for me. It's not perfect but it definitely handles busy work well enough that it's a net positive. Like I add some new fields to a table and ask it to update all the files that depend on the field and it works after 1 or 2 tries. There is a time saving benefit but there is also an avoiding mental fatigue benefit for busywork.
What are you using on top of Claude Opus that helps it access your file system?
cmd c cmd v, definitely not ideal
This is what I use it for too --
Write me the molecular simulation boilerplate because these crappy tools all have their own esoteric DSLs, then I tweak the parameters to my use case, avoiding the busywork -
e.g. "Write me a simulation for methane burning in air"
Gives me a boilerplate, I modify the initial conditions (concentrations, temperatures, etc) and then deploy. Have the LLM do the busy-work, so I dont have to spend ages reading docs or finding examples just to get started.
Now deploy to a stable environment. Thats what I'm trying to help with by building https://atomictessellator.com
In my experiments at Pythagora[0], we've found that sweet spot is technical person who doesn't want to know, doesn't know, or doesn't care about the details, but is still technical enough to be able to guide the AI. Also, it's not either/or, for best effect use human and AI brainpower combined, because what's trivial vs tedious for human and AI is different so actually we can complement each other.
Also, current crop of LLMs are not there yet for large/largish projects. GPT4 is too slow and expensive, while Groq is superfast but open source models are not quite there yet. Claude is somewhere in the middle. I expect somewhere in the next 12 months there's going to be a tipping point where they will be capable, fast, and reliable enough to be in wide use for coding in this style[1].
[0] I have an AI horse in the game with http://pythagora.ai, so yeah I'm biased [1] It already works well for snippet-level cases (eg GitHub copilot or Cursor.sh) where you still have creative control as a human. It's exponentially harder to have the AI be (mostly) in control.
.
I would clarify that "there" in my "not there yet" doesn't assume superhuman AGI developer that will automagically solve all the software development projects. That's a deep philosophical issue best addressed in a pub somewhere ;-)
But roughly on par with what could be expected of today's junior software developer (unaided by AI)? Definitely.
Almost no products fit this description, and if they do then the marketing department is getting fired.
Does a Mcdonalds burger look like the picture?
If you go in with a healthy dose of cynicism IMO LLMs can impress. I’d call it a better google search and autocomplete on steroids.
It actually does in the countries that require it. You know you can write ACTUAL "truth in advertising" laws right?
Sometimes? But I don't go to McDonalds for the loss function between the picture and actual product. I go for the fast food and good taste (YMMV).
I use them everyday in one way or another. But they're not replacing me coding today. Maybe tomorrow. And I go in with a healthy dose of optimism when I say this.
Sure, but this particular discussion is not about its ability to replace Google Search and / or Autocomplete.
As a developer who is good at object oriented design, architecture, and sucks at leetcode stuff, I have been able to use it to make myself probably twice as productive as I otherwise would be. I just have a conversation with GPT-4 when it doesn't do what I want. "Could you make that object oriented?" "Could you do that for this API instead, here let me paste the docs in for you."
I think people want it to completely replace developers so they can treat programming as a magic box, but it will probably mostly help big picture architecture devs compete with people who are really good at Leetcode type algorithm stuff.
The competition should be happening in the other direction.
Totally agree. I am not a professional developer. I find programming to be quite dull and uninteresting.
I am going to work on something after this pot of coffee brews that I simply could not produce without chatGPT4. The ideas will be mine but the most of the code will be from chatGPT.
What is obvious is different skill sets are helped more than others with the addition of these tools.
I would even say it is all there in the language we use. If we are passing out "artificial intelligence" to people, the people who already have quite a bit of intelligence will be helped far less than those lacking in intelligence. Then combine that with the asymmetry of domains this artificial intelligence will help in.
It should be no surprise we see hugely varied opinions on its usefulness.
Sound like a skill issue
Easy to say. Show me something impressive you've done with AI and little involvement from yourself.
What kind of prompts are you using? You'd be surprised how much better your output is using prompting techniques tailored for your goal. There are research papers that show different techniques (e.g one shot, role playing, think step by step etc) can yield more effective results. From my own anecdotal experience coding with ChatGPT+ for the past year, I find this to be true.
I hack on them till I get something sort of satisfying.
The biggest problem I encounter is context length, not necessarily the output for small inputs. It starts forgetting very fast, whether it's Claude, GPT+ or other self hosted models I've tried.
As a human, I can tell you that I suck at predicting exponential growth.
My personal take is that LLM is fairly good at replacing low level tasks with intuitive patterns. When it comes to a high level ambiguous question that actually has an implication on your daily works and the products, LLM is not helpful anymore than search engines.
Yeah, AI will do the easy and fun jobs for you. You will only need to care difficult decisions that you're going to be responsible for. What a wonderful world...
lol I was on the same boat until I sinked it all together. I ended up wasting more time arguing with the LLM chat than doing anything remotely useful. I just use it for reference now, and even that I am not 1000% sure.
Exactly where I'm at! Totally transformative set of tools for me to use to do my day to day work significantly more productively and also a giant distance away from being capable of doing my day to day work.
This is exactly my experience. Furthermore, I've become acutely aware that spending time prompting either a) prevents me from going down rabbit holes, all but denying me the kind of learning that can only really happen during those kinds of sessions, and b) prevents me from "getting my reps in" on stuff that I already know. It stands to reason that my ability to coax actually useful information out of LLMs will atrophy with time.
I'm quite wary of the long-term implications and downstream effects of that occurring at scale. AI is typically presented as "the human's hands are still on the wheel," but in reality I think we're handing the wheel over to the AI -- after all, what else would the endgame be? By definition, the more it can do without requiring human intervention, the "better" it is. Even if replacing people isn't the intention, I fail to see how any other effect could usurp that.
Assuming AI keeps developing as it has been, where will we be in 20 years? 50? Will anyone actually have the knowledge to evaluate the code it produces? Will it even matter?
Perhaps it's because Dune is in the air, but I'm really feeling the whole "in a time of increased technology, human capabilities matter more than ever" thing it portrays.
I totally agree!
And I'm sure the reason for that is the garbage input. From time to time I have to perform quantitative code analyses in our so called enterprise repositories. And the results are shocking every time. I have found an extremely poor SQL code block to type cast many columns in hundreds of projects. It was simply copied again and again even though the casting was no longer necessary.
The training base should be sufficiently qualified (and StackOverflow ranking is obviously not enough).
But unfortunately it's probably too late for that now. Now inexperienced programmers are undercutting themselves with poor AI output as training input for the next generation of models.
The other day I thought I had the perfect task for AI and to clean up some repetitive parts in my scss and to leverage mixins. It failed terribly and was hallucinating scss features. It seems to struggle in the code <-> visual realm.
I highly doubt you're the dumb one here.
Yeah I'll only give it tasks where it needs to spot patterns and do something obvious, and even then I'll check it make sure it hasn't just omitted random stuff just for shits and giggles.
TBH I'm more surprised when I don't need to help it now. After about 3 times where it cycles between incorrect attempts I just do the job myself.
I disabled copilot since it consistently breaks my flow.
Rather, your “problem” is that you’re likely not writing uninteresting cookie cutter boilerplate that everyone can do and has done hundreds of times. The current crop of AI is cool for coding demos, not for solving real relevant problems.
The people hyping this crap only care about the second part of that sentence. The first one is an afterthought.
This is an interesting post. An expert in numerical analysis compares the output of a tool which optimizes floating point expressions for speed and accuracy with the output generated by chatgpt on the same benchmarks:
https://pavpanchekha.com/blog/chatgpt-herbie.html
This has been exactly my experience with chatgpt as well.