return to table of content

Meta's new LLM-based test generator

ajmurmann
49 replies
1d17h

I find it interesting that generally the first instinct seems to be to use LLMs for writing test code rather than the implementation. Maybe I've done too much TDD, but to me the tests describe how the system is supposed to behave. This is very much what I want the human to define and the code should fit within the guardrails set by the tests.

I could see it as very helpful though for an LLM to point out underspecified areas. Maybe having it propose unit tests for underspecified areas is a way to do look at that and what's happening here?

Edit: Even before LLMs were a thing, I sometimes wondered if monkeys on type writers could write my application once I've written all the tests.

skissane
15 replies
1d17h

Maybe I've done too much TDD, but to me the tests describe how the system is supposed to behave. This is very much what I want the human to define and the code should fit within the guardrails set by the tests.

People who work on legacy code bases often build what are called “characterisation tests” - tests which define how the current code base actually behaves, as opposed to how some human believes it ought to behave. They enable you to rewrite/refactor/rearchitect code while minimising the risk of introducing regressions. The problem with many legacy code bases is nobody understands how they are supposed to work, sometimes even the users believe it is supposed to work a certain way which is different from how it actually does - but the most important thing is to avoid changing behaviour except when changes are explicitly desired.

makeitdouble
5 replies
1d8h

Yes. There's also other (better) ways to solve this issue: for instance sampling input/outputs in production and srtting them in stone in the tests.

An issue with going with llms will be to validate if the behavior described are merely tolerated or if they're correct. Another will be wether something is actually tested (e.g. a code change still wouldn't break the test). Too granular output check would be an issue as well.

All in all this feels like a bad idea, but I hope to be wrong.

wickedsickeune
3 replies
1d7h

This is called a "golden master" (giving X input to the system and recording the output as a test expectation). The difference with the parent is that it is way less granular, so both have value.

weebull
2 replies
1d6h

This is something I've yet to see a software testing framework do. Compare the results of two different implementations separated in time. i.e. two different revisions out of source control, using one as the golden reference for the other.

adregan
0 replies
23h49m

Isn’t that kind of what snapshot tests can do?

CHY872
0 replies
1d4h

bazhenov/tango does something like this for performance tests, basically to counter system behaviour you run the old and new implementation at the same time.

hitchstory
0 replies
1d6h

The problem is xUnit style tests are really bad at this and make it tedious - which is why people gravitate to LLMs for writing them. LLMs on the surface look like they can relieve the pain of using bad abstractions - but theyre still a band aid on a gaping wound. 20 years ago we'd be using them to write ugly raw PHP.

Characterization tests ideally need to not be written in code but defined in something resembling a configuration language - something without loops, conditionals, methods, etc. There then needs to be a strict separation of concerns kept between these definitions and code that executes them.

I wrote a testing framework (with the same name as my username) centered around this idea. Because it is YAML based, expected textual outputs can be automatically written into the test from actual inputs which saves tons of time and you can autogenerate readable stakeholder documentation that validates behavior.

It might seem unbelievable but with decent abstractions, writing tests and TDD stops being a chore and actually starts being fun - something you won't want to delegate to an LLM.

totetsu
4 replies
1d16h

Couldn’t an llm provided with the right level of logs write really good characterization tests?

crowcroft
2 replies
1d16h

The seems like a perfect use case. Quickly find all the foot guns you didn’t know to look for.

wahnfrieden
1 replies
1d15h

as long as you have process to dismantle the tests and move fully over to a new system, if you are indeed migrating/upgrading. leaving a legacy thing dangling and tightly coupled tests lingering for years happens easily when going from 95% to 100% can cost too much for management and stakeholders in various ways relative to other pressing needs

skissane
0 replies
1d14h

as long as you have process to dismantle the tests and move fully over to a new system, if you are indeed migrating/upgrading. leaving a legacy thing dangling and tightly coupled tests lingering for years happens easily when going from 95% to 100% can cost too much for management and stakeholders in various ways relative to other pressing needs

Characterisation tests are not supposed to be tightly coupled – they are supposed to be integration/end-to-end tests not unit tests – the point is to ensure that some business process continues to produce the same outputs given the same inputs, not that the internals of how it produces that output are unchanged. Code coverage is used as an (imperfect) measure of how complete your set of test inputs is, and as a tool to help discover new test inputs, and minimise test inputs (if two test inputs all hit the same lines/branches, maybe it is wasteful to keep both of them–although it isn't just about code coverage, e.g. extreme values such as maximums and minimums can be valuable in the test suite even if they don't actually increase coverage.)

They can take the form of unit tests if you are focusing on refactoring a specific component, and want to ensure its interactions with the rest of the application are not changed. But at some point, a larger redesign may get rid of that component entirely, at which point you can throw those unit tests away, but you'll likely keep the system-level tests

skissane
0 replies
1d16h

A significant part of writing characterisation tests can be simply staring at a code coverage report and asking “can I write a test (possibly by modifying an existing one) which hits this line/branch”. Sometimes that’s easy, sometimes that’s hard, sometimes that’s impossible (code bases, especially crapulent legacy ones, sometimes contain large sections of dead code which are impossible to reach given any input).

An LLM doesn’t have to always get it right to be useful-have it generate a whole bunch of tests, run them all, keep the ones which hit new lines/conditions, maybe even feed those results back in to see if it can iteratively improve, stop when it is no longer generating useful tests. Hopefully, that addresses most of the low-hanging fruit, and leaves the harder cases to a human.

There already exist automated test generation systems which can do some of this–for example, concolic testing-but an LLM can be viewed as just another tool in the toolbox, which may sometimes be able to generate tests which concolic testing can’t, or possibly produce the same tests quicker than concolic testing would. There is also the potential for them to interact synergistically - the LLM might produce a test which concolic testing couldn’t, but then concolic testing might then use that to discover further tests which the LLM couldn’t.

ajmurmann
1 replies
1d16h

Agreed, that's a great use case for autogenerated tests.

dmarchand90
0 replies
1d8h

From my experience I use llms for writing tests because llms are much better at writing tests than the application code. I suspect this might be due to the fact that the code ends up being a very detailed and clear prompt to the llm.

Muromec
1 replies
1d8h

Imagine dealing with say COBOL where even the team who was hired to maintain it is going to retire. It's the closest to "lost technology" trope from Sci-Fi we have ever been.

giantrobot
0 replies
1d3h

Imagine dealing with say COBOL where even the team who was hired to maintain it is going to retire. It's the closest to "lost technology" trope from Sci-Fi we have ever been.

The technical aspect of the code is not actually the difficult part WRT maintenance. Someone that knows COBOL as a language can figure out what some code does and how it does it. It takes time but that is information that can be derived if you just have the code.

The main problem with COBOL is the code is often an implementation of some business or regulatory process. The COBOL maintainer retiring is taking knowledge of the code but more importantly the knowledge of the literal business logic.

The business logic and accounting/legal restraints aren't something that can necessarily be derived from the code. You can know some bit of code multiplies a value to 100 but you can't necessarily know if it supposed to do that. If the source code and documentation don't capture the why of the code the how doesn't help the future maintainer.

Often with COBOL the people that originally defined the why of the code are not just retired but dead. The first generation of maintainers may only have ever received partial knowledge of the why so even the best documenters have holes in their knowledge. They may have had exposure to the original why defines but neglected or didn't have an opportunity to document some aspects of why. The subsequent generations of maintainers are constrained by how much of the original why was documented.

Edit: pre-coffee typo

xboxnolifes
4 replies
1d17h

I find it interesting that generally the first instinct seems to be to use LLMs for writing test code rather than the implementation. Maybe I've done too much TDD, but to me the tests describe how the system is supposed to behave. This is very much what I want the human to define and the code should fit within the guardrails set by the tests.

I feel the same way about how test code is viewed even outside of AI. A lot of the time the test code is treated as a lower priority code given to more junior engineers, which seems like the opposite of what you would want.

zeroonetwothree
2 replies
1d16h

When I do code review I always review the tests first. If they look through and reasonable then I can be a lot less careful reviewing the rest.

postalrat
1 replies
1d12h

Tests never cover everything so exactly what are you looking for?

Muromec
0 replies
1d8h

For example, just check the list of units, see that there is one for a happy flow, but none for error flow. Check that a number of tests correlates well with apparent cyclomatic complexity of the code. Check that tests are actually defining and testing relevant behavior and not just "I called method A, dus the method B was called".

pydry
0 replies
1d9h

This is how I feel too. For me, tests usually end up being a concrete specification which I can execute.

Getting LLMs to write tests is like getting LLMs to write my spec.

closeparen
3 replies
1d14h

Covering all the "if err != nil { return err }" branches in Go is pretty mindless work.

randomdata
2 replies
1d7h

Your tests only need to assert what failure states the user should expect under what conditions, not cover how it should be implemented. If, say, your implementation uses panic/recover instead, your tests shouldn't care. Asserting 'how' the code is to be implemented is how you get brittle tests that are a nightmare to maintain.

And making those assertions shouldn't be mindless work. Documenting the failure cases should be the most interesting work of all the code you are writing. If you are finding that it isn't, then that tells you that you should be using a DSL that has already properly abstracted the failure cases for your problem space away.

closeparen
1 replies
1d

* The user isn't inside the package, but the test has to be or it's not a unit test (and coverage doesn't count).

* While "errors are values" makes it possible to enumerate the potential failure cases, most of the time errors are just strings. The thing you are forced to document by the lack of exception semantics is the sites at which you might be dealing with an error vs. a success value. In an IO heavy application this rounds up to "everywhere."

* Go is not really powered to do DSLs in a type-safe way, except through code generation. I would view the LLM as a type of code generator here.

randomdata
0 replies
22h57m

* The test isn’t inside the package either. In fact, Go in particular defines _test packages so that you can ensure that there is explicit separation. You are not testing units if you don’t look at the software the same way the user does.

* If your errors are strings, you are almost certainly doing something horribly wrong. Further, your tests would notice it is horribly wrong as making assertions on those strings would look pretty silly, so if your errors are strings that tells that your testing is horribly, horribly wrong.

* Go is not a DSL, no. It is unabashedly a systems language. Failure is the most interesting problem in systems. Again, if your failures aren’t interesting, you’re not building a system. You should not be using a systems language, you should be using a DSL.

When you have no idea what you are doing, choosing the wrong tool at every turn, an LLM might be able to help, sure.

TeMPOraL
3 replies
1d10h

FWIW, writing implementation is much more pleasant/interesting experience, because you're writing the actual thing the application is supposed to do. In contrast, when writing tests, you're describing what the application is supposed to do, using an extremely bloated, constrained language, requiring you to write dozens or hundreds of lines of setup code, just to be able to then add few glorified if/else statements.

In my experience, at least in languages like C++ or Java, unit tests are made of tedium, so I'm absolutely not surprised that the first instinct is to use LLMs to write that for you.

robryk
0 replies
1d9h

This is my experience, unless I try to make the contract of the thing simple to test via property testing. Then, writing tests often becomes basically an exercise in writing down the contract in a succinct way.

Sadly, this is a rare approach, so if you cooperate with others it's hard to use it.

ric2b
0 replies
1d7h

Yeah, I suppose this is very language dependent. In Ruby I actually quite enjoy writing tests, whereas in Java it was boilerplate pain.

Muromec
0 replies
1d8h

I actually enjoy writing unit tests, but I do frontend stuff. For me it's the moment of calm and reflection when I step back and look carefully at what I built and say "nice, it works" to myself.

madeofpalk
2 replies
1d16h

This is very much what I want the human to define and the code should fit within the guardrails set by the tests.

Most systems are pretty predictable. it("displays the user's name") isn't very novel, and is probably pretty easy for a LLM to generate.

andreasmetsala
0 replies
1d8h

it("displays the user's name") isn't very novel, and is probably pretty easy for a LLM to generate.

Arguably the implementation that passes such a test is even simpler, making it a bit questionable why we have the human write that part.

ajmurmann
0 replies
1d13h

Well, someone needs to define that the username should be shown in the first place

grogenaut
2 replies
1d14h

One reason I can think of is that many engineers really don't do testing. They write tests after the fact because they have to. I've worked with a bunch of engineers who will code for days then write a few tests "proving" the system works. They have low covergage and are usually brittle.

This system would be a godsend in the minds of engineers who think / operate that way.

I've also had managers who told me I wasn't allowed to write tests firsts as it was slower. Luckily I was able to override / ignore them as I was on loan "take it up with my boss". They're probbably thinking the same as the above engineers.

Another way to think of this is most devs hate documentation... if they had an AI that would write great docs from the code they'd love it. And these to these devs docs they don't have to write are great docs :)

tjpnz
0 replies
1d13h

I've also had managers who told me I wasn't allowed to write tests firsts as it was slower.

Sounds like a great place to work.

antifa
0 replies
1h23m

managers who told me I wasn't allowed to write tests firsts as it was slower.

You're suppose to write them together, at the same time. What's slower is spending all of your time mousing around the GUI like a caveman and then having missing tests and then trying to debug every future recurring regression forever the rest of the project's lifespan.

The only thing that's slower is perhaps trying to do TDD for the first time. After that, you're embarrassed at how much time you wasted wandering in a browser before your API tests were finished.

ralusek
1 replies
1d17h

I basically agree with this but some caveats. I often find there are maybe 5% of the tests I should write that only I could write, because they deal with the specifics of the application that actually give it its primary purpose/defining features. As in, it's not that there is any test I believe AI eventually wouldn't be able to write, it's more that there are certain tests that define the "keyframes" of the application, that without defining explicitly, you'd be failing to describe your application properly.

For the remaining 95% of uninteresting surfaces I'd be perfectly happy to let an AI interpolate between my key cases and write the tests that I was mostly not going to bother writing anyway.

ajmurmann
0 replies
1d17h

You are probably right and the percentages change with the language and framework being used. When I write Ruby I write enormous amounts of tests and many of these could probably be derived from the higher-level integration tests I stared with. In Rust on the other hand, I write very few tests. I wonder if this also shows which code could be entirely generated based on the high-level tests.

janosdebugs
1 replies
1d9h

This kind of thinking is sadly lost on many. I have seen copious amounts of nonsensical tests slapped full of hard-wired mocks and any change in the functionality would break hundreds of tests. In such a scenario an LLM might be the bandaid many are looking for. Then again, the value of such tests is questionable.

robryk
0 replies
1d8h

My best example of that was a test that was asserting that a monitoring metric changes in some way with a comment expressing doubt whether the asserted values are correct (but whoever wrote it still wrote the test that enshrined the behaviour that they themselves doubted).

torginus
0 replies
1d8h

Imo there are 2 kinds of programming:

- Software engineering, which is akin to real engineering, as it involves desigining a complex mechanism that fits a lot of real world constraints. Usually you need to develop a sophisticated mental model and exploit it for the desired results. Involved implementations and algorithms usually fall into this category. - Talking to computers, which is about describing what you need to the computer. Usually focuses on the 'what' as the 'how' is trivial. Examples include HTML/CSS, Terraform and very simple programs (like porting a business process flow from a flowchart to code). And, indeed test code.

LLMs are terrible at the former, but great at the latter.

summerlight
0 replies
21h47m

There are multiple reasons, including what you mentioned. The first thing is that test codes are generally considered "safe" to write and change, so it won't be end of the world even if LLM does something subtly wrong. The next is that reading a test change is usually easier than writing it, which is the entire idea of golden/approval tests. And finally... people generally don't like writing tests, which is probably the biggest reason...

pokstad
0 replies
1d16h

I agree, humans should write tests. Humans are the oracles of the program output who know whether the code did the right or wrong thing.

I’m guessing they want to automate tests because most engineers skimp on them. Compensating for lack of discipline.

pacoverdi
0 replies
1d8h

I write (at least) 2 kinds of tests:

- TDD, which as you say describes the system's behavior. But it often deals with the nominal cases. It is hard to predict all that can go wrong in the initial development phase.

- tests designed to reproduce a bug. The goal of these to try very hard to make the system fail, taking inspiration with the bug's context

Maybe this LLM test generator could allow to be more proactive in the second kind?

mrbonner
0 replies
1d16h

I wrote a simple LLM backed chat application. My primary usage right now is to copy paste the code I have written (Java and Python) into the chat and ask it to generate unit test cases. I think it has reduced my development time a huge amount. It also generates tests for edge cases. The generated code usually are usable 90% of the time. It also is very good at making mocks for service calls. I'm using Claude 2.1 model with Bedrock.

It's nowhere as fancy as FB tool but I know it is blessed by company.

mike_hock
0 replies
1d6h

Passing tests don't guarantee correctness over all possible inputs, and especially not freedom from vulnerabilities. I'd rather have the code written by a human who actually understands it. Especially if the AI just gets re-prompted after failed attempts until the tests pass.

AI-generated tests can work like compiler/sanitizer warnings. If they fail, you can audit them and decide if it was a true or false positive.

makk
0 replies
1d13h

I find it interesting that generally the first instinct seems to be to use LLMs for writing test code rather than the implementation.

When you try to get the LLM to write the code, you find that it’s easier to get it to write the tests. So you do that and publish about that first.

benreesman
0 replies
1d13h

At the risk of telling you something you already know, I’d bring to your attention for example property-based testing, probably most popularized by Hypothesis, which is great wud I recommend, but by no means the only approach or high-quality implementation. I think QuickCheck for Haskell was around when it got big enough to show up on HN.

Just in case any reader hasn’t tried this, the basic idea is to make statements about code’s behavior that are weaker than a totally closed-form proof system (which also have their place) stated as “properties” than are checked up to some inherently probabilistic bound, which can be quite useful statements.

The “canonical” example is reversing a string: two applications of string reverse is generally intended to produce the input. But with 1 line of code, you can check as many weird Unicode edge cases or whatever as you have time and electricity.

I know this example seems trite, but I met this because some hard CUDA hackers doing the autodiff and kernels and shit that became PyTorch used it to tremendous effect and probably got 5x the confidence in the code for half the effort/price.

It doesn’t always work out, but when it does it’s great, and LLMs seems to be able to get a Hypothesis case sort of, closer than starting from scratch.

anu7df
0 replies
1d14h

I really believe this "application" is the result of thinking about tests as a chore and requirement without great benefits. Your thought of LLM writing application give the tests is interesting also from test pass/fail as optimization that ca be run online by the LLM to improve the result without human feedback.

MASNeo
0 replies
1d9h

If you had as many monkeys as parameters in LLM they might run your business ;-)

I dread the morning after a night of getting something to work…somehow.

holoduke
30 replies
1d19h

I am using copilots now since a few months and it really makes me a 2x more productive developer. Its like you become an orchestrator of a dev team. You still need to look into details, but things just flow much much faster. I can only imagine how it will be like if I also have access to a AI debugger / end to end tester. Completing the loop and making it super efficient. Also think that this is not only the case for developers. Even for lawyer it could be the same thing. Expected production output of a worker is going to rise. The ones who do not embrace AI assistants in the near future will have a hard time in the future.

unshavedyak
12 replies
1d19h

I still haven't figured out how people use this that much. I use LLMs almost daily via ChatGPT, Phind, Kagi Ultimate (my current one), etc. However i spend too much time pushing the LLM towards my goal that i can't imagine it speeding my coding up.

I clearly find value in LLMs to some degree, but speeding up my coding is not yet one.. i'd love to, but i just don't understand. I can only imagine typing more in explanation for the LLM than it would take me to write it to begin with.

swatcoder
1 replies
1d19h

Some revered novelists spend all day to write a single page and others are able to get in a flow and produce a chapter in that same time.

Without getting into touchy questions of quality and talent and the whole 10x trope, there are a lot of working engineers out there that produce code a lot more slowly than others and that work on more common problems than others. My sense is copilot-like products provide the biggest boon to those people and are a lot harder for people who are more naturally prolific or who work on more esoteric things.

holoduke
0 replies
1d16h

Have you ever tried it?. It sounds like you are a bit against the idea of an AI supporting your work.

freedomben
1 replies
1d19h

It all seems to depend on what you are doing. The more niche and technical the task, the less the AI can help. If you are just generating crud and points for a common web language and framework, it can be an 80% boost.

I think the real problem with this is that people aren't differentiating these different types of work when they give these numbers.

ickyforce
0 replies
1d18h

It's not just the type of work but also experience. Here are two cases:

1. After working for many years in Java I needed to build a service. I spent few days designing it and then a month on implementation. I used DBs and libraries I knew very well. I didn't need to access google/stackoverflow, I didn't need to look up names of (std)lib methods and their parameters and if something wasn't working it was fairly obvious what I needed to change.

2. Recently I wanted to create a simple page which fetched a bunch of stuff from some URLs and showed the results, simple stuff. But with React since that was what frontend team was using. I never used React and rarely touched web in the recent years. Most of my time was spent googling about React and how exactly CORS/SOP work in the browsers, and with polishing it took a couple of days.

I'm pretty sure that in case 1) AI wouldn't help me much. Maybe just as a more fancy code completion.

In case 2) AI would probably be a significant time save. I could just ask it to write some draft for me and then I could make few tweaks, without having to figure out React.

But somehow nobody quantifies their experience with the languages/tools when they are using AI - I'm sure there's a staggering difference between 1 month and 10 yoe.

block_dagger
1 replies
1d15h

Try this: in a situation where you need a small change to an existing class that you haven't looked at in a long time, dump the code and spec to ChatGPT with a request to add the feature or make the change along with a supporting spec. This can really speed up getting to the final result.

unshavedyak
0 replies
23h24m

I feel like there's too many files (lots of out-of-file references) to do that easily. Probably is fine for something like Copilot, but manually feeding it to ChatGPT is not something i've had much faith in

samatman
0 replies
1d15h

So I wanted to take a heatmap and bin it into fifteen values on a logarithmic basis. So I asked ChatGPT to do it for me, and it just worked.

Would this have been difficult for me to just write? No. I wouldn't call it difficult. But it would have involved effort, which is a resource I'm happy to conserve. It's like the difference between a sandwich you make and a sandwich you ask someone to make.

Then I asked it to generate the integer ranges which correspond to each bucket. It screwed that one up, which I found out by copying the function to a scratch file and trying some representative values in the REPL. So I told it to iterate all the values between in and max and generate the ranges that way. That one worked. Net of less effort, plus I had a test for it which I could copy-paste from the REPL to the test suite.

Faster? Maybe, maybe not. But I'm rate-limited by gumption, not minutes. When it's easier to describe the function than write it, and it's simple enough that the robot won't screw the pooch, I hand it off to the LLM. It's a great addition to the toolkit.

okdood64
0 replies
1d13h

The co-pilot equivalent tool I use makes me about 10-15% more productive. Noticeable but not life changing.

oblio
0 replies
1d19h

IMO a chunk of it is like Peel developers: people that prioritize their own speed above all else: readability, team reviews, etc. I'd love to be proven wrong.

holoduke
0 replies
1d16h

Its usually with simple things like code completion on logging, boiler plate, repetitive tasks etc. I am surprised sometimes how copilot knows what my next step is. I am typing in the start of an algorithm and copilot gives me the rest. And it predicts a lot of things correctly. It saves me tons of time. Another thing what i like is that I no longer need to no all the programming language related syntax. In the past I looked up stuff on stackoverflow. Now i simply type a short comment like "reverse the array...". Copilot automatically suggests the right syntaxes. Sometimes need some adjustments, but thats fine.

azeirah
0 replies
1d18h

For myself I have several distinct cognitive weaknesses that I can often just unload onto an AI in a conversational style.

For instance I know I'm good at breaking up large tasks into smaller ones, but because planning and executive functioning is by far my worst energy consuming skill (adhd) I can save a lot of energy (but not time) by approaching it conversationally. I'd often use colleagues for the same thing, but then my productivity costs 2 salaries

Similarly, I have some trouble with memory and cognitive speed when I'm tired which is unfortunately often the case due to my health, I know well enough what I kinda "want" to do and I let the AI generate something that comes near what I need and I can work from there once I have the right starting point.

Just my personal experience I'd wanted to share.

010101010101
0 replies
1d19h

I rarely find interfacing with an external chat interface useful, but integration with the coding environment (e.g. Copilot) is an immediate productivity boost.

rglover
7 replies
1d19h

The ones who do not embrace AI assistants in the near future will have a hard time in the future.

The exact opposite will be true and the funny (sad?) part is that they will lack the skills necessary (because they got lazy and over-trusted the AI) to fix mistakes/incompatible solutions.

skwirl
3 replies
1d17h

People said the same thing about garbage collectors.

swatcoder
1 replies
1d17h

And sure enough, people practiced only in garbage collected environments are the ones who struggle most to work with Rust's borrow checker or write sound embedded/IOT code, or to attend to refence leaks in things like event listeners.

Did they help people write lots of effective code faster? Yes.

Did they breed a generation of people with little intuition around how memory works? Also yes.

samatman
0 replies
1d15h

This statement boils down to "people who know C or C++ have an easier time learning Rust" which is not especially informative, or even interesting.

dns_snek
0 replies
22h57m

That's a terrible comparison. GC is there to take care of one specific implementation detail in a "good enough" way so that you can focus on more productive work.

People are starting to rely on LLMs to do almost everything they've been hired to do for them.

fhd2
2 replies
1d18h

My thinking as well. If coding assistants become so good that I'm at a competitive disadvantage, I'll just start using them. It's not rocket science. So far, everything I've tried largely slowed me down. As a Google/SO replacement for some types of questions, they sure save me maybe an hour per week, but that's really all I could extract so far.

Maybe my work is not too typical though, I spend only a fraction of my time actually typing in code. And I do eliminate the need for boilerplate through other means (picking frameworks/libraries that are a good fit for the problem, refactoring, meta programming, scripts, suitable tool chain etc).

zmgsabst
1 replies
1d17h

1 hour per week of increased coding is a 5-7% boost in productivity, using the Amazon guidelines for how SDEs use their time — 50%/20hrs for SDE1 and 33%/13hrs for SDE3.

Is that enough to be competitive?

I’m not sure — but at scale that would be a 5% reduction in headcount for the same work, or ~$12M/yr for every 1,000 engineers.

If you can figure out how to get 2-3 hours more coding done a week, we’re talking real gains.

fhd2
0 replies
1d17h

Depends on where the time is saved. If it's in figuring out how to do something in Django where StackOverflow is flooded with outdated answers, sure. I see those kinds of savings. But the tragic beauty of programming is that a little time saved today can very well mean lots of time lost later. The former you can measure, the latter is a tougher nut.

zdragnar
6 replies
1d19h

Even for lawyer it could be the same thing

Funny you mention that, since lawyers have already gotten in trouble for citing fictional cases when submitting work performed by chatgpt.

It's useful for rote things, but for anything that you depend on, you still need to give it just as much attention as if you'd done it yourself.

skissane
2 replies
1d18h

Funny you mention that, since lawyers have already gotten in trouble for citing fictional cases when submitting work performed by chatgpt.

For something like legal work, you don’t want a raw LLM. You want an agent integrated with a legal database, in such a way that it can’t cite cases which don’t exist in the database. And it can’t generate a direct quote from a case unless that text actually occurs in the case.

You still need a human lawyer familiar with the law to pick up on subtler errors, but better technology can prevent the grosser ones. And the subtler errors (e.g. misrepresenting what a case says through partial or out of context quoting) is the kind of error human lawyers sometimes make too - like every other profession, lawyers vary greatly in their competence, and often only the grosser cases of incompetence incur sanctions

zmgsabst
1 replies
1d18h

Tried Lexis+ AI, and… it’s just not very good yet.

Much like ChatGPT, it can only handle recall and summation — and even then, I don’t fully trust it because it often misses key ideas.

And much like ChatGPT, it can’t do anything coherent at length without a lot of help and working around its faults.

And seems entirely unaware of similar sounding words having distinct legal meanings. Which is not good.

skissane
0 replies
1d17h

Tried Lexis+ AI, and… it’s just not very good yet.

I wonder to what extent that’s due to current inherent limitations of the technology, and to what extent it is due to quality of implementation issues.

It is hard to say because (I presume) there is only limited public information on how it is actually implemented.

e.g. which LLM is it using? How much fine-tuning has been done? Are they using other potentially helpful techniques such as guided sampling? Or breaking down the task into parts and having multiple agents each specialised to handle one particular part?

Much like ChatGPT

When you compare it to ChatGPT, do you mean GPT3.5 or GPT4?

I also would guess that certain areas of law (especially criminal law) may be prone to triggering “safeguards” which result in poorer performance than if those safeguards were absent. Arguing that what your client did was legal (even be it ethically unsavoury) is an essential part of a lawyer’s job

qup
1 replies
1d19h

I agree, but it does change the shape of the work you would be doing. Validating work isn't the same as generating new work.

Whether that's better, or useful, probably differs by situation.

blibble
0 replies
1d19h

Validating work isn't the same as generating new work.

in many cases it's way harder

cloverich
0 replies
1d18h

you still need to give it just as much attention as if you'd done it yourself.

It's different than a Lawyers case, where the facts require manually cross referencing. In our case, that verification can come via directly and immediately running the code. How good the actual code is varies, but the fundamental difference remains.

taude
0 replies
1d18h

in the future, it seems like we might just become PR reviewers

rco8786
0 replies
1d16h

I’ve had it enabled for months across both Javascript and Kotlin codebases and it’s…fine? Good enough that I leave it enabled. But only barely. I’m certainly not orchestrating a dev team.

It has probably the same productivity boost that intellisense gave back when it came out. Which is good, but still marginal. Certainly not replacing anyone’s job.

siliconc0w
20 replies
1d18h

Good testing is hard to do - coverage is not a categorical good. You can easily write too many tests that calcify programs and basically just creates a change-detector program. Oh it looks like you changed something, oh no - all the tests are broken, but it's okay we can now ask the LLM to regenerate them! 100% Coverage! Amazing! What progress!

webdood90
9 replies
1d17h

... basically just creates a change-detector program

interesting perspective - why do you think this is a bad thing?

to me, it's an opportunity to verify that the change is intended. without it, how do you know that the program does what it is supposed to do?

whoisjuan
3 replies
1d17h

No op, but I don’t think test-driven development resounds with everyone who writes code.

I don’t want to write tests for everything. I just want to write the ones that matter.

nyrikki
1 replies
1d17h

That is a common misconception about TDD.

TDD is _about_ writing tests that matter, but most people think it is about writing all unit tests first.

If you are following TDD anywhere close to the way it is described, you will only be writing tests that relate to domain functionality first.

Note how it is described here, although it is turse.

https://martinfowler.com/bliki/TestDrivenDevelopment.html

The coverage metric as a goal writing style doesn't work for TDD, sorry you were exposed to that.

You are correct that model doesn't work.

randomdata
0 replies
1d6h

> The coverage metric as a goal writing style doesn't work for TDD

Coverage is not a goal of TDD, but in practice you will have 100% coverage by following TDD as you would never have reason to write code that isn't covered by test.

Ultimately, the purpose of coverage tools is to let you know what you might have forgotten to clean up during a refactor, to help you remove what you missed.

randomdata
0 replies
1d6h

TDD or not, why would you write tests for things that don't matter?

More importantly, why are you writing any code for things that don't matter?

elicksaur
1 replies
1d16h

How do you know that the tests accurately define what the code is supposed to do?

Another way, if you know what the code is supposed to do, why write it down in two places?

bbojan
0 replies
1d10h

Another way, if you know what the code is supposed to do, why write it down in two places?

This would be like criticizing double-entry accounting by asking "if you know what the amount is, why write it down in two places?"

We write the code down in two places because that gives us advantages that far outweigh the added effort:

- Once written, your test will catch regressions forever

- A test is often excellent documentation on what the code does

- It's now much easier to refactor the code, making it more likely that it will be refactored when needed.

siliconc0w
0 replies
1d17h

Without deliberate tests it can be very difficult and time consuming to parse out intended change from unwanted or incidental change.

nyrikki
0 replies
1d17h

It tightly couples domain needs with implementation details.

Thinking of it as a leaky abstraction helps me.

I try hard to separate domain logic tests from implementation specific tests.

Your code could be loosely coupled with high cohesion, but with lots of random tests like you get when code coverage is a performance metric, you have to add a lot of complexity that only relates to an implementation.

IshKebab
0 replies
1d7h

It can be. E.g. consider GUI testing where people sometimes take automatic "golden" screenshots. The problem there is that if you change some minor thing that e.g. moves text by 1 pixel it will fail the test even though that's probably fine.

People then get used to just blindly updating the golden images. It becomes basically "the output changed, do you want to continue anyway" which is not the most useful thing. You really want it to say "the output is wrong".

suzzer99
5 replies
1d17h

Agreed. Good tests are an order of magnitude harder than good code.

brabel
3 replies
1d9h

I don't know where you're getting that from but it's simply wrong. Testing is quite easy IMO. I've been working at the same place for almost 10 years and I introduced our testing framework. We have hundreds of thousands of tests. New guys have some trouble to get started, but after a couple of months they're writing tests for our systems like a pro.

Most tests are use-case based or written for checking error-handling.

Use-case tests are easy to write: you don't even need to be a programmer (in fact, it's good if use-cases are defined by a Product Owner or Tester), though of course some cases are only known to the programmer as they're the only ones who dive into the details. The programmer should come up with all use-cases the PO missed, of course, and judge whether or not they need to test those too... sometimes it's ok to not test as the cost-benefit is low. Anyway, once you have this use-case based test mentality, it's very easy to write the tests (using a proper language to do it is important! Don't use just JUnit if you're doing Java as it will be really tedious to write and you will stop midway - I know, I've been there... I highly recommend using Spock, though other frameworks to make writing test pleasurable exist).

This applies mostly for "integration tests". For unit tests, hopefully you don't find them difficult to write?! I find them quite easy to write since I know how to write testable code, which takes a while to learn but once you do, it's really easy.

If you have examples of difficult to write tests, I would be curious to see it! Perhaps we can discuss how to make them easy.

IshKebab
1 replies
1d8h

You can't just say "testing is quite easy". It totally depends on what you're testing and how thoroughly you want to test it.

For example GUI testing is IMO still an unsolved problem. Maybe AI will help there but existing solutions are generally not worth the pain.

I work in silicon verification and the testing we do is way way way more thorough than software testing, for obvious reasons. Do you formally verify your software? Unlikely.

I can only assume you work in an easy-to-test domain on a project that doesn't have changing requirements, like... I dunno a C compiler or something.

brabel
0 replies
1d5h

The kind of application we test is probably more complex than most as we're a product company supporting a huge number of integrations and specifications.

You're dismissing my claim by basically saying I am naive. Which is not an honest argument as you know absolutely nothing about me (and I don't want to tell you more than I did here).

About changing requirements: what does that have to do with testing at all? If requirements change, you basically discard the tests for the old behaviour and start over...

I would say that the testing we do is very close to formal verification because it's close to being comprehensive - though no, we do not use methods normally classified as such. I tried to but the benefit we would get over our current approach would be negligible.

By the way, I do a lot of UI testing, and dare I say it: yes, it's easy too.

We use this sort of thing if you're curious: https://gebish.org/manual/current/#pages

Again, if you find a real example of something you find hard to test, let me know so I can evaluate it against my own situation.

viraptor
0 replies
1d9h

The positive ones (use cases) are usually pretty straightforward. It's once you get to failures like "how does the whole system recover if one packet, 3 steps into a transaction is corrupted". If the system is complex enough, that warrants a whole internal fault injection framework and can take a really long time to make it reliable and usable.

To be fair most projects don't care beyond "rollback the database and maybe display a 5xx error or something". But some do. Anyway, use cases are fine. It's the failures / edge cases that cause pain.

postalrat
0 replies
1d12h

Which is why they should be treated as a waste of time unless specific tests can be justified.

pshc
1 replies
1d15h

One gig I worked had web component tests where they committed a snapshot of the expected DOM and asserted that the component spat it out... so for every subsequent change the dev would naturally hit the re-generate button and commit it all. Plentiful deltas, questionable signal.

gen220
0 replies
20h7m

Counterpoint, those tests are really useful when you're working on a shared sub-component/library and want to understand the scope of a change, or want to be confident that what should be a no-op change from the caller's point of view is indeed a no-op.

But yea 9/10 times that a snapshot test fails, it's noise rather than signal.

hinkley
0 replies
1d12h

I know for sure that code with no coverage has terrible tests. For everything else I have to read through five other people’s idea of good test.

We are all terrible at writing tests. We just find our own ways to do it.

3abiton
0 replies
1d12h

It's all about the long tail cases.

elzbardico
13 replies
1d18h

I feel for the future maintainers of all this crappy LLM legacy code in the future. It’s gonna be ugly.

bongodongobob
2 replies
1d17h

Agreed.

LLMs will never get any better than they are right now and haven't improved at all in 2 years. Just fancy Markov chains.

The only way they can be used to write code is by people who don't know how to code blindly commiting code to prod without any review whatsoever.

People who do know how to code couldn't possibly have a use case and it won't make them any more productive.

I'm just going to ignore all this LLM nonsense that isn't changing the world at all and you definitely should too.

SnowTile
1 replies
1d13h

Disagree, I find them very useful to quickly explain new libraries or do tedious things like regex

bongodongobob
0 replies
23h41m

My post is absolutely doused in sarcasm.

bigfudge
2 replies
1d17h

I suspect it will be no worse than enterprisey code. It might even look quite similar, although the comments and docs will be more thorough and less likely to be actively wrong.

jachee
0 replies
1d17h

…unless the LLM hallucinates the comments and docs.

Nathanba
0 replies
1d17h

It will be worse because there will be a lot more of it

duderific
1 replies
1d17h

So, I guess LLMs are actually creating jobs rather than destroying them. Not exactly fun jobs though.

steve_adams_86
0 replies
1d12h

Not exactly well paid either, I suspect.

block_dagger
1 replies
1d15h

I too feel compassion for the AI agents that will be dealing with this code. 99% of human developers will be out of the loop by then.

travoc
0 replies
1d14h

I’m old enough to remember the first time they said this about offshoring.

armchairhacker
1 replies
1d15h

Just delete the tests, problem solved. Your CI dashboard even gives you the green checkmark.

steve_adams_86
0 replies
1d12h

This made me think of that midwit meme with “delete the tests, green check mark in CI” on either side of the graph and “100% coverage” in the centre. Not totally valid, but… A bit of truth, haha.

Maybe the right side should be something about fuzzing and using static types. Use systemic and automated checks. I’m a little ashamed that I thought in memes so readily.

idle_zealot
0 replies
1d18h

Surely we will get LLMs to maintain it.

LASR
9 replies
1d19h

Everyone by now should be writing unit tests using ChatGPT4.

I paste in functions / classes I want to write unit tests for. Paste in a sample unit test, and it does a solid job of writing tests for it in the same manner as in the sample.

For unit tests, you don't even need the multi-step coverage optimization in this article. You just manually inspect, adjust it etc.

interroboink
4 replies
1d19h

What about people who don't trust sending their code to a 3rd party for processing?

(edit: I didn't downvote you, but I do think your claim is over-broad)

ric2b
1 replies
1d4h

I think those people overestimate how useful it is to get a bunch of random code snippets from some company, that may or may not even reach production unchanged.

dns_snek
0 replies
22h48m

I think you're underestimating the power such third parties are going to have in the future.

There's a 100% likelihood that OpenAI and other LLMs providers are going to cooperate with intelligence agencies looking to deploy their backdoors to businesses across the world.

It's the perfect supply chain attack vector with complete deniability.

biot
1 replies
1d17h

I suspect a lot of people overestimate how special their code is. Also, if your code exists in a private repo on GitHub, then you're already trusting the same third party when using GitHub Copilot.

dns_snek
0 replies
22h42m

This isn't just about IP. In the not so distant future we'll find out about 3 letter agencies using companies like OpenAI to deploy corporate backdoors with ease.

petesergeant
2 replies
1d18h

Doesn’t that just bake in any bugs in the original implementation?

samatman
0 replies
1d15h

Oddly enough it doesn't.

The secret sauce is: the robot doesn't run the tests. It digests the code and writes tests for it.

So it doesn't know if they'll pass or not. And sure enough, some of them don't, because the code was buggy.

I've seen some awful code come out of ChatGPT, but never a bad test. Good tests are short, which makes them hard to screw up. It has a reasonable grasp on what an edge case is as well.

romwell
0 replies
1d18h

Of course not!

The Power of AI™ can figure out the true intent of the code by looking at the initial (and, potentially, buggy) implementation, and help the programmer by generating edge test cases where the code doesn't produce correct results.

The programmer will easily tell those test cases from the ones where the AI did a mistake and generated a flawed test case because the AI doesn't make such silly mistakes; clearly, it's the programmer's code that needs to be corrected.

In fact, the AI would do a better job at that, too, which clearly speeds up the development cycle.

The correct way to use the tool is to let the AI both generate the test cases and modify the code so that it would pass the tests it generates.

After all, if the AI can't figure out what you wanted to do in the first place — how can you?

Of course, there's more to it.

Whichever problems you run into can be surely attributed to writing the code in an AI-unfriendly way.

In the past, we had the adage that the code is read more times than it's written. This is still true, but we need to abandon the old habit habit of having a human reader in mind.

Just like you rearrange your furniture to make your house more accessible to the robot vacuum cleaner, you need to write the code with the AI in mind.

When you write a Google query or a prompt for ChatGPT (effectively the same thing anyway), you don't write it like you'd talk to a person.

You're going to have to write code the same way to be truly effective, and think a little bit like AI to get the most use out of it.

That might sound like a lot of work to get code that does what you want.

But of course, that's not the case.

Just use the AI for this.

swatcoder
0 replies
1d19h

If I have a choice between a clever, practiced antagonist to plan and write my tests and an automated system that can fit common testing patterns to my code... I'm always going to get more robust results from the former.

But yeah, if you're just working solo on basic stuff and need to protect against off-by-one errors and accidental mutations in later refactoring, it's a great tool. You'd never write duly antagonistic tests for your own code anyway.

nicklecompte
7 replies
1d19h

I don't want to review this whole thing but one part in particular seems way off. [Caveat: I sorta-read the original paper shortly after it was posted, my memory is fuzzy and I am only skimming it now.]

From the blog:

Most of the test cases created by Meta’s TestGen-LLM only covered an extra 2.5 lines. However, one test case covered 1326 lines! The value of that one test case is exponentially more valuable than most of the previous test cases and exponentially improves the value of TestGen-LLM. LLMs can vigorously “think outside the box” and the value of catching unexpected edge cases is very high here.

Of course "exponentially more valuable" should set off your BS detector. But to verify, from the paper:

However, this result arose due to a single test case, which achieved 1,326 lines covered. This test case managed to ‘hit the jackpot’ in terms of unit test coverage for a single test case. Because TestGen-LLM is typically adding to the existing coverage, and seeking to cover corner cases, the typical expected number of lines of code covered per test case is much lower....The median number of lines of code added by a TestGen-LLM test in the test-a-thon was 2.5. This is a more realistic assessment of the expected additional line coverage from a single test generated by TestGen-LLM.

Nowhere do the authors mention "unexpected edge cases" or "thinking outside the box." They clearly present this 1,326 lines of coverage test as a fluke, e.g. maybe the test case checked one branch of a horrible switch statement, or perhaps it was even a fluke in how code coverage is counted. It is noteworthy that the authors do not seem to have looked into it any further, even in the "qualitative results" section.

Inaccurate editorializing really doesn't help anyone. The internet is too damn full of people pretending to understand things they pretended to read.

engineercodex
5 replies
1d18h

Hey! Thanks for your comment - I'm the one who wrote this article. I wasn't trying to say that the paper authors talked about "unexpected edge cases" or "thinking outside the box." I edited the post to be more clear that some of these takeaways are my own opinions.

This article is less of a summary of a paper and rather commentary on what the results of the paper entails. After all, Hacker News is meant for discussion :)

I will say though that I do believe that I still stand by the "exponentially more valuable" portion. I think the fact that LLMs can fluke their way into "hitting a jackpot" in terms of test coverage is exactly why they're so valuable. When you have something constantly trying out different combinations, if it hits even one jackpot, like in the paper, it's extremely valuable to the team. It's a case that could have been either non-obvious or simply too tedious to write a test for manually. I think there's tremendous value in that, especially speaking as someone who has spend way too much time simply figuring out how to test something within a Big Tech codebase (F/G) when I already knew what to test.

digdugdirk
1 replies
1d17h

Thanks for engaging with the above constructive criticism, it's a refreshing change from what is sadly the norm.

One additional question - do you forsee any issues with this application where LLMs enter a non-value add "doom loop"? I can imagine a scenario where a test generation LLM gets hooked on the lower value simplistic tests, and yet management sees such a huge increase on the test metric ("100x increase in unit tests in an afternoon? Let's do it again!") that they continue to bloat the test suite to near-infinity. Now we're in a situation where all future training data is now training on complete cesspool of meaningless tests that technically add coverage, but mostly just to cover an edge case that only an LLM would create.

Not sure if that makes sense, but tl;dr - having LLMs in the loop for both code creation and code testing seems like it's a feedback loop waiting to happen, with what seems like solely negative repercussions for future LLM training data.

samstave
0 replies
1d16h

Perhaps there should be domains of focus for the test LLMs - even if they are clones, but assigned to only a particular domain, then their results have to be PR's etc...

Why not treat every LLM as a dev contributing to git such that Humans, or other LLms need to gatekeep in case something like that happens? (start by treating them as Interns, rather than Professors with office hours)

camkego
1 replies
1d17h

Pedantic warning here. In fast and loose day-to-day common English language "exponentially more" means "fast growth" or "a whole lot". But that usage is meaningless! Why?, technically, you can't have exponential growth without a dependent variable. You can have exponential growth as a function of time, height, spend, distance, any freaking metric or variable. But it has to be as a function of a value.

You CAN'T have exponential growth that is not a function of some value or variable or input.

I suppose in this case you could argue you have exponential growth as a function of the discrete using-an-LLM or not-using-an-LLM, but I've never heard of exponential growth as a function of a discrete.

Often people using the term "exponential growth" in common English don't understand what it means. Sorry.

atq2119
0 replies
1d3h

Spot on.

FWIW, exponential growth as a function of a discrete variable is very common (e.g. all of algorithmic complexity), but it has to be (at least modeled as) an unbounded numeric variable.

You can't have exponential growth as a function of a binary variable.

nicklecompte
0 replies
1d16h

Seconding digdugdirk's comment :) Thanks for the thoughtful response and I apologize if I came across as mean.

My problem is we have no clue what those lines actually were. If it was effectively dead code, then it's not surprising that it was untested, and the LLM-generated test wouldn't be valuable to the team. We have no clue what the value of the test actually was, and using a single stat like "lines of code covered" doesn't actually tell us anything. Saying the test was "exponentially more valuable" is pure speculation, and IMO not an especially well-founded one. (Sort of like saying people who write more lines of code are more productive.)

This speculation seems downright irresponsible when the paper specifically emphasizes that this result was a fluke. When the authors said "hit the jackpot" they did not mean "hit the jackpot with a valuable test", they meant "hit the jackpot with an outlier that somewhat artificially juked the stats." I truly believe if the LLM managed to write a unusually valuable test with such broad coverage they would have mentioned it in the qualitative discussion. Instead they went out of their way to dismiss the importance of the 1,326 figure.

fermentation
0 replies
1d17h

The incentives at meta around code production are all wrong. The team behind this is absolutely gearing this around lines of code and number of diffs produced. This'll just be another codegen tool creating another mountain of code that is difficult to debug.

gxt
7 replies
1d18h

Elementary tests, like unit tests, should be mecanically generated by walking the AST, differences ack`d and snapshoted when commiting. Every language should come with this built-in.

superb_dev
4 replies
1d18h

What exactly are we testing at that point?

Groxx
3 replies
1d18h

Ensuring that `if x == 1` works when x == 1.

Very important. Very valuable.

Imagine if `if err != nil { return err }` just stopped working tomorrow. Your tests would detect it! Outage prevented!

Cthulhu_
1 replies
1d18h

You're being sarcastic but honestly, I've never found a regression because of a unit test.

Only past few days though, I did find two bugs that would've been prevented if the original code was covered by a decent unit test.

sangnoir
0 replies
1d15h

You're being sarcastic but honestly, I've never found a regression because of a unit test

So you've never made a change caused a unit test to fail? If not, how large is your codebase, and is ownership shared across multiple teams?

I caught dozens of latent or unreported bugs by writing unit tests for a 6kloc JS app which had 0% coverage before.

cgdub
0 replies
1d17h

I don't write Perl or Ruby anymore, but this would have been immensely helpful back then.

bluefishinit
1 replies
1d18h

This is called compiling with a type system.

gxt
0 replies
17h45m

Beyond the type system, you're snapshoting exact behavior. Differences between x = y, and 2x=y that the types don't encapsulate, even in rust.

samsk
3 replies
1d19h

Finally some AI Codegen, that makes sense to me.

ShamelessC
2 replies
1d19h

Finally?

refulgentis
1 replies
1d19h

There's a persistent rather large minority that has a nuanced take: it can't write code they like (don't want to edit), but it's great for weekend projects (where they're trying new things without established personal preferences).

Forest for the trees if you ask me, but, to each their own.

skissane
0 replies
1d19h

Sometimes, I find writing pseudocode easier than code. And then I ask an AI to turn it into code. Sometimes the results aren’t too bad, and just need a few tweaks for me to use it-overall I’ve saved mental effort compared to translating the pseudocode into code by hand. And if the results aren’t useful, I’ve only wasted a few seconds, and then I just have to do it manually.

kissgyorgy
3 replies
1d18h

What future? LLMs got into our tech stack faster than a JavaScript framework was created! If you are not using some kind of Copilot TODAY, you are missing out a lot.

qwertox
0 replies
1d18h

Yes, but for me ChatGPT today was more useless than a rubber duck. Only when I said "thanks for nothing" it tried to turn all the blabla into code, which was unusable.

GitHub copilot instead, as an intelligent Intellisense, is absolutely great; a real blessing and gift to coders.

nozzlegear
0 replies
1d17h

YMMV but trying to use any code that ChatGPT or Copilot generates for F# (my main language) just leads to a lot of compilation errors or worse, subtly incorrect code.

josefresco
0 replies
1d18h

My success rate for writing code with ChatGTP or Copilot is about 5%. Started much better but now I can’t get either to fix any mistakes or generate useful code. Anecdotal but it hasn’t changed my life as a coder.

romwell
2 replies
1d18h

Yeah, after working in semiconductor industry (computational lithography) where test-driven design is the norm... I'm not convinced.

I'm not saying that writing tests before the production code is something that should always be done.

But tests are just as much a part of the codebase as anything else, and absolutely must be written alongside the code being tested.

The most important part of the test is that it showcases intent of the developer. A test suite demonstrates the following:

* How the code should be used

* What the code does

* What the code doesn't do

* What it was written for

Then when that code is used or modified by another developer, they don't have to hunt for clues in the codebase like they're Sherlock Holmes.

If the tests aren't telling a story, you're writing tests wrong.

And until the computers gain the ability to read your mind and do a better job at understanding what you want to do, AI/LLM-based generators can't do this job for you.

Of course, if the only goal of your test suite is getting a green checkmark on a pre-commit check (and being able to show great coverage numbers), then yeah, you can double your productivity with AI.

Automatic code generators will surely help you write more bad code at lightning speed.

And if others complain that tons of boilerplate make the code bloated and hard to understand — just tell them to use AI to deal with it. Worked for you!

That really does seem to be the future of development. But not the future I'm looking forward to.

azeirah
1 replies
1d18h

I agree with almost everything you said, although I do think this type of testing has a place.

There are different types of testing, what you're describing sounds to me like testing the "core" of your code, part documentation, part validation, part stability, etc.

Other types of testing like fuzzing provide an entirely kind of value. I believe this AI- driven testing can inherit a space to target tests at the tail end of the distribution, many tests with little value. Providing extra coverage where human energy and time is lacking.

That is how I see the current state of AI tooling regardless, as a cognitive assistant.

I'd be surprised if this line of research doesn't end up being very fruitful in the coming years.

romwell
0 replies
1d18h

That I can fully agree with (particularly, comparison with fuzzing).

Your comment presents a way more grounded perspective on the future of LLMs in programming than the article does.

sandGorgon
1 replies
1d9h

using private, internal LLMs that are probably fine-tuned with Meta’s codebase.

what does this mean ? i would have thought they would simply use codellama. is there any research around privately finetuned code llms ? why would they be better ?

mdaniel
0 replies
23h39m

I am not privy to Meta's situation, and to be honest don't have any hand-to-hand experience with finetuned code LLMs, but my mental model is that any corpus of rules will always produce better outcomes when taking local norms into consideration. It's a silly one, but code formatting styles is a perfect example: a hypothetical Google one that has been finetuned on the Google codebase will more easily produce code that already follows their documented code style merely because it has seen more "already correct" examples. Variable nomenclature, method ordering, what things are versus are not documented, any nullability annotations (where appropriate to the language), etc are more that spring to mind

More germane to this discussion, I would guess a locally tuned model will also recognize the kinds of things they care about testing, up to and including spotting any bug fix tests that were hard won and can carry forward in any such generated tests for future code

avereveard
1 replies
1d11h

Developers will do anything not to write tests

mdaniel
0 replies
23h43m

My life experience has been that is often the intersection of two very hard problems: test-thinking is a learned skillset that often consumes a lot more active-thought than implementation-thinking and, as I repeatedly and loudly say to my team: testing is always AGAINST REQUIREMENTS. No requirements means no accurate tests, only busywork/metric-gaming. And, as I also always point out: no, your fever-dream one sentence statement of outcome is not a "requirement"

The bad news is that often the business folks don't know what they want, either, which is how "agile" became a thing. I recognize the ship has sailed on that, but it's "cake and eat it too" to think one can have good tests and ship "PoCs that do something valuable" in 2 week increments

aussieguy1234
1 replies
1d17h

Already done it with GPT-4.

I showed it a TypeScript module, asked it to generate a unit test and it made a working test not only covering the happy paths but a few edge cases as well.

ramoz
0 replies
1d17h

Yea… agree.

I’m not resonating with the downvotes here on similar comments.

ChatGPT goes above and beyond for me in many ways.

Tests seem… easy in terms of gpt capabilities.

Last week I had it write python that traversed an AST and construct a react flow graph as well as the component. I made no edits, went through a few iterations of prompt feedback, and it worked great. Many similar interesting abilities I’ve observed from gpt.

yes_man
0 replies
1d17h

I think the future of development is the other way around. Devs and PMs define the goalposts with tests, AI will handle the implementation

yes_man
0 replies
1d17h

I think the future of development is the other way around. Devs and PMs define the goalposts with tests, AI will do the implementation

paradoxyl
0 replies
1d8h

Just another way to censor the free speech of those who opppose the technocracy, or "private-public partnership" or whatever weasel words they use to take away freedom from the masses.

jimbob45
0 replies
1d16h

For greenfield projects, these LLM coders would be invaluable. For my old codebase with observed requirements and magic numbers? Lol it’s going to be just as confused as I am.

haliskerbas
0 replies
21h39m

Nice this will make people 15% more effective so we can do another 10% company wide layoff at least!

galaxyLogic
0 replies
1d11h

How does the AI know what tests it should write?

I think this is an interesting experiment but somewhat dubious. The way I see AI would best help software development is that I the programmer have a question about my or somebody else's code, which the AI then answers, sometimes with a code-proposal but not always. It should be able to answer questions like "Is there a way to simplify this code? What are some inputs that would cause an error?" etc.

AI should help us understand the code, and understand how to improve it. Not write all of it on its own because if we don't tell it what to do, it cannot know what we want it to do. Tests is a good example. What do we want it to test?

cavisne
0 replies
1d15h

Doesn’t meta famously not do much testing at all? Ie they use experiments to “test in prod”.

bjackman
0 replies
1d7h

These papers are interesting but I think it's impossible to have a valuable opinion without practical experience using the tool and reviewing its output on a codebase you know well.

Everyone seems to feel one way or the other about AI code, it's a very political topic. But I would just wanna try it and see.

This is pretty interesting, because a lot of these technologies are staggeringly expensive to develop. The AI tooling I've used so far has been somewhat useful, but if it doesn't get much better it won't have been worth the cost that was paid to create it.

I'm pretty optimistic about what will be achieved but even with my optimism it's far from clear that it's actually gonna pay for itself.

anoopelias
0 replies
1d15h

I thought that unit tests are a balance. A balance of not too much, not too little. "Too little" means you are not covered on the edges. "Too much" means the tests are too rigid its scary to change the code.

Ideally, "one change" (Whatever that might be) in production code should cause exactly 1 test to fail.

How does TestGen-LLM address this problem?

acituan
0 replies
1d16h

Unless well separated, this will easily turn developer-hostile by some clueless management demanding high coverage and enthusiastic juniors smuggling in massive amounts of AI tests so that at the end of the day you will need get a rubberstamp from an hard-to-maintain llm-gen test code each time you want to submit your work.

Yes authoring some tests might be sped up but not necessarily maintaining them - or maintaining the code under test because you are not necessarily generating good ones. Not to mention sweating over tests usually help developers with checking the design of the code early on too; if not very testable, usually not a good design either, e.g not sufficiently abstracted component contracts which suck in a context where you need to coauthor code with others.

What some people miss is that tests are supposed to be sacrifical code, that most of which will not catch anything during their lifetime - and that is OK because it gives an automated peace of mind and saves from potential false clues when things fail. But that also means max investment into a probabilistic safeguard is not gonna pan out at all times; you will always have diminishing marginal utility as the coverage tops. Unless you're writing some high traffic part of the execution path - e.g. a standard library - touting high coverage is not gonna pay off.

Not to mention almost always an ecology of tests need be there - not just unittests but integration, system etc - to make the thing keep chugging at the end of the day. Will llm's sit at the design meetings and understand the architecture to write tests for them too? Or what they can do will be oversold at the expense of what should be done. A sense of "what is relevant" is needed while investing effort in tests - not just at write-time but also at design-time and maintain-time - which is what humans are pretty OK at, and AI tools are not.

What llms can save time with is keystrokes of an experienced developer who already has a sense of what is a good thing to test and what is not. It can also be - and has been - a hinderance with making the developers smuggle not-so-relevant things into the code.

I don't want an economy of producing keystrokes, I want an appropriately thought set of highly relevant out keystrokes, and I want the latter well separated from the former so that their objective utility - or lack thereof - can be demonstrated in time.

Temporary_31337
0 replies
1d9h

All this to write another CRUD app ;)

TeeWEE
0 replies
1d11h

The proof is in the pudding, show me the code!

In my experience LLM are smart but sometimes inconsistent and over a long chat it might say things that are logically self contradictions… when you tell it that it confirms it.

It just seems like it lacks a consistent world view.

I don’t trust them yet. Maybe with even more scale they become better.

They act a little bit like young children, with a lot of domain knowledge.

MASNeo
0 replies
1d9h

Ok, so test case generation has been around a while and now that it is working, where is the GitHub Action?

Jtsummers
0 replies
1d16h

Quoting myself (lightly edited) from when the paper itself came up. They misrepresent the stats in their writeup.

https://news.ycombinator.com/item?id=39406726

Their abstract doesn't match their actual paper contents. That's unfortunate. Their summary indicates rates in terms of test cases:

75% of test cases built correctly, 57% passed reliably [implying test cases by context], and 25% increased coverage [same implication] The actual report talks about test classes, where each class has one or more test cases.

(1) 75% of test classes had at least one new test case that builds correctly.

(2) 57% of test classes had at least one test case that builds cor- rectly and passes reliably.

(3) 25% of test classes had at least one test case that builds cor- rectly, passes and increases line coverage compared to all other test classes that share the same build target.

Those are two very different statements. They even have a footnote acknowledging this:

For a given attempt to extend a test class, there can be many attempts to generate a test case, so the success rate per test case is typically considerably lower than that per test class.

But then in their conclusion they misrepresent their findings again, like the abstract:

When we use TestGen-LLM in its experimental mode (free from the confounding factors inherent in deployment), we found that the success rate per test case was 25% (See Section 3.3). However, line coverage is a stringent requirement for success. Were we to relax the requirement to require only that test cases build and pass, then the success rate rises to 57%.
Fricken
0 replies
1d10h

Meta likes to release positive news about itself in the wake of it's competitors misfortunes.