I find it interesting that generally the first instinct seems to be to use LLMs for writing test code rather than the implementation. Maybe I've done too much TDD, but to me the tests describe how the system is supposed to behave. This is very much what I want the human to define and the code should fit within the guardrails set by the tests.
I could see it as very helpful though for an LLM to point out underspecified areas. Maybe having it propose unit tests for underspecified areas is a way to do look at that and what's happening here?
Edit: Even before LLMs were a thing, I sometimes wondered if monkeys on type writers could write my application once I've written all the tests.
People who work on legacy code bases often build what are called “characterisation tests” - tests which define how the current code base actually behaves, as opposed to how some human believes it ought to behave. They enable you to rewrite/refactor/rearchitect code while minimising the risk of introducing regressions. The problem with many legacy code bases is nobody understands how they are supposed to work, sometimes even the users believe it is supposed to work a certain way which is different from how it actually does - but the most important thing is to avoid changing behaviour except when changes are explicitly desired.
Yes. There's also other (better) ways to solve this issue: for instance sampling input/outputs in production and srtting them in stone in the tests.
An issue with going with llms will be to validate if the behavior described are merely tolerated or if they're correct. Another will be wether something is actually tested (e.g. a code change still wouldn't break the test). Too granular output check would be an issue as well.
All in all this feels like a bad idea, but I hope to be wrong.
This is called a "golden master" (giving X input to the system and recording the output as a test expectation). The difference with the parent is that it is way less granular, so both have value.
This is something I've yet to see a software testing framework do. Compare the results of two different implementations separated in time. i.e. two different revisions out of source control, using one as the golden reference for the other.
Isn’t that kind of what snapshot tests can do?
bazhenov/tango does something like this for performance tests, basically to counter system behaviour you run the old and new implementation at the same time.
The problem is xUnit style tests are really bad at this and make it tedious - which is why people gravitate to LLMs for writing them. LLMs on the surface look like they can relieve the pain of using bad abstractions - but theyre still a band aid on a gaping wound. 20 years ago we'd be using them to write ugly raw PHP.
Characterization tests ideally need to not be written in code but defined in something resembling a configuration language - something without loops, conditionals, methods, etc. There then needs to be a strict separation of concerns kept between these definitions and code that executes them.
I wrote a testing framework (with the same name as my username) centered around this idea. Because it is YAML based, expected textual outputs can be automatically written into the test from actual inputs which saves tons of time and you can autogenerate readable stakeholder documentation that validates behavior.
It might seem unbelievable but with decent abstractions, writing tests and TDD stops being a chore and actually starts being fun - something you won't want to delegate to an LLM.
Couldn’t an llm provided with the right level of logs write really good characterization tests?
The seems like a perfect use case. Quickly find all the foot guns you didn’t know to look for.
as long as you have process to dismantle the tests and move fully over to a new system, if you are indeed migrating/upgrading. leaving a legacy thing dangling and tightly coupled tests lingering for years happens easily when going from 95% to 100% can cost too much for management and stakeholders in various ways relative to other pressing needs
Characterisation tests are not supposed to be tightly coupled – they are supposed to be integration/end-to-end tests not unit tests – the point is to ensure that some business process continues to produce the same outputs given the same inputs, not that the internals of how it produces that output are unchanged. Code coverage is used as an (imperfect) measure of how complete your set of test inputs is, and as a tool to help discover new test inputs, and minimise test inputs (if two test inputs all hit the same lines/branches, maybe it is wasteful to keep both of them–although it isn't just about code coverage, e.g. extreme values such as maximums and minimums can be valuable in the test suite even if they don't actually increase coverage.)
They can take the form of unit tests if you are focusing on refactoring a specific component, and want to ensure its interactions with the rest of the application are not changed. But at some point, a larger redesign may get rid of that component entirely, at which point you can throw those unit tests away, but you'll likely keep the system-level tests
A significant part of writing characterisation tests can be simply staring at a code coverage report and asking “can I write a test (possibly by modifying an existing one) which hits this line/branch”. Sometimes that’s easy, sometimes that’s hard, sometimes that’s impossible (code bases, especially crapulent legacy ones, sometimes contain large sections of dead code which are impossible to reach given any input).
An LLM doesn’t have to always get it right to be useful-have it generate a whole bunch of tests, run them all, keep the ones which hit new lines/conditions, maybe even feed those results back in to see if it can iteratively improve, stop when it is no longer generating useful tests. Hopefully, that addresses most of the low-hanging fruit, and leaves the harder cases to a human.
There already exist automated test generation systems which can do some of this–for example, concolic testing-but an LLM can be viewed as just another tool in the toolbox, which may sometimes be able to generate tests which concolic testing can’t, or possibly produce the same tests quicker than concolic testing would. There is also the potential for them to interact synergistically - the LLM might produce a test which concolic testing couldn’t, but then concolic testing might then use that to discover further tests which the LLM couldn’t.
Agreed, that's a great use case for autogenerated tests.
From my experience I use llms for writing tests because llms are much better at writing tests than the application code. I suspect this might be due to the fact that the code ends up being a very detailed and clear prompt to the llm.
Imagine dealing with say COBOL where even the team who was hired to maintain it is going to retire. It's the closest to "lost technology" trope from Sci-Fi we have ever been.
The technical aspect of the code is not actually the difficult part WRT maintenance. Someone that knows COBOL as a language can figure out what some code does and how it does it. It takes time but that is information that can be derived if you just have the code.
The main problem with COBOL is the code is often an implementation of some business or regulatory process. The COBOL maintainer retiring is taking knowledge of the code but more importantly the knowledge of the literal business logic.
The business logic and accounting/legal restraints aren't something that can necessarily be derived from the code. You can know some bit of code multiplies a value to 100 but you can't necessarily know if it supposed to do that. If the source code and documentation don't capture the why of the code the how doesn't help the future maintainer.
Often with COBOL the people that originally defined the why of the code are not just retired but dead. The first generation of maintainers may only have ever received partial knowledge of the why so even the best documenters have holes in their knowledge. They may have had exposure to the original why defines but neglected or didn't have an opportunity to document some aspects of why. The subsequent generations of maintainers are constrained by how much of the original why was documented.
Edit: pre-coffee typo
I feel the same way about how test code is viewed even outside of AI. A lot of the time the test code is treated as a lower priority code given to more junior engineers, which seems like the opposite of what you would want.
When I do code review I always review the tests first. If they look through and reasonable then I can be a lot less careful reviewing the rest.
Tests never cover everything so exactly what are you looking for?
For example, just check the list of units, see that there is one for a happy flow, but none for error flow. Check that a number of tests correlates well with apparent cyclomatic complexity of the code. Check that tests are actually defining and testing relevant behavior and not just "I called method A, dus the method B was called".
This is how I feel too. For me, tests usually end up being a concrete specification which I can execute.
Getting LLMs to write tests is like getting LLMs to write my spec.
Covering all the "if err != nil { return err }" branches in Go is pretty mindless work.
Your tests only need to assert what failure states the user should expect under what conditions, not cover how it should be implemented. If, say, your implementation uses panic/recover instead, your tests shouldn't care. Asserting 'how' the code is to be implemented is how you get brittle tests that are a nightmare to maintain.
And making those assertions shouldn't be mindless work. Documenting the failure cases should be the most interesting work of all the code you are writing. If you are finding that it isn't, then that tells you that you should be using a DSL that has already properly abstracted the failure cases for your problem space away.
* The user isn't inside the package, but the test has to be or it's not a unit test (and coverage doesn't count).
* While "errors are values" makes it possible to enumerate the potential failure cases, most of the time errors are just strings. The thing you are forced to document by the lack of exception semantics is the sites at which you might be dealing with an error vs. a success value. In an IO heavy application this rounds up to "everywhere."
* Go is not really powered to do DSLs in a type-safe way, except through code generation. I would view the LLM as a type of code generator here.
* The test isn’t inside the package either. In fact, Go in particular defines _test packages so that you can ensure that there is explicit separation. You are not testing units if you don’t look at the software the same way the user does.
* If your errors are strings, you are almost certainly doing something horribly wrong. Further, your tests would notice it is horribly wrong as making assertions on those strings would look pretty silly, so if your errors are strings that tells that your testing is horribly, horribly wrong.
* Go is not a DSL, no. It is unabashedly a systems language. Failure is the most interesting problem in systems. Again, if your failures aren’t interesting, you’re not building a system. You should not be using a systems language, you should be using a DSL.
When you have no idea what you are doing, choosing the wrong tool at every turn, an LLM might be able to help, sure.
FWIW, writing implementation is much more pleasant/interesting experience, because you're writing the actual thing the application is supposed to do. In contrast, when writing tests, you're describing what the application is supposed to do, using an extremely bloated, constrained language, requiring you to write dozens or hundreds of lines of setup code, just to be able to then add few glorified if/else statements.
In my experience, at least in languages like C++ or Java, unit tests are made of tedium, so I'm absolutely not surprised that the first instinct is to use LLMs to write that for you.
This is my experience, unless I try to make the contract of the thing simple to test via property testing. Then, writing tests often becomes basically an exercise in writing down the contract in a succinct way.
Sadly, this is a rare approach, so if you cooperate with others it's hard to use it.
Yeah, I suppose this is very language dependent. In Ruby I actually quite enjoy writing tests, whereas in Java it was boilerplate pain.
I actually enjoy writing unit tests, but I do frontend stuff. For me it's the moment of calm and reflection when I step back and look carefully at what I built and say "nice, it works" to myself.
Most systems are pretty predictable. it("displays the user's name") isn't very novel, and is probably pretty easy for a LLM to generate.
Arguably the implementation that passes such a test is even simpler, making it a bit questionable why we have the human write that part.
Well, someone needs to define that the username should be shown in the first place
One reason I can think of is that many engineers really don't do testing. They write tests after the fact because they have to. I've worked with a bunch of engineers who will code for days then write a few tests "proving" the system works. They have low covergage and are usually brittle.
This system would be a godsend in the minds of engineers who think / operate that way.
I've also had managers who told me I wasn't allowed to write tests firsts as it was slower. Luckily I was able to override / ignore them as I was on loan "take it up with my boss". They're probbably thinking the same as the above engineers.
Another way to think of this is most devs hate documentation... if they had an AI that would write great docs from the code they'd love it. And these to these devs docs they don't have to write are great docs :)
Sounds like a great place to work.
You're suppose to write them together, at the same time. What's slower is spending all of your time mousing around the GUI like a caveman and then having missing tests and then trying to debug every future recurring regression forever the rest of the project's lifespan.
The only thing that's slower is perhaps trying to do TDD for the first time. After that, you're embarrassed at how much time you wasted wandering in a browser before your API tests were finished.
I basically agree with this but some caveats. I often find there are maybe 5% of the tests I should write that only I could write, because they deal with the specifics of the application that actually give it its primary purpose/defining features. As in, it's not that there is any test I believe AI eventually wouldn't be able to write, it's more that there are certain tests that define the "keyframes" of the application, that without defining explicitly, you'd be failing to describe your application properly.
For the remaining 95% of uninteresting surfaces I'd be perfectly happy to let an AI interpolate between my key cases and write the tests that I was mostly not going to bother writing anyway.
You are probably right and the percentages change with the language and framework being used. When I write Ruby I write enormous amounts of tests and many of these could probably be derived from the higher-level integration tests I stared with. In Rust on the other hand, I write very few tests. I wonder if this also shows which code could be entirely generated based on the high-level tests.
This kind of thinking is sadly lost on many. I have seen copious amounts of nonsensical tests slapped full of hard-wired mocks and any change in the functionality would break hundreds of tests. In such a scenario an LLM might be the bandaid many are looking for. Then again, the value of such tests is questionable.
My best example of that was a test that was asserting that a monitoring metric changes in some way with a comment expressing doubt whether the asserted values are correct (but whoever wrote it still wrote the test that enshrined the behaviour that they themselves doubted).
Imo there are 2 kinds of programming:
- Software engineering, which is akin to real engineering, as it involves desigining a complex mechanism that fits a lot of real world constraints. Usually you need to develop a sophisticated mental model and exploit it for the desired results. Involved implementations and algorithms usually fall into this category. - Talking to computers, which is about describing what you need to the computer. Usually focuses on the 'what' as the 'how' is trivial. Examples include HTML/CSS, Terraform and very simple programs (like porting a business process flow from a flowchart to code). And, indeed test code.
LLMs are terrible at the former, but great at the latter.
There are multiple reasons, including what you mentioned. The first thing is that test codes are generally considered "safe" to write and change, so it won't be end of the world even if LLM does something subtly wrong. The next is that reading a test change is usually easier than writing it, which is the entire idea of golden/approval tests. And finally... people generally don't like writing tests, which is probably the biggest reason...
I agree, humans should write tests. Humans are the oracles of the program output who know whether the code did the right or wrong thing.
I’m guessing they want to automate tests because most engineers skimp on them. Compensating for lack of discipline.
I write (at least) 2 kinds of tests:
- TDD, which as you say describes the system's behavior. But it often deals with the nominal cases. It is hard to predict all that can go wrong in the initial development phase.
- tests designed to reproduce a bug. The goal of these to try very hard to make the system fail, taking inspiration with the bug's context
Maybe this LLM test generator could allow to be more proactive in the second kind?
I wrote a simple LLM backed chat application. My primary usage right now is to copy paste the code I have written (Java and Python) into the chat and ask it to generate unit test cases. I think it has reduced my development time a huge amount. It also generates tests for edge cases. The generated code usually are usable 90% of the time. It also is very good at making mocks for service calls. I'm using Claude 2.1 model with Bedrock.
It's nowhere as fancy as FB tool but I know it is blessed by company.
Passing tests don't guarantee correctness over all possible inputs, and especially not freedom from vulnerabilities. I'd rather have the code written by a human who actually understands it. Especially if the AI just gets re-prompted after failed attempts until the tests pass.
AI-generated tests can work like compiler/sanitizer warnings. If they fail, you can audit them and decide if it was a true or false positive.
When you try to get the LLM to write the code, you find that it’s easier to get it to write the tests. So you do that and publish about that first.
At the risk of telling you something you already know, I’d bring to your attention for example property-based testing, probably most popularized by Hypothesis, which is great wud I recommend, but by no means the only approach or high-quality implementation. I think QuickCheck for Haskell was around when it got big enough to show up on HN.
Just in case any reader hasn’t tried this, the basic idea is to make statements about code’s behavior that are weaker than a totally closed-form proof system (which also have their place) stated as “properties” than are checked up to some inherently probabilistic bound, which can be quite useful statements.
The “canonical” example is reversing a string: two applications of string reverse is generally intended to produce the input. But with 1 line of code, you can check as many weird Unicode edge cases or whatever as you have time and electricity.
I know this example seems trite, but I met this because some hard CUDA hackers doing the autodiff and kernels and shit that became PyTorch used it to tremendous effect and probably got 5x the confidence in the code for half the effort/price.
It doesn’t always work out, but when it does it’s great, and LLMs seems to be able to get a Hypothesis case sort of, closer than starting from scratch.
I really believe this "application" is the result of thinking about tests as a chore and requirement without great benefits. Your thought of LLM writing application give the tests is interesting also from test pass/fail as optimization that ca be run online by the LLM to improve the result without human feedback.
If you had as many monkeys as parameters in LLM they might run your business ;-)
I dread the morning after a night of getting something to work…somehow.