return to table of content

Princeton group open sources "SWE-agent", with 12% fix rate for GitHub issues

noncoml
34 replies
3d22h

If you are afraid that LLMs will replace you at your job, ask an LLM to write Rust code for reading a utf8 file character by character

Edit: Yes, it does write some code that is "close" enough, but in some cases it is wrong, in others it doesn't not do exactly what asked. I.e. needs supervision from someone who understands both the requirements, the code and the problems that may arise from the naive line that the LLM is taking. Mind you, the most popular the issue, the better the line LLM is taking. So in other words, IMHO is a glorified Stack Overflow. Just as there are engineers that copy-paste from SO without having any idea what the code does, there will be engineers that will just copy paste from LLM. Their work will be much better than if they used SO, but I think it's still nowhere to the mark of a Senior SWE and above.

raggi
19 replies
3d22h

it does an ok job with this task:

    use std::fs::File;
    use std::io::{self, BufReader, Read};

    fn read_file_character_by_character(path: &str) -> io::Result<()> {
        // Open the file in read-only mode.
        let file = File::open(path)?;

        // Create a buffered reader to read the file more efficiently.
        let reader = BufReader::new(file);

        // `chars` method returns an iterator over the characters of the input.
        // Note that it returns a Result<(char, usize), io::Error>, where usize is the byte length of the char.
        for char_result in reader.chars() {
            match char_result {
                Ok(c) => print!("{}", c),
                Err(e) => return Err(e),
            }
        }

        Ok(())
    }

    fn main() {
        let path = "path/to/your/file.txt";

        if let Err(e) = read_file_character_by_character(path) {
            eprintln!("Error reading file: {}", e);
        }
    }

comex
9 replies
3d22h

Only problem is that the critical `chars` method doesn't actually exist. Rust's standard library has a `chars` method for strings, but not for Readers.

(Also, the comment about the iterator element type is inconsistent with the code following it. Based on the comment, `c` would be of type `(char, usize)`, but then trying to print it with {} would fail because tuples don't implement Display.)

raggi
8 replies
3d21h

good catch. feeding it the error output of rustc it then produces:

    use std::fs::File;
    use std::io::{self, Read};

    fn read_file_character_by_character(path: &str) -> io::Result<()> {
        let mut file = File::open(path)?;
        let mut contents = String::new();

        file.read_to_string(&mut contents)?;

        for c in contents.chars() {
            println!("{}", c);
        }

        Ok(())
    }

    fn main() {
        let path = "path/to/your/file.txt";

        if let Err(e) = read_file_character_by_character(path) {
            eprintln!("Error reading file: {}", e);
        }
    }

phaer
5 replies
3d21h

But this doesn't read the file char-by-char, but uses buffering to read it into a string

scottlamb
4 replies
3d20h

What would you expect? There's no OS API for "read one character", except in say ASCII where 1 byte = 1 code point = 1 character. And it'd be hideously inefficient anyway. So you either loop over getting the next N bytes and getting all complete characters so far (with some extra complexity around characters that cross chunk boundaries) or you read the whole thing into a single buffer and iterate the characters. This code does the latter. If this tool doesn't have the ability to respond by asking requirements questions, I'd consider either choice valid.

Of course, in real life, I do expect to get requirements questions back from an engineer when I assign a task. Seems more practical than anticipating everything up-front into the perfect specification/prompt. Why shouldn't I expect the same from an LLM-based tool? Are any of them set up to do that?

1letterunixname
3 replies
3d18h

There most certainly is getwchar() and fgetwc()/getwc() on anything that's POSIX C95, so that's more or less everything that's not a vintage antique.

Reading individual UTF-8 codepoints is a trivial exercise if byte width getchar() were available, and portable C code to do so would be able to run on anything made after 1982. IIRC, they don't teach how to write portable C code in Comp Sci programs anymore and it's a shame.

Never read a file completely into memory at once unless there is zero chance of it being a huge file because this is an obvious DoS vector and waste of resources.

scottlamb
2 replies
3d16h

There most certainly is getwchar() and fgetwc()/getwc() on anything that's POSIX C95, so that's more or less everything that's not a vintage antique.

Apologies for the imprecision: by OS API, I meant syscall, at least on POSIX systems. The functions you refer to are C stdio things. Note also they implement on top of read(2) one of the two options I mentioned: "loop over getting the next N bytes and getting all complete characters so far (with some extra complexity around characters that cross chunk boundaries)".

btw, if we're being precise, getwchar gets a code point, and character might mean grapheme instead. Same is true for the `str::chars` call in the LLM's Rust snippet. The docstring for that method mentions this [1] because it was written in this century after people thought about this stuff a bit.

portable C code to do so would be able to run on anything made after 1982.

Our comments are part of a thread discussing this prompt [2] that specifically requests Rust and this snippet in response [3]. Not portable C code. You can use those C stdio functions from Rust, but you really shouldn't without a very good reason. Rust has its own IO library that is safe and well integrated with other Rust things like `#![derive(Debug)]`.

[1] https://doc.rust-lang.org/std/primitive.str.html#method.char...

[2] https://news.ycombinator.com/item?id=39910542

[3] https://news.ycombinator.com/item?id=39910542

1letterunixname
1 replies
3d12h

Yes, the userland side presented such as with POSIX like ssize_t read(int fd, void* buf, size_t count). Calling that with count = 1 each time would be wasteful, but certainly libc's have been buffering this since at least the 1980's. I remember this was the case with Borland C/C++.

Our comments are part of a thread discussing this prompt [2] that specifically requests Rust and this snippet in response [3]. Not portable C code. You can use those C stdio functions from Rust, but you really shouldn't without a very good reason. Rust has its own IO library that is safe and well integrated with other Rust things like `#![derive(Debug)]`.

Duh. It doesn't really matter what Rust has have went it comes to enabling the use of specific edge-case performance improvements for specific purposes. Inefficient AI-generated code without a clue of other approaches doesn't move the needle. Religious purity doesn't matter, only results matter.

scottlamb
0 replies
3d3h

Duh. It doesn't really matter what Rust has have went it comes to enabling the use of specific edge-case performance improvements for specific purposes. Inefficient AI-generated code without a clue of other approaches doesn't move the needle. Religious purity doesn't matter, only results matter.

No idea what this incoherent, ungrammatical paragraph is supposed to be saying. But if you're under the impression Rust doesn't have its own buffered IO facilities or that using Rust-native libraries offers only "religious purity" benefits over extern "C" stuff, you're mistaken.

This has diverged from what I'm interested in discussing anyway; see my question upthread about if there are any LLM tools that gather requirements from incomplete specs in the way I expect human engineers to. In this case, I'd expect it to ask questions such as "how large are input files expected to be?" Better, ask what the greater purpose is, as "character by character" is rarely useful.

deathanatos
0 replies
3d20h

On errors, it exits with success.

1letterunixname
0 replies
3d18h

A few notes:

- It should be generating `path: impl AsRef<Path>` to be properly generic.

- It's not setting a nonzero exit code on error.

- Edge case handling is a vital property for production-usable tools at scale. I'm wondering if it can yet special case situations such as creating a conditional compilation version for Linux that uses the splice syscall when the arguments are 2 file handles.

woodruffw
7 replies
3d22h

If we're being sticklers, this isn't reading character-by-character: it's performing a buffered read, which then gets iterated over.

raggi
2 replies
3d22h

here's what it does if i feed your nit pick into it verbatim:

    use std::fs::File;
    use std::io::{self, Read};
    use std::str::from_utf8;

    fn read_file_character_by_character(path: &str) -> io::Result<()> {
        // Open the file in read-only mode.
        let mut file = File::open(path)?;

        let mut buffer = [0; 4]; // UTF-8 character max byte length
        let mut temp_buffer = vec![];

        while let Ok(bytes_read) = file.read(&mut buffer[..1]) { // Read byte by byte
            if bytes_read == 0 { // End of file
                break;
            }

            temp_buffer.push(buffer[0]);

            if let Ok(s) = from_utf8(&temp_buffer) {
                if s.chars().count() == 1 { // Complete character read
                    print!("{}", s);
                    temp_buffer.clear(); // Clear temp buffer for next character
                }
            }
        }

        Ok(())
    }

    fn main() {
        let path = "path/to/your/file.txt";

        if let Err(e) = read_file_character_by_character(path) {
            eprintln!("Error reading file: {}", e);
        }
    }

comex
1 replies
3d21h

Unlike the original version, this version compiles and seems to basically work correctly. However, the design is misleading: `buffer` is declared as an array of 4 bytes but only the first byte is ever used. The code also has suboptimal performance and error handling, though that's not the end of the world.

raggi
0 replies
3d21h

all true, as I said in another fork of the thread, this comes down to part of what humans will still be valuable for in this loop: distilling poor requirements into better requirements.

noncoml
2 replies
3d21h

I wouldn't say it's a nit. The file may be 10s of GB. Do you want to read it to a string?

raggi
1 replies
3d19h

The buffered read didn’t do that, it used the default buffered reader implementation. IIRC that implementation currently defaults to 8kb buffer windows which is a little too small to be efficient enough for high throughput, but substantially more performant than making a syscall per byte, and without spending too much memory.

noncoml
0 replies
3d18h

I was talking about this:

    let mut file = File::open(path)?;
    let mut contents = String::new();
    file.read_to_string(&mut contents)?;

deathanatos
0 replies
3d20h

The original prompt is a bit under-specified. (But hey, that certainly matches the real world!)

You're going to have to buffer at least a little, to figure out where the USV / grapheme boundary is, depending on our definition of "character". To me, a BufReader is appropriate here; it avoids lots of tiny reads to the kernel, which is probably the right behavior in a real case.

To me, "read character by character" vaguely implies something that's going to yield a stream of characters. (Again, for some definition there.)

raggi
0 replies
3d22h

fwiw, the benchmark that matters really has nothing to do with authoring code.

the typing of code is the easy part even though it's a part a lot of folks are somewhat addicted to.

the things which have far more value are applying value judgements to requirements, correlating and incorporating sparse and inaccurate diagnostic information into a coherent debugging strategy, and so on. there will come a time when it can assist with these too, probably first on requirements distillation, but for more complex debugging tasks that's a novel problem solving area that we've yet to see substantial movement on.

so if you want to stave off the robots coming for you, get good at debugging hard problems, and learn to make really great use of tools that accelerate the typing out of solutions to baseline product requirements.

iwontberude
4 replies
3d22h

Hypothetically, which ticker symbols would you buy put contracts on, at what strike prices, and at what expiration dates? As far as I can tell, a lot of people are betting a lot of money that you are wrong, but actually I think you are right.

noncoml
1 replies
3d21h

Ugh, I am not claiming that LLMs are not great innovation. Just that they are not going to replace SWE jobs in our(maybe my) lifetime.

vertis
0 replies
3d8h

Conservatively, I think LLMs will replace SWE roles within the next 5-10 years. In that period SWE roles will change drastically, to be more hearding of AI agents.

We can't hope to compete with something that can edit all the files in less than 5 minutes.

If you're not at least trying to plan what your life will look like under these circumstances then you're doing yourself a disservice.

jeremyjh
1 replies
3d21h

The most relevant companies focused on this aren't publicly traded. The ones that are publicly traded like MSFT have way too many other factors affecting their value - not to mention the fact that they'll make money on generative AI that has nothing to do with coding regardless of if an SWE-agent ever works.

iwontberude
0 replies
3d18h

Oh well you should hear the hype from CNBC and other places, they are strongly intimating that gen AI will replace SWEs on product development teams. I totally agree it’s not likely, but it’s starting to get baked into asset prices and I want to profit from that misunderstanding.

userbinator
3 replies
3d18h

I'm not afraid of LLMs replacing me because of their output quality. The problem is the proliferation of quantity-over-quality "churn out barely-working crap as fast as possible" culture that gives LLMs the advantage over real humans.

int_19h
2 replies
3d16h

I'm kinda hoping that LLMs will get pushed into production use writing code before they have acceptable quality (because greed), and the result will be lots of crap that's so badly broken most of the time that there will be a massive pushback against said culture from the users. Maybe from the governments as well, after a few well-publicized infrastructure failures.

userbinator
1 replies
2d17h

Unfortunately, "lots of crap that's so badly broken most of the time" already describes a lot of software these days, and yet there hasn't been much pushback. Everyone seems to be mostly in a state of learned helplessness.

int_19h
0 replies
1d18h

There is pushback, it's just not broad enough yet. Because, as broken as things are (which is especially visible to those of us making the sausage or watching it made), they still kinda sorta work most of the time, to the point where users grumble but learn to live with it.

But I think that AI coding will upset this equilibrium by reducing the quality even more, and significantly enough that users will very much notice - and for many of them it will push things into "what I need doesn't work most of the time" category. And then there will be payback.

Then again, I am an optimist.

DabbyDabberson
3 replies
3d21h

The way I see it, its undetermined if Generative AI will be able to fully do a SWE job.

But, for most of the debates I've seen, I don't think it the answer matters all too much.

Once we have models that can act as full senior SWEs.. the models can engineer the models. And then we've hit the recursive case.

Once models can engineer models better and faster than humans, all bets are off. Its the foggy future. Its the singularity.

vertis
0 replies
3d8h

People (SWEs) don't want to hear this. I think it's an inevitability that something of this nature will happen.

int_19h
0 replies
3d16h

The implicit assumption here is that a human "senior SWE" can engineer a model of the same quality that is capable of simulating him. Which is definitely not true with the best models that we have today - and they certainly can't simulate a senior SWE, so the actual bar is higher.

I'm not saying that the whole "robots building better robots" thing is a pipedream, but given where things are today, this is not something that's going to happen soon.

dvt
0 replies
3d20h

Once we have models that can act as full senior SWEs.. the models can engineer the models.

This is such an extremely bullish case, I'm not sure why you'd think this is even remotely possible. A Google search is usually more valuable than ChatGPT. For example, the rust utf-8 example is already verbatim solved on reddit: https://www.reddit.com/r/rust/comments/l5m1rw/how_can_i_effi...

vineyardmike
0 replies
3d22h

Yea the problem with that is the control group - grab any SWE and ask them the same thing. I don’t think most would pass. Unless you want to give an SWE time to learn… then it’s hardly fair. And I vaguely trust the LLM to be able to learn it too.

Also I just asked Claude and Gemini and they both provided an implementation that matches the “bytes to UTF-8” rust docs. Assuming those are right,LLMs can do this (but I haven’t tested the code).

https://doc.rust-lang.org/std/string/struct.String.html

dimal
29 replies
3d20h

The demo shows a very clearly written bug report about a matrix operation that’s producing an unexpected output. Umm… no. Most bug reports you get in the wild are more along the lines of “I clicked on on X and Y happened” then if you’re lucky they’ll say “and I expected Z”. Usually the Z expectation is left for the reader to fill in because as human users we understand the expectations.

The difficulty in fixing a bug is in figuring out what’s causing the bug. If you know it’s caused by an incorrect operation, and we know that LLMs can fix simple defects like this, what does this prove?

Has anyone dug through the paper yet to see what the rest of the issues look like? And what the diffs look like? I suppose I’ll try when I have a sec.

drcode
11 replies
3d19h

Most bug reports you get in the wild are more along the lines of

Since this fixes 12% of the bugs, the authors of the paper probably agree with you that 100-12= 88%, and hence "most bugs" don't have nicely written bug reports.

aiauthoritydev2
5 replies
3d14h

12% is a very very large number for that kind of problem. I doubt even 0.1% of bug reports in the wild are that well written.

sitkack
3 replies
3d10h

Have the LLM rewrite the bug reports.

killingtime74
1 replies
2d23h

Why not have LLM write AGI while you're at it

sitkack
0 replies
2d23h

It is and it will!

fasa99
0 replies
2d3h

You'd want three LLMs, one to create the bugs, one to report it, one to fix it. I joke of course but on the other hand this is potentially a worthwhile architecture from a self-training perspective - a bug-creating LLM means your training set size is as big as you want it +/- GAN features.

littlestymaar
0 replies
3d10h

Except this is automated, so you could get multiples orders of magnitude more bug filled, so you need to have a very low false positive ratio to avoid being overwhelmed by automatically generated crap (which is basically spam).

dimal
2 replies
3d18h

I suppose I should nail down my point. No one would ever write a big report like this. A bug generally has an unknown cause. Once you found the cause of the bug, you’d fix it. Nowadays, you could just cut and paste the problem into ChatGPT and get the answer right then. So why would anyone ever log this bug? All this demo proves that they automated a process that didn’t need automation.

hvis
1 replies
3d18h

To be fair, sometimes meticulous users investigate the bugs and write down logical chains explaining the causes and even offer a solution at the end (which they can't apply for the lack of commit access, for instance).

The proposed solution isn't always right, of course, but it would be incorrect to say that no bug reports come with a diagnosed cause. But that's exactly where a conscious reviewer is most needed, I believe.

citrin_ru
0 replies
3d11h

I sometimes write a detailed bug reports but not a PR when there are different ways to address the problem (and all look bad to me) or the fix can introduce new problems. But I would expect LLM to ignore tradeoffs and choose an option which not necessarily the best for the same reason I hesitate - luck of understanding of this specific project.

skywhopper
0 replies
3d16h

It fixes 12% of their benchmark suite, not 12% of bug reports.

medellin
0 replies
3d15h

In my 15 years i would say less than 1% of bug reports are like this. If you know the bug to this level most people just would fix it themselves

bee_rider
7 replies
3d15h

Maybe it just needs another, independent tool. One that detects poorly written bug reports and rejects them.

A cool thing about LLM is they have infinite patience. They can go back and forth with the user until they either sort out how to make a useable bug report, or give up.

bfdm
6 replies
3d15h

While it might tickle metrics the right way, frustrating a user into giving up because your bot was not satisfied is not solving their problem.

throwup238
3 replies
3d12h

I think that depends on the exact KPI.

ffsm8
2 replies
3d10h

KPI stands for key performance indicator. It is a tool to grade people or teams by applying numbers to their work.

The only relationship you can have between these is that a ticket with a "resolved" status can be used as a KPI, but you're trying to invert the relationship here, which doesn't work. After all, it's an indicator and not a causal relationship

shabble
1 replies
3d6h

"ratio of open/total issues" can definitely be gamed by autoclosing anything that isn't an easy fix.

"average time to resolution" is also susceptible.

Both of these are pretty common all over the place, including OSS e.g. https://isitmaintained.com/#metrics

I suspect this sort of thing is one of the major motivations for the (as a user/reporter) infuriating rise in automated "this bug hasn't been touched in NN days, autoclosing for staleness" bots on various issue trackers.

bee_rider
0 replies
3d3h

This whole “worrying about KPI’s for my free, open source, community project” thing seems weird to me. (Not to say I don’t believe you, but I don’t understand why people want to inject this annoying mini-game into their hobby).

I’m not sure what to think about the auto-close bots. Which do you think would be more annoying as the person who made the report: having a report that just sits there forever and you just have to hope somebody decided to pick it up, or having the issue auto-closed? (I’m truly and honestly not sure). At least in the case of the former you have a clear marker for when you should try again. But getting rejected by a bot can definitely be annoying.

mdaniel
0 replies
3d2h

Oh, so you've experienced those "stale" bots on GitHub. Good times.

bee_rider
0 replies
3d3h

I was thinking in the context of an open source project, where the users are hopefully converting to productive community members. If it is, like, a job, with a customer service relationship, where they are paying to be able to just throw problems at you and you have to deal with fixing them, I’m sure this wouldn’t fly, so I agree there. (I think my brain short-circuited to open source because it is on GitHub, haha, but of course there’s no reason this couldn’t be used in a proprietary setting).

I’m not sure how it would work out in the case of a free, community driven project, though. The goal isn’t to serve users, it is to convert users into helpful community members. If the bot converts people who wouldn’t otherwise be converted, it seems like a win. If it chases away users who could have been converted with human intervention, that’s a lose. But the human community members can always jump into the thread as well… if the bot is filtering out lots of people and nobody from the community is intervening, I guess that tells us something about the priorities of the community, haha.

megablast
2 replies
3d17h

Exactly. This is not perfect and doesn't fix every report so it is useless.

skywhopper
0 replies
3d16h

On the contrary, it’s worse than useless. If it could fix 12% of bugs (it can’t — it only fixes 12% of their benchmark suite), you’d still have to figure out which 12% of the responses it gave were good. So, 88% of the time you’d have wasted time confirming a “fix” that doesn’t work. But it’s worse than that. Because even on the fixes it got right, you’d still have to fully vet it, because this tool doesn’t know when it can’t solve something, or ask for clarification. It just gives a wrong answer.

So you didn’t save 12% of your effort, you wasted probably more than double your effort checking the work of a tool that is wrong eight out of nine times.

dimal
0 replies
2d22h

That’s not what I said and you know it. I’m not saying LLMs are useless. I’m not even saying this tool is useless. I’m saying I’m not impressed with this tool, at least as represented in the demo.

jcarrano
0 replies
3d8h

Maybe it would be better if the agent would help people submit better reports instead of trying to fix it. E.g. it could ask them to add missing information, test different combinations of inputs, etc. I could also learn which maintainer to ping according to the type of issue.

gorjusborg
0 replies
3d6h

If the bug report needs to be of a certain quality to work, they've just invented issue-oriented programming.

forty
0 replies
3d10h

The trick is that people would use LLM to write very long and detailed bug reports :p

codeonline
0 replies
3d15h

I agree that bugs aren't as well specified as the example. But a specification for a new feature certainly can be.

I'm going to give it a try on my side project and see if it can at least provide a hint or some guidance on the development of small new features in an existing well structured project.

chinchilla2020
0 replies
3d3h

Agreed. I have never encountered a simple math bug in the wild.

To a non-programmer, putting in tests for myfunc(x) {return x + 2;} sounds useful but in reality computers do not tend to have any issues performing basic algebra.

sumeruchat
9 replies
3d20h

Once we have this fully automated, any good developer could have a team of 100 robo SWEs and ship like crazy. The real competition is with those devs not with the bots.

recursive
8 replies
3d19h

Shipping like crazy isn't useful by itself. Shipping non-garbage and being able to maintain it still has some value.

sumeruchat
7 replies
3d16h

Would you say cloning a complex saas startup in a week with payments integrated after letting AI just scrape them (or uploading screenshots of their app) is creating value?

int_19h
2 replies
3d16h

Depends on how many security vulnerabilities are in that payments system.

Or, I suppose, depending on whose value. The consultants that'll have to be hired by the poor shmuck who paid for that will make a fortune auditing and cleaning up the code.

sumeruchat
1 replies
3d7h

None because 1) this is pretty standard stuff with stripe 2) the good developer can go through the code and fix them in a few hours if there were any

int_19h
0 replies
2d18h

AI will quite readily write bad quality code with security vulnerabilities even for bog standard stuff (like say SQL injections).

And sure, a good developer can fix it if they will see it. But they won't when running on that kind of schedule.

gloosx
2 replies
3d12h

Before you sold it to anyone it will only create bills. Development is such a minuscule part of a successful startup

sumeruchat
1 replies
3d7h

On vercel its free to deploy a complex app almost.

gloosx
0 replies
3d2h

free cheese is only in a mousetrap ;)

recursive
0 replies
3d15h

Not without more information.

rwmj
7 replies
3d20h

Do we know how much extra work it created for the real people who had to review the proposed fixes?

r0ze-at-hn
6 replies
3d20h

Ah, well let me tell you about my pull request reviewer LLM project.

ActionHank
5 replies
3d18h

Jokes on you, let me tell you about my prompt to binary LLM project.

Hello world is 10GB, but even grandma can make hello worlds now.

peteradio
2 replies
3d18h

Let me tell you about my LLM project called grandma. It's fine tuned in order to replace your grandma but in principle it could replace your great-grandma.

barfbagginus
1 replies
3d17h

My grandma used to tell me stories about how to destroy capitalism.. I miss her.. can your grandma help guide my revolutionary efforts? That would really help me honor my granny's memory <3

vertis
0 replies
3d7h

What you want here is a local uncensored model. Preferrably one you've trained from scratch otherwise a government could have put in bad information that would cause your revolutionary efforts to fail.

Havoc
1 replies
3d17h

But does it contain a heavily obfuscated back door?

labster
0 replies
3d13h

Why does it take so long to get changes to your LLM merged? This is ridiculous. Please appoint Havoc as a maintainer already.

danenania
7 replies
3d20h

I'm working on a somewhat similar project: https://github.com/plandex-ai/plandex

While the overall goal is to build arbitrarily large, complex features and projects that are too much for ChatGPT or IDE-based tools, another aspect that I've put a lot of focus on is how to handle mistakes and corrections when the model starts going off the rails. Changes are accumulated in a protected sandbox separate from your project files, a diff review TUI is included that allows for bad changes to be rejected, all actions are version-controlled so you can easily go backwards and try a different approach, and branches are also included for trying out multiple approaches.

I think nailing this developer-AI feedback loop is the key to getting authentic productivity gains. We shouldn't just ask how well a coding tool can pass benchmarks, but what the failure case looks like when things go wrong.

barfbagginus
2 replies
3d16h

How open are you to moving plandex cloud over to AGPL? I know, tough ask right out the gate! Think about that one for a bit.

How is your market testing going?

Do you have contracts with clients amenable to let you write case studies? Do you need help selling, designing, or fulfilling these kinds of pilot contacts?

What are your plans for docs a PR?

As a researcher, it's currently hard to situate plandex against existing research, or anticipate where a technical contribution is needed.

As a business owner, it's currently hard to visualize plandex's impact on a business workflow.

Are you open to producing a technical report? Detail plandex methodology, benchmark efficiency, ablation tests for key contributions, customer case studies, relevant research papers, and next steps/help needed.

What do you think?

If plandex is interested in being a fully open org, then I'd be interested in seeing it find its market footing and grow its technical capabilities. We need open source orgs like this!

danenania
1 replies
3d15h

It’s AGPL licensed already :)

barfbagginus
0 replies
2d20h

Did I miss the plandex-cloud repo? It seems like it's proprietary at this time. I couldn't find the AWS design, billing system, user dashboards, and admin dashboards.

Can you point me to the missing code?

panqueca
1 replies
3d17h

Does it work with a large existing codebase?

danenania
0 replies
3d16h

Yes, at least up to the point of the context limit of the underlying model. If you needed to go beyond that, you would break the work up into separate "plans" (a plan is a set of tasks with an attached context and conversation).

The general workflow is to load some relevant context (could be a few files, an entire directory, a glob pattern, a URL, or piped in data), then send a prompt. Quick example:

  plandex new
  plandex load components/some-component.ts lib/api.ts package.json https://react.dev/reference/react/hooks
  plan tell "Update the component in components/some- 
  components.ts to load data from the 'fetchFooBars' 
  function in 'lib/api.ts' and then display it in a 
  datagrid. Use a suitable datagrid library."
From there the plan will start streaming. Existing files will be updated and new files created as needed.

One thing I like about it for large codebases compared to IDE-based tools I've tried is that it gives me precise control over context. A lot of tools try to index the whole codebase and it's pretty opaque--you never really know what the model is working with.

etheridev
1 replies
3d20h

You need to make yourself a business analyst agent to provide the feedback! To make it real, perhaps a team of them with conflicting personalities.

danenania
0 replies
3d20h

I think we'll get there at some point, but one thing I've learned from this project is how difficult it is to stack AI interactions. Each little bit of AI-based logic that gets added tends to fail terribly at first. Only after a long period of intense testing and iteration does it become remotely usable. The more you are combining different kinds of tasks, the more difficult it gets.

anotherpaulg
7 replies
3d22h

Very cool project!

I've experimented in this direction previously, but found agentic behavior is often chaotic and leads to long expensive sessions that go down a wrong rabbit hole and ultimately fail.

It's great that you succeed on 12% of swe-bench, but what happens the other 88% of the time? Is it useless wasted work and token costs? Or does it make useful progress that can be salvaged?

Also, I think swe-bench is from your group, right? Have you made any attempt to characterize a "skilled human upper bound" score?

I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to “solve”. Mainly because the tasks were under specified wrt to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task.

a_wild_dandan
5 replies
3d21h

Personally, I'd just use one of my local MacBook models (e.g. Mixtral 8x7b) and forget about any wasted branches & cents. My debugging time costs many orders of magnitude more than SWE-agent, so even a 5% backlog savings would be spectacular!

swatcoder
1 replies
3d19h

My debugging time costs many orders of magnitude more than SWE-agent

Unless your job is primarily to clean up somebody else's mess, your debugging time is a key part of a career-long feedback loop that improves your craft. Be careful not to shrug it off as something less. Many many people are spending a lot of money to let you forget it, and once you do, you'll be right there in the ranks of the cheaply replaceble.

(And on the odd chance that cleaning up other people's mess is your job, you should probably be the one doing it; and for largely the same reasons)

nickpsecurity
0 replies
3d19h

I totally agree. My solution to this was limiting my AI use to (a) whatever didn't impair creativity and (b) just in general to keep the brain sharp. If using AI regularly, one could just manually solve a percentage of the problems.

int_19h
0 replies
3d16h

Given that they got 12% with GPT-4, which is vastly better than any open model, I doubt this would be particularly productive. And powering compute at full load is going to add up.

ein0p
0 replies
3d20h

I’ve tried this with another similar system. FOSS LLMs including Mixtral are currently too weak to handle something like this. For me they run out of steam after only a few turns and start going in circles unproductively

Aperocky
0 replies
3d19h

That's assuming that the other 95% stays the same with this new agent (vs creating more work for you to now also have to parse what the model is saying).

senko
0 replies
3d11h

If you don't mind me asking, which agentic tools/frameworks have you tried for code fixing/generation, with which LLMs?

unit_circle
5 replies
3d22h

A 1/8 chance of fixing a bug at the cost of a careful review and some corrections is not bad.

0% -> 12% improvement is not bad for two years either (I'm somewhat arbitrary picking the release date of ChatGPT). If this can be kept up for a few years we will have some extremely useful tooling. The cost can be relatively high as well, since engineering time is currently orders of magnitude more expensive than these tools.

blharr
1 replies
3d20h

I still don't know. I feel like there are many ways where GPT will write some code or fix a bug in a way that makes it significantly harder to debug. Even for relatively simple tasks, it's kind of like machine-generated code that I would not want to touch.

WanderPanda
0 replies
3d20h

It is a bit worrisome but we manage to deal with subpar human code as well. Often the boilerplate generated by ChatGPT is already better than what an unexperienced coder would string together. I‘m sure it will not be a free lunch but the the benefits will probably outweigh the downsides.

Interesting scalability questions will arise wrt to security when scaling the already unmanagably large code bases by another magnitude (or two), though.

stefan_
0 replies
3d21h

These „benchmark“ are tuned around reporting some exciting result, once you look inside, all the „fixes“ are trash.

golergka
0 replies
3d21h

It's still abysmal from POV of actually using it in production, but it's a very impressive rate of improvement. Given what happened with LLMs and image generation in the last few years, we can probably assume that these systems will be able to fix most trivial bugs pretty soon.

SrslyJosh
0 replies
3d9h

If someone submitted 8 PRs and 7 of them were bullshit, I would close anything else they submitted in the future without even bothering to review.

pjmlp
4 replies
3d11h

Eventually it will be 90% fix rate and everyone cheering for the 12% will be flipping burgers instead.

iLoveOncall
1 replies
3d10h

Flipping burgers will be automated long before AI fixes any relevant number of bug reports.

pjmlp
0 replies
3d10h

Might be, still the point I was trying to make remains.

littlestymaar
0 replies
2d22h

Why would a human ever flip a burger at that point? It's not a particularly difficult task for a robot.

Unclogging sewers on the other hand…

fennecfoxy
0 replies
3d5h

I still think this is a long way off, but it definitely ties into UBI etc and improvement of the general human condition, taxing the rich, restricting investment on protected things like housing and public industries like water, electricity, healthcare and internet.

What's funny is that people on here & tech people in general seem to be the most averse to improving equity between all humans/stopping the obscenely rich from abusing and twisting the system. Do many HN peeps believe they're all somehow gonna become billionaires one day?

toddmorey
3 replies
3d22h

I’m always fascinated to read the system prompts & I always wonder what sort of gains can be made optimizing them further.

Once I’m back on desktop I want to look at the gut history of this file.

hazn
1 replies
3d19h

DSPy is the best tool for optimizing prompts [0]: https://github.com/stanfordnlp/dspy

Think of it as a meta-prompt optimizer, it uses a LLM to optimize your prompts, to optimize your LLM.

toddmorey
0 replies
3d19h

Excellent! Thanks for sharing this!

clement_b
0 replies
3d21h

I have a git feeling this comment was written on mobile.

aussieguy1234
3 replies
3d20h

12% fix rate = 88% bug rate

mlcrypto
2 replies
3d20h

Yep. After xz we don't need a bot mindlessly fixing all suggestions from malicious actors

aussieguy1234
0 replies
3d13h

Fix one bug, introduce 5 more

Dylan16807
0 replies
3d16h

I don't think xz makes a difference here. The perceived likelihood of problems, malicious or not, is pretty much the same. As far as this discussion goes, it's just another example in the pile of examples, not an event with meaningful before and after epochs.

JonChesterfield
3 replies
3d13h

If AI generated pull requests become a popular thing we'll see the end of public bug trackers.

(not because bugs will be gone - because the cost of reviewing the PR vs the benefit gained to the project will be a substantial net loss)

CGamesPlay
1 replies
3d12h

Not a chance. If AI-generated pull requests become popular, GitHub will automatically offer them in response to opened issues. Case in point: they already are popular for dependency upgrades.

JonChesterfield
0 replies
3d11h

And thus issues will no longer be opened

itsgrimetime
0 replies
3d12h

It’ll likely keep getting better, if it gets to 30-40% I’d say that’s a decent trade off. Also could you boost your chances by having the AI do a 2nd pass and double check the work? I’d be curious what the success rate of an LLM “determining whether a bug fix is valid” is

bwestergard
2 replies
3d23h

Friendly suggestion to the authors: success rates aren't meaningful to all but a handful of researchers. They should add a few examples of tests SWE-agent passed and did not pass to the README.

nyrikki
0 replies
3d23h

Yes please, the code quality on Devin was incredibly poor in all examples I traced down.

At least from a maintainability perspective.

I would like to see if this implementation is less destructive or at least more suitable for a red-green-refactor workflow.

NegativeLatency
0 replies
3d20h

Unless you weren't actually that successful but need to publish a "successful" result

tibbetts
1 replies
3d16h

But can their AI quietly introduce a security exploit into a GitHub project?

worthless-trash
0 replies
3d15h

Copilot already does this.

paradite
1 replies
3d20h

For anyone who didn't bother looking deeper, the SWEbench benchmark contains only Python code projects, so it is not representative of all the programing languages and frameworks.

I'm working on a more general SWE task eval framework in JS for arbitrary language and framework now (for starter JS/TS, SQL and Python), for my own prompt engineering product.

Hit me up if you are interested.

barfbagginus
0 replies
3d17h

Assuming the data set is proprietary, else please share the repo

mdaniel
1 replies
3d22h

I think that "Demo" link is just an extremely annoying version of an HTML presentation, so they could save me a shitload of clicking if they just dumped their presentation out to a PDF or whatever so I could read faster than watching it type out text as if it was live. It also whines a lot on the console about its inability to connect to a websocket server on 3000 but I don't know what it would do with a websocket connection if had it

SrslyJosh
0 replies
3d8h

Probably created with an LLM.

lispisok
1 replies
3d22h

Their demo is so similar to the Devin one I had to go look up the Devin one to check I wasnt watching the same demo. I feel like there might be a reason they both picked Sympy. Also I rarely put weight into demos. They are usually cherry-picked at best and outright fabricated at worst. I want to hear what 3rd parties have to say after trying these things.

lewhoo
0 replies
3d20h

Maybe that's the point of this research. Hey look, we reproduced the way to game the stats a bit. I really can't tell anymore.

iLoveOncall
1 replies
3d22h

And creates how many new ones?

This and Devin generate garbage code that will make any codebase worse.

It's a joke that 12.5% is even associated with the word "success".

1letterunixname
0 replies
3d22h

Do spaces and spelling fixes count?

Copilot, so far, is only good for predicting the next bit of similar patterns of code

barfbagginus
1 replies
3d16h

I would like something like this that helps me, as a green developer, find open source projects to contribute to.

For instance, I recently learned about how to replace setup.py with pyproject.toml for a large number of projects. I also learned how to publish packages to pypi. These changes significantly improve project ease and accessibility, and are very easy to do.

The main thing that holds people back is that python packaging documentation is notoriously cryptic - well I've already paid that cost, and now it's easy!

So I'm thinking of finding projects that are healthy, but haven't focused on modernizing their packaging or distributing their project through pypi.

I'd build human + agent based tooling to help me find candidates, propose the improvement to existing maintainers, then implement and deliver.

I could maybe upgrade 100 projects, then write up the adventure.

Anyone have inspiration/similar ideas, and wanna brainstorm?

SrslyJosh
0 replies
3d16h

...or you could just use the GitHub API to find projects that match certain criteria (e.g., no pyproject.toml). I'm not sure what the stochastic parrot adds here, besides making noob mistakes that you'll have to find and fix before you can submit PRs. You'd learn a lot more by trying to actually automate the process yourself.

Frummy
1 replies
3d22h

Interesting idea to provide the Agent-Computer Interface for it to scroll and such, interact easier from its perspective

aussieguy1234
0 replies
3d20h

Similar to how early computers didn't have enough ram to display the whole text file, so old programmers had to work with parts of the file at a time. It's not a bad way to get around the context window problem, which is kind of similar.

trebligdivad
0 replies
3d21h

So this issues arbitrary shell commands based on trying to understand the untrusted bug text ? Should be fun waiting until someone finds an escape.

readthenotes1
0 replies
3d20h

I made a lot of money as I was paid hourly while working with a cadre of people I called "the defect generators".

I'm kind of sad that future generations will not have that experience...

Madmallard
0 replies
3d11h

What veterans in the field know that AI hasn’t tackled is that the majority of difficulty in development is dealing with complexity and ambiguity and a lot of it has to do with communication between people in natural language as well as reasoning in natural language about your system. These things are not solved by AI as it is now. If you can fully specify what you want with all of the detail and corner cases and situation handling then at some point AI might be able to make all of that for you. Great! Unfortunately, that’s the actual hard part! Not the implementation generally.