HN comments for: Princeton group open sources "SWE-agent", with 12% fix rate for GitHub issues

noncoml

34 replies

3d22h

2024-04-02 20:05:40 UTC

If you are afraid that LLMs will replace you at your job, ask an LLM to write Rust code for reading a utf8 file character by character

Edit: Yes, it does write some code that is "close" enough, but in some cases it is wrong, in others it doesn't not do exactly what asked. I.e. needs supervision from someone who understands both the requirements, the code and the problems that may arise from the naive line that the LLM is taking. Mind you, the most popular the issue, the better the line LLM is taking. So in other words, IMHO is a glorified Stack Overflow. Just as there are engineers that copy-paste from SO without having any idea what the code does, there will be engineers that will just copy paste from LLM. Their work will be much better than if they used SO, but I think it's still nowhere to the mark of a Senior SWE and above.

raggi

19 replies

3d22h

2024-04-02 20:14:25 UTC

it does an ok job with this task:

    use std::fs::File;
    use std::io::{self, BufReader, Read};

    fn read_file_character_by_character(path: &str) -> io::Result<()> {
        // Open the file in read-only mode.
        let file = File::open(path)?;

        // Create a buffered reader to read the file more efficiently.
        let reader = BufReader::new(file);

        // `chars` method returns an iterator over the characters of the input.
        // Note that it returns a Result<(char, usize), io::Error>, where usize is the byte length of the char.
        for char_result in reader.chars() {
            match char_result {
                Ok(c) => print!("{}", c),
                Err(e) => return Err(e),
            }
        }

        Ok(())
    }

    fn main() {
        let path = "path/to/your/file.txt";

        if let Err(e) = read_file_character_by_character(path) {
            eprintln!("Error reading file: {}", e);
        }
    }

comex

9 replies

3d22h

2024-04-02 20:24:25 UTC

Only problem is that the critical `chars` method doesn't actually exist. Rust's standard library has a `chars` method for strings, but not for Readers.

(Also, the comment about the iterator element type is inconsistent with the code following it. Based on the comment, `c` would be of type `(char, usize)`, but then trying to print it with {} would fail because tuples don't implement Display.)

raggi

8 replies

3d21h

2024-04-02 20:30:21 UTC

good catch. feeding it the error output of rustc it then produces:

    use std::fs::File;
    use std::io::{self, Read};

    fn read_file_character_by_character(path: &str) -> io::Result<()> {
        let mut file = File::open(path)?;
        let mut contents = String::new();

        file.read_to_string(&mut contents)?;

        for c in contents.chars() {
            println!("{}", c);
        }

        Ok(())
    }

    fn main() {
        let path = "path/to/your/file.txt";

        if let Err(e) = read_file_character_by_character(path) {
            eprintln!("Error reading file: {}", e);
        }
    }

phaer

5 replies

3d21h

2024-04-02 21:09:19 UTC

But this doesn't read the file char-by-char, but uses buffering to read it into a string

scottlamb

4 replies

3d20h

2024-04-02 22:22:18 UTC

What would you expect? There's no OS API for "read one character", except in say ASCII where 1 byte = 1 code point = 1 character. And it'd be hideously inefficient anyway. So you either loop over getting the next N bytes and getting all complete characters so far (with some extra complexity around characters that cross chunk boundaries) or you read the whole thing into a single buffer and iterate the characters. This code does the latter. If this tool doesn't have the ability to respond by asking requirements questions, I'd consider either choice valid.

Of course, in real life, I do expect to get requirements questions back from an engineer when I assign a task. Seems more practical than anticipating everything up-front into the perfect specification/prompt. Why shouldn't I expect the same from an LLM-based tool? Are any of them set up to do that?

1letterunixname

3 replies

3d18h

2024-04-03 00:15:18 UTC

There most certainly is getwchar() and fgetwc()/getwc() on anything that's POSIX C95, so that's more or less everything that's not a vintage antique.

Reading individual UTF-8 codepoints is a trivial exercise if byte width getchar() were available, and portable C code to do so would be able to run on anything made after 1982. IIRC, they don't teach how to write portable C code in Comp Sci programs anymore and it's a shame.

Never read a file completely into memory at once unless there is zero chance of it being a huge file because this is an obvious DoS vector and waste of resources.

scottlamb

2 replies

3d16h

2024-04-03 01:39:51 UTC

There most certainly is getwchar() and fgetwc()/getwc() on anything that's POSIX C95, so that's more or less everything that's not a vintage antique.

Apologies for the imprecision: by OS API, I meant syscall, at least on POSIX systems. The functions you refer to are C stdio things. Note also they implement on top of read(2) one of the two options I mentioned: "loop over getting the next N bytes and getting all complete characters so far (with some extra complexity around characters that cross chunk boundaries)".

btw, if we're being precise, getwchar gets a code point, and character might mean grapheme instead. Same is true for the `str::chars` call in the LLM's Rust snippet. The docstring for that method mentions this [1] because it was written in this century after people thought about this stuff a bit.

portable C code to do so would be able to run on anything made after 1982.

Our comments are part of a thread discussing this prompt [2] that specifically requests Rust and this snippet in response [3]. Not portable C code. You can use those C stdio functions from Rust, but you really shouldn't without a very good reason. Rust has its own IO library that is safe and well integrated with other Rust things like `#![derive(Debug)]`.

[1] https://doc.rust-lang.org/std/primitive.str.html#method.char...

[2] https://news.ycombinator.com/item?id=39910542

[3] https://news.ycombinator.com/item?id=39910542

1letterunixname

1 replies

3d12h

2024-04-03 06:00:29 UTC

Yes, the userland side presented such as with POSIX like ssize_t read(int fd, void* buf, size_t count). Calling that with count = 1 each time would be wasteful, but certainly libc's have been buffering this since at least the 1980's. I remember this was the case with Borland C/C++.

Our comments are part of a thread discussing this prompt [2] that specifically requests Rust and this snippet in response [3]. Not portable C code. You can use those C stdio functions from Rust, but you really shouldn't without a very good reason. Rust has its own IO library that is safe and well integrated with other Rust things like `#![derive(Debug)]`.

Duh. It doesn't really matter what Rust has have went it comes to enabling the use of specific edge-case performance improvements for specific purposes. Inefficient AI-generated code without a clue of other approaches doesn't move the needle. Religious purity doesn't matter, only results matter.

scottlamb

0 replies

3d3h

2024-04-03 15:22:32 UTC

Duh. It doesn't really matter what Rust has have went it comes to enabling the use of specific edge-case performance improvements for specific purposes. Inefficient AI-generated code without a clue of other approaches doesn't move the needle. Religious purity doesn't matter, only results matter.

No idea what this incoherent, ungrammatical paragraph is supposed to be saying. But if you're under the impression Rust doesn't have its own buffered IO facilities or that using Rust-native libraries offers only "religious purity" benefits over extern "C" stuff, you're mistaken.

This has diverged from what I'm interested in discussing anyway; see my question upthread about if there are any LLM tools that gather requirements from incomplete specs in the way I expect human engineers to. In this case, I'd expect it to ask questions such as "how large are input files expected to be?" Better, ask what the greater purpose is, as "character by character" is rarely useful.

deathanatos

0 replies

3d20h

2024-04-02 21:39:40 UTC

On errors, it exits with success.

1letterunixname

0 replies

3d18h

2024-04-02 23:50:08 UTC

A few notes:

- It should be generating `path: impl AsRef<Path>` to be properly generic.

- It's not setting a nonzero exit code on error.

- Edge case handling is a vital property for production-usable tools at scale. I'm wondering if it can yet special case situations such as creating a conditional compilation version for Linux that uses the splice syscall when the arguments are 2 file handles.

woodruffw

7 replies

3d22h

2024-04-02 20:19:04 UTC

If we're being sticklers, this isn't reading character-by-character: it's performing a buffered read, which then gets iterated over.

raggi

2 replies

3d22h

2024-04-02 20:23:39 UTC

here's what it does if i feed your nit pick into it verbatim:

    use std::fs::File;
    use std::io::{self, Read};
    use std::str::from_utf8;

    fn read_file_character_by_character(path: &str) -> io::Result<()> {
        // Open the file in read-only mode.
        let mut file = File::open(path)?;

        let mut buffer = [0; 4]; // UTF-8 character max byte length
        let mut temp_buffer = vec![];

        while let Ok(bytes_read) = file.read(&mut buffer[..1]) { // Read byte by byte
            if bytes_read == 0 { // End of file
                break;
            }

            temp_buffer.push(buffer[0]);

            if let Ok(s) = from_utf8(&temp_buffer) {
                if s.chars().count() == 1 { // Complete character read
                    print!("{}", s);
                    temp_buffer.clear(); // Clear temp buffer for next character
                }
            }
        }

        Ok(())
    }

    fn main() {
        let path = "path/to/your/file.txt";

        if let Err(e) = read_file_character_by_character(path) {
            eprintln!("Error reading file: {}", e);
        }
    }

comex

1 replies

3d21h

2024-04-02 20:30:03 UTC

Unlike the original version, this version compiles and seems to basically work correctly. However, the design is misleading: `buffer` is declared as an array of 4 bytes but only the first byte is ever used. The code also has suboptimal performance and error handling, though that's not the end of the world.

raggi

0 replies

3d21h

2024-04-02 20:31:58 UTC

all true, as I said in another fork of the thread, this comes down to part of what humans will still be valuable for in this loop: distilling poor requirements into better requirements.

noncoml

2 replies

3d21h

2024-04-02 21:15:00 UTC

I wouldn't say it's a nit. The file may be 10s of GB. Do you want to read it to a string?

raggi

1 replies

3d19h

2024-04-02 23:19:23 UTC

The buffered read didn’t do that, it used the default buffered reader implementation. IIRC that implementation currently defaults to 8kb buffer windows which is a little too small to be efficient enough for high throughput, but substantially more performant than making a syscall per byte, and without spending too much memory.

noncoml

0 replies

3d18h

2024-04-03 00:04:25 UTC

I was talking about this:

    let mut file = File::open(path)?;
    let mut contents = String::new();
    file.read_to_string(&mut contents)?;

deathanatos

0 replies

3d20h

2024-04-02 21:43:35 UTC

The original prompt is a bit under-specified. (But hey, that certainly matches the real world!)

You're going to have to buffer at least a little, to figure out where the USV / grapheme boundary is, depending on our definition of "character". To me, a BufReader is appropriate here; it avoids lots of tiny reads to the kernel, which is probably the right behavior in a real case.

To me, "read character by character" vaguely implies something that's going to yield a stream of characters. (Again, for some definition there.)

raggi

0 replies

3d22h

2024-04-02 20:21:59 UTC

fwiw, the benchmark that matters really has nothing to do with authoring code.

the typing of code is the easy part even though it's a part a lot of folks are somewhat addicted to.

the things which have far more value are applying value judgements to requirements, correlating and incorporating sparse and inaccurate diagnostic information into a coherent debugging strategy, and so on. there will come a time when it can assist with these too, probably first on requirements distillation, but for more complex debugging tasks that's a novel problem solving area that we've yet to see substantial movement on.

so if you want to stave off the robots coming for you, get good at debugging hard problems, and learn to make really great use of tools that accelerate the typing out of solutions to baseline product requirements.

iwontberude

4 replies

3d22h

2024-04-02 20:07:44 UTC

Hypothetically, which ticker symbols would you buy put contracts on, at what strike prices, and at what expiration dates? As far as I can tell, a lot of people are betting a lot of money that you are wrong, but actually I think you are right.

noncoml

1 replies

3d21h

2024-04-02 21:17:21 UTC

Ugh, I am not claiming that LLMs are not great innovation. Just that they are not going to replace SWE jobs in our(maybe my) lifetime.

vertis

0 replies

3d8h

2024-04-03 09:55:55 UTC

Conservatively, I think LLMs will replace SWE roles within the next 5-10 years. In that period SWE roles will change drastically, to be more hearding of AI agents.

We can't hope to compete with something that can edit all the files in less than 5 minutes.

If you're not at least trying to plan what your life will look like under these circumstances then you're doing yourself a disservice.

jeremyjh

1 replies

3d21h

2024-04-02 20:42:23 UTC

The most relevant companies focused on this aren't publicly traded. The ones that are publicly traded like MSFT have way too many other factors affecting their value - not to mention the fact that they'll make money on generative AI that has nothing to do with coding regardless of if an SWE-agent ever works.

iwontberude

0 replies

3d18h

2024-04-02 23:48:50 UTC

Oh well you should hear the hype from CNBC and other places, they are strongly intimating that gen AI will replace SWEs on product development teams. I totally agree it’s not likely, but it’s starting to get baked into asset prices and I want to profit from that misunderstanding.

userbinator

3 replies

3d18h

2024-04-02 23:56:34 UTC

I'm not afraid of LLMs replacing me because of their output quality. The problem is the proliferation of quantity-over-quality "churn out barely-working crap as fast as possible" culture that gives LLMs the advantage over real humans.

int_19h

2 replies

3d16h

2024-04-03 01:37:13 UTC

I'm kinda hoping that LLMs will get pushed into production use writing code before they have acceptable quality (because greed), and the result will be lots of crap that's so badly broken most of the time that there will be a massive pushback against said culture from the users. Maybe from the governments as well, after a few well-publicized infrastructure failures.

userbinator

1 replies

2d17h

2024-04-04 01:24:12 UTC

Unfortunately, "lots of crap that's so badly broken most of the time" already describes a lot of software these days, and yet there hasn't been much pushback. Everyone seems to be mostly in a state of learned helplessness.

int_19h

0 replies

1d18h

2024-04-05 00:22:33 UTC

There is pushback, it's just not broad enough yet. Because, as broken as things are (which is especially visible to those of us making the sausage or watching it made), they still kinda sorta work most of the time, to the point where users grumble but learn to live with it.

But I think that AI coding will upset this equilibrium by reducing the quality even more, and significantly enough that users will very much notice - and for many of them it will push things into "what I need doesn't work most of the time" category. And then there will be payback.

Then again, I am an optimist.

DabbyDabberson

3 replies

3d21h

2024-04-02 20:41:24 UTC

The way I see it, its undetermined if Generative AI will be able to fully do a SWE job.

But, for most of the debates I've seen, I don't think it the answer matters all too much.

Once we have models that can act as full senior SWEs.. the models can engineer the models. And then we've hit the recursive case.

Once models can engineer models better and faster than humans, all bets are off. Its the foggy future. Its the singularity.

vertis

0 replies

3d8h

2024-04-03 09:57:15 UTC

People (SWEs) don't want to hear this. I think it's an inevitability that something of this nature will happen.

int_19h

0 replies

3d16h

2024-04-03 01:40:11 UTC

The implicit assumption here is that a human "senior SWE" can engineer a model of the same quality that is capable of simulating him. Which is definitely not true with the best models that we have today - and they certainly can't simulate a senior SWE, so the actual bar is higher.

I'm not saying that the whole "robots building better robots" thing is a pipedream, but given where things are today, this is not something that's going to happen soon.

dvt

0 replies

3d20h

2024-04-02 21:26:24 UTC

Once we have models that can act as full senior SWEs.. the models can engineer the models.

This is such an extremely bullish case, I'm not sure why you'd think this is even remotely possible. A Google search is usually more valuable than ChatGPT. For example, the rust utf-8 example is already verbatim solved on reddit: https://www.reddit.com/r/rust/comments/l5m1rw/how_can_i_effi...

vineyardmike

0 replies

3d22h

2024-04-02 20:13:26 UTC

Yea the problem with that is the control group - grab any SWE and ask them the same thing. I don’t think most would pass. Unless you want to give an SWE time to learn… then it’s hardly fair. And I vaguely trust the LLM to be able to learn it too.

Also I just asked Claude and Gemini and they both provided an implementation that matches the “bytes to UTF-8” rust docs. Assuming those are right,LLMs can do this (but I haven’t tested the code).

https://doc.rust-lang.org/std/string/struct.String.html

dimal

29 replies

3d20h

2024-04-02 22:07:56 UTC

The demo shows a very clearly written bug report about a matrix operation that’s producing an unexpected output. Umm… no. Most bug reports you get in the wild are more along the lines of “I clicked on on X and Y happened” then if you’re lucky they’ll say “and I expected Z”. Usually the Z expectation is left for the reader to fill in because as human users we understand the expectations.

The difficulty in fixing a bug is in figuring out what’s causing the bug. If you know it’s caused by an incorrect operation, and we know that LLMs can fix simple defects like this, what does this prove?

Has anyone dug through the paper yet to see what the rest of the issues look like? And what the diffs look like? I suppose I’ll try when I have a sec.

drcode

11 replies

3d19h

2024-04-02 22:38:54 UTC

Most bug reports you get in the wild are more along the lines of

Since this fixes 12% of the bugs, the authors of the paper probably agree with you that 100-12= 88%, and hence "most bugs" don't have nicely written bug reports.

aiauthoritydev2

5 replies

3d14h

2024-04-03 04:05:22 UTC

12% is a very very large number for that kind of problem. I doubt even 0.1% of bug reports in the wild are that well written.

sitkack

3 replies

3d10h

2024-04-03 08:12:11 UTC

Have the LLM rewrite the bug reports.

killingtime74

1 replies

2d23h

2024-04-03 18:59:17 UTC

Why not have LLM write AGI while you're at it

sitkack

0 replies

2d23h

2024-04-03 19:21:18 UTC

It is and it will!

fasa99

0 replies

2d3h

2024-04-04 14:42:46 UTC

You'd want three LLMs, one to create the bugs, one to report it, one to fix it. I joke of course but on the other hand this is potentially a worthwhile architecture from a self-training perspective - a bug-creating LLM means your training set size is as big as you want it +/- GAN features.

littlestymaar

0 replies

3d10h

2024-04-03 07:30:31 UTC

Except this is automated, so you could get multiples orders of magnitude more bug filled, so you need to have a very low false positive ratio to avoid being overwhelmed by automatically generated crap (which is basically spam).

dimal

2 replies

3d18h

2024-04-02 23:44:36 UTC

I suppose I should nail down my point. No one would ever write a big report like this. A bug generally has an unknown cause. Once you found the cause of the bug, you’d fix it. Nowadays, you could just cut and paste the problem into ChatGPT and get the answer right then. So why would anyone ever log this bug? All this demo proves that they automated a process that didn’t need automation.

hvis

1 replies

3d18h

2024-04-03 00:02:04 UTC

To be fair, sometimes meticulous users investigate the bugs and write down logical chains explaining the causes and even offer a solution at the end (which they can't apply for the lack of commit access, for instance).

The proposed solution isn't always right, of course, but it would be incorrect to say that no bug reports come with a diagnosed cause. But that's exactly where a conscious reviewer is most needed, I believe.

citrin_ru

0 replies

3d11h

2024-04-03 07:08:29 UTC

I sometimes write a detailed bug reports but not a PR when there are different ways to address the problem (and all look bad to me) or the fix can introduce new problems. But I would expect LLM to ignore tradeoffs and choose an option which not necessarily the best for the same reason I hesitate - luck of understanding of this specific project.

skywhopper

0 replies

3d16h

2024-04-03 01:59:54 UTC

It fixes 12% of their benchmark suite, not 12% of bug reports.

medellin

0 replies

3d15h

2024-04-03 02:42:20 UTC

In my 15 years i would say less than 1% of bug reports are like this. If you know the bug to this level most people just would fix it themselves

bee_rider

7 replies

3d15h

2024-04-03 03:09:18 UTC

Maybe it just needs another, independent tool. One that detects poorly written bug reports and rejects them.

A cool thing about LLM is they have infinite patience. They can go back and forth with the user until they either sort out how to make a useable bug report, or give up.

bfdm

6 replies

3d15h

2024-04-03 03:15:31 UTC

While it might tickle metrics the right way, frustrating a user into giving up because your bot was not satisfied is not solving their problem.

throwup238

3 replies

3d12h

2024-04-03 05:46:57 UTC

I think that depends on the exact KPI.

ffsm8

2 replies

3d10h

2024-04-03 08:04:36 UTC

KPI stands for key performance indicator. It is a tool to grade people or teams by applying numbers to their work.

The only relationship you can have between these is that a ticket with a "resolved" status can be used as a KPI, but you're trying to invert the relationship here, which doesn't work. After all, it's an indicator and not a causal relationship

shabble

1 replies

3d6h

2024-04-03 12:10:03 UTC

"ratio of open/total issues" can definitely be gamed by autoclosing anything that isn't an easy fix.

"average time to resolution" is also susceptible.

Both of these are pretty common all over the place, including OSS e.g. https://isitmaintained.com/#metrics

I suspect this sort of thing is one of the major motivations for the (as a user/reporter) infuriating rise in automated "this bug hasn't been touched in NN days, autoclosing for staleness" bots on various issue trackers.

bee_rider

0 replies

3d3h

2024-04-03 14:29:40 UTC

This whole “worrying about KPI’s for my free, open source, community project” thing seems weird to me. (Not to say I don’t believe you, but I don’t understand why people want to inject this annoying mini-game into their hobby).

I’m not sure what to think about the auto-close bots. Which do you think would be more annoying as the person who made the report: having a report that just sits there forever and you just have to hope somebody decided to pick it up, or having the issue auto-closed? (I’m truly and honestly not sure). At least in the case of the former you have a clear marker for when you should try again. But getting rejected by a bot can definitely be annoying.

mdaniel

0 replies

3d2h

2024-04-03 16:06:49 UTC

Oh, so you've experienced those "stale" bots on GitHub. Good times.

bee_rider

0 replies

3d3h

2024-04-03 14:44:02 UTC

I was thinking in the context of an open source project, where the users are hopefully converting to productive community members. If it is, like, a job, with a customer service relationship, where they are paying to be able to just throw problems at you and you have to deal with fixing them, I’m sure this wouldn’t fly, so I agree there. (I think my brain short-circuited to open source because it is on GitHub, haha, but of course there’s no reason this couldn’t be used in a proprietary setting).

I’m not sure how it would work out in the case of a free, community driven project, though. The goal isn’t to serve users, it is to convert users into helpful community members. If the bot converts people who wouldn’t otherwise be converted, it seems like a win. If it chases away users who could have been converted with human intervention, that’s a lose. But the human community members can always jump into the thread as well… if the bot is filtering out lots of people and nobody from the community is intervening, I guess that tells us something about the priorities of the community, haha.

megablast

2 replies

3d17h

2024-04-03 00:39:40 UTC

Exactly. This is not perfect and doesn't fix every report so it is useless.

skywhopper

0 replies

3d16h

2024-04-03 02:05:38 UTC

On the contrary, it’s worse than useless. If it could fix 12% of bugs (it can’t — it only fixes 12% of their benchmark suite), you’d still have to figure out which 12% of the responses it gave were good. So, 88% of the time you’d have wasted time confirming a “fix” that doesn’t work. But it’s worse than that. Because even on the fixes it got right, you’d still have to fully vet it, because this tool doesn’t know when it can’t solve something, or ask for clarification. It just gives a wrong answer.

So you didn’t save 12% of your effort, you wasted probably more than double your effort checking the work of a tool that is wrong eight out of nine times.

dimal

0 replies

2d22h

2024-04-03 19:32:44 UTC

That’s not what I said and you know it. I’m not saying LLMs are useless. I’m not even saying this tool is useless. I’m saying I’m not impressed with this tool, at least as represented in the demo.

stingraycharles

0 replies

3d12h

2024-04-03 05:32:26 UTC

It appears that they’re using the PRs from the top5000 most popular PyPi packages for their bench: https://github.com/princeton-nlp/SWE-bench/tree/main/swebenc...

jcarrano

0 replies

3d8h

2024-04-03 09:28:21 UTC

Maybe it would be better if the agent would help people submit better reports instead of trying to fix it. E.g. it could ask them to add missing information, test different combinations of inputs, etc. I could also learn which maintainer to ping according to the type of issue.

gorjusborg

0 replies

3d6h

2024-04-03 11:37:35 UTC

If the bug report needs to be of a certain quality to work, they've just invented issue-oriented programming.

forty

0 replies

3d10h

2024-04-03 07:29:13 UTC

The trick is that people would use LLM to write very long and detailed bug reports :p

codeonline

0 replies

3d15h

2024-04-03 02:36:32 UTC

I agree that bugs aren't as well specified as the example. But a specification for a new feature certainly can be.

I'm going to give it a try on my side project and see if it can at least provide a hint or some guidance on the development of small new features in an existing well structured project.

chinchilla2020

0 replies

3d3h

2024-04-03 15:22:29 UTC

Agreed. I have never encountered a simple math bug in the wild.

To a non-programmer, putting in tests for myfunc(x) {return x + 2;} sounds useful but in reality computers do not tend to have any issues performing basic algebra.

sumeruchat

9 replies

3d20h

2024-04-02 21:26:00 UTC

Once we have this fully automated, any good developer could have a team of 100 robo SWEs and ship like crazy. The real competition is with those devs not with the bots.

recursive

8 replies

3d19h

2024-04-02 22:27:10 UTC

Shipping like crazy isn't useful by itself. Shipping non-garbage and being able to maintain it still has some value.

sumeruchat

7 replies

3d16h

2024-04-03 01:26:30 UTC

Would you say cloning a complex saas startup in a week with payments integrated after letting AI just scrape them (or uploading screenshots of their app) is creating value?

int_19h

2 replies

3d16h

2024-04-03 01:33:39 UTC

Depends on how many security vulnerabilities are in that payments system.

Or, I suppose, depending on whose value. The consultants that'll have to be hired by the poor shmuck who paid for that will make a fortune auditing and cleaning up the code.

sumeruchat

1 replies

3d7h

2024-04-03 10:51:00 UTC

None because 1) this is pretty standard stuff with stripe 2) the good developer can go through the code and fix them in a few hours if there were any

int_19h

0 replies

2d18h

2024-04-04 00:09:44 UTC

AI will quite readily write bad quality code with security vulnerabilities even for bog standard stuff (like say SQL injections).

And sure, a good developer can fix it if they will see it. But they won't when running on that kind of schedule.

gloosx

2 replies

3d12h

2024-04-03 05:42:39 UTC

Before you sold it to anyone it will only create bills. Development is such a minuscule part of a successful startup

sumeruchat

1 replies

3d7h

2024-04-03 10:50:00 UTC

On vercel its free to deploy a complex app almost.

gloosx

0 replies

3d2h

2024-04-03 16:19:03 UTC

free cheese is only in a mousetrap ;)

recursive

0 replies

3d15h

2024-04-03 03:00:35 UTC

Not without more information.

rwmj

7 replies

3d20h

2024-04-02 21:36:21 UTC

Do we know how much extra work it created for the real people who had to review the proposed fixes?

r0ze-at-hn

6 replies

3d20h

2024-04-02 22:10:20 UTC

Ah, well let me tell you about my pull request reviewer LLM project.

ActionHank

5 replies

3d18h

2024-04-02 23:33:49 UTC

Jokes on you, let me tell you about my prompt to binary LLM project.

Hello world is 10GB, but even grandma can make hello worlds now.

peteradio

2 replies

3d18h

2024-04-03 00:05:53 UTC

Let me tell you about my LLM project called grandma. It's fine tuned in order to replace your grandma but in principle it could replace your great-grandma.

barfbagginus

1 replies

3d17h

2024-04-03 01:00:01 UTC

My grandma used to tell me stories about how to destroy capitalism.. I miss her.. can your grandma help guide my revolutionary efforts? That would really help me honor my granny's memory <3

vertis

0 replies

3d7h

2024-04-03 11:21:44 UTC

What you want here is a local uncensored model. Preferrably one you've trained from scratch otherwise a government could have put in bad information that would cause your revolutionary efforts to fail.

Havoc

1 replies

3d17h

2024-04-03 00:39:11 UTC

But does it contain a heavily obfuscated back door?

labster

0 replies

3d13h

2024-04-03 05:19:25 UTC

Why does it take so long to get changes to your LLM merged? This is ridiculous. Please appoint Havoc as a maintainer already.

danenania

7 replies

3d20h

2024-04-02 22:01:40 UTC

I'm working on a somewhat similar project: https://github.com/plandex-ai/plandex

While the overall goal is to build arbitrarily large, complex features and projects that are too much for ChatGPT or IDE-based tools, another aspect that I've put a lot of focus on is how to handle mistakes and corrections when the model starts going off the rails. Changes are accumulated in a protected sandbox separate from your project files, a diff review TUI is included that allows for bad changes to be rejected, all actions are version-controlled so you can easily go backwards and try a different approach, and branches are also included for trying out multiple approaches.

I think nailing this developer-AI feedback loop is the key to getting authentic productivity gains. We shouldn't just ask how well a coding tool can pass benchmarks, but what the failure case looks like when things go wrong.

barfbagginus

2 replies

3d16h

2024-04-03 01:49:15 UTC

How open are you to moving plandex cloud over to AGPL? I know, tough ask right out the gate! Think about that one for a bit.

How is your market testing going?

Do you have contracts with clients amenable to let you write case studies? Do you need help selling, designing, or fulfilling these kinds of pilot contacts?

What are your plans for docs a PR?

As a researcher, it's currently hard to situate plandex against existing research, or anticipate where a technical contribution is needed.

As a business owner, it's currently hard to visualize plandex's impact on a business workflow.

Are you open to producing a technical report? Detail plandex methodology, benchmark efficiency, ablation tests for key contributions, customer case studies, relevant research papers, and next steps/help needed.

What do you think?

If plandex is interested in being a fully open org, then I'd be interested in seeing it find its market footing and grow its technical capabilities. We need open source orgs like this!

danenania

1 replies

3d15h

2024-04-03 03:22:08 UTC

It’s AGPL licensed already :)

barfbagginus

0 replies

2d20h

2024-04-03 21:34:53 UTC

Did I miss the plandex-cloud repo? It seems like it's proprietary at this time. I couldn't find the AWS design, billing system, user dashboards, and admin dashboards.

Can you point me to the missing code?

panqueca

1 replies

3d17h

2024-04-03 01:09:31 UTC

Does it work with a large existing codebase?

danenania

0 replies

3d16h

2024-04-03 01:35:15 UTC

Yes, at least up to the point of the context limit of the underlying model. If you needed to go beyond that, you would break the work up into separate "plans" (a plan is a set of tasks with an attached context and conversation).

The general workflow is to load some relevant context (could be a few files, an entire directory, a glob pattern, a URL, or piped in data), then send a prompt. Quick example:

  plandex new
  plandex load components/some-component.ts lib/api.ts package.json https://react.dev/reference/react/hooks
  plan tell "Update the component in components/some- 
  components.ts to load data from the 'fetchFooBars' 
  function in 'lib/api.ts' and then display it in a 
  datagrid. Use a suitable datagrid library."

From there the plan will start streaming. Existing files will be updated and new files created as needed.

One thing I like about it for large codebases compared to IDE-based tools I've tried is that it gives me precise control over context. A lot of tools try to index the whole codebase and it's pretty opaque--you never really know what the model is working with.

etheridev

1 replies

3d20h

2024-04-02 22:04:56 UTC

You need to make yourself a business analyst agent to provide the feedback! To make it real, perhaps a team of them with conflicting personalities.

danenania

0 replies

3d20h

2024-04-02 22:16:47 UTC

I think we'll get there at some point, but one thing I've learned from this project is how difficult it is to stack AI interactions. Each little bit of AI-based logic that gets added tends to fail terribly at first. Only after a long period of intense testing and iteration does it become remotely usable. The more you are combining different kinds of tasks, the more difficult it gets.

anotherpaulg

7 replies

3d22h

2024-04-02 20:21:50 UTC

Very cool project!

I've experimented in this direction previously, but found agentic behavior is often chaotic and leads to long expensive sessions that go down a wrong rabbit hole and ultimately fail.

It's great that you succeed on 12% of swe-bench, but what happens the other 88% of the time? Is it useless wasted work and token costs? Or does it make useful progress that can be salvaged?

Also, I think swe-bench is from your group, right? Have you made any attempt to characterize a "skilled human upper bound" score?

I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to “solve”. Mainly because the tasks were under specified wrt to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task.

a_wild_dandan

5 replies

3d21h

2024-04-02 20:39:51 UTC

Personally, I'd just use one of my local MacBook models (e.g. Mixtral 8x7b) and forget about any wasted branches & cents. My debugging time costs many orders of magnitude more than SWE-agent, so even a 5% backlog savings would be spectacular!

swatcoder

1 replies

3d19h

2024-04-02 22:40:54 UTC

My debugging time costs many orders of magnitude more than SWE-agent

Unless your job is primarily to clean up somebody else's mess, your debugging time is a key part of a career-long feedback loop that improves your craft. Be careful not to shrug it off as something less. Many many people are spending a lot of money to let you forget it, and once you do, you'll be right there in the ranks of the cheaply replaceble.

(And on the odd chance that cleaning up other people's mess is your job, you should probably be the one doing it; and for largely the same reasons)

nickpsecurity

0 replies

3d19h

2024-04-02 23:20:59 UTC

I totally agree. My solution to this was limiting my AI use to (a) whatever didn't impair creativity and (b) just in general to keep the brain sharp. If using AI regularly, one could just manually solve a percentage of the problems.

int_19h

0 replies

3d16h

2024-04-03 01:29:55 UTC

Given that they got 12% with GPT-4, which is vastly better than any open model, I doubt this would be particularly productive. And powering compute at full load is going to add up.

ein0p

0 replies

3d20h

2024-04-02 21:46:35 UTC

I’ve tried this with another similar system. FOSS LLMs including Mixtral are currently too weak to handle something like this. For me they run out of steam after only a few turns and start going in circles unproductively

Aperocky

0 replies

3d19h

2024-04-02 22:53:49 UTC

That's assuming that the other 95% stays the same with this new agent (vs creating more work for you to now also have to parse what the model is saying).

senko

0 replies

3d11h

2024-04-03 06:29:20 UTC

If you don't mind me asking, which agentic tools/frameworks have you tried for code fixing/generation, with which LLMs?

unit_circle

5 replies

3d22h

2024-04-02 20:17:46 UTC

A 1/8 chance of fixing a bug at the cost of a careful review and some corrections is not bad.

0% -> 12% improvement is not bad for two years either (I'm somewhat arbitrary picking the release date of ChatGPT). If this can be kept up for a few years we will have some extremely useful tooling. The cost can be relatively high as well, since engineering time is currently orders of magnitude more expensive than these tools.

blharr

1 replies

3d20h

2024-04-02 21:25:41 UTC

I still don't know. I feel like there are many ways where GPT will write some code or fix a bug in a way that makes it significantly harder to debug. Even for relatively simple tasks, it's kind of like machine-generated code that I would not want to touch.

WanderPanda

0 replies

3d20h

2024-04-02 21:39:21 UTC

It is a bit worrisome but we manage to deal with subpar human code as well. Often the boilerplate generated by ChatGPT is already better than what an unexperienced coder would string together. I‘m sure it will not be a free lunch but the the benefits will probably outweigh the downsides.

Interesting scalability questions will arise wrt to security when scaling the already unmanagably large code bases by another magnitude (or two), though.

stefan_

0 replies

3d21h

2024-04-02 21:07:55 UTC

These „benchmark“ are tuned around reporting some exciting result, once you look inside, all the „fixes“ are trash.

golergka

0 replies

3d21h

2024-04-02 20:45:31 UTC

It's still abysmal from POV of actually using it in production, but it's a very impressive rate of improvement. Given what happened with LLMs and image generation in the last few years, we can probably assume that these systems will be able to fix most trivial bugs pretty soon.

SrslyJosh

0 replies

3d9h

2024-04-03 09:24:54 UTC

If someone submitted 8 PRs and 7 of them were bullshit, I would close anything else they submitted in the future without even bothering to review.

pjmlp

4 replies

3d11h

2024-04-03 07:15:40 UTC

Eventually it will be 90% fix rate and everyone cheering for the 12% will be flipping burgers instead.

iLoveOncall

1 replies

3d10h

2024-04-03 07:34:39 UTC

Flipping burgers will be automated long before AI fixes any relevant number of bug reports.

pjmlp

0 replies

3d10h

2024-04-03 07:53:47 UTC

Might be, still the point I was trying to make remains.

littlestymaar

0 replies

2d22h

2024-04-03 20:07:59 UTC

Why would a human ever flip a burger at that point? It's not a particularly difficult task for a robot.

Unclogging sewers on the other hand…

fennecfoxy

0 replies

3d5h

2024-04-03 12:58:48 UTC

I still think this is a long way off, but it definitely ties into UBI etc and improvement of the general human condition, taxing the rich, restricting investment on protected things like housing and public industries like water, electricity, healthcare and internet.

What's funny is that people on here & tech people in general seem to be the most averse to improving equity between all humans/stopping the obscenely rich from abusing and twisting the system. Do many HN peeps believe they're all somehow gonna become billionaires one day?

matthewaveryusa

4 replies

3d22h

2024-04-02 19:46:34 UTC

Very neat. Uses the langchain method, here are some of the prompts:

https://github.com/princeton-nlp/SWE-agent/blob/main/config/...

toddmorey

3 replies

3d22h

2024-04-02 20:22:45 UTC

I’m always fascinated to read the system prompts & I always wonder what sort of gains can be made optimizing them further.

Once I’m back on desktop I want to look at the gut history of this file.

hazn

1 replies

3d19h

2024-04-02 23:08:52 UTC

DSPy is the best tool for optimizing prompts [0]: https://github.com/stanfordnlp/dspy

Think of it as a meta-prompt optimizer, it uses a LLM to optimize your prompts, to optimize your LLM.

toddmorey

0 replies

3d19h

2024-04-02 23:23:03 UTC

Excellent! Thanks for sharing this!

clement_b

0 replies

3d21h

2024-04-02 21:14:52 UTC

I have a git feeling this comment was written on mobile.

aussieguy1234

3 replies

3d20h

2024-04-02 21:38:03 UTC

12% fix rate = 88% bug rate

mlcrypto

2 replies

3d20h

2024-04-02 22:01:07 UTC

Yep. After xz we don't need a bot mindlessly fixing all suggestions from malicious actors

aussieguy1234

0 replies

3d13h

2024-04-03 05:05:41 UTC

Fix one bug, introduce 5 more

Dylan16807

0 replies

3d16h

2024-04-03 02:05:13 UTC

I don't think xz makes a difference here. The perceived likelihood of problems, malicious or not, is pretty much the same. As far as this discussion goes, it's just another example in the pile of examples, not an event with meaningful before and after epochs.

JonChesterfield

3 replies

3d13h

2024-04-03 05:16:45 UTC

If AI generated pull requests become a popular thing we'll see the end of public bug trackers.

(not because bugs will be gone - because the cost of reviewing the PR vs the benefit gained to the project will be a substantial net loss)

CGamesPlay

1 replies

3d12h

2024-04-03 06:22:02 UTC

Not a chance. If AI-generated pull requests become popular, GitHub will automatically offer them in response to opened issues. Case in point: they already are popular for dependency upgrades.

JonChesterfield

0 replies

3d11h

2024-04-03 06:51:30 UTC

And thus issues will no longer be opened

itsgrimetime

0 replies

3d12h

2024-04-03 06:06:30 UTC

It’ll likely keep getting better, if it gets to 30-40% I’d say that’s a decent trade off. Also could you boost your chances by having the AI do a 2nd pass and double check the work? I’d be curious what the success rate of an LLM “determining whether a bug fix is valid” is

bwestergard

2 replies

3d23h

2024-04-02 18:45:30 UTC

Friendly suggestion to the authors: success rates aren't meaningful to all but a handful of researchers. They should add a few examples of tests SWE-agent passed and did not pass to the README.

nyrikki

0 replies

3d23h

2024-04-02 18:57:15 UTC

Yes please, the code quality on Devin was incredibly poor in all examples I traced down.

At least from a maintainability perspective.

I would like to see if this implementation is less destructive or at least more suitable for a red-green-refactor workflow.

NegativeLatency

0 replies

3d20h

2024-04-02 22:00:41 UTC

Unless you weren't actually that successful but need to publish a "successful" result

tibbetts

1 replies

3d16h

2024-04-03 01:53:36 UTC

But can their AI quietly introduce a security exploit into a GitHub project?

worthless-trash

0 replies

3d15h

2024-04-03 02:58:48 UTC

Copilot already does this.

paradite

1 replies

3d20h

2024-04-02 21:41:15 UTC

For anyone who didn't bother looking deeper, the SWEbench benchmark contains only Python code projects, so it is not representative of all the programing languages and frameworks.

I'm working on a more general SWE task eval framework in JS for arbitrary language and framework now (for starter JS/TS, SQL and Python), for my own prompt engineering product.

Hit me up if you are interested.

barfbagginus

0 replies

3d17h

2024-04-03 00:56:30 UTC

Assuming the data set is proprietary, else please share the repo

mdaniel

1 replies

3d22h

2024-04-02 19:32:56 UTC

I think that "Demo" link is just an extremely annoying version of an HTML presentation, so they could save me a shitload of clicking if they just dumped their presentation out to a PDF or whatever so I could read faster than watching it type out text as if it was live. It also whines a lot on the console about its inability to connect to a websocket server on 3000 but I don't know what it would do with a websocket connection if had it

SrslyJosh

0 replies

3d8h

2024-04-03 09:31:44 UTC

Probably created with an LLM.

lispisok

1 replies

3d22h

2024-04-02 20:13:15 UTC

Their demo is so similar to the Devin one I had to go look up the Devin one to check I wasnt watching the same demo. I feel like there might be a reason they both picked Sympy. Also I rarely put weight into demos. They are usually cherry-picked at best and outright fabricated at worst. I want to hear what 3rd parties have to say after trying these things.

lewhoo

0 replies

3d20h

2024-04-02 21:52:48 UTC

Maybe that's the point of this research. Hey look, we reproduced the way to game the stats a bit. I really can't tell anymore.

iLoveOncall

1 replies

3d22h

2024-04-02 19:56:07 UTC

And creates how many new ones?

This and Devin generate garbage code that will make any codebase worse.

It's a joke that 12.5% is even associated with the word "success".

1letterunixname

0 replies

3d22h

2024-04-02 19:59:13 UTC

Do spaces and spelling fixes count?

Copilot, so far, is only good for predicting the next bit of similar patterns of code

barfbagginus

1 replies

3d16h

2024-04-03 02:02:59 UTC

I would like something like this that helps me, as a green developer, find open source projects to contribute to.

For instance, I recently learned about how to replace setup.py with pyproject.toml for a large number of projects. I also learned how to publish packages to pypi. These changes significantly improve project ease and accessibility, and are very easy to do.

The main thing that holds people back is that python packaging documentation is notoriously cryptic - well I've already paid that cost, and now it's easy!

So I'm thinking of finding projects that are healthy, but haven't focused on modernizing their packaging or distributing their project through pypi.

I'd build human + agent based tooling to help me find candidates, propose the improvement to existing maintainers, then implement and deliver.

I could maybe upgrade 100 projects, then write up the adventure.

Anyone have inspiration/similar ideas, and wanna brainstorm?

SrslyJosh

0 replies

3d16h

2024-04-03 02:17:57 UTC

...or you could just use the GitHub API to find projects that match certain criteria (e.g., no pyproject.toml). I'm not sure what the stochastic parrot adds here, besides making noob mistakes that you'll have to find and fix before you can submit PRs. You'd learn a lot more by trying to actually automate the process yourself.

Frummy

1 replies

3d22h

2024-04-02 20:13:57 UTC

Interesting idea to provide the Agent-Computer Interface for it to scroll and such, interact easier from its perspective

aussieguy1234

0 replies

3d20h

2024-04-02 21:44:32 UTC

Similar to how early computers didn't have enough ram to display the whole text file, so old programmers had to work with parts of the file at a time. It's not a bad way to get around the context window problem, which is kind of similar.

trebligdivad

0 replies

3d21h

2024-04-02 21:09:15 UTC

So this issues arbitrary shell commands based on trying to understand the untrusted bug text ? Should be fun waiting until someone finds an escape.

readthenotes1

0 replies

3d20h

2024-04-02 22:10:11 UTC

I made a lot of money as I was paid hourly while working with a cadre of people I called "the defect generators".

I'm kind of sad that future generations will not have that experience...

Madmallard

0 replies

3d11h

2024-04-03 07:22:57 UTC

What veterans in the field know that AI hasn’t tackled is that the majority of difficulty in development is dealing with complexity and ambiguity and a lot of it has to do with communication between people in natural language as well as reasoning in natural language about your system. These things are not solved by AI as it is now. If you can fully specify what you want with all of the detail and corner cases and situation handling then at some point AI might be able to make all of that for you. Great! Unfortunately, that’s the actual hard part! Not the implementation generally.