It's pretty annoying that every project like this lately is just a wrapper for OpenAI API calls.
This approach works. I just built a SPA in 3 days with GPT-4 of which about 50% was generated. My only tooling was a bash script to list all the files in the repo (with some exceptions), including a README.md planning the project, a file list, and at the end I type my task.
I run about 10-15 rounds with it. At the beginning I was using GPT more heavily, but in the middle I found it easier to just fix the code myself. The context got as big as 10k tokens, but was not a problem. At some point I might need to filter the files more aggressively.
But surprisingly all that is needed for a bare-bone repo-level coding assistant is a script to list all the files so I could easily copy paste the whole thing into the chatGPT window.
Yes, well said. Doing exactly this kind of thing for months with ChatGPT is what convinced me the idea could work in the first place. I knew the underlying intelligence was there--the challenge is giving it the right prompts and supporting infra.
Do you have any of the issues where ChatGPT tends to forget the first parts of it’s context window? It could have the information explicitly spelled out, but if it weren’t in the last 2K tokens or so it’d just start to hallucinate stuff for me.
Plandex uses gradual summarization as the conversation gets longer (the exact cutoff point in terms of tokens is configurable via `plandex set-model`). So eventually, with a long enough plan, you can start to lose some resolution. That said, assuming you use the default gpt-4-turbo model with a 128k context window, you'd need to go far beyond 2k tokens before you'd start seeing anything like that.
We don't know what ChatGPT's summarization strategy is since it's closed source, but it does seem to be quite a bit more aggressive than Plandex's.
What’s your experience with API cost? I've also tried something similar, but I often end up using up my balance too quickly.
I can generally have these tools solve a simple issue in about 0.1 USD, or "complex" issues in 1-2 USD (complex generally just means that I'm spending time prompt engineering to get the model to do the right thing).
Do you have any boilerplate part of your prompt you can share?
a script to list all the files so I could easily copy paste the whole thing
Just in case you are using a Mac, you can pipe the output of your script to pbcopy so that it goes directly into your clipboard
script.sh | pbcopy
Show me one of these things do something more complex then a front end intern project.
I agree, these things seem to do okish on trivial web projects. I've never seen them do anything more than that.
I still use ChatGPT for some coding tasks, e.g. I asked it to write C code to do some annoying fork/execve stuff (can't remember the details) and it did a decentish job, but it's like 90% right. Great for figuring out a rough shape and what functions to search for, but you definitely can't just take the code and expect it to work.
Same when I asked it to write a device driver for some simple peripheral. It had the shape of an answer but with random hallucinated numbers.
I've also noticed that because there is a ton of noob-level code on the internet it will tend to do noob-level things too, like for the device driver it inserted fixed delays to wait for the device to perform an operation rather than monitoring for when it had actually finished.
I wonder if coding AIs would benefit from fine tuning on programming best practices so they don't copy beginner mistakes.
I used a web project in the demo because I figured it would be familiar to a wide range of developers, but actually many nontrivial pieces of Plandex have been built with the help of Plandex itself.
That's not to say it's perfect or will never make "noob-level" mistakes. That can definitely happen and is ultimately a function of the underlying model's intelligence. But I can at least assure you that it's quite capable of going far beyond a trivial web project.
It's also on me to show more indepth examples, so thanks for calling it out. I'd love it if you would try some of the projects you mention and let me know how it goes.
So basically you doesn't have any non trivial example. What else but to be expected?
Check out some of the test prompts here for examples of larger tasks: https://github.com/plandex-ai/plandex/blob/main/test/test_pr...
Here's a prompt I used to build the AWS infrastructure for Plandex Cloud with Plandex: https://github.com/plandex-ai/plandex/blob/main/test/test_pr...
It's not something I would consider a complex job. A simple prompt to chatgpt could even produce a working CDK template.
Here's another one, for the backend of a Stripe billing system: https://github.com/plandex-ai/plandex/blob/main/test/test_pr...
It seems like more examples demonstrating relatively complex tasks would be helpful, so I'll work on those.
I'm certainly not trying to claim that it can handle any task. The underlying model's intelligence and context size do place limits on what it can do. And it can definitely struggle with code that uses a lot of abstraction or indirection. But I've also been amazed by what it can accomplish on many occasions.
Love the idea of this, and very excited to see how it pans out. That said: I hate the code review UI. Just dump the changes as `git diff` does and let me review them using all the code review tools I use every day, then provide revision instructions. Building a high-quality TUI for side-by-side diffs should not be the thing you are spending time on, and there already exist great tools for viewing diffs in the terminal.
Thanks for the feedback! I actually had a ‘plandex diff’ command working at one point, but dropped it in favor of the changes TUI. I could definitely bring it back for people who prefer that format.
You could have a mode for people „who know what they are doing“ and just auto approve all the changes plandex makes and let users handle the changes themselves. I would actually prefer that, because I could keep using my IDE to look at diffs and decide what to keep.
Thanks, I'll consider this. It would be easy enough to add flags that will allow this.
Agreed! I for example would prefer to use diffstatic
Providing diff output allows people to self select their approach to merging the changes.
Yeah, that makes sense. I'm going to add this soon.
Congrats on the launch. Can you please compare and contrast Plandex features with another similar solution like aider[1] which also helps solve similar problem.
Thanks for mentioning aider! I haven't had a chance to look closely at plandex, but have read the author's description of differences wrt aider. I'd add a few comments:
I think the plandex UX is novel and interesting. The idea of a git-like CLI with various stateful commands is a new one in this space of ai coding tools. In contrast, aider uses a chat based "pair programming" UX, where you collaborate with the AI and ask for a sequence of changes to your local git repo.
The plandex author highlights that it makes changes in a "version-controlled sandbox" and can "rewind" unwanted changes.
These capabilities are all available "for free" in aider, because it is tightly integrated with git. Each AI change is automatically git committed with a sensible commit message. You can type “/diff” to check the diff, or "/undo" to undo any AI commit that you don't like. Or you can use "/git checkout -b <branch-name>" to start working on a branch to explore a longer sequence of changes, etc.
All your favorite git workflows are supported by invoking familiar git commands with "/git ..." inside the aider chat, or using any external git tooling that you prefer. Aider notices any changes in the underlying repo, however they occur.
These capabilities are all available "for free" in aider, because it is tightly integrated with git.
Sounds like the right approach to me. Some quick questions:
1. Is it easy to customize the system prompt with aider?
2. Does aider save a record of all OpenAI API calls? I’m thinking I may e.g. want to experiment with fine tuning an open source model using these one day.
3. What would you say are aider’s closest “competitors”?
Just to note, Plandex also has integration with git on the client-side and can automatically commit its changes (or not--you can decide when applying changes).
One of the reasons I think it's good to have the plan version-controlled separately from the repo is it avoids intermingling your changes and the model's changes in a way that's difficult to disentangle. It's also resilient to a "dirty" git state where you have a mix of staged, unstaged, and untracked changes.
One more benefit is that Plandex can be used in directories that aren't git repos, while still retaining version control for the plan itself. This can be useful for more one-off tasks where you're not working in an established project.
Thanks! Sure, I posted this comment in a Reddit thread a couple days ago to a user who asked the same question (and I added one additional point):
First I should say that it’s been a few months at least since I’ve used aider, so it’s possible my impression of it is a bit outdated. Also I’m a big fan of it and drew a lot of inspiration from it. That said:
Plandex is more focused on building larger and more complex functionality that involves multiple steps, whereas aider is more geared toward making a single change at a time.
Plandex has an isolated, version-controlled sandbox where tentative changes are accumulated. I believe with aider you have to apply or discard each change individually?
Plandex has diff review TUI where changes can be viewed side-by-side, and optionally rejected, a bit like GitHub’s PR review UI.
Plandex has branches that allow for exploring multiple approaches.
aider has cool voice input features that Plandex lacks.
aider’s maintainer Paul has done a lot of benchmarking of file update strategies. While I think Plandex’s approach is better suited to larger and more complex functionality, aider’s unified diff approach may have higher accuracy for a single change. I hope to do benchmarking work on this in the future.
aider requires Python and is installed via pip, while Plandex runs from a single binary with no dependencies, so Plandex installation is arguably easier overall, especially if you aren't a Python dev.
I’m sure I’m missing some other differences but those are the main ones that come to mind.
Thank you. Branches to explore different approaches is a really good idea, since LLMs are most powerful when they are used as a rubber duck to generate boilerplate templates and this can help get multiple perspectives. Going to test it soon.
Whats the deal with plandex cloud and $10/$20-mo? The github repo README devolves into a cloud pitch halfway through. I thought this was a local binary talking to openAI? I thought this was open source?
Hi, it’s open source and it also has a cloud option. You can either self-host or use cloud—it’s up to you.
The CLI talks to the Plandex server and the server talks to OpenAI.
but i still don't get what the cloud option would be doing that's worth $20/mo if it's talking to openAI. Does the plandex server have large resource requirements?
The server does quite a bit. Most of the features are covered here: https://github.com/plandex-ai/plandex/blob/main/guides/USAGE...
I actually did start out with just the CLI running locally, but it reached a point I needed a database and thus a client-server model to get it all working smoothly. I also want to add sharing and collaboration features in the future, and those require a client-server model.
Congrats on the launch, I'm excited to give it a try. I'm curious how you're having it edit files in place - having built a similar project last summer, I had trouble with reliably getting it to patch files with correct line numbers. It was especially a problem in React files with nested div's.
Thanks! I tried many different ways of doing it before settling on the current approach. It's still not perfect and can make mistakes (which is why the `plandex changes` diff review TUI is essential), but it's pretty good now overall.
I was able to improve reliability of line numbers by using a chain-of-thought approach where, for each change, the model first summarizes what's changing, then outputs code that starts and ends the section in the original file, and then finally identifies the line numbers from there.
The relevant prompts are here: https://github.com/plandex-ai/plandex/blob/main/app/server/m...
Amazing work. Loved the video and looking forward to trying it
Can a user ask plandex to modify a commit? Maybe the commit just needs a small change, but doesn’t need to be entirely re-written. Can the scope be reduced on the spot to focus only on a commit?
Thanks! There isn't anything built-in to specifically modify a commit, but you could make the modification to the file with Plandex and then `git commit --amend` for basically the same effect.
Not for this project specifically, but I realize that I've seen a lot of AI agents, but I've never seen something interesting build with them. Some simple website, maybe even some very simple old games like snake or pong, but nothing better. Do I miss something ?
I'd say LLM agent, or multi-agents are still in the very early research/prototype stage.
You can tell because there are papers from Microsoft on this but no product: https://www.microsoft.com/en-us/research/project/autogen/
I also wrote about the L1 to L5 of AI coding here: https://prompt.16x.engineer/blog/ai-coding-l1-l5
I brainstormed a text game-engine powered by an llm, but relying on a non-local llm was offputting. Local llms are getting more and more viable though. A general problem I was running into was that thinking in terms of LLM queries is a very new way of computation design and adapting takes a lot of effort. Then again, my central idea was a bit ambitious too: every game character would have a unique interpretation on what was happening.
Try to build something interesting with Plandex! Perhaps you will be pleasantly surprised. Either way, please let me know how it goes.
To support many other models you should look at ollama - it provides a REST API on your machine for local inference that works just like OpenAI
Thanks, I'm aware of ollama and the open source model ecosystem, but I haven't done a deep dive yet, so all the info in this thread has been quite helpful.
In theory, all you have to do is redirect the API gateway to localhost and all your existing integrations should just work!
There's an issue here to keep track of this: https://github.com/plandex-ai/plandex/issues/20
It seems that while ollama does have partial OpenAI API compatibility, it's missing function calling, so that's a blocker for now.
This seems very interesting, but I think the interface choice is not good. There would have been much less friction if this was purely a GitHub/GitLab/etc bot.
I see where you're coming from and I do plan to add a web UI and plugin/integration options in the future.
I personally wanted something with a tighter feedback loop that felt more akin to git. I also thought that simplifying the UI side would help me stay focused on getting the data structures and basic mechanics right in the initial version. But now that the core functionality is in place, I think it will work well as a base for additional frontends.
I haven't tried it yet, but I think making it fast iteration and simple initially is the right way to go. Nice one sharing this as open source!
I disagree, having used Sweep extensively, I've found the GitHub Issue -> PR flow to be incredibly clunky with a lack of ability to see what is happening and what has gone wrong.
In demo it modified UI components, is there any model that can look at the rendered page to see if it looks right? Right now all these wrappers just blindly edit the code.
Plandex can't do this yet, but soon I want to add GPT4-vision (and other multi-modal models) as model options, which will enable this kind of workflow.
Well I have built similar project that lives in github action, communicates via issues and sends PR when done.
4-vision isn't there yet. It can mostly OCR or pattern recognize the image if it's popular or has some known object. It cannot detect pixel differences or css/alignment issues.
I paired mine with VSCode and used the live view addon for that folder. So far so good.
Looks interesting. Can you go into more detail about why you like this better for large/complex tasks compared to GH Copilot?
Not the author, but I'm in a discord with him, I believe the main selling point here is that it allows you to manage your updates and conversations in a branching pattern that's saved. So if you can't get the AI to do something you can always revert to a prior state and try a different method.
Also it doesn't work on a "small view of the world" like Copilot from when I was using it could only insert code around your cursor (I understand that copilot pulls in a lot of context from all the files you have open, but the area it can modify is really small). This can add/remove/update code in multiple files at once. But it'll also just show you a diff first before it applies and you can select some or all of the changes made.
Yes, couldn't have said it better myself!
Curious to know how you built this. Is it GPT-4 or a fine-tuned model. How much does it cost?
It's written in Go. The models that it uses are configurable, but it mostly uses gpt-4-turbo by default currently. It calls the OpenAI API on your behalf with your own API key. No fine-tuning yet, though I'm interested in trying that in the future.
Appreciate the response. Really cool work!
Congrats! Looks great, and I can't wait to try it.
Do you support AzureOpenAI with custom endpoints?
Are any special settings necessary to disable telemetry or non-core network requests?
Thanks! It doesn't yet support custom endpoints, but it will soon. I'd recommend either joining the Discord (https://discord.gg/plandex-ai) or watching the repo for updates if you want to find out when this gets released.
If you self-host the server, there is no telemetry and no data is sent anywhere except to your self-hosted server and OpenAI.
This is really cool. I tried it and ran into a few syntax errors - it kept missing closing braces in PHP for some reason.
It seems it might be useful if it could actually try to execute the code, or somehow check for syntax errors/unimplemented functions before accepting the response from the LLM.
Thanks! Was this on cloud or self-hosted? If cloud and you created an account, feel free to ping me on Discord (https://discord.gg/plandex-ai) or by email (dane@plandex.ai) and let me know your account email so I can investigate. If you have an anonymous trial account on cloud, please still ping me--I can track it down based on file names. There is definitely some work to do in ironing out these kinds of edge cases.
"It seems it might be useful if it could actually try to execute the code, or somehow check for syntax errors/unimplemented functions before accepting the response from the LLM."
Indeed, I do have some ideas on how to add this.
This is something I've been thinking a lot about (a way to set context for an LLM against my own code), thank you for putting this out. Looks really polished.
Thanks! Please let me know how it goes for you if you try it :)
Very small nit: it'd be nice to provide an OpenAI org in case multiple orgs exist.
Ok, I made a note to add that. Thanks for the feedback!
This looks so damn good! Can't wait to try it in the morning!
Thanks! Please let me know how it goes for you.
Looks really interesting. Is it wrapping git for the rollback and diffing stuff? If I were a user I'd probably opt to use git directly for that sort of thing.
Yes, it does use git underneath, with the idea of exposing a very simple subset of git functionality to the user. There's also some locking and transaction logic involved to ensure integrity and thread safety, so it wouldn't really be straightforward to expose the repo directly.
I tried to build the backend so that postgres, the file system, and git would combine to form effectively a single transactional database.
I appreciate in the copy here that you are not claiming plandex to be a super dev or some such nonsense.
I really dislike the hype marketing in some other solutions.
Thanks! I agree. I think the key to working effectively with LLMs is to understand and embrace their limitations, using them for tasks they're good at while not spinning the tires on tasks (or parts of tasks) that they aren't yet well-suited for.
As someone who is trying to build a bootstrapped startup in spare time (read: coding while tired), this is amazing. Thank you so much for creating it.
Thanks! I agree it's great for coding while tired. I also like it when I'm procrastinating or feeling lazy. I find it helps to reduce the activation energy of getting started.
What is the cost of planning and working through, let's say, a manageable issue in a repo? Does it make sense to use 3.5/Sonnet or some lower cost endpoint for these tasks?
It's hard to put a precise number on it because it depends on exactly how much context is loaded, how many model responses the task needs to finish, and how much iteration you need to do in order to get the results you're looking for.
That said, you can do quite a meaty task for well under $1. If you're using it heavily it can start to add up over time, so you'd just need to weigh that cost against how you value your time I suppose. In the future I do hope to incorporate fine tuned models that should bring the cost down, as well as other model options like I mentioned in the post.
You can try different models and model settings with `plandex set-model` and see how you go. But in my experience gpt-4 is really the minimum bar for getting usable results.
Congrats on the launch.
Thank you!
Are you using plandex to write improvements to plandex?
Yes, quite often! Some of the most complex bits involving stream handling and concurrency were easier to do myself, but it’s been very helpful for http handlers, CLI commands, formatted output, TUIs, AWS infrastructure, and a lot more. I’ve also used it to track down bugs.
Hi! Is it possible to tell Plandex that the code should pass all tests in, e.g., `tests.py`?
Hey! Not in an automated way (yet). But you can get pretty close by building your plan, applying it, and then piping the output of your tests back into Plandex:
pytest tests.py | plandex load
plandex tell "update the plan files to fix the failing tests from the included pytest output"
Love this. Super excited AI-SWEs, will give it a try.
Awesome, thank you!
this looks neat i can't wait to try it out.
Thanks! Let me know how it goes :)
Wow, this is phenomenal! I can't wait to dig in. This is almost exactly the application I've been envisioning for my own workflow. I'm excited to contribute!
Thank you! Awesome, I'm glad to hear that! Looking forward to your thoughts, and your contributions :)
I wanted to get a better idea of how it worked, so I asked Claude to write up an overview. https://gist.github.com/CGamesPlay/8c2a2882c441821e76bbe9680...
This is really cool! And quite accurate.
If this thing really worked, why wouldn't you just point it at AWS documentation and ask it to implement the exact same APIs and come up with designs for the datacenters in extreme detail? Implementing APIs is completely legal.
Supporting more models, including Claude, Gemini, and open source models is definitely at the top of the roadmap. Would that make it less annoying? :)
Not affiliated with the project but you could use something like OpenRouter to give users a massive list of models to choose from with fairly minimal effort
https://openrouter.ai/
Thanks, I need to spend some time digging into OpenRouter. The main requirement would be reliable function calling and JSON, since Plandex relies heavily on that. I'm also expecting to need some model-specific prompts, considering how much prompt iteration was needed to get things behaving how I wanted on OpenAI.
I've also looked at Together (https://www.together.ai/) for this purpose. Can anyone speak to the differences between OpenRouter and Together?
I can't speak to the differences of Openrouter to Together but the Openrouter endpoint should work as a drop-in replacement for OpenAI api calls after replacing the endpoint url and the value of $OPENAI_API_KEY. The model names may differ to other apis but everything else should work the same.
Awesome, looking forward to trying it out.
Would love to hear any feedback from people who have gotten to know OpenRouter, as well as any similar tools.
I think Mistral-2-Pro would work really well for this, judging by the great results I've had with it on another heavy on tool calling project [1]
[1] https://github.com/radareorg/r2ai
Thanks, I'll give it a try. Plandex's model settings are version-controlled like everything else and play well with branches, so it will be fun to start comparing how all different kinds of models do vs. each other on longer coding tasks using a branch for each one.
For challenging tasks, I typically get code outputs from all three top models (gpt4, opus, and ultra), and pick the best one. It would be nice if your tool could simply this for me: run all three models and perhaps even facilitate some type of model interaction to produce a better outcome.
Definitely, I'm very interested in doing something along these lines.
https://github.com/ollama/ollama
I think OpenAI is still the best of the bunch. Kind of feel like the others are kind of there to make people realize OpenAI works the best. Maybe when Gemini 1.5 is released?
I’m moving an inordinate amount of data between the ChatGPT browser window and my IDE (a lot through copying and pasting) and this demonstrates two things: 1) ChatGPT is incredibly useful to me and 2) the worflow UX is still terrible. I think there is room for building innovative UXs with OpenAI, and so far what I’ve seen in Jetbrains and VSCode isn’t it…
That was also my experience and thought process.
Every program is a wrapper around a CPU, so annoying.
But the open source models have Open AI compatible APIs, so as long as you can set the API endpoint you can use whatever you want.
OpenAI API is simply a utility. The question is given this utility, how does one find the right use case, structure the correct context, and build the right UX.
OP has certainly built something interesting here and added significant value on top of the base utility of the OpenAI API.