HN comments for: Building reliable systems out of unreliable agents

mritchie712

12 replies

2d19h

2024-04-09 22:33:45 UTC

This is a great write up! I nodded my head thru the whole post. Very much aligns with our experience over the past year.

I wrote a simple example (overkiLLM) on getting reliable output from many unreliable outputs here[0]. This doesn't employ agents, just an approach I was interested in trying.

I choose writing an H1 as the task, but a similar approach would work for writing any short blob of text. The script generates a ton of variations then uses head-to-head voting to pick the best ones.

This all runs locally / free using ollama.

0 - https://www.definite.app/blog/overkillm

maciejgryka

8 replies

2d19h

2024-04-09 22:43:14 UTC

Oh this is fun! So you basically define personalities by picking well-known people that are probably represented in the training data and ask them (their LLM-imagined doppelganger) to vote?

CuriouslyC

7 replies

2d7h

2024-04-10 11:20:52 UTC

In the research literature, this process is done not by "agent" voting but by taking a similarity score between answers, and choosing the answer that is most representative.

Another approach is to use multiple agents to generate a distribution over predictions, in sort of like bayesian estimation.

mistermann

3 replies

2d4h

2024-04-10 14:19:46 UTC

Any chance you could expand on both of these, even enough to assist in digging deeper into them? TIA.

CuriouslyC

2 replies

2d3h

2024-04-10 14:35:59 UTC

The TLDR is you can prompt the LLM to take different perspectives than its default, then combine those. If the LLM is estimating a number, the different perspectives give you a distribution over the truth, which shows you the range of biases and the most likely true answer (given wisdom of the crowd). If the LLM is generating non-quantifiable output, you can find the "average" of the answers (using embeddings or other methods) and select that one.

mistermann

1 replies

1d19h

2024-04-10 23:23:37 UTC

Ah ok, so both are implemented via a call(s) to the LLM, as opposed to a standard algorithmic approach?

CuriouslyC

0 replies

1d18h

2024-04-11 00:01:12 UTC

Once you have bayesian prior distributions (which it makes total sense for llms to estimate) you can do tons of nifty statistical techniques. It's only the bottom layer of the analysis stack that's LLM generated.

mritchie712

1 replies

2d3h

2024-04-10 14:50:10 UTC

for my use case (generating an interesting H1), using a similarity score would defeat the purpose.

In this approach, I'm looking for the diamond in the rough. It's often dissimilar from the others. With this approach, the diamond can still get a high number of votes.

CuriouslyC

0 replies

2d2h

2024-04-10 16:03:40 UTC

That approach definitely has promise. I would have agents rate answers and take the highest rated rather than vote for them though, since you're losing information about ranking and preference gradients with n choose 1. Also, you can do that whole process in one prompt, in case you're re-prompting currently, it's cheaper to batch it up.

infecto

0 replies

2d6h

2024-04-10 11:54:37 UTC

For clarification on the first part. The research suggests you can utilize the same prompt over multiple runs as the input to picking the answer.

all2

2 replies

2d19h

2024-04-09 22:47:04 UTC

I'd be curious to see some examples and maybe intermediate results?

mritchie712

1 replies

2d3h

2024-04-10 14:55:38 UTC

here's some examples[0]:

this one scored high:

Pinned Down - Powerful Analytics Without the Need for Engineering or SQL

this one scored low:

Analytics Made Accessible for Everyone.

Each time I've compared the top scoring results to those at the bottom, I've always preferred the top scoring variations.

0 - https://docs.google.com/spreadsheets/d/1hdu2BlhLcLZ9sruVW8a_...

all2

0 replies

1d14h

2024-04-11 04:17:12 UTC

I love the spreadsheet. That's exactly what I was looking for. Thank you!

iamleppert

8 replies

2d20h

2024-04-09 22:00:30 UTC

A better way is to threaten the agent:

“If you don’t do as I say, people will get hurt. Do exactly as I say, and do it fast.”

Increases accuracy and performance by an order of magnitude.

maciejgryka

4 replies

2d20h

2024-04-09 22:01:50 UTC

Ha, we tried that! Didn't make a noticeable difference in our benchmarks, even though I've heard the same sentiment in a bunch of places. I'm guessing whether this helps or not is task-dependent.

dudus

1 replies

2d20h

2024-04-09 22:04:28 UTC

Agreed. I ran a few tests and observed similarly that threats didn't outperform other types of "incentives" I think it might some sort of urban legend in the community.

Or these prompts might cause wild variations based on the model and any study you do is basically useless for the near future as the models evolve by themselves.

maciejgryka

0 replies

2d20h

2024-04-09 22:09:20 UTC

Yeah, the fact that different models might react differently to such tricks makes it hard. We're experimenting with Claude right now and I'm really hoping something like https://github.com/stanfordnlp/dspy can help here.

dollo_7

1 replies

2d20h

2024-04-09 22:04:28 UTC

I hoped it was too good to be just a joke. Still, I will try it on my eval set…

maciejgryka

0 replies

2d20h

2024-04-09 22:07:25 UTC

I wouldn't be surprised to see it help, along with the "you'll get $200 if you answer this right" trick and a bunch of others :) They're definitely worth trying.

IIAOPSW

1 replies

2d20h

2024-04-09 22:17:15 UTC

Personally I prefer to liquor my agents up a bit first.

"Say that again but slur your words like you're coming home sloshed from the office Christmas party."

Increases the jei nei suis qua by an order of magnitude.

mtremsal

0 replies

2d18h

2024-04-09 23:51:00 UTC

jei nei suis qua

"je ne sais quoi", i.e. "I don't know (exactly) what", or an intangible but essential quality. :)

thimkerbell

0 replies

2d16h

2024-04-10 01:44:03 UTC

"do as I say...", not realizing that the LLM is actually 1000 remote employees

caseyy

5 replies

2d12h

2024-04-10 05:59:54 UTC

Interesting ideas but it didn’t mention priming, which is a prompt-engineering way to improve consistency in answers.

Basically, in the context window, you provide your model with 5 or more example inputs and outputs. If you’re running in chat mode, that’s be the preceding 5 user and assistant message pairs, which establish a pattern of how to answer to different types of information. Then you give the current prompt as a user, and the assistance will follow the rhythm and style of previous answers in the context window.

It works so well I was able to take out answer reformatting logic out of some of my programs that query llama2 7b. And it’s a lot cheaper than fine-tuning, which may be overkill for simple applications.

notsylver

4 replies

2d12h

2024-04-10 06:13:41 UTC

They mention few-shot prompting in the prompt engineering section, which I think is what you mean.

caseyy

3 replies

2d10h

2024-04-10 07:57:59 UTC

Oh yeah. I read few-shot like it means trying a few times to get an appropriate output. That’s how the author uses the word “shot” in the beginning of the article. Priming is a specific term that means giving examples in the context window. But yeah, the author seems to describe this. Still, you can go a long way with priming. I wouldn’t even think of fine-tuning before trying priming for a good while. It might still be quicker and a lot cheaper.

maciejgryka

2 replies

2d9h

2024-04-10 08:28:31 UTC

Ha good point, I did say "let's have another shot" when I just meant another try at generating! FWIW "few shot prompting" is how most people refer to this technique, I think (e.g. see https://www.promptingguide.ai/techniques/fewshot), I haven't heard "priming" before, though it does convey the right thing.

And the reason we don't really do it is context length. Our contexts are long and complex and there are so many subtleties that I'm worried about either saturating the context window or just not covering enough ground to matter.

caseyy

0 replies

1d1h

2024-04-11 16:32:42 UTC

Interesting, I didn’t hear about few shot prompting. There’s a ton of stuff written on specifically “priming” as well. People use different terms I suppose.

It makes sense about the context window length, it can be limiting. For small inputs and outputs, it’s great. And it’s remarkably effective with diminishing returns. This is why I have 5 shots as a concrete example. You probably need more than 1 or 2, but for a lot of applications, probably less than 20. For most basic tasks like extracting words from a document or producing various summaries, for example.

It depends on the complexity of the task and how much you’re worried about over-fitting to your data set. But if you’re not so worried, the task is not complex, and the inputs and outputs are small, then it works very well with only shots.

And it’s basically free in the context of fine-tuning.

It might be worth expanding on it a bit in this or a separate article. It’s a good way to increase reliability to a workable extent in unreliable LLMs. Although a lot has been written on few short prompting/priming already.

Hugsun

0 replies

2d7h

2024-04-10 10:31:33 UTC

Yes, X-shot prompting or X-shot learning was how the pioneering LLM researchers referred to putting examples in the prompt. The terminology stuck around.

CuriouslyC

4 replies

2d6h

2024-04-10 11:26:36 UTC

Prompt engineering is honestly not long for this world. It's not hard to build an agent that can iteratively optimize a prompt given an objective function, and it's not hard to make that agent general purpose. DSPy already does some prompt optimization via multi-shot learning/chain of thought, I'm quite certain we'll see an optimizer that can actually rewrite the base prompt as well.

namaria

1 replies

1d10h

2024-04-11 07:50:36 UTC

It strikes me as bad reasoning to look at a system that is designed to be very complex and stochastic as a way to get some creativity out of it ("generative AI" so to speak) and try to bolt down added apparatus to get deterministic behavior out of it.

We have deterministic programming systems. They're called compilers.

CuriouslyC

0 replies

1d6h

2024-04-11 12:13:01 UTC

I think you're missing the point. If an application had simple logic, the program would have been written in a simple language in the first place. This is about taking fuzzy processes that would be incredibly difficult to program, and making them consistent and precise.

maciejgryka

1 replies

2d6h

2024-04-10 11:49:07 UTC

I hear you and am planning to try DSPy because it seems attractive, but I'm also hearing people with a lot of experience being cautions about this https://x.com/HamelHusain/status/1777131374803402769 so I wouldn't make this a high-conviction bet.

CuriouslyC

0 replies

2d4h

2024-04-10 13:28:25 UTC

I don't have the context to fully address that tweet, but in my experience there is a repeatable process to prompt design and optimization that could be outlined and followed by a LLM with iterative capabilities using an objective function.

The real proof though is that most "prompt engineers" already use chatgpt/claude to take their outline prompt and reword it for succinctness and relevance to LLMs, have it suggest revisions and so forth. Not only is the process amenable to automation, but people are already doing hybrid processes leveraging the AI anyhow.

viksit

3 replies

2d20h

2024-04-09 22:20:24 UTC

this is a great write up! i was curious about the verifier and planner agents. has anyone used them in a similar way in production? any examples?

for instance: do you give the same llm the verifier and planner prompt? or have a verifier agent process the output of a planner and have a threshold which needs to be passed?

feels like there may be a DAG in there somewhere for decision making..

maciejgryka

2 replies

2d19h

2024-04-09 22:26:24 UTC

Yep, it's a DAG, though that only occurred to me after we built this so we didn't model it that way at first. It can be the same LLM with different prompts or totally different models, I think there's no rule and it depends on what you're doing + what your benchmarks tell you.

We're running it in prod btw, though don't have any code to share.

viksit

1 replies

2d13h

2024-04-10 04:27:11 UTC

funnily enough i have a library i’m planning to open source soon! i’ve used airflow as a guideline for it as well.

maciejgryka

0 replies

2d9h

2024-04-10 08:43:45 UTC

Nice, looking forward to seeing that! Someone else pointed me towards https://github.com/DAGWorks-Inc/burr/ which also seems related in case you're curious.

serjester

2 replies

2d19h

2024-04-09 22:34:04 UTC

Some of these points are very controversial. Having done quite a bit with RAG pipelines, avoiding strongly typing your code is asking for a terrible time. Same with avoiding instructor. LLM's are already stochastic, why make your application even more opaque - it's such a minimal time investment.

minimaxir

0 replies

2d19h

2024-04-09 22:49:22 UTC

LLM's are already stochastic

That doesn't mean it's easy to get what you want out of them. Black boxes are black boxes.

maciejgryka

0 replies

2d19h

2024-04-09 22:38:32 UTC

I think instructor is great! And most of our Python code is typed too :)

My point is just that you should care a lot about preserving optionality at the start because you're likely to have to significantly change things as you learn. In my experience going a bit cowboy at the start is worth it so you're less hesitant to rework everything when needed - as long as you have the discipline to clean things up later, when things settle.

liampulles

2 replies

2d10h

2024-04-10 08:08:23 UTC

Agree with lots of this.

As an aside: one thing I've tried to use ChatGPT for is to select applicable options from a list. When I index the list as 1..., 2... Etc. I find that the LLM likes to just start printing out ascending numbers.

What I've found kind of works is indexing by African names, e.g Thandokazi, Ntokozo, etc. then the AI seems to have less bias.

Curios what others have done in this case

maciejgryka

1 replies

2d10h

2024-04-10 08:23:47 UTC

I'm a little surprised to hear this, my experience has been a little better. Are you using GPT4? I know 3.5 is significantly more challenged/challenging with things like this. It's still possible to make it do the right thing, but much more careful prompting is required.

liampulles

0 replies

2d7h

2024-04-10 11:05:27 UTC

Yeah this is to make it work for 3.5, because cost is a factor.

tedtimbrell

1 replies

2d16h

2024-04-10 01:30:14 UTC

On the topic of wrappers, as someone that's forced to use GPT-3.5 (or the like) for cost reasons, anything that starts modifying the prompt without explicitly showing me how is an instant no-go. It makes things really hard to debug.

Maybe I'm the equivalent of that idiot fighting against JS frameworks back when they first came out it but it feels pretty simple to just use individual clients and have pydantic load/validate the output.

msp26

0 replies

2d6h

2024-04-10 11:50:05 UTC

No, you're along the right lines. Every prompting wrapper I've tried and looked through has been awful.

It's not really the authors' faults, it's just a weird new problem with lots of unknowns. It's hard to get the design and abstractions correct. I've had the benefit of a lot of time at work to build my own wrapper (solely for NLP problems) and that's still an ongoing process.

maciejgryka

1 replies

2d21h

2024-04-09 21:12:31 UTC

This is a bunch of lessons we learned as we built our AI-assisted QA. I've seen a bunch of people circle around similar processes, but didn't find a single source explaining it, so thought it might be worth writing down.

Super curious whether anyone has similar/conflicting/other experiences and happy to answer any questions.

xrendan

0 replies

2d19h

2024-04-09 22:44:40 UTC

This generally resonates with what we've found. Some colour based on our experiences.

It's worth spending a lot of time thinking about what a successful LLM call actually looks like for your particular use case. That doesn't have to be a strict validation set `% prompts answered correctly` is good for some of the simpler prompts, but especially as they grow and handle more complex use cases that breaks down. In an ideal world

chain-of-thought has a speed/cost vs. accuracy trade-off a big one.

Observability is super important and we've come to the same conclusion of building that internally.

Fine-tune your model

Do this for cost and speed reasons rather than to improve accuracy. There are decent providers (like Openpipe, relatively happy customer, not associated) who will handle the hard work for you.

jongjong

1 replies

2d16h

2024-04-10 02:13:30 UTC

My experience with AI agents is that they don't understand nuance. Thie makes sense since they are trained on a wide range of data produced by the masses. The masses aren't good with nuance. That's why, if you put 10 experts together, they will often make worse decisions than they would have made individually.

Im terms of coding, I managed to get AI to build a simple working collaborative app but beyond a certain point, it doesn't understand nuance and it kept breaking stuff that it had fixed previously even with Claude where it kept our entire conversation context. Beyond a certain degree of completion, it was simply easier and faster to write the code myself than to tell the AI to write it because it just didn't get it, no matter how precise I was with my wording because it became like playing a game of whac-a-mole; fixed one thing, broke 2 others.

CuriouslyC

0 replies

2d6h

2024-04-10 11:40:23 UTC

Your comment runs contrary to a lot of established statistics. We have demonstrated with ensemble learning that pooling the estimates of many weak learners provides best in class answers to hard problems.

You are correct that we should be using expert AIs rather than general purpose ones when possible though.

cpursley

1 replies

2d7h

2024-04-10 11:04:32 UTC

If you’re using Elixir, I thought I’d point out how great this library is:

https://github.com/thmsmlr/instructor_ex

It piggybacks on Ecto schemas and works really well (if instructed correctly).

cpursley

0 replies

2d4h

2024-04-10 13:30:17 UTC

While I'm at at, this Elixir library is great as well: https://github.com/brainlid/langchain

tmm84

0 replies

2d17h

2024-04-10 00:48:43 UTC

Unlike the author of this article I have had success with RAGatouille. It was my main tool when I was limited on resources and working with non Romanized languages that don't follow the usual token rules (spaces, periods, line breaks, triplet word groups, etc). However, I have had to move past RAGatouille and use embedding + vector DB for a more portable solution.

jasontlouro

0 replies

2d12h

2024-04-10 06:09:57 UTC

Very tactical guide, which I appreciate. This is basically our experience as well. Output can be wonky, but can also be pretty easily validated and honed.

ThomPete

0 replies

2d16h

2024-04-10 01:58:46 UTC

We went through a two tier process before we got to something useful First we built a prompting system so you could do things like:

Get the content from news.ycombinator.com using gpt-4

- or -

Fetch LivePass2 from google sheet and write a summary of it using gpt-4 and email it to thomas@faktory.com

but then we realized that it was better to teach the agents than human beings and so we create a fairly solid agent setup:

Some of the agents we got can be seen here all done via instruct:

Paul Graham https://www.youtube.com/watch?v=5H0GKsBcq0s

Moneypenny https://www.youtube.com/watch?v=I7hj6mzZ5X4

V33 https://www.youtube.com/watch?v=O8APNbindtU