HN comments for: What We Learned from a Year of Building with LLMs

wokwokwok

21 replies

4d9h

2024-05-29 08:35:02 UTC

Mildly surprised to see no mention of my top 2 LLM fails:

1) you’re sampling a distribution; if you only sample once, your sample is not representative of the distribution.

For evaluating prompts and running in production; your hallucination rate is inversely proportional to the number of times you sample.

Sample many times and vote is a highly effective (but slow) strategy.

There is almost zero value in evaluating a prompt by only running it once.

2) Sequences are generated in order.

Asking an LLM to make a decision and justify its decision in that order is literally meaningless.

Once the “decision” tokens are generated; the justification does not influence them. It’s not like they happen “all at once” there is a specific sequence to generating output where the later output cannot magically influence the output which has already been generated.

This is true for sequential outputs from an LLM (obviously), but it is also true inside single outputs. The sequence of tokens in the output is a sequence.

If you’re generating structured output (eg json, xml) which is not sequenced, and your output is something like {decision: …, reason:…} it literally does nothing.

…but, it is valuable to “show the working out” when, as above, you then evaluate multiple solutions to a single request and pick the best one(s).

CuriouslyC

8 replies

4d6h

2024-05-29 11:59:25 UTC

You don't need to hit a LLM multiple times to get multiple distributions, just provide a list of perspectives and ask the model to answer the question from each of them in turn, then combine the results right there in the prompt. I have tested this approach a bunch, it works.

wokwokwok

6 replies

4d5h

2024-05-29 12:43:07 UTC

You don't need to hit a LLM multiple times to get multiple distributions

This isn't correct.

You're just sampling a different distribution.

You can adjust the shape of the distribution with your prompt; certainly... and if you make a good prompt, perhaps, you can narrow the 'solution space' that you sample into.

...but, you're still sampling randomly into a distribution, and the N'th token relies on the (N-1)'th token as an input; that means that a random deviance to a bad solution is incrementally responsible for a bad solution, regardless of your prompt.

...

Consider the prompt "Your name is Pete. What is your name?"

Seems like a fairly narrow distribution right?

However, there's a small chance that the first generated token is 'D'; it's small, but non-zero. That means it happens from time to time. The higher the temperature, the higher the randomization of the output tokens.

How do you imagine that completion runs when it happens? Doug? Dane? Danial? Dave? Don't know? I tell you what it is not; it's not Pete.

That's the issue here; when you sample, the solution space is wide, and any single sample has a P chance of being a stupid hallucination.

When you sample multiple times, the chance of that hallucination is P * P * P * P, etc. by the number of time you sample.

You can therefore control your error rate this way, because, you can calculate the chance of failure as P^N.

Yes, obviously, if your P(good answer) < P(bad answer) it has the opposite effect.

...but no, sampling once with a single prompt does not save you from this prompt no matter what or how good your prompt is.

Furthermore, when you evaluating prompts, only sampling once means you have no way of knowing if it was a good prompt or not. While, if you sample say, 10 times, you can see that obviously, from the outputs (eg. Pete, Pete, Pete, Pete, Potato, Pete, Pete <--- ) what the prompt is doing.

You can measure the error rate in your prompts this way.

If you don't, honestly, you really have no idea if your prompts are any good at all. You're just guessing.

People who run a prompt, tweak it, run it, tweak it, run it, tweak it, etc. are observing random noise, not doing prompt engineering.

Never sample only once.

CuriouslyC

3 replies

4d5h

2024-05-29 12:49:18 UTC

I suggest you spend 20 hours evaluating the results of 10 prompts vs 1 prompt with multiple perspectives to learn the truth about the matter rather than trying to armchair expert.

Edit in response to your wall of text: I have *extensively* tested the results of multi-shot prompting vs repeated single shot prompting, and the differences between them are not material to the outcome of "averaging" results, or selecting the best result. You can theorize all you want, but the real world would like a word.

wokwokwok

1 replies

4d5h

2024-05-29 13:02:53 UTC

Just click on the 'regenerate result' button a few times and see what happens before you change the prompt. That's all it takes.

It's an easy adjustment to workflows that people often either forget to do or don't realise they should be doing.

Sorry; I'm not trying to criticize; I'm just telling you that's how it works.

CuriouslyC

0 replies

4d5h

2024-05-29 13:06:36 UTC

That's an early step that matters more when hitting chat with hidden temperature. Once you get a prompt dialed in you usually want to lower the temperature to the minimum value that still produces the desired results.

dartos

0 replies

4d3h

2024-05-29 14:58:07 UTC

I think the two of you are arguing different things.

OP is saying that you can’t evaluate any prompt by using just one generation using that prompt.

You need to run that prompt several times to approximate any prompt’s performance. That’s just how probability works.

I don’t believe OP is arguing the effectiveness of running 10 different prompts vs a single multiple perspective prompt.

spott

1 replies

3d22h

2024-05-29 20:13:11 UTC

I mean, if you are attempting to use multiple sampling to avoid this kind of error, just use temperature 0 sampling and do it once.

In your example you will get Pete every time.

wokwokwok

0 replies

3d12h

2024-05-30 06:07:44 UTC

Sure, that's fair.

I will say though, using temperature 0 without understanding it (or worse, testing at temp > 0 and then setting temp to 0 for production, which I literally had to stop someone I know and respect as a developer from doing) and not understanding what top_k and top_n do (but using them anyway) is my #3 for LLM fails.

/shrug

...but, yes as you say, in a trivial case, like binary decision making, a 0 or very low temperature can help with the need to multiple sample; and as you say, when it's deterministic, sampling multiple times doesn't help at all.

1minusp

0 replies

4d2h

2024-05-29 16:24:08 UTC

What are some good metrics to evaluate LLM output performance in general? Or is it too hard to quantify at this stage (or not understood well enough). Perhaps the latter, or else those could be in the loss function itself..

refibrillator

4 replies

4d3h

2024-05-29 15:14:21 UTC

There’s a few subtle misconceptions being spread here:

1) Hallucination rate is not inversely proportional to number of samples, unless you assume statistical independence. As you’re sampling from the same generative process each time, any inherent bias of the LLM could affect every sample (eg see golden gate Claude). Naively calculating hallucination rate as P^N is going to be a massive underestimate of the true error rate for many tasks requiring factual accuracy.

2) You’re right that output tokens are generated autoregressively, but you are thinking like a human. Transformer attention layers are permutation invariant. The ordering of output (eg decision first then justification later) is inconsequential, either can be derived from input context and hidden state where there is no causal masking of attention.

bunderbunder

2 replies

4d3h

2024-05-29 15:21:45 UTC

Justification before decision still works out better in practice, though, because of chain of thought [1]. You'll tend to get more accurate and better-justified decisions.

With decision before justification, you tend to have a greater risk of the output being a wrong decision followed by convincing BS justifying it.

(edit: Another way you could think of it is, LLMs still can't violate causality. Attention heads' ability to look in both directions with respect to a particular token's position in the sequence does not enable them to see into the future and observe tokens that don't exist yet.)

1: https://arxiv.org/abs/2201.11903

wtarreau

1 replies

4d2h

2024-05-29 15:53:53 UTC

I totally agree, that's what I had to do with my patchbot that evaluates haproxy patches to be backported ( https://github.com/haproxy/haproxy/tree/master/dev/patchbot/ ). Originally it would just provide a verdict and justify it and it worked extremely poorly, often with a justification that directly contradicted the verdict. I swapped that by asking the analysis and the final verdict and now the success rate is totally amazing (particularly with mistral that remains unbeatable at this task by obeying extremely well to instructions).

BeefySwain

0 replies

4d2h

2024-05-29 16:24:22 UTC

You find Mistral to be the best "open"/local model? Or you find it to be the best model period?

solidasparagus

0 replies

3d21h

2024-05-29 21:05:36 UTC

Your second point I either don’t correctly understand or seems to fly in the face of a lot of proven techniques. Chain-of-thought, react, decision transformer all showcase that order of output of an LLM matter because the tokens output by the LLM before the “answer” can nudge the model to sample from a higher quality part of the distribution for the remainder of its output

altdataseller

1 replies

4d6h

2024-05-29 11:53:39 UTC

> If you’re generating structured output (eg json, xml) which is not sequenced, and your output is something like {decision: …, reason:…} it literally does nothing.

Is this true if you are using RAG too?

Imanari

0 replies

4d6h

2024-05-29 12:06:29 UTC

The core issue that parent is talking about is that the decision-tokens should built on the reasoning-tokens vs the reasoning-tokens are generated according to the decison-tokens. RAG just provides the context the LLM should reason about.

remorses

0 replies

4d1h

2024-05-29 16:34:31 UTC

If you set temperature to zero the output will be always the same, not a distribution. If instead you increase temperature the LLM will sometimes choose other tokens than the one with the highest score, but it won’t be that much different

piloto_ciego

0 replies

4d2h

2024-05-29 16:02:52 UTC

There is almost zero value in evaluating a prompt by only running it once.

This is crazy town banana pants.

olooney

0 replies

4d5h

2024-05-29 12:40:14 UTC

Sample many times and vote is a highly effective (but slow) strategy.

Beam search[1] has long been a great way to sample from language models, even before transformers. Essentially you keep track of the top N most promising threads and sample randomly from those.

OpenAI doesn't offer beam search yet, just temperature and top_k, but I hope they add support for it because it's far more efficient than just starting over each time.

[1]: https://www.width.ai/post/what-is-beam-search

bbischof

0 replies

2d12h

2024-05-31 06:22:50 UTC

We allude to 2 when talking about using explanations first, but I totally agree. One minor comment is explanations after can sometimes be useful for understanding how the model came to a particular generation during post-hoc evals.

Point 1 is also a good callout. I added something on this for llm judge but it’s relevant more broadly.

ADeerAppeared

0 replies

4d8h

2024-05-29 09:30:30 UTC

There is almost zero value in evaluating a prompt by only running it once.

To the user

But these tools are marketed as if you do only need to run them once to get a good result; The companies behind them would really want you to stop hammering the button that deletes their money.

As an aside:

For evaluating prompts and running in production; your hallucination rate is inversely proportional to the number of times you sample.

This isn't really true, and requires you to fuzz the prompt itself for best effect. Making the "spam the LLM with requests" problem much worse.

__loam

19 replies

4d11h

2024-05-29 06:47:21 UTC

I feel like an insane person everytime I look at the LLM development space and see what the state of the art is.

If I'm understanding this correctly, the standard way to get structured output seems to be to retry the query until the stochastic language model produces expected output. RAG also seems like a hilariously thin wrapper over traditional search systems, and it still might hallucinate in that tiny distance between the search result and the user. Like we're talking about writing sentences and coaching what amounts to an auto complete system to magically give us something we want. How is this industry getting hundreds of billions of dollars in investment?

Also the error rate is about 5-10% according to this article. That's pretty bad!

freddref

5 replies

4d11h

2024-05-29 07:19:46 UTC

Humans probably have about the same error rate. It's easy to miss a comma or quote.

These systems compete with humans, not with formatters.

mtsolitary

4 replies

4d5h

2024-05-29 13:12:35 UTC

A system of checks and balances overseen by several humans can have orders of magnitude lower error rates, though.

Last5Digits

3 replies

4d5h

2024-05-29 13:23:34 UTC

A system of checks and balances also costs orders of magnitude more money.

__loam

1 replies

4d1h

2024-05-29 16:52:33 UTC

You probably also need that for the AI as well though

Last5Digits

0 replies

3d21h

2024-05-29 21:06:49 UTC

The point was that for many tasks, AI has similar failure rates compared to humans while being significantly cheaper. The ability for human error rates to be reduced by spending even more money just isn't all that relevant.

Even if you had to implement checks and balances for AI systems, you'd still come away having spent way less money.

jazzyjackson

0 replies

4d2h

2024-05-29 15:42:57 UTC

this is my fear regarding AI - it doesn't have to be as good as humans, it just has to be cheaper and it will get implemented in business processes. overall quality of service will degrade while profit margins increase.

palata

2 replies

4d11h

2024-05-29 06:54:15 UTC

How is this industry getting hundreds of billions of dollars in investment?

FOMO? To me it's the Gold Rush, except that it's not clear if anyone wants that kind of gold at the end :-).

__loam

1 replies

4d11h

2024-05-29 07:11:58 UTC

Google is so terrified that someone is threatening their market position, the one in which they have over $100b in cash and get something like $20b in profit quarterly, that they're willing to shove this technology into some of the most important infrastructure on the internet so they can get fucksmith to tell everyone to put glue in their pizza sauce. I'll never understand how a company in maybe one of the most secure financial situations in all of human history has leadership that is this afraid.

oispakaljaa

0 replies

4d10h

2024-05-29 08:02:38 UTC

Line must go up.

yaantc

1 replies

4d11h

2024-05-29 07:26:15 UTC

[...] the standard way to get structured output seems to be to retry the query until the stochastic language model produces expected output.

No, that would be very inefficient. At each token generation step, the LLM provides a likelihood for all the defined token based on the past context. The structured output is defined by a grammar, which defines the legal tokens for the next step. You can then take the intersection of both (ignore any token not allowed by the grammar), and then select among the authorized token based on the LLM likelihood for them in the usual way. So it's a direct constraint, and it's efficient.

__loam

0 replies

4d10h

2024-05-29 07:40:05 UTC

Yeah that sounds way better. I saw one of the python libraries they recommended mention retries and I thought, this can't be that awful can it?

sensanaty

1 replies

4d7h

2024-05-29 10:47:03 UTC

I've been building out an AI/LLM-based feature at work for a while now and, yeah, from my POV it's completely useless bullshit that only exists because our CTO is hyped by the technology and our investors need to see "AI" plastered somewhere on our marketing page, regardless of how useful it is in real use. Likewise with any of the other LLM products I've seen out in the wild as well, it's all just a hypewave being pushed by corps and clueless C-suites who hear other C-suites fawning over the tech.

frob

0 replies

4d6h

2024-05-29 11:45:47 UTC

It's so painful. We have funders come to us saying they love what we do, they want us to do more of it, they have $X million to invest, but only if we use "AI." Investers have their new favorite hammer and, by gosh, you better use it, even if you're trying to weld a pipe.

rtb

1 replies

4d6h

2024-05-29 12:07:15 UTC

Very much agree.

Also, what's it for? None of these articles point to anything worthwhile that it's useful for.

__loam

0 replies

3d17h

2024-05-30 01:22:17 UTC

I've seen a lot of salivation over replacing customer service reps lol.

rcarmo

1 replies

4d11h

2024-05-29 07:20:24 UTC

Via APIs, yes. But if you have direct access to the model you can use libraries like https://github.com/guidance-ai/guidance to manipulate the output structure directly.

__loam

0 replies

4d10h

2024-05-29 07:49:16 UTC

This seems like it could do some cool code completion stuff with local models.

Kiro

1 replies

4d11h

2024-05-29 07:19:35 UTC

Also the error rate is about 5-10% according to this article. That's pretty bad!

Having 90-95% success rate on something that was previously impossible is acceptable. Without LLMs the success rate would be 0% for the things I'm doing.

__loam

0 replies

4d9h

2024-05-29 09:17:13 UTC

I think the problem here is that that is often still not that acceptable. Let's imagine a system with say, 100 million users making 25 queries a day, just to give us some contrived numbers to examine. At a 10% error rate that's 250 million mistakes a day, or 75 million if we're generous and say there's a 3% error rate. Then you have to think about your application, how easily you can detect issues, how much money you're willing to pay your ops staff (and how big you want to expand it), the cost of the mistakes themselves as well as the legal and retutational costs of having an unreliable system. Take those costs, add it to the cost to run this system (probably considerable), and you're coming up on a heuristic for figuring out if possible equates worth doing. 75 million times any dollar amount (plus 2.5 billion total queries you need to run the infrastructure for) is still a lot of capital. If each mistake costs you $0.20 (I made this number up), then maybe $5.5b a year is worth the cost? I'm not sure.

It's probable that Google is in the middle of doing this napkin math given all the embarrassing stuff we saw last week. So it's cool that we're closer to solving these really hard problems but whether they're acceptable is a more complicated question than just it used to not be possible. Maybe that math works out in your favor for your application.

mloncode

6 replies

4d13h

2024-05-29 05:09:35 UTC

Hello this is Hamel, one of the authors (among the list of other amazing authors). Happy to answer any questions as well as tag any of my colleagues to answer any questions!

(Note: this is only Part 1 of 3 of a series that has already been written and the other 2 parts will be released shortly)

sieszpak

1 replies

4d12h

2024-05-29 06:08:33 UTC

I would like to know your opinion about grafRAG and the ontology. Knowledge Graphs (KG) are a game changer for companies with a lot of unstructured data in the context of applying them with LLM

bbischof

0 replies

4d11h

2024-05-29 06:30:00 UTC

Bryan here, one of the authors.

Sure. Ultimately, you want to use KG to increase your ability to do great retrieval.

Why do graphs help with retrieval? Well, don’t overlook the classic pageant example: graphs provide signal about the interconnectivity of the docs.

Also, sometimes the graph itself are a kind of object you want to retrieve over.

umangrathi

0 replies

4d12h

2024-05-29 06:04:41 UTC

It was a great read, aligned on many thought processes from our own tinkering in breaking down tasks for LLM. Eagerly look forward to the next 2 parts, this one has been educational.

rasmus1610

0 replies

4d11h

2024-05-29 07:00:58 UTC

Just wanted to say Thank you for publishing something so valuable. So many great tips in there!

dejobaan

0 replies

4d4h

2024-05-29 14:17:56 UTC

The breadth and (for lack of a better term) concreteness are just fantastic. Thank you for writing this!

alach11

0 replies

2d19h

2024-05-30 22:39:35 UTC

This is a fantastic article, and I've already shared it with a few colleagues. I'm wondering if you prefer to provide few-shot examples in a single "message" or in a simulated back-and-forth conversation between the user and the assistant?

surfingdino

5 replies

4d12h

2024-05-29 05:49:47 UTC

One thing I am getting from this is that you need to be able to write prompts using well-structured English. That may be a challenge to a significant percentage of the population.

I am curious to know if the authors tried to build LLMs in languages other than English and what did they learn while doing so?

An excellent post reminding me of the best O'Reilly articles from the past. Looking forward to parts 2 and 3.

azinman2

3 replies

4d12h

2024-05-29 06:08:25 UTC

that you need to be able to write prompts using well-structured English. That may be a challenge to a significant percentage of the population.

I didn’t realize at first you meant to highlight a language barrier — being well structured is a challenge for most in their native tongue!

surfingdino

2 replies

4d12h

2024-05-29 06:24:44 UTC

Well, it's early in the morning and I have not had my coffee, yet. English being my second language doesn't help either :-) Probably best for me to wait until I wake up before writing a prompt.

BOOSTERHIDROGEN

1 replies

4d8h

2024-05-29 10:17:09 UTC

Do you have custom prompts for improving writing?

surfingdino

0 replies

4d6h

2024-05-29 12:07:04 UTC

empiko

0 replies

4d12h

2024-05-29 06:07:55 UTC

"fix the English in the following prompt: {prompt}"

hubraumhugo

5 replies

4d12h

2024-05-29 05:32:48 UTC

Comprehensive and practical write-up that aligns with most of my experiences.

One controversial point that has led to discussions in my team is this:

A common anti-pattern/code smell in software is the “God Object,” where we have a single class or function that does everything. The same applies to prompts too.

In theory, a monolithic agent/prompt with infinite context size, a large toolset, and perfect attention would be ideal.

Multi-agent systems will always be less effective and more error-prone than monolithic systems on a given problem because of less context of the overall problem. Individual agents work best when they have entirely different functionalities.

I wrote down my thoughts about agent architectures here: https://www.kadoa.com/blog/ai-agents-hype-vs-reality

tedsanders

3 replies

4d12h

2024-05-29 06:05:11 UTC

As an OpenAI employee who has worked with dozens of API customers, I mostly agree with the article's tip to break up tasks into smaller, more reliable subtasks.

If each step of your task requires knowledge of the big picture, then yeah it ought to help to put all your context into a single API call.

But if you can decompose your task into relatively independent subtasks, then it helps to use a custom prompt/custom model for each of those steps. Extraneous context and complexity are just opportunities for the model to make mistakes, and the more you can strip those out, the better. 3 steps with 99% reliability are better than 1 step with 90% reliability.

Of course, it all depends on what you're trying to do.

I'd say single, big API calls are better when:

- Much of the information/substeps are interrelated

- You want immediate output for a user-facing app, without having to wait for intermediate steps

Multiple, sequenced API calls are better when:

- You can decompose the task into smaller steps, each of which do not require full context

- There's a tree or graph of steps, and you want to prune irrelevant branches as you proceed from the root

- You want to have some 100% reliabile logic live outside of the LLM in parsing/routing code

- You want to customize the prompts based on results from previous steps

umangrathi

0 replies

4d12h

2024-05-29 06:08:19 UTC

smaller tasks also helps in choosing smaller models to work with, instead of waiting for a large model to respond (really not usable when doing customer facing work)

hubraumhugo

0 replies

4d12h

2024-05-29 06:14:25 UTC

Agree, that's a very good summary. Would love to see some benchmarks for the two approaches.

7d7n

0 replies

4d12h

2024-05-29 06:23:47 UTC

100% agree with Ted's take. One of the authors wrote about splitting up prompts here too: https://eugeneyan.com/writing/prompting/#split-catch-all-pro...

sjducb

0 replies

4d10h

2024-05-29 08:02:28 UTC

I wonder how helpful “prompt unit tests” would be here?

Write the initial prompt, and write some tests to validate the output of the prompt. Then as the prompt grows you can observe the decline in performance at the initial task. Then you can decide if the new larger prompt is worth the decline in performance at it’s initial task.

It might not work for all tasks, but a good candidate would be - write SQL queries from natural language.

DubiousPusher

5 replies

4d12h

2024-05-29 05:33:10 UTC

Pretty good. Despite my high scepticism of the technology I have spent the last year working with LLMs myself. I would add a few things.

The LLM is like another user. And it can surprise you just like a user can. All the things you've done over the years to sanitize user input apply to LLM responses.

There is power beyond the conversational aspects of LLMs. Always ask, do you need to pass the actual text back to your user or can you leverage the LLM and constrain what you return?

LLMs are the best tool we've ever had for understanding user intent. They obsolete the hierarchies of decision trees and spaghetti logic we've written for years to classify user input into discrete tasks (realizing this and throwing away so much code has been the joy of the last year of my work).

Being concise is key and these things suck at it.

If you leave a user alone with the LLM, some users will break it. No matter what you do.

umangrathi

2 replies

4d12h

2024-05-29 06:03:24 UTC

This has been really interesting read. Aligned that if you leave a user along with LLM, some one will break. Hence we choose to use large number of templates wherever suitable as compared to a free reign for LLM to respond with.

distalx

1 replies

4d11h

2024-05-29 07:19:52 UTC

In my opinion, using templates can help keep responses reliable. But it can also make interactions feel robotic, diminishing the "wow" factor of LLMs. There might be better options out there that we haven't found yet.

DubiousPusher

0 replies

2024-05-29 17:39:25 UTC

Absolutely. This is a huge trade-off. The constraints you place on the model output is all about how much your app and user experience can tolerate bad LLM behavior.

photon_collider

0 replies

2d3h

2024-05-31 15:08:42 UTC

The LLM is like another user. And it can surprise you just like a user can. All the things you've done over the years to sanitize user input apply to LLM responses.

I really like this analogy! That sums up my experiences with LLMs as well.

Terr_

0 replies

4d9h

2024-05-29 08:42:44 UTC

The LLM is like another user.

I like to think of LLMs as client-side code, at least in terms of their risk-profile.

No data you put into them (whether training or prompt) is reliably hidden from a persistent user, and they can also force it to output what they want.

l5870uoo9y

3 replies

4d12h

2024-05-29 06:07:12 UTC

Thus, you may expect that effective prompting for Text-to-SQL should include structured schema definitions; indeed.

I found that the simpler the better, when testing lots of different SQL schema formats on https://www.sqlai.ai/. CSV (table name, table column, data type) outperformed both a JSON formatted and SQL schema dump. And not to mention consumed fewer tokens.

If you need the database schema in a consistent format (e.g. CSV) just have LLM extract data and convert whatever the user provides into CSV. It shines at this.

FrostKiwi

2 replies

4d11h

2024-05-29 07:08:48 UTC

Interesting, thanks for sharing! I was wondering about this. When starting out with feeding Tabular data, I instinctively went with CSV, but always worried: What if there is a better choice? What if longer tables, the LLMs forgets the column order?

firejake308

1 replies

4d10h

2024-05-29 07:57:39 UTC

That's exactly what happened to me when I tried to get the open-source models to extract a CSV from textual data with a lot of yes/no fields; i.e., the model forgot the column order and started confusing the cell values. I found I had to use more powerful models like Mistral Large or ChatGPT. So I think that is a valid thing to worry about with smaller models, but maybe less of a concern with larger ones.

a_bonobo

0 replies

4d7h

2024-05-29 10:51:22 UTC

Have you ever figured out a way to annotate the CSV columns to the model? Or do you write a long prefix explaining the columns?

I found that similarly-named columns easily confused GPTs

goldemerald

2 replies

4d12h

2024-05-29 05:44:39 UTC

"Ready to -dive- delve in?" is an amazingly hilarious reference. For those who don't know, LLMs (especially ChatGPT) use the word delve significantly more often than human created content. It's a primary tell-tale sign that someone used an LLM to write the text. Keep an eye out for delving, and you'll see it everywhere.

threeseed

0 replies

4d12h

2024-05-29 06:25:08 UTC

I believe this was debunked as just a US-centric view.

In places that grew up learning UK English we use delve not that dissimilar to ChatGPT.

7d7n

0 replies

4d11h

2024-05-29 06:28:42 UTC

haha I'm glad you noticed!

it's originally "Ready to ~~delve~~ dive in?" but something got lost in translation

Havoc

2 replies

4d10h

2024-05-29 08:02:02 UTC

Surely step one is carefully consider whether LLMs are the solution to you problem? That to me is the part where this is likely to go wrong for most people

bootsmann

1 replies

4d10h

2024-05-29 08:11:43 UTC

Enforcing a BM25 baseline for every RAG project will keep so many "talk to your pdf" projects off your plate.

Havoc

0 replies

4d9h

2024-05-29 08:56:47 UTC

Do you mean as an additional step to determine whether the content the rag wants to pull is actually relevant? Or as a filter of sorts as to hat projects to work on?

msp26

1 replies

4d9h

2024-05-29 09:28:26 UTC

Thanks for sharing, I've followed these authors for a while and they're great.

Some notes from my own experience on LLMs for NLP problems:

1) The output schema is usually more impactful than the text part of a prompt.

a) Field order matters a lot. At inference, the earlier tokens generated influence the next tokens.

b) Just have the CoT as a field in the schema too.

c) PotentialField and ActualField allow the LLM to create some broad options and then select the best. This mitigates the fact that they can't backtrack a bit. If you have human evaluation in your process, this also makes it easier for them to correct mistakes.

`'PotentialThemes': ['Surreal Worlds', 'Alternate History', 'Post-Apocalyptic'], 'FinalThemes': ['Surreal Worlds']`

d) Most well definined problems should be possible zero-shot on a frontier model. Before rushing off to add examples really check that you're solving the correct problem in the most ideal way.

2) Defining the schema as typescript types is flexible and reliable and takes up minimal tokens. The output JSON structure is pretty much always correct (as long as the it fits in the context window) the only issue is that the language model can pick values outside the schema but that's easy to validate in post.

3) "Evaluating LLMs can be a minefield." yeah it's a pain in the ass.

4) Adding too many examples increases the token costs per item a lot. I've found that it's possible to process several items in one prompt and, despite it being seemingly silly and inefficient, it works reliably and cheaply.

5) Example selection is not trivial and can cause very subtle errors.

6) Structuring your inputs with XML is very good. Even if you're trying to get JSON output, XML input seems to work better. (Haven't extensively tested this because eval is hard).

chanchar

0 replies

3d19h

2024-05-29 23:09:20 UTC

Just have the CoT as a field in the schema too.

Love the idea of adding CoT as a field in the expected structured output as it also makes it easier from a UX perspective to show/hide internal vs external outputs.

Structuring your inputs with XML is very good. Even if you're trying to get JSON output, XML input seems to work better. (Haven't extensively tested this because eval is hard).

Would be neat to see LLM-specific adapters that can be used to swap out different formats within the prompt.

mark_l_watson

0 replies

4d3h

2024-05-29 14:33:40 UTC

Fantastic advice. While reading the article I kept running across advice I had seen before or figured out myself, then forgot about. I am going to summarize this article and add the summary to my own Apple Notes (there are better tools, but I just use Apple Notes to act as a pile-of-text for reach notes.)

lagrange77

0 replies

4d8h

2024-05-29 09:59:47 UTC

Can anyone recommend resources, preferably books, on this whole topic of building applications around LLMs? It feels like running after an accelerating train to hop on.

hugobowne

0 replies

4d13h

2024-05-29 05:13:57 UTC

hey there, Hugo here and big fan of this work. Such a fan I'm actually doing a livestream podcast recording with all the authors here, if you're interested in hearing more from them: https://lu.ma/e8huz3s6?utm_source=hn

should be fun!

elicksaur

0 replies

4d4h

2024-05-29 13:32:33 UTC

Upon loading the site, a chat bubble pops up and auto-plays a loud ding. Is the innovation of LLMs really a regression to 2000s spam sites? Can’t say I’m excited.

beepbooptheory

0 replies

3d20h

2024-05-29 21:32:32 UTC

Is every "AI product" a piece of software where the end user interfaces with an llm? Or is an application that used AI to be built an "AI product"?

Is it the thing itself, or is it the thing that enables us?

anon373839

0 replies

4d9h

2024-05-29 08:47:29 UTC

Is anyone using DSPy? It seems like a really interesting project, but I haven’t heard much from people building with it.

CuriouslyC

0 replies

4d6h

2024-05-29 12:15:40 UTC

One thing that wasn't mentioned that works pretty well - if you have a RAG process running async rather than in a REPL loop, you can retrieve documents then perform a pass with another LLM to do summarization/extraction first. This saves input token costs for expensive LLMs, and lets you cram more information in the context, you just have to deal with additional latency.

7thpower

0 replies

4d7h

2024-05-29 11:06:11 UTC

This is excellent and matches with my experience, especially the part about prioritizing deterministic outputs. They are not as sexy as agentic chain of thought, but they actually work.