return to table of content

What we've learned from a year of building with LLMs

dbs
43 replies
20h58m

Show me the use cases you have supported in production. Then I might read all the 30 pages praising the dozens (soon to be hundreds?) of “best practices” to build LLMs.

robbiemitchell
21 replies
20h8m

Processing high volumes of unstructured data (text)… we’re using a STAG architecture.

- Generate targeted LLM micro summaries of every record (ticket, call, etc.) continually

- Use layers of regex, semantic embeddings, and scoring enrichments to identify report rows (pivots on aggregates) worth attention, running on a schedule

- Proactively explain each report row by identifying what’s unusual about it and LLM summarizing a subset of the microsummaries.

- Push the result to webhook

Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.

Another is preventing LLMs from adding intro or conclusion text.

BoorishBears
15 replies
19h48m

Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.

How are you struggling with this, let alone as a significant barrier? JSON adherence with a well thought out schema hasn't been a worry between improved model performance and various grammar based constraint systems in a while.

Another is preventing LLMs from adding intro or conclusion text.

Also trivial to work around by pre-filling and stop tokens, or just extremely basic text parsing.

Also would recommend writing out Stream-Triggered Augmented Generation since the term is so barely used it might as well be made up from the POV of someone trying to understand the comment

robbiemitchell
14 replies
19h36m

Asking even a top-notch LLM to output well formed JSON simply fails sometimes. And when you’re running LLMs at high volume in the background, you can’t use the best available until the last mile.

You work around it with post-processing and retries. But it’s still a bit brittle given how much stuff happens downstream without supervision.

jncfhnb
7 replies
18h1m

… why would you have the LLM spit out a json rather than define the json yourself and have the LLM supply values?

esafak
4 replies
10h48m

If the LLM doesn't output data that conforms to a schema, you can't reliably parse it, so you're back to square one.

jncfhnb
3 replies
5h38m

It’s significantly easier to output an integer than a JSON with a key value structure where the value is an integer and everything else is exactly as desired

esafak
2 replies
1h37m

That's because you've dumbed down the problem. If it was just about outputting one integer, there would be nothing to discuss. Now add a bunch more fields, add some nesting and other constraints into it...

neverokay
0 replies
18m

Is there a general model that got fine tuned on these json schema/output pairs?

Seems like it would be universally useful.

jncfhnb
0 replies
1h2m

The more complexity you add the less likely the LLM is to give you a valid response in one shot. It’s still going to be easier to get the LLM to supply values to a fixed scheme than to get the LLM to give the answers and the scheme

janpieterz
1 replies
13h54m

How would I do this reliably? Eg give me 10 different values, all in one prompt for performance reasons?

Might not need JSON but whatever format it outputs, it needs to be reliable.

jncfhnb
0 replies
4h12m

Don’t do it all in one prompt.

fancy_pantser
3 replies
18h22m

Constrained output with GBNF or JSON is much more efficient and less error-prone. I hope nobody outside of hobby projects is still using error/retry loops.

joatmon-snoo
2 replies
12h34m

Constraining output means you don’t get to use ChatGPT or Claude though, and now you have to run your own stuff. Maybe for some folks that’s OK, but really annoying for others.

fancy_pantser
1 replies
12h13m

You're totally right, I'm in my own HPC bubble. The organizations I work with create their own models and it's easy for me to forget that's the exception more than the rule. I apologize for making too many assumptions in my previous comment.

joatmon-snoo
0 replies
3h18m

Not at all!

Out of curiosity- do those orgs not find the loss of generality that comes from custom models to be an issue? e.g. vs using Llama or Mistral or some other open model?

yeahwhatever10
0 replies
16h50m

The phrase you want to search is "constrained decoding".

BoorishBears
0 replies
16h14m

The best available actually have the fewest knobs for JSON schema enforcement (ie. OpenAI's JSON mode, which technically can still produce incorrect JSON)

If you're using anything less you should have a grammar that enforces exactly what tokens are allowed to be output. Fine Tuning can help too in case you're worried about the effects of constraining the generation, but in my experience it's not really a thing

adamsbriscoe
1 replies
17h17m

Lack of JSON schema restriction is a significant barrier to entry on hooking LLMs up to a multi step process.

(Plug) I shipped a dedicated OpenAI-compatible API for this, jsonmode.com a couple weeks ago and just integrated Groq (they were nice enough to bump up the rate limits) so it's crazy fast. It's a WIP but so far very comparable to JSON output from frontier models, with some bonus features (web crawling etc).

tarasglek
0 replies
10h6m

The metallica-esque lightning logo is cool

lastdong
0 replies
10h32m

“Use layers of regex, semantic embeddings, and scoring enrichments to identify report rows (pivots on aggregates) worth attention, running on a schedule”

This is really interesting, is there any architecture documentation/articles that you can recommend?

joatmon-snoo
0 replies
12h37m

We actually built an error-tolerant JSON parser to handle this. Our customers were reporting exactly the same issue- trying a bunch of different techniques to get more usefully structured data out.

You can check it out over at https://github.com/BoundaryML/baml. Would love to talk if this is something that seems interesting!

benreesman
0 replies
19h11m

I only became aware of it recently and therefore haven’t done more than play with in a fairly cursory way, but unstructured.io seems to have a lot of traction and certainly in my little toy tests their open-source stuff seems pretty clearly better than the status quo.

Might be worth checking out.

fnordpiglet
11 replies
17h47m

We use LLMs in dozens of different production applications for critical business flows. They allow for a lot of dynamism in our flows that aren’t amenable to direct quantitative reasoning or structured workflows. Double digit percents of our growth in the last year are entirely due to them. The biggest challenge is tool chain, limits on inference capacity, and developer understanding of the abilities, limits, and techniques for using LLMs effectively.

I often see these messages from the community doubting the reality, but LLMs are a powerful tool in the tool chest. But I think most companies are not staffed with skilled enough engineers with a creative enough bent to really take advantage of them yet or be willing to fund basic research and from first principles toolchain creation. That’s ok. But it’s foolish to assume this is all hype like crypto was. The parallels are obvious but the foundations are different.

TeMPOraL
5 replies
13h8m

We use LLMs in dozens of different production applications for critical business flows. They allow for a lot of dynamism in our flows that aren’t amenable to direct quantitative reasoning or structured workflows. Double digit percents of our growth in the last year are entirely due to them. The biggest challenge is tool chain, limits on inference capacity, and developer understanding of the abilities, limits, and techniques for using LLMs effectively.

That sounds like corporate buzzword salad. It doesn't tell much as it stands, not without at least one specific example to ground all those relative statements.

mloncode
4 replies
13h1m

Hi, Hamel here. I'm one of the co-authors. I'm an independent consultant and not all clients allow me to talk about their work.

However, I have two that do, which I've discussed in the article. These are two production use cases that I have supported (which again, are explicitly mentioned in the article):

1. https://www.honeycomb.io/blog/introducing-query-assistant

2. https://www.youtube.com/watch?v=B_DMMlDuJB0

Other co-authors have worked on significant bodies of work:

Bryan Bischoff lead the creation of Magic in Hex: https://www.latent.space/p/bryan-bischof

Jason Liu created the most popular OSS libraries for structured data called instructor https://github.com/jxnl/instructor, and works with some of the leading companies in the space like Limitless and Raycast (https://jxnl.co/services/#current-and-past-clients)

Eugene Yan works with LLMs extensively at Amazon and uses that to inform his writing: https://eugeneyan.com/writing/ (However he isn't allowed to share specifics about Amazon)

I believe you might find these worth looking at.

mattmanser
3 replies
9h1m

You've linked to a query generator for a custom programming language and a 1 hour video about LLM tools. The cynic in me feels like the former could probably be done by chatgpt off the shelf.

But those do not seem to be real world business cases.

Can you expand a bit more why you think they are? We don't have hours to spend reading, and you say you've been allowed to talk about them.

So can you summarise the business benefits for us, which is what people are asking for, instead of linking to huge articles?

mloncode
0 replies
4h52m

do not seem to be real world business cases

The first one is a real world product that lives in production that is user facing for a paid product.

The second video goes in depth about how a AI assistant was built for a real estate CRM company, also a paid product.

I don’t understand the assertion that it’s not “real world” or not “business”

Here are additional articles about these

https://help.rechat.com/guides/lucy

https://www.prnewswire.com/news-releases/honeycomb-launches-...

idf00
0 replies
6h15m

They think they are real business use cases, because real businesses use them to solve their use cases. They know that chatgpt can't solve this off the shelf, because they tried that first and were forced to do more in order to solve their problem.

There's a summary for ya! More details in the stuff that they linked if you want to learn. Technical skills do require a significant time investment to learn, and LLM usage is no different.

80hd
0 replies
8h3m

Sounds like something you could do with an LLM

threeseed
2 replies
16h19m

No one is saying that all of AI is hype. It clearly isn't.

But the facts are that today LLMs are not suitable for use cases that need accurate results. And there is no evidence or research that suggests this is changing anytime soon. Maybe for ever.

There are very strong parallels to crypto in that (a) people are starting with the technology and trying to find problems and (b) there is a cult like atmosphere where non-believers are seen as being anti-progress and anti-technology.

fnordpiglet
1 replies
16h15m

Yeah I think a key is LLMs in business are not generally useful alone. They require classical computing techniques to really be powerful. Accurate computation is a generally well established field and you don’t need an LLM to do optimization or math or even deductive logical reasoning. That’s a waste of their power which is typically abstract semantic abductive “reasoning” and natural language processing. Overlaying this with constraints, structure, and augmenting with optimizers, solvers, etc, you get a form of computing that was impossible more than 5 years prior and is only practical in the last 9 months.

On the crypto stuff yeah I get it - especially if you’re not in the weeds of its use. A lot of people formed opinions from GPT3.5, Gemini, copilot, and other crappy experiences and haven’t kept up with the state of the art. The rate of change in AI is breathtaking and I think hard to comprehend for most people. Also the recent mess of crypto and the fact grifters grift etc also hurts. But people who doubt -are- stuck in the past. That’s not necessarily their fault and it might not even apply to their career or lives in the present and the flaws are enormous as you point out. But it’s such a remarkably powerful new mode of compute that it in combination with all the other powerful modes of compute is changing everything and will continue too, especially if next generation models keep improving as they seem to be likely to.

jeffreygoesto
0 replies
10h11m

That text applies to basically every new technology. Point is that you can't predict it's usefulness in 20 years from that.

To me it still looks like a hammer made completely from rubber. You can practice to get some good hits, but it is pretty hard to get something reliable. And a beginner will basically just bounce it around. But it is sold as rescue for beginners.

mvdtnz
1 replies
17h37m

Yet another post claiming "dozens" of production use cases without listing a single one.

fnordpiglet
0 replies
17h6m

I’ve listed plenty in my comment history. I don’t generally feel compelled to trot them all out all the time - I don’t need to “prove” anything and if you think I’m lying that’s your choice. Finally, many of our uses are trade secrets and a significant competitive advantage so I don’t feel the need to disclose them to the world if our competitors don’t believe in the tech. We can keep eating their lunch.

mloncode
1 replies
13h2m

Hi, Hamel here. I'm one of the co-authors. I'm an independent consultant and not all clients allow me to talk about their work.

However, I have two that do, which I've discussed in the article. These are two production use cases that I have supported (which again, are explicitly mentioned in the article):

1. https://www.honeycomb.io/blog/introducing-query-assistant

2. https://www.youtube.com/watch?v=B_DMMlDuJB0

Other co-authors have worked on significant bodies of work:

Bryan Bischoff lead the creation of Magic in Hex: https://www.latent.space/p/bryan-bischof

Jason Liu created the most popular OSS libraries for structured data called instructor https://github.com/jxnl/instructor, and works with some of the leading companies in the space like Limitless and Raycast (https://jxnl.co/services/#current-and-past-clients)

Eugene Yan works with LLMs extensively at Amazon and uses that to inform his writing: https://eugeneyan.com/writing/ (However he isn't allowed to share specifics about Amazon)

I believe you might find these worth looking at.

anon373839
0 replies
7h25m

I know it’s a snarky comment you responded to, but I’m glad you did. Those are great resources, as is your excellent article. Thanks for posting!

thallium205
0 replies
20h10m

We have a company mail, fax, and phone room that receives thousands of pages a day that now sorts, categorizes, and extracts useful information from them all in a completely automated way by LLMs. Several FTEs have been reassigned elsewhere as a result.

joe_the_user
0 replies
20h45m

I have a friend who uses ChatGPT for writing quick policy statement for her clients (mostly schools). I have a friend who uses it to create images and descriptions for DnD adventures. LLMs have uses.

The problem I see is, who can an "application" be anything but a little window onto the base abilities of ChatGPT and so effectively offers nothing more to an end-user. The final result still have to be checked and regular end-users have to do their own prompt.

Edit: Also, I should also say that anyone who's designing LLM apps that, rather than being end-user tools, are effectively gate keepers to getting action or "a human" from a company deserves a big "f* you" 'cause that approach is evil.

hubraumhugo
0 replies
12h54m

I think it comes down to relatively unexciting use cases that have a high business impact (process automation, RPA, data analysis), not fancy chatbots or generative art.

For example, we focused on the boring and hard task of web data extraction.

Traditional web scraping is labor-intensive, error-prone, and requires constant updates to handle website changes. It's repetitive and tedious, but couldn't be automated due to the high data diversity and many edge cases. This required a combination of rule-based tools, developers, and constant maintenance.

We're now using LLMs to generate web scrapers and data transformation steps on the fly that adapt to website changes, automating the full process end-to-end.

harrisoned
0 replies
20h18m

It certainly has use cases, just not as many as the hype lead people to believe. For me:

-Regex expressions: ChatGPT is the best multi-million regex parser to date.

-Grammar and semantic check: It's a very good revision tool, helped me a lot of times, specially when writing in non-native languages.

-Artwork inspiration: Not only for visual inspiration, in the case of image generators, but descriptive as well. The verbosity of some LLMs can help describe things in more detail than a person would.

-General coding: While your mileage may vary on that one, it has helped me a lot at work building stuff on languages i'm not very familiar with. Just snippets, nothing big.

cqqxo4zV46cp
0 replies
13h17m

Or maybe they could choose to focus their attention on people that aren’t needlessly aggressive and adversarial.

bbischof
0 replies
12h52m

Hello, it’s Bryan, an author on this piece.

I’d you’re interested in using one of the LLM-applications I have in prod, check out https://hex.tech/product/magic-ai/ It has a free limit every month to give it a try and see how you like it. If you have feedback after using it, we’re always very interested to hear from users.

solidasparagus
31 replies
22h8m

No offense, but I'd love to see what they've successfully built using LLMs before taking their advice too seriously. The idea that fine-tuning isn't even a consideration (perhaps even something they think is absolutely incorrect if the section titles of the unfinished section is anything to go by) is very strange to me and suggests a pretty narrow perspective IMO

lmeyerov
7 replies
20h32m

We work in some pretty serious domains and try to stay away from fine tuning:

- Most of our accuracy ROI is from agentic loops over top models, and dynamic RAG example injection goes far here that the relative lift of adding fine-tuning isn't worth the many costs

- A lot of fine-tuning is for OSS models that do worse than agentic loops over the proprietary GPT4/Opus3

- For distribution, it's a lot easier to deploy for pluggable top APIs without requiring fine-tuning, e.g., "connect to your gpt4/opus3 + for dumber-but-bigger tasks, groq"

- The resources we could put into fine-tuning are better spent on RAG, agentic loops, prompts/evals, etc

We do use tuned smaller dumber models, such as part of a coarse relevancy filter in a firehose pipeline... but these are outliers. Likewise, we expect to be using them more... but again, for rarer cases and only after we've exhausted other stuff. I'm guessing as we do more fine-tuning, it'll be more on embeddings than LLMs, at least until OSS models get a lot better.

solidasparagus
6 replies
20h8m

See if the article said this, I would have agreed - fine-tuning is a tool and it should be used thoughtfully. Although I personally believe that in this funding climate it makes sense to make data collection and model training a core capability of any AI product. However that will only be available and wise for some founders.

lmeyerov
5 replies
18h31m

Agreed, model training and data collection are great!

The subtle bit is just doesn't have to be for LLMs, as these are typically part of a system-of-models. E.g., we <3 RAG, and GNNs for improving your KG is fascinating. Likewise, dspy's explorations in optimizing prompts, vs LLMs, is very cool.

tarasglek
1 replies
10h25m

Can you give a concrete example of GNNs helping?

lmeyerov
0 replies
6h22m

Entity resolution - RAG often mixes vector & symbolic queries, and ER improves reverse indexing, which is a starting point for a lot of the symbolic ones

Identifying misinfo - Ranking & summarization based on internet data should be a lot more careful, and sometimes the controversy is the interesting part

For both, GNNs are generally SOTA

solidasparagus
1 replies
18h18m

we <3 RAG, and GNNs for improving your KG is fascinating

Oh man I am so torn between this being a fantastic idea and this being "building a better slide-rule in the age of the computer".

dspy is definitely a project I want to dig into more

lmeyerov
0 replies
5h48m

Yeah I would recommend sticking to RAG on naively chunked data for weekend projects by 1 person. Likewise, a consumer tool like perplexity's search engine where you minimize spend per user task or go bankrupt, same thing, do the cheap thing and move on, good enough

Once RAG projects become important and good answers matter - we work with governments, manufacturers, banks, cyber teams, etc - working through data quality, data representation, & retrieval quality helps

Note that we didn't start here: We began with naive RAG, then relevancy filtering, then agentic & neurosymbolic querying, then dynamic example prompt injection, and now are getting into cleaning up the database/kg itself

For folks doing investigative/analytics projects in this space, happy to chat about what we are doing w Louie.AI. These are more implementation details we don't normally write about.

qeternity
0 replies
15m

Have you actually used DSPy? I still can't figure out what it's useful for beyond optimizing basic few shot prompts.

OutOfHere
7 replies
21h5m

Fine-tuning is an absolutely necessary for true AI, and even if it's desirable, it's unfeasible to do for now for any large model considering how expensive GPUs are. If I had infinite money, I'd throw it at continuous fine-tuning and would throw away the RAG. Fine-tuning also requires appropriate measures to prevent forgetting of older concepts.

solidasparagus
6 replies
20h57m

It is not unfeasible. It is absolutely realistic to do distributed finetuning of an 8B text model on previous generation hardware. You can add finetuning to your set of options for about the cost of one FTE - up to you whether that tradeoff is worth it, but in many places it is. The expertise to pull it off is expensive, but to get a mid-level AI SME capable of helping a company adopt finetuning, you are only going to pay about the equivalent of 1-3 senior engineers.

Expensive? Sure, all of AI is crazy expensive. Unfeasible? No

OutOfHere
5 replies
20h51m

I don't consider a small 8B model to be worth fine-tuning. Fine-tuning is worthwhile when you have a larger model with capacity to add data, perhaps one that can even grow its layers with the data. In contrast, fine-tuning a small saturated model will easily cause it to forget older information.

All things considered, in relative terms, as much as I think fine-tuning would be nice, it will remain significantly more expensive than just making RAG or search calls. I say this while being a fan of fine-tuning.

solidasparagus
3 replies
20h47m

I don't consider a small 8B model to be worth fine-tuning.

Going to have to disagree with you on that one. A modern 8B model that has been trained on enough tokens is ridiculously powerful.

OutOfHere
2 replies
20h38m

A well-trained 8B model will already be over-saturated with information from the start. It will therefore easily forget much old information when fine-tuning it with new materials. It just doesn't have the capacity to take in too much information.

Don't get me wrong. I think an 70B or larger model would be worth fine-tuning, especially if it can be grown further with more layers.

solidasparagus
1 replies
20h12m

A well-trained 8B model will already be over-saturated with information from the start

Any evidence of that that I can look at? This doesn't match what I've seen nor have I heard this from the world-class researchers I have worked with. Would be interested to learn more.

OutOfHere
0 replies
16h43m

Upon further thought, if fine-tuning involves adding layers, then the initial saturation should not matter. Let's say if an 8B model adds 0.8*2 = 1.6B of new layers for fine-tuning, then with some assumptions, a ballpark is that this could be good for 16 million articles for fine-tuning.

robrenaud
0 replies
2h18m

The reason to fine tune is to get a model that performs well on a specific task. It could lose 90 percent of it's knowledge and beat the unturned model at the narrow task at hand. That's the point, no?

CuriouslyC
6 replies
21h17m

Fine tuning has been on the way out for a while. It's hard to do right and costly. LoRAs are better for influencing output style as they don't dumb down the model, and they're easier to create. This is on top of RAG just being better for new facts like the other reply mentioned.

solidasparagus
4 replies
20h51m

How much of that is just the flood of traditional engineers into the space and the fact that collecting data and then fine-tuning models is orders of magnitude more complex than just throwing in RAG? I suspect a huge amount of RAG's popularity is just that any engineer can do a version of it + ChatGPT API calls in a day.

As for lora - in the context of my comment, that's just splitting hairs IMO. It falls in the category of finetuning for me, although I understand why you might disagree. But it's not like the article mentions lora either, nor am I aware of people doing lora without GPUs which the article is against (No GPUs before PMF)

altdataseller
3 replies
14h21m

I disagree. No amount of fine tuning will ever give the LLM the relevant context with which to answer my question. Maybe if your context is a static Wikipedia or something that will never change, you can fine tune it. But if your data and docs keep changing, how is fine tuning going to be better than RAG?

solidasparagus
1 replies
14h5m

Continuous retraining and deployment maybe? But I'm actually not anti-RAG (although I think it is overrated because the retrieval problem is still handled extremely naively), I just think that fine-tuning should also be in your toolkit.

altdataseller
0 replies
2h17m

Why is the retrieval part overrated? There isnt even a single way to retrieve. It could be a simple keyword sesrch, a vector sesrch, a combo, or just simply retrieving a single doc and stuffing it in the context

idf00
0 replies
7h33m

Luckily it's not one or the other. You can fine tune and use RAG.

Sometimes RAG is enough. Sometimes fine tuning on top of RAG is better. It depends on the use case. I can't think of any examples where you would want to fine tune and not use rag as well.

Sometimes you fine tune a small model so it performs close to a larger varient on that specific narrow task and you improve inference performance by using a smaller model.

phillipcarter
0 replies
2h32m

I don't see why this is seen as an either-or by people? Fine-tuning doesn't eliminate the need for RAG, and RAG doesn't obviate the need for fine-tuning either.

Note that their guidance here is quite practical:

If prompting gets you 90% of the way there, then fine-tuning may not be worth the investment.
gandalfgeek
3 replies
21h48m

This was kind of conventional wisdom ("fine tune only when absolutely necessary for your domain", "fine-tuning hurts factuality"), but some recent research (some of which they cite) has actually quantitatively shown that RAG is much preferable to FT for adding domain-specific knowledge to an LLM:

- "Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?" https://arxiv.org/abs//2405.05904

- "Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs" https://arxiv.org/abs/2312.05934

solidasparagus
2 replies
21h22m

Thanks, I'll read those more fully.

But "knowledge injection" is still pretty narrow to me. Here's an example of a very simple but extremely valuable usecase - taking a model that was trained on language+code and finetuning it on a text-to-DSL task, where the DSL is a custom one you created (and thus isn't in the training data). I would consider that close to infeasible if your only tool is a RAG hammer, but it's a very powerful way to leverage LLMs.

yoelhacks
0 replies
3h4m

This is exactly (one of) our use cases at Eraser - taking code or natural language and producing diagram-as-code DSL.

As with other situations that want a custom DSL, our syntax has its own quirks and details, but is similar enough to e.g. Mermaid that we are able to produce valid syntax pretty easily.

What we've found harder is controlling for edge cases about how to build proper diagrams.

For more context: https://www.eraser.io/decision-node/on-building-with-ai

gandalfgeek
0 replies
16h19m

Agree that your use-case is different. The papers above are dealing mostly with adding a domain-specific textual corpus, still answering questions in prose.

"Teaching" the LLM an entirely new language (like a DSL) might actually need fine-tuning, but you can probably build a pretty decent first-cut of your system with n-shot prompts, then fine-tune to get the accuracy higher.

jph00
1 replies
13h50m

The idea that fine-tuning isn't even a consideration (perhaps even something they think is absolutely incorrect if the section titles of the unfinished section is anything to go by) is very strange to me and suggests a pretty narrow perspective IMO

The article has a section called "When to finetune", along with links to separate pages describing how to do so. They absolutely don't say that "fine-tuning isn't even a consideration". Instead, they describe the situations in which fine-tuning is likely to be helpful.

solidasparagus
0 replies
10h58m

Huh. Well that's embarrassing. I guess I missed it when I lost interest in the caching section and jumped straight to Evaluation and Monitoring.

bbischof
1 replies
12h44m

Hello, it’s Bryan, an author on this piece.

I’d you’re interested in using one of the LLM-applications I have in prod, check out https://hex.tech/product/magic-ai/ It has a free limit every month to give it a try and see how you like it. If you have feedback after using it, we’re always very interested to hear from users.

As far as fine-tuning in particular, our consensus is that there are easier options first. I personally have fine-tuned gpt models since 2022; here’s a silly post I wrote about it on gpt 2: https://wandb.ai/wandb/fc-bot/reports/Accelerating-ML-Conten...

solidasparagus
0 replies
11h23m

I took at look at Magic earlier today and it didn't work at all for me, sorry to say. After the example prompt, I tried to learn about a table and it generated bad SQL (correct query to pull a row, but with limit 0). I asked it to show me the DDL and it generated invalid SQL. Then I tried to ask it to do some population statistics on the customer table and ended up confused about why there appears to be two windows in the cell, with the previously generated SQL on the left and the newly generated SQL on the right. The new SQL wouldn't run when I hit run cell, the error showed the originally generated SQL. I gave up and bounced.

I went back while writing this comment and realized it might be showing me a diff (better use of color would have helped, I have been trained by github). But I was at a loss for what to do with that. I just now figured out the Keep button exists and it accepted the diff and now it sort of makes sense, but the SQL still doesn't return any results.

My honest feedback is that there is way too much stuff I don't understand on the screen and it makes me confused and a little stressed. Ease me into it please, I'm dumb. There seems to be cells that are linked together and cells that aren't(? separated by purplish background) and I don't understand it. I am a jupyter user and I feel like this should be intuitive to me, but it isn't. I am not a designer, but I suspect the structural markings like cell boundaries are too faint compared to the content of the cells and/or the exterior of a cell having the same color as the interior is making it hard for me. I feel lost in a sea of white.

But the core issue is that, excluding the prompt I copy-pasted word for word which worked like a charm, I am 0 out of 4 on actually leveraging AI to solve the problems I asked of Magic. I like the concept of natural language BI (I worked on in the early days when Alexa came out) so I probably gave it more chances than I would have for a different product.

For me, it doesn't fit my criteria for good problems to solve with AI in 2024 - the conversational interface and binary right/wrong nature of querying/presenting data accurately make the cost of failure too high, which is a death sentence for AI products IMO (compare to proactive, non-blocking products like copilot or shades-of-wrong problems like image generation or conversations with imaginary characters). But text-to-SQL and data presentation make sense as AI capabilities in 2024 so I can see why that could be a good product to pursue. If it worked, I would definitely use it.

Multicomp
8 replies
21h36m

Anyone have a convenience solution for doing multi-step workflows? For example, I'm filling out the basics of an NPC character sheet on my game prep. I'm using a certain rule system, give the enemy certain tactics, certain stats, certain types of weapons, right now I have a 'god prompt' trying to walk the LLM through creating the basic character sheet, but the responses get squeezed down into what one or two prompt responses can be.

If I can do node-red or a function chain for prompts and outputs, that would be sweet.

hugocbp
1 replies
21h7m

For me, a very simple "breakdown tasks into a queue and store in a DB" solution has help tremendously with most requests.

Instead of trying to do everything into a single chat or chain, add steps to ask the LLM to break down the next tasks, with context, and store that into SQLite or something. Then start new chats/chains on each of those tasks.

Then just loop them back into LLM.

I find that long chats or chains just confuse most models and we start seeing gibberish.

Right now I'm favoring something like:

"We're going to do task {task}. The current situation and context is {context}.

Break down what individual steps we need to perform to achieve {goal} and output these steps with their necessary context as {standard_task_json}. If the output is already enough to satisfy {goal}, just output the result as text."

I find that leaving everything to LLM in a sequence is not as effective as using LLM to break things down and having a DB and code logic to support the development of more complex outcomes.

datameta
0 replies
28m

Indeed! If I'm met with several misunderstandings in a row, asking it to explain what I'm trying to do is a pretty surefire way to move forward.

Also mentioning what to "forget" or not focus on anymore seems to remove some noise from the responses if they are large.

mentos
0 replies
21h7m

I still haven’t played with using one LLM to oversee another.

“You are in charge of game prep and must work with an LLM over many prompts to…”

gpsx
0 replies
21h4m

One option for doing this is to incrementally build up the "document" using isolated prompts for each section. I say document because I am not exactly sure what the character sheet looks like, but I am assuming it can be constructed one section at a time. You create a prompt to create the first section. Then, you create a second prompt that gives the agent your existing document and prompts it to create the next section. You continue until all the sections are finished. In some cases this works better than doing a single conversation.

CuriouslyC
0 replies
21h22m

You can do multi shot workflows pretty easy, I like to have the model produce markdown, then add code blocks (```json/yaml```) to extract the interim results. You can lay out multiple "phases" in your prompt and have it perform each one in turn, and have each one reference prior phases. Then at the end you just pull out the code blocks for each phase and you have your structured result.

127
0 replies
18h9m

Did you force it into a parser? You can define a simple language in llama.cpp for the LLM to obey.

sheepscreek
4 replies
18h2m

I’m sure this has some decent insights but it’s from almost 1 year ago! A lot has changed in this space since then.

bgrainger
3 replies
15h51m

Are you sure? The article says "cite this as Yan et al. (May 2024)" and published-time in the metadata is 2024-05-12.

Weird: I just refreshed the page and it now redirects to a different domain (than the originally-submitted URL) and has a date of June 8, 2023. It still cites articles and blog posts from 2024, though.

jph00
2 replies
14h41m

Looks like they made a mistake in the article metadata - they definitely just released this article.

jph00
1 replies
13h55m

OK I let them know, and they've fixed it now.

sheepscreek
0 replies
2h27m

Awesome - thanks. Makes much more sense now. Can’t update my original comment but hopefully people will read this.

JKCalhoun
4 replies
2h59m

Note that in recent times, some doubt has been cast on if this technique is as powerful as believed. Additionally, there’s significant debate as to exactly what is going on during inference when Chain-of-Thought is being used...

I love this new era of computing we're in where rumors, second-guessing and something akin to voodoo have entered into working with LLMs.

ezst
3 replies
2h44m

That's the thing, it's a novel form of computing that's increasingly moving away from computer science. It deserves to be treated as a discipline of its own, with lots of words of caution and danger stickers slapped over it.

amelius
1 replies
1h0m

Yeah like psychology being a different field from physics even if it is running on atoms ultimately.

Imagine if physics literature was filled with stuff about psychology and how that would drive physicists nuts. That's how I feel right now ;)

throwup238
0 replies
17m

Quantum consciousness!

skydhash
0 replies
2h17m

It’s text (word) manipulation based on probalistic rules derived from analyzing human-produced text. And everyone knows language is imperfect. That’s why we have introduced logic and formalism so that we can reliably transmit knowledge.

That’s why LLMs are good at translating and spellchecking. We’ve been describing the same world and almost all texts respect grammar. That’s the first things that surface. But you can extract the same rules in other way and create a program that does it without the waste of computing power.

If we describe computing as solving problems, then it’s not computing because if your solution was not part of the training data, you won’t solve anything. If we describe computing as symbol manipulation, then it’s not doing a good job because the rules changes with every model and they are probabilistic. No way to get a reliable answer. It’s divination without the divine (no hint from an omniscient entity).

threeseed
2 replies
18h46m

RAGs do not prevent hallucinations nor does it guarantee that the quality of your output is contingent solely on the quality of your input. Using LLMs for legal use cases for example has shown it to be poor for anything other than initial research as it is accurate at best 65%:

https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Halluc...

So would strongly disagree that LLMs have become “good enough” for real-world applications" based on what was promised.

phillipcarter
0 replies
2h38m

So would strongly disagree that LLMs have become “good enough” for real-world applications" based on what was promised.

I can't speak for "what was promised" by anyone, but LLMs have been good enough to live in production as a core feature in my product since early last year, and have only gotten better.

mattyyeung
0 replies
5h14m

You may be interested "Deterministic Quoting"[1]. This doesn't completely "solve" hallucinations, but I would argue that we do get "good enough" in several applications

Disclosure: author on [1]

[1] https://mattyyeung.github.io/deterministic-quoting

jakubmazanec
2 replies
34m

I'm not saying the content of the article is wrong, but what apps are people/companies writing articles like this actually building? I'm seriously unable to imagine any useful app. I only use GPT via API (as better Google for documentations, and its output is never usable without heavy editing). This week I tried to use "AI" in Notion: I needed to generate 84 check boxes for each day starting with specific date. I got 10 check boxes and line "here should go rest..." (or some variation of such lazy output). Completely useless.

qeternity
0 replies
21m

I think you're going about it backwards. You don't take a tool, and then try to figure out what to do with it. You take a problem, and then figure out which tool you can use to solve it.

exhaze
0 replies
5m

I've built many production applications using a lot of these techniques and others - it's made money either by increasing sales or decreasing operational costs.

Here's a more dramatic example: https://www.grey-wing.com/

This company provides deeply integrated LLM-powered software for operating freight ships.

There are a lot of people who are doing this and achieving very good results.

Sorry, if it's not working for you, it doesn't mean that it doesn't work.

mloncode
1 replies
13h48m

This is Hamel, one of the authors of the article. We published the article with OReilly here:

Part 1: https://www.oreilly.com/radar/what-we-learned-from-a-year-of... Part 2: https://www.oreilly.com/radar/what-we-learned-from-a-year-of...

We were working on this webpage to collect the entire three part article in one place (the third part isn't published yet). We didn't expect anyone to notice the site! Either way, part 3 should be out in a week or so.

seventytwo
0 replies
1h5m

Was wondering about the June 8th date on there :)

blumomo
1 replies
9h34m

PUBLISHED

June 8, 2024

Is this an article from the future?

pklee
0 replies
53m

This is pure gold !! Thank you so much eugene and gang for doing this. For those of them which I have encountered, I can 100 % agree with them. This is fantastic !! So many good insights.

mercurialsolo
0 replies
7h1m

As we go about moving LLM enabled products into production we definitely see a bunch of what is being spoken about resonate. We also see the below as areas which need to be expanded upon for developers building in the space to take products to production :

I would love to see this article also expand to touch upon things like : - data management - (tooling, frameworks, open vs closed data management, labelling & annotations) - inference as a pipeline - frameworks for breaking down model inference into smaller tasks & combining outputs (do DAG's have a role to play here?) - prompts - areas like caching, management, versioning, evaluations - model observability - tokens, costs, latency, drift? - evals for multimodality - how do we tackle evals here which in turn can go into loops e.g. quality of audio, speech or visual outputs

asshatdev
0 replies
1h23m

"What we've learned from a year of fucking Large Luxury Models" -> you'll never see here

OutOfHere
0 replies
22h6m

Almost all of this should flow from common-sense. I would use what makes sense for your application, and not worry about the rest. It's a toolbox, not a rulebook. The one point that comes more from experience than from common-sense is to always pin your model versions. As a final tip, if despite trying everything, you still don't like the LLM's output, just run it again!

Here is a summary of all points:

1. Focus on Prompting Techniques:

   1.1. Start with n-shot prompts to provide examples demonstrating tasks.
   1.2. Use Chain-of-Thought (CoT) prompting for complex tasks, making instructions specific.
   1.3. Incorporate relevant resources via Retrieval Augmented Generation (RAG).
2. Structure Inputs and Outputs:

   2.1. Format inputs using serialization methods like XML, JSON, or Markdown.
   2.2. Ensure outputs are structured to integrate seamlessly with downstream systems.
3. Simplify Prompts:

   3.1. Break down complex prompts into smaller, focused ones.
   3.2. Iterate and evaluate each prompt individually for better performance.
4. Optimize Context Tokens:

   4.1. Minimize redundant or irrelevant context in prompts.
   4.2. Structure the context clearly to emphasize relationships between parts.
5. Leverage Information Retrieval/RAG:

   5.1. Use RAG to provide the LLM with knowledge to improve output.
   5.2. Ensure retrieved documents are relevant, dense, and detailed.
   5.3. Utilize hybrid search methods combining keyword and embedding-based retrieval.
6. Workflow Optimization:

   6.1. Decompose tasks into multi-step workflows for better accuracy.
   6.2. Prioritize deterministic execution for reliability and predictability.
   6.3. Use caching to save costs and reduce latency.
7. Evaluation and Monitoring:

   7.1. Create assertion-based unit tests using real input/output samples.
   7.2. Use LLM-as-Judge for pairwise comparisons to evaluate outputs.
   7.3. Regularly review LLM inputs and outputs for new patterns or issues.
8. Address Hallucinations and Guardrails:

   8.1. Combine prompt engineering with factual inconsistency guardrails.
   8.2. Use content moderation APIs and PII detection packages to filter outputs.
9. Operational Practices:

   9.1. Regularly check for development-prod data skew.
   9.2. Ensure data logging and review input/output samples daily.
   9.3. Pin specific model versions to maintain consistency and avoid unexpected changes.
10. Team and Roles:

    10.1. Educate and empower all team members to use AI technology.
    10.2. Include designers early in the process to improve user experience and reframe user needs.
    10.3. Ensure the right progression of roles and hire based on the specific phase of the project.
11. Risk Management:

    11.1. Calibrate risk tolerance based on the use case and audience.
    11.2. Focus on internal applications first to manage risk and gain confidence before expanding to customer-facing use cases.

DylanSp
0 replies
21h50m

Looks like the same content that was posted on oreilly.com a couple days ago, just on a separate site. That has some existing discussion: https://news.ycombinator.com/item?id=40508390.