Ask HN: How do I train a custom LLM/ChatGPT on my own documents in Dec 2023?

You don't train on documents. There are many startups claiming that but they are deliberately using a misleading term because they know that's what people are searching for.

You still do RAG. Llamaindex is still the best option that I know of. Most of the startups that have working products are likely using llamaindex. All of the ones that say they are training on documents are actually using RAG.

Test it out. If it really and truly doesn't work, search for a script that creates question and answer pairs automatically with gpt-4. Then try using that for qLoRA. I have never heard of anyone successfully using that for a private document knowledgebase though. Only for skills like math, reasoning, Python, etc. I think the issue is that you need a LOT of data and it needs to repeat concepts or any facts you need to learn many, many times in different supporting ways.

What absolutely does not work is trying to just feed a set of documents into fine tuning. I personally have proven that dozens of times because I had a client who is determined to do it. He has been mislead.

What it will do is learn the patterns that are in those documents.

What is RAG? That's hard to search for

Ask chatgpt next time. "What is rag in context of AI?"

Or just using a traditional search engine and "rag" plus literally any ML/AI/LLM term will yield a half dozen results at the top with "Retrieval-augmented generation" in the page title.

Or if GGP can't think of an AI-related term they can use HN search. Searching 'rag' shows the term on the first page of results:

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

Searching for "RAG" on Kagi and Google give some AI-related results fairly high up, including results that explain it and say what it stands for.

Or people could just not use obscure acronyms when discussing specialised topics on an open forum?

Right? How does someone who browses this forum not know how to find knowledge online?

What percentage of people could you fool if you told them it was AI and replayed standard search results but with the "karaoke-like" prompt that highlights each word (as if we're 2nd graders in Special Ed learning how to string more than 2 sentences together)

Off Topic;

It fascinates me how much variance there is in peoples searching skills.

some people think they are talking to a person when searching e.g 'what is the best way that i can {action}' I think the number one trick is to forget grammar and other language niceties and just enter concepts e.g. 'clean car best'

That’s why they will love chatgpt

I found something very annoying while looking for technical data ( a service manual for an ancient medical device - build around 2001).

The same term was the name of the device + something about the power source.

The result from the client network - my phone/client computer nothing related to the search for 4-5 pages.

Same search from work - second result was what I was looking.

So it seems there is a relation with your search history, but somehow connected with the related search history from the same ip/network.

Over the last couple of years, at least with Google, I've found that no strategy really seems to work all that well - Google just 'interprets' my request and assumes that I'm searching for a similar thing that has a lot more answers than what I was actually searching for, and shows me the results for that.

I used to do this. Then when Google's search results started declining in quality, I often found it better to search by what the average user would probably write.

Retrieval Augmented Generation - in brief, using some kind of search to find relevant documents to the user’s question (often vector DB search, which can search by “meaning”, by also other forms of more traditional search), then injecting those into the prompt to the LLM alongside the question, so it hopefully has facts to refer to (and its “generation” can be “augmented” by documents you’ve “retrieved”, I guess!)

So, as a contrived example, with RAG you make some queries, in some format, like “Who is Sauron?” And then start feeding in what books he’s mentioned in, paragraphs describing him from Tolkien books, things he has done.

Then you start making more specific queries? How old is he, how tall is he, etc.

And the game is you run a “questionnaire AI” that can look at a blob of text, and you ask it “what kind of questions might this paragraph answer”, and then turn around and feed those questions and text back into the system.

Is that a 30,000 foot view really of how this works?

The 3rd paragraph missed the mark but previous ones are in the right ballpark.

You take the users question either embed it directly or augment it for embedding (you can for example use LLM to extract keywords form the question), query the vector db containing the data related to the question and then feed it all of LLM as: here is question form the user and here is some data that might be related to it.

This one seems like a good summary

Retrieval-Augmented Generation for Large Language Models: A Survey

https://arxiv.org/abs/2312.10997

The photos of this post are also good for a high level look

https://twitter.com/dotey/status/1738400607336120573/photo/2

From the various posts I have seen people claim that phi-2 is a good model to start off from.

If you just want to do embeddings, there are various tutorials to use pgvector for that.

"Retrieval augmented generation". I found success from "rag llm tutorial" as a search input to better explain the process.

Seems fairly easy to search for to me - top results are all relevant:

https://kagi.com/search?q=ml+rag

https://www.google.com/search?q=ml+rag

RAG: having a LLM spew search queries for you because your search foo is worse than a chat bot alucinations.

or because you want to charge your client the "ai fee".

or because your indexing is so bad you hide it from your user and blame the llm assistant dept.

Retrieval-augmented generation, RAG + LLM will turn up more results.

Another question, which one is preferred, LlamaIndex or Langchain, for RAG? Thanks in advance for your insights.

You basically don't use langchain for anything besides 30 minute demos that you copied from someone else's github. It has a completely spaghettified API, is not performant, and forces you into excessive mental contortions to reason about otherwise simple tasks.

LlamaIndex is pretty good.

Yeah +1

We originally started out building features with LangChain (loading chains from YAML sounded good—it felt like it would be easy to get non-engineers to help with prompt development) but in practice it’s just way too complicated. Nice idea, but the execution feels lacking.

It also doesn’t help that LangChain is evolving so rapidly. When we first started using it a lot of code samples on the internet couldn’t be copy/pasted because of import paths changing, and at one point we had to bump by ~60 patch versions to get a bug fix, which was painful because it broke all kinds of stuff

Yea discovered this with Langchain last week. Was great for a demo then started to push it harder and spent ages trawling Reddit, discord, GitHub trying to find solutions to issues only to discover what was supposed to be supported was deprecated. Got a massive headache for what should have been a simple change. Moved on now.

Echoing others’ sentiments, I was frustrated with the bloat and obscurity of existing tools. This led me to start building Langroid with an agent-oriented paradigm 8 months ago https://github.com/langroid/langroid We have companies using it in production for various use-cases. They especially like our RAG and multi-agent orchestration. See my other comment for details.

what's the "groid"? isn't that a slur?

language android i imagine..

LlamaIndex is mainly focused on RAG. LangChain does a ton of other stuff too. I'd focus on LlamaIndex first.

Haystack [1] is another good option. It‘s modular, doesn’t get in your way and is particularly strong at retrieval. People like the documentation too.

Disclaimer: I work at deepset

[1] https://github.com/deepset-ai/haystack

Besides the other comments in this thread, I'd really recommending looking at least first to the (relatively new) "Managed index" in LlamaIndex: https://docs.llamaindex.ai/en/stable/community/integrations/... . These handle combining the retrieval with the generative side. I've seen a lot of users both get frustrated and get bad results by trying to write their own glue to string together various components of retrieval and generation and these are much easier to get started with

We just held a workshop about this a few weeks ago: https://red.ht/llmappdev We created a simple chatbot using local models with Ollama (llamacpp), LlamaIndex and streamlit. Have a look at the streamlit folder, it's super easy.

I used this simple example to teach about RAG, the importance of the system prompt and prompt injection. The notebook folder has a few more examples, local models can even do natural language SQL querying now.

looks very promising, do you plan to keep this single repo up to date as new things are released?

Good question, as you can see I haven't touched it for a month. I wanted to show what's possible then with open source and (open) local models and there's already so much new stuff out there.

I'll probably fix some things this week and then either update it or start from scratch. Guided generation, structured extraction, function calling and multi-modal are things I wanted to add and chainlit looks interesting.

Llamaindex has so mucu potential. Any benchmarks on performance compared to fine-tuning?

You probably don't need fine-tuning, at least if it's just new content (and no new instructions). It may even be detrimental, since LLMs are als good at forgetting: https://twitter.com/abacaj/status/1739015011748499772

Are there public examples of working products using RAG, compared with fine-tuning or training from scratch?

The OpenAI assistants API is an implementation of a RAG pipeline. It performs both RAG on any documents you upload, and on any conversation you have with it that exceeds the context.

Amazon Q is (at least partially) a RAG implementation.

Copilots use RAG:

https://www.microsoft.com/en-us/research/group/dynamics-insi...

Not public but internally I wrote a tool to help us respond to RFPs. You pass in a question from a new RFP and it outputs surprisingly great answers most of the time. Is writing 75%+ of our RFP responses now (naturally we review and adjust sometimes and as needed). And best of all it was very quickly hacked together and it’s actually useful. Copied questions/answers from all previous ones into a doc, and am using OpenAI embeddings api + FAISS vector db + GPT-4 to load the chunks + store the embeddings + process the resulting chunks.

To sing the praises of Bedrock again, it does have continuous pre-training as well as RAG “knowledge bases”. The former is based on JSON fragments and the RAG stuff is PDFs and other document formats.

With regards to its efficacy, I haven’t gone to production with it yet but I was reasonably impressed.

I uploaded 100 legal case documents to Bedrock via Claude and could push it pretty hard asking about the various cases and for situations across the knowledge base.

It did feel like it broke down and got confused at a certain point of complexity of questioning, but I still think it’s already useful as a “copilot” or search engine and surely it will only improve over time.

I forgot about the continuous pre-training thing. How long and how much did they cost on Bedrock?

I had tried to suggest continuous pre-training to my client but it seemed expensive and when I mentioned that he lost interest and just kept wanting me to do fine tuning.

Also to clarify, did you do the continuous pre-training or RAG? And did you compare the efficacy of one or the other or both?

I used the RAG knowledge bases for most of my testing described above.

I got a toy demo up and running with continuous pre-training but haven’t evaluated it unfortunately.

Oh Great! How did you evaluate the LLM responses? I'm cofounder of an evaluation and monitoring platform - Athina AI (www.athina.ai) You can use our monitoring dashboard and evals to check your LLM performance and iterate quickly.

Well said. The problem is, there are way too many alternatives. Any idea how llamaindex's ingestion engine compares to unstructured.io? ( Which is used in langchain)

I think they may be using the same thing.

LlamaIndex can't do chunk-level metadata, only document-level metadata, so you can't put precise references to where materials the LLM synthesized answers from originated, e.g. HTML anchors. Just write your own RAG with Pinecone and OpenAI APIs directly.

What absolutely does not work is trying to just feed a set of documents into fine tuning.

Not quite. It does work, albeit likely not optimal.

See https://github.com/bublint/ue5-llama-lora

Ouch your client! I had one earlier this year like this. We were doing some audio processing for word matching, he had also been mislead before coming to us, he fully believed that this was going to be some form of super AI trained on his 5 audio records of him repeating the words over and over...

We did all we could to steer him toward a correct path of understanding. Sadly we launched a working product but he doesn't understand it and continues to miss represent and miss sell it.

After continuing to give him time and follow up with him (I tend to personally do this with Clients like this), I can tell he is starting to realize his lack of understanding...

RAG is a funny thing. It’s like going back to Watson for specifics but letting the LLM handle the generic stuff.

You don't just feed documents in, you need to build a dataset representative of how you want to interact with it. So likely using gpt-4 or something to create: a chunk of a document, a question that can be answered by that chunk and a good answer. (Or something)

AWS Bedrock is fairly easy. You can do it in 5 or 6 clicks.

You have to upload your documents to S3, create a “Knowledge Base” then sync your documents into a vector database like OpenSearch or PineCone. You are then good to go via their playground or the AWS API.

I made a video here describing the process, check around 14 minutes in:

https://ensembleanalytics.io/blog/introducing-bedrock-knowle...

Bedrock is a decent product I think. All of the models in one place (apart from the big dogs from OpenAI) and a common API across them.

Is there a limit? Could I create a knowledge base with 10,000 documents? 100k? 1M?

https://docs.aws.amazon.com/bedrock/latest/userguide/quotas....

I’m sorry, I don’t understand those limits. It uses a lot of unfamiliar terms like “batch inference” and “modality”. I just want a nice UI that I can give my hard-drive to and then ask it questions.

That’s probably unrealistic at this time

It's doable with Amazon Q, but it's in Preview phase now. https://aws.amazon.com/q/

Is Q the answer to this whole thread? People are talking about AWS Bedrock but that seems like something a startup would build upon and offer something like Q, eh?

I’m going to try it this week so maybe we have a follow up thread??

This attitude puzzles me. At the heart of software engineering (and learning new technologies in general) is the eagerness to dive in, hack on code, and experiment with open source examples. If you're not even willing to read the documentation to learn basic terminology, why are you even here?

This attitude puzzles me. "If you wish to make an apple pie from scratch, you must first invent the universe" energy.

We are not always makers. Oftentimes we're consumers as well.

I don't want to read documentation and experiment with my phone, I just want it to work out of the box and do what I expect.

This is standard consumer behaviour and you're lying to yourself if you don't think you act like this with some things.

Even if you could, the problem is that these documents are first chunked into smaller pages, and then embeddings are created. When you ask a question, the algo searches for relevant chunks and passes them to the LLM's overall prompt. If there are too many chunks, or too many chunks with similar content, the search also coupled with the LLM's limited context window would mean only 1-3 chunks get passed.

This isn't the same as the training data the LLM is trained on. As a result, it doesn't take advantage of the entire document set. So, if you have a billion documents, only 1-3 chunks will be picked for the final answer. When you know a question spans many many documents, the answer is never going to cover that.

You could make a recursive also where you parse all the chunks, generate summaries of those and then pass them to the next chunk sequentially and so on. But you can imagine how expensive and slow that will be. It might still work for you but this is a very lossy approach.

The documents are encoded as vectors and stored in a database, so I suspect it would be effectively unlimited. You would just pay for storage and compute.

AWS OpenSearch has fairly good integration so you could look up costs for that. It’s not the cheapest AWS service to run and not exactly serverless as you pay by the hour.

Bedrock is cool, but I found it prohibitively expensive for hobbyists and small companies. On a first glance, it would cost me something like $ 5,000 per month for a simple trained model.

I suspect most of the cost in the OpenSearch vector database?

I agree it is high and it is not exactly serverless. You pay by the hour.

Bedrock itself is charged based on tokens exchanged.

I think Pinecone is a cheaper database for hobby and small business projects, though I haven't looked into it.

No, if you use a custom model (trained with your data), you'll pay around $20 per hour, minimum. That equates to ~$15,000.

You can reduce a lot by committing to a 6-month contract, but it won't get cheaper than about ~$5,000/mo.

That's prohibitively expensive for small projects.

Fine tuning GPT 3.5 is much cheaper.

To be fair, Bedrock is more flexible than OpenAI's fine tuning. It's nice to have at least the option to pay big bucks for this. But it's big bucks nonetheless.

How does bedrock satisfy the non-hallucinating requirement?

It doesn't. You need to try and reduce hallucination as much as possible with your prompts, and then benchmark it.

And what’s the correct answer in December 2023 if one wants to narrow down only to tools and services provided on Azure?

Is Llamaindex + hosted model on Azure OpenAI Services still the best option?

There's Azure AI studio which is kinda like AWS bedrock. It's not bad, but for max control and versatility i'd start out rolling my own w/ for example llamaindex+azure-branded openai like you say.

Thanks. Is it possible to have persistent RAG in Azure AI Studio though? I found only a preview version of uploading files that are available to model, but when using this model through API, this uploaded data is not available to the model

Likely I misunderstood about how RAG works with Azure AI Studio, so sorry in advance

https://github.com/azure/aistudio-copilot-sample

Check it out, specifically steps 3 and 4. As with almost every Microsoft CLI tool and SDK, it's clunky... and you can tell everyone is rushing this AI shit out as fast as they can to stay in the game. But what you want should be doable.

Much thanks to you

Happy holidays, and good luck building!

Thanks, and same for you

https://github.com/microsoft/semantic-kernel

Semantic Kernel is MS's response to LangChain and LlamaIndex - available for .NET, python and Java.

Using their Memory support (using Azure Cognitive Search), it gives a powerful RAG quickstart, which you can combined with Azure Document Intelligence to chunk your source documentation into memories that your foundational models can later use.

(Disclaimer: Its only very recently gone 1.0 and is still likely to undergo API change as the LLM domain itself is still rapidly evolving - I've substantially forked the project for my own needs, but I hope that as it stabilises, I can contribute PRs for some of my more advanced use cases).

I think the answer depends on how many documents you have. To think in terms of tokens (assuming 750-1000 tokens is a page), if you have a good estimate of number of pages you want to query on, you can decide on the approach. Three popular approaches:

1. RAG: Most popular and works really well on smaller datasets. It is limited by number of vectors/embeddings. A typical embedding could be of 1000 tokens in size. Llamaindex did a lot of engineering on this and their techniques work pretty well. The problem with large datasets is almost always that users don't like writing long prompts/queries so the answers are more generic.

2. Finetuning + RAG: You can finetune a model on the expected outputs. If your datasets have the knowledge which might already be on open internet (blogposts, articles, anything non proprietary), then finetuning would work really well in combination with RAG, especially for large datasets. It may not work if you are working on a proprietary knowledge hard to find on open internet.

3. Continual pretraining - large large datasets, and when the knowledge is proprietary. I talked to a firm with 70GB worth of data. No way a RAG pipeline would give them results. They are struggling to get LLMs to work for them. Needs a model that is trained on their data and then Instruction Tuning of top of that. Most likely you wont need to do this

> I talked to a firm with 70GB worth of data. No way a RAG pipeline would give them results. They are struggling to get LLMs to work for them.

Wow so RAG is basically a toy for demos and low effort MVPs. 70GB is tiny, it’d barely qualify as “big data” 20 years ago.

Is anyone trying more advanced stuff like knowledge graph augmented generation to try to expand on that?

I have a _small_ e-commerce company and we have >300GB. Most of that bulk is photos and videos though, but in an ideal world I’d like my AI assistant to find that stuff too: “I’m making a Boxing Day ad campaign. Can you show me the ads that we’ve made in previous years and all of the photos that we’ve taken of our new Reindeer and Elf designs?”

Photos and videos are very different from text. 300 GB of text is not comparable to 300 GB of photos.

You can do something using image embeddings to get what you want.

That can be done if we use Imagebind from meta(embeds text, image, video, audio in same vector space). I would want to explore this if possible just for a POC if you are okay with it. Would you be interested?

Some caution here. Not everything needs to go into a RAG pipeline (Eg: a database table would not necessarily need to be embedded, but its schema should be.). There would be a lot of repetitions, lots of junk and useless data, and numerical data and parsing through that would be a pain. Then comes how the users would behave. You need a longer string to get accurate results. Most non tech users would rather write shorter strings and expect technology to read their mind. (it's a human issue and not tech issue)

A simpler way here is just train the model unsupervised so all the knowledge is there in the model, and instruction tune it on the use-cases you want. Simpler from human effort perspective. Somewhat costly though the cost of storing that many vectors would be more than training the model itself. Everything else requires a lot of custom effort. Knowledge graph augmentation is probably the next step in the hype cycle, but it does not solve the fundamental human problem of writing fewer letters. (Training solves as changing 1-2 keywords do the trick if the generic string does not get the answer. See how Chatgpt changes answers if you tweak your prompt a bit). In a way RAG is an engineering solution to what is basically a data problem. It works for many cases, but when it does not, people will have to solve it via data science.

Wow so RAG is basically a toy for demos and low effort MVPs

I would not say it's for demos or low effort MVPs. Many companies wont have that amount of data. You can also segregate it by teams. Eg: customer support has one, sales has one, product has one. Then, a golden use case is for parsing user docs. We created one for GST queries in India that works quite well.[1]. It's a search engine, but points to right docs at the source when you ask about any clause. Useful for CAs only and addresses a very narrow use case.(it's a market need as the notifications are published in PDF format and not indexed by Google)

[1]:https://clioapp.ai/gst-search

"Toy" is the wrong word to describe it but it seems like another order of magnitude or two increase in context size will solve all their problems.

On the other hand I've got a terabyte of text extracted from LibGen - let's say I can ignore the half that is fiction and I can dedupe the rest further by 80% - that's still 100gb. On top of that I've got 300gb of text extracted from court documents and that's just from California! I haven't even downloaded the federal dump yet, let alone the other 49 states. Even if I limited myself to just the US Code and Federal Code of Regulations, that's hundreds of millions of tokens of very dense text. Embedding based RAG has been pretty much useless in each of these cases but maybe I just suck at implementing the retrieval part.

What's the data size on the GST search engine? Do you have any examples of more complex queries?

The only thing that has even been remotely useful to tackling the kind of questions I want to ask of my data sources is having the LLM generate search queries, navigate the knowledge graph, and rank the relevance of retrieved snippets but that often takes dozens if not hundreds of LLM calls which is incredibly slow and expensive.

What are you trying to achieve with that dataset from LibGen? I kinda expect that GPT4 was trained on the data that is available on LibGen

Most use cases that actually require this much data are probably best solved by more traditional ML architectures (ie classification).

LLM work best on use cases where the working context is the length of a short research paper (or less). Building with LLM is mostly a exercise in application engineering on how to get them the most relevant context at the right time and how to narrow its scope to produce reliable outputs.

Fine tuning can help specialize the LLM model to perform better, but AFAIK, the training sets are relatively small (in big data terms)

Slightly off topic but is there recommended advice on how to tune / train not for document retrieval but for consistent JSON output with specific enums?

i.e given a text, always return back a certain set of fields. For some keys here is the possible set of enums etc. One shot prompting does work but curious how others approach this if you have training data on hand.

Microsoft Guidance will do this.

Seems Microsoft spun them out and gave them independence. Not sure why, given it's the kind of IP that helps keep microsoft dominant.

Ask the model nicely, check any json in the output against your schema, regenerate if it doesn’t match.

Crude, I know, but it’s compatible with every model. Which is useful if you want to compare the many different models out there.

That is a fairly good strategy. Six years ago at Capital One, I experimented with generating synthetic JSON AWS CloudWatch files for testing (that would not contain any sensitive information). Way back then I used LSTM models and I simply had code to check output and only keep valid samples.

LLMs are so much better for this than LSTMs now.

You want grammars to restrict the output, search for "gbnf grammar". That and combined with a good prompt with an example, also check out outlines.dev

For OpenAI, use their functions schema mechanism.

Aside from that, take a look at llama.cpp grammars.

There are many interesting tools that achieve this, like Outlines[0] and jsonformer[1]. I haven't tried them myself but they look very promising.

[0]: https://github.com/outlines-dev/outlines [1]: https://github.com/1rgs/jsonformer

Train on your own documents or analyze your own documents for answers? Very different things.

For the first (fine tuning) follow “AI Jason” on YouTube. He has some great tutorials.

For the second (RAG or similar), fire up a cloud VM with GPUs or use Ollama locally and read through the LlamaIndex docs on how to build a RAG pipeline.

Would you kindly elaborate a little bit the difference between training on own documents vs analyzing documents for answers?

The word "training" implies creating a new model by fine-tuning an existing model on top of new documents.

As several other comments in this thread have already indicated: this is almost always the wrong direction. Which is confusing because it's the direction everyone always assumes they should go in at first.

The approaches that does work is surprisingly simple: take the user's question, search for snippets of your documents that appear to be about that question, then paste all of those snippets into the prompt along with the user's question and see what answer you get.

This is known as RAG: Retrieval Augmented Generation. It's a very powerful approach.

take the user's question, search for snippets of your documents that appear to be about that question, then paste all of those snippets into the prompt along with the user's question and see what answer you get.

We use RAG at my job, but we don’t do any preprocessing on the message from the user, so the results are not always great for us.

Do any of you have experience using a small local model just for extracting keywords from messages which you then use for the retrieval? And then feed the search result and your prompt into OpenAI or whatever as normal.

I've been trying out an interesting embedding model that knows how to treat text as a question be as a phrase about the world, and embeds the question such that it's likely to end up close to phrases that might answer that question: https://til.simonwillison.net/llms/embed-paragraphs

Embedding and chunking large amounts of documents is expensive though, in both compute and storage.

The other trick I've been planning to explore is using an LLM to turn the user's question into a small number of normal FTS search queries and then run those to try and get context data.

The other trick I've been planning to explore is using an LLM to turn the user's question into a small number of normal FTS search queries and then run those to try and get context data.

I have also been working on this. I still fail to see why this approach isn't the default frankly. There's little benefit to vector databases.

https://docs.llamaindex.ai/en/stable/examples/retrievers/bm2...

Also maybe try to include tags or categories when you index and then you can filter on those when doing the vector search. Might get a similar effect from BM25.

Also llamaindex does RAG better than some other solutions.

how do RAG implementations work with generic prompts vs specific prompts? meaning, there are prompts that could easily be answered by the base model itself and doesn't require RAG. but some prompts might involve questions about something proprietary where RAG is actually useful.

so is the default to just run the RAG search index on every prompt and if it returns nothing then you get the plain answer from the base model otherwise you get the augmented answer?

GPT-4 Turbo has a 128K (~300 pages) context window, which probably handles a lot of use cases which might have previously needed extra training/refinement.

The chatgtp app says it has a context window of 4096 tokens (gpt 4). How do I get access to turbo?

If you're asking ChatGPT about its own characteristics, you should not believe the responses. Models can't examine themselves, so unless the model was trained on specific information about itself, or unless the info is put into the System Prompt, then it cannot know the answer. A response indicating 4096 tokens would just be a hallucination, or a misunderstanding based on training data that saw references to what older versions of ChatGPT were trained on.

ChatGPT-4 is powered by GPT-4 Turbo at this point, so it has a context window that is much larger than 4096 tokens, whether it knows it or not. The ChatGPT application may limit the context size from reaching the full 128k to keep costs down, and it will be using some of the context window for its own purposes, but it's certainly able to maintain context across more than 4096 tokens of conversation.

The gpt-4-1106-preview (aka gpt-4-turbo) is a foundational model with a 128K context window that OpenAI makes available for API consumption both directly and via Azure OpenAI Services.

ChatGPT is a consumer facing service that wraps the GPT-4 foundational model but at some point will likely wrap gpt-4-turbo.

Signing up for OpenAI API access or Azure OpenAI Services will grant you access to this model (with some rate-limits in place given its a preview model).

Long context length models are still mostly a mirage with the "lost in the middle" phenomenon rearing it's ugly little head on actual production use-cases of this.

Not true.

Huh? Obviously true. You can not have tried it.

If you want a simpler task, like training a mistral llama etc on your documents, to act as a document completer, how would you proceed instead? Probably much easier. Thanks

Markov chains, or, if you need to get fancy, Hidden Markov models?

Is this a joke? I'd like to fine-tune an existing model, let's say mistral, on a new dataset using (existing tools),

I've seen that there are a lot of approaches, but none that has gained traction, there isn't a clear consensus...

Not a joke. Without more specifics, doesn't sound like LLMs are what you need/want.

How you do RAG with embeddings NOT in English? I mean there are a few thousand more languages.

Embeddings are not just english based ....

Just use your embeddings model of choice that works with your language, I believe Ada from openai is multilingual but I don’t know what languages it works well on, there are many embedding models out there, huggingface is your friend in this search. The output is just a vector and the rest of the system can basically stay the same. The only other thing that may need to change depending on language is any text preprocessing you need to do like word or sentence breaking for languages with compound words (German), agglutination (Turkish), etc.

I haven't personally tried this for anything serious yet, but to get the thread started:

Cheshire Cat [0] looks promising. It's a framework for building AI assistants by providing it with documents that it stores as "memories" that can be retrieved later. I'm not sure how well it works yet, but it has an active community on Discord and seems to be developing rapidly.

The main perk over the cloud options is that you can point it at any language model, including fully local—my local install pointed at my local Ollama running Mistral.

[0] https://github.com/cheshire-cat-ai/core

But that's not training. That's RAG. They seem to be using qdrant which I believe is a vector store.

They've updated the question to clarify that RAG counts, and as many have noted, properly "training" on a set of documents isn't really a thing.

Run https://github.com/imartinez/privateGPT

Then

make ingest /path/to/folder/with/files

Then chat to the LLM.

Done.

Docs: https://docs.privategpt.dev/overview/welcome/quickstart

I've tried LocalGPT, PrivateGPT, and H2OGPT. Have you been satisfied with the responses you get from PrivateGPT? When I tried it, it seemed very shallow/cursory in its responses. I saw much more detailed and complete responses when trying H2OGPT.

Others have said that RAG is only good up to a few 100s of files, eh? And since this is based on IndexLlama which is RAG-based, this would have the same limitation, eh?

If you’re looking for something that is hosted for you, at Notion we launched a feature for this a few weeks ago and it works quite well in my experience. RAG is one of the techniques used. https://www.notion.so/blog/introducing-q-and-a

Thank you! I have been reading these QLoRa posts in the hopes of training it on my notes stored in Notion, but then you do it for me! Nice product ;).

Easiest is OpenAI assistants api. Use the playground and it’s a no code experience.

How do you upload the documents? Via the API or do you have to upload them beforehand through the UI?

We have to add LLMs and MMMs (multi modal models) into all standard Linux distributions. A service will index all local files creating embedding connectors, this will be used to augment user prompts, and voila we can search for anything with natural language.

Once models / inference engines are performant enough to run on consumer hardware without hogging all the resources of a machine -- sure! Embedding this into the core of a linux system sounds very interesting. A shell interaction could be had in nerd gobbledygook or plain human language.

how to run a local llm model for RAG apps. Retrieval documents are turkish. But ı would to analyze this documents with llm. But ı have not a turkish local llm. How to solve this problem. Out of fine-tune and training.

If the LLM you use supports Turkish (I am pretty sure chatgpt does) then the language doesn’t matter. Augment the Generation by Retrieving turkish documents / snippets.

With OpenAI, you can first build Question & Answer pairs derived from your documents and use the OpenAI fine-tuning feature to build yourself a custom model. This method is more than just learning behavior in that facts do get recalled. I have written about it here, with a play demo use-case: https://ndurner.github.io/training-own-model-finetuning Note that I have yet to use this in a real world use-case, and I would love to hear feedback.

Other than OpenAI, there is the newly introduced „continued pre-training“ of Amazon Bedrock - but I haven’t tried.

RAG: I think that‘s a fundamentally flawed concept, but RAGfluencers will disagree. ;-)

Could you expand on why RAG is flawed in your eyes? From the other answers it seems like its the way to go

Did this in the summer via RAG. One thing we realised is that pure vector embeddings retrieval doesn't work so well for docs with acronyms (which let's face it all businesses have). Created a hybrid solution using embeddings and BM25 which is traditional ranking tool. This hybrid gave best results.

I was going to ask how you integrated BM25 but then I found this: https://docs.llamaindex.ai/en/stable/examples/retrievers/bm2...

Here's a (video) guide on fine-tuning Mistral 7B with QLoRA: https://www.harpercarroll.com/articles/ai/llm-finetune-own-d... / https://ghostarchive.org/varchive/kmkcNVvEz-k

Fine tuning does result in degradation of the overall model (https://twitter.com/xaiguydotagi/status/1737082280835703142) and so various RAG techniques may be desirable. As others have mentioned, LlamaIndex is a neat solution to build RAG pipelines: https://docs.llamaindex.ai/en/stable/optimizing/production_r...

I strongly agree this is the direction the author is looking for. RAG is one approach, but if the query doesn't match the right documents, you're screwed. And often they use a different, much simpler, embedding model.

I think harpercarrol link is a pretty good one, but it basically just feeds in the documents for completion, which isn't a good approach. The dataset needs to represent how you want to use it.

This one might also be helpful https://www.deeplearning.ai/short-courses/finetuning-large-l...

Honestly surprised how almost everyone is saying to use RAG (on its own). One strong benefit to RAG is the data can change, but has lots of failure modes.

People often use hybrid search (fuzzy or bm25 etc alongside embedding search) which I suppose is still RAG.

But fine-tuning models to be better at RAG is valuable as well, increasing accuracy.

https://ragntune.com/blog/Fine-tuning-an-LLM-to-be-good-at-R...

Ideally, I'd try both. Fine tune on both the documents (create a question / answer dataset with gpt4) and rag instruction fine tune it.

Many services/platforms are careless/disingenuous when they claim they “train” on your documents, where they actually mean they do RAG.

An under-appreciate benefit of RAG is the ability to have the LLM cite sources for its answers (which are in principle automatically/manually verifiable). You lose this citation ability when you finetune on your documents.

In Langroid (the Multi-Agent framework from ex-CMU/UW-Madison researchers) https://github.com/langroid/langroid we’ve implemented a number of RAG techniques: our DocChatAgent uses a combination of lexical and semantic retrieval, reranking and relevance extraction to improve precision and recall: https://github.com/langroid/langroid/blob/main/langroid/agen... All the code is laid out clearly so can be tweaked. We have companies using Langroid in production (e.g. for customer-support); they especially like the RAG and multi-agent features.

One of the interesting techniques in Langroid is a numbering trick for the relevance extraction stage (having the LLM extract verbatim relevant parts of passages) — instead of having the LLM “parrot” out the relevant portions, thus wasting time and tokens, we have it just spit out the relevant sentence numbers from a pre-annotated passage. We use a tool/function-call for this, leveraging Langroid’s task loop that seamlessly works for tool handling as well as sub-task handoff.

Many interesting RAG applications often require more than simple question-answering (e.g extract structured info, match doc against requirements etc) and these scenarios benefit immensely from having multiple agents (so you get separation of concerns, modularity and easier state management). Langroid simplifies this type of multi-agent setup with its unique conversational task loops, e.g https://langroid.github.io/langroid/examples/agent-tree/

Colab quick start that builds up to a 2-agent system for extracting structured info from a document:

https://colab.research.google.com/github/langroid/langroid/b...

Among many other things, Langroid also has full support for the OpenAI Assistants API, so you could use the “built-in” RAG from this API, which is a convenience but is a black box, I.e you don’t know what retrieval algo it is using, how it is filling context, and the tokens consumed.

Thanks. I just read through your Colab examples notebook.

So far the recommendations are mostly hosted, so here's one local: https://github.com/weaviate/Verba

I'm very happy with its results, even though the system is still young and a little bit janky. You can use it with either GPT API, or your local models through LiteLlm. (I'm running ollama + dolphin-mixtral)

PrivateGPT is one of the better-known examples, but most people are not aware that GPT4 Assistants handle RAG natively now: https://platform.openai.com/docs/assistants/overview

A bit unrelated, but one could open any binary file as text. With enough training data, could an llm just learn the format?

You can not get a non-hallucinating AI in 2023.

What would be nice is some type of box or device I connect to a computer. I then give it full access to the system. It trains itself on all of the data on that computer. The device is now a portable LLM.

My approach was not to train the model on the documents, as others mentioned.

I built a vector database from the documents, and I query the questions against it, which is very fast. This is the RAG (retrival augmented generation) step others mentioned.

The results, which are literal extracts from the documents, but short ones, are given to the model, which produces an answer. This is the slow part.

I used many of Langchain's tools to manage the whole process.

You can try it on Supawiki, with one of the featured wikis. Then, if you are ok with a solution that hosts your documents for you, you can upload them and use our solution.

I'm curious about this as well but my data is mostly (95%) numerical metrics. Is there a "RAG" mechanism for numerical data instead of text? My use case is data analysis, insight discovery for example.

I have a related question. I have a fair idea of the LLM ecosystem. (Thanks to this very nice blog called Emerging Architectures for LLM Applications). The problem is, there are way too options in each component. ( For E.g, too many vector store implementations, ingestion engines etc) What is the easiest way to get started? Primarily around RAG on my own pdf files. Also, what is the best/easiest option for hosting?. That blog lists vercel,streamlit, streamship and modal. I know vercel at a high level and found it very good. I am not well versed with javascript/typescript though. I believe the best option for UI generation is to use one of their templates.

What is your usecase? If you want to search for relevant info in your documents and get relevant info, and you want to avoid hallucination, you might avoid the text generation altogether.

Instead you can extract text embeddings from your documents, put them in a vector DB, and then you have a super search. You can convert your search query to an embedding, search the DB and keep the e.g. 10 closest matches.

As others have said you want RAG.

The most feature complete implementation I've seen is h2ogpt[0] (not affiliated).

The code is kind of a mess (most of the logic is in an ~8000 line python file) but it supports ingestion of everything from YouTube videos to docx, pdf, etc - either offline or from the web interface. It uses langchain and a ton of additional open source libraries under the hood. It can run directly on Linux, via docker, or with one-click installers for Mac and Windows.

It has various model hosting implementations built in - transformers, exllama, llama.cpp as well as support for model serving frameworks like vLLM, HF TGI, etc or just OpenAI.

You can also define your preferred embedding model along with various other parameters but I've found the out of box defaults to be pretty sane and usable.

[0] - https://github.com/h2oai/h2ogpt

What are you trying to do more specifically? You can use https://docalysis.com/ for most document RAG tasks.

Gpt4all is a local desktop app with a Python API that can be trained on your documents: https://gpt4all.io/

If you’re looking for an open source RAG solution, try our library:

https://www.github.com/jerpint/Buster

A go-to method is to ingest different chunksizes based on the document hierarchy & then use langchain with a bunch of retrievers depending on the doc type.

Then create an index about the metadata of each doc. So that you can ask the RAGbot what all it can answer about.

Another way to ensure it stays on-domain is to generate synthetic questions & check for similarity against user queries. There's a whole rabbit hole of query decomposition to avoid straying off topic as well.

I have usually seen people recommend to chunk by sentences or paragraphs or some fixed length of characters. IMO, all these are suggested because they are easy to write code for, but in reality, length of a meaningful chunk depends entirely on the data. The way we chunk an FAQ document vs a PRD is different.

Based on this assumption, I have a couple of questions:

1. Is chunking the most significant factor in RAG quality?

2. If there are no limitations, would humans that are experts in that dataset, be the best people to create chunks?

anything-llm looks pretty interesting and easy to use https://github.com/Mintplex-Labs/anything-llm

As mentioned above, I don't think you'd need to train your own model for this (or for most use cases of this, anyway). You'd use a RAG.

I've tried out working with custom documents in two different ways for different types of data:

* Once using LlamaIndex + Chroma[0] to transcribe and then conversationally query video contents (using GPT 3.5 or 4 as the backing LLM).

* Once using GPT Plus, uploading long-form PDFs of my own fiction books to the GPT's knowledge base. I use this to help me remember character names and timelines (not always accurate, so results need to be treated with caution) and help brainstorm story or tech ideas for my world.

Both work for what I'm using them for. I feel like option one is more customizable and easier to tweak for the types of results I would want, if I have concrete requirements about what kind of output I'm looking for. Option two has a lower barrier to entry and is just a little lower effort (no need to run your own app).

For the next iteration, I'd like to try out AWS Bedrock and compare the workflow and results.

[0] https://www.daily.co/blog/search-your-video-content-library-...

https://khoj.dev/

Tried this summer, and kinda worked!

Unstract - https://unstract.com/ They are a month away from launch(both open source and cloud) The team might be able to give you a quick demo on your specific requirements.

Show HN from two weeks ago mentioned this. https://news.ycombinator.com/item?id=38587052

Hey, GPT Researcher shows exactly how to do that with RAG. See here https://github.com/assafelovic/gpt-researcher

I'm a fan of Khoj. Been using it for months. https://github.com/khoj-ai/khoj

There was something similar about Retrieval Augmented Generation (RAG) recently on HN: https://news.ycombinator.com/item?id=38491251

Early next year I’m preparing something similar for my team, so I’ll surely look into the useful links/recommendations posted by fellow HNers :-)

I have a question for which I haven't found a definitive answer yet: is how can one effectively manage typos and Out-of-Vocabulary (OOV) words in RAG systems?

For instance, if I search for a specific product name but accidentally mistype it, the resulting encoded vector might not be similar to the vector for the correctly spelled product name ?

It's super easy, an example could be found here https://technoclub.bearblog.dev/creating-a-simple-ai-chat-wi...

Try with private gpt github repo

Here are ways to do it by simply adding files to an online interface. I mention them only because they are quite straightforward (and free) to set up.

- https://notebooklm.google/ (US or VPN): uses the "gemini pro" model.

- poe.com: You need to "create a new bot", disable "Make bot publicly accessible," and then "add a knowledge source." this offers many models, although the best ones require a subscription.

Since no one has mentioned it so far: I did just this recently with txtai in a few lines of code.

https://neuml.github.io/txtai/

I've been looking for the answer to this to have a chat interface to my obsidian markdown notes (the whole vault, not just rag of individual notes). Will be following these threads closely