I did a hobby RAG project a little while back, and I'll just share my experience here.
1. First ask the LLM to answer your questions without RAG. It is easy to do and you may be surprised (I was, but my data was semi-public). This also gives you a baseline to beat.
2. Chunking of your data needs to be smart. Just chunking every N characters wasn't especially fruitful. My data was a book, so it was hierarchal (by heading level). I would chunk by book section and hand it to the LLM.
3. Use the context window effectively. There is a weighted knapsack problem here, there are chunks of various sizes (chars/tokens) with various weightings (quality of match). If your data supports it, then the problem is also hierarchal. For example, I have 4 excellent matches in this chapter, so should I include each match, or should I include the whole chapter?
4. Quality of input data counts. I spent 30 minutes copy-pasting the entire book into markdown format.
This was only a small project. I'd be interested to hear any other thoughts/tips.
Pro tip, you can use small models like phi-2 to do semantic chunking rather than just chunk on based on size, and chunking works even better if you book-end chunks with summaries of text coming before/after to enrich the context.
Another pro tip, you can again use a small model to summarize/extract RAG content to get more actual data in the context.
Could you share a bit more about semantic chunking with Phi? Any recommendations/examples of prompts?
Sure, it'll look something like this:
""" Task: Divide the provided text into semantically coherent chunks, each containing between 250-350 words. Aim to preserve logical and thematic continuity within each chunk, ensuring that sentences or ideas that belong together are not split across different chunks.
Guidelines: 1. Identify natural text breaks such as paragraph ends or section divides to initiate new chunks. 2. Estimate the word count as you include content in a chunk. Begin a new chunk when you reach approximately 250 words, preferring to end on a natural break close to this count, without exceeding 350 words. 3. In cases where text does not neatly fit within these constraints, prioritize maintaining the integrity of ideas and sentences over strict adherence to word limits. 4. Adjust the boundaries iteratively, refining your initial segmentation based on semantic coherence and word count guidelines.
Your primary goal is to minimize disruption to the logical flow of content across chunks, even if slight deviations from the word count range are necessary to achieve this. """
This just isn't working for me, phi-2 starts summarizing the document I'm giving it. I tried a few news articles and blog posts. Does using a GGUF version make a difference?
Depending on the number of bits in the quantization, for sure. The most common failure mode should be minor restatements which you can choose to ignore or not.
Is phi actually able to follow those instructions? How do you handle errors?
Whether or not it follows the instructions as written, it produces good output as long as the chunk size stays on the smaller side. You can validate that all the original text is present in the chunks and that no additional text has been inserted easily enough and automatically re-prompt.
Any comments about using Sparse Priming Representations for achieving similar things?
That looks like it'd be an adjunct strategy IMO. In most cases you want to have the original source material on tap, it helps with explainability and citations.
That being said, it seems that everyone working at the state of the art is thinking about using LLMs to summarize chunks, and summarize groups of chunks in a hierarchical manner. RAPTOR (https://arxiv.org/html/2401.18059v1) was just published and is close to SoTA, and from a quick read I can already think of several directions to improve it, and that's not to brag but more to say how fertile the field is.
Wild speculation - do you think there could be any benefit from creating two sets of chunks with one set at a different offset from the first? So like, the boundary between chunks in the first set would be near the middle of a chunk in the second set?
No, it's better to just create summaries of all the chunks, and return summaries of chunks that are adjacent to chunks that are being retrieved. That gives you edge context without the duplication. Having 50% duplicated chunks is just going to burn context, or force you to do more pre-processing of your context.
Might sound like a rookie question, but curious how you'd tackle semantic chunking for a hefty text, like a 100k-word book, especially with phi-2's 2048 token limit [0]. Found some hints about stretching this to 8k tokens [1] but still scratching my head on handling the whole book. And even if we get the 100k words in, how do we smartly chunk the output into manageable 250-350 word bits? Is there a cap on how much the output can handle? From what I've picked up, a neat summary ratio for a large text without missing the good parts is about 10%, which translates to around 7.5K words or over 20 chunks for the output. Appreciate any insights here, and apologies if this comes off as basic.
[0]: https://huggingface.co/microsoft/phi-2
[1]: https://old.reddit.com/r/LocalLLaMA/comments/197kweu/experie...
Can you speak a bit to the speed/slowness of doing such chunking? We recently started using LLMs to clean the text (data quality/text cleanliness is a problem for us), and it has increased our indexing time a lot.
It's going to depend on what you're running on, but phi2 is pretty fast so you can reasonably expect to be hitting ~50 tokens a second. Given that, if you are ingesting a 100k token document you can expect it to take 30-40 minutes if done serially, and you can of course spread stuff in parallel.
thanks for the info--good to know we aren't the only ones contending with speed for large documents lol
I've heard of people having success with methods like this. Would be awesome if we found a way to build that into this project :)
Is there any science associated with creative effective embedding sets? For a book, you could do every sentence, every paragraph, every page or every chapter (or all of these options). Eventually people will want to just point their RAG system at data and everything works.
The easy answer is just use a model to chunk your data for you. Phi-2 can chunk and annotate with pre/post summary context in one pass, and it's pretty fast/cheap.
There is an optimal chunk size, which IIRC is ~512 tokens depending on some factors. You could hierarchically model your data with embeddings by chunking the data, then generating summaries of those chunks and chunking the summaries, and repeating that process ad nauseum until you only have a small number of top level chunks.
How does this work when there is a limited context window. You do some pre-chunking?
Phi can ingest 2k tokens and the optimal chunk size is between 512-1024 depending on the model/application, so you just give it a big chunk and tell it to break it down into smaller chunks that are semantically related, leaving enough room for book-end sentences to enrich the context of the chunk. Then you start the next big chunk with the remnants of the previous one that the model couldn't group.
Isn't "give it a big chunk" just the same problem at a higher level? How do you handle, say, a book?
You don't need to handle a whole book, the goal is to chunk the book into chunks of the correct size, which is less than the context size of the model you're using to chunk it semantically. When you're ingesting data, you fill up the chunker model's context, and it breaks that up into smaller, self relevant chunks and a remainder. You then start from the remainder and slurp up as much additional text as you can to fill the context and repeat the process.
This is an example of knowledge transfer from a model. I used a similar approach to augment chunked texts with questions, summaries, and keyterms (which require structured output from the LLM). I haven't tried using a smaller model to do this as GPT3.5 is fast and cheap enough, but I like the idea of running a model in house to do things like this.
It seems like you put a lot of thought and effort into this. Were you happy with the results?
If you could put the entire book into the context window of Gemini (1M tokens), how do you think it would compare with your approach? It would cost like $15 to do this, so not cheap but also cost effective considering the time you’ve spent chunking it.
At the time I was working with either GPT 3.5 or 4, and – looking at my old code - I was limiting myself to around 14k tokens.
It was somewhat ok. I was testing the system from a (copyrighted) psychiatry textbook book and getting feedback on the output from a psychotherapist. The idea was to provide a tool to help therapists prep for a session, rather than help patients directly.
As usual it was somewhat helpful but a little too vague sometimes, or missed important specific information for situation.
It is possible that it could be improved with a larger context window, having more data to select, or different prompting. But the frequent response was along the lines of, "this is good advice, but it just doesn't drill down enough".
Ultimately we found that GPT3.5/4 could produce responses that matched or exceeded our RAG-based solution. This was surprising as it is quite domain specific, but also it seemed pretty clear that GPT must have be trained on very data very similar the (copyrighted) content we were using.
Further steps would be:
1. Use other LLM models. Is it just GPT3.5/4 that is reluctant to drill down?
2. Used specifically-trained LLMs(/LORA) based on the expected response style
I'd be careful of entering this kind of arms race. It seems to be a fight against mediocre results, and at any moment OpenAI at al may release a new model that eats your lunch.
Did you try or consider fine-tuning Gpt 3.5 as a complementary approach?
The thing you have to keep in mind is that OpenAI (and all the big tech companies) are risk averse. That's the reason that model alignment is so overbearing. The foundation models are also trying to be good at everything, which means they won't be sensitive to the nuances of specific types of questions.
RAG and chatbot memory systems are here to stay, and they will always provide a benefit.
But you would need to spend the $15 on every request whereas the RAG approach would be most likely significantly cheaper per request.
https://twitter.com/parthsarthi03/status/1753199233241674040
processes documents, organizing content and improving readability by handling sections, paragraphs, links, tables, lists, page continuations, and removing redundancies, watermarks, and applying OCR, with additional support for HTML and other formats through Apache Tika:
https://github.com/nlmatics/nlm-ingestor
I don't understand. Why build up text chunks from different, non-contiguous sections?
If those non-contiguous sections share similar semantic/other meaning, it can make sense from a search perspective to group them?
it starts to look like a graph problem
On the level of paper, not everything is laid out linearly. The main text is often laid out in column, the flow can be be offset with pictures with a caption, additional text can be placed in inserts, etc ...
You need a human eye to figure that out and this is the task nlm-ingestor tackles.
As for the content, semantic contiguity is not always guaranteed. A typical example of this are conversations, where people engage in narrative/argumentative competitions. Topics get nested as the conversation advances, along the lines of "Hey, this remind me of ...". Building up a stack that can be popped once subtopics have been exhausted: "To get back to the topic of ...".
This is explored at length by Kebrat-Orecchioni in:
https://www.cambridge.org/core/journals/language-in-society/...
And an explanation is offered by Dessalles in:
https://telecom-paris.hal.science/hal-03814068/document