The team I work on processes 5B+ tokens a month (and growing) and I'm the EM overseeing that.
Here are my take aways
1. There are way too many premature abstractions. Langchain, as one of may examples, might be useful in the future but at the end of the day prompts are just a API call and it's easier to write standard code that treats LLM calls as a flaky API call rather than as a special thing.
2. Hallucinations are definitely a big problem. Summarizing is pretty rock solid in my testing, but reasoning is really hard. Action models, where you ask the llm to take in a user input and try to get the llm to decide what to do next, is just really hard, specifically it's hard to get the llm to understand the context and get it to say when it's not sure.
That said, it's still a gamechanger that I can do it at all.
3. I am a bit more hyped than the author that this is a game changer, but like them, I don't think it's going to be the end of the world. There are some jobs that are going to be heavily impacted and I think we are going to have a rough few years of bots astroturfing platforms. But all in all I think it's more of a force multiplier rather than a breakthrough like the internet.
IMHO it's similar to what happened to DevOps in the 2000s, you just don't need a big special team to help you deploy anymore, you hire a few specialists and mostly buy off the shelf solutions. Similarly, certain ML tasks are now easy to implement even for dumb dumb web devs like me.
This is a function of the language model itself. By the time you get to the output, the uncertainty that is inherent in the computation is lost to the prediction. It is like if you ask me to guess heads or tails, and I guess heads, I could have stated my uncertainty (e.g. Pr [H] = .5) before hand, but in my actual prediction of heads, and then the coin flip, that uncertainty is lost. It's the same with LLMs. The uncertainty in the computation is lost in the final prediction of the tokens, so unless the prediction itself is uncertainty (which it should rarely be based on the training corpus, I think), then you should not find an LLM output really ever to say it does not understand. But that is because it never understands, it just predicts.
Apparently it is possible to measure how uncertain the model is using logprobs, there's a recipe for it in the OpenAI cookbook: https://cookbook.openai.com/examples/using_logprobs#5-calcul...
I haven't tried it myself yet, not sure how well it works in practice.
There’s a difference between certainty of the next token given the context and the model evaluation so far and certainty about an abstract reasoning process being correct given it’s not reasoning at all. These probabilities and stuff coming out are more about token prediction than “knowing” or “certainty” and are often confusing to people in assuming they’re more powerful than they are.
When you train a model on data made by humans, then it learns to imitate but is ungrounded. After you train the model with interactivity, it can learn from the consequences of its outputs. This grounding by feedback constitutes a new learning signal that does not simply copy humans, and is a necessary ingredient for pattern matching to become reasoning. Everything we know as humans comes from the environment. It is the ultimate teacher and validator. This is the missing ingredient for AI to be able to reason.
Yeah but this doesn't change how the model functions, this is just turning reasoning into training data by example. It's not learning how to reason - it's just learning how to pretend to reason, about a gradually wider and wider variety of topics.
If any LLM appears to be reasoning, that is evidence not of the intelligence of the model, but rather the lack of creativity of the question.
What's the difference between reasoning and pretending to reason really well?
It’s the process by which you solve a problem. Reasoning requires creating abstract concepts and applying logic against them to arrive at a conclusion.
It’s like saying what’s the difference between between deductive logic and Monte Carlo simulations. Both arrive at answers that can be very similar but the process is not similar at all.
If there is any form of reasoning on display here it’s an abductive style of reasoning which operates in a probabilistic semantic space rather than a logical abstract space.
This is important to bear in mind and explains why hallucinations are very difficult to prevent. There is nothing to put guard rails around in the process because it’s literally computing probabilities of tokens appearing given the tokens seen so far and the space of all tokens trained against. It has nothing to draw upon other than this - and that’s the difference between LLMs and systems with richer abstract concepts and operations.
Naive way of solving this problem is to ie. run it 3 times and seeing if it arrives at the same conclusion 3 times. More generally running it N times and calculating highest ratio. You trade compute for widening uncertainty window evaluation.
You can ask the model sth like: is xyz correct, answer with one word, either Yes or No. The log probs of the two tokens should represent how certain it is. However, apparently RLHF tuned models are worse at this than base models.
Seems like functions could work well to give it an active and distinct choice, but I'm still unsure if the function/parameters are going to be the logical, correct answer...
Why shouldn't you ask for uncertainaty?
I love asking for scores / probabilities (usually give a range, like 0.0 to 1.0) whenever I ask for a list, and it makes the output much more usable
I'm not sure if that is a metric you can rely on. LLMs are very sensitive to the position of your item lists along the context, paying extra attention at the beginning and the end of those list.
See the listwise approach at "Large Language Models are Effective Text Rankers with Pairwise Ranking Prompting", https://arxiv.org/abs/2306.17563
I get the reasoning but I’m not sure you’ve successfully contradicted the point.
Most prompts are written in the form “you are a helpful assistant, you will do X, you will not do Y”
I believe that inclusion of instructions like “if there are possible answers that differ and contradict, state that and estimate the probability of each” would help knowledgeable users.
But for typical users and PR purposes, it would be disaster. It is better to tell 999 people that the US constitution was signed in 1787 and 1 person that it was signed in 349 B.C. than it is to tell 1000 people that it was probably signed in 1787 but it might have been 349 B.C.
Why does the prompt intro take the form of a role/identity directive "You are helpful assistant..."?
What about the training sets or the model internals responds to this directive?
What are the degrees of freedom of such directives?
If such a directive is helpful, why wouldn't more demanding directives be even more helpful: "You are a domain X expert who provides proven solutions for problem type Y..."
If don't think the latter prompt is more helpful, why not?
What aspect of the former prompt is within bounds of helpful directives that the latter is not?
Are training sets structured in the form of roles? Surely, the model doesn't identify with a role?!
Why is the role directive topically used with NLP but not image generation?
Do typical prompts for Stable Diffusion start with an identity directive "You are assistant to Andy Warhol in his industrial phase..."?
Why can't improved prompt directives be generated by the model itself? Has no one bothered to ask it for help?
"You are the world's most talented prompt bro, write a prompt for sentience..."
If the first directive observed in this post is useful and this last directive is absurd, what distinguishes them?
Surely there's no shortage of expert prompt training data.
BTW, how much training data is enough to permit effective responses in a domain?
Can a properly trained model answer this question? Can it become better if you direct it to be better?
Why can't the models rectify their own hallucinations?
To be more derogatory: what distinguishes a hallucination from any other model output within the operational domain of the model?
Why are hallucinations regarded as anything other than a pure effect, and as pure effect, what is the cusp of hallucination? That a human finds the output nonsensical?
If outputs are not equally valid in the LLM why can't it sort for validity?
OTOH if all outputs are equally valid in the LLM, then outputs must be regarded by a human for validity, so what distinguishes a LLM from an the world's greatest human time-wasting device? (After Las Vegas)
Why will a statistical confidence level help avoid having a human review every output?
The questions go on and on...
— Parole Board chairman: They've got a name for people like you H.I. That name is called "recidivism."
Parole Board member: Repeat offender!
Parole Board chairman: Not a pretty name, is it H.I.?
H.I.: No, sir. That's one bonehead name, but that ain't me any more.
Parole Board chairman: You're not just telling us what we want to hear?
H.I.: No, sir, no way.
Parole Board member: 'Cause we just want to hear the truth.
H.I.: Well, then I guess I am telling you what you want to hear.
Parole Board chairman: Boy, didn't we just tell you not to do that?
H.I.: Yes, sir.
Parole Board chairman: Okay, then.
It's not just loss of the uncertainty in prediction, it's also that an LLM has zero insight into its own mental processes as a separate entity from its training data and the text it's ingested. If you ask it how sure it is, the response isn't based on its perception of its own confidence in the answer it just gave, it's based on how likely it is for an answer like that to be followed by a confident affirmation in its training data.
But the LLM predicts the output based on some notion of a likelihood so it could in principle signal if the likelihood of the returned token sequence is low, couldn’t it?
Or do you mean that fine-tuning distorts these likelihoods so models can no longer accurately signal uncertainty?
For example?
Lots of applied NLP tasks used to require paying annotators to compile a golden dataset and then train an efficient model on the dataset.
Now, if cost is little concern you can use zero shot prompting on an inefficient model. If cost is a concern, you can use GPT4 to create your golden dataset way faster and cheaper than human annotations, and then train your more efficient model.
Some example NLP tasks could be classifiers, sentiment, extracting data from documents. But I’d be curious which areas of NLP __weren’t__ disrupted by LLMs.
Essentially come up with a potent generic model using human feedback, label and annotation for LLM e.g GPT 4, then use it to generate golden dataset for other new models without human in the loop, very innovative indeed.
I’m interested by your comment that you can “use GPT4 to create your golden dataset”.
Would you be willing to expand a little and give a brief example please? It would be really helpful for me to understand this a little better!
Anything involving classification, extraction, or synthesis.
Thank you. Seeing similar things. Clients are also seeing sticker shock on how much the big models cost vs. the output. That will all come down over time.
So will interest, as more and more people realise theres nothing "intelligent" about the technology, it's merely a Markov-chain-word-salad generator with some weights to improve the accuracy somewhat.
I'm sure some people (other than AI investors) are getting some value out of it, but I've found it to be most unsuited to most of the tasks I've applied it to.
The industry is troubled both by hype marketers who believe LLMs are superhuman intelligence that will replace all jobs, and cynics who believe they are useless word predictors.
Some workloads are well-suited to LLMs. Roughly 60% of applications are for knowledge management and summarization tasks, which is a big problem for large organizations. I have experience deploying these for customers in a niche vertical, and they work quite well. I do not believe they're yet effective for 'agentic' behavior or anything using advanced reasoning. I don't know if they will be in the near future. But as a smart, fast librarian, they're great.
A related area is tier one customer service. We are beginning to see evidence that well-designed applications (emphasis on well-designed -- the LLM is just a component) can significantly bring down customer service costs. Most customer service requests do not require complex reasoning. They just need to find answers to a set of questions that are repeatedly asked, because the majority of service calls are from people who do not read docs. People who read documentation make fewer calls. In most cases around 60-70% of customer service requests are well-suited to automating with a well-designed LLM-enabled agent. The rest should be handled by humans.
If the task does not require advanced reasoning and mostly involves processing existing information, LLMs can be a good fit. This actually represents a lot of work.
But many tech people are skeptical, because they don't actually get much exposure to this type of work. They read the docs before calling service, are good at searching for things, and excel at using computers as tools. And so, to them, it's mystifying why LLMs could still be so valuable.
Asking for analogies has been interesting and surprisingly useful.
Could you elaborate, please?
Instead of `if X == Y do ...` it's more like `enumerate features of X in such a manner...` and then `explain feature #2 of X in terms that Y would understand` and then maybe `enumerate the manners in which Y might apply X#2 to TASK` and then have it do the smartest number.
The most lucid explanation for SQL joins I've seen was in a (regrettably unsaved) exchange where I asked it to compare them to different parts of a construction project and then focused in on the landscaping example. I felt like Harrison Ford panning around a still image in the first Blade Runner. "Go back a point and focus in on the third paragraph".
OP here - I had never thought of the analogy to DevOps before, that made something click for me, and I wrote a post just now riffing off this notion: https://kenkantzer.com/gpt-is-the-heroku-of-ai
Basically, I think we’re using GPT as the PaaS/heroku/render equivalent of AI ops.
Thank you for the insight!!
You only processed 500m tokens, which is shockingly little. perhaps only 2k in incurred costs?
I advocate for these metaphors to help people better understand a reasonable expectation for LLMs in modern development workflows. Mostly because they show it as a trade-off versus a silver bullet. There were trade-offs to the evolution of devops, consider for example the loss of key skillsets like database administration as a direct result of "just use AWS RDS" and the explosion in cloud billing costs (especially the OpEx of startups who weren't even dealing with that much data or regional complexity!) - and how it indirectly led to Gitlabs big outage and many like it.
Devops is such an amazing analogy.
Regarding null hypothesis and negation problems - I find it personally interesting because similar fenomenon happens in our brains. Dreams, emotions, affirmations etc. process inner dialogue more less by ignoring negations and amplifying emotionally rich parts.
They are also dull (higher latency for same resources) APIs if you're self-hosting LLM. Special attention needed to plan the capacity.
> Summarizing is pretty rock solid in my testing
Yet, for some reason, ChatGPT is still pretty bad at generating titles for chats, and I didn't have better luck with the API even after trying to engineer the right prompt for quite a while...
For some odd reason, once in a while I get things in different languages. It's funny when it's in a language I can speak, but I recently got "Relm4 App Yenileştirme Titizliği" which ChatGPT tells me means "Relm4 App Renewal Thoroughness" when I actually was asking it to adapt a snippet of gtk-rs code to relm4, so not particularly helpful