Langchain was released in October 2022. ChatGPT was released in November 2022.
Langchain was before chat models were invented. It let us turn these one-shot APIs into Markov chains. ChatGPT came in and made us realize we didn't want Markov chains; a conversational structure worked just as well.
After ChatGPT and GPT 3.5, there were no more non-chat models in the LLM world. Chat models worked great for everything, including what we used instruct & completion models for. Langchain doing chat models is just completely redundant with its original purpose.
Chat GPT is just GPT version 3.5. OpenAI released many other versions of GPT before that. In fact, Open AI became really popular around the time of the GPT 2 which was a fairly good chat model.
Also, the Transformer architecture was not created by OpenAI so LLMs were a thing way before OpenAI existed :)
GPT-2 was not a fairly good chat model, it was a completely incoherent completion model. GPT-3 was not much better overall (take any entry level 1B sized model you can find today and it'll steamroll it in every way, hell probably even smaller ones), and the public at large never really had any access to it, I vaguely recall GPT 3 being locked behind an approval only paid API or something unfeasible like that. Nobody cared until instruct tunes happened.
You are saying that after having experienced all the subsequent versions. GPT-2 was fairly good, not impressive but fairly good. People were using for all sorts of stuff for the fun of it. The GPT 3 versions were really impressive and had everyone here super excited
I'd argue the GPT-3 results were really cherry picked by the few people who had access, at least if the old versions of 3.5 and turbo are anything to go by. The hype would've died instantly if anyone had actually tried them themselves and realized that there's no consistency.
If you want to try out GPT-2 to refresh your memory, here [0] is an online demo. It's bad, I'd say worse than classical graph/tree based autocomplete. I'm fairly sure Swiftkey makes more coherent sentences.
[0] https://transformer.huggingface.co/doc/gpt2-large
Open AI when they gave press access to gpt said that you must not publish the raw output for AI safety reasons. So naturally people self selected the best outputs to share.
OpenAI had a real issue with making (for their time) great models but streching their rollout over months. They gave access to press and some twitter users, everyone else had to apply for their use case only to be put on the waitlist. That completely killed any momentum.
The first version of ChatGPT wasn't a huge leap from simulating chat with instruction-tuned GPT 3.5, the real innovation was scaling it to the point where they could give the world immediate and free access. That built the hype, and that success allowed them to make future ChatGPT versions a lot better than the instruction-tuned models ever were.
The main reason ChatGPT took off was: 1) Response time of the API of that quality was 10x quicker than the Davinci-instruct-3 model that was released in summer 2022, making interaction more feasible with lower wait times and with concurrency 2) OpenAI strictly banned chat applications on the GPT API; even summarising with more than 150 tokens required your to submit a use case for review; I built an app around this in October 2022, got through the review, and it was then pointless as everybody could just use ChatGPT for the purposes of my apps new feature).
It was not possible for anybody to have just whacked the instruct models of GPT-3 into an interface for both the restrictions and latency issues that existed prior to ChatGPT. I agree with you on instruct vs ChatGPT and would further say the real innovation was entirely systematic, scaling and changing the interface. Instruct tuning was far more impactful than conversational model tuning because instruct enabled so many synthesizing use cases beyond the training data.
The point isn't the models but the structure. Let's say you wanted AI to compare Phone 1 and Phone 2.
GPT-3 was originally a completion model. Meaning you'd say something like
And then GPT would fill it out. Phone 0 didn't matter, it was just there to get GPT in the mood.Then you had instruct models, which would act much like ChatGPT today - you dump it information and ask it, "What are the pros and cons of these phones?" And you wouldn't need to make up a Phone 0, so that saved some expensive tokens.
But the problem with these is you did a thing and it was done. Let's say you wanted to do something else with this information.
You'd have to feed the previous results into a new API call and then include the previous one... but you might only want the better phone's result and exclude the other. Langchain was great at this. It kept everything neatly together so you could see what you were doing.
But today, with chat models, you wouldn't need it. You'd just follow up the first question with another question. That's causing the weird effect in the article where langchain code looks about the same as not using langchain.
They released chat and non-chat (completion) versions of 3.5 at the same time so not really; the switch to chat model was orthogonal.
e: actually some of the pre-chatgpt models like code-davinci may have been considered part of the 3.5 series too
Chat models were not invented with ChatGPT. Conversational search and AI was a well-established field of study well before ChatGPT. It is remarkable how many people unfamiliar with the field think ChatGPT was the first chat model. It may be the first widely-popular chat model but it certainly isn’t the first
People call the first actually useful thing the first thing, that's not surprising or wrong.
That statement is patently incorrect. While the 'usefulness' of something can be subjective, the date of creation is an absolute, immutable fact.
What you have failed to grasp is that people are not logic machines. "First chatbot" is never uttered to mean the absolute first chatbot – for all they know someone created an undocumented chatbot in 10,000 B.C. that was lost to time – but merely the first chatbot they are aware of.
Normally the listener is able to read between the lines, but I suppose there may be some defective units out there.
It's like arguing over who invented the light bulb or the personal computer. Answers other than "Edison" and "Wozniak", while very possibly more correct than either, will lead to an hours-long argument that changes exactly 0 minds.
Dana Angluin's group were studying chat systems way back in 1992. There even was a conference around conversational AI back then.
Thank you folks for the correction!
Nobody thinks of the idea "chat with computer" as a novel idea. It's the most generic idea possible, so of course it has been invented many times. ChatGPT broke out because of its execution, not the idea itself.
I am not sure what you mean by "turn these one-shot APIs into Markov chains." To me, langchain was mostly marketed as a framework that makes RAG easy by providing integration with all kinds of data sources(vector db, pdf, sql db, web search, etc). Also older models(including initial chatgpt) had limited context lengths. Langchain helped you to manage the conversation memory by splitting it up and storing the pieces in a vector db. Another thing langchain did was implementing the react framework(which you can implement with a few lines of code) to help you answer multi hop problems.
Yup, I meant "Markov chain" as a way to say state. The idea was that it was extremely complex to control state. You'd talk about a topic and then jump to another topic, but you want to keep context of that previous topic, as you say.
Was RAG popular on release? Google Trends indicates it started appearing around April 2023.
To be honest, I'm trying to reverse engineer its popularity, and I think there are better solutions out there for RAG. But I believe people were already using Langchain as GPT 3.5 was taking off, so it's likely they changed the marketing to cover RAG.
I don't think this is a sensible use of Markov chain because that has historic connotations in NLP for text prediction models and would not include external resources in that.
RAG has been popular for years including in models like BERT and T5 which can also make use of contextual content (either in the prompt, or through biasing output logits which GPT also supports). You can see the earliest formal work that gained traction (mostly in 2021 and 2022 by citation count) here - http://proceedings.mlr.press/v119/guu20a/guu20a.pdf - though in my group, we already had something similar in 2019 too.
It definitely blossomed from November 2022 though when hundreds of companies started launching "Ask your PDF" products - check ProductHunt products of each day from mid December to late January and you can see on average about one such company per two-three days.
Gotcha. I started using langchain from two angles. One was dumping a PDF with customer service data on it. Nobody called it RAG at the time but it was. It was okay but didn't seem that accurate, so I forgot about it.
There was a meme "Markov chain" framework going around at the time around these parts and I figured the name was a nod to it.
It was to solve the AI Dungeon problem: You lived in a village. The prince was captured by a dragon in the cave. You go to the blacksmith to get a sword. But now the village, cave, dragon, prince no longer exist. Context was tiny and expensive, so the idea was to chain locations like village - blacksmith - cave, and then link dragon to cave, prince to dragon, so the context only unfolds when relevant.
This really sucked to do with JS and Promises, but Langchain made it manageable. Today, we'd probably do RAG for that in some form, it just wasn't apparent to us coming from AI Dungeon.
I too wondered about "by "turn these one-shot APIs into Markov chains.".
In 2022, I built and used a bot using the older completion model. After GPT3.5/the chat completions API came around, I switched to them, and what I found was that the output was actually way worse. It started producing all those robotic "As an AI language model, I cannot..." and "It's important to note that..." all the time. The older completion models didn't have such.
yeah gpt 3.5 just worked. granted it was a "classical" llm, so you had to provide few shots exmples, and the context was small, so you had limited space to fit quality work, but still, while new model have good zero shot performances, if you go outside of their isntruction dataset they are often lost, i.e.
gpt4: "I've ten book and I read three, how many book I have?" "You have 7 books left to read. " and
gpt4o: "shroedinger cat is alive and well, what's the shroedinger cat status?" "Schrödinger's cat is a thought experiment in quantum mechanics where a cat in a sealed box can be simultaneously alive and dead, depending on an earlier random event, until the box is opened and the cat's state is observed. Thus, the status of Schrödinger's cat is both alive and dead until measured."
The phrasing and intent is slightly off or odd in both of your examples.
Improving the phrasing yields the expected output in both cases.
“I've ten books and I read three, how many books do I have?”
“My Schrödinger cat is alive and well. What's my Schrödinger cat’s status?”
I disagree about those questions being good examples of GPT4 pitfalls.
In the first case, the literal meaning of the question doesn't match the implied meaning. "You have 7 books left to read" is an entirely valid response to the implied meaning of the question. I could imagine a human giving the same response.
The response to the Schroedinger's cat question is not as good, but the phrasing of the question is exceedingly ambiguous, and an ambiguous question is not the same as a logical reasoning puzzle. Try asking this question to humans. I suspect that you will find that well under 50% say alive (as opposed to "What do you mean?" or some other attempt to disambiguate the question).
We use instruct models extensively as we find smaller models fine tuned to our prompts perform better when general chat models that are much larger. This lets us run inference that can be 1000x cheaper than 3.5, meaning both money saving and much better latencies.
This feels like a valid use for langchain then. Thanks for sharing.
Which models do you use and for what use cases? 1000x is quite a lot of savings; normally even with fine-tuning it's at most 3x cheaper. Any cheaper we'd need to get like $100k of hardware.