I like that this shows how hard even conceptually simple ideas are to achieve in fine-tuning LLMs. Even given a pretty good starting dataset, a decent starting model, etc. this appears to have been a challenge.
One thing it did make me think about was that these models are suitable for things that don't have a natural definitive answer. That is, picking the perfect card given a set of picks is probably combinatorially impossible to solve. But picking a good card given a set is possible and LLMs can approach human level performance.
I think this leads to a set of problems that current LLMs may be fine-tuned to solve.
That lines up with my experience- for high-stakes decisions, they rarely give me a great answer. But for low stakes decisions, they do well at giving me a good enough answer. For example, I've been using them to help find gifts for friends and children this month. I don't need the best choice to solve the problem, just a good one.
What are examples of low stakes
Generating content for tabletop gaming with my friends (especially wacky ideas, like character names themed after items on the Taco Bell menu)
I had to buy some spare tools where I cared more about price than quality and it helped me choose some suitable brands
As mentioned, you can tell it a bit about a person (and feed in their wishlist if they have one) and it'll help you pick something they'll probably like
Finding something to do to spend an afternoon in a city while traveling
In general, anything where there is no objective best answer (meaning I can ask it to generate multiple possibilities and filter out the bad ideas) and where I value speed over correctness.
I've been going ham with this. Pay for ChatGPT Plus for work, so GPT-4's been helping me design encounters, plan sessions, and brainstorm ideas for one-shots. It gives me a sort of bland, vanilla idea for something, I suggest a twist on it, it goes, "oh great idea, here that is with your changes:" and I iterate with it from there.
Likewise I love theming characters, plotlines, and settings after songs, bands and albums, so I'll dump in a bunch of lyrics and ask ChatGPT to help me intertwine subtle references into descriptions, names, and plot points.
A random sampling of things GPT-4 has helped me with lately:
Where are the dates in whole foods? (A: with nuts, not fruits and veggies)
How can I steam bao without a steamer basket? (A: saucepan, 1" water, balled up aluminum foil, plate, baos, lid)
Any guess as to when this photo was taken? It looks like anywhere from the 70s to the 90s. (A: the photo paper has a logo that postdates a 2003 company merger)
How much additional calculation occurs in high-stakes decisions by individuals. Also what is the variability in quality of high stakes decisions in humans?
I'm guessing LLM decision is rather average, but that the LLM has no easy way of spending the extra time to gather information around said high stakes decisions like a human would.
I dont think additional calculation is the difference. It makes more sense to think of individual humans as models which are highly tuned.
Just like like LLMs, some humans are better tuned than others for specific tasks, as well as in general.
The difference is that you can reject a low-stakes answer that's invalid. You can tell that something is off, or it doesn't matter.
With high-stakes decisions, you're surrendering the decision-making power to the AI because you don't understand the output well enough to verify it.
Basically, and AI can give ideas but not advise.
They can’t do super human performance like alpha go and the can’t think “system 2” that would be required for high stakes decisions.*
Room to grow.
*observations from the recent karpathy llm talk.
Surely, but we can't gloss over the fact that this was accomplished by a single person.
Yes and no I think. I've seen individuals achieve things in their bedrooms that would make most corporations blush. Demoscene type stuff comes to mind as an example. Often a single person can become hyper obsessed with achieving some goal and in the absence of any interference can achieve something impressive beyond what can be achieved within a company.
Consider a PM involved in this project, feeding in requirements from a business. Instead of the "just get it done at any cost" mentality of a single person you would have KPIs and business objectives that would muddy the water.
I just mean to say that there is a gulf between what can be done by a single hacker in his basement when they have no constraints other than their imagination compared to what can be accomplished by a business. Sometimes the single-hacker achievement doesn't scale.
So, it is impressive that this is possible for a single person at all. But, from a business/operation perspective, I don't actually think that is as relevant as it may seem.
I wonder if you could define a specific complexity class of problems that LLMs are good at