Just signed up for Claude Pro to try out the Opus model. Decided to throw a complex query at it, combining an image with an involved question about SDXL fine tuning and asking it to do some math comparing the cost of using an RTX 6000 Ada vs an H100.
It made a lot of mistakes. I provided it with a screenshot of Runpod's pricing for their GPUs, and it misread the pricing on an RTX 6000 ADA as $0.114 instead of $1.14.
Then, it tried to do math, and here is the outcome:
-----
Approach 1: Use the 1x RTX 6000 Ada with a batch size of 4 for 10,000 steps.
Cost: $0.114/hr * (10,000 steps / (4 images/step * 2.5 steps/sec)) = $19.00 Time: (10,000 steps / (4 images/step * 2.5 steps/sec)) / 3600 = 0.278 hours
Approach 2: Use the 1x H100 80GB SXMS with a batch size of 8 for 10,000 steps.
Cost: $4.69/hr * (10,000 steps / (8 images/step * 3 steps/sec)) = $19.54 Time: (10,000 steps / (8 images/step * 3 steps/sec)) / 3600 = 0.116 hours
-----
You will note that .278 * $0.114 (or even the actually correct $1.14) != $19.00, and that .116 * $4.69 != $19.54.
For what it's worth, ChatGPT 4 correctly read the prices off the same screenshot, and did math that was more coherent. Note, it saw that the RTX 6000 Ada was currently unavailable in that same screenshot and on its own decided to substitute a 4090 which is $.74/hr, also it chose the cheaper PCIe version of the H100 Runpod offers @ $3.89/hr:
-----
The total cost for running 10,000 steps on the RTX 4090 would be approximately $2.06.
It would take about 2.78 hours to complete 10,000 steps on the RTX 4090. On the other hand:
The total cost for running 10,000 steps on the H100 PCIe would be approximately $5.40.
It would take about 1.39 hours to complete 10,000 steps on the H100 PCIe, which is roughly half the time compared to the RTX 4090 due to the doubled batch size assumption.
-----
I'm convinced GPT is running separate helper functions on input and output tokens to fix the 'tokenization' issues. As in, find items of math, send it to this hand made parser and function, then insert result into output tokens. There's no other way to fix the token issue.
For reference, Let's build the GPT Tokenizer https://www.youtube.com/watch?v=zduSFxRajkE
I'd almost say anyone not doing that is being foolish.
The goal of the service is to answer complex queries correctly, not to have a pure LLM that can do it all. I think some engineers feel that if they are leaning on an old school classically programed tool to assist the LLM, it's somehow cheating or impure.
The problem is, such tricks are sold as if there's superior built-in multi-modal reasoning and intelligence instead of taped up heuristics, exacerbating the already amped up hype cycle in the vacuum left behind by web3.
Why is this a trick or somehow inferior to getting the AI model to be able to do it natively?
Most humans also can’t reliably do complex arithmetic without the use of something like a calculator. And that’s no trick. We’ve built the modern world with such tools.
Why should we fault AI for doing what we do? To me, training the AI use a calculator is not just a trick for hype, it’s exciting progress.
By all means if it works to solve your problem, go ahead and do it.
The reason some people have mixed feelings about this because of a historical observation - http://www.incompleteideas.net/IncIdeas/BitterLesson.html - that we humans often feel good about adding lots of hand-coded smarts to our ML systems reflecting our deep and brilliant personal insights. But it turns out just chucking loads of data and compute at the problem often works better.
20 years ago in machine vision you'd have an engineer choosing precisely which RGB values belonged to which segment, deciding if this was a case where a hough transform was appropriate, and insisting on a room with no windows because the sun moves and it's totally throwing off our calibration. In comparison, it turns out you can just give loads of examples to a huge model and it'll do a much better job.
(Obviously there's an element of self-selection here - if you train an ML system for OCR, you compare it to tesseract and you find yours is worse, you probably don't release it. Or if you do, nobody pays attention to you)
The reason we chucked loads of data at it was because we had no other options. If you wanted to write a function that classified a picture as a cat or a dog, good luck. With ML, you can learn such a function.
That logic doesn’t extend to things we already know how to program computers to do. Arithmetic already works. We don’t need a neural net to also run the calculations or play a game of chess. We have specialized programs that are probably as good as we’re going to get in those specialized domains.
> We don’t need a neural net to also run the calculations or play a game of chess.
That's actually one of the specific examples from the link I mentioned:-
> In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that ``brute force" search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.
While it's true that they didn't use an LLM specifically, it's still an example of chucking loads of compute at the problem instead of something more elegant and human-like.
Of course, I agree that if you're looking for a good game of chess, Stockfish is a better choice than ChatGPT.
It would be exciting if the LLM knew it needed a calculator for certain things and went out and got it. If the human supervisors are pre-screening the input and massaging what the LLM is doing that is a sign we don't understand LLMs enough to engineer them precisely and can't count on them to be aware of their own limitations, which would seem to be a useful part of general intelligence.
It can if you let it, that's the whole premise of LangChain style reasoning and it works well enough. My dumb little personal chatbot knows it can access a Python REPL to carry out calculations and it does.
Because if NN is smart enough, it should be able to do arithmetic flawlessly. Basic arithmetic doesn't even require that much intelligence, it's mostly attention to detail.
Well it’s obviously not smart enough so the question is what do you do about it? Train another net that’s 1000x as big for 99% accuracy or hand it off to the lowly calculator which will get it right 100% of the time?
And 1000x is just a guess. We have no scaling laws about this kind of thing. It could be a million. It could be 10.
No, that's the actual end goal. We want a NN that does everything, trained end-to-end.
"We" contains more than just one perspective though.
As someone applying LLMs to a set of problems in a production application, I just want a tool that solves the problem. Today, that tool is an LLM, tomorrow it could be anything. If there are ~hacks~ elegant techniques that can get me the results I need faster, cheaper, or more accurately, I absolutely will use those until there's a better alternative.
Like a AGI? I think we’ll put up with hacks for some more time still. Unless the model gets really really good at generalizing and then it’s probably close to human level already
I'm unclear if you're saying that as a user who wants that feature, or an AI developer (for Anthropic or other) who is trying to achieve that goal?
Of course. But we must acknowledge that many have blinders on, assuming that scale is all you need to beat statistical errors.
Well, these people are not wrong per se. Scale is what drove what we have today and as hardware improves, the models will too. It's just that in the very short term it turns out to be faster to just code around some of these issues on the backend of an API rather than increase the compute you spend on the model itself.
I personally find approaches like this the correct way forward.
An input analyzer that finds out what kinds of tokens the query contains. A bunch of specialized models which handle each type well: image analysis, OCR, math and formal logic, data lookup,sentiment analysis, etc. Then some synthesis steps that produce a coherent answer in the right format.
Then you might enjoy looking up the "Mixture of Experts" model design.
That has nothing to do with the idea of ensembling multiple specialized/single-purpose models. Mixture of Experts is an method of splitting the feed-forwards in a model such that only a (hopefully) relevant subset of parameters is run for each token.
The model learns how to split them on its own, and usually splits based not on topic or domain, but on grammatical function or category of symbol (e.g., punctuation, counting words, conjunctions, proper nouns, etc.).
An ensemble of specialists is different to a mixture of experts?
I thought half the point of MoE was to make the training tractable by allowing the different experts to be trained independently?
Doesn't the human brain work like this? Yeah it's all connected together and plastic and so on, but functions tend to be localized, e.g vision is in occipital area. These base areas are responsible for the basic latent representations (edge detectors) which get fed forward to the AGI module (prefrontal cortex) that coordinates the whole thing based on the high quality representations it sees from these base modules.
This strikes me as the most compute efficient approach.
Yeah. Have a multimodal parser model that can decompose prompts into pieces, generate embeddings for each of them and route those embeddings to the correct model based on the location of the embedding in latent space. Then have a "combiner/resolver" model that is trained to take answer embeddings from multiple models and render it in one of a variety of human readable formats.
Eventually there is going to be a model catalog that describes model inputs/outputs in a machine parseable format, all models will use a unified interface (embedding in -> embedding out, with adapters for different latent spaces), and we will have "agent" models designed to be rapidly fine tuned in an online manner that act as glue between all these different models.
GPT has for some time output "analyzing" in a lot of contexts. If you see that, you can go into settings and tick "always show code when using data analyst" and you'll see that it does indeed construct Python and run code for problems where it is suitable.
What if we used character tokens?
I wrote a whole paper about ways to "fix" tokenization in a plug-and-play fashion for poetry generation: Filter the vocabulary before decoding.
https://paperswithcode.com/paper/most-language-models-can-be...
Hi, CISO of Anthropic here. Thank you for the feedback! If you can share any details about the image, please share in a private message.
No LLM has had an emergent calculator yet.
Hey Jason, checked your HN bio and I don't see a contact. Found you on twitter but it seems I'm unable to DM you.
Went ahead and uploaded the image here: https://imgur.com/pJlzk6z
An "LLM crawler app" is needed -- in that you should be able to shift Tokenized Workloads between executioners in a BGP routing sort of sense...
Least cost routing of prompt response. especially if time-to-respond is not as important as precision...
Also, is there a time-series ability in any LLM model (meaning "show me this [thing] based on this [input] but continually updated as I firehose the crap out of it"?
--
What if you could get execution estimates for a prompt?
Thank you!
Regardless of emergence, in the context of "putting safety at the frontier" I would expect Claude 3 to be augmented with very basic tools like calculators to minimize such trivial hallucinations. I say this as someone rooting for Anthropic.
LLMs are building blocks and I’m excited about folks building with a concert of models working together with subagents.
How many uses do you get per day of Opus with the pro subscription?
Hmm, not seeing it anywhere on my profile or in the chat interface, but I might be missing it.
100 messages per 8 hours:
https://support.anthropic.com/en/articles/8324991-about-clau...
I cant wait until this is the true disruptor in the economy: "Take this $1,000 and maximise my returns and invest it where appropriate. Goal is to make this $1,000 100X"
And just let your r/wallStreetBets BOT run rampant with it...
That will only work for the first few people who try it.
When OpenAI showed that GPT-4 with vision was smarter than GPT-4 without vision, what did they mean really? Does vision capability increase intelligence even in tasks that don't involve vision (no image input)?
Yes. They increase the total parameters used in the model and adjust the existing parameters.
I'm guessing the difference is screenshot reading, I'm finding that it's about the same as GPT-4 with text. For example, given this equation:
(64−30)−(46−38)+(11+96)+(30+21)+(93+55)−(22×71)/(55/16)+(69/37)+(74+70)−(40/29)
Calculator: 22.08555452004
GPT-4 (without Python): 22.3038
Claude 3 Opus: 22.0492