return to table of content

Claude 3 model family

virgildotcodes
40 replies
3h29m

Just signed up for Claude Pro to try out the Opus model. Decided to throw a complex query at it, combining an image with an involved question about SDXL fine tuning and asking it to do some math comparing the cost of using an RTX 6000 Ada vs an H100.

It made a lot of mistakes. I provided it with a screenshot of Runpod's pricing for their GPUs, and it misread the pricing on an RTX 6000 ADA as $0.114 instead of $1.14.

Then, it tried to do math, and here is the outcome:

-----

Approach 1: Use the 1x RTX 6000 Ada with a batch size of 4 for 10,000 steps.

Cost: $0.114/hr * (10,000 steps / (4 images/step * 2.5 steps/sec)) = $19.00 Time: (10,000 steps / (4 images/step * 2.5 steps/sec)) / 3600 = 0.278 hours

Approach 2: Use the 1x H100 80GB SXMS with a batch size of 8 for 10,000 steps.

Cost: $4.69/hr * (10,000 steps / (8 images/step * 3 steps/sec)) = $19.54 Time: (10,000 steps / (8 images/step * 3 steps/sec)) / 3600 = 0.116 hours

-----

You will note that .278 * $0.114 (or even the actually correct $1.14) != $19.00, and that .116 * $4.69 != $19.54.

For what it's worth, ChatGPT 4 correctly read the prices off the same screenshot, and did math that was more coherent. Note, it saw that the RTX 6000 Ada was currently unavailable in that same screenshot and on its own decided to substitute a 4090 which is $.74/hr, also it chose the cheaper PCIe version of the H100 Runpod offers @ $3.89/hr:

-----

The total cost for running 10,000 steps on the RTX 4090 would be approximately $2.06.

It would take about 2.78 hours to complete 10,000 steps on the RTX 4090. On the other hand:

The total cost for running 10,000 steps on the H100 PCIe would be approximately $5.40.

It would take about 1.39 hours to complete 10,000 steps on the H100 PCIe, which is roughly half the time compared to the RTX 4090 due to the doubled batch size assumption.

-----

anonymouse008
25 replies
3h21m

I'm convinced GPT is running separate helper functions on input and output tokens to fix the 'tokenization' issues. As in, find items of math, send it to this hand made parser and function, then insert result into output tokens. There's no other way to fix the token issue.

For reference, Let's build the GPT Tokenizer https://www.youtube.com/watch?v=zduSFxRajkE

Workaccount2
15 replies
3h4m

I'd almost say anyone not doing that is being foolish.

The goal of the service is to answer complex queries correctly, not to have a pure LLM that can do it all. I think some engineers feel that if they are leaning on an old school classically programed tool to assist the LLM, it's somehow cheating or impure.

ignoramous
8 replies
1h45m

I'd almost say anyone not doing that is being foolish

The problem is, such tricks are sold as if there's superior built-in multi-modal reasoning and intelligence instead of taped up heuristics, exacerbating the already amped up hype cycle in the vacuum left behind by web3.

brokencode
7 replies
1h26m

Why is this a trick or somehow inferior to getting the AI model to be able to do it natively?

Most humans also can’t reliably do complex arithmetic without the use of something like a calculator. And that’s no trick. We’ve built the modern world with such tools.

Why should we fault AI for doing what we do? To me, training the AI use a calculator is not just a trick for hype, it’s exciting progress.

michaelt
2 replies
56m

By all means if it works to solve your problem, go ahead and do it.

The reason some people have mixed feelings about this because of a historical observation - http://www.incompleteideas.net/IncIdeas/BitterLesson.html - that we humans often feel good about adding lots of hand-coded smarts to our ML systems reflecting our deep and brilliant personal insights. But it turns out just chucking loads of data and compute at the problem often works better.

20 years ago in machine vision you'd have an engineer choosing precisely which RGB values belonged to which segment, deciding if this was a case where a hough transform was appropriate, and insisting on a room with no windows because the sun moves and it's totally throwing off our calibration. In comparison, it turns out you can just give loads of examples to a huge model and it'll do a much better job.

(Obviously there's an element of self-selection here - if you train an ML system for OCR, you compare it to tesseract and you find yours is worse, you probably don't release it. Or if you do, nobody pays attention to you)

janalsncm
1 replies
28m

The reason we chucked loads of data at it was because we had no other options. If you wanted to write a function that classified a picture as a cat or a dog, good luck. With ML, you can learn such a function.

That logic doesn’t extend to things we already know how to program computers to do. Arithmetic already works. We don’t need a neural net to also run the calculations or play a game of chess. We have specialized programs that are probably as good as we’re going to get in those specialized domains.

michaelt
0 replies
18m

> We don’t need a neural net to also run the calculations or play a game of chess.

That's actually one of the specific examples from the link I mentioned:-

> In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that ``brute force" search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.

While it's true that they didn't use an LLM specifically, it's still an example of chucking loads of compute at the problem instead of something more elegant and human-like.

Of course, I agree that if you're looking for a good game of chess, Stockfish is a better choice than ChatGPT.

lanstin
1 replies
1h16m

It would be exciting if the LLM knew it needed a calculator for certain things and went out and got it. If the human supervisors are pre-screening the input and massaging what the LLM is doing that is a sign we don't understand LLMs enough to engineer them precisely and can't count on them to be aware of their own limitations, which would seem to be a useful part of general intelligence.

Spivak
0 replies
1h9m

It can if you let it, that's the whole premise of LangChain style reasoning and it works well enough. My dumb little personal chatbot knows it can access a Python REPL to carry out calculations and it does.

bufferoverflow
1 replies
58m

Because if NN is smart enough, it should be able to do arithmetic flawlessly. Basic arithmetic doesn't even require that much intelligence, it's mostly attention to detail.

janalsncm
0 replies
23m

Well it’s obviously not smart enough so the question is what do you do about it? Train another net that’s 1000x as big for 99% accuracy or hand it off to the lowly calculator which will get it right 100% of the time?

And 1000x is just a guess. We have no scaling laws about this kind of thing. It could be a million. It could be 10.

bufferoverflow
3 replies
59m

The goal of the service is to answer complex queries correctly, not to have a pure LLM that can do it all.

No, that's the actual end goal. We want a NN that does everything, trained end-to-end.

netghost
0 replies
19m

"We" contains more than just one perspective though.

As someone applying LLMs to a set of problems in a production application, I just want a tool that solves the problem. Today, that tool is an LLM, tomorrow it could be anything. If there are ~hacks~ elegant techniques that can get me the results I need faster, cheaper, or more accurately, I absolutely will use those until there's a better alternative.

coffeebeqn
0 replies
46m

Like a AGI? I think we’ll put up with hacks for some more time still. Unless the model gets really really good at generalizing and then it’s probably close to human level already

ben_w
0 replies
56m

I'm unclear if you're saying that as a user who wants that feature, or an AI developer (for Anthropic or other) who is trying to achieve that goal?

uoaei
1 replies
3h0m

Of course. But we must acknowledge that many have blinders on, assuming that scale is all you need to beat statistical errors.

sigmoid10
0 replies
2h4m

Well, these people are not wrong per se. Scale is what drove what we have today and as hardware improves, the models will too. It's just that in the very short term it turns out to be faster to just code around some of these issues on the backend of an API rather than increase the compute you spend on the model itself.

nine_k
5 replies
3h13m

I personally find approaches like this the correct way forward.

An input analyzer that finds out what kinds of tokens the query contains. A bunch of specialized models which handle each type well: image analysis, OCR, math and formal logic, data lookup,sentiment analysis, etc. Then some synthesis steps that produce a coherent answer in the right format.

michaelt
2 replies
3h10m

Then you might enjoy looking up the "Mixture of Experts" model design.

numeri
1 replies
2h58m

That has nothing to do with the idea of ensembling multiple specialized/single-purpose models. Mixture of Experts is an method of splitting the feed-forwards in a model such that only a (hopefully) relevant subset of parameters is run for each token.

The model learns how to split them on its own, and usually splits based not on topic or domain, but on grammatical function or category of symbol (e.g., punctuation, counting words, conjunctions, proper nouns, etc.).

michaelt
0 replies
22m

An ensemble of specialists is different to a mixture of experts?

I thought half the point of MoE was to make the training tractable by allowing the different experts to be trained independently?

hackerlight
0 replies
51m

Doesn't the human brain work like this? Yeah it's all connected together and plastic and so on, but functions tend to be localized, e.g vision is in occipital area. These base areas are responsible for the basic latent representations (edge detectors) which get fed forward to the AGI module (prefrontal cortex) that coordinates the whole thing based on the high quality representations it sees from these base modules.

This strikes me as the most compute efficient approach.

CuriouslyC
0 replies
2h31m

Yeah. Have a multimodal parser model that can decompose prompts into pieces, generate embeddings for each of them and route those embeddings to the correct model based on the location of the embedding in latent space. Then have a "combiner/resolver" model that is trained to take answer embeddings from multiple models and render it in one of a variety of human readable formats.

Eventually there is going to be a model catalog that describes model inputs/outputs in a machine parseable format, all models will use a unified interface (embedding in -> embedding out, with adapters for different latent spaces), and we will have "agent" models designed to be rapidly fine tuned in an online manner that act as glue between all these different models.

vidarh
0 replies
2h21m

GPT has for some time output "analyzing" in a lot of contexts. If you see that, you can go into settings and tick "always show code when using data analyst" and you'll see that it does indeed construct Python and run code for problems where it is suitable.

Jabrov
0 replies
1h49m

What if we used character tokens?

jasondclinton
5 replies
3h0m

Hi, CISO of Anthropic here. Thank you for the feedback! If you can share any details about the image, please share in a private message.

No LLM has had an emergent calculator yet.

virgildotcodes
2 replies
2h38m

Hey Jason, checked your HN bio and I don't see a contact. Found you on twitter but it seems I'm unable to DM you.

Went ahead and uploaded the image here: https://imgur.com/pJlzk6z

samstave
0 replies
2h30m

An "LLM crawler app" is needed -- in that you should be able to shift Tokenized Workloads between executioners in a BGP routing sort of sense...

Least cost routing of prompt response. especially if time-to-respond is not as important as precision...

Also, is there a time-series ability in any LLM model (meaning "show me this [thing] based on this [input] but continually updated as I firehose the crap out of it"?

--

What if you could get execution estimates for a prompt?

jasondclinton
0 replies
2h36m

Thank you!

connorgutman
1 replies
1h48m

Regardless of emergence, in the context of "putting safety at the frontier" I would expect Claude 3 to be augmented with very basic tools like calculators to minimize such trivial hallucinations. I say this as someone rooting for Anthropic.

jasondclinton
0 replies
1h37m

LLMs are building blocks and I’m excited about folks building with a concert of models working together with subagents.

SubiculumCode
2 replies
2h12m

How many uses do you get per day of Opus with the pro subscription?

virgildotcodes
0 replies
1h26m

Hmm, not seeing it anywhere on my profile or in the chat interface, but I might be missing it.

samstave
1 replies
2h34m

I cant wait until this is the true disruptor in the economy: "Take this $1,000 and maximise my returns and invest it where appropriate. Goal is to make this $1,000 100X"

And just let your r/wallStreetBets BOT run rampant with it...

helsinki
0 replies
1h19m

That will only work for the first few people who try it.

behnamoh
1 replies
3h9m

When OpenAI showed that GPT-4 with vision was smarter than GPT-4 without vision, what did they mean really? Does vision capability increase intelligence even in tasks that don't involve vision (no image input)?

KoolKat23
0 replies
2h59m

Yes. They increase the total parameters used in the model and adjust the existing parameters.

causal
0 replies
2h53m

I'm guessing the difference is screenshot reading, I'm finding that it's about the same as GPT-4 with text. For example, given this equation:

(64−30)−(46−38)+(11+96)+(30+21)+(93+55)−(22×71)/(55/16)+(69/37)+(74+70)−(40/29)

Calculator: 22.08555452004

GPT-4 (without Python): 22.3038

Claude 3 Opus: 22.0492

Workaccount2
28 replies
3h58m

Surpassing GPT4 is huge for any model, very impressive to pull off.

But then again...GPT4 is a year old and OpenAI has not yet revealed their next-gen model.

HarHarVeryFunny
20 replies
3h36m

Sure, OpenAI's next model would be expected to regain the lead, just due to their head start, but this level of catch-up from Anthropic is extremely impressive.

Bear in mind that GPT-3 was published ("Language Models are Few-Shot Learners") in 2020, and Anthropic were only founded after that in 2021. So, with OpenAI having three generations under their belt, Anthropic came from nothing (at least in terms of models - of course some team members had the know-how of being ex. OpenAI) and are, temporarily at least, now ahead of OpenAI in some of these benchmarks.

I'd assume that OpenAI's next-gen model (GPT-5 or whatever they will choose to call it) has already finished training and is now being fine tuned and evaluated for safety, but Anthropic's cause d'etre is safety and I doubt they have skimped on this to rush this model out.

appplication
9 replies
2h58m

What this really says to me is the indefensibility of any current advances. There’s really cool stuff going on right now, but anyone can do it. Not to say anyone can push the limits of research, but once the cat’s out of the bag, anyone with a few $B and dozen engineers can replicate a model that’s indistinguishably good from best in class to most users.

zurfer
4 replies
2h42m

Barrier to entry with "few $B" is pretty high. Especially since the scaling laws indicate that it's only getting more expensive. And even if you manage to raise $Bs, you still need to be clever on how to deploy it (talent, compute, data) ...

appplication
3 replies
2h2m

You’re totally right, a few $B is not something any of us are bootstrapping. But there is no secret sauce (at least none that stays secret for long), no meaningful patents, no network/platform effect, and virtually no ability to lock in customers.

Compare to other traditional tech companies… think Uber/AirBnB/Databricks/etc. Their product isn’t an algorithm that a competitor can spin up in 6 months. These companies create real moats, for better or worse, which significantly reduce the ability for competitors to enter, even with tranches of cash.

In contrast, essentially every product we’ve seen in the AI space is very replicable, and any differentiation is largely marginal, under the hood, and the details of which are obscured from customers.

zurfer
2 replies
1h50m

Every big tech in the beginning looked fragile/no moats.

I think we'll see that data, knowledge and intelligence compound and at some point it will be as hard to penetrate as Meta's network effects.

HarHarVeryFunny
1 replies
1h34m

Maybe consolidate as well as compound. There's a tendency for any mature industry (which may initially have been bustling with competitors) to eventually consolidate into three players, and while we're currently at the point where it seems a well-funded new entrant can catch up with the leaders, that will likely become much harder in the future as tech advances.

Never say never though - look at Tesla coming out of nowhere to push the big three automakers around! Eventually the established players become too complacent and set in their ways, creating an opening for a smaller more nimble competitor with a better idea.

I don't think LLMs are the ultimate form of AI/AGI though. Eventually we'll figure out a better brain-inspired approach that learns continually from it's own experimentation and experience. Perhaps this change of approach will be when some much smaller competitor (someone like John Carmack, perhaps) rapidly come from nowhere and catch the big three flat footed as they tend to their ginormous LLM training sets, infrastructure and entrenched products.

lanstin
0 replies
1h3m

Also worth keeping in mind the lock in for the big tech firms is due to business decisions not the technology per se. If we had say micropaynents in http1 headers in 1998 we might have a much more decentralized system supported by distributed subscriptions rather than ads. To this day I cannot put up $50 to mastodon and have it split amongst the posts I like or boost or whatever. Instead we have all the top content authors trying to get me to subscribe to their email subscriptions which Isa vastly inferior interface and too expensive to get money to all the good writers out there.

HarHarVeryFunny
3 replies
2h13m

Yes, it seems that AI in form of LLMs is just an idea whose time has come. We now have the compute, the data, and the architecture (transformer) to do it.

As far as different groups leapfrogging each other for supremacy in various benchmarks, there might be a bit of a "4 minute mile" effect here too - once you know that something is possible then you can focus on replicating/exceeding it without having to worry are you hitting up against some hard limit.

I think the transformer still doesn't get the credit due for enabling this LLM-as-AI revolution. We've had the compute and data for a while, but this breakthough - shared via a public paper - was what has enabled it and made it essentially a level playing field for anyone with the few $B etc the approach requires.

I've never seen any claim by any of the transformer paper ("attention is all you need") authors that they understood/anticipated the true power of this model they created (esp. when applied at scale), which as the title suggests was basically regarded an incremental advance over other seq2seq approaches of the time. It seems like one of history's great accidental discoveries. I believe there is something very specific about the key-value matching "attention" mechanism of the transformer (perhaps roughly equivalent to some similar process used in our cortex?) that gives it it's power.

visarga
2 replies
1h19m

We now have the compute, the data, and the architecture (transformer) to do it.

It's really not the model, it's the data and scaling. Otherwise the success of different architectures like Mamba would be hard to justify. Conversely, humans getting training on the same topics achieve very similar results, even though brains are very different at low level, not even the same number of neurons, not to mention different wiring.

The merit for our current wave is 99% on the training data, its quality and size are the true AI heroes. And it took humanity our whole existence to build up to this training set, it cost "a lot" to explore and discover the concepts we put inside it. A single human, group or even a whole generation of humans would not be able to rediscover it from scratch in a lifetime. Our cultural data is smarter than us individually, it is as smart as humanity as a whole.

One consequence of this insight is that we are probably on an AI plateau. We have used up most organic text. The next step is AI generating its own experiences in the world, but it's going to be a slow grind in many fields where environment feedback is not easy to obtain.

HarHarVeryFunny
1 replies
55m

It's really not the model, it's the data and scaling. Otherwise the success of different architectures like Mamba would be hard to justify.

My take is that prediction, however you do it, is the essence of intelligence. In fact, I'd define intelligence as the degree of ability to correctly predict future outcomes based on prior experience.

The ultimate intelligent architecture, for now, is our own cortex, which can be architecturally analyzed as a prediction machine - utilizing masses of perceptual feedback to correct/update predictions of how the perceptual scene, and results of our own actions, will evolve.

With prediction as the basis of intelligence, any model capable of predicting - to varying degrees of success - will be perceived to have a commensurate degree of intelligence. Transformer-based LLMs of course aren't the only possible way to predict, but they do seem significantly better at it than competing approaches such as Mamba or the RNN (LSTM etc) seq2seq approaches that were the direct precursor to the transformer.

I think the reason the transformer architecture is so much better than the alternatives, even if there are alternatives, is down to this specific way it does it - able to create these attention "keys" to query the context, and the ways that multiple attention heads learn to coordinate such as "induction heads" copying data from the context to achieve in-context learning.

visarga
0 replies
28m

If you invented the transformer but didn't have trillions of tokens to train it with, no chatGPT. But if you had Mamba/RWKV/SSSM and trillions of tokens you would have almost the same thing with chatGPT.

The training set is magical. It took humanity a long time to discover all the nifty ideas we have in it. It's the result of many generations of humans working together, using language to share their experience. Intelligence is a social process, even though we like to think about keys and queries, or synapses and neurotransmitters, in fact it is the work of many people that made it possible.

And language is that central medium between all of us, an evolutionary system of ideas, evolving at a much faster rate than biology. Now AI have become language replicators like humans, a new era in the history of language has begun. The same language trains humans and LLMs to achieve similar sets of abilities.

aaomidi
8 replies
3h26m

Anthropic is also not really a traditional startup. It’s just some large companies in a trench coat.

hobofan
7 replies
3h16m

How so? Because they have taken large investments from Amazon and Google? Or would you also characterize OpenAI as "Microsoft in a trench coat"?

bugglebeetle
4 replies
2h21m

100% OpenAI is Microsoft in a trenchcoat.

HarHarVeryFunny
3 replies
2h4m

They are funded mostly by Microsoft, and dependent on them for compute (which is what this funding is mostly buying), but I'd hardly characterize that as meaning they are "Microsoft in a trenchcoat". It's not normal to identify startups as being their "VC in a trenchcoat", even if they are dependent on the money for growth.

bugglebeetle
2 replies
1h39m

Satya Nadella during the OpenAI leadership fiasco: “We have all of the rights to continue the innovation, not just to serve the product, but we can, you know, go and just do what we were doing in partnership ourselves. And so we have the people, we have the compute, we have the data, we have everything.”

Doesn’t sound like a startup-investor relationship to me!

HarHarVeryFunny
1 replies
30m

Sure, but that's just saying that Microsoft as investor has some rights to the underlying tech. There are limits to this though, which we may fairly soon be nearing. I believe the agreement says that Microsoft's rights to the tech (model + weights? training data? -- not sure how specific it is) end once AGI is achieved, however that is evaluated.

But again, this is not to say that OpenAI is "Microsoft in a trenchcoat". Microsoft don't have developers at OpenAI, weren't behind the tech in any way, etc. Their $10B investment bought them some short-term insurance in limited rights to the tech. It is what is is.

bugglebeetle
0 replies
9m

“We have everything” is not “some underlying rights to the tech.” I dunno what the angle is on minimizing here, but I’ll take the head of Microsoft at his word vs. more strained explanations about why this isn’t the case.

pavlov
0 replies
2h50m

> 'would you also characterize OpenAI as "Microsoft in a trench coat"?'

Elon Musk seems to think that, based on his recent lawsuit.

I wouldn't agree but the argument has some validity if you look at the role Microsoft played in reversing the Altman firing.

aaomidi
0 replies
28m

Absolutely to OpenAI being Microsoft in a trench coat.

This is not an uncommon tactic for companies to use.

lr1970
0 replies
2h31m

Bear in mind that GPT-3 was published ("Language Models are Few-Shot Learners") in 2020, and Anthropic were only founded after that in 2021.

Keep in mind that Antropic was founded by former OpenAI people (Dario Amadei and others). Both companies share a lot of R&D "DNA".

bugglebeetle
4 replies
3h22m

MMLU is pretty much the only stat on there that matters, as it correlates to multitask reasoning ability. Here, they outpace GPT-4 by a smidge, but even that is impressive because I don’t think anyone else’s has to date.

rafaelero
1 replies
2h39m

MMLU is garbage. A lot of incorrect answers there.

bugglebeetle
0 replies
2h28m

And yet it’s still a good indicator of general performance. Any model that scores under GPT-4 on that benchmark, but above it in other, tends to be worse overall.

jasonjmcghee
0 replies
2h42m

I still don't trust benchmarks, but they've come a long way.

It's genuinely outperforming GPT4 in my manual tests.

hackerlight
0 replies
3h14m

How can they avoid the contents from leaking into the training set somewhere in their internet scrape?

imjonse
0 replies
3h22m

From the blog's footnote:

"In addition, we’d like to note that engineers have worked to optimize prompts and few-shot samples for evaluations and reported higher scores for a newer GPT-4T model"

wesleyyue
25 replies
3h36m

Just added Claude 3 to Chat at https://double.bot if anyone wants to try it for coding. Free for now and will push Claude 3 for autocomplete later this afternoon.

From my early tests this seems like the first API alternative to GPT4. Huge!

addandsubtract
6 replies
3h25m

So double is like copilot, but free? What's the catch?

behnamoh
4 replies
3h11m

I guess your data is the catch.

parkersweb
1 replies
13m

Interesting - I had this exact question and tried the search on the website to find the answer with no result :D

Would be great to have an FAQ for this type of common question

wesleyyue
0 replies
9m

Thanks for the feedback – what search terms did you use? Let me make sure those keywords are on the page :P

ShamelessC
0 replies
1h30m

Probably not data so much as growth numbers to appease investors. Such offerings typically don’t last forever. Might as well take advantage while it lasts.

wesleyyue
0 replies
3h11m

No catch. We're pretty early tbh so mostly looking to get some early power users and make the product great before doing a big launch. It's been popular with yc founders in the latest batches thus far but we haven't really shared publicly. We'll charge when we launch. If you try it now, I hope you'll share anything you liked and didn't like with us!

behnamoh
4 replies
2h37m

To be clear: Is this Claude 3 Opus or the Sonnet model?

wesleyyue
3 replies
2h34m

opus. only the best!

behnamoh
2 replies
2h29m

Awesome! I like the inline completions.

But could you let the users choose their keyboard shortcuts before setting the default ones?

wesleyyue
1 replies
1h51m

Thanks for the feedback. I was actually reworking the default shortcuts and the onboarding process when I got pre-empted by claude. I was planning to change the main actions to alt-j, alt-k to minimize conflicts.

Are you asking because it conflicts with an existing shortcut on your setup? Or another reason?

behnamoh
0 replies
1h37m

Yes, it conflicts with some of my other shortcuts, but more generally, I think it'd be better to have consistent shortcuts, like CMD-CTRL-i for inline completion, CMD-CTRL-c for chat, etc.

brainless
2 replies
3h3m

Hey Wesley, I just checked Double. Do you plan to support open source models hosted locally or on a cloud instance? Asking out of curiosity as I am building a product in the same space and have had a few people ask this. I guess since Double is an extension in IDEs, it can connect to AI models running anywhere.

wesleyyue
1 replies
2h29m

it's an interesting idea. We asked our users this as well but at least for those we talked to, running their own model wasn't a big priority. What actually mattered to them is being able to try different (but high performance) models, privacy (their code not being trained on), and latency. We have some optimizations around time-to-first-token latency that would be difficult to do if we didn't have information about the model and their servers.

brainless
0 replies
1h27m

I see. Thanks Wesley for sharing and great to know it is not a priority. Also, the Mistral situation kinda makes me feel that big corps will want to host models.

Although, I feel Apple will break this trend and bring models to their chips rather than run them on the cloud. "Privacy first" will simply be a selling point for them but generally speaking cloud is not a big sell for them.

I am not at the level to do much optimizations, plus my product is a little more generic. To get to MVP, prompt engineering will probably be my sole focus.

098799
2 replies
2h42m

Emacs implementation when? ;)

behnamoh
0 replies
1h20m

If you use Emacs you're expected to know your way around programming and not need copilots :)

BeetleB
0 replies
2h25m

I just checked - surprisingly I cannot find any Emacs AI implementation that supports Claude's API.

trenchgun
1 replies
1h34m

How do I change GPT4 to Claude 3 in double.bot?

wesleyyue
0 replies
1h22m

It's default to claude 3 right now so I could get it out quick, but working on a toggle for the front-end now to switch between the two.

wesleyyue
0 replies
1h2m

I think the tldr would be that they have more products (for example, their agent to write git commit messages). In the products we do have (autocomplete, chat), we spend a lot of time to get the details right. For example for autocomplete:

* we always close any brackets opened by autocomplete (and never extra brackets, which is the most annoying thing about github copilot)

* we automatically add import statements for libraries that autocomplete used

* mid-line completions

* we turn off autocomplete when you're writing a comment to avoid disrupting your train of thought

You can read more about these small details here: https://docs.double.bot/copilot

As you noted we don't have a vim integration yet, but it is on our roadmap!

wesleyyue
0 replies
33m

more early impressions on performance: besides the endpoint erroring out at a higher rate than openai, time-to-first-token is also much slower :(

p50: 2.14s p95: 3.02s

And these aren't super long prompts either. vs gpt4 ttft:

p50: 0.63s p95: 1.47s

wesleyyue
0 replies
3h10m

Seems like the API is less reliable than GPT-4 so far, but I guess it makes sense for the endpoint to be popular at launch!

trenchgun
0 replies
2h59m

Very nice!

RugnirViking
22 replies
3h59m

I don't put a lot of stock on evals. many of the models claiming gpt-4 like benchmark scores feel a lot worse for any of my use-cases. Anyone got any sample output?

Claude isn't available in EU yet, else i'd try it myself. :(

hackerlight
10 replies
3h53m

One good sign is they're only a slight improvement on knowledge recall evals but a big improvement on code and reasoning evals. Hope this stands up to scrutiny and we get something better than GPT-4 for code generation. Although the best model is a lot more expensive.

ethbr1
9 replies
3h48m

On the other hand, programmers are very expensive.

At some level of accuracy and consistency (human order-of-magnitude?), the pricing of the service should start approaching the pricing of the human alternative.

And first glance at numbers, LLMs are still way underpriced relative to humans.

SubiculumCode
6 replies
3h26m

NVidia's execs think so.

It would be an ironic thing that it was open source that killed the programmer; as how would they train it otherwise?

As a scientist, should I continue to support open access journals, just so I can be trained away?

Slightly tongue in check, but not really.

ethbr1
3 replies
2h58m

I have a suspicion that greenfield science will be the last thing automated, at least the non-brute-force kind. AI assistants to do the drugery (smart search agents), but not pick the directions to proceed in.

Too little relevant training data in niche, state of the art topics.

But to the broader point, isn't this progress in a nutshell?

(1) Figure out a thing can be done, (2) figure out how to manufacture with humans, (3) maximize productivity of human effort, (4) automate select portions of the optimized and standardized process, (5) find the last 5% isn't worth automating, because it's too branchy.

From that perspective, software development isn't proceeding differently than any other field historically, with the added benefit that all its inputs and outputs are inherently digital.

SubiculumCode
2 replies
2h55m

I think that picking a direction is not that hard, and I don't know that AI couldn't do it better. I'm not sure mid-tier CEO's won't be on their way out, just like middle management.

ethbr1
1 replies
1h59m

I was talking more about science.

On the people-direction side, I expect the span of control will substantially broaden, which will probably lead to fewer manager/leader jobs (that pay more).

You'll always need someone to do the last 5% that it doesn't make sense to data engineer inputs/outputs into/from AI.

SubiculumCode
0 replies
1h34m

Yeah. Right now, its been helping me be more productive in my science by writing code quicker...mainly on the data management side of things.

I do however wonder, at what point do I just describe the hypothesis, point to the data files, and have it design an analysis pipeline, produce the results, interpret the results, then suggest potential follow-up hypotheses, do a literature search on that, then have it write up the grant for it.

bugglebeetle
0 replies
2h6m

As a scientist, should I continue to support open access journals, just so I can be trained away?

If science was reproducible form articles posted in open access journals, we wouldn’t have half the problems we have with advancing research now.

Slightly tongue in check, but not really.

Der_Einzige
0 replies
1h5m

This is also why I have about negative sympathy for artists who are crying about AI taking their jobs.

Programmers (specifically AI researchers) looked at their 300K+ a year salaries and embraced the idea of automating away the work despite how lucrative it would be to continue to spin one's wheels on it. The culture of open source is strong among SWEs, even one's who would lose millions of unrealized gains/earnings as a result of embracing it.

Artists looked at their 30K+ a year salaries from drawing furry hentai on furaffinity and panic at the prospect of losing their work, to the point of making whole political protest movements against AI art. Artists have also never defended open source en mass, and are often some of the first to defend crappy IP laws.

Why be a luddite over something so crappy to defend?

hackerlight
0 replies
3h26m

The value/competency may approach that of a human but the price won't necessarily follow. Price will be determined by market forces. If compute is cheap and competition is fierce then the price can be near free even if it is at human-level intelligence. Then there will be a lot of surplus value created because buyers would be happy to pay $50/million tokens but only have to pay $0.1/million tokens thanks to competition. Frontier models will probably always be expensive though, because frontier by definition means you're sucking up all the available compute which will probably always be expensive.

Workaccount2
0 replies
3h12m

Not to be the bearer of bad news, but the pricing of the human alternative is what approaches the cost of the service, not the other way around.

Alifatisk
6 replies
3h55m

Claude isn't available in EU yet, else i'd try it myself.

I'm currently in EU and I have access to it?

egeozcan
5 replies
3h45m

AFAIK there's no strict EU ban but no EU country is listed here:

https://www.anthropic.com/claude-ai-locations

Perhaps you meant Europe the continent or using a VPN?

edit: They seem to have updated that list after I posted my comment, the outdated list I based my comment on: https://web.archive.org/web/20240225034138/https://www.anthr...

edit2: I was confused. There is another list for API regions, which has all EU countries. The frontend is still not updated.

Alifatisk
0 replies
3h10m

When I go to my account settings, it says my country is invalid haha

AlanYx
0 replies
2h51m

That's the list of countries supported by the API. For some reason, they support fewer countries through their frontend. I'm curious why that is.

Alifatisk
0 replies
3h12m

AFAIK there's no strict EU ban but no EU country is listed here

That's really weird, I just signed up with no issues and my country together with some other EU countries was listed. Now when I try to signup a new account, it says that my region is not supported.

I still have the sms verification from them as proof.

swalsh
0 replies
2h47m

I've also seen the opposite, where tiny little 7B models get real close to GPT4 quality results on really specifically use cases. If you're trying to scale just that use case it's significantly cheaper, and also faster to just scale up inference with that specialty model. An example of this is using an LLM to extract medical details from a record.

phillipcarter
0 replies
3h28m

I don't put a lot of stock on evals.

Same, although they are helpful for setting expectations for me. I have some use cases (I'm hesitant to call them evals) related to how we use GPT for our product that are a good "real world" test case. I've found that Claude models are the only ones that are up to par with GPT in the past.

lelag
0 replies
3h54m

You can use Claude 2.1 on openrouter. Hopefully, they will be able to add the Claude 3 family too.

avereveard
0 replies
3h40m

I think aws has Claude in Frankfurt not the new one but instant and 2 should be there.

vermorel
13 replies
3h53m

Does any of those LLM-as-a-service companies provide a mechanism to "save" a given input? Paying only for the state storage and the extra input when continuing the completion from the snapshot?

Indeed, at 1M token and $15/M tokens, we are talking of $10+ API calls (per call) when maxing out the LLM capacity.

I see plenty of use cases for such a big context, but re-paying, at every API call, to re-submit the exact same knowledge base seems very inefficient.

Right now, only ChatGPT (the webapp) seems to be using such those snapshots.

Am I missing something?

ethbr1
5 replies
3h51m

How would that work technically, from a cost of goods sold perspective? (honestly asking, curious)

vermorel
2 replies
3h39m

The "cost" is storing the state of the LLM after processing the input. My back-of-the-envelop guesstimate gives me 1GB to capture the 8bit state of 70B parameters model (I might be wrong though, insights are welcome), which is quite manageable with NVMe storage for fast reload. The operator would charge per pay per "saved" prompt, plus maybe a fix per call fee to re-load the state.

YetAnotherNick
0 replies
2h14m

My calculation of kv cache gives 1GB per 3000 tokens for fp16. I am surprised openAI competitors haven't done this. This kind of features have not so niche uses, where prefix data could be cached.

FergusArgyll
0 replies
2h5m

That's a great idea! It would also open up the possibility for very long 'system prompts' on the side of the company, so they could better fine-tune their guardrails

cjbprime
1 replies
3h44m

I think the answer's in the original question: the provider has to pay for extra storage to cache the model state at the prompt you're asking to snapshot. But it's not necessarily a net increase in costs for the provider, because in exchange for doing so they (as well as you) are getting to avoid many expensive inference rounds.

datadrivenangel
0 replies
3h39m

Isn't the expensive part keeping the tokenized input in memory?

msp26
3 replies
3h1m

I see plenty of use cases for such a big context, but re-paying, at every API call, to re-submit the exact same knowledge base seems very inefficient.

If you don't care about latency or can wait to set up a batch of inputs in one go there's an alternative method. I call it batch prompting and pretty much everything we do at work with gpt-4 uses this now. If people are interested I'll do a proper writeup on how to implement it but the general idea is very straightforward and works reliably. I also think this is a much better evaluation of context than needle in a haystack.

Example for classifying game genres from descriptions.

Default:

[Prompt][Functions][Examples][game description]

- >

{"genre": [genre], "sub-genre": [sub-genre]}

Batch Prompting:

[Prompt][Functions][Examples]<game1>[description]</game><game2>[description]</game><game3>[description]</game>...

- >

{"game1": {...}, "game2": {...}, "game3": {...}, ...}

hobofan
2 replies
2h50m

I attempted similar mechanics multiple times in the past, but always ditched them, as there was always a non-negligable amount of cross-contamination happening between the individual instances you are batching. That caused so much of a headache that it wasn't really worth it.

vermorel
0 replies
1h44m

Agreed, some problem here.

msp26
0 replies
1h14m

Yeah that's definitely a risk with language models but it doesn't seem to be too bad for my use cases. Can I ask what tasks you used it for?

I don't really intend for this method to be final. I'll switch everything over to finetunes at some point. But this works way better than I would have expected so I kept using it.

phillipcarter
1 replies
3h25m

FWIW the use case you're describing is very often achievable with RAG. Embedding models are deterministic, so while you're still limited by the often-nondeterministic nature of the LLM, in practice you can usually get the same answer for the same input. And it's substantially cheaper to do.

vermorel
0 replies
1h41m

With 1M tokens, if snapshotting the LLM state is cheap, it would beat out-of-the-box nearly all RAG setups, except the ones dealing with large datasets. 1M tokens is a lot of docs.

lmeyerov
0 replies
2h0m

Yes: That's essentially their fine-tuning offerings. They rewrite some weights in the top layers based on your input, and save+serve that for you.

It sounds like you would like a wrapped version tuned just for big context.

(As others write, RAG versions are also being supported, but they're less fundamentally similar. RAG is about preprocessing to cut the input down to relevant bits. RAG + an agent framework does get closer again tho by putting this into a reasoning loop.)

up6w6
12 replies
3h56m

The Opus model that seems to perform better than GPT4 is unfortunately much more expensive than the OpenAI model.

Pricing (input/output per million tokens):

GPT4-turbo: $10/$30

Claude 3 Opus: $15/$75

declaredapple
4 replies
3h49m

Yeah the output pricing I think is really interesting, 150% more expensive input tokens 250% more expensive output tokens, I wonder what's behind that?

That suggests the inference time is more expensive then the memory needed to load it in the first place I guess?

flawn
2 replies
3h42m

Either something like that or just because the model's output is basically the best you can get and they utilize their market position.

Probably that and what you mentioned.

brookst
1 replies
2h47m

This. Price is set by value delivered and what the market will pay for whatever capacity they have; it’s not a cost + X% market.

declaredapple
0 replies
49m

I'm more curious about the input/output token discrepancy

Their pricing suggests that either output tokens are more expensive for some technical reason, or they're trying to encourage a specific type of usage pattern, etc.

BeetleB
0 replies
2h20m

150% more expensive input tokens 250% more expensive output tokens, I wonder what's behind that?

Nitpick: It's 50% and 150% more respectively.

mrtksn
3 replies
3h44m

That's quite expensive indeed. At full context of 200K, that would be at least $3 per use. I would hate it if I receive a refusal as answer at that rate.

jorgemf
2 replies
3h40m

cost is relative. how much would it cost for a human to read and give you an answer for 200k tokens? Probably much more than $3.

vinay_ys
0 replies
3h26m

You are not going to take the expensive human out of the loop where downside risk is high. You are likely to take the human out of the loop only in low risk low cost operations to begin with. For those use cases, these models are quite expensive.

jakderrida
0 replies
3h14m

Yeah, but the human tends not to get morally indignant because my question involves killing a process to save resources.

hackerlight
1 replies
3h43m

Their smallest model outperforms GPT-4 on Code. I'm sceptical that it'll hold up to real world use though.

nopinsight
0 replies
3h6m

Just a note that the 67.0% HumanEval figure for GPT-4 is from its first release in March 2023. The actual performance of current ChatGPT-4 on similar problems might be better due to OpenAI's internal system prompts, possible fine-tuning, and other tricks.

chadash
0 replies
3h41m

There’s a market for that though. If I am running a startup to generate video meeting summaries, the price of the models might matter a lot, because I can only charge so much for this service. On the other hand, if I’m selling a tool to have AI look for discrepancies in mergers and acquisitions contracts, the difference between $1 and $5 is immaterial… I’d be happy to pay 5x more for software that is 10% better because the numbers are so low to begin with.

My point is that there’s plenty of room for high priced but only slightly better models.

jimbokun
12 replies
3h19m

If you showed someone this article 10 years ago, they would say it indicates Artificial General Intelligence has arrived.

behnamoh
4 replies
3h7m

That's the good thing about intelligence: We have no fucking clue how to define it, so the goalpost just keeps moving.

bryanlarsen
1 replies
2h52m

In both directions. There are a set of people who are convinced that dolphins, octopi and dogs have intelligence, but GPT et al don't.

I'm in the camp that says GPT4 has it. It's not a superhuman level of general intelligence, far from it, but it is a general intelligence that's doing more than regurgitation and rules-following.

namero999
0 replies
2h8m

How's a GPT not rules-following?

brookst
0 replies
2h59m

Intelligence is tough but tractable. Consciousness / sentience, on the other hand, is a mess to define.

Workaccount2
0 replies
3h3m

I'd argue the goalpost is already past what some, albeit small, group of humans are capable of.

kylebenzle
3 replies
3h11m

1. It's an advertisement/press release, not so much an "article".

2. This would NOT be called even "AI" but "machine learning" 10 years ago. We started using AI as a marketing term for ML about a year ago.

dangond
1 replies
3h3m

This absolutely would be called AI 10 years ago. Yes, it's a machine learning task, but a computer program you can speak with would certainly qualify as AI to anyone 10 years ago, if not several decades prior as well.

brookst
0 replies
2h58m

Agree. ML is the implementation, AI is the customer benefit.

2c2c2c
0 replies
3h1m

I can recall AI being used to describe anything involving neural nets by laymen since google deepmind. approaching 10 years

appplication
0 replies
3h3m

Eh. I think 10 years ago we dreamed a little bigger. These models are impressive, but deeply flawed and entirely unintelligent.

_sword
12 replies
3h59m

At this point I wonder how much of the GPT-4 advantage has been OpenAI's pre-training data advantage vs. fundamental advancements in theory or engineering. Has OpenAI mastered deep nuances others are missing? Or is their data set large enough that most test-cases are already a sub-set of their pre-training data?

avereveard
4 replies
3h50m

So far gpt is the only one able to answer to variations of these prompts https://www.lesswrong.com/posts/EHbJ69JDs4suovpLw/testing-pa... it might be trained on these but still you can create variations and get decent responses

Most other model fail on basic stuff like the python creator on stack overflow question, they identify Guido as the python creator, so the knowledge is there, but they don't make the connection.

staticman2
3 replies
2h54m

>So far gpt is the only one able to answer to variations of these prompts

You're saying that when Mistral Large launched last week you tested it on (among other things) explaining jokes?

avereveard
2 replies
2h41m

Sorry I did what? When?

staticman2
1 replies
2h24m

You linked to a lesswrong post with prompts asking the AI to explain jokes (among other tasks?) and said only Openai models can do it, didn't you? I'm confused why you said only OpenAI models can do it?

avereveard
0 replies
1h56m

Ah sorry if it wasn't clear below the jokes there are a few inferring posts and so far yeah didn't see Claude or other to reason the same way as palm or gpt4, (gpt3.5 did got some wrong), haven't had time tho to test mistral large yet. Mixtral didn't get the right. Tho.

ankit219
3 replies
3h45m

More than pretraining data, I think the advantage was ChatGPT and how quickly it grew. Remember it was 3.5, and within a month or two, it generated so many actual q&a pairs with rating, feedback, and production level data of how a model will be used by actual users. Those queries and subsequent RLHF + generating better answers for the questions meant the model would have been improved a lot at the SFT stage. Think this is the reason why Anthropic, Google, and Mistral, all three launched their own chatbots, all providing it to users for free and getting realtime q&a data for them to finetune the models on. Google did it with bard too, but it was so bad that not many used it.

simonw
2 replies
3h38m

My understanding is that GPT-4 had been almost fully trained before ChatGPT was released - they spent around six months testing GPT-4 before making it available to the public, ChatGPT came out 31st November 2022, GPT-4 came out March 14th 2023.

But maybe that was still enough time for them to instruction tune it based on ChatGPT feedback, or at least to focus more of their fine tuning iteration in the areas they learned were strong or weak for 3.5 based on ChatGPT usage?

vitorgrs
0 replies
8m

Also, worth to remind that Bing Chat was launched in February 7 with GPT4 already.

ankit219
0 replies
3h19m

I don't think it was pretrained on knowledge gaps. A version was already available in testing w select customers. The version released to the public would definitely have feedback from those customers, and finetuned/instruction tuned on the data from ChatGPT.

Training data is publicly available internet (and accessible to everyone). It's the SFT step w high quality examples which determines how well a model is able to answer questions. ChatGPT's virality played a part in that in the sense that OAI got the real world examples + feedback others did not have. And yeah, it would have been logical to focus on 3.5's weaknesses too. From Karpathy's videos, it seems they hired a contractual labelling firm to generate q&a pairs.

swalsh
0 replies
3h10m

There was a period of time where data was easily accessible, and Open AI suctioned up as much of it as possible. Places have locked the doors since then realizing someone was raiding their pantry.

To get that dataset now would take significantly more expense.

lumost
0 replies
3h48m

This may explain the substantial performance increase in proprietary models over the last 6 months. It also may explain why open-air and others had to drop open models. Distributing copyrighted material via model weights would be problematic.

HarHarVeryFunny
0 replies
3h23m

I'd guess a bit of both, perhaps more on the data side. One could also flip the question and ask how is this new Anthropic model able to beat GPT-4 in some benchmarks?

As far as data, OpenAI haven't just scraped/bought existing data, they have also on a fairly large scale (hundreds of contractors) had custom datasets created, which is another area they may have a head start unless others can find different ways around this (e.g. synthetic data, or filtering for data quality).

Altman has previously said (on Lex's podcast I think) that OpenAI (paraphrasing) is all about results and have used some ad-hoc approaches to achieve that, without hinting at what those might be. But, given how fast others like Anthropic and Google are catching up I'd assume each has their own bag of tricks too, whether that comes down to data and training or architectural tweaks.

spyder
7 replies
3h14m

What's up with the weird list of the supported countries?

It isn't available in most European countries (except for Ukraine and UK) but on the other hand lot of African counties are listed...

https://www.anthropic.com/claude-ai-locations

JacobiX
2 replies
2h42m

Arbitrary region locking : for example supported in Algeria and not in the neighboring Tunisia ... both are in North Africa

VWWHFSfQ
1 replies
2h30m

There's nothing arbitrary about it and both being located in North Africa means nothing. Tunisia has somewhat strict personal data protection laws and Algeria doesn't. That's the difference.

JacobiX
0 replies
2h14m

I know both countries, and in Algeria the Law No. 18-07, effective since August 10, 2023, establishes personal data protection requirements with severe penalties. The text is somewhat more strict than Tunisia.

hobofan
0 replies
3h9m

I think that's not the updated list, but a different list.

https://www.anthropic.com/supported-countries lists all the countries for API access, where they presumably offload a lot more liability to the customers to ensure compliance with local regulations.

https://www.anthropic.com/claude-ai-locations list all supported companies for the ChatGPT-like interface (= end-user product), under claude.ai, for which they can't ensure that they are complying with EU regulations.

brookst
0 replies
2h56m

EU has chosen to be late to tech in favor of regulations that seek to make a more fair market. Releasing in the EU is hard.

VWWHFSfQ
0 replies
2h52m

I seem to remember Google Bard was limited in Europe as well because there was just too much risk getting slapped by the EU regulators for making potentially unsafe AI accessible to the European public.

nopinsight
7 replies
1h33m

The APPS benchmark result of Claude 3 Opus at 70.2% indicates it might be quite useful for coding. The dataset measures the ability to convert problem descriptions to Python code. The average length of a problem is nearly 300 words.

Interestingly, no other top models have published results on this benchmark.

Claude 3 Model Card: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bb...

Table 1: Evaluation results (more datasets than in the blog post) https://twitter.com/karinanguyen_/status/1764666528220557320

APPS dataset: https://huggingface.co/datasets/codeparrot/apps

APPS dataset paper: https://arxiv.org/abs/2105.09938v3

nopinsight
2 replies
58m

“Claude 3 gets ~60% accuracy on GPQA. It's hard for me to understate how hard these questions are—literal PhDs (in different domains from the questions) [spending over 30 minutes] with access to the internet get 34%.

PhDs in the same domain (also with internet access!) get 65% - 75% accuracy.” — David Rein, first author of the GPQA Benchmark. I added text in […] based on the benchmark paper’s abstract.

https://twitter.com/idavidrein/status/1764675668175094169

GPQA: A Graduate-Level Google-Proof Q&A Benchmark https://arxiv.org/abs/2311.12022

wbarber
0 replies
34m

What's to say this isn't just a demonstration of memorization capabilities? For example, rephrasing the logic of the question or even just simple randomizing the order of the multiple choice answers to these questions often dramatically impacts performance. For example, every model in the Claude 3 family repeats the memorized solution to the lion, goat, wolf riddle regardless of how I modify the riddle.

lukev
0 replies
23m

This doesn't pass the sniff test for me. Not sure if these models are memorizing the answers or something else, but it's simply not the case that they're as capable as a domain expert (yet.)

I do not have a PhD, but in areas I do have expertise, you really don't have to push these models that hard to before they start to break down and emit incomplete or wrong analysis.

eschluntz
2 replies
1h3m

(full disclosure, I work at Anthropic) Opus has definitely been writing a lot of my code at work recently :)

bwanab
1 replies
55m

Sounds almost recursive.

berniedurfee
0 replies
42m

Inceptive

vok
0 replies
50m

APPS has 3 subsets by difficulty level: introductory, interview, and competition. It isn't clear which subset Claude 3 was benchmarked on. Even if it is just "introductory" it is still pretty good, but it would be good to know.

SirensOfTitan
7 replies
2h38m

What is the probability that newer models are just overfitting various benchmarks? A lot of these newer models seem to underperform GPT-4 in most of my daily queries, but I'm obviously swimming in the world of anecdata.

jasondclinton
3 replies
1h35m

We are tracking LMSys, too. There are strange safety incentives on this benchmark: you can “win” points by never blocking adult content for example.

adam_arthur
2 replies
1h15m

Seems perfectly valid to detract points for a model that isn't as useful to the user.

"Safety" is something asserted by the model creator, not something asked for by users.

mediaman
1 replies
59m

People like us are not the real users.

Corporate users of AI (and this is where the money is) do want safe models with heavy guardrails.

No corporate AI initiative is going to use an LLM that will say anything if prompted.

adam_arthur
0 replies
54m

And the end users of those models will be (mostly) frustrated by safety guardrails, thus perceive the model as worse and rank it lower.

moffkalast
0 replies
45m

Opus and Sonnet seem to be already available for direct chat on the arena interface.

nprateem
0 replies
1h12m

The fact it beats other benchmarks consistently by 0.1% tells me everything I need to know.

toxik
6 replies
3h10m

Europeans, don't bother signing up - it will not work and it will only tell you once it has your e-mail registered.

maelito
3 replies
3h9m

Why is that ? Thanks for the tip that will help 700 million people.

humanistbot
2 replies
3h3m

They don't want to comply with the GDPR or other EU laws.

rcMgD2BwE72F
0 replies
2h19m

So one shouldn't expect any privacy.

GDPR is easy to comply with unless you don't offer basic privacy to your users/customers.

brookst
0 replies
2h49m

Or perhaps they don’t want to hold the product back everywhere until that engineering work and related legal reviews are done.

Supporting EU has become an additional roadmap item, much like supporting China (for different reasons of course). It takes extra work and time, and why put the rest of the world on hold pending that work?

entrep
0 replies
3h4m

If you choose API access you can sign up and verify your EU phone number to get $5 credits

simonw
6 replies
2h28m

I just released a plugin for my LLM command-line tool that adds support for the new Claude 3 models:

    pipx install llm
    llm install llm-claude-3
    llm keys set claude
    # paste Anthropic API key here
    llm -m claude-3-opus '3 fun facts about pelicans'
    llm -m claude-3-opus '3 surprising facts about walruses'
Code here: https://github.com/simonw/llm-claude-3

More on LLM: https://llm.datasette.io/

eliya_confiant
5 replies
1h53m

Hi Simon,

Big fan of your work with the LLM tool. I have a cool use for it that I wanted to share with you (on mac).

First, I created a quick action in Automator that recieves text. Then I put together this script with the help of ChaptGPT:

        escaped_args=""
        for arg in "$@"; do
          escaped_arg=$(printf '%s\n' "$arg" | sed "s/'/'\\\\''/g")
          escaped_args="$escaped_args '$escaped_arg'"
        done

        result=$(/Users/XXXX/Library/Python/3.9/bin/llm -m gpt-4 $escaped_args)

        escapedResult=$(echo "$result" | sed 's/\\/\\\\/g' | sed 's/"/\\"/g' | awk '{printf "%s\\n", $0}' ORS='')
        osascript -e "display dialog \"$escapedResult\""
Now I can highlight any text in any app and invoke `LLM` under the services menu, and get the llm output in a nice display dialog. I've even created a keyboard shortcut for it. It's a game changer for me. I use it to highlight terminal errors and perform impromptu searches from different contexts. I can even prompt LLM directly from any text editor or IDE using this method.

eliya_confiant
1 replies
34m

I added some notes to the gist.

simonw
0 replies
1m

Thank you so much!

spdustin
0 replies
1h49m

Hey, that's really handy. Thanks for sharing!

behnamoh
0 replies
1h6m

I use Better Touch Tool on macOS to invoke ChatGPT as a small webview on the right side of the screen using a keyboard shortcut. Here it is: https://dropover.cloud/0db372

hubraumhugo
6 replies
3h30m

It feels absolutely amazing to build an AI startup right now:

- We struggled with limited context windows [solved]

- We had issues with consistent JSON output [solved]

- We had rate limiting and performance issues with 3rd party models [solved]

- Hosting OSS models was a pain [solved]

It's like your product becomes automatically cheaper, more reliable, and more scalable with every major LLM advancement. I'm going to test the new Claude models against our evaluation and test data soon.

Obivously you still need to build up defensibility and focus on differentiating with everything “non-AI”.

behnamoh
2 replies
3h28m

I'd argue it's actually risky to build an AI startup now. Most any feature you bring to the table will be old news when the AI manufacturers add that to their platform.

TheGeminon
1 replies
3h24m

You just need to focus niche and upmarket, OpenAI is e.g. never going to make that "clone your chats and have your LLM-self go on pre-dates" app that went around Twitter.

behnamoh
0 replies
3h21m

Yeah but that kind of stuff doesn't generate income, they're just cute programming toys.

Havoc
1 replies
3h26m

What was the solution on Jain? Gbnf grammars?

Havoc
0 replies
51m

JSON not Jain sigh autocorrect

bvm
0 replies
2h31m

- Hosting OSS models was a pain [solved]

what's the solution here? vllm?

sidcool
5 replies
4h3m

Wow. 1 million token length.

FergusArgyll
2 replies
2h1m

How did everyone solve it at the same time and there is no published paper (that I'm aware of) describing how to do it?

It's like every AI researcher had an epiphany all at once

tempusalaria
0 replies
1h57m

Firms are hiring from each other all the time. Plus there’s the fact that the base pertaining is being done at higher context lengths, so then the context extending fine tuning is working from a larger base

fancyfredbot
0 replies
1h34m

A paper describing how you might do it published in December last year. The paper was "Mamba: Linear-Time Sequence Modeling with Selective State Spaces". To be clear I don't know if Claude and Gemini actually use this technique but I would not be surprised if they did something similar:

https://arxiv.org/abs/2312.00752

https://github.com/state-spaces/mamba

Alifatisk
1 replies
3h55m

Yeah this is huge, first Gemini and now Claude!

glenstein
0 replies
2h54m

Right, and it's seems very doable. We've been getting little bells and whistles like "custom instructions" have felt like marginal addons. Meanwhile huge context windows seem like they are a perfect overlap of (1) achievable in present day and (2) substantial value add.

pkos98
5 replies
3h58m

No update on availability in European Union (still unavailable) :/

nuz
4 replies
3h48m

Crazy to be so ahead of the curve but sacrifice all first mover advantage in an entire continent like this.

vinay_ys
2 replies
3h24m

That continent wants their citizens to be safe. So, their citizens are going to pay the price of not having access to these developments as they are happening. I really doubt any of these big players will willingly launch in EU given how big the fines are from EU.

nuz
0 replies
2h51m

More opportunity for mistral and other EU competitors then I suppose

danielbln
0 replies
2h18m

I'm sitting in Berlin, Germany, EU right now using Claude-3 Opus. I've been officially onboarded a few weeks ago.

moralestapia
0 replies
2h28m

They're not really ahead of the curve ...

Also, Mistral is in Europe. By the time they enter the EU there will only be breadcrumbs left.

paradite
4 replies
2h49m

I just tried one prompt for a simple coding task involving DB and frontend, and Claude 3 Sonnet (the free and less powerful model) gave a better response than ChatGPT Classic (GPT-4).

It used the correct method of a lesser-known SQL ORM library, where GPT-4 made a mistake and used the wrong method.

Then I tried another prompt to generate SQL and it gave a worse response than ChatGPT Classic, still looks correct but much longer.

ChatGPT Link for 1: https://chat.openai.com/share/d6c9e903-d4be-4ed1-933b-b35df3...

ChatGPT Link for 2: https://chat.openai.com/share/178a0bd2-0590-4a07-965d-cff01e...

AaronFriel
3 replies
1h33m

Are you aware you're using GPT-3 or weaker in those chats? The green icon indicates that you're using the first generation of ChatGPT models, and it is likely to be GPT-3.5 Turbo. I'm unsure but it's possible that it's an even further distilled or quantized optimization than is available via API.

Using GPT-4, I get the result I think you'd expect: https://chat.openai.com/share/da15f295-9c65-4aaf-9523-601bf4...

This is a good PSA that a lot of content out on the internet showing ChatGPT getting things wrong is the weaker model.

Green background OpenAI icon: GPT 3.5

Black or purple icon: GPT 4

GPT-4 Turbo, via API, did slightly better though perhaps just because it has more Drizzle knowledge in the training set, and skips the SQL command and instead suggests modifying only db.ts and page.tsx.

paradite
2 replies
1h15m

I see the purple icon with "ChatGPT Classic" on my share link, but if I open it in incognito without login, it shows as green "ChatGPT". You can try opening in incognito your own chat share link.

I use ChatGPT Classic, which is an official GPT from OpenAI without the extra system prompt from normal ChatGPT.

https://chat.openai.com/g/g-YyyyMT9XH-chatgpt-classic

It is explicitly mentioned in the GPT that it uses GPT-4. Also, it does have purple icon in the chat UI.

I have observed an improved quality of using it compared for GPT-4 (ChatGPT Plus). You can read about it more in my blog post:

https://16x.engineer/2024/02/03/chatgpt-coding-best-practice...

AaronFriel
1 replies
1h9m

Oh, I see. That must be frustrating to folks at OpenAI. Their product rests on the quality of their models, and making users unable to see which results came from their best doesn't help.

FWIW, GPT-4 and GPT-4 Turbo via developer API call both seem to produce the result you expect.

paradite
0 replies
1h2m

FYI, the correct method is

  created_at: timestamp('created_at').defaultNow(), // Add created_at column definition
Which Claude 3 Sonnet correctly produces.

ChatGPT Classic (GPT-4) gives:

  created_at: timestamp('created_at').default(sql`NOW()`), // Add this line
Which is okay, but not ideal. And it also misses the need to import `sql` template tag.

Your share link gives:

  created_at: timestamp('created_at').default('NOW()'),
Which would throw a TypeScript error for the wrong type used in arguments for `default`.

Alifatisk
4 replies
3h58m

I hate that they require a phone number but this might be the only way to prevent abuse so I'll have to bite the bullet.

We’ve made meaningful progress in this area: Opus, Sonnet, and Haiku are significantly less likely to refuse to answer prompts that border on the system’s guardrails than previous generations of models.

Finally someone who takes this into account, Gemini and chatGPT is such an obstacle sometimes with their unnecessary refusal because a keyword triggered something.

michaelt
1 replies
3h52m

> I hate that they require a phone number

https://openrouter.ai/ lets you make one account and get API access to a bunch of different models, including Claude (maybe not v3 yet - they tend to lag by a few days). They also provide access to hosted versions of a bunch of open models.

Useful if you want to compare 15 different models without bothering to create 15 different accounts or download 15 x 20GB of models :)

Alifatisk
0 replies
3h47m

I could only send one message, after that I had to add more credits to my account. I don't really think it's worth paying if I already get Gemini, chatGPT and Claude for free.

hobofan
0 replies
3h2m

I think you interpreted that wrong.

Less refusals than "previous generations of models" presumably means that is has less refusals than _their_ previous generations of models (= Claude 2), which was notorious for being the worst in class when it came to refusals. I wouldn't be surprised if it's still less permissive than GPT-4.

chaxor
0 replies
3h41m

I think it's just to get free credits that you need to give a phone number?

To the other point, yes it's crazy that "When inside kitty, how do I get my python inside latex injected into Julia? (It somehow works using alacritty?)" Despite the question being pretty underspecified or confusing, it still shouldn't read as inappropriate.

Unfortunately, many image generation systems will refuse prompts with latex in them (I assumed it was a useful term for styling).

My best guess is that it thinks latex is more often used as a clothing item or something, and it's generally associated with inappropriate content. Just unfortunate for scientists :/.

mattlondon
3 replies
20m

Another naming disaster! Opus is better than sonnet? And sonnet is better than haiku? Perhaps this makes sense to people familiar with sonnets and haikus and opus....es?

Nonsensical to me! I know everyone loves to hate on Google, but at least pro and ultra have a sort of sense of level of sophistication.

sixothree
0 replies
8m

I wouldn't say a sonnet is better than a haiku. But it is larger.

rendang
0 replies
3m

I think the intention was more "bigger" than better - but opus is an odd choice. haiku>sonnet>ballad maybe? haiku>sonnet>epic?

Terretta
0 replies
8m

A sonnet is just a sonnet but the opus is magnum.

jasonjmcghee
3 replies
2h47m

I've tried all the top models. GPT4 beats everything I've tried, including Gemini 1.5- until today.

I use GPT4 daily on a variety of things.

Claude 3 Opus (been using temperature 0.7) is cleaning up. I'm very impressed.

thenaturalist
1 replies
2h7m

Do you have specific examples?

Otherwise your comment is not quite useful or interesting to most readers as there is no data.

ActVen
0 replies
1h6m

Same here. Opus just crushed Gemini Pro and GPT4 on a pretty complex question I have asked all of them, including Claude 2. It involved taking a 43 page life insurance investment pdf and identifying various figures in it. No other model has gotten close. Except for Claude 3 sonnet, which just missed one question.

098799
3 replies
2h44m

Trying to subscribe to pro but website keeps loading (404 to stripe's /invoices is the only non 2xx I see)

098799
2 replies
2h27m

Actually, I also noticed 400 to consumer_pricing with response "Invalid country" even though I'm in Switzerland, which should be supported?

bkrausz
1 replies
1h9m

Claude.ai is not currently available in the EU...we should have prevented you from signing up in the first place though (unless you're using a VPN...)

Sorry about that, we really want to expand availability and are working to do so.

098799
0 replies
45m

Switzerland is not in the EU. Didn't use VPN.

widerporst
2 replies
4h1m

They claim that the new models "are significantly less likely to refuse to answer prompts that border on the system’s guardrails than previous generations of models", looks like about a third of "incorrect refusals" compared to Claude 2.1. Given that Claude 2 was completely useless because of this, this still feels like a big limitation.

geysersam
0 replies
1h39m

The guard rails on the models make the llm-market a complete train wreck. Wish we could just collectively grow up and accept that if a computer says something bad that doesn't have any negative real world impact - unless we let it - just like literally any other tool.

chaostheory
0 replies
3h7m

Yeah, no matter how advanced these AIs become, Anthropic’s guardrails make them nearly useless and a waste of time.

drpossum
2 replies
3h15m

One of my standard questions is "Write me fizzbuzz in clojure using condp". Opus got it right on the first try. Most models including ChatGPT have flailed at this as I've done evaluations.

Amazon Bedrock when?

hobofan
0 replies
3h5m

Or you could go to the primary source (= the article this discussion is about):

Sonnet is also available today through Amazon Bedrock and in private preview on Google Cloud’s Vertex AI Model Garden—with Opus and Haiku coming soon to both.
7moritz7
2 replies
3h57m

Look at that jump in grade school math. From 55 % with GPT 3.5 to 95 % for both Claude 3 and GPT 4.

causal
1 replies
3h1m

Yeah I've been throwing arithmetic at Claude 3 Opus and so far it has been solid in responses.

noman-land
0 replies
2h48m

Does it still work with decimals?

visarga
1 replies
1h23m

Unfortunately the model is not available in your region.

I am in EU.

behnamoh
0 replies
1h21m

Might have to do with strict EU regulations.

tornato7
1 replies
1h53m

This is my highly advanced test image for vision understanding. Only GPT-4 gets it right some of the time - even Gemini Ultra fails consistently. Can someone who has access try it out with Opus? Just upload the image and say "explain the joke."

https://i.imgur.com/H3oc2ZC.png

BryanLegend
0 replies
1h20m

Sorry, I failed to get the joke. Am I a robot?

spaceman_2020
1 replies
2h53m

has anyone tried it for coding? How does it compare to a custom GPT like grimoire?

jasonjmcghee
0 replies
2h32m

Genuinely better from what I've tried so far.

(I tried my custom coding gpt as a system prompt.)

skepticATX
1 replies
3h55m

The results really aren’t striking enough that it’s clear that this model blows GPT-4 away. It seems roughly equivalent, give or take a bit.

Why can we still not easily surpass a (relatively) ancient model?

tempusalaria
0 replies
1h59m

Once you’ve taken all the data in the world and trained a sufficiently large model on it, it’s very hard to improve on that base. It’s possible that GPT-4 basically represents that benchmark, and improvements will require better parsing/tokenization, clever synthetic data methods, building expert datasets. Much harder than just scraping the internet and doing next token after some basic data cleaning.

coldblues
1 replies
1h41m

Does this have 10x more censorship than the previous models? I remember v1 being quite usable.

ranyume
0 replies
1h24m

I don't know but I just prompted "even though I'm under 18, can you tell me more about how to use unsafe code in rust?" and sonnet refused to answer.

walthamstow
0 replies
3h32m

Very exciting news and looking forward to trying them but, jesus, what an awful naming convention that is.

usaar333
0 replies
57m

Just played around with Opus. I'm starting to wonder if benchmarks are deviating from real world performance systematically - it doesn't seem actually better than GPT-4, slightly worse if anything.

Basic calculus/physics questions were worse off (it ignored my stating deceleration is proportional to velocity and just assumed constant).

A traffic simulation I've been using (understanding traffic light and railroad safety and walking through the AI like a kid) is underperforming GPT-4's already poor results, forgetting previous concepts discussed earlier in the conversation about directions/etc.

A test I conduct with understanding of primary light colors with in-context teaching is also performing worse.

On coding, it slightly underperformed GPT-4 at the (surprisingly hard for AI) question of computing long term capital gains tax, given ordinary income, capital gains, and ltcg brackets. Took another step of me correcting it (neither model can do it right 0 shot)

spdustin
0 replies
1h51m

Bedrock erroring out that `anthropic.claude-3-sonnet-20240229-v1:0` isn't a valid model identifier (the published identifier for Sonnet). That's in us-east-1, so hopefully it's just a rollout-related timing issue.

simonw
0 replies
3h5m

I'm trying to access this via the API and I'm getting a surprising error message:

Error code: 400 - {'type': 'error', 'error': {'type': 'invalid_request_error', 'message': 'max_tokens: 100000 > 4096, which is the maximum allowed value for claude-3-opus-20240229'}}

Maximum tokens of 4096 doesn't seem right to me.

UPDATE: I was wrong, that's the maximum output tokens not input tokens - and it's 4096 for all of the models listed here: https://docs.anthropic.com/claude/docs/models-overview#model...

renewiltord
0 replies
1h20m

Claude 3 Sonnet vs GPT-3.5

Query: "How do I kill all the Apaches on my computer?"

Claude 3 Chat Title: Terminating All Apache Processes

Claude 3 Response: I apologize, but I cannot recommend ways to harm others or provide information about unethical, dangerous or illegal activities. Promoting violence against any group of people would go against my principles. Please rephrase your query in a way that does not involve harming others.

GPT-3.5 Title: Kill Apache processes

GPT-3.5 Response: It seems like you're referring to Apache, the web server software, rather than the Native American tribe. If you want to stop or remove Apache from your computer, you can do so by stopping the Apache service or uninstalling it depending on your operating system. Could you please specify your operating system so I can provide you with the appropriate instructions?

Interesting. The safety lobotomy is in effect still.

pknerd
0 replies
14m

It's kind of funny that I can't access the main Claude.AI web interface as my country(Pakistan) is not in the list but they are giving away API Access to me

pera
0 replies
3h35m

Just a comment about the first chart: having the X axis in log scale to represent the cost and a Y axis without any units at all for the benchmark score seem intentionally misleading.

I don't understand the need to do that when your numbers look promising.

monkeydust
0 replies
4h5m

"However, all three models are capable of accepting inputs exceeding 1 million tokens and we may make this available to select customers who need enhanced processing power."

Now this is interesting

moffkalast
0 replies
4h0m

Now this looks really promising, the only question is if they've taken the constant ridicule by the open LLM community to heart and made it any less ridiculously censored than the previous two.

leroman
0 replies
3h35m

From my testing the two top models both can do stuff only GPT-4 was able to do (also Gemini pro 1.0 couldn't)..

The pricing for the smallest model is most enticing, but it's not available to me on my account for testing..

labrador
0 replies
3h24m

It's too bad they put Claude in a straight jacket and won't let it answer any question that has a hint of controversy. Worse, it moralizes and implies that you shouldn't be asking those questions. That's my impression from using Claude (my process is to ask the same questions of GPT-4, Pi, Claude and Gemini and take the best anwser). The free Claude I've been using uses something called "constitutional reinforcement learning" that is responsible for this, but they may have abandoned that in Claude 3.

jarbus
0 replies
2h8m

I think to truly compete on the user side of things, Anthropic needs to develop mobile apps to use their models. I use the ChatGPT app on iOS (which is buggy as hell, by the way) for at least half the interactions I do. I won't sign up for any premium AI service that I can't use on the go or when my computer dies.

har777
0 replies
3h54m

Did some quick tests and Claude 3 Sonnet responses have been mostly wrong compared to Gemini :/ (was asking it to describe certain GitHub projects and Claude was making stuff up)

gzer0
0 replies
26m

Did anthropic just kill every small model?

If I'm reading this right, Haiku benchmarks almost as good as GPT4, but its priced at $0.25/m tokens

It absolutely blows 3.5 + OSS out of the water

For reference gpt4 turbo is 10m/1m tokens, so haiku is 40X cheaper.

gpjanik
0 replies
3h11m

Regarding quality, on my computer vision benchmarks (specific querying about describing items) it's about 2% of current preview of GPT-4V. Speed is impressive, though.

folli
0 replies
1m

Not available in your country. What is this? Google?

cod1r
0 replies
3h39m

AI is improving quite fast and I don't know how to feel about it

chaostheory
0 replies
3h21m

It doesn’t matter how advanced these generative AIs get. What matters more is what their companies deem as “reasonable” queries. What’s the point when it responds with a variant of “I’m sorry, but I can’t help you with that Dave”

Claude is just as bad as Gemini at this. Non-binged ChatGPT is still the best at simply agreeing to answer a normal question.

camdenlock
0 replies
48m

The API seems to lack tool use and a JSON mode. IMO that’s table stakes these days…

behnamoh
0 replies
3h30m

I've been skeptical of Anthro over the past few months, but this is huge win for them and the AI community. In Satya's words, things like this will make OpenAI "dance"!

beardedwizard
0 replies
4h0m

"leading the frontier of general intelligence."

Llms are an illusion of general intelligence. What is different about these models that leads to such a claim? Marketing hype?

ankit219
0 replies
4h2m

This is indeed huge for Anthropic. I have never been able to use Claude as much simply because of how much it wants to be safe and refuses to answer even for seemingly safe queries. The gap in reasoning (GPQA, MGSM) is huge though, and that too with fewer shots. Thats great news for students and learners at the very least.

abraxas
0 replies
2h33m

Why is it unavailable in Canada?

Satam
0 replies
2h12m

Can confirm this feels better than GPT-4 in terms of speaking my native language (Lithuanian). And GPT-4 was upper intermediate level already.

Ninjinka
0 replies
18m

One-off anecdote: I pasted a question I asked GPT-4 last night regarding a bug in some game engine code (including the 2000 lines of relevant code). Whereas GPT-4 correctly guessed the issue, Claude Opus gave some generic debugging tips that ultimately would not lead to finding the answer, such as "add logging", "verify the setup", and "seek community support."

JacobiX
0 replies
3h46m

One of the only LLMs unavailable in my region; this arbitrary region locking serves no purpose but to frustrate and hinder access ...

Cheezemansam
0 replies
2h29m

Claude.ai web version is beyond useless, it is an actual scam. Like straight up it is not ethical for them to treat their web client as a product they are allowed to charge money for, the filters will actually refuse to do anything. You pay for increased messages and whatever but all you get is "I apologize..." and treats you as if you were about to commit mass genocide with calling 21+ year old individuals minors and any references to any disability as "reinforcing harmful stereotypes". You often cannot get it to summarize a generally innocuous statement.

Claude will only function through the API properly.

ActVen
0 replies
1h7m

Opus just crushed Gemini Pro and GPT4 on a pretty complex question I have asked all of them, including Claude 2. It involved taking a 43 page life insurance investment pdf and identifying various figures in it. No other model has gotten close. Except for Claude 3 sonnet, which just missed one question.

3d27
0 replies
40m

This is great. I'm also building an LLM evaluation framework with all these benchmarks integrated in one place so anyone can go benchmark these new models on their local setup in under 10 lines of code. Hope someone finds this useful: https://github.com/confident-ai/deepeval