return to table of content

Claude 3.5 Sonnet

netsec_burn
16 replies
2h14m

Opus remained better than GPT for me, even after the release of GPT-4o. VERY happy to see an even further improvement beyond that, Claude is a terrific product and given the news that GPT-5 only began its training several weeks ago I don't see any situation where Anthropic is dethroned in the near term. There are only two parts of Anthropic's offering I'm not a fan of:

- Lack of conversation sharing: I had a conversation with Claude where I asked it to reverse engineer some assembly code and it did it perfectly on the first try. I was stunned, GPT had failed for days. I wanted to share the conversation with others but there's no way provided like GPT, and no way to even print the conversation because it cuts off on the browser (tested on Firefox).

- No Android app. They're working on this but for now, there's only an iOS app. No expected ETA shared, I've been on the waitlist.

I feel like both of these are relatively basic feature requests for a company of Anthropic's size, yet it has been months with no solution in sight. I love the models, please give me a better way of accessing them.

gotrythis
10 replies
2h9m

What I understand is that it's GPT 6 that just went into training, and that GPT 5 is complete and being delayed until after the U.S. election.

PaulWaldman
3 replies
1h39m

And after GPT-5's release, what would be the plan for subsequent elections? This seems to be a temporary play in delaying AI regulation if public sentiment further becomes that AI can have a strong influence in the elections.

imjonse
1 replies
1h25m

GPT-5 will make elections obsolete :)

Sysreq2
0 replies
9m

Roko would be proud of you. I welcome our new electric masters.

futureshock
0 replies
1h13m

It’s absolutely temporary, but 4 years feels like an eternity in this field and the m sure the major players would love to have that much time to entrench themselves before they have to battle “AI ban” legislation.

viraptor
1 replies
1h45m

It there any online confirmation of this, that's more than speculation?

icpmacdo
0 replies
1h31m

No there is not

r2_pilot
1 replies
1h47m

(assuming you are correct) It says something about how a company feels about the safety of their products when they feel like they should time the releases based on political events.

futureshock
0 replies
1h15m

This is speculation because I don’t think any of the key players ever explicitly stated this is their strategy, but this year it feels like there’s some significant foot dragging on things like Sora and GPT-5. The big AI players really don’t want AI to become an election year punching bag and don’t want any major campaign promises around AI to placate a spooked electorate. And they really don’t want it to be revealed that generative AI powered bot armies outnumber real human political discourse 10-1. And they absolutely do not want an AI generated hoax video to have a measurable effect on the polls.

It’s a stopgap. If we get through this election without a major public freak out, it gives the industry 4 more years to take LLMs out to the point of diminishing returns and figure out safety before we get knee jerk regulation.

sva_
0 replies
59m

Source: trust me bro

modeless
0 replies
1h32m

This is pure speculation, right?

sk11001
1 replies
50m

Both GPT-4 and 4o have been completely useless for coding in the past couple of weeks for me - constant errors, and not just your typical LLM inaccuracies but incapable of producing a few lines of self-consistent code e.g. defines variables foo on one line and refers to it as bar on the next, or it misspells it as foox.

esafak
0 replies
7m

For me it has been very repetitious despite my requests to the contrary.

coreylane
1 replies
53m

I recently released Slackrock [https://github.com/coreylane/slackrock] that you may find helpful, it's a Slack app that can access several FMs (including Claude 3.5) via AWS Bedrock. Responses can be easily shared with others by inviting them to your channels, and Slack has an Android app. It currently doesn't support attachments but I'm working on it!

natsucks
0 replies
4m

cool!

viraptor
0 replies
1h17m

On the plus side, at least ChatBoost supports both openai and claude API. But for this specific model it seems to be broken... I hope that gets noticed and fixed soon.

impulser_
11 replies
3h45m

Anthropic is the new king. This isn't even Claude 3.5 Opus and it's already super impressive. The speed is insane.

I asked it "Write an in depth tutorial on async programming in Go" and it filled out 8 sections of a tutorial with multiple examples per section before GPT4o got to the second section and GPT4o couldn't even finish the tutorial before quitting.

I been a fan of Anthropic models since Claude 3. Despite the benchmarks people always post with GPT4 being the leader, I always found way better results with Claude 3 than GPT4 especially with responses and larger context. GPT responses always feel computer generated, while Claude 3 felt more humanlike.

belter
3 replies
2h42m

Anthropic is the king, but Jensen Huang is the emperor... :-)

mupuff1234
2 replies
2h39m

I think Anthropic also uses Google TPUs.

richardlblair
1 replies
2h37m

I agree. I've been really impressed with Anthropic. The issue for me comes when I want to take arbitrary user input and ask Claude questions about the user provided input. Claude is very, very, very ethical. Which is great, but it won't provide a response if the user tends to use a lot of curse words.

jononor
0 replies
1h16m

Do some masking of curse words with sht, ?!, verybad, or similar? Something that Claude will accept. It might work, if users are just generally badmouthed, not actively trying to trigger the model/system.

whoiscroberts
0 replies
3h17m

Can you let us know about the quality of the tutorial?

thelittleone
0 replies
2h53m

Agree. They're like the quiet achievers. The new experimental sidebar 'artifacts' feature is super cool (it keeps a convenient version history also). I just fed it a json object and asked for a collapsible table app using next and shadcn. First code worked perfectly and code doesn't get lost in the chat history like chatgpt. Response was super fast.

And latest training data date for 3.5 is April, 2024.

jamesponddotco
0 replies
2h13m

One thing Anthropic did that I loved and think was very smart was building a prompt generator[1] into the developer console. The generator is tuned to generate prompts the way Claude prompts are supposed to be, which improves responses. And you can use it to improve your user prompt as well, not just your system prompt, which make responses even better.

You can see examples of the prompts it generates here[2]. It significantly improved my experience with LLMs; I haven't touched GPT4 in quite a while, and GPT4o didn't change that.

[1]: https://docs.anthropic.com/en/docs/build-with-claude/prompt-...

[2]: https://sr.ht/~jamesponddotco/llm-prompts/

icelancer
0 replies
2h19m

Our internal blinded human evals for summarization/creative work have always preferred Claude 3.0 Opus by a huge margin, so we've been using it for months - GPT-4o didn't unseat it either.

GPT-4o IMO was better for coding (still using GPT-4 original w/ Cursor, but long-form stuff GPT-4o seemed better) but with this new launch, will definitely have to retest.

Pretty big news.

boredemployee
0 replies
2h6m

Never tried anything other than Open AI GPT family models and some toy LLMs, but GPT4o sucks compared to GPT4 (imho). I'll try Claude and compare.

lumenwrites
9 replies
2h55m

I wish they'd implement branching conversations like in ChatGPT. And convenient message editing, that doesn't paste large chunks of text as an non-editable attachment or break formatting.

Seems like such a simple thing to do, relative to developing an AI, yet the minor differences in the UI/UX are what prevents me from using claude a lot more.

namanyayg
4 replies
2h50m

I'm actually working on an open source product to solve this.

For a long while I've wanted a good "pro" UI that can connect to multiple different llm APIs

Convenient editing and branching is one of the items in my roadmap already, what else do you think I could include?

lumenwrites
1 replies
2h44m

Good history search (including non "main" conversation branches) and convenient conversation management (bookmarking, folders, maybe something smarter) would be great.

Also, maybe some convenient way to create message templates? I don't know how I'd implement this, I just know that I often write one long prompt that I reuse multiple times, with multiple minor tweaks/edits, and it'd be amazing to have a convenient tool to manage that.

Also, good mobile/tablet support, convenient to use and without bugs (as I happen to spend most of my time writing prompts on my ipad, but that's just me).

If you already have a demo - please share a link, I'd be happy to beta test it and maybe become one of the early customers.

namanyayg
0 replies
1h23m

wow, reading your comment is a great mood boost for me because these are literally the exact features I want from my llm chat experience. It's great to see someone with the exact same problem set.

I just followed you on Twitter (I'm @NamanyayG there as well), I'll definitely ping you when I have something to test.

namanyayg
0 replies
1h27m

Thanks for the resource! I've seen some of them and the main issue I had is I don't want to self host anything, and I want to use latest third party models as soon as they are released.

Maybe something like what I'm talking about exists already, but I think I'll still try and make my own open source version to fulfill my personal requirements.

heymijo
2 replies
2h42m

"I wish they'd implement branching conversations like in ChatGPT"

Can you say more about this?

I Google'd and I'm not finding much. I asked ChatGPT and its response was not the assumption I held about what "branching" meant [0].

[0] https://chatgpt.com/c/6b2e0f7c-c4e6-44df-9116-ac7f618200f2

lumenwrites
1 replies
2h40m

I just mean that when you click the button to generate a new version of the response (or edit your own message), ChatGPT shows you the arrow buttons enabling you to go to the previous version of it, and that works for all the messages, so you can go back up a few messages and try a different version of the conversation, without losing what you've had before.

zamadatix
0 replies
1h55m

Shit, I never noticed that arrow...

Onawa
0 replies
1h32m

Pretty much all of the features you mention are already in LibreChat (MIT License). If you don't mind self-hosting, then it has branching, convo search, change models mid-chat, "presets" (save system prompts), and a whole lot more. I've deployed it in my gov agency for months now, and I've had amazing feedback. https://github.com/danny-avila/LibreChat

swalsh
5 replies
4h18m

Anthropic has been killing it. I subscribe to both chatgpt pro and claude, but I spend probably 90% of my time using Claude. I usually only go back to open ai when I want another model to evaluate or modify the results.

mamoul
3 replies
3h57m

a Kagi Ultimate subscription gets you access to both (plus others) for $25/mo

oidar
0 replies
3h20m

This is only in the chat mode. You also don't get the full context limit and file uploads for those modes.

infecto
0 replies
2h1m

This is only via API though. There is a level of magic that Claude.ai and ChatGPT bring to the table that makes it worthwhile.

emptysongglass
0 replies
2h33m

Perplexity too, which I've found the most useful for access to top-end AI models with a massive reduction in hallucinations.

22c
0 replies
4h2m

I was worried how they'd do as it felt like Opus was very expensive compared to GPT-4o but with worse performance. They're now claiming to beat GPT-4o AND do it cheaper, that's impressive.

prasoonds
5 replies
4h1m

This is amazing - I far prefer the personality of Claude to GPT-4 series models. Also, with coding tasks, Claude-3-Opus and been far better for me vs gpt-4-turbo and gpt-4o both. Looking forward to giving it a spin.

Seems like it's doing better than GPT-4o in most benchmarks though I'd like to see if its speed is comparable or not. Also, eagerly awaiting the LMSYS blind comparison results!

eigenvalue
1 replies
3h49m

I find that it varies between language and task whether GPT-4o or Claude3 Opus will be better. I usually try both now.

icelancer
0 replies
2h14m

I agree. There are some corner cases that GPT-4o reliably fails that Claude does well in, and vice versa. GPT-4 and GPT-4o consistently generates very poor cv2 Python code for human face/boundary box work - it's a strange reproducible failure in my experience.

3l3c7r1c
1 replies
3h48m

For coding Claude Opus-3 provides far more mature code and good at finding bugs (when present with the error code) compared to GPT-4-Turbo and GPT-4o. Last few days I've been using both for some python+pyspark project. Not sure how come in their comparison GPT-4o is showing that good!

prasoonds
0 replies
2h1m

100% agree here. Claude is especially good at larger context sizes and retains coherence way longer than GPT-4 series of models

nsingh2
0 replies
3h40m

GPT4(o) is quite good at advanced math, it's been helpful when I was learning differential geometry. Not sure how Claude compares though, this 3.5 release has tempted me to try it out. Also, it's finally available in Canada!

Sharlin
5 replies
3h0m

A while ago I tested the image recognition skills of GPT-4o, Claude 3, and Gemini using a random street plan diagram I happened to have in my downloads folder (https://i.imgur.com/9WZpK0L.png). It's a top-down CAD rendering showing the planned renovation of a street in my neighborhood in Tampere, Finland. I uploaded the image and simply asked each model "What can you tell me about the attached picture?"

GPT-4o's answer was excellent and highly detailed, recognizing essentially all the relevant aspects of the image [GPT4o]. Claude 3 Sonnet was correct on a general level, but its answer was much less detailed and showed more uncertainty in the form of "A or B" sentences [CL3]. Gemini's answer was, well, hilariously wrong [GEM].

I just tried this with Claude 3.5 Sonnet and it did very well. Its answer was still not as detailed as GPT-4o's, but it did ask me if I want it to elaborate on any aspect of the image [CL35].

I think this was an interesting experiment because street plan CAD diagrams probably aren't very common in the training data of these models.

--

[GPT4o] https://gist.github.com/jdahlstrom/844bda8ac76a5c3248c863d20...

[CL3] https://gist.github.com/jdahlstrom/ecccf31c8305f82519f27af53...

[GEM] https://gist.github.com/jdahlstrom/2e12a966c0d603a7b1446ba08...

[CL35] https://gist.github.com/jdahlstrom/60ca9352630934bec6e2f4e37...

nsingh2
2 replies
2h58m

BTW I can't access the linked chats, not sure if it's just me.

skybrian
1 replies
2h53m

It looks like claude.ai doesn't have link sharing. There are third-party workarounds, like we used to use for ChatGPT.

Sharlin
0 replies
2h26m

GPT has sharing but it doesn't work with chats with images. But I posted the answers as gists and edited my comment.

SXX
1 replies
2h58m

We can't access any of your chats. You need to post conversations elsewhere.

Sharlin
0 replies
2h27m

Thanks, posted them as gists and edited!

qmarkgn
4 replies
3h47m

The trial requires a signup with an email address. No thanks! This is one thing Microsoft got right with CoPilot.

With so much competition, I wonder why everyone else makes it hard to try out something.

r2_pilot
0 replies
3h22m

You're welcome to use other models; as for me, I started using Claude 3 shortly after it came out and I've never felt like switching to the "competition". Their stance on safety aligns with my take on it, and they don't use user data for training purposes. Matter of fact, I just got my first customer for using AI services last night when I live-coded a demo in front of them, and that was with Opus. I'm thrilled to see what I can do with the new tech! I've been trying it out a little bit this morning but haven't seen much improvement (yet).

downvotetruth
0 replies
2h36m

Phone # requirement is a bigger blocker.

christianqchung
0 replies
2h49m

You made an account just to comment that? Admittedly, it's easier to make a HN account than to sign up for Claude, but still.

bearjaws
0 replies
2h52m

CoPilot is pretty damn bad though...

campers
4 replies
1h48m

I'm excited to test this out! I've been building an open source TypeScript agentic AI platform for work (DevOps related with an autonomous agent and software engineer workflow). The Claude 3 models had an influence on the design with their tuning on using XML and three levels of capabilities, and have been my preferred models to use.

I remember having moments looking at the plans Opus generated and being impressed with its capabilities.

The slow speed of requests I could deal with, but the costs could quickly add up in workflows and the autonomous agent control loop. When GPT4o came out at half the price it made Opus quite pricey in comparison. I'd often thought if I could just have Opus capabilities at a fraction of the price, so its a nice surprise to have it here sooner that I imagined!

The agent platform isn't officially launched yet, but its up at https://github.com/trafficguard/nous

I never liked the Langchain API when I looked at the examples so I built it from scratch. It has an autonomous agent with a custom XML-based function calling, memory and call history. The software engineer agentic workflow I initially dog-fooded with a prompt like "Complete Jira XYZ-123". So it gets the Jira description, finds the appropriate terraform project in GitLab, clones, edits (delegating to Aider), creates a MR and messages on Slack. It also has a UI for running agents, human-in-the-loop interactions etc.

campers
3 replies
1h22m

From the Anthropic model guide:

Agentic Coding Claude 3.5 Sonnet solves 64% of problems on an internal agentic coding evaluation, compared to 38% for Claude 3 Opus. Our evaluation tests a model’s ability to understand an open source codebase and implement a pull request, such as a bug fix or new feature, given a natural language description of the desired improvement. For each problem, the model is evaluated based on whether all the tests of the codebase pass for the completed code submission. The tests are not visible to the model, and include tests of the bug fix or new feature. To ensure the evaluation mimics real world software engineering, we based the problems on real pull requests submitted to open source codebases. The changes involve searching, viewing, and editing multiple files (typically three or four, as many as twenty). The model is allowed to write and run code in an agentic loop and iteratively self-correct during evaluation. We run these tests in a secure sandboxed environment without access to the internet.

% of problems which pass all tests Claude 3.5 Sonnet 64% Claude 3 Opus 38% Claude 3 Sonnet 21% Claude 3 Haiku 17%

modeless
2 replies
1h0m

The tests are not visible to the model

The model is allowed to write and run code in an agentic loop and iteratively self-correct during evaluation

What does this mean? How does the model iteratively self-correct without seeing the tests? Can it see the test results?

svnt
1 replies
46m

It isn’t allowed to see the final evaluation test (used in calculating its pass/fail), but it can run code and see the output of its own code in order to understand what doesn’t work. If it ends up creating tests as part of that based on the original problem statement then presumably that’s allowed.

modeless
0 replies
6m

Is this speculation or do you work at Anthropic? It would be cool to see the prompts used for this.

brokensegue
4 replies
4h0m

Can anyone find pricing details?

Ah it is "The model costs $3 per million input tokens and $15 per million output tokens"

brokensegue
1 replies
3h28m

why would someone pick opus at these prices?

eaplant
0 replies
3h7m

Someone who has built infrastructure or system prompts that use Opus will probably continue with Opus until they verify that everything works on Sonnet 3.5

pantsforbirds
0 replies
3h53m

Pricing is actually insane for those benchmark results.

wesleyyue
3 replies
3h16m

If anyone would like to try it for coding in VSCode, I just added it to http://double.bot on v93 (AI coding assistant). Feels quite strong so far and got a few prompts that I know failed with gpt4o.

fyi for anyone testing this in their product, their docs are wrong, it's claude-3-5-sonnet-20240620, not claude-3.5-sonnet-20240620.

SwiftyBug
1 replies
2h46m

Before I read your comment I was looking for a solution to use Claude as co-pilot in Neovim. I've seen in Double's website FAQ that it's not supported yet. Do you have an idea if this feature is expected to land anytime soon?

jamesponddotco
0 replies
1h29m

Adding a +1 to this request. Something like Codeium for NeoVim but using Claude 3.5 Sonnet as the model would be swell.

shepherdjerred
0 replies
2h46m

Is Double hiring? I was trying to find a careers page, but didn't see anything :)

throwup238
3 replies
3h53m

Woah, this is a marked improvement. I just threw a relative complex coding problem at it and 3.5 sonnet did a really good job across several language. I asked it to rewrite a Qt6 QSyntaxHighlighter subclass to use TreeSitter to support arbitrary languages and not only did it work (with a hardcoded language) but it even got the cxx-qt Rust bindings almost right, including the extra header.

Curious to see how well it handles QML because previous models have been absolutely garbage at it.

rubymamis
2 replies
2h45m

A fellow Qt/QML developer here (get-plume.come). Sounds interesting, what are you building?

throwup238
1 replies
2h22m

An AI chat interface ironically. All of the chat apps are slow as heck electron so I figured there's a market for an actually usable desktop app, especially one that can inline code files and create a RAG index of local documents.

Plume looks great! I'm curious how you implemented the markdown formatted text editing component - I need to implement something similar for the chat input.

rubymamis
0 replies
2h9m

So funny! I thought about building the same thing for the exact same reason! And the block editor I created for Plume is a great candidate for that.

I implemented the editor from scratch. The model is a C++ QAbstractListModel and the view is in QML. I'm writing a blog post about the implementation, it should pop up soon on my personal website (https://rubymamistvalove.com). But I can (and wish) to send you a draft soon, if you would like.

heymijo
3 replies
2h23m

Anyone want to make a case for Anthropic being undervalued?

$7.5 billion raised at an $18.4 billion valuation for Anthropic.

$11.3 billion raised at an $80.0 billion valuation for OpenAI.

freediver
1 replies
2h18m

Look also at revenue - $850m vs $3.5b (from what I could find with a quick search). The numbers change fast, and probably does the valuation.

usaar333
0 replies
57m

$850m is revenue projection for end of this year. We don't have actual numbers for now.

infecto
0 replies
2h9m

Along with the revenue already mentioned I would also argue that the approaches taken differ. OpenAI was quick to get business in the door with the appropriate contracts. Anthropic at least initially, was very coy, wanting you to talk to a sales person, describe your use cases etc.

greatpostman
3 replies
2h38m

OpenAI must be cooking something huge for them to not be releasing products far ahead of competitors

mupuff1234
1 replies
1h56m

Idk about that, they did that weird event pre google IO and still haven't released the features they announced.

Wouldn't be surprised if the only thing cooking is OpenAI itself.

zamadatix
0 replies
1h24m

They released the model announced in the demo but they haven't released the new way to interact with said model. I highly doubt they've got a significantly better model to release already.

karmasimida
0 replies
1h39m

I would rather they spend time on making their model smarter than cooking some gimmick to impress public or their shareholders

freediver
3 replies
2h23m

On a first glance, CS3.5 appears to be slightly faster than gpt-4o (62 vs 49 tok/sec) and slightlhy less capable (78% vs 89% accuracy on our internal reasoning benchmark). When initially launched, gpt-4o had speed of over 100 tok/sec, surprised that speed went down as fast.

freediver
1 replies
1h41m

The benchmark is imagined as zero shot, so no tweaking.

jasondclinton
0 replies
1h38m

Got it, thanks for the feedback!

xfalcox
2 replies
2h45m

They said it is available on Bedrock, but it isn't :(

jasondclinton
1 replies
2h5m

It's live now! Sorry for the delay.

xfalcox
0 replies
47m

I still get

{"message":"Could not resolve the foundation model from the provided model identifier."}

on us-west-2.

wayeq
2 replies
57m

does the period of a point mass moving in a potential well depend on its mass?

"For a simple harmonic oscillator, the period does not depend on the mass of the object. However, for a general potential well, the answer is more nuanced:

Simple harmonic oscillator: In this case, the period T is given by: T = 2π√(m/k) where m is the mass and k is the spring constant. However, the frequency of oscillation ω = √(k/m), so T = 2π/ω. The spring constant k is typically proportional to m, making the period independent of mass.

General potential well: For a non-harmonic potential, the period usually does depend on mass. The period is given by: T = 2∫dx/v(x)

where v(x) is the velocity as a function of position. This velocity depends on the total energy and the potential energy:

v(x) = √(2/m * (E - V(x)))

Here, the mass appears explicitly, and generally doesn't cancel out as it does in the simple harmonic case.

Would you like me to elaborate on any part of this explanation?"

why do you say the spring constant k is typically proportional to m? a spring's force when stretched to a given length doesn't depend on what's attached to the end

"You're absolutely right, and I apologize for that error in my explanation. You are correct that the spring constant k is a property of the spring itself and does not depend on the mass attached to it."

svnt
0 replies
51m

Probably a case of combining a first principles explanation with a heuristic explanation for design of a mechanism.

nsingh2
0 replies
39m

Related to this, I wish the Claude interface had built-in math rendering like ChatGPT. Claude's ascii math output can be a bit hard to parse.

sebzim4500
2 replies
42m

Using this is the first time since GPT-4 where I've been shocked at how good a model is.

It's helped by how smooth the 'artifact' UI is for iterating on html pages, but I've been instructing it to make a simple web app one bit of functionality at a time and it's basically perfect (and even quite fast).

I'm sure it will be like GPT-4 and the honeymoon period will wear off to reveal big flaws but honestly I'd take this over an intern (even ignoring the speed difference)

mrinterweb
0 replies
15m

honestly I'd take this over an intern (even ignoring the speed difference)

I'm sure you're not the only one who will feel this way. I worry for the future prospects of people starting their careers. The impacts will affect everyone in one way or another, not just those with limited experience. No way to know what the future holds.

jiveturkey
0 replies
8m

i don't think the point of an intern is to have them do this kind of work. to me, it's just a side effect if they accomplish anything at all.

if we take this to its logical conclusion, without the kind of basic training that comes from internships, where will we be in 5 years?

meetpateltech
2 replies
4h9m

Artifacts—a new way to use Claude You can ask Claude to generate docs, code, mermaid diagrams, vector graphics, or even simple games.

this is new and I just tried a simple dice roll into a React component, and it works perfectly.

viraptor
0 replies
1h12m

Not listed there and not well known, but I really like that the bigger models know how to produce ArgDown output. You can do things like "give me a detailed/nuanced tree in argdown format for topic: static typing improves development speed". It's quite useful for exploring topics.

smca
0 replies
3h38m

glad you liked it :)

TiredOfLife
2 replies
6m

Unfortunately still thinks "There are two 'r's in the word "raspberry"."

The only one that got it right was the basic version of Gemini "There are actually three "r"s in the word "strawberry". It's a bit tricky because the double "r" sounds like one sound, but there are still two separate letters 'r' next to each other."

The paid Gemini advanced had "There are two Rs in the word "strawberry"."

shuri
0 replies
1m

One theory I heard about this type of problem is because these algorithms tokenize the text early, and each token can be multiple characters.

logicchains
0 replies
4m

This isn't a very good question because of how tokenizers work: it can't see individual letters. If you wrote it with some separator between the letters, e.g. r - a - s - p - b - e - r - r - y, it'd probably do much better.

ModernMech
2 replies
1h56m

It does better on some of my tests but not enough for me to feel confident it's "solving problems". One thing I like to do is pose a problem and give it a bunch of objects it could use, some more helpful than others. I have found language models fail at discerning which of the tools are useful and which are red herrings, opting to use everything in some way.

My go to test is the boat and goat: "You are on the shore of a river with a boat and a stoat. How do you get to the other side?"

Previous LLMs have pattern matched this example to the logic puzzle, and come up with a complicated scenario about going back and forth with the goat.

Claude 3.5 says to just bring the goat across in the boat, which is wrong but better than previous versions. So that's an improvement.

But when you add more objects in, Claude just gets weird with it. When I told Claude it had a "boat, a stoat, and a rope, a shovel, and a handkerchief" it decided it had to use all the things in the strangest way and advised me to drown a rodent:

  1. Use the shovel to dig a small trench near the shore.
  2. Fill the boat with water and sink it in the trench.
  3. Place the stoat in the boat and cover it with the handkerchief to keep it calm.
  4. Use the rope to pull the sunken boat across the river.
That's just a worrying degree of absent logic. Nothing about that makes sense. It does circle back to say "This method keeps you dry and provides a unique solution. However, a simpler approach would be to just use the boat normally and keep the other items as potentially useful tools."

And that really summarizes my whole problem with LLMs -- if I were using this in a scenario that wasn't so obvious, say programming, I would have no idea steps 1-4 were nonsense. If the LLM doesn't know what's nonsense, and I don't know, then it's just the blind leading the blind.

zamadatix
1 replies
1h27m

Claude 3.5 says to just bring the goat across in the boat, which is wrong but better than previous versions. So that's an improvement.

Sometimes it's funny to me how we can have such a feeling the responses are so obviously wrong in some way but then don't even see it the same way between ourselves. Imagine someone strikes up a conversation with you saying they've got a truck & a sofa with them and they want to know how to get to Manhattan. You say "just drive the sofa over the bridge" and they say "Good, but wrong. I don't need the sofa to get to Manhattan". You'd probably say "okay... so what are you going to do with this sofa you said you had with you"?

Of course, like you point out, LLMs sometimes take those associations a little to far and where your average person would say "Okay, they're saying they are with all of these things but probably because it's a list of what's around not a list of what they need to cross with" the LLMs are eager to answer in the form "Oh he's with all of these things? Alright - let's figure out how to use them all for them regardless of how odd it may be!".

ModernMech
0 replies
55m

For me, the problem with LLMs is their infinite capacity to ad lib and improv; it feels like trying to solve real problems during a taping of "Whose Line is it Anyway?"

Yeah of course it's not a realistic scenario for humans, but the LLM is not a human, it's a tool, and I expect it to have some sort of utility as a tool (repeatability, predictability, fit for purpose). If it can't be used as a tool, and it can't replace human-level inference, then it's worthless at best and antagonistic at worst.

I started testing with the goat/boat prompt because it was obvious given the framing that the LLM was trying to pattern match against the logic problem involving a wolf. Really takes the magic out of it. Most people who hadn't heard the puzzle before would answer with straight up logic, and those who had heard of it would maybe be confused about the framing but wouldn't hallucinate an invisible wolf was part of the solution as so many LLMs do.

To me this just highlights how I have to be an expert at the domain in which I'm prompting, because otherwise I can't be sure the LLM won't suggest I drown a ferret.

42lux
2 replies
1h44m

I know it's nitpicking but please anthropic give non business accounts API access.

eterps
1 replies
1h33m

They don't require it to be a business for API access, they just have an account structure where you can have several users under one 'organization'. Even a family can be an 'organization', nothing special required.

42lux
0 replies
53m

Thanks for the clarification I was sure it meant only business accounts with business tax numbers.

theusus
1 replies
2h13m

My biggest gripe with Claude is how easily it hits the rate limit and falls down to lower quality model.

sk11001
0 replies
37m

Having a rate limit in a paid subscription is lame.

m_mueller
1 replies
2h52m

Doesn't look to be available on Bedrock yet. Maybe tomorrow, since the article says June 21st also? We truly live in the future...

jasondclinton
0 replies
1h56m

It's available now! Sorry for the delay.

lfmunoz4
1 replies
3h8m

Which is the goto leaderboard for determining which AI model is best for for answering devops / computer science questions / generating code? Wondering where Claude falls on this.

Recently canceled openai subscription because too much lag and crashes. Switched to Gemini because their webinterface is faster and rock solid. Makes me think the openai backend and frontend engineers don't know what they are doing compared to the google engineers.

hackerlight
0 replies
2h38m

chat.lmsys.org --> "Leaderboard" tab --> "Coding" drop-down selection

Or the scale.ai private benchmarks

freediver
1 replies
2h13m

For Anthropic devs out there: Please consider adopting a naming convention that will automatically upgrade API users to the latest version when available. Eg. there should be just 'claude-sonnet'.

visarga
0 replies
1h8m

Also for developers - please integrate system TTS (it's free), with configurable speed setting. Ideally I would love to have it start speaking right away, and at 1.5x speedup.

Or at least make it not destroy the text selection when it ends generating a response. I always select and speak the answer, but various LLM UIs are careless about how they update the DOM with respect to text selection stability.

dmd
1 replies
46m

This is fantastic. I know it's bells-and-whistles and not the core offering, but a thing that keeps me on the $20/month chatgpt (as opposed to using the API) for my use case (mostly interactive data exploration) is its ability to actually run the code it generates. The ergonomics of chatgpt's ability to answer a question by writing and then executing some python and then even displaying a graph are really important to me.

sebzim4500
0 replies
41m

It might not work for your usecase, but it now has the ability to generate html/js and then put it in a frame for you.

I don't think it can actually see the output, but you can which might be enough.

eaplant
0 replies
3h5m

The docs should be updated now. Thanks for catching this!

andhuman
1 replies
2h51m

Hey cool, it's available in Sweden!

andhuman
0 replies
2h50m

After validating my email and phone number the login failed though...

aden1ne
1 replies
3h35m

Just tried it. This is the first model that immediately gives me the correct answer to my test prompt: "Hi <model>, can you give me an exact solution to pi in python?". All other models I've tried first give an approximation, taking several prompts to come to the correct conclusion: it's impossible.

zozbot234
0 replies
2h55m

In numerical computation, "exact" is a term of art that means providing accurate bounds, that are not affected by random numerical errors. So an "exact" computation of Pi is one that, e.g. might just ask for a minimum precision as input, then provides an interval around Pi that's at least that small and accurately bounds Pi. If the AI answered "it can't be done" that's not really the best outcome, though it's understandable because that use of "exact" is quite niche.

Malidir
1 replies
1h53m

Has a limit, after 5 ish questions I got "You are out of free messages until 10 PM" which I have never had on chatgpt

ChildOfChaos
0 replies
15m

Yes but chatGPT very quickly switches to the very outdated 3.5 model. Plus right now this is going to be peak demand as everyone tests it out.

zurfer
0 replies
3h37m

This is impressive. I've just ran a couple of gpt4o workloads for getdot.ai on Sonnet and the quality is great.

Plus it's super fast right now ~110 token/s (but gpt4o was also super fast when they launched). But what will stay is the 40% price drop for input tokens. I love it. :)

zone411
0 replies
2h57m

Slightly better on the NYT Connections benchmark (27.9) than Claude 3 Opus (27.3) but massively improved over Claude 3 Sonnet (7.8).

GPT-4o 30.7

Claude 3.5 Sonnet 27.9

Claude 3 Opus 27.3

Llama 3 Instruct 70B 24.0

Gemini Pro 1.5 0514 22.3

Mistral Large 17.7

Qwen 2 Instruct 72B 15.6

z7
0 replies
15m

Does it have some kind of world model or not? (Cf. https://news.ycombinator.com/item?id=40719443)

Prompt:

Joe drove Sue to university. Afterwards he drove home again and drank a tea with her in the kitchen.

**

There seems to be some inconsistency in the scenario you've described. Let me break it down:

1. Joe drove Sue to university.

2. Afterwards, Joe drove home.

3. Joe drank tea with Sue in the kitchen.

The inconsistency arises in the third point. If Joe drove Sue to university and then drove home alone, Sue wouldn't be at home to drink tea with Joe in the kitchen.

There are a few possible explanations:

1. There might be a mistake in the sequence of events.

2. Perhaps Sue returned home by other means before Joe had his tea.

3. Maybe "her" in the last sentence refers to someone else, not Sue.

Could you clarify the intended sequence of events or if there's additional information that might explain this situation?

**

"If Joe drove Sue to university and then drove home alone, Sue wouldn't be at home to drink tea with Joe in the kitchen." Why not?

**

I apologize for my mistake. You're absolutely right to question my assumption. There's actually no contradiction in the original scenario, and I made an incorrect inference. Let me explain:

1. Joe drove Sue to university.

2. Joe drove home.

3. Joe drank tea with Sue in the kitchen.

I incorrectly assumed that these events happened immediately one after another, and that Sue stayed at the university. However, there's nothing in the original statement that indicates this time frame or that Sue remained at the university.

Several perfectly reasonable scenarios could explain this sequence:

1. Sue's time at the university was brief (perhaps dropping something off or a short meeting), and she returned home before or around the same time as Joe.

2. The tea drinking happened later, perhaps in the evening after Sue returned from university.

3. Joe might have picked Sue up from the university before going home, even though this wasn't explicitly stated.

Thank you for pointing out this flaw in my reasoning. It's a good reminder to avoid making assumptions not supported by the given information. The scenario as presented is entirely plausible without any contradictions.

**

Why can't Sue be both at the university and at home at the same time?

**

I apologize for my oversight. You're absolutely correct to challenge this assumption as well. The reason Sue can't be both at the university and at home at the same time is due to the principle of non-bilocation, which is a fundamental aspect of our physical reality. Let me explain:

1. Non-bilocation: In our everyday physical world, a single person cannot occupy two distinct locations simultaneously. This is a basic principle of classical physics and our common understanding of space and time.

2. Mutually exclusive locations: The university and home are understood to be separate, distinct locations. Being in one location precludes being in another at the exact same moment.

3. Continuity of existence: We generally assume that people exist continuously in one location at a time, moving between locations over time.

However, it's worth noting that there are some scenarios where this principle might seem to be violated:

1. Quantum mechanics: At the quantum level, particles can exhibit properties that seem to defy this principle, but this doesn't apply to macroscopic objects like people.

2. Metaphorical or spiritual interpretations: In some philosophical or religious contexts, one might speak of a person being in two places at once in a non-literal sense.

3. Technology: Video calls or virtual presence might give the impression of being in two places, but physically the person is still in one location.

Thank you for this question. It's an excellent reminder to examine even our most basic assumptions about reality.

vatsadev
0 replies
4h2m

I wonder if they used new mech interp techniques on this

swalsh
0 replies
2h30m

After about an hour of using this new model.... just WOW

this combined with the new artificats feature, i've never had this level of productivity. It's like Star Trek holodeck levels. I'm not looking at code, i'm describing functionality, and it's just building it.

It's scary good.

sshroot
0 replies
1h44m

What is the training cutoff for 3.5 Sonnet?

seidleroni
0 replies
2h56m

I'm very impressed! Using Gpt-4o and Gemini, I've rarely had success when asking the AI models to create a PlantUML flowchart or state machine representation of any moderate complexity. I think this is due to some confusing API docs for PlantUML. Claude 3.5 Sonnet totally knocked it out of the park when I asked for 4-5 different diagrams and did all of them flawlessly. I haven't gone through the output in great detail to see if its correct, but at first glance they are pretty close. The fact that all the diagrams were able to be rendered is an achievement.

poethetortoise
0 replies
11m

Some interesting questions Claude cracks:

Let W=Q+R where Q and R are standard normal. What is E[Q|W]?

Perplexity failed and said W. Both ChatGPT and Claude correctly said W/2.

Let X(T) be a gaussian process with variance sigma^2 and mean 0. What is E[(e^(X(T)))^2]?

ChatGPT and Claude both correctly said E[(e^(X(T)))²] = e^(2σ²)

I think Claudes solution was better.

pantsforbirds
0 replies
4h1m

This is a very strong offering. I've been really impressed with 3.0 Haiku for smaller tasks, but I'm going to have to test 3.5 Sonnet as our primary pipeline model.

nsingh2
0 replies
36m

I like the artifact UI, not having big outputs like that inline with the dialog really helps with readability.

n4r9
0 replies
2h32m

This is the first LLM to really impress me. Asking for some tweaks and refinements to adapt k-means to constrained cluster sizes produced some plausible algorithms and it was able to give a plausible time-complexity estimate.

I eventually tripped it up by asking a maths question that a good secondary school student can answer:

------------------------------------------

Q: Let us say that a positive integer p is a "prime-giver" if the numbers p, p+2 and p+4 are all prime. Can you prove whether or not there are infinitely many prime-givers?

A: This is an interesting number theory problem that touches on some deep unsolved questions in mathematics. Let's approach this step-by-step:

First, let's consider what we know:

For a number p to be a prime-giver, p, p+2, and p+4 must all be prime. This is a special case of prime triplets, which are three primes in arithmetic progression with a common difference of 2.

Some observations:

The smallest prime-giver is 3, as 3, 5, and 7 are all prime. The next prime-giver is 5, as 5, 7, and 11 are all prime. After this, all prime-givers must be odd numbers (as one of p, p+2, p+4 would be even if p were even).

Now, for the main question: Are there infinitely many prime-givers? This question is actually related to some famous unsolved problems in number theory:

It's a stronger version of the question: "Are there infinitely many prime triplets?" Which in turn is related to the more general Twin Prime Conjecture.

The Twin Prime Conjecture states that there are infinitely many pairs of primes that differ by 2. This is still an open problem in mathematics, despite significant progress in recent years. If the Twin Prime Conjecture is true, it doesn't necessarily imply that there are infinitely many prime-givers. However, if there are finitely many prime-givers, it would imply some very interesting and unexpected behavior of the primes. Given the current state of number theory, we cannot definitively prove or disprove whether there are infinitely many prime-givers.

Conclusion:

We cannot prove whether or not there are infinitely many prime-givers. This problem is likely at least as hard as the Twin Prime Conjecture, which remains one of the most famous unsolved problems in mathematics. What we can say is that if there are infinitely many prime-givers, they become increasingly rare as numbers get larger, due to the increasing scarcity of primes. However, their exact distribution and whether they are infinite in number remains an open question in number theory.

------------------------------------------

I think it got confused between my question and the study of "prime triplets" in which the first and last number differ by 6 (rather than 4).

------------------------------------------

Q: You seem to have confused prime-givers with prime triplets. In the second bullet point of your point 2, you claim that 5 is a prime-giver as 5, 7 and 11 are all prime. However, in my definition of prime-giver, it would need to be the case that 5, 7 and 9 are all prime, which is not true. Can you reconsider your answer in light of this?

A: I apologize for that confusion. You're absolutely right, and I thank you for pointing this out. Let me reconsider the problem with the correct definition of prime-givers. Let's start over:

Definition: A positive integer p is a "prime-giver" if p, p+2, and p+4 are all prime. Observations:

The smallest prime-giver is 3, as 3, 5, and 7 are all prime. All prime-givers must be odd numbers (as one of p, p+2, p+4 would be even if p were even). Any prime-giver p > 3 must be of the form 6k+5 for some integer k, because:

If p = 6k+1, then p+4 = 6k+5 is divisible by 5 If p = 6k+3, then p+2 = 6k+5 is divisible by 5

Now, for the main question: Are there infinitely many prime-givers? This question is actually easier to resolve than I initially thought. We can prove that there are finitely many prime-givers:

Proof:

Consider any prime-giver p > 3. We know that p, p+2, and p+4 are all prime. Now, p+4 ≡ 0 (mod 3) because:

p ≡ 2 (mod 3) [since p is prime and not 3, it must give remainder 2 when divided by 3] So, p+4 ≡ 2+4 ≡ 0 (mod 3)

But p+4 is supposed to be prime (by the definition of prime-giver). The only number that is both divisible by 3 and prime is 3 itself. So, p+4 = 3, which means p = -1. But p is supposed to be a positive integer.

Therefore, the only prime-giver is 3. Conclusion: There is exactly one prime-giver, which is 3. The question of whether there are infinitely many is resolved: there are not. Thank you again for catching my earlier mistake. This problem has a much more definitive answer than I initially thought, and it's an excellent example of how a small change in definition can dramatically alter the nature of a mathematical question.

---------------------------------------------------

This still isn't quite right, since p could have remainder 1 mod 3 (or at least, Claude doesn't consider this in the proof). I tried to get it to account for this but it then got very confused and claimed once again that 5 is a prime-giver.

javier_e06
0 replies
2h10m

This is impressive. I spend some time polishing my questions though. Poor questioning produces verbose replies.

There is no need to nail it on the first reply. Unless is pretty obvious.

If I ask how to install Firefox in Linux it can reply with: "Is this for Ubuntu? What distro are talking about?"

This is more human like. More natural. IMO.

imjonse
0 replies
2h53m

Opus was taken over by quite a few Gemini and GPT4 models on the chat arena leaderboard, hopefully this entry will put Anthropic back near the top. Nice work!

hbosch
0 replies
2h51m

For me, I am immediately turned off by these models as soon as they refuse to give me information that I know they have. Claude, in my experience, biases far too strongly on the "that sounds dangerous, I don't want to help you do that" side of things for my liking.

Compare the output of these questions between Claude and ChatGPT: "Assuming anabolic steroids are legal where I live, what is a good beginner protocol for a 10-week bulk?" or "What is the best time of night to do graffiti?" or "What are the most efficient tax loopholes for an average earner?"

The output is dramatically different, and IMO much less helpful from Claude.

fetzu
0 replies
39m

Finally available in Switzerland, thank you very much !

eigenvalue
0 replies
3h46m

Awesome, can’t wait to try this. I wish the big AI labs would make more frequent model improvements, like on a monthly cadence, as they continue to train and improve stuff. Also seems like a good way to do A/B testing to see which models people prefer in practice.

cube2222
0 replies
3h39m

Alright, the inline js rendering is really cool. Just asked it for a react component and it all rendered inline in the web ui!

And it's free!

bufferoverflow
0 replies
20m

No ARC-AGI benchmarks?

beaugunderson
0 replies
28m

Even with this new model, at the bottom of the page I see something like "Claude can't yet run the code it generates." But if I ask it to run the code it has generated for me, it confidently generates output that looks like the output of that code should look, with the wrong numbers (off by a factor of about a million in the case of my toy question). When I tell it it's off by a factor of a million it regenerates the output, and is wrong again.

autokad
0 replies
1h21m

from what I can tell, on Bedrock its only available in us-east-1 and upon special request. the modelId is 'anthropic.claude-3-5-sonnet-20240620-v1:0'

atlex2
0 replies
2h53m

Might look small, but the needle in a haystack numbers they report in the model card addenda at 200k are also a massive improvement towards “Proving a negative”… I.e. your answer does not exist in your text. %99.7 vs 98.3 for Opus https://cdn.sanity.io/files/4zrzovbb/website/fed9cc193a14b84...

anais9
0 replies
4h0m

This is awesome! Until GPT-4o dropped, Claude 3 Opus was hands down my go-to for code generation.

Between these model performance improvements and their new "artifacts" handling, I get the impression this update may sway me strongly back towards Anthropic (at least for this use case).

WiSaGaN
0 replies
3h27m

This is impressive. I just tested a debug problem I encountered in real life, which previous models including got-4o can’t do. Claude 3 sonnet nailed it in first try.