HN comments for: Claude 3.5 Sonnet

m0zzie

61 replies

15h7m

2024-06-27 03:21:15 UTC

Can anyone comment on its coding ability?

Considering cancelling my subscription with OpenAI as I was previously using GPT-4 quite heavily as a multiplier for myself, guiding it and editing outputs as required, but GPT-4o feels significantly worse for this use case. It is certainly better in many other areas, but its coding ability is not great.

I tried to revert back to standard GPT-4 but it is now so slow to respond (higher load?) that it breaks my mental flow, so I'm exploring other options.

eschluntz

17 replies

15h2m

2024-06-27 03:25:53 UTC

(I'm biased as I work for anthropic)

I broke my hand 2 months ago and Claude 3.5 sonnet has been writing all my code for me. It's been awesome

porl

10 replies

14h44m

2024-06-27 03:44:39 UTC

Since you work there, can you ask that they are more transparent with exactly what you are paying for with a "pro" subscription?

Currently you are apparently paying for:

- Everything in Free - Use Claude 3 Opus and Haiku - Higher usage limits versus Free - Create Projects to work with Claude around a set of docs, code, or files - Priority bandwidth and availability - Early access to new features

But what are the usage limits? Higher than free by how much?

Having an invisible limit on a paid product really rubs me the wrong way. Maybe some rate-limiting after a certain amount would be better than a hard cutoff, but even then I'd like to know what the limit is before I pay, not when I accidentally hit it in the middle of something important.

timenova

5 replies

14h33m

2024-06-27 03:54:57 UTC

I ran into this rate limit recently when generating a logo in SVG. It's 45 messages per 5 hours (with some caveats). It's documented here [0].

[0] https://support.anthropic.com/en/articles/8324991-about-clau...

teruakohatu

4 replies

14h25m

2024-06-27 04:02:54 UTC

Your limit gets used up faster with longer conversations, notably with large attachments. For example, if you upload a copy of The Great Gatsby, you may only be able to send 15 messages in that conversation within 5 hours, as each time you send a message, Claude “re-reads” the entire conversation, including any large attachments.

Seems to also be limited by tokens. It’s still quite obscure.

porl

3 replies

14h10m

2024-06-27 04:18:12 UTC

Even moreso:

Please note that these limits may vary depending on Claude’s current capacity.

Fine for the free tier of course, but not great for the paid version.

r2_pilot

0 replies

5h57m

2024-06-27 12:31:07 UTC

For what it's worth, I haven't run into any limitations while using the API side. Although I also haven't really had to use it so extensively that it would be a problem.

cma

0 replies

11h43m

2024-06-27 06:45:29 UTC

ChatGPT says the same about capacity for Pro I think.

aoeusnth1

0 replies

5h32m

2024-06-27 12:56:05 UTC

With a fixed capacity, someone’s traffic is going to be rejected. I think it makes sense to have a variable rate limit depending on available capacity and demand.

hdhshdhshdjd

1 replies

14h36m

2024-06-27 03:52:29 UTC

I was literally about to type in my credit card number after seeing the new code editor window UX, but for the life of me I had no idea what I was paying for.

I’ve been paying for GPT since 3.5 debuted and I know what I’m getting - full, unlimited use of the best model. Period.

Anthropic needs to figure out what the hell they are selling.

saaaaaam

0 replies

13h2m

2024-06-27 05:26:15 UTC

ChatGPT Plus does NOT give you full unlimited use of the best model. There are still limits.

FWIW I regularly hit my ChatGPT Plus limits, and I think the “dynamic” limiting is regularly in place. I’ve only once hit my Claude Pro limit. I now use Claude more than ChatGPT.

From this page:

https://help.openai.com/en/articles/6950777-what-is-chatgpt-...

As of May 13th 2024, Plus users will be able to send 80 messages every 3 hours on GPT-4o. and 40 messages every 3 hours on GPT-4. The message cap for a user in a ChatGPT Team workspace is approximately twice that of ChatGPT Plus.

Please note that unused messages do not accumulate (i.e. if you wait 6 hours, you will not have 80 messages available to use for the next 3 hours on GPT-4).

In certain cases for Plus users, we may dynamically adjust the message limit based on available capacity in order to prioritize making GPT-4 accessible to the widest number of people.

eschluntz

0 replies

13h13m

2024-06-27 05:14:57 UTC

It says 5x the free tier usage, but I agree we could be more explicit about what that is. Just made a ticket :)

bomewish

0 replies

5h25m

2024-06-27 13:03:07 UTC

+1. More transparency around usage limit would be very helpful. Message based seems arbitrary if different messages are different lengths. Maybe token based with a bar that tracks how much you’ve used. It would go down with use and up with time and capacity and you’d be able to track and pace usage better than a warning you’re on last ten. Or just let us pay more for more usage! Switch to API! The workbench does not have as nice affordances as the chat.

mauflows

4 replies

13h57m

2024-06-27 04:31:34 UTC

Mind explaining your workflow? I actually just broke my wrist today :(

eschluntz

3 replies

13h22m

2024-06-27 05:06:47 UTC

Oh no! I use Apple voice control into Claude and then copy and paste code back and forth between my IDE and Claude mostly.

For longer writing,I really like going for a walk for 45 minutes and brain dumping on a topic, and transcribing it. Then I write a brief outline and have Claude fill it out into a document, explicitly only using language that I used in the transcript. Then edit via voice

mauflows

0 replies

6h53m

2024-06-27 11:35:15 UTC

Thank you!

OccamsMirror

0 replies

11h22m

2024-06-27 07:06:12 UTC

Are you using the claude.ai interface or do you have something bespoke?

IanCal

0 replies

7h1m

2024-06-27 11:27:04 UTC

I don't know about the usability for you but if you use vscode already cursor is basically vscode but has access to various llms, including Claude. I started using it recently and really like it.

https://cursor.com

mertbio

0 replies

9h48m

2024-06-27 08:40:07 UTC

Why do you work? Can't you get a sick leave?

r2_pilot

8 replies

14h58m

2024-06-27 03:29:55 UTC

The first night it was released, it one-shotted merging two Python files together (one was an incomplete mess of speech to text, the other was an also-incomplete Claude api demo with a fake loop with forced conversation paths). It rewrote the fake loop and turned it into a voice interface that worked on the first try(then I asked it to add text to speech and it did). The second night it improves the code by adding threads (which I do know about but wouldn't have thought to use it there) and it worked better. Then Linux decided not to work properly (pulseaudio and ALSA issues), I asked it to help me migrate to pipewire and it gave me step by step directions, when I had issues all I had to do was post the error and it fixed it enough that it's all working again. It's been a pretty amazing week from my perspective. I finally had to take a break earlier this week but I still use Claude 3.5 for gardening (taking pictures and asking for help) and general advice.

egl2021

5 replies

13h14m

2024-06-27 05:14:47 UTC

Would you please describe your setup for coding? I'm interested in trying this out, but I need an ELI5. Thanks.

PS Anybody is welcome to jump in with a description of their setup.

xwolfi

0 replies

11h27m

2024-06-27 07:01:12 UTC

It s prob not gonna work well: he fixes demo of the api of Claude with Claude. You didnt think of asking the AI, so you need more work on your own inference abilities first

r2_pilot

0 replies

7h30m

2024-06-27 10:58:07 UTC

There's not really anything to it. As I mostly use console.athropic.com workbench(the API pay-as-you-go), I haven't played around much with the Artifacts--I am probably getting a Claude Pro subscription later today. But yeah I just copy/pasted my 2 code files in the window, told it I wanted a conversational voice interface model, and in less time than it took to describe it, it wrote it and even saved a bunch of output tokens by saying where to copy large parts of the first file's boilerplate code. That had been a project I started around Easter and had kept putting off because it felt too hard for my somewhat limited time. But there's no moat or special setup to this stuff; anybody who can communicate in a text box can productively use this tool.

dzhiurgis

0 replies

9h50m

2024-06-27 08:38:35 UTC

Sourcegraph Cody uses Claude

chwzr

0 replies

30m

2024-06-27 17:58:42 UTC

I currently tested the cursor IDE [1]. It’s vscode with nice ai integrations made with great ux in mind. They integrate with ChatGPT and Claude Sonnet/ Opus. For my experience (rust/typescript) the sonnet 3.5 model is giving me better results.

You can pass the current file, or multiple files or your whole project as context to the models.

[1] https://cursor.com

8n4vidtmkvmk

0 replies

11h40m

2024-06-27 06:48:33 UTC

I just copy and paste snippets and ask gpt to fix it or translate it

8n4vidtmkvmk

1 replies

11h41m

2024-06-27 06:47:34 UTC

Gardening? The vision stuff for plants has always been terrible guesswork. Is it actually good/useful now?

r2_pilot

0 replies

7h39m

2024-06-27 10:49:10 UTC

It's a start. We just had a sudden storm that severely damaged some tomato plant stalks, and it suggested splinting them, or if it is still having trouble in a couple of days to cut off the top and plant in another pot, which is advice I've received from others too. The first picture I sent was some early blight on the leaves, it didn't get it on the first try (suggested the stem had edema, which I suppose is possible), but when I mentioned I thought it had a fungal issue it correctly noted the brown leaves and suggested a copper fungicide, among other correct remedies. It also correctly discriminated between two types of cucumber (well one was Armenian melon) held by my friend who had just picked them from her garden. Claude said that because there were hands present that it suggested that the vegetables were home-grown and/or freshly picked, and before that noted that they seemed to be good quality, which was correct. I went into this particular aspect of using Claude thinking it would be useless, but that's not what I have found with Claude 3.5 Sonnet. I'll note the fungal blight was with Claude 3 Opus, as well. Hope that helps.

ldjkfkdsjnv

8 replies

14h2m

2024-06-27 04:26:52 UTC

Coding ability is off the charts, to the degree that I think in one year software engineering will be considered a dying field.

newzisforsukas

4 replies

13h17m

2024-06-27 05:11:15 UTC

What will the people who use LLMs to write code be called? What will the field be called?

Where will all the power and computers come from that will replace the tens of millions of software developers?

diffxx

3 replies

13h0m

2024-06-27 05:28:10 UTC

I feel reasonably confident that most of the people who are making these hyperbolic claims either don't understand programming at a deep enough level to make the claim or are trying to sell you something.

ldjkfkdsjnv

1 replies

4h32m

2024-06-27 13:56:36 UTC

No, I've got 12+ years of experience in FAANG software engineering. I really believe the end is near

newzisforsukas

0 replies

4h14m

2024-06-27 14:14:48 UTC

FWIW, very few people will actually care about this statement unless you state your name, work you've done, etc. A lot of users here don't pay heed to anonymous appeals to authority.

artisin

0 replies

8h7m

2024-06-27 10:21:01 UTC

I'll throw my hat behind this horse because, honestly, if I was just learning to code, I would have probably quit by now due to the frequency of these types of comments. LLMs have certainly improved at an impressive rate, and they're fairly decent at producing 'Lego' blocks. However, when it comes to block building—the real meat and potatoes of programming—they're, at best, bad butchers. Building functional systems is hard, and it's simply something LLMs can't do now or perhaps ever. Or I just don't know how to prompt. 50/50.

shinycode

0 replies

11h53m

2024-06-27 06:35:50 UTC

Yeah good idea to put everyone out of their job and give it to 2 companies. Then what happens when those companies decide to rise prices by x1000 (twitter api ?). Or what happens when your system has a bug and does not work and the AI says it’s all good ? Who will introspect your code to find what’s wrong ?

mirsadm

0 replies

12h57m

2024-06-27 05:31:16 UTC

Here we go again. If you seriously think that then you should probably consider changing careers.

Moldoteck

0 replies

9h54m

2024-06-27 08:33:53 UTC

and who would actually verify the output of llm works as expected? Unit tests? Who would verify unit tests are testing what needs to be tested instead of just a branch/output coverage? I've used these tools personally and I've used gh copilot at company I work in and at no point in time I was under impression llm's can do all of my work. Each step needs verification if it actually gives what you need, some steps do need good description by someone who knows what the system is doing to perform the changes where these are needed. Continuous extension of features without good design patterns in place, usually checked by devs, can also create a lot of problems. So what makes you think in one year swe will be considered a dying field? What I see is the need for better devs that can see beyond boilerplate that can be empowered by llm's, but it doesn't mean a dying field, it means rising the standards to break into the field.

jampa

3 replies

14h55m

2024-06-27 03:32:56 UTC

You can use OpenAI's developer platform https://platform.openai.com/playground/chat

I haven't used the normal website in a year, it is a way better experience:

- It allows tweaking of the model / temperature.

- You pay only per token used

- It does not use your conversations to train their models

- You can edit both your and the assistant message in the conversation or delete it.

- Allows for easy message replay if the output is not great

Be mindful that you are paying per token for every message in the thread, so if the thread becomes long it gets expensive, but still far more value than subscribing.

panarky

2 replies

13h49m

2024-06-27 04:39:31 UTC

The world is changing too fast for knowledge that is nine months old.

------

USER

What is the maximum number of tokens supported by gpt-4o?

ASSISTANT

As of my most recent update in October 2023, GPT-4 does not have an officially designated variant called "gpt-4o". However, OpenAI's GPT-4 model supports different maximum token limits depending on its variant. There are two primary context lengths for GPT-4 models:

1. *GPT-4 with 8K context length*: This variant supports up to 8,000 tokens. 2. *GPT-4 with 32K context length*: This variant supports up to 32,000 tokens.

These limits encompass both input and output tokens in a given interaction. If you intended to refer to a specific or newly introduced variant, such as "gpt-4o," it is essential to look up the most current resources or official announcements from OpenAI, as models and their capabilities can evolve.

8n4vidtmkvmk

1 replies

11h45m

2024-06-27 06:43:14 UTC

I don't know what you're trying to point out here. The models aren't trained on themselves.

panarky

0 replies

10m

2024-06-27 18:18:32 UTC

ChatGPT and Gemini incorporate web search into their results, so they're not limited to stale information.

If you think you can save money by paying per token in the playground, you should know that you're not getting the same results.

For example, if I ask Gemini the identical question, it gives a reasonable answer, even though Gemini was released before ChatGPT was released. It can look up the current OpenAI documentation and summarize it:

USER

What is the maximum number of tokens supported by gpt-4o?

ASSISTANT

GPT-4o has a context window of 128,000 tokens. This means it can process up to 128,000 tokens between your input and its output combined. However, the maximum number of output tokens (the AI's response) is limited to 4,096 tokens.

cyral

3 replies

15h4m

2024-06-27 03:24:42 UTC

I've been playing around with it this week and its coding ability is insane (for a LLM). I've given it some pretty sloppy descriptions about things I want to do and it's managed to figure out exactly how to do it on the first or second try, I'm talking things like building animations in React that cannot be described with text very well. Big pain point is copy and pasting things back and forth to have it edit them. If it was integrated and could see my local files, that would be killer. I know there are various companies working on that, but the jetbrains AI integration for example is garbage compared to the results I get by manually asking claude.

I wasn't worried about how this would affect our industry a few months ago, but this has me reconsidering. It's like a junior engineer that can do most tasks in seconds for a couple of cents.

hdhshdhshdjd

1 replies

13h59m

2024-06-27 04:29:07 UTC

What worries me is you need that time in the dirt to get a feel for coding as a craft. And at least for me that aspect of knowing the craft helps get my thinking in tune with problem solving in a very productive way.

Coding can be similar to playing an instrument, if you have mastery, it can help you be more expressive with the ideas you already have and lead you to new ones.

Whereas if we take away the craft of coding I think you end up with the type of code academic labs produce: something that purely starts on a “drawing board”, is given to the grad student/intern/LLM to make work, and while it will prove the concept it won’t scale into long term, as the intern doesn’t know when to spend an extra 30 minutes in a function so that it may be more flexible down the road.

newswasboring

0 replies

13h42m

2024-06-27 04:46:34 UTC

What worries me is you need that time in the dirt to get a feel for coding as a craft.

I see this sentiment a lot regarding gen AI. An I get it, we need to learn our tools. But this seems like it's saying the only way to learn problem solving is the way you learned it. That's just not true. Everyone learns problem solving differently and the emerging field of gen AI will figure out it's own way. It's a different way of thinking. I see my niece using ChatGPT to make projects I wouldn't have even imagined taking up at her age. Her games work. Who am I to say she isn't learning problem solving? In hindi we say "pratyaksh ko praman ki kya avashyakta" (what's right in front of you doesn't require proof).

ttul

0 replies

13h35m

2024-06-27 04:53:44 UTC

I’d say it’s far more useful than a junior engineer. A junior might know one or two areas well, but Claude 3.5 knows literally every steep learning curve across all engineering professions. Claude may not yet be able to orchestrate a large project, but it can advise on structuring using knowledge drawn from best practices across every conceivable discipline. We may not get “AGI”; we might get something far more useful instead.

ChicagoDave

2 replies

14h56m

2024-06-27 03:32:29 UTC

I switched to Claude when Opus came out and have never looked back.

Anthropic tech is vastly better at code.

ChicagoDave

1 replies

14h16m

2024-06-27 04:11:58 UTC

Also, Anthropic Projects is HUGE.

You get to create a set of default instructions that all project prompts to adhere to, like: Use Linq whenever possible. Consider Fluent capabilities when it will improve developer experience.

Then you can add all of your existing code (though as has been pointed out, connecting to a repo would be ideal).

And then you can have multiple conversations within that project, so you can separate your lines of thinking.

ChicagoDave

0 replies

1h15m

2024-06-27 17:13:13 UTC

Alas, the same problems exist. Sonnet 3.5 is faster and a bit smarter, but it still cannot follow a longer conversation.

This is the problem that LLMs have not yet solved.

wesleyyue

1 replies

12h55m

2024-06-27 05:33:00 UTC

I've been using it in https://double.bot (vscode coding copilot I'm hacking on) and it's been pretty great so far. We have support for all the top models (gpt4o, deep seek, opus, etc) so I regularly like to switch between them especially if I see a challenging problem. So far, sonnet 3.5 has the best quality and is the one I use by default.

namanyayg

0 replies

11h52m

2024-06-27 06:36:20 UTC

How does double.bot compare against cursor.sh? I've been using the latter for a while after switching from copilot++ and am pleased.

thomasfromcdnjs

1 replies

15h0m

2024-06-27 03:27:56 UTC

I was trying to get it to create an LLM from scratch using tensorflow.js with training and prediction, it got about 95% of the way, impressive.

But gpt-4o actually got a working solution in a couple prompts -> https://gist.github.com/thomasdavis/fadbca01605fb3cb64911077...

Though the new artefacts is really damn handy, you can describe the most detailed projects, and it does a really great job of what you asked for, and I found it delivered beyond what I wanted.

I am now paying for both -.-

- it's hard to rank which codes better, but I think claude has better abstractions

- sometimes I paste the output of the other, and continue solving on the other

girvo

0 replies

12h15m

2024-06-27 06:13:45 UTC

Yo! Long time no see haha <3

Would love to see your workflow sometime, my experimentations have been small tbh but I was quite impressed with Claude.

Funnily, I think (modulo costs) feeding outputs of one model into another once you hit a brick wall is quite powerful, but feels clunky to do at the moment

andrewstuart

1 replies

15h1m

2024-06-27 03:27:14 UTC

> GPT-4o feels significantly worse for this use case

Interesting. I felt GPT4 was virtually useless and GPT3.5 was the best, then came along GPT-4o and it instantly became the only version worth using.

I find GPT-4o to be extremely powerful and useful.

What don't you like about it?

m0zzie

0 replies

14h55m

2024-06-27 03:33:06 UTC

To clarify, I think it's extremely powerful and useful too. It's just that I get more misses than hits when using it as a rubber duck and coding assistant.

valval

0 replies

12h50m

2024-06-27 05:38:15 UTC

I saw the initial 3.5 Sonnet release a few days ago and decided to pull the trigger with canceling OpenAI and moving over, and that was a great decision. I've been having to do more and more intricate nudging and leading towards the correct answer or approach when using ChatGPT of late.

My experience with Claude is very positive when it comes to programming and planning out infrastructure. My only gripe so far has been some ethical constraints that didn't exist for ChatGPT, and those are a big one for me since I don't need Anthropic employees to act as my moral compass. For a specific example, asking about securing code through making decompiling or reading harder is a no-no for Claude, but a-ok for GPT.

stephc_int13

0 replies

13h3m

2024-06-27 05:25:14 UTC

I use it mostly for coding help, not Python or JS but a mix of C and C++ for my own framework.

It is surprisingly good and helpful. I am still exploring the limits.

Writing doc comments or test cases is much faster and more fun with this kind of tool, but you still have to double check everything as it inevitably make mistakes, often small and not obvious…

skybrian

0 replies

13h46m

2024-06-27 04:42:49 UTC

I switched from a GitHub Copilot subscription to Cody for VS Code, to give the competition a chance.

Cody's chat UI seems pretty good for making comparisons like this. You can set it to whichever LLM you want, including GPT-4o and Claude 3.5 Sonnet.

I haven't used Claude enough to do to a detailed comparison, but GPT4o and Claude 3.5 seem roughly similar for my coding questions.

replwoacause

0 replies

14h38m

2024-06-27 03:50:27 UTC

I find it’s coding ability superior to GPT4 and 4o

netsec_burn

0 replies

15h2m

2024-06-27 03:26:28 UTC

Hit or miss, with significantly more hits than misses. I was able to build a useful cybersecurity service in about 6 hours leveraging Claude to write most of the code. It has certainly made me more productive.

mbil

0 replies

14h0m

2024-06-27 04:28:10 UTC

I haven’t had a great experience using GPT-4o for coding; it will sometimes make a mistake, fix the mistake introducing another mistake, then fix that second mistake repeating the first. Or other weird things like that. I love the text to speech, and image generation sometimes comes in handy, but ultimately I prefer Sonnet 3.5 for coding and have recently switched to that.

If OpenAI comes out with a stronger model I’ll prob switch back, I don’t have much brand loyalty in this regard. I could see some features making usage more sticky (memory, projects, etc).

energy123

0 replies

14h11m

2024-06-27 04:17:48 UTC

It's slightly higher ranked than 4o on the lmsys leaderboard, Coding section. But the difference is not big.

bredren

0 replies

14h6m

2024-06-27 04:22:26 UTC

What languages / frameworks do you focus on?

I’ve also had worse results with 4o than 4. It’s faster but it vomits output.

I can’t tell if I need to go tweak memory or if it’s just like this.

tkgally

13 replies

14h55m

2024-06-27 03:33:24 UTC

This article was written before Anthropic added the Projects feature for Pro and Team users [1].

I spent some time yesterday experimenting with Projects, and, like Artifacts, it looks really useful. I like the idea of being able to have multiple projects going simultaneously, each with its own reference materials. I don’t need to use it in a team, but I can see how that could be useful, too.

The one problem I see is that the total context window for each project might start to seem too small pretty quickly. I assume, though, that Anthropic’s context windows will be getting larger as time goes on.

I wonder what other features Anthropic has in the works for Claude. My personal wish is for a voice interface, something like what OpenAI announced in May but has now put off until later this year.

[1] https://www.anthropic.com/news/projects

viraptor

9 replies

13h44m

2024-06-27 04:44:11 UTC

From my reading there's nothing here that's specific to Anthropic, right? Any app like Chatbox, Msty or other could implement their own version of Projects based on any of the available models. (Maybe even Notion...) The API for creating an assistant with a pre-baked context is also available.

Or did I miss something unique there?

amadeuspagel

5 replies

12h31m

2024-06-27 05:57:52 UTC

These sort of basic features seem like the hardest thing for AI companies. OpenAI only recently added a way to search all your chats. Gemini still doesn't have that.

doczoidberg

3 replies

10h58m

2024-06-27 07:30:29 UTC

where can I search for a chat in chatgpt? I miss that feature. Can't find it

DominikPeters

2 replies

10h44m

2024-06-27 07:44:13 UTC

I think it might only be available in the mac app.

SOLAR_FIELDS

1 replies

2h58m

2024-06-27 15:30:17 UTC

Which is still horrifically slow and unoptimized. I’m not sure why streaming tokens from an api into an electron app is difficult but OpenAI managed to bungle it somehow.

icpmacdo

0 replies

2h0m

2024-06-27 16:28:37 UTC

its a native swift app

concordDance

0 replies

10h48m

2024-06-27 07:40:50 UTC

And Claude's textbook still doesn't respond to the End key last I checked.

It's odd that things like that are what they're missing.

namanyayg

1 replies

12h55m

2024-06-27 05:33:49 UTC

I'm currently building an open source interface for LLMs but hadn't heard about Chatbox or Msty, thanks for sharing.

Terretta

0 replies

2h24m

2024-06-27 16:04:27 UTC

see also Machato

tkgally

0 replies

12h45m

2024-06-27 05:43:27 UTC

No, I think you’re right. It’s just a matter of convenient tooling. I have been doing similar tasks with ChatGPT and Gemini, but, in the standard web interfaces, I end up uploading the same reference files multiple times and I tend to lose track of previously used prompts and unfinished projects. Anthropic’s Projects should make it easier for me to pursue multiple projects with different contexts at the same time.

The fact that Claude 3.5 Sonnet also seems smarter than the other current flagship models makes the Projects feature that much more attractive.

Maybe if I were smarter I wouldn’t find much use for Projects.

muzani

2 replies

14h43m

2024-06-27 03:45:46 UTC

I'm far more excited for Projects than the other stuff. OpenAI's Memories is their lock-in for me; I'd have switched to Claude if Anthropic had something better. Projects sounds like an even better approach.

8n4vidtmkvmk

1 replies

11h55m

2024-06-27 06:33:45 UTC

What are you using memories for? Openai told me it memorized some stuff, but it's never used it in a useful way nor do I see how it would.

muzani

0 replies

11h6m

2024-06-27 07:22:00 UTC

Game dev. It keeps tabs on the type of game it is, background, genre, classes, theme. Notably the difference between one decision and another - it takes note when I say "Instead of Y, do X". Like if I show a preference of Chinese names, it'll give me more of that.

aappleby

13 replies

14h34m

2024-06-27 03:54:40 UTC

I don't need an AI to write code for me, but it is _astoundingly_ helpful to have it summarize various design options and new technology stacks without me having to scavenge Google for the obscure corner-cases I care about.

I have an idea for a project that involves streaming 3 giabits of data per second from a USB 3.0 device out over a 10 gig Ethernet connection, and it was able to compare/contrast various levels of support for high-bandwidth USB 3 and Ethernet in multiple frameworks and languages.

And the whole conversation, with code examples, cost me 3 _cents_ of Anthropic credits.

My new fear is when people start asking AIs "Hey AI, here is my codebase, my org chart, and commit histories for all my employees - how can I reduce the number of humans I need to employ to get this project done?"

gjsman-1000

5 replies

14h31m

2024-06-27 03:57:42 UTC

Nobody cared when it was the artists, the songwriters, the music studios, the film companies. But impacting jobs of the programmers - now, heaven forbid.

squigz

2 replies

14h24m

2024-06-27 04:04:12 UTC

I'm pretty sure people cared about the artists too.

gjsman-1000

1 replies

14h24m

2024-06-27 04:04:49 UTC

Did the people building dall-e care?

squigz

0 replies

14h23m

2024-06-27 04:05:43 UTC

Are those people here? Do you reckon GP is one of them?

thethirdone

1 replies

14h17m

2024-06-27 04:11:13 UTC

The specific issue of using a LLM to make org decisions on how to downsize actually affects nearly all jobs equally.

From what I can tell, most programmers are more ok with LLMs directly replacing them than artists are. I tend to agree that it is better to replace programmers, and protect artists.

whywhywhywhy

0 replies

7h31m

2024-06-27 10:56:56 UTC

The replacement is currently less obvious with developers, AI work needs to be copy pasted to the right places in execution and the hooks into the rest of the code need to be set up right.

It doesn’t yet do prompt to 100% finished artifact.

For concept artists and illustrators at least who have very generic or commercial styles Diffusion models can create a final artifact that passes as good enough to a lot of clients.

Think developers would feel a bit different if they were in the same situation today.

squigz

4 replies

14h26m

2024-06-27 04:02:34 UTC

it was able to compare/contrast various levels of support for high-bandwidth USB 3 and Ethernet in multiple frameworks and languages.

How do you know such information is correct though?

panarky

2 replies

14h6m

2024-06-27 04:22:31 UTC

The same way as when you learn something new from some random internet person.

Does what they say conflict with anything you already know to be true?

Is their argument internally inconsistent, does it contradict itself?

Can you corroborate parts of their argument from other independent sources?

squigz

1 replies

7h42m

2024-06-27 10:46:34 UTC

The key difference is that I can reasonably trust (depending on the audience) that the other human is not simply hallucinating facts in order to provide a response.

panarky

0 replies

2024-06-27 18:25:22 UTC

Not only do humans invent stuff that didn't happen (hallucination), we're also prewired with a very long list of cognitive biases.

It depends a lot on the domain of course, but I'd bet that frontier LLMs already exhibit superhuman capabilities in providing accurate answers in the vast majority of domains.

aappleby

0 replies

14h21m

2024-06-27 04:07:43 UTC

Some of it I was able to verify immediately, some I confirmed by poking around in Github repos once I knew the magic words, some I haven't verified yet.

For example, I wasn't sure if there was good support for USB devices in Rust. Just not something I'd ever bothered to investigate. Claude knew that there was a libusb wrapper and a "rusb" crate, and claimed that support for USB in Rust was stable. I verified the existence of the crates, and I'll take Claude at its word for stability unless I discover otherwise.

ttul

0 replies

13h32m

2024-06-27 04:56:36 UTC

Examples like this make me think, “maybe the S&P 500 will be at 15,000 by 2030.” I mean, in my own job, Claude is making me substantially better in a wide range of areas, from sales negotiation to managing my team. Add that up across millions of people…

eastern

0 replies

12h34m

2024-06-27 05:54:12 UTC

Even without writing code, on personal projects it's worth it for the three things I'm always lazy about:

1. Good, detailed commits 2. Tests 3. Docstrings

dagaci

9 replies

12h43m

2024-06-27 05:45:03 UTC

Apparently my account was banned on Anthropic Sonnet after a "Automatic review". I'm 100% sure i did not make any "unsafe" queries, I've litterally only briefly tested and that was weeks ago.

+1 OpenAI Subscription -1 Anthropic Sonnet->sudden-death-automatic-review-system

namanyayg

3 replies

11h51m

2024-06-27 06:37:32 UTC

You may be able to create a new account from a different email?

dagaci

2 replies

8h22m

2024-06-27 10:06:27 UTC

Accounts are tied to your phone number.

whywhywhywhy

0 replies

7h38m

2024-06-27 10:50:18 UTC

Which are locked to the first account verified with them.

newscracker

0 replies

6h32m

2024-06-27 11:56:44 UTC

So when someone lets (or has) their phone service get disconnected and the company recycles that phone number shortly, the person who then gets this phone number is out of luck if they want to use Claude?

bluerobotcat

1 replies

8h55m

2024-06-27 09:33:00 UTC

I was banned too. I filled in the form 2-3 times already, but I never heard back from them.

My working theory is that I was banned due to using a VPN (Mullvad) with a location set to Europe, at a time when Europe users were not allowed to use the app.

I wasn’t actually in Europe, to be clear, but I live in a country where even WhatsApp voice calls are blocked and so I routinely have my VPN turned on.

The country I live in is officially supported by Anthropic, and so is Europe these days, so it’s quite frustrating that they won’t unban me.

I can’t use ChatGPT and Perplexity either when I have my VPN turned on, but at least they don’t ban my account.

Fortunately, Poe is VPN friendly.

whywhywhywhy

0 replies

7h39m

2024-06-27 10:49:37 UTC

Their support sucks tbh, months ago I used my phone number with my work gmail and wanted to move it to my personal gmail (phone numbers can only be used on one account for some reason) to subscribe, never got a response to the request.

rfoo

0 replies

11h17m

2024-06-27 07:10:58 UTC

You can fill the form here: https://support.anthropic.com/en/articles/8241253-trust-and-... and eventually you get unbanned.

Unfortunately, by the time Anthropic's next model is out you are likely to be banned by an "automatic review" again. At least that's my experience.

bl4ckneon

0 replies

12h9m

2024-06-27 06:19:19 UTC

I had that too, no explanation, no response via email support, nothing. I can't give money to them knowing my account can just get banned at anytime while I might have an active subscription going on.

andhuman

0 replies

12h9m

2024-06-27 06:19:41 UTC

Mine too! And I didn’t even get to use it once. I’ve filled in the form, let’s see if they lift the ban.

silisili

8 replies

13h50m

2024-06-27 04:38:41 UTC

Well, I went to try it, but it requires a phone number for some bizarre reason. Fine, gave it my primary number, a google voice number I've had for a decade, and it won't accept it. That's the end of my Claude journey, forever.

If you want me to try your service, try using some flow with less friction than sandpaper, folks.

seaal

6 replies

13h37m

2024-06-27 04:50:57 UTC

Google Voice blacklist is pretty common.

silisili

4 replies

13h34m

2024-06-27 04:54:05 UTC

I have about a dozen credit cards, 4 bank accounts, mortgage, car payment, utility accounts, github 2fa, aws 2fa, Fidelity retirement, etc. Not one has an issue with my number.

I did have some service refuse it, I want to say Twitter? But I'd definitely not consider it common. This is probably only the second or third time I've seen it, tbh.

simonw

2 replies

13h28m

2024-06-27 05:00:52 UTC

I'm going to guess the reason Anthropic do verified phone numbers is that, unlike email addresses, most people don't have an easy way to create multiple phone numbers.

Since Anthropic accounts come with free rate-limited access to their models they're trying to avoid freeloaders who sign up for hundreds of accounts in order to work around those caps.

Google Voice numbers are blocked because people can create multiple of those, which would allow them to circumvent those limits.

mulmen

1 replies

13h25m

2024-06-27 05:03:16 UTC

What's the opportunity cost there? How much does a rate limited account cost and how much is a signup worth?

mewpmewp2

0 replies

11h45m

2024-06-27 06:42:55 UTC

These systems are likely under heavy sign up attacks and so must have a way to guard against that since their costs are quite high.

mewpmewp2

0 replies

11h42m

2024-06-27 06:46:20 UTC

2fa and those services are different. LLM api calls are relatively costly and in demand so they must have a decent barrier of entry so people couldn't just spam automated accounts with automated requests to use the app for free. There are systems that abuse that and even existing API integrations are being abused by proxying and using for other purposes if they are free and don't have good security. Any place that wants to defend against bots can't trust those phone numbers.

mulmen

0 replies

13h26m

2024-06-27 05:02:08 UTC

For a sales lead? Why?

mulmen

0 replies

13h28m

2024-06-27 05:00:32 UTC

They don't want you to do anything. The friction they ware attempting to reduce is between phone numbers and wallets. This isn't much different than the typo-laden email scam. It's working as intended.

tonyoconnell

7 replies

13h46m

2024-06-27 04:42:36 UTC

I have been using Claude Sonnet with Artifacts along with Vercel V0 to build Sveltekit pages and components really well. I create a UI in V0 and then simply copy the JSX into Claude and tell it to convert to Sveltekit. It creates the +page.svelte +page.server.ts and all the components almost perfectly.

ttul

2 replies

13h39m

2024-06-27 04:49:32 UTC

Claude helped me code up a quick webhook handler for a silly side project. Being a side project, I asked it to put the code into a Docker container, which is did flawlessly. It also gave me a Makefile with sensible targets for building, rebuilding, starting, stopping, etc. Finally, I asked it use an .env file for the secrets I needed to store. Everything it did was one-shot on target. The only thing that would make this better would be to have it in the terminal with me.

lukasb

1 replies

12h0m

2024-06-27 06:28:12 UTC

I keep saying that if nothing else, we are in a golden age of actually finishing side projects. All those “it’s easy, I’ll just glue this API to this API” projects really are easy now.

r2_pilot

0 replies

3h28m

2024-06-27 15:00:26 UTC

This, exactly. Every side project I've thrown at Claude 3.5 Sonnet has been completed that same night. It's so different from how I used to treat my backlog projects which could take a week or so of research-code-iterate and now they're just an evening (or less; since Sonnet's upgrade on average it's taken me about 20-40 minutes to get the night's planned work done) and I get to sleep earlier. Win-win-win

brcmthrowaway

2 replies

13h6m

2024-06-27 05:22:06 UTC

This is not really valuable, because it could probably be done by an fiverr contractor cheaper than the marginal cost of $(LLM training cost + query cost)

sally_glance

0 replies

11h28m

2024-06-27 07:00:26 UTC

That contractor would likely just be a LLM proxy today though...

afro88

0 replies

9h52m

2024-06-27 08:36:28 UTC

OP isn't training though. They're just paying for query. Much cheaper (and insanely faster) than fiverr

whiddershins

0 replies

13h39m

2024-06-27 04:49:28 UTC

I wish it could preview svelte builds. I wish it would let me include svelte files in “projects.”

Davidzheng

7 replies

15h16m

2024-06-27 03:12:11 UTC

Still not as good at math as gpt4o from benchmarks and also my experience

muzani

3 replies

14h33m

2024-06-27 03:55:23 UTC

me: "is 400000 the same as 4 * 100 * 100"

gpt-4o: "Yes, 400,000 is the same as 4 X 100 × 1000."

gpt-4:

    Yes, 400,000 is the same as 4 X 100 × 100.
    Here's the breakdown:
    4 × 100 = 400
    400 × 100 = 40,000
    Oops, it looks like there was an error in my previous calculation. 
    So, 4 × 100 X 100 actually equals 40,000, not 400,000. To make it 400,000, 
    you would need to use 4 × 10 x 10,000 or adjust the numbers similarly.

claude-3.5:

    No, 400000 is not the same as 4 * 100 * 100.
    Let's break it down:
    4 * 100 * 100 = 4 * 10000 = 40000
    400000 is ten times larger than 40000.

Everyone keeps saying gpt-4o beats benchmarks and stuff, but this is consistently my experience with it. The benchmarks fall far off my every day experience.

Davidzheng

1 replies

13h16m

2024-06-27 05:11:59 UTC

The thing is, gpt4(o) is the only model I've talked to that makes me feel like it understands calculus. Not like it's pretending and succeeding

muzani

0 replies

9h40m

2024-06-27 08:48:37 UTC

Ah, I think it's just much better trained. At double the speed and half the price, I'd assume there's less computing power behind it. But likely the training covers some topics really well.

Jverse

0 replies

7h52m

2024-06-27 10:36:11 UTC

My answer from gpt4o was correct and similar to your example from claude. Maybe you were just unlucky?

Legend2440

2 replies

15h11m

2024-06-27 03:17:19 UTC

That's ok, I don't need my language model to do math. My calculator does that just fine.

ta988

0 replies

14h8m

2024-06-27 04:20:50 UTC

It is not just about math it is about tokenization and chain of thought.

Davidzheng

0 replies

14h52m

2024-06-27 03:36:18 UTC

Math is not just arithmetic

sdwr

5 replies

15h7m

2024-06-27 03:21:02 UTC

Are there any tools out there that expose GPT or Claude to a codebase, and let it write PRs (semi) autonomously?

tananaev

0 replies

14h56m

2024-06-27 03:32:12 UTC

It's probably not going to be very good at handling it out of the box. It would require quite a bit of fine tuning.

nyellin

0 replies

14h28m

2024-06-27 04:00:32 UTC

aider is pretty good - https://github.com/paul-gauthier/aider

kleneway1

0 replies

14h4m

2024-06-27 04:23:57 UTC

I’m working on adding Sonnet 3.5 to JACoB this week. So far it’s been very impressive. https://github.com/jacob-ai-bot/jacob

afro88

0 replies

9h51m

2024-06-27 08:37:36 UTC

Semi autonomously is the key. Aider and Plandex are both pretty good.

CGamesPlay

0 replies

14h57m

2024-06-27 03:31:20 UTC

I know of Sweep <https://docs.sweep.dev>, but honestly the SOTA on SWE-bench, which is basically exactly what you're asking for, is only about a 25% success rate, so expect very mediocre results.

lowyek

5 replies

15h12m

2024-06-27 03:16:48 UTC

Must be fun working on cutting edge competitive stuff for all these 3 major teams. It's exciting to live in this times and see this all unfold in our eyes.

jimsimmons

3 replies

15h5m

2024-06-27 03:23:10 UTC

Who 3

snewman

0 replies

15h0m

2024-06-27 03:27:58 UTC

Presumably Anthropic / Claude, OpenAI / GPT, Google DeepMind / Gemini.

netsec_burn

0 replies

15h1m

2024-06-27 03:27:32 UTC

Meta, OpenAI, and Anthropic come to mind.

In TFA they name OpenAI, Google DeepMind and Anthropic.

muzani

0 replies

14h40m

2024-06-27 03:48:12 UTC

Likely 4, based on the other comments. Maybe 5 if you want to add Midjourney.

daniel_iversen

0 replies

12h32m

2024-06-27 05:56:05 UTC

On that note, and apologies if it sounds spammy, but genuinely if there's any AI engineers reading, check out asana.com/jobs - we partner with the two leading AI labs (they're all customers of ours too) and I know the team get to experiment with some of the early release stuff to build our leading AI work management platform (Dustin, our very technical founder and CEO writes about what the deal with our latest AI advancements are here https://asana.com/inside-asana/the-key-to-unlocking-the-pote...) - I feel like it's one of the best places to work in so many ways, but the cutting edge AI stuff is REALLY fun!

Filligree

5 replies

14h55m

2024-06-27 03:33:52 UTC

I’ve been wanting to test it, but the API console wants me to fill in a VAT field. Not sure how to get one of those as an individual.

42lux

4 replies

14h52m

2024-06-27 03:35:54 UTC

It's really bad ux but you can just leave it blank as a private user.

wraptile

3 replies

14h26m

2024-06-27 04:02:08 UTC

is it though? blank optional fields have been the standard since forever.

42lux

2 replies

14h16m

2024-06-27 04:12:08 UTC

The whole organization wording throws a lot of non-native people of. Combine it with the vat field and you have bad ux.

8n4vidtmkvmk

1 replies

11h26m

2024-06-27 07:01:55 UTC

A lot of these guys ask for company name and it often is required. I'm not a company. Just a dude trying to evaluate your thing for my hobby projects.

42lux

0 replies

10h58m

2024-06-27 07:30:51 UTC

Yeah, that’s why the organisation wording is unfortunate. It’s meant more like a group name under which you will have your projects and not corporation name.

cortesi

4 replies

14h18m

2024-06-27 04:10:47 UTC

Claude 3.5 Sonnet's coding abilities are incredibly impressive. I think it lets an expert programmer move more than twice as fast. There are limits - to produce high quality code, not copy-and-paste pablum, you have to be able to give detailed step-by-step directions and critically evaluate the results. This means you can't produce code better than you would have written by yourself, you can only do it much faster.

As an experiment, I produced a set of bindings to Anthropic's API pair-programming with Claude. The project is of pretty good quality, and includes advanced features like streaming and type-safe definitions of tools. More than 95% of the code and docs was written by Claude, under close direction from me. The project is here:

https://github.com/cortesi/misanthropy

And I've shared part of the conversation that produced it in a video here:

https://twitter.com/cortesi/status/1806135130446307340

zeroonetwothree

1 replies

14h11m

2024-06-27 04:17:25 UTC

I find LLM coding much less useful when it’s interacting with a large existing codebase. It’s certainly good at one-off type code and greenfield projects (especially if similar to other open source stuff). And it’s also good at getting started if you aren’t an expert yourself.

cortesi

0 replies

13h51m

2024-06-27 04:36:55 UTC

We haven't found this to be an impediment. Keep things modular, and share the type definitions of anything you import with the model. As the benefits here become more and more clear tooling will improve and people will adapt their development practices to get the most out of the models.

miohtama

0 replies

7h42m

2024-06-27 10:46:22 UTC

I have been developing Python 20 years now. Claude 3.5 is the first AI that is “smart” enough to help me. I usually do not need help with easy task, but complex ones. Claude is not a perfect, but it definitely gives a productivity boost for even the most seasoned developers, which would have been some obscure mailing list and source code reading in the past.

appplication

0 replies

2h34m

2024-06-27 15:54:03 UTC

The biggest challenge I’ve felt is finding modern, idiomatic solutions for framework code. All LLMs seem to be victim to this issue where they will propose a mixture of old anti-patterns, roundabout approaches that technically work but make little sense, and mixing framework patterns with non-framework patterns in a way that leads to problematically incohesive code when taken together.

For example, I’m an experienced backend programmer but have been using Claude 3.5 Sonnet and GPT-4 asking questions about a frontend I’m building in TS using Svelte (which i am not very proficient in). The LLMs frequently confuse themselves with TS/JS, server/client side approaches, recommend old and deprecated approaches, and mixing patterns from other frameworks (e.g. react) when an idiomatic approach does exist. The biggest risk is when, in my ignorance, I do not detect when this is happening.

It’s been functional enough to push a hacky prototype out (where it would take me probably months longer to learn and do this otherwise), but the code quality and organization of the project is certainly pretty low.

JCM9

4 replies

12h40m

2024-06-27 05:48:32 UTC

This is highlighting what has happened with all forms of ML. Give a baseline set of folks the same dataset and they will end up with a model that performs about the same. Companies are one-upping each other but it’s very back and forth and just a case of release date. These models will become a complete commodity. The thing that could be proprietary is the data used to train them, which could lead to a sustained better model performance. The barrier to entry here is super high given training costs, but the ML skills are still a commodity.

brap

1 replies

11h52m

2024-06-27 06:36:38 UTC

This is why I strongly believe Google has the clear advantage. Infinite data, infinite resources , not to mention dozens of verticals (search, Android and Chrome are probably the killer ones)

yunwal

0 replies

2h28m

2024-06-27 16:00:05 UTC

Google obviously has the advantage here, but it also seems like they’re willing to squander it. The Gemini rollout has basically been clippy 2.0 so far. The Gemini interface in gcloud seems to know nothing about how services work, the chat summaries consistently fail, the search summaries are comically bad. I’m usually not one of these people who wants to “block all the AI stuff” but with google products I do.

bboygravity

1 replies

12h33m

2024-06-27 05:55:07 UTC

The dataset could become proprietary you say?

In other words: information with copyright on it that was (illegally) used/stolen from billions of people and companies will get copyright on it that will be resold as a set?

I don't know...

realharo

0 replies

9h1m

2024-06-27 09:27:14 UTC

I think the comment was referring to things like internal company data, or licensed data that is not publicly available, etc. Those things could be a competitive advantage.

thethirdone

3 replies

14h21m

2024-06-27 04:07:26 UTC

You can say ‘the recent jumps are relatively small’ or you can notice that (1) there is an upper bound at 100 rapidly approaching for this set of benchmarks, and (2) the releases are coming quickly one after another and the slope of the line is accelerating despite being close to the maximum.

The graph does not look like it is accelerating. I actually struggle to imagine what about it convinced the author the progress is accelerating.

I would be very interested in a more detailed graph that shows individual benchmarks because it should be possible to see some benchmarks effectively be beaten and get a good idea of where all of the other benchmarks are on that trend. The 100 % upper bound is likely very hard to approach, but I don't know if the limit is like 99%, 95% or 90% for most benchmarks.

willsmith72

0 replies

11h2m

2024-06-27 07:26:38 UTC

i took it to mean progress is increasing, not rate of progress is increasing. a classic case of "acceleration misuse" but nothing more

simonw

0 replies

13h31m

2024-06-27 04:57:33 UTC

I heard a theory today that hitting 100% on the MMLU benchmark may be impossible due to errors in that benchmark itself - if there are errors in the benchmark no model should ever be able to score 100% on it.

The same problem could well be present in other benchmarks as well.

aoeusnth1

0 replies

5h21m

2024-06-27 13:07:49 UTC

I think this is what they meant: https://imgur.com/a/GWqfp9U

If you take the upper bounds at any given point in time, the rate of increase of the best models over time is accelerating.

liquidise

2 replies

12h19m

2024-06-27 06:09:21 UTC

As someone building an AI company right now, my quick Pro/Con for 4o vs Claude 3.5:

Claude: subjectively sounds more human to me, and really nails data questions that 4o is lackluster at

4o: far better assistant logic reasoning. I can trivially break Claude's assistant (system prompt) instructions within the user prompt, where 4o succeeds in all of these tests.

Pricing and output speed, for our purposes, are functionally identical. Exciting to have a competitor in the space already who stands to keep openai honest.

xwolfi

0 replies

11h30m

2024-06-27 06:58:10 UTC

And hum, what incredible problem are you solving at your AI company? Must be the forefront of human innovation ! I hope it's porn.

dmazin

0 replies

11h24m

2024-06-27 07:04:49 UTC

Aha, so I’m not the only one. For both Claude 3 Opus and 3.5 Sonnet, anecdotally its language is far more natural. So much so that I prefer it over 4o.

andrewstuart

2 replies

14h58m

2024-06-27 03:30:25 UTC

AI programming would be really useful if it moved towards me being able to make a fixed set of statements about the software, those statements are preserved permanently in the source code somehow, and the AI ensures that those statements remain true.

Its frustrating to work with the AI to implement something only to realise within a few interactions that it has forgotten or lost track of something I deemed to be a key requirement.

Surely the future of software has to start to include declarative statement prompts as part of the source code.

simonw

0 replies

14h48m

2024-06-27 03:40:14 UTC

Have you tried building that system with prompting?

You could set a convention of having comments something like this:

    # Requirement: function always returns an array of strings

And then have a system prompt which tells the model to always obey comments like that, and to add comments like that to record important requirements provided by the user.

ilaksh

0 replies

14h36m

2024-06-27 03:52:39 UTC

The first thing I built with OpenAI a few years ago was a system that had a section for a program spec on one part of the screen and a live web page area on the other part. You could edit the spec and it would then regenerate the web page. It would probably work much better now. Eventually I might add something like that as a plugin in my current framework.

alastairr

2 replies

11h8m

2024-06-27 07:20:32 UTC

my 2p worth - my work involves a lot of summarisation, recommendation from a user preference statement. I've been able to do this with 4o / opus, but the consistency wasn't there, which required complex prompting chains to stabilise.

What I'm seeing with Sonnet 3.5 is a night-and-day step up in consistency. The responses don't seem to be that different in capability of opus / 4o when they respond well, it just does it with rock-solid consistency. That sounds a bit dull, but it's a huge step forward for me and I suspect for others.

Bluestein

1 replies

8h7m

2024-06-27 10:21:18 UTC

Found same.-

PS. Consistency is everthing sometimes.-

Bluestein

0 replies

5h13m

2024-06-27 13:15:16 UTC

PS. I realized the above advice is not consistent.-

alach11

2 replies

15h10m

2024-06-27 03:17:55 UTC

Claude 3.5 Sonnet took a solid lead on our internal benchmarks over gpt-4-turbo for extraction tasks against large documents.

It may not be great for every workflow, but it certainly hits a sweet spot for intelligence x cost on most of my workflows.

petra

1 replies

14h8m

2024-06-27 04:20:36 UTC

Have anybody tried it against Gemini, for extraction tasks against large documents(its similar to question answering from books, right?)?

panarky

0 replies

14h1m

2024-06-27 04:27:17 UTC

Claude 3.5 Sonnet supports a maximum of 200k tokens. Gemini handles 2,000k tokens.

GaggiX

2 replies

14h30m

2024-06-27 03:58:03 UTC

The incredible ability of Claude 3.5 Sonnet to create coherent SVG makes me wonder if the LLM was not just pretrained on text. Vision capabilities are usually added later using a vision encoder that does not affect the LLM's knowledge of the visual world, but in this case the LLM clearly has quite a strong understanding of the visual world.

levocardia

1 replies

14h26m

2024-06-27 04:02:16 UTC

Have you read the "sparks of AGI" paper about GPT4? It suggested that even just text can give an LLM a rich world model, based on the tikz drawings of a unicorn that got progressively better as GPT4 precursors were trained on increasingly more data (and, interestingly, the drawings got worse when it was RLHF'd for safety).

GaggiX

0 replies

14h18m

2024-06-27 04:10:11 UTC

Yes of course, as always, it's very possible that just scaling solved the problem, but the fact that the model is so good makes me wonder if they actually did something different and pre-trained the model on image tokens as well.

xixixao

1 replies

11h57m

2024-06-27 06:31:21 UTC

I tried Sonnet vs GPT 4 just now with:

Given a body with momentum B and forques F, what is the differential of applying the forques to the momentum in PGA?

Claude gave a wrong answer, ChatGPT gave a correct one. I’m sticking with ChatGPT.

Smaug123

0 replies

11h21m

2024-06-27 07:06:57 UTC

I have a mathematics (though not physics) degree and I didn't understand your question at all; "forques" appears to be either a place in France, Old French, or Catalan. I assume ChatGPT was correct in re-spelling "forques" as "torques", but have you tried asking Claude using words that do appear on the Internet?

renewiltord

1 replies

13h41m

2024-06-27 04:47:08 UTC

Claude Sonnet is freaking amazing. I used to have a safety test[0] that Claude failed. But it was a bogus safety test and fortunately someone here told me so and I immediately subscribed to it. It's amazing. The other day I ported a whole node.js script to Python with it. It was not flawless but it was pretty damned good. Such a mechanical process and I just had to review. Loved it.

0: https://news.ycombinator.com/item?id=39607069

namcxr

0 replies

6h36m

2024-06-27 11:51:55 UTC

I do not understand this. Will you still love it when it takes your job?

(Assuming it is really that good, which is something that I can never replicate.)

andrewchambers

1 replies

15h16m

2024-06-27 03:12:34 UTC

I wonder if openai they have a response ready or if they just are tackling other business problems like ios integration now and the seemingly postponed low latency chat launch. Either way I am looking forward to claude 3.5 opus.

It did seem slightly odd to me that openai made their supposedly best model free.

replwoacause

0 replies

14h35m

2024-06-27 03:53:41 UTC

I pay for both OpenAI and Anthropic pro plans, and can say OpenAI is lagging behind at this point. Hopefully their next model steps it up.

willsmith72

0 replies

12h12m

2024-06-27 06:16:04 UTC

am i right in that it has no online capabilities? that's a pretty big issue for me

thmixc

0 replies

8h16m

2024-06-27 10:12:40 UTC

Claude 3.5 Sonnet can solve the farmer and sheep problem with two small changes to the prompt: 1. change the word "person" to "human". 2. change the word "trips" to "trip or trips". (Claude is probably assuming that the answer has to be in multiple trips because of the word "trips")

nyellin

0 replies

14h52m

2024-06-27 03:36:47 UTC

Interesting. We're finding that it's more verbose than GPT-4o (and often ignores instructions to give output in a specific format), but does better when faced with ambiguity and potentially confusing situations.

Here is an example using http://github.com/robusta-dev/holmesgpt/ where ask it to troubleshoot a payment processing workload in a Kubernetes cluster vs the GPT-4o output underneath:

  holmes ask --model="bedrock/anthropic.claude-3-5-sonnet-20240620-v1:0" "what is wrong with the payment processor?"
  AI: To investigate what's wrong with the payment processor, we'll need to gather more information about it. First, let's try to find the relevant Kubernetes resources related to the payment processor.

  Running `kubectl get -A --show-labels pod | grep payment`

  Thank you for your question about the payment processor. initial search, I've found two pods related to payment processing. Let's investigate the one that seems to be having issues:

  1. In the default namespace, there's a pod named "payment-processing-worker-5bb5bd75cd-x4g99" which is in a CrashLoopBackOff state. This indicates that the pod is repeatedly crashing and restarting.

  2. In the sock-shop namespace, there's a pod named "payment-ff86cd6f8-rs47b" which appears to be running normally.

  Let's focus on the problematic pod in the default namespace. To get more information about what's wrong, we'll need to look at its logs and description.

  Running `kubectl describe pod payment-processing-worker-5bb5bd75cd-x4g99 -n default`

  AI: Based on the information gathered, it appears that there is an issue with the payment processor in the Kubernetes cluster. Let me explain what's wrong:

  1 The payment processing worker pod is in a CrashLoopBackOff state:
    • Pod name: payment-processing-worker-5bb5bd75cd-x4g99
    • Namespace: default
    • Status: CrashLoopBackOff
  2 The pod is continuously restarting and failing:
    • Restart Count: 242
    • The container is in a "Waiting" state with the reason "CrashLoopBackOff"
  3 The root cause of the issue appears to be related to an environment variable:
    • The container is running a bash script that checks for the presence of the DEPLOY_ENV environment variable.
    • If DEPLOY_ENV is not set, the container exits immediately after printing an error message.
    • The command being run is:

       if [[ -z "${DEPLOY_ENV}" ]]; then echo Environment variable DEPLOY_ENV is undefined ; else while true; do echo hello; sleep 10;done; fi

  4 The environment section in the pod description shows:

    Environment: <none>

   This indicates that no environment variables are set for the container.
  5 The container is exiting with a status of "Completed" and an exit code of 0, which suggests that it's running the script, finding that DEPLOY_E

  To fix this issue:

  1 The DEPLOY_ENV environment variable needs to be set for the payment-processing-worker deployment.
  2 You should update the deployment configuration to include this environment variable.
  3 After setting the environment variable, the pod should stop crashing and enter a running state where it prints "hello" every 10 seconds.

  To resolve this, you'll need to update the deployment configuration for the payment-processing-worker. You can do this by editing the deployment and adding the DEPLOY_ENV environment variable with an appropriate value.

  Would you like me to show you how to update the deployment with the required environment variable?

Note that it explicitly picks up on the ambiguity in the question and notes that there are 2 payment related workloads and one is healthy.

GPT-4o:

  $ holmes ask "what is wrong with the payment processor?"

  Running `kubectl get -A --show-labels pod | grep payment-processor`                     
  Running `kubectl get -A --show-labels deployment | grep payment-processor`                    
  Running `kubectl get -A --show-labels service | grep payment-processor`              
  Running `kubectl get -A --show-labels pod | grep payment`
  Running `kubectl describe pod payment-processing-worker-5bb5bd75cd-x4g99 -n default`
  Running `kubectl logs payment-processing-worker-5bb5bd75cd-x4g99 -n default --previous`
  
  AI: payment-processing-worker-5bb5bd75cd-x4g99 in namespace default crashes due to missing DEPLOY_ENV environment variable.

  Fix: Define DEPLOY_ENV in the deployment spec.

  Example patch:

  env:
   - name: DEPLOY_ENV
     value: "production"

mrcwinn

0 replies

14h16m

2024-06-27 04:12:20 UTC

I’ve thoroughly enjoyed the product overall much more than ChatGPT. I do wish it had voice input that rivaled what OpenAI previewed. Excited for 3.5 Opus. For now I’ve canceled OpenAI subscription and removed the app in favor of Claude.

erdemo

0 replies

5h8m

2024-06-27 13:20:46 UTC

As developer Claude code generator 2x better than gpt4o, of course it subjunctive but Claude much consistent for me.

boyka

0 replies

9h28m

2024-06-27 09:00:37 UTC

These models are clearly great with language, be it natural language or code. However, I wonder where the expectation comes from that a static stochastic parrot should be able to compute arbitrary first order logic (in a series of one-shot next word predictions). Could any expert elaborate on how this would be solved by a transformer model?

SubiculumCode

0 replies

14h44m

2024-06-27 03:44:15 UTC

I'm fairly impressed with Sonnet's one shot scripting performance for my use cases. However, I was using it to help me diagnoses a gnome key ring and ssh issue I was having, and it suggested that I 'rm -rf' my keyring files to test it's solution out. A little drastic..maybe mv my password file first? Anyway, it sometimes seems even more cocky than last gen, and less careful by default