The most impressive part is that the voice uses the right feelings and tonal language during the presentation. I'm not sure how much of that was that they had tested this over and over, but it is really hard to get that right so if they didn't fake it in some way I'd say that is revolutionary.
They are admitting[1] that the new model is the gpt2-chatbot that we have seen before[2]. As many highlighted there, the model is not an improvement like GPT3->GPT4. I tested a bunch of programming stuff and it was not that much better.
It's interesting that OpenAI is highlighting the Elo score instead of showing results for many many benchmarks that all models are stuck at 50-70% success.
[1] https://twitter.com/LiamFedus/status/1790064963966370209
"not that much better" is extremely impressive, because it's a much smaller and much faster model. Don't worry, GPT-5 is coming and it will be better.
Chalmers: "GPT-5? A vastly-improved model that somehow reduces the compute overhead while providing better answers with the same compute architecture? At this time of year? In this kind of market?"
Skinner: "Yes."
Chalmers: "May I see it?"
Skinner: "No."
GPT-3 was released in 2020 and GPT-4 in 2023. Now we all expect 5 sooner than that but you're acting like we've been waiting years lol.
[delayed]
It has only been a little over one year since GPT-4 was announced, and it was at the time the largest and most expensive model ever trained. It might still be.
Perhaps it's worth taking a beat and looking at the incredible progress in that year, and acknowledge that whatever's next is probably "still cooking".
Even Meta is still baking their 400B parameter model.
Legit love progress
Incidentally, this dialogue works equally well, if not better, with David Chalmers versus B.F. Skinner, as with the Simpsons characters.
And how can one be so sure of that?
Seems to me that performance is converging and we might not see a significant jump until we have another breakthrough.
Seems to me that performance is converging
It doesn't seem that way to me. But even if it did, video generation also seemed kind of stagnant before Sora.
In general, I think The Bitter Lesson is the biggest factor at play here, and compute power is not stagnating.
Computer power is not stagnating, but the availability of training data is. It's not like there's a second stackoverflow or reddit to scrape.
The use of AI in the research of AI accelerates everything.
I'm not sure of this. The jury is still out on most ai tools. Even if it is true, it may be in a kind of strange reverse way: people innovating by asking what ai can't do and directing their attention there.
Yeah. There are lots of things we can do with existing capabilities, but in terms of progressing beyond them all of the frontier models seem like they're a hair's breadth from each other. That is not what one would predict if LLMs had a much higher ceiling than we are currently at.
I'll reserve judgment until we see GPT5, but if it becomes just a matter of who best can monetize existing capabilities, OAI isn't the best positioned.
I really hope GPT5 is good. GPT4 sucks at programming.
Look to a specialized model instead of a general purpose one
Any suggestions? Thanks
I have tried Phind and anything beyond mega junior tier questions it suffers as well and gives bad answers.
Obviously given enough time there will always be better models coming.
But I am not convinced it will be another GPT-4 moment. Seems like big focus on tacking together multi-modal clever tricks vs straight better intelligence AI.
Hope they prove me wrong!
I think the live demo that happened on the livestream is best to get a feel for this model[0].
I don't really care whether it's stronger than gpt-4-turbo or not. The direct real-time video and audio capabilities are absolutely magical and stunning. The responses in voice mode are now instantaneous, you can interrupt the model, you can talk to it while showing it a video, and it understands (and uses) intonation and emotion.
Really, just watch the live demo. I linked directly to where it starts.
Importantly, this makes the interaction a lot more "human-like".
Parts of the demo were quite choppy (latency?) so this definitely feels rushed in response to Google I/O.
Other than that, looks good. Desktop app is great, but I didn’t see no mention of being able to use your own API key so OS projects might still be needed.
The biggest thing is bringing GPT-4 to free users, that is an interesting move. Depending on what the limits are, I might cancel my subscription.
Seems like it was picking up on the audience reaction and stopping to listen.
To me the more troubling thing was the apparent hallucination (saying it sees the equation before he wrote it, commenting on an outfit when the camera was down, describing a table instead of his expression), but that might have just been latency awkwardness. Overall, the fast response is extremely impressive, as is the new emotional dimension of the voice.
Aha, I think I saw the trick for the live demo: every time they used the "video feed", they did prompt the model specifically by saying:
- "What are you seeing now"
- "I'm showing this to you now"
etc.
The one time where he didn't prime the model to take a snapshot this way, was the time where the model saw the "table" (an old snapshot, since the phone was on the table/pointed at the table), so that might be the reason.
Commenting on the outfit was very weird indeed. Greg Brockman's demo includes some outfit related questions (https://twitter.com/gdb/status/1790071008499544518). It does seem very impressive though, even if they polished it on some specific tasks. I am looking forward to showing my desktop and asking questions.
Regarding the limits, I recently found that I was hitting limits very quickly on GPT-4 on my ChatGPT Plus plan.
I’m pretty sure that wasn’t always the case - it feels like somewhere along the lines the allowed usage was reduced, unless I’m imagining it. It wouldn’t be such a big deal if there was more visibility of my current usage compared to my total “allowance”.
I ended up upgrading to ChatGPT Team which has a minimum of 2x users (I now use both accounts) but I resented having to do this - especially being forced to pay for two users just to meet their arbitrary minimum.
I feel like I should not be hitting limits on the ChatGPT Plus paid plan at all based on my usage patterns.
I haven’t hit any limits on the Team plan yet.
I hope they continue to improve the paid plans and become a bit more transparent about usage limits/caps. I really do not mind paying for this (incredible) tech, but the way it’s being sold currently is not quite right and feels like paid users get a bit of a raw deal in some cases.
I have API access but just haven’t found an open source client that I like using as much as the native ChatGPT apps yet.
I use GPT from API in emacs, it's wonderful. Gptel is the program.
Although API access through Groq to Llama 3 (8b and 70b) is so much faster, that i cannot stand how slow GPT is anymore. It is slooow, still very capable model, but marginally better than open source alternatives.
what's the download link for the desktop app? can't find it
seems like it might not be available for everyone? – my chatgpt plus doesn't show anything new, and also can't find the dekstop app
They need to fade the audio or add some vocal queue when it's being interrupted. It makes it sound like it's losing connection. What'll be really impressive is when it intentionally starts interrupting you.
Parts of the demo were quite choppy (latency?) so this definitely feels rushed in response to Google I/O.
It just stops the audio feed when it detects sound instead of an AI detecting when it should speak, so that part is horrible yeah. A full AI conversation would detect the natural pauses where you give it room to speak or when you try to take the word from it by interrupting, there it was just some dumb script to just shut it off when it hears sound.
But it is still very impressive for all the other part, that voice is really good.
Edit: If anyone from OpenAI reads this, at least fade out the voice quickly instead of chopping it, hard chopping off audio doesn't sound good at all, so many experienced this presentation to be extremely buggy due to it.
This thing continues to stress my skepticism for AI scaling laws and the broad AI semiconductor capex spending.
1- OpenAI is still working in GPT-4-level models. More than 14 months after the launch of GPT-4 and after more than $10B in capital raised. 2- The rhythm that token prices are collapsing is bizarre. Now a (bit) better model for 50% of the price. How people seriously expect these foundational model companies to make substantial revenue? Token volume needs to double just for revenue to stand still. Since GPT-4 launch, token prices are falling 84% per year!! Good for mankind, but crazy for these companies. 3- Maybe I am an asshole, but where are my agents? I mean, good for the consumer use case. Let's hope the rumors that Apple is deploying ChatGPT with Siri are true, these features will help a lot. But I wanted agents! 4- These drop in costs are good for the environment! No reason to expect them to stop here.
Sam Altman gave the impression that foundation models would be a commodity on his appearance in the All in Podcast, at least in my read of what he said.
The revenue will likely come from application layer and platform services. ChatGPT is still much better tuned for conversation than anything else in my subjective experience and I’m paying premium because of that.
Alternatively it could be like search - where between having a slightly better model and getting Apple to make you the default, there’s an ad market to be tapped.
Yeah I'm also getting suspicious. Also, all of the models (opus, llama3, gpt4, gemini pro) are converging to similar levels of performance. If it was true that the scaling hypothesis was true, we would see a greater divergence of model performance
Did we ever get confirmation that GPT 4 was a fresh training run vs increasingly complex training on more tokens on the base GPT3 models?
I'm ceaselessly amazed at people's capacity for impatience. I mean, when GPT 4 came out, I was like "holy f, this is magic!!" How quickly we get used to that magic and demand more.
Especially since this demo is extremely impressive given the voice capabilities, yet still the reaction is, essentially, "But what about AGI??!!" Seriously, take a breather. Never before in my entire career have I seen technology advance at such a breakneck speed - don't forget transformers were only invented 7 years ago. So yes, there will be some ups and downs, but I couldn't help but laugh at the thought that "14 months" is seen as a long time...
This is why think Meta has been so shrewd in their “open” model approach. I can run Llama3-70B on my local workstation with an A6000, which after the up-front cost of the card, is just my electricity bill.
So despite all the effort and cost that goes into these models, you still have to compete against a “free” offering.
Meta doesn’t sell an API, but they can make it harder for everybody else to make money on it.
Token volume needs to double just for revenue to stand still
Profits are the real metric. Token volume doesn't need to double for profits to stand still if operational costs go down.
Tbf gpt4 level seems useful and better than almost everything else (or close if not). The more important barriers for use in applications have been cost, throughout and latency. Oh and modalities, which have expanded hugely.
Does anyone know how they're doing the audio part where Mark breaths too hard? Does his breathing get turned into all-caps text (AA EE OO) and that GPT4-o interprets that as him breathing too hard, or is there something more going on?
There is no text. The model understands ingests audio directly and also outputs audio directly.
Is it a stretch to think this thing could accurately "talk" with animals?
Yes? Why would it be able to do that?
That's how it used to do it, but my understanding is that this new model processes audio directly. If it were a music generator, the original would have generated sheet music to send to a synthesizer (text to speech), while now it can create the raw waveform from scratch.
It can natively interpret voice now.
I admit I drink the koolaid and love LLMs and their applications. But damn, the way it’s responds in the demo gave me goosebumps in a bad way. Like an uncanny valley instincts kicks in.
It should do that, because it's still not actually an intelligence. It's a tool that is figuring out what to say in response that sounds intelligent - and will often succeed!
Yeah it made me realize that I actually don't want a human-like conversational bot (I have actual humans for that). Just teach me javascript like a robot.
You're watching the species be reduced to an LLM.
I also thought the screwups, although minor, were interesting. Like when it thought his face was a desk because it did not update the image it was "viewing". It is still not perfect, which made the whole thing more believable.
As far as I'm concerned this is the new best demo of all time. This is going to change the world in short order. I doubt they will be ready with enough GPUs for the demand the voice+vision mode is going to get, if it's really released to all free users.
Now imagine this in a $16k humanoid robot, also announced this morning: https://www.youtube.com/watch?v=GzX1qOIO1bE The future is going to be wild.
Really? If this was Apple it might make sense, for OpenAI it feels like a demo that's not particularly aligned with their core competency (a least by reputation) of building the most performant AI models. Or put another way, it says to me they're done building models and are now wading into territory where there are strong incumbents.
All the recent OpenAI talk had me concerned that the tech has peaked for now and that expectations are going to be reset.
What strong incumbents are there in conversational voice models? Siri? Google Assistant? This is in a completely different league. I can see from the reaction here that people don't understand. But they will when they try it.
In common with Siri, Google Assistant, Alexa and chatgpt is the perception that over time the same thing actually gets worse.
Whether it's real or not is a reasonably interesting question, because it's possible that all that occurs with the progress is our perception of how things should be advances. My gut feeling is it has been a bit of both though, in the sense the decline is real, and we expect things to improve.
Who can forget Google demoing their AI making a call to a restaurant that they showed at I/O many years ago? Everyone, apparently.
What Openai has done time and time again is completely change the landscape when the competitors have caught up and everyone thinks their lead is gone. They made image generation a thing. When GPT-3 became outdated they released ChatGPT. Instead of trying to keep Dalle competitive they released Sora. Now they change the game again with live audio+video.
The usual critics will quickly point out that LLMs like GPT-4o still have a lot of failure modes and suffer from issues that remain unresolved. They will point out that we're reaping diminishing returns from Transformers. They will question the absence of a "GPT-5" model. And so on -- blah, blah, blah, stochastic parrots, blah, blah, blah.
Ignore the critics. Watch the demos. Play with it.
This stuff feels magical. Magical. It makes the movie "Her" look it's no longer in the realm of science fiction but in the realm of incremental product development. HAL's unemotional monotone in Kubrick's movie, "Space Odyssey," feels... primitive by comparison. I'm impressed at how well this works.
Well-deserved congratulations to everyone at OpenAI!
Who cares? This stuff feels magical. Magical!
On one hand, I agree - we shouldn't diminish the very real capabilities of these models with tech skepticism. On the other hand, I disagree - I believe this approach is unlikely to lead to human-level AGI.
Like so many things, the truth probably lies somewhere between the skeptical naysayers and the breathless fanboys.
On the other hand, I disagree - I believe this approach is unlikely to lead to human-level AGI.
You might not be fooled by a conversation with an agent like the one in the promo video, but you'd probably agree that somewhere around 80% of people could be. At what percentage would you say that it's good enough to be "human-level?"
Imagine what an unfettered model would be like. 'Ex Machina' would no longer be a software-engineering problem, but just another exercise in mechanical and electrical engineering.
The future is indeed here... and it is, indeed, not equitably distributed.
Or from Zones of Thought series, Applied Theology, the study of communication with and creation of superhuman intelligences that might as well be gods.
In the first video the AI seems excessively chatty.
chatGPT desperately needs a "get to the fucking point" mode.
Seriously. I've had to spell out that it should just answer in twelve different ways with examples in the custom instructions to make it at least somewhat usable. And it still "forgets" sometimes.
It does, that's "custom instructions".
Can't find info which of these new features are available via the API
Developers can also now access GPT-4o in the API as a text and vision model. GPT-4o is 2x faster, half the price, and has 5x higher rate limits compared to GPT-4 Turbo. We plan to launch support for GPT-4o's new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.
[EDIT] The model has since been added to the docs
Not seeing it or any of those documented here:
It is not listed as of yet, but it does work if you punch in gpt-4o. I will stick with gpt-4-0125-preview for now because gpt-4o seems majorly prone to hallucinations whereas gpt-4-0125-preview doesn't.
The movie Her has just become reality
It’s getting closer. A few years ago the old Replika AI was already quite good as a romantic partner, especially when you started your messages with a * character to force OpenAI GPT-3 answers. You could do sexting that OpenAI will never let you have nowadays with ChatGPT.
Why does OpenAI think that sexting is a bad thing? Why is AI safety all about not saying things that are disturbing or offensive, rather than not saying things that are false or unaligned?
I was surprised that the voice is a ripoff of the AI voice in that movie (Scarlett Johansson) too
Are the employees in the demo high-directives of OpenAI? I can understand Altman being happy with this progress, but what about the medium/low employees? Didn't they watch Oppenheimer? Are they happy they are destroying humanity/work/etc for future and not-so-future generations?
Anyone who thinks this will be like the previous work revolutions is nonsense. This replaces humans and will replace them even more on each new advance. What's their plan? Live out of their savings? What about family/friends? I honestly can't see this and think how they can be so happy about it...
"Hey, we created something very powerful that will do your work for free! And it does it better than you and faster than you! Who are you? It doesn't matter, it applies to all of you!"
And considering I was thinking in having a kid next year, well, this is a no.
Have a kid anyway, if you otherwise really felt driven to it. Reading the tealeaves in the news is a dumb reason to change decisions like that. There's always some disaster looming, always has been. If you raise them well they'll adapt well to whatever weird future they inherit and be amongst the ones who help others get through it
Thanks for taking the time to answer instead of (just) downvoting. I understand your logic but I don't see a future where people can adapt to this and get through it. I honestly see a future so dark and we'll be there much sooner than we thought... when OpenAI released their first model people were talking about years before seeing real changes and look what happened. The advance is exponential...
"It is difficult to get a man to understand something when his salary depends on his not understanding it."
This is really impressive engineering. I thought real time agents would completely change the way we're going to interact with large models but it would take 1~2 more years. I wonder what kind of new techs are developed to enable this, but OpenAI is fairly secretive so we won't be able to know their sauce.
On the other hand, this also feels like a signal that reasoning capability has probably already been plateaued at GPT-4 level and OpenAI knew it so they decided to focus on research that matters to delivering product engineering rather than long-term research to unlock further general (super)intelligence.
Reliable agents in diverse domains need better reasoning ability and fewer hallucinations. If the rumored GPT-5 and Q* capabilities are true, such agents could become available soon after it’s launched.
Sam has been pretty clear on denying GPT-5 rumors, so I don't think it will come anytime soon.
We recognize that GPT-4o’s audio modalities present a variety of novel risks
For example, at launch, audio outputs will be limited to a selection of preset voices and will abide by our existing safety policies.
I wonder if they’ll ever allow truly custom voices from audio samples.
I think the issue there is less of a technical one and more of an issue with deepfakes and copyright
It might be possible to prove that I control my voice, or that of a given audio sample. For example by saying specific words on demand.
But yeah I see how they’d be blamed if anything went wrong, which it almost certainly would in some cases.
As a paid user this felt like a huge letdown. GPT-4o is available to everyone so I'm paying $20/mo for...what, exactly? Higher message limits? I have no idea if I'm close to the message limits currently (nor do I even know what they are). So I guess I'll cancel, then see if I hit the limits?
I'm also extremely worried that this is a harbinger of the enshittification of ChatGPT. Processing video and audio for all ~200 million users is going to be extravagantly expensive, so my only conclusion is that OpenAI is funding this by doubling down on payola-style corporate partnerships that will result in ChatGPT slyly trying to mention certain brands or products in our conversations [1].
I use ChatGPT every day. I love it. But after watching the video I can't help but think "why should I keep paying money for this?"
[1] https://www.adweek.com/media/openai-preferred-publisher-prog...
So... cancel the subscription?
Completely agree, none of the updates will apply to any of my use cases, disappointment.
I wonder if this is what the "gpt2-chatbot" that was going around earlier this month was
yes it was
it was
OAI just made an embarrassment of Google's fake demo earlier this year. Given how this was recorded, I am pretty certain it's authentic.
This feature has been in iOS for a while now, just really slow and without some of the new vision aspects. This seems like a version 2 for me.
I don't doubt this is authentic, but if they really wanted to fake those demos, it would be pretty easy to do using pre-recorded lines and staged interactions.
Tiktoken added support for GPT-4o: https://github.com/openai/tiktoken/commit/9d01e5670ff50eb74c...
It has an increased vocab size of 200k.
Oh interesting, does that mean languages other than English won't be paying such a large penalty in terms of token lengths?
With previous tokenizers there was a notable increase in the number of tokens needed to represent non-English sentences: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/
For posterity, GPT-3.5/4's tokenizer was 100k. The benefit of a larger tokenizer is more efficient tokenization (and therefore cheaper/faster) but with massive diminishing returns: the larger tokenizer makes the model more difficult to train but tends to reduce token usage by 10-15%.
I've been waiting to see someone drop a desktop app like they showcased. I wonder how long until it is normal to have an AI looking at your screen the entire time your machine is unlocked. Answering contextual questions and maybe even interjecting if it notices you made a mistake and moved on.
That seems to be what Microsoft is building and will reveal as a new Windows feature at BUILD '24. Not too sure about the interjecting aspect but ingesting everything you do on your machine so you can easily recall and search and ask questions, etc. AI Explorer is the rumored name and will possibly run locally on Qualcomm NPUs.
Yes, this is Windows AI Explorer.
Big questions are (1) when is this going to be rolled out to paid users? (2) what is the remaining benefit of being a paid user if this is rolled out to free users? (3) Biggest concern is will this degrade the paid experience since GPT-4 interactions are already rate limited. Does OpenAI have the hardware to handle this?
Edit: according to @gdb this is coming in "weeks"
thanks, I was confused because the top of the page says to try now when you cannot in fact try it at all
what is the remaining benefit of being a paid user if this is rolled out to free users?
It says so right in the post
We are making GPT-4o available in the free tier, and to Plus users with up to 5x higher message limits
The limits are much lower for free users.
It is really cool that they are bringing this to free users. It does make me wonder what justifies ChatGPT plus now though...
I assume the desktop app with voice and vision is rolling out to plus users first?
they stated that they will be announcing something new that is on the next frontier (or close to it IIRC) soon. so there will definitely be an incentive to pay because it will be something better than gpt 4o.
Gone are the days of copy-pasting to/from ChatGPT all the time, now you just share your screen. That's a fantastic feature, in how much friction that removes. But what an absolute privacy nightmare.
With ChatGPT having a very simple text+attachment in, text out interface, I felt absolutely in control of what I tell it. Now when it's grabbing my screen or a live camera feed, that will be gone. And I'll still use it, because it's just so damn convenient?
Now when it's grabbing my screen or a live camera feed, that will be gone. And I'll still use it, because it's just so damn convenient?
Presumably you'll have a way to draw a bounding box around what you want to show or limit to just a particular window the same way you can when doing a screen share w/ modern video conferencing?
Anyone who watched the OpenAI livestream: did they "paste" the code after hitting CTRL+C ? Or did the desktop app just read from the clipboard?
Edit: I'm asking because of the obvious data security implications of having your desktop app read from the clipboard _in the live demo_... That would definitely put a damper to my fanboyish enthusiasm about that desktop app.
To me it looked they used one command that did copy+paste into ChatGPT both.
Given that they are moving all these features to free users, it tells us that GPT-5 is around the corner and is significantly much better than their previous models.
Or maybe it is a desperation move after Llama 3 got released and the free mode will have such tight constraints that it will be unusable for anything a bit more serious.
Just like that Google is on back foot again.
Considering the stock pumped following the presentation the market doesn't seem particularly with what OpenAI released at all.
That first demo video was impressive, but then it ended very abruptly. It made me wonder if the next response was not as good as the prior ones.
Extremely impressive -- hopefully there will be an option to color all responses with a underlying brevity. It seemed like the AI just kept droning on and on.
Clicking the "Try it on ChatGPT" link just takes me to GPT-4 chat window. Tried again in an incognito tab (supposing my account is the issue) and it just takes me to 3.5 chat. Anyone able to use it?
Same here and also I can't hear audio in any of the videos on this page. Weird.
OpenAI's Mission and the New Voice Mode of GPT-4
• Sam Altman, the CEO of OpenAI, emphasizes two key points from their recent announcement. Firstly, he highlights their commitment to providing free access to powerful AI tools, such as ChatGPT, without advertisements or restrictions. This aligns with their initial vision of creating AI for the benefit of the world, allowing others to build amazing things using their technology. While OpenAI plans to explore commercial opportunities, they aim to continue offering outstanding AI services to billions of people at no cost.
• Secondly, Altman introduces the new voice and video mode of GPT-4, describing it as the best compute interface he has ever experienced. He expresses surprise at the reality of this technology, which provides human-level response times and expressiveness. This advancement marks a significant change from the original ChatGPT and feels fast, smart, fun, natural, and helpful. Altman envisions a future where computers can do much more than before, with the integration of personalization, access to user information, and the ability to take actions on behalf of users.
Please don't post AI-generated summaries here.
Too bad they consume 25x the electricity Google does.
https://www.brusselstimes.com/world-all-news/1042696/chatgpt...
That's not a well sourced story: it doesn't say where the numbers come from. Also:
"However, ChatGPT consumes a lot of energy in the process, up to 25 times more than a Google search."
That's comparing a Large Language Model prompt to a search query.
Those voice demos are cool but having to listen to it speak makes me even more frustrated with how these LLMs will drone on and on without having much to say.
For example, in the second video the guy explains how he will have it talk to another "AI" to get information. Instead of just responding with "Okay, I understand" it started talking about how interesting the idea sounded. And as the demo went on, both "AIs" kept adding unnecessary commentary about the secenes.
I would hate having to talk with these things on a regular basis.
Yea at some pont the style and tone of these assistants needs to be seriously changed, I can imagine a lot of their RLHF and instruct processes emphasize sounding good vs being good too much.
the OpenAI live stream was quite underwhelming...
Does anyone with a paid plan see anything different in the ChatGPT iOS app yet?
Mine just continues to show “GPT 4” as the model - it’s not clear if that’s now 4o or there is an app update coming…
It is quite nice how they keep giving premium features for free, after a while. I know openai is not open and all but damn, they do give some cool freebies.
This is every romance scammer's dreams come true...
Very impressive demo, but not really a step change in my opinion. The hype from OpenAI employees was on another level, way more than was warranted in my opinion.
Ultimately, the promise of LLM proponents is that these models will get exponentially smarter - this hasn’t born out yet. So from that perspective, this was a disappointing release.
If anything, this feels like a rushed release to match what Google will be demoing tomorrow.
I'm seeing gpt-4o in the OpenAI Playground interface already: https://platform.openai.com/playground/chat?mode=chat&model=...
First impressions are that it feels very fast.
jeez, that model really speaks a lot! I hope there's a way to make it more straight to the point rather than radio-like.
So, babelfish soon?
So, babelfish incoming?
Very, very impressive for a "minor" release demo. The capabilities here would look shockingly advanced just 5 years ago.
Universal translator, pair programmer, completely human sounding voice assistant and all in real time. Scifi tropes made real.
But: Interesting next to see how it actually performs IRL latency and without cherry-picking. No snark, it was great but need to see real world power. Also what the benefits are to subscribers if all this is going to be free...
@sama reflects:
Will this include image generation for the free tier as well? That's a big missing feature in OpenAI's free tier compared to Google and Meta.
Universal real time translation is incredibly dope.
I hate video players without volume control.
So GPT-4o can do voice intonation? Great. Nice work.
Still, it sounds like some PR drone selling a product. Oh wait....
I like the robot typing at the keyboard that has B as half of the keys and my favorite part is when it tears up the paper and behind it is another copy of that same paper
Looking forward to trying this via ChatGPT. As always OpenAI says "now available" but refreshing or logging in/out of ChatGPT (web and mobile) don't cause GPT-4o to show up. I don't know why I find this so frustrating. Probably because they don't say "rolling out" they say things like "try it now" but I can't even though I'm a paying customer. Oh well...
I hope when this gets to my iphone I can use it to set two concurrent timers.
Are there any remotely comparable open source models? Fully multimodal, audio-to-audio?
That they are offering more features for free concurs with my theory that, just like search, state of the art AI will soon be "free", in exchange for personal information/ads.
With the news that Apple and OpenAI are closing / just closed a deal for iOS 18, it's easy to speculate we might be hearing about that exciting new model at WWDC...
Interesting that they didn't mention a bump in capabilities - I wrote a LLM benchmark a few weeks ago, and before GPT-4 could solve Wordle about ~48% of the time.
Currently with GPT-4o, it's easily clearing 60% - while blazing fast, and half the cost. Amazing.
I can't help but feel a bit let down. The demos felt pretty cherry picked and still had issues with the voice getting cut off frequently (especially in the first demo).
I've already played with the vision API, so that doesn't seem all that new. But I agree it is impressive.
That said, watching back a Windows Vista speech recognition demo[1] I'm starting to wonder if this stuff won't have the same fate in a few years.
That can “reason”?
Won't this make pretty much all of the work to make a website accessible go away, as it becomes cheap enough? Why struggle to build parallel content for the impaired when it can be generated just in time as needed?
what's the path from LLMs to "true" general AI? is it "only" more training power/data or will they need a fundamental shift in architecture?
question for you guys - is there a model that can take figures (graphs), from scientific publications, and combine image analysis with picking up the data point symbol descriptions and analyse the trends?
Very impressed by the demo where it starts speaking French in error, then laughs with the user about the mistake. Such a natural recovery.
window dressing
his love for yud is showing.
I wonder if the audio stuff works like ViTS. Do they just encode the audio as tokens and input the whole thing? Wouldn't that make the context size a lot smaller?
One does notice that context size is noticeably absent from the announcement ...
It is notable OpenAI did not need to carefully rehearse the talking points of the speakers. Or even do the kind of careful production quality seen in a lot of other videos.
The technology product is so good and so advanced it doesn't matter how the people appear.
Zuck tried this in his video countering to vision pro, but it did not have the authentic "not really rehearsed or produced" feel of this at all. If you watch that video and compare it with this you can see the difference.
Very interesting times.
GPT-4o being a truly multimodal model is exciting, does open the door to more interesting products. I was curious about the new tokenizer which uses much fewer tokens for non-English, but also 1.1x fewer tokens for English, so I'm wondering if this means each token now can be more possible values than before? Might make sense provided that they now also have audio and image output tokens? https://openai.com/index/hello-gpt-4o/
I wonder what "fewer tokens" really means then, without context on raising the size of each token? It's a bit like saying my JPEG image is now using 2x fewer words after I switched from a 32-bit to a 64-bit architecture no?
I still need to talk very fast to actually chat with ChatGPT which is annoying. You can tell they didn't fix this based on how fast they are talking in the demo.
I don't see anything released today. Login/signup is still required, no signs of desktop app or free use on web. What am I missing?
I'm really impressed about this demo! Apart from the usual quality benchmarks I'm really impressed about the latency for audio/video: "It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response"... If true at scale, what could be the "tricks" they're using for achieving that?!
Express having a human-like emotional response every single time you interact with it is pretty annoying.
In general, trying to push that this is a human being is probably "unsafe", but that hurts the marketing.
Weird visiting the page crashed my graphics driver using Firefox.
It's pretty impressive, although I don't like the voice / tone, I prefer something more neutral.
Holy crap, the level of corporate cringe of that "two AIs talk to each other" scene is mind-boggling.
It feels like a pretty strong illustration of the awkwardness of getting value from recent AI developments. Like, this is technically super impressive, but also I'm not sure it gives us anything we couldn't have one year ago with GPT-4 and ElevenLabs.
This is impressive, but they just sound so _alien_, especially to this non-U.S. English speaker (to the point of being actively irritating to listen to). I guess picking up on social cues communicating this (rather than express instruction or feedback) is still some time away.
It's still astonishing to consider what this demonstrates!
I found these videos quite hard to watch. There is a level of cringe that I found a bit unpleasant.
It’s like some kind of uncanny valley of human interaction that I don’t get on nearly the same level with the text version.
Why must every website put stupid stuff that floats above the content and can’t be dismissed? It drives me nuts.
Now, say goodbye to call centers.
In the video where the 2 AI's sing together, it starts to get really cringey and weird to the point where it literally sounds like it's being faked by 2 voice actors off-screen with literal guns to their heads trying not to cry, did anyone else get that impression?
The tonal talking was impressive, but man that part was like, is someone being tortured or forced against their will?
Did they provide the limit rate for free user ?
Because I have the plus membership which is expensive (25$/month).
But if the limit is high enough (or my usage low enough), there is no point for paying that much money for me.
(I work at OpenAI.)
It's really how it works.
With this capability, how close are y'all to it being able to listen to my pronunciation of a new language (e.g. Italian) and given specific feedback about how to pronounce it like a local?
Seems like these would be similar.
The italian output in the demo was really bad.
Winner of the 'understatement of the week' award (and it's only Monday).
Also top contender in the 'technically correct' category.
and was briefly untrue for like 2 days
Random OpenAI question: While the GPT models have become ever cheaper, the price for the tts models have stayed in the $15/1Mio char range. I was hoping this would also become cheaper at some point. There're so many apps (e.g. language learning) that quickly become too expensive given these prices. With the GPT-4o voice (which sounds much better than the current TTS or TTS HD endpoint) I thought maybe the prices for TTS would go down. Sadly that hasn't happened. Is that something on the OpenAI agenda?
"(I work at OpenAI.)"
Ah yes, also known as being co-founder :)
hi gdb, could you please create an assistant AI that can filter low-quality HN discussion on your comment so that it can redirect my focus on useful stuff.
Licensing the emotion-intoned TTS as a standalone API is something I would look forward to seeing. Not sure how feasible that would be if, as a sibling comment suggested, it bypasses the text-rendering step altogether.
Is it possible to use this as a TTS model? I noticed on the announcement post that this is a single model as opposed to a text model being piped to a separate TTS model.
How far are we away from something like a helmet with chat GPT and a video camera installed, I imagine this will be awesome for low vision people. Imagine having a guide tell you how to walk to the grocery store, and help you grocery shop without an assistant. Of course you have tons of liability issues here, but this is very impressive
Consequences of audio2audio (rather than audio >text text>audio). Being able to manipulate speech nearly as well as it manipulates text is something else. This will be a revelation for language learning amongst other things. And you can interrupt it freely now!
However, this looks like it only works with speech - i.e. you can't ask it, "What's the tune I'm humming?" or "Why is my car making this noise?"
I could be wrong but I haven't seen any non-speech demos.
What about the breath analysis?
Fwiw, the live demo[0] included different kinds of breathing, and getting feedback on it.
[0]: https://youtu.be/DQacCB9tDaw?t=557
Anyone who has used elevenlabs for voice generation has found this to be the case. Voice to voice seems like magic.
that was very impressive, but it doesn't surprise me much given how good the voice mode is in the ChatGPT iPhone app is already.
The new voice mode sounds better, but the current voice mode did also have inflection that made it feel much more natural than most computer voices I've heard before.
Can you tell the current voice model what feelings and tone it should communicate with? If not it isn't even comparable, being able to control how it reads things is absolutely revolutionary, that is what was missing from using these AI models as voice actors.
Right to who? To me, the voice sounds like an over enthusiastic podcast interviewer. Whats wrong with wanting computers to sound like what people think computers should sound like?
It understands tonal language, you can tell it how you want it to talk, I have never seen a model like that before. If you want it to talk like a computer you can tell it to, they did it during the presentation, that is so much better than the old attempts at solving this.
I mention this down thread, but a symptom of a tech product of sufficient advancement is the nature of its introduction matters less and less.
Based on the casual production of these videos, the product must be this good.
https://news.ycombinator.com/item?id=40346002