return to table of content

Show HN: I Remade the Fake Google Gemini Demo, Except Using GPT-4 and It's Real

godelski
36 replies
14h28m

I don't get why companies lie like this. How much do they have to gain? It seems like they actually have a lot to lose.

What's crazy to me is that these tools are wildly impressive without the hype. As a ML researcher, there's a lot of cool things we've done but at the same time almost everything I see is vastly over hyped from papers to products. I think there's a kinda race to the bottom we've created and it's not helpful to any of us except maybe in the short term. Playing short term games isn't very smart, especially for companies like Google. Or maybe I completely misunderstand the environment we live in.

But then again, with the discussions in this thread[0] maybe there's a lot of people so ethically bankrupt that they don't even know how what they're doing is deceptive. Which is an entirely different and worse problem.

[0] https://news.ycombinator.com/item?id=38559582

nextworddev
10 replies
14h22m

Google stock rallied 5% ish the after the demo (though the stock didn’t move immediately). Then it gave back about 1% once the news broke that it was faked

godelski
5 replies
14h3m

That's not a great answer. We need to know the counterfactual question of "How much would google's stock have rallied after a realistic demo was given?" I would not have been surprised if the answer was also 5%. Almost certainly Google's stock would have risen after announcing Gemini. There are other long term questions too like about how this feeds into growing dissent against Google and trust. But what the stock did is only a small part of a much bigger and far more important question.

Edit: Can someone explain the downvotes? Is there a error in my response? I'm glad to learn but I'd appreciate a bit better feedback signal so that I can do so better instead of guessing.

nextworddev
2 replies
13h34m

Broad investor community was spooked by Gemini Ga being delayed to Q1 so this stunt was a good stop gap / distraction

yellow_postit
1 replies
12h24m

And likely caused more long term harm since if they had to fake this they’re likely further behind

pixl97
0 replies
2h24m

That's tomorrows problem, not todays. They are hoping to solve the issue by then, or find a new way to fake it another quarter.

dkga
1 replies
11h48m

Economist here who studies exactly this type of counterfactual analysis. You are completely right: the effect of Gemini can only be estimated if we factor in what the Alphabet stock price would have been in the same time but in a world without Gemini. This is actually very standard in financial economics. This type of effect can be calculated with econometric techniques that compare before/after for “treated” vs “untreated” units, but in instances such as these, where only one or a few units were affected, like Alphabet stock amongst hundreds of other companies, one could use techniques such as “synthetic controls”. The intuition is to use other companies’ data to estimate before Gemini how Alphabet stock prices move over time, and then use that relationship to estimate a post-Gemini version of no-Gemini Google. The difference between the actual stock price and that counterfactual is the effect of interest; whether it is a significant effect or just random noise can be established by a number of auxiliary statistical tests. For more info, see [0].

[0] Abadie, A. (2021). Using synthetic controls: Feasibility, data requirements, and methodological aspects. Journal of Economic Literature, 59(2), 391-425.

godelski
0 replies
10h37m

Well I don't mean with/without Gemini, I mean without the deceptive marketing of Gemini vs had they counterfactually produced a non-deceptive marketing. Other than that nitpicking, I appreciate the backup and the source. Counterfactual testing is by no means easy, so good luck with your work! My gf is an economist but on the macro side. You guys have it tough. I'm a bit confused why people are disliking my comment but mostly that they are doing so without explanation lol.

lolinder
3 replies
14h2m

One question I've long had is why short-term changes in stock prices seem to matter so much to companies like these. Is it just that the short-term changes are seen as harbingers of longer-term trends, or is there a concrete reason to play games to get temporary boosts to stock price?

godelski
0 replies
12h5m

I've been using the term Goodhart's Hell to describe what I see as a system of metric hacking and shortsightedness. I do think there's systems encouraging people to meet metrics for metrics sake but not stopping to understand the metric or what it actually measures, and most importantly, how that differs from the actual goals. Because all metrics are proxies and far from complete, even for things that sound really simple. I think this is because we've gotten so good at measuring that we've innately forgotten about the noisiness of the system where when we weren't good at measuring that was just forced upon us in a clear manner.

But I really don't get it either. One of the things that really sets humans apart from other animals is our capacity to perform long term planning and prediction. Why abandon that skill instead of exploiting it to its maximum potential?

barbarr
0 replies
1h46m

A CEO's/company's job is to maximize instantaneous shareholder value. Anything else is a waste of time, since investors are assumed to take long-term vs short-term risk preference into their own hands. The company is essentially a machine that investors can dip into and dip out of at any time, so it doesn't make sense to make decisions to move the stock price over a pre-planned certain time horizon. The reason companies invest in any long-term projects at all is because the net present value of those projects affects the stock price.

YetAnotherNick
0 replies
13h11m

Short term changes definitely affect medium to long terms' price. Because at one level stock price is more like casino and isn't actually related to the company's performance. e.g. See the 5 year history of gamestop. Its price once increased due to random activity from redditors and its stock price is still increased due to that.

dkga
6 replies
12h6m

Relatedly, I saw in the thread that people call these types of deceptions as “smoke and mirrors” or “dog and pony show”. What happened to “Potemkin”?!

Austizzle
1 replies
10h53m

Perhaps this is revealing my ignorance, but I've never even heard of potemkin before

mrkstu
0 replies
3h15m

If you grew up post USSR era it has probably fallen out of the lexicon for younger folk…

latexr
0 replies
3h29m

I’d wager most aren’t familiar with the term, and the majority of those who are will think of the battleship/movie, not the villages.

https://en.wikipedia.org/wiki/Russian_battleship_Potemkin

https://en.wikipedia.org/wiki/Potemkin_village

gwd
0 replies
9h8m

The nice thing about "Potemkin" is that there's a decent chance the video was also designed to fool their own CEOs (in response to an impossible request), just as the Potemkin Villages were used to fool the country's own ruler.

godelski
0 replies
12h2m

I had never previously heard of that term but it does seem apt. I think idioms are often more cultural and can change rapidly while one might seem ubiquitous in your group it isn't in another. Another term I think might be apt, but a bit less so, is snake oil or snake oils sells man.

banku_brougham
0 replies
3h49m

Potemkin product launch is the correct label.

Racing0461
4 replies
13h13m

Here is the headline Business Today published just in case you wonder why businesses do this

"Google Gemini Outperforms Most Human Experts & GPT-4 I Artificial intelligence I Google’s DeepMind".

It's all marketing. Same reason why satya publically postes sama + others are joining a new team at MSFT to continue should the openai thing not work out.

godelski
3 replies
11h58m

I'm not sure how that is really responding to my question with an explanation. I'm well aware that its marketing and I'd hope my comment makes that clear. The question is why oversell the product, and frankly by a lot. Because people are going to find out, I mean the intent is that they use it after all.

I'm sure the marketing team can come up with good marketing that also isn't deceitful. The question is why pull a con when you already got something of value that customers would legitimately buy?

mobiuscog
1 replies
10h46m

I'm well aware that its marketing and I'd hope my comment makes that clear. The question is why oversell the product, and frankly by a lot.

Most marketing sells the dream, not the reality. There are just many shades of grey (although 50 tends to sell well).

godelski
0 replies
10h35m

I'm still not sure how that is responding to my comment. Have I said something that makes me seem naive of the existence of snake oil salesmen? I'm actually operating under the assumption that my comment, and especially followup, explicitly demonstrate my awareness of this.

pixl97
0 replies
2h18m

Because it may significantly delay those customers from buying the other competing product. Yea, Google has something of value, but OpenAI seems to have something of more value and Google is frantically trying to keep OpenAI from eating the whole market.

CSMastermind
4 replies
12h23m

Because the same day they released the video our CEO was messaging me saying we have to get on Google's new stuff because it's so much better than GPT-4.

I said I was skeptical of the demo but, like all developments in the field, will try it out once they release it.

godelski
2 replies
12h1m

This also seems like equally poor decision making. Wouldn't you want to at least try things out before you make a hard switch? Chasing hype is costly.

movedx
0 replies
11h24m

Welcome to IT.... no seriously, this is how a lot of executives behave in IT.

AdamN
0 replies
7h44m

The important thing for Google is to be part of the short list right now before adoption crystalizes. Over the next year or two a large chunk of early (late?) adopters will firmly commit to one (or maybe two) vendors for their generative AI and those decisions will be sticky.

So now is the time to do whatever it takes to get into the conversation ... which Google successfully did I think.

Zelphyr
0 replies
2h8m

I have a daily call with my CEO and I'm counting down until he mentions Googles demo and asks why we're not using their AI technology. Then it will take an exorbitant amount of energy on my part to get him to understand how that demo was doctored.

hackerlight
2 replies
13h39m

Playing short term games isn't very smart, especially for companies like Google. Or maybe I completely misunderstand the environment we live in.

It could be the principal-agent problem. The agent (employee and management) is optimizing for short-term career benefits and has no loyalty to Google's shareholders. They can quit after 3 years, so reputation damage to Google doesn't matter that much. But the shareholders want agents to optimize for longer-term things like reputation. Aligning those incentives is difficult. Shareholders try with good governance and material incentives tied to the stock price with a vesting schedule, but you're still going to get a level of disalignment.

I suppose this is where a cult-like culture of mission alignment can deliver value. If you convince/select your agents (employees) into actually believing in the mission, alignment follows from that.

godelski
1 replies
11h56m

Yeah I think that makes some sense. But you would think the CEO and top execs of the company would be trying to balance these forces rather than letting one dominate. You need pressures for short term but you can't abandon long term planning for short term optimization. Anyone who's worked with basic RL systems should be keenly aware of that and I'm most certain they teach this in business school. I mean it's not like these things don't come up multiple times a year.

hackerlight
0 replies
11h25m

There's some other explanations too. Maybe they thought the deception would fly under the radar, so it was rational according to cost-benefit analysis given available information. Maybe they fell for the human psychological bias of overvaluing near-term costs/benefits and undervaluing long-term costs/benefits. Maybe some deception was used internally when the demo was communicated to senior execs. Maybe the ego of being second place to OpenAI was too much and the shame avoidance/prestige seeking kicked in.

userabchn
0 replies
1h23m

It made me feel, more than I have ever felt before, that Google is now run by non-technical business people who don't seem to understand that many people who have at least some awareness of how this technology works, and so are probably going to be part of the decision-making process on whether to use it and other Google products, can immediately see that it is faked and are often the type of people who react very negatively to such deceptive practices.

parineum
0 replies
13h36m

I think it's because, while I think these LLMs are incredibly interesting and can be very useful, they're less than what the hype is and the valuations are based on the hype.

latexr
0 replies
3h36m

I don't get why companies lie like this.

The answer is always “money”. All you have to do is think “what line of thought would lead someone to believe that by lying in this manner they’ll either lose less money or make more of it?”

jay-barronville
0 replies
8h48m

What's crazy to me is that these tools are wildly impressive without the hype.

My wife and I were talking about this yesterday, and I made this exact point! I told her I’m convinced Google was deceptive like this for the Wall Street crowd and normies, because to techies and researchers who actually understand AI, the extra BS is unnecessary if the technology is legitimately impressive.

RagnarD
0 replies
13h59m

Google screws up every business opportunity, including wantonly buying small successful businesses and killing them. Dishonesty is a fundamental part of the company.

phire
20 replies
14h53m

The "magic" of the fake Gemini demo was the way it seemed like the LLM was continually receiving audio + video input and knew when to jump in with a response.

It appeared to be able to wait until the user had finished the drawing, or even jumping in slightly before the drawing finished. At one point the LLM was halfway through a response and then saw the user was now colouring the duck in blue, and started talking about how the duck appearing to be blue. The LLM also appeared to know when a response wasn't needed because the user was just agreeing with the LLM.

I'm not sure how many people noticed that on a conscious level, but I positive everyone noticed it subconsciously, and felt the interaction was much more natural, and much more advanced than current LLMs.

-----------------

Checking the source code, the demo takes screenshots of the video feed every 800ms, waits until the user finishes taking and then sends the last three screenshots.

While this demo is impressive, it kind of proves just how unnatural it feels to interact with an LLM in this manner when it doesn't have continuous audio-video input. It's been technically possible to do kind of thing for a while, but there is a good reason why nobody tried to present it as a product.

TaylorAlexander
5 replies
12h40m

Yeah my friend and I were just talking about continuous stream input multimodal LLMs. Does anyone know if there is a technical limitation preventing continuous stream input data? Like it’s listening to you practice guitar and then when you get to a certain point it says “okay let’s go back and practice that section again”. It seems the normal approach of next token prediction falls flat when there is a continuous stream of tokens and it only sometimes needs to produce output.

What is that type of input called in the literature and what research has been done on it? Thanks!

profile53
4 replies
11h55m

At a purely technical level, no, as long as the model can output a null token. E.g. imagine training using a transcript of two people talking. What would be a single text token is a tuple of two tokens, one per person. Each segment where a person is not talking is a series of null tokens, one per ‘tick’ of time. In an actual conversation, one token in the tuple is user input and one is GPT prediction. Just disregard the user half of the tuple when determining whether the GPT should ‘speak’.

The real world challenge is threefold. First, null tokens would be massively over represented in training and by extent, in outputs. Second, at a computational level, outputting a continuous stream of tokens would be absurdly expensive. Third, there is not nearly as much training data of interspersed conversations as of monologues (e.g. research papers, this comment, etc.).

mewpmewp2
1 replies
4h1m

I think you should be able to do it out of the box if you just keep sending the tokens, and after that ask the GPT "is there a mistake? Respond with just "yes" or "no". Why does there have to be something like a "null" token?

However it might seem expensive yes, but at least it only has to respond with one token.

edgyquant
0 replies
2h51m

There’s a null token because the question was about you not having to ask if there was a mistake. It would just default to constantly producing a null token until it had a real response

TaylorAlexander
1 replies
11h37m

Yeah it seems the notion of time is sort of not built in conceptually to current systems. You could pick a fixed time constant like 0.1 seconds or 1 second, but it's clear that it's sort of missing something more fundamental.

radarsat1
0 replies
10h46m

I think if the same LLM were trained on audio and video input instead of text, and produced audio output, including silence tokens, then the notion of time would get "built in". Audio continuation without translation to text has been shown to work. Mixing it with text is also possible. But all this would require a massive network that maybe even be difficult for the world's biggest companies to train and serve at any kind of scale. So it's more of an engineering problem than a theoretical one imho.

Also imho, I think until the context/memory problem is fully solved we won't really see the AI as having any kind of agency. But continuous, low latency interaction would certainly feel like a step towards that.

gregsadetsky
4 replies
14h1m

100%.

I made this demo in 2-3 hours, and I did use the "wait until the dictation results are finalized" technique which is safer (i.e. the dictation transcription is more robust) but slower.

For another demo - https://www.youtube.com/watch?v=fxS7OKh_4vc - I kept feeding the "in progress" transcription results into GPT and that was super super awesome & fast. It would just require more work to deal with all of the different timings going on (i.e. there's the speech itself from the person, the time to transcribe, sending the request to GPT, "sync'ing" it to where the person is (mentally/in their speech) at the point where GPT replies, etc.)

But yeah. Real time/continuous talk is absolutely where it's at. Should GPT be available as a websocket...?!

modeless
2 replies
10h2m

I have a rough demo of real time continuous voice chat here, ~1 second response latency: https://apps.microsoft.com/detail/9NC624PBFGB7

Basically it starts computing a response every time a word comes out of the speech recognizer, and if it is able to finish its response before it hears another word then it starts speaking. If more words come in then it stops speaking immediately; in other words, you can interrupt it. It feels so much more natural in conversation than ChatGPT's voice mode due to the low latency and continuous listening with the ability to interrupt.

There are a lot of things that need improvement. Most important is probably that the speech recognition system (Whisper) wasn't designed for real time and is not that reliable or efficient in a real time mode. I think some more tweaking could improve reliability considerably. But also very important is that it doesn't know when not to respond. It will always jump in if you stop speaking for a second, and it will always try to get the last word. A first pass at fixing that would be to fine tune a language model to predict whose turn it is to speak.

There are also a lot of things that this architecture will never be able to do. It will never be able to correct your pronunciation (e.g. for language learning), it will never be able to identify your emotions based on vocal cues or express proper emotions in its voice, it will never be able to hear the tone of a singing voice or produce singing itself. The future is in eliminating the boundaries between speech-to-text and LLM and text-to-speech, with one unified model trained end-to-end. Such a system would be able to do everything I mentioned and more, if trained on enough data. And further integrating vision will help with conversation too, allowing it to see the emotions on your face and take conversational cues from your gaze direction and hand gestures, in addition to all the other obvious things you can do with vision such as chat about something the camera can see or something displayed on your screen.

thom
1 replies
9h53m

What's the horizon after which you reset the input instead of appending to it? Does that happen if the user lets the system finish speaking?

modeless
0 replies
9h42m

Great question. Right now that happens, somewhat arbitrarily, if the user lets the system finish speaking the first sentence of its response. If the user interrupts before that, then it's considered a continuation of the previous input. If the user interrupts after that, it's still an interruption (and, importantly, the language model's response must be truncated in the conversation context because the user didn't hear it all), but it starts a new input to the LLM. This could be handled better as well. Basically any heuristics like this that are in the system should eventually be subsumed into the AI models so that they can be responsive to the conversation context.

xiphias2
0 replies
6h19m

This would be super cool with Mistral on local machine

og_kalu
3 replies
14h46m

I think probably training on pause tokens or something similar would be the key to something like this. Maybe it's not even necessary. Maybe if you just tell GPT-4 to output something like .... every time it thinks it should wait for a response (you wouldn't need to wait for the user to finish then), things would be a lot smoother.

phire
2 replies
14h32m

Yes, you could probably fine-tune (or even zero-shot) a LLM to handle the "knowing when to jump in" use case.

The real problem is that it's simply too computationally expensive to continually feed audio and video into it one of these massive LLMs just in case it might decide to jump in.

I was wondering if you could train a lightweight monitoring model that continually watching the audio/video input and only tried to work out when the full-sized LLM might want to jump in and generate a response.

bbarnett
1 replies
13h23m

As the human brain is a clump if regions all interconnected and interacting, for example, one may focus their attention elsewhere until their name is called, having a ight model wait for an important queue makes sense more than fiscally.

One time I was so distracted, I missed an entire paragraph someone said to me, walked to my car, drove away, and 5 minutes later processed it.

Ajedi32
0 replies
3h14m

Yeah, one thing I've noticed myself do is that when I'm focused on something else and someone suddenly gets my attention I'll replay the last few seconds of the conversation in my head to get context on what was being talked about before I respond. That seems pretty trivial to do with a LLM; it doesn't need to be using 100% of its "brainpower" at all times.

LelouBil
1 replies
7h37m

I wanted to plug a GPT4 chatbot into a group chat, so it could react to what people said. In the end I abandoned the idea because it was too hard for me to figure out when it should talk vs let people talk between them.

cubefox
0 replies
6h47m

Couldn't you instruct the model to only say something when it is important or when it's being addressed directly, and otherwise output just empty response which isn't rendered?

ycdxvjp
0 replies
9h6m

As a deaf person I have been watching "live" speech recognition demos for 20-30 years. All look great. Using it in day to day life is crazy cause if you have 1 mistake per 10 words it builds up over time to be supremely annoying.

nextworddev
0 replies
14h20m

One easy improvement would be to stop the video capture automatically via a combination of silence detection and motion detection

BiteCode_dev
0 replies
6h59m

For now LLM can only answer, but they will soon be able to prompt YOU.

True conversation is going to be very interesting.

iandanforth
11 replies
15h10m

Looks like, again, this doesn't have GPT-4 processing video as much as a stack of video frames, concatenated and sent as a single image. But much closer to real!

trescenzi
3 replies
14h59m

How does a video differ from a stack of video frames? Isn’t that all a video is? A bunch of images stuck together and played back quickly?

zwily
0 replies
14h2m

You could say that this demo is processing a 2.4s video that is 1.25fps.

og_kalu
0 replies
14h52m

I'd guess you'd miss any audio that way. But otherwise, yeah a video is a stack of images.

civilitty
0 replies
1h43m

In a technical sense most video is compressed using motion prediction algorithms so the preprocessing on the data is already significantly different to static images, containing more compressed information. Only the key frames are actually full images and they only make up 1-5% of the frames.

On top of that the video container usually provides synchronized audio packets.

riwsky
2 replies
14h26m

I just found out it gets worse: turns out GPT-4 isn't processing images so much as arrays of pixels!

And worse: turns out GPT-4 isn't processing pixels so much as integers representing in a position in some color space like RGB!

And worse! turns out GPT-4 isn't processing integers so much as series of ones and zeroes!

Now that this is public knowledge, I'm willing to bet this was the ugly "less than candid" truth that the board sacked Sam Altman over.

zarzavat
0 replies
13h47m

As they say, quantity has a quality all of its own. If the framerate of a video is so slow as to be a slideshow, then it’s arguably not video anymore. Video has a connotation of appearing temporally continuous to the naked eye.

omneity
0 replies
4h15m

There is a significant difference. Video has a temporal component (frames tend to be correlated with previous ones), and vision LLMs do not have some sort of hidden states to keep track of that temporal component.

Using captions to bridge this only works to a certain extent (you're giving text descriptions of what happened in the past, not what had _actually_ happened).

adtac
1 replies
14h35m

Is "it's just processing frames" the new "it's just predicting the next token"?

topspin
0 replies
10h25m

Nothing new here at all: trivializing the intellectual achievements of machines is SOP. This will continue until machines have surpassed every conceivable benchmark. At that point we'll be left with only our epic hubris.

fortunefox
0 replies
14h42m

Since audio is processed separately this isn't just close to real. it is real. After all what is video if not a stack of frames! :D

ShamelessC
0 replies
15h1m

The video is an actual live demo without any editing or other tricks involved and even includes reasonable mistakes and the code used. It is not close to real, it's just real.

sheepscreek
5 replies
14h59m

Thank you for creating this demo. This was the point I was trying to make when the Gemini launch happened. All that hoopla for no reason.

Yes - GPT-4V is a beast. I’d even encourage anyone who cares about vision or multi-modality to give LLaVA a serious shot (https://github.com/haotian-liu/LLaVA). I have been playing with the 7B q5_k variant last couple of days and I am seriously impressed with it. Impressed enough to build a demo app/proof-of-concept for my employer (will have to check the license first or I might only use it for the internal demo to drive a point).

ok_dad
2 replies
14h56m

I’ve been using llava via https://github.com/Mozilla-Ocho/llamafile which runs on any modern system.

jart
1 replies
10h29m

It's so great. I've been this vision model to rename all the files in my Pictures folder. For example, the one-liner:

    llamafile --temp 0 \
        --image ~/Pictures/lemurs.jpg \
        -m llava-v1.5-7b-Q4_K.gguf \
        --mmproj llava-v1.5-7b-mmproj-Q4_0.gguf \
        --grammar 'root ::= [a-z]+ (" " [a-z]+)+' \
        -p $'### User: What do you see?\n### Assistant: ' \
        --silent-prompt 2>/dev/null |
      sed -e's/ /_/' -e's/$/.jpg/'
Prints to standard output:

    a_baby_monkey_on_the_back_of_a_mother.jpg
This is something that's coming up in the next llamafile release. You have to build from source to have the ability to use grammar and --silent-prompt on a vision model right now.

Weights here: https://huggingface.co/jartine/llava-v1.5-7B-GGUF/tree/main

Sauce here: https://github.com/mozilla-Ocho/llamafile

sheepscreek
0 replies
5h51m

Truly grateful for your work on cosmopolitan, cosmo libc, redbean, nudging POSIX towards realizing the unachieved dream and also for contributing to llama.cpp. It’s like wherever I look, you’ve already left your mark there!

To me, you exemplify and embody the spirit of OSS, and to top that - you seem to be just an amazing human. You are an inspiration for me and many others. And even though I know I’ll never ever get close, you make me want to try. Thank you. :)

sheepscreek
1 replies
5h28m

Update: For anyone else facing the commercial use question on LLaVA - it is licensed under Apache 2.0. Can be used commercially with attribution: https://github.com/haotian-liu/LLaVA/blob/main/LICENSE

johnmoberg
0 replies
5h3m

The code is licensed under Apache 2.0, but the weights are CC BY-NC 4.0 according to the README, so no commercial use unfortunately.

swyx
3 replies
15h13m

haha yes it was entirely possible with gpt4v. literally just screenshot and feed in the images and text in chat format, aka “interleaved”. made something similar at a hackathon recently. (https://x.com/swyx/status/1722662234680340823). the bizarre thing is that google couldve done what you did, and we wouldve all been appropriately impressed, but instead google chose to make a misleading marketing video for the general public and leave the rest of us frustrated nerds to do the nasty work of having to explain why the technology isnt as seen on tv yet; making it seem somehow our fault

i am curious about the running costs of something like this

gregsadetsky
2 replies
13h26m

I made 77 requests to the GPT-vision API while developing/demo'ing this, and that resulted in a $0.47 bill. Pretty reasonable!

jankovicsandras
1 replies
10h44m

Hi Greg,

Congratulations, great demo! The $0.47 bill seems reasonable for an experiment, but imagine someone doing a task of this complexity as a daily job - let's say 100x times, or a little more than 4 hours - the bill would be $47/day. It feels like there's still an opportunity for a cheaper solution. Have you or someone else experimented with e.g. https://localai.io/ ?

swyx
0 replies
10h38m

if i did not have your comment history i'd have sworn you worked for localai.io

adtac
2 replies
14h19m

[tangential to this really cool demo] JPEG images being the only possible interface to GPT-4 feels wasteful. the human eye works the delta between "frames", not the image itself. I wonder if the next big step that would allow real-time video processing at high resolutions is to have the model's internal state operate on keyframes and deltas similar to how video codecs like MPEG work.

zwily
1 replies
14h4m

When Google talks about Gemini's "multi-modal", they include "video" in the list of modes. It's totally possible they don't actually mean video, and just mean frames like in this demo. They haven't elaborated on it anywhere that I've seen.

dannyw
0 replies
11h55m

Their technical report clarifies that video is just a sequence of frames fed as images.

ShamelessC
2 replies
15h3m

Sad state of affairs for Google.

paul7986
0 replies
14h59m

Indeed they've been sad starting in 2010 and on... (maybe before)... all the projects they kill.. their IP theft, them doing evil, etc

bergen
0 replies
5h16m

ANY indication the video was fake? I see none. The example at hand is nice and all, but if you say someone is faking you better have receipts.

sibeliuss
1 replies
14h16m

Lol at choosing the name Sagittarius, which is exactly across from Gemini in the Zodiac

SilasX
0 replies
14h1m

I remember there was speculation that Facebook named their vaporware cryptocurrency Libra (later, “Diem”) as a jab at the longtime rival Winklevoss twins, who had started a crypto exchange called Gemini. I have no idea how astrologically clever that would be.

razodactyl
1 replies
8h19m

The part that really confuses me is the lack of a "*some sequences simulated" disclaimer.

cubefox
0 replies
6h29m

The Gemini demo had a disclaimer in the beginning, albeit not a very clear one.

iamleppert
1 replies
9h49m

I’ve recently been trying to actually use Google’s AI conversational translation app that was released awhile back and has many updates and iterations since.

It’s completely unusable for real conversation. I’m actually in a situation where I could benefit from it and was excited to use it because I remember watching the demo and how natural it looked but was never able to actually try it myself.

Now having used it, I went back and watched their original demo and I’m 100% convinced all or part of it was faked. There is just no way this thing ever worked. If they can’t manage to make conversational live translation work (which is a lot more useful than drawing a picture of a duck) I have high doubts about this new AI.

Seems like the exact same situation to me. It’s insane to me how much nerve it must take to completely fake something like this.

cubefox
0 replies
6h40m

What was the App called?

zainhoda
0 replies
15h12m

Wow, this is super cool! From the code it seems like the speech to text and text to speech are using the browser’s built-in features. I always forget those capabilities even exist!

razodactyl
0 replies
8h16m

The latency is excusable as this is through the API. Inference on local infrastructure is almost instant so this demo would smoke everything else if this dude had access.

op00to
0 replies
15h15m

Very cool!

jakderrida
0 replies
14h56m

Lmao! So, presumably, they could have hired Greg to improvise almost the exact same demonstration, but with evidence it works. I don't know how much Greg costs, but I'll bet my ass it's less than the cost in investor sentiment after getting caught committing fraud. Not saying you're cheap. Just cheaper.

frays
0 replies
7h11m

Thanks for sharing!

dvaun
0 replies
14h58m

Great demo, I laughed at the final GPT response too.

Honestly: it would be fun to self-host some code hooked up to a mic and speakers to let kids, or whoever, play around with GPT4. I’m thinking of doing this on my own under an agency[0] I’m starting up on the side. Seems like a no-brainer as an application.

[0]: https://www.divinatetech.com

dingclancy
0 replies
4h15m

I am now convinced that Google DeepMind really had nothing in terms of state-of-the-art language models (SOTA LLMs). They were just bluffing. I remember when ChatGPT was released; Google was saying that they had much better models they were not releasing due to AI safety. Then they released Palm and Palm 2, saying it's time to beat ChatGPT with these models. However, it was not a good model.

They then hyped up Gemini, and if Gemini Ultra is the best they have, I am not convinced that they have a better model.

Sundar's code red was genuinely alarming because they had to dig deep to make this Gemini model work, and they still ended up with a fake video. Even if Gemini was legitimate, it did not beat GPT-4 by leaps and bounds, and now GPT-5 is on the horizon, putting them a year behind. It makes me question if they had a secret powerful model all along

dingclancy
0 replies
4h24m

I am now convinced that Google Deepmind really had nothing in terms of SOTA LLMs. They were just bluffing.

I remember when chatgpt was released Google was saying that they had much much better models that they are not releasing because they for AI Safety. Then theu released palm and palm 2 saying that it is time to release these models to beat ChatGPT. It was not a good model.

The they hyped up Gemini, and if Gemini Ultra is the best they have, I am not convinced that they have a better model. So this is it.

So in one year, we went from Google has to have the best model, they just do not want to release to they have the infrastructure and data and the talent to make the best model. Why they really had was nothing.

cylinder714
0 replies
7h2m

Snader's Law: "Any sufficiently advanced technology is indistinguishable from a rigged demo."