Gemini "duck" demo was not done in realtime or with voice

That's not the only thing wrong. Gemini makes a false statement in the video, serving as a great demonstration of how these models still outright lie so frequently, so casually, and so convincingly that you won't notice, even if you have a whole team of researchers and video editors reviewing the output.

It's the single biggest problem with LLMs and Gemini isn't solving it. You simply can't rely on them when correctness is important. Even when the model has the knowledge it would need to answer correctly, as in this case, it will still lie.

The false statement is after it says the duck floats, it continues "It is made of a material that is less dense than water." This is false; "rubber" ducks are made of vinyl polymers which are more dense than water. It floats because the hollow shape contains air, of course.

This seems to be a common view among some folks. Personally, I'm impartial.

Search or even asking other expert human beings are prone to provide incorrect results. I'm unsure where this expectation of 100% absolute correctness comes from. I'm sure there are use cases, but I assume it's the vast minority and most can tolerate larger than expected inaccuracies.

I'm unsure where this expectation of 100% absolute correctness comes from.

It's a computer. That's why. Change the concept slightly: would you use a calculator if you had to wonder if the answer was correct or maybe it just made it up? Most people feel the same way about any computer based anything. I personally feel these inaccuracies/hallucinations/whatevs are only allowing them to be one rung up from practical jokes. Like I honestly feel the devs are fucking with us.

Okay, but search is done on a computer, and like the person you’re replying to said, we accept close enough.

I don’t necessarily disagree with your interpretation, but there’s a revealed preference thing going on.

The number of non-tech ppl I’ve heard directly reference ChatGPT now is absolutely shocking.

The number of non-tech ppl I've heard directly reference ChatGPT now is absolutely shocking.

The problem is that a lot of those people will take ChatGPT output at face value. They are wholly unaware that of its inaccuracies or that it hallucinates. I've seen it too many times in the relatively short amount of time that ChatGPT has been around.

So what? People do this with Facebook news too. That's a people problem, not an LLM problem.

People on social media are absolutely 100% posting things deliberately to fuck with people. They are actively seeking to confuse people, cause chaos, divisiveness, and other ill intended purposes. Unless you're saying that the LLM developers are actively doing the same thing, I don't think comparing what people find on the socials vs getting back as a response from a chatBot is a logical comparison at all

How is that any different from what these AI chatbots are doing? They make stuff up that they predict will be rewarded highly by humans who look at it. This is exactly what leads to truisms like "rubber duckies are made of a material that floats over water" - which looks like it should be correct, even though it's wrong. It really is no different from Facebook memes that are devised to get a rise out of people and be widely shared.

Because we shouldn't be striving to make mediocrity. We should be striving to build better. Unless the devs of the bots are wanting to have a bot built on trying to deceive people, I just don't see the purpose of this. If we can "train" a bot and fine tune it, we should be fine tuning truth and telling it what absolutely is bullshit.

To avoid the darker topics to keep the conversation on the rails, if there were a misinformation campaign that was trying to state that the Earth's sky is red, then the fine tuning should be able to inform that this is clearly fake so when quoting this it should be stated as incorrect information that is out there. This kind of development should be how we can clean up the fake, but nope, we're seemingly quite happy at accepting it. At least that's how your question comes off to me.

Sure, but current AI bots are just following the human feedback they get. If the feedback is naïve enough to score the factoid about rubber duckys as correct, guess what, that's the kind of thing these AI's will target. You can try to address this by prompting them with requests like "do you think this answer is correct and ethical? Think through this step by step" ('reinforcement learning from AI feedback') but that's very ad hoc and uncertain - ultimately, the humans in the loop call the shots.

At the end of the day, if there is no definitive answer to a question, it should respond in such a manner. "While there are compelling reasons to think A or B, neither A nor B have been verified. They are just the leading theories." That would be a much better answer than "Option A is the answer even if some people think B is." when A is just as unproven as B, but because it answers so definitively, people think it is the right answer.

So the labels thing is something that obviously will never work. But the system has all of the information it needs to know if the question is definitively answerable. If it is not, do not phrase the response definitively. At this point, I'd be happy if it responded to "Is 1+1 = 2?" with a wish washy answer like, "Most people would agree that 1+1 = 2", and if it wanted to say "in base 10, that is the correct answer. however, in base 2, the 1+1 = 10" would also be acceptable. Fake it till you make it is not the solution here.

There are far more people who post obviously wrong, confusing and dangerous things online with total conviction. There are people who seriously believe Earth is flat, for example.

Literally everything is a "people problem"

You can kill people with a fork, it doesn't mean you should legally be allowed to own a nuclear bomb "because it's just the same". The problem always come from scale and accessibility

If we rewind a little bit to the mid to late 2010s, filter bubbles, recommendation systems and unreliable news being spread on social media was a big problem. It was a simpler time, but we never really solved the problem. Point is, I don’t see the existence of other problems as an excuse for LLM hallucination, and writing it off as a “people problem” really undersells how hard it is to solve people problems.

So you're saying we need a Ministry of Truth to protect people from themselves? This is the same argument used to suppress "harmful" speech on any medium.

I've gotten to the point where I want "advertisment" stamped on anything that is, and I'm getting to the point I want "fiction" stamped on anything that is. I have no problem with fiction existing. It can be quite fun. People trying to pass fiction as fact is a problem though. Trying to force a "fact" stamp would be problematic though, so I'd rather label everything else.

How to enforce it is the real sticky wicket though, so it's only something best discussed at places like this or while sitting around chatting while consuming

And who gets to control the "fiction" stamp? Especially for hot button topics like covid (back in 2020)? Should asking an LLM about lab leak theory be auto-stamped with "fiction" since it's not proven? But then what if it's proven later?

Speech to text is often wrong too. So is autocorrect. And object detection. Computers don't have to be 100% correct in order to be useful, as long as we don't put too much faith in them.

Your caveat is not the norm though, as everyone is putting a lot of faith in them. So, that's part of the problem. I've talked with people that aren't developers, but they are otherwise smart individuals that have absolutely not considered that the info is not correct. The readers here are a bit too close to the subject, and sometimes I think it is easy to forget that the vast majority of the population do not truly understand what is happening.

Nah, I don’t think anything has the potential to build critical thinking like LLMs en masse. I only worry that they will get better. It’s when they are 99.9% correct we should worry.

People put too much faith in conspiracy theories they find on YT, TikTok, FB, Twitter, etc. What you're claiming is already not the norm. People already put too much faith into all kinds of things.

Call me old fashioned, but I would absolutely like to see autocorrect turned off in many contexts. I much prefer to read messages with 30% more transparent errors rather than any increase in opaque errors. I can tell what someone meant if I see "elephent in the room", but not "element in the room" (not an actual example, autocorrect would likely get that one right).

"Computer says no" is not a meme for no reason.

why should all computing be deterministic?

let me show you this "genius"/"wrong-thinking" person as to say about AL(artificial life) and deterministic computing.

https://www.cs.unm.edu/~ackley/

https://www.youtube.com/user/DaveAckley

To sum up a bunch of their content: You can make intractable problems solvable/crunchable if you allow just a little error into the result (which is reduced the longer the calculation calculates). And this is acceptable for a number of use cases where initial accuracy is less important that instant feedback.

It is radically different from a Von Neumann model of a computer - where there is a deterministic 'totalitarian finger pointer' pointing to some registry (and only one registry at a time) is an inherently limited factor. In this model - each computational resource (a unit of ram, and a processing unit) fights for and coordinates reality with it's neighbors without any central coordination.

Really interesting stuff. still in its infancy...

Let's see, so we exclude law, we exclude medical.. it's certainly not a "vast minority" and the failure cases are nothing at all like search or human experts.

Are you suggesting that failure cases are lower when interacting with humans? I don't think that's my experience at all.

Maybe I've only ever seen terrible doctors but I always cross reference what doctors say with reputable sources like WebMD (which I understand likely contain errors). Sometimes I'll go straight to WebMD.

This isn't a knock on doctors - they're humans and prone to errors. Lawyers, engineers, product managers, teachers too.

You think you ask your legal assistant to find some precedents related to your current case and they will come back with an A4 page full of made up cases that sound vaguely related and convincing but are not real? I don't think you understand the failure case at all.

That example seems a bit hyperbolic. Do you think lawyers who leverage ChatGPT will take the made up cases and present them to a judge without doing some additional research?

What I'm saying is that the tolerance for mistakes is strongly correlated to the value ChatGPT creates. I think both will need to be improved but there's probably more opportunity in creating higher value.

I don't have a horse in the race.

Do you think lawyers who leverage ChatGPT will take the made up cases and present them to a judge without doing some additional research?

I generally agree with you, but it's funny that you use this as an example when it already happened. https://arstechnica.com/tech-policy/2023/06/lawyers-have-rea...

facepalm

Do you think lawyers who leverage ChatGPT will take the made up cases and present them to a judge without doing some additional research?

Oh dear.

Do you think lawyers who leverage ChatGPT will take the made up cases and present them to a judge without doing some additional research?

You don't?

https://fortune.com/2023/06/23/lawyers-fined-filing-chatgpt-...

What would be the point of a lawyer using chatGPT if it had to root through every single reference chatGPT relied upon? I don't have to doublecheck every reference of a junior attorney, because they actually know what they are doing, and when they don't, it's easy to tell and wont come with fraudulently created decisions/pleadings, etc

Do you think lawyers who leverage ChatGPT will take the made up cases and present them to a judge without doing some additional research

I really don’t recommend using ChatGPT (even GPT-4) for legal research or analysis. It’s simply terrible at it if you’re examining anything remotely novel. I suspect there is a valuable RAG application to be built for searching and summarizing case law, but the “reasoning” ability and stored knowledge of these models is worse than useless.

I'm a software engineer, and I more or less stopped asking ChatGPT for stuff that isn't mainstream. It just hallucinates answers and invents config file options or language constructs. Google will maybe not find it, or give you an occasional outdated result, but it rarely happens that it just finds stuff that's flat out wrong (in technology at least).

For mainstream stuff on the other hand ChatGPT is great. And I'm sure that Gemini will be even better.

it rarely happens that it just finds stuff that's flat out wrong

"Flat out wrong" implies determinism. For answers which are deterministic such as "syntax checking" and "correctness of code" - this already happens.

ChatGPT, for example, will write and execute code. If the code has an error or returns the wrong result it will try a different approach. This is in production today (I use the paid version).

Dollars to doughnuts says they are using GPT3.5.

I'm currently working with some relatively obscure but open source stuff (JupyterLite and Pyodide) and ChatGPT 4 confidently hallucinates APIs and config options when I ask it for help.

With more mainstream libraries it's pretty good though

I use chatgpt4 for very obscure things

If I ever worried about being quoted then I’ll verify the information

otherwise I’m conversational, have taken an abstract idea into a concrete one and can build on top of it

But I’m quickly migrating over to mistral and if that starts going off the rails I get an answer from chatgpt4 instead

The important thing is that with Web Search as a user you can learn to adapt to varying information quality. I have a higher trust for Wikipedia.org than I do for SEO-R-US.com, and Google gives me these options.

With a chatbot that's largely impossible, or at least impractical. I don't know where it's getting anything from - maybe it trained on a shitty Reddit post that's 100% wrong, but I have no way to tell.

There has been some work (see: Bard, Bing) where the LLM attempts to cite its sources, but even then that's of limited use. If I get a paragraph of text as an answer, is the expectation really that I crawl through each substring to determine their individual provenances and trustworthiness?

The shape of a product matters. Google as a linker introduces the ability to adapt to imperfect information quality, whereas a chatbot does not.

As an exemplar of this point - I don't trust when Google simply pulls answers from other sites and shows it in-line in the search results. I don't know if I should trust the source! At least there I can find out the source from a single click - with a chatbot that's largely impossible.

I know exactly where the expectation comes from. The whole world has demanded absolute precision from computers for decades.

Of course, I agree that if we want computers to “think on their own“ or otherwise “be more human“ (whatever that means) we should expect a downgrade in correctness, because humans are wrong all the time.

Is it less reliable than an encyclopedia? It is less reliable than Wikipedia? Those aren't infallible but what's the expectation if it's wrong on something relatively simple?

With the rush of investment in dollars and to use these in places like healthcare, government, security, etc. there should be absolute precision.

The whole world has demanded absolute precision from computers for decades.

Computer engineers maybe. I think the general population is quite tolerant of mistakes as long as the general value is high.

People generally assign very high value to things computers do. To test this hypothesis all you have to do is ask folks to go a few days without their computer or phone.

The whole world has demanded absolute precision from computers

The opposite. Far too tolerant of the excuse "sorry, computer mistake." (But yeah, just at the same time as "the computer says so".)

If it’s no better than asking a random person, then where is the hype? I already know lots of people who can give me free, maybe incorrect guesses to my questions.

At least we won’t have to worry about it obtaining god-like powers over our society…

At least we won’t have to worry about it obtaining god-like powers over our society…

We all know someone who's better at self promotion than at whatever they're supposed to be doing. Those people often get far more power than they should have, or can handle—and ChatGPT is those people distilled.

1. Hunans may also never be 100% - but it seems they are more often correct. 2. When AI is wrong it's often not only slighty off, but completely off the rails. 3. Humans often tell you when they are not sure. Even if it's only their tone. AI is always 100% convinced it's correct.

It’s not AI it’s a machine learning model

Humans are imperfect, but this comes with some benefits to make up for it.

First, we know they are imperfect. People seem to put more faith into machines, though I do sometimes see people being too trusting of other people.

Second, we have methods for measuring their imperfection. Many people develop ways to tell when someone is answering with false or unjustified confidence, at least in fields they spend significant time in. Talk to a scientist about cutting edge science and you'll get a lot of 'the data shows', 'this indicates', or 'current theories suggest'.

Third, we have methods to handle false information that causes harm. Not always perfect methods, but there are systems of remedies available when experts get things wrong, and these even include some level of judging reasonable errors from unreasonable errors. When a machine gets it wrong, who do we blame?

Absolutely! And fourth, we have ways to make sure the same error doesn't happen again; we can edit Wikipedia, or tell the person they were wrong (and stop listening to them if they keep being wrong).

Aside: this is not what impartial means.

The bigger problem is lack of context. When I speak with a person or review search results, I can use what I know about the source to evaluate the information I'm given. People have different areas of expertise and use language and mannerisms to communicate confidence in their knowledge or lack thereof. Websites are created by people (most times) and have a number of contextual clues that we have learned to interpret over the years.

LLMs do none of this. They pose as a confident expert on almost everything, and are just as likely to spit out BS as a true answer. They don't cite their sources, and if you ask for the source sometimes they provide ones that don't contain the information cited or don't even exist. If you hired a researcher and they did that you wouldn't hire them again.

I find it ironic that computer scientists and technologists are frequently uberrationalists to the point of self parody but they get hyped about a technology that is often confidently wrong.

Just like the hype with AI and the billions of dollars going into it. There’s something there but it’s a big fat unknown right now whether any part of the investment will actually pay off - everyone needs it to work to justify any amount of the growth of the tech industry right now. When everyone needs a thing to work, it starts to really lose the fundamentals of being an actual product. I’m not saying it’s not useful, but is it as useful as the valuations and investments need it to be? Time will tell.

Guessing from the last sentence that you are one of those "most" who "can tolerate larger than expected inaccuracies".

How much inaccuraciy would that be ?

Most people I worked with either tell me "I don't know" or "I think x, but with not sure" when they are not sure about something, the issue with LLMs is they don't have this concept.

There's a huge difference between demonstrating something with fuzzy accuracy and playing something off as if it's giving good, correct answers. An honest way to handle that would be to highlight where the bot got it wrong instead of running with the answer as if it was right.

Deception isn't always outright lying. This video was deceitful in form and content and presentation. Their product can't do what they're implying it can, and it was put together specifically to mislead people into thinking it was comparable in capabilities to gpt-4v and other competitor's tech.

Working for Google AI has to be infuriating. They're doing some of the most cutting edge research with some of the best and brightest minds in the field, but their shitty middle management and marketing people are doing things that undermine their credibility and make them look like untrustworthy fools. They're a year or more behind OpenAI and Anthropic, barely competitive with Meta, and they've spent billions of dollars more than any other two companies, with a trashcan fire for a tech demo.

It remains to be seen whether they can even outperform Mistral 7b or some of the smaller open source models, or if their benchmark numbers are all marketing hype.

Honestly I agree. Humans make errors all the time. Perfection is not necessary and requiring perfection blocks deployment of systems that represent a substantial improvement over the status quo despite their imperfections.

The problem is a matter of degree. These models are substantially less reliable than humans and far below the threshold of acceptability in most tasks.

Also, it seems to me that AI can and will surpass the reliability of humans by a lot. Probably not by simply scaling up further or by clever prompting, although those will help, but by new architectures and training techniques. Gemini represents no progress in that direction as far as I can see.

If a human expert gave wrong answers as often and as confidently as LLMs, most would consider no longer asking them. Yet people keep coming back to the same LLM despite the wrong answers to ask again in a different way (try that with a human).

This insistence on comparing machines to humans to excuse the machine is as tiring as it is fallacious.

Where did you get the 100% number from? It's not in the original comment, it's not in a lot of similar criticisms of the models.

I'm unsure where this expectation of 100% absolute correctness comes from. I'm sure there are use cases, but I assume it's the vast minority and most can tolerate larger than expected inaccuracies.

As others hinted at, there's some bias because it's coming from a computer, but I think it's far more nuanced than that.

I've worked with many experts and professionals through my career ranging across medicine, various types of engineers, scientists, academics, researchers and so on and the pattern I often see is the level of certainty presented that always bothers me and the same is often embedded in LLM responses.

While humans don't typically quantify the certainty of their statements, the best SMEs I've ever worked with make it very clear what level of certainty they have when making professional statements. The SMEs who seem to be more often wrong than not speak in certainty quite often (some of this is due to cultural pressures and expectations surrounding being an "expert").

In this case, I would expect a seasoned scientist to say something in response to the duck question that: "many rubber ducks exist and are designed to float, this one very well might, we'd really need to test it or have far more information about the composition of the duck, the design, the medium we want it in (Water? Mecury? Helium?)" and so on. It's not an exact answer but you understand there's uncertainty there and we need to better clarify our question and the information surrounding that question. The fact is, it's really complex to know if it'll float or not from visual information alone.

It could have an osmimum ball inside that overcomes most the assumed buoyancy the material contains, including the air demonstrated to make it squeak. It's not transparent. You don't know for sure and the easiest way to alleviate uncertainty in this case is simply to test it.

There's so much uncertainty in the world, around what seem like the most certain and obvious things. LLMs seem to have grabbed some of this bad behavior from human language and culture where projecting confidence is often better (for humans) than being correct.

Is it possible for humans to be wrong about something, without lying?

I don't agree with the argument that "if a human can fail in this way, we should overlook this failing in our tooling as well." Because of course that's what LLMs are, tools, like any other piece of software.

If a tool is broken, you seek to fix it. You don't just say "ah yeah it's a broken tool, but it's better than nothing!"

All these LLM releases are amazing pieces of technology and the progress lately is incredible. But don't rag on people critiquing it, how else will it get better? Certainly not by accepting its failings and overlooking them.

If a broken tool is useful, do you not use it because it is broken ?

Overpowered LLMs like GPT-4 are both broken (according to how you are defining it) and useful -- they're just not the idealized version of the tool.

Maybe not if its the case that your use of the broken tool would result in the eventual undoing of your work. Like, lets say your staple gun is defective and doesn't shoot the staples deep enough, but it still shoots. You can keep using the gun, but it's not going to actually do its job. It seems useful and functional, but it isn't and its liable to create a much bigger mess.

So to continue the analogy, if the staple gun is broken and it requires you to do more than a working (but non-existent) staple gun BUT less work than doing the affixment without the broken staple gun, you would or would not use it ?

But nobody said they wouldn't use it. You said that. You came up with this idea and then demanded other people defend it.

I don't know why "critiquing the tool" is being equated to "refusing to use the tool."

I don't like calling something a strawman, because I think it's an overused argument, but...I mean...

I didn't come up with it nor ask anyone to defend it. I asked a different question about usefulness, and about what it means to him for something to be "broken".

My point is that the attempt to critique it was a failure. It provided no critique.

It was incomplete at the very least -- it assigned it the label of broken, but didn't explain the implications of that. It didn't define at what level of failure it would need to be to valuable.

Additionally, I didn't indicate whether or not he would refuse to use it -- specifically because I didn't know, because he didn't say.

We all use broken tools built on a fragile foundation of imperfect precision.

I think you are missing the point. If I do use it, then my result will be a broken and defective product. How exactly is that not clear? That's the point. It might not be observable to be, but whatever I'm affixing with the staple gun will come loose because its not working right and not sinking the staples in deep enough...

If I don't use it, then the tool is not used and provided no benefit...

It's not clear because it is false and I believe I can produce a proof if you are willing to validate that you accept my premise.

Your CPU, right now, has known defects. It will produce the wrong outputs for some inputs. It seems to meet your definition of broken.

Do you agree with that premise ?

“Broken” is word used by pedants. A broken tool doesn’t work. This works, most of the time.

Is a drug “broken” because it only cures a disease 80% of the time?

The framing most critics seem to have is “it must be perfect”.

It’s ok though, their negativity just means they’ll miss out on using a transformative technology. No skin off the rest of us.

I think you're reading a lot into GP's comment that isn't there. I don't see any ragging on people critiquing it. I think it's perfectly compatible to think we should continually improve on these things while also recognizing that things can be useful without being perfect

I think the comparison to humans is just totally useless. It isn’t even just that, as a tool, it should be better than humans at the thing it does, necessarily. My monitor is on an arm, the arm is pretty bad at positioning things compared to all the different positions my human arms could provide. But it is good enough, and it does it tirelessly. A tool is fit for a purpose or not, the relative performance compared to humans is basically irrelevant.

I think the folks making these tools tend to oversell their capabilities because they want us to imagine the applications we can come up with for them. They aren’t selling the tool, they are selling the ability to make tools based on their platform, which means they need to be speculative about the types of things their platform might enable.

Lying implies an intent to deceive despite, or giving a response despite having better knowledge, which I'd argue LLMs can't do, at least not yet. It just requires a more robust theory of mind than I'd consider them to realistically be capable of.

They might have been trained/prompted with misinformation, but then it's the people doing the training/prompting who are lying, still not the LLM.

Not to say this example was lying but they can lie just fine - https://arxiv.org/abs/2311.07590

They're lying in the same way that a sign that says "free cookies" is lying when there are actually no cookies.

I think this is a different usage of the word, and we're pretty used to making the distinction, but it gets confusing with LLMs.

You are making an imaginary distinction that doesn't exist. It doesn't even make any sense in the context of the paper i linked.

The model consistently and purposefully withheld knowledge it was directly aware of. This is lying under any useful definition of the word. You're veering off into meaningless philosophy that has no bearing on outcomes and results.

To the question of whether it could have intent to deceive, going to the dictionary, we find that intent essentially means a plan (and computer software in general could be described as a plan being executed) and deceive essentially means saying something false. Furthermore, its plan is to talk in ways that humans talk, emulating their intelligence, and some intelligent human speech is false. Therefore, I do believe it can lie, and will whenever statistically speaking a human also typically would.

Perhaps some humans never lie, but should the LLM be trained only on that tiny slice of people? It's part of life, even non-human life! Evolution works based on things lying: natural camouflage, for example. Do octopuses and chameleons "lie" when they change color to fake out predators? They have intent to deceive!

Most humans I professionally interact with don't double down on their mistakes when presented with evidence to the contrary.

The ones that do are people I do my best to avoid interacting with.

LLMs act more like the latter, than the former.

I think this problem needs to be solved at a higher level, and in fact Bard is doing exactly that. The model itself generates its output, and then higher-level systems can fact check it. I've heard promising things about feeding back answers to the model itself to check for consistency and stuff, but that should be a higher level function (and seems important to avoid infinite recursion or massive complexity stemming from the self-check functionality).

I'm not a fan of current approaches here. "Chain of thought" or other approaches where the model does all its thinking using a literal internal monologue in text seem like a dead end. Humans do most of their thinking non-verbally and we need to figure out how to get these models to think non-verbally too. Unfortunately it seems that Gemini represents no progress in this direction.

The point of “verbalizing” the chain of thought isn’t that it’s the most effective method. And frankly I don’t think it matters that humans think non verbally. The goal isn’t to create a human in a box. Verbalizing the chain of thought allows us to audit the thought process, and also create further labels for training.

No, the point of verbalizing the chain of thought is that it's all we know how to do right now.

And frankly I don’t think it matters that humans think non verbally

You're right, that's not the reason non-verbal is better, but it is evidence that non-verbal is probably better. I think the reason it's better is that language is extremely lossy and ambiguous, which makes a poor medium for reasoning and precise thinking. It would clearly be better to think without having to translate to language and back all the time.

Imagine you had to solve a complicated multi-step physics problem, but after every step of the solution process your short term memory was wiped and you had to read your entire notes so far as if they were someone else's before you could attempt the next step, like the guy from Memento. That's what I imagine being an LLM using CoT is like.

I mean a lot of problems are amenable to subdivision into parts where the process of each part is not needed for the other parts. It's not even clear that humans usually hold in memory all of process of the previous parts especially the it won't be used later.

"Chain of thought" or other approaches where the model does all its thinking using a literal internal monologue in text seem like a dead end. Humans do most of their thinking non-verbally and we need to figure out how to get these models to think non-verbally too.

Insofar as we can say that models think at all between the input and the stream of tokens output, they do it nonverbally. Forcing the structure of reduce some of it to verbal form short of the actual response-of-concern does not change that, just as the fact that humans reduce some of their thought to verbal form to work through problems doesn't change that human thought is mostly nonverbal.

(And if you don't consider what goes on between input and output thought, than chain of thought doesn't force all LLM thought to be verbal, because only the part that comes out in words is "thought" to start with in that case -- you are then saying that the basic architecture, not chain of thought prompting, forces all thought to be verbal.)

You're right, the models do think non-verbally. However, crucially, they can only do so for a fixed amount of time for each output token. What's needed is a way for them to think non-verbally continuously, and decide for themselves when they've done enough thinking to output the next token.

Is it clear that humans can think nonverbally (including internal monologue) continuously? As in, for difficult reasoning tasks, do humans benefit a lot from extra time if they are not allowed internal monologue. Genuine question

Humans do most of their thinking non-verbally and we need to figure out how to get these models to think non-verbally too.

That's a very interesting point, both technically and philosophically.

Where Gemini is "multi-modal" from training, how close do you think that gets? Do we know enough about neurology to identical a native language in which we think? (not rhetorical questions, I'm really wondering)

Neural networks are only similar to brains on the surface. Their learning process is entirely different and their internal architecture is different as well.

We don’t use neural networks because they’re similar to brains. We use them because they are arbitrary function approximators and we have an efficient algorithm (backprop) coupled with hardware (GPUs) to optimize them quickly.

I’m not an expert but I suspect that this aspect of lack of correctness in these models might be fundamental to how they work.

I suppose there’s two possible solutions: one is a new training or inference architecture that somehow understand “facts”. I’m not an expert so I’m not sure how that would work, but from what I understand about how a model generates text, “truth” can’t really be a element in the training or inference that affects the output.

the second would be a technology built on top of the inference to check correctness, some sort of complex RAG. Again not sure how that would work in a real world way.

I say it might be fundamental to how the model works because as someone pointed out below, the meaning of the word “material” could be interpreted as the air inside the duck. The model’s answer was correct in a human sort of way, or to be more specific in a way that is consistent with how a model actually produces an answer- it outputs in the context of the input. If you asked it if PVC is heavier than water it would answer correctly.

Because language itself is inherently ambiguous and the model doesn’t actually understand anything about the world, it might turn out that there’s no universal way for a model to know what’s true or not.

I could also see a version of a model that is “locked down” but can verify the correctness of its statements, but in a way that limits its capabilities.

this aspect of lack of correctness in these models might be fundamental to how they work.

Is there some sense in which this isn't obvious to the point of triviality? I keep getting confused because other people seem to keep being surprised that LLMs don't have correctness as a property. Even the most cursory understanding of what they're doing understands that it is, fundamentally, predicting words from other words. I am also capable of predicting words from other words, so I can guess how well that works. It doesn't seem to include correctness even as a concept.

Right? I am actually genuinely confused by this. How is that people think it could be correct in a systematic way?

Maybe you simplify a bit what "guessing words from other words" means. HOW do you guess this, is what's mysterious to many: you can guess words from other words due to habit of language, a model of mind of how other people expect you to predict, a feedback loop helping you do it better over time if you see people are "meh" at your bad predictions, etc.

So if the chatbot is used to talking, knows what you'd expect, and listens to your feedback, why wouldn't it also want to tell the truth like you would instinctively, even best effort only ?

Sadly, the chatbots doesn't yet really care about the game it's playing, it doesn't want to make it interesting, it's just like a slave producing minimal low-effort outputs. I've talked to people exploited for money in dark places, and when they "seduce" you, they talk like a chatbot: most of it is lie, it just has to convince you a little bit to go their way, they pretend to understand or care about what you say, but end of the day, the goal is for you to pay. Like the chatbot.

Is there some sense in which this isn't obvious to the point of triviality?

This is maybe a pedantic "yes", but is also extremely relevant to the outstanding performance we see in tasks like programming. The issue is primarily the size of the correct output space (that is, the output space we are trying to model) and how that relates to the number of parameters. Basically, there is a fixed upper bound on the amount of complexity that can be encoded by a given number of parameters (obvious in principle, but we're starting to get some theory about how this works). Simple systems or rather systems with simple rules may be below that upper bound, and correctness is achievable. For more complex systems (relative to parameters) it will still learn an approximation, but error is guaranteed.

I am speculating now, but I seriously suspect the size of the space of not only one or more human language but also every fact that we would want to encode into one of these models is far too big a space for correctness to ever be possible without RAG. At least without some massive pooling of compute, which long term may not be out of the question but likely never intended for individual use.

If you're interested, I highly recommend checking out some of the recent work around monosemanticity for what fleshing out the relationship between model-size and complexity looks like in the near term.

I think very few people on this forum believe LLMs are correct in a systematic way, but a lot of people seem to think there's something more than predicting words from other words.

Modern machine learning models contain a lot of inscrutable inner layers, with far too many billions of parameters for any human to comprehend, so we can only speculate about what's going on. A lot of people think that, in order to be so good at generating text, there must be a bunch of understanding of the world in those inner layers.

If a model can write convincingly about a soccer game, producing output that's consistent with the rules, the normal flow of the game and the passage of time - to a lot of people, that implies the inner layers 'understand' soccer.

And anyone who noodled around with the text prediction models of a few decades ago, like Markov chains, Bayesian text processing, sentiment detection and things like that can see that LLMs are massively, massively better than the output from the traditional ways of predicting the next word.

Just to play devil’s advocate: we can train neural networks to model some functions exactly, given sufficient parameters. For example simple functions like ax^2 + bx + c.

The issue is that “correctness” isn’t a differentiable concept. So there’s no gradient to descend. In general, there’s no way to say that a sentence is more or less correct. Some things are just wrong. If I say that human blood is orange that’s not more incorrect than saying it’s purple.

Because it is assumed that it can think or/and reason. In this case, knowing the concepts of density, the density of a material, detecting the material from an image, detecting what object this image is. And, most importantly, knowing that this object is not solid. Because then it could not float.

Yeah. I think there's some ambiguity around the meaning of reasoning- because it is a kind of reasoning to say a Duck's material is less dense than water. In a way it's reasoned that out, and it might actually say something about the way a lot of human reasoning works.... (especially if you've ever listened to certain people talk out loud and say to yourself... huh?)

Bing chat uses gpt-4 and sites sources from it's retrieval.

To be fair, one could describe the duck as being made of air and vinyl polymer, which in combination are less dense than water. That's not how humans would normally describe it, but that's kind of arbitrary; consider how aerogel is often described as being mostly made of air.

Is an aircraft carrier made of a material that is less dense than water?

Is an aircraft carrier made of metal and air? Or just metal?

Where’s the distinction between the air that is part of the boat, and the air that is not? If the air is included in the boat, should we all be wearing life vests?

only if you average it out over volume :P

If I take all of the air out of a toy duck, it is still a toy duck. If I take all of the vinyl/rubber out of a toy duck, it is just the atmosphere remaining

The material of the duck is not air. It's not sealed. It would still be a duck in a vacuum and it would still float on a liquid the density of water too.

That's a tricky one though since the question is, is the air inside of the rubber duck part of the material that makes it? If you removed the air it definitely wouldn't look the same or be considered a rubber duck. I gave it to the bot since when taking ALL the material that makes it a rubber duck, it is less dense than water.

A rubber duck in a vacuum is still a rubber duck and it still floats (though water would evaporate too quickly in a vacuum, it could float on something else of the same density).

A rubber duck with a vacuum inside (removing the air material) of it is just a piece of rubber with eyes. Assuming OP's point about the rubber not being less dense than water, it would sink, no?

No. Air is less dense than water; vacuum is even less dense than air. A rubber duck will collapse if you seal it and try to pull a vacuum inside with air outside, but if the rubber duck is in a vacuum then it will have only vacuum inside and it will still float on a liquid the density of water. If you made a duck out of a metal shell you could pull a vacuum inside, like a thermos bottle, and it would float too.

The metal shell is ridged though, so the volume maintains the same with the vacuum. A rubber duck collapses with a vacuum inside of it, thus losing the shape of a duck and reducing the volume of the object =). That's why I said it's just a piece of rubber with eyes.

A rubber duck collapses with a vacuum inside of it

Not if there is vacuum outside too. In a vacuum it remains a duck and still floats.

If you hold a rubber duck under water and squeeze out the air, it will fill with water and still be a rubber duck. If you send a rubber duck into space, it will become almost completely empty but still be a rubber duck. Therefore, the liquid used to fill the empty space inside it is not part of the duck.

I mean apply this logic to a boat, right? Is the entire atmosphere part of the boat? Are we all on this boat as well? Is it a cruise boat? If so, where is my drink?

I totally agree with you on the confident lies. And it’s really tough. Technically the duck is made out of air and plastic right?

If I pushed the model further on the composition of a rubber duck, and it failed to mention its construction, then it’d be lying.

However there is this disgusting part of language where a statement can be misleading, technically true, not the whole truth, missing caveats etc.

Very challenging problem. Obviously Google decided to mislead the audience and basically cover up the shortcomings. Terrible behaviour.

Calling the air inside the duck (which is not sealed inside) part of its "material" would be misleading. That's not how most people would interpret the statement and I'm confident that's not the explanation for why the statement was made.

The air doesn’t matter. Even with a vacuum inside it would float. It’s the overall density of “the duck” that matters, not the density of the plastic.

A canoe floats, and that doesn't even command any thought regarding whether you can replace trapped air with a vacuum. If you had a giant cube half full of water, with a boat on the water, the boat would float regardless of whether the rest of the cube contained air or vacuum, and regardless of whether the boat traps said air (like a pontoon) or is totally vented (like a canoe). The overall density of the canoe is NOT influenced by its shape or any air, though. The canoe is strictly more dense than water (it will sink if it capsizes) yet in the correct orientation it floats.

What does matter, however, is the overall density of the space that was water and became displaced by the canoe. That space can be populated with dense water, or with a less dense canoe+air (or canoe+vacuum) combination. That's what a rubber duck also does: the duck+air (or duck+vacuum) combination is less dense than the displaced water.

No, the density of the object is less than water, not the density of the material. The Duck is made of plastic, and it traps air. Similarly, you can make a boat that floats in water out of concrete or metal. It is an important distinction when trying to understand buoyancy.

EDIT: never mind, I missed the exact wording about being "made of a material..." which is definitely false then. Thanks for the correction below.

Preserving the original comment so the replies make sense:

---

I think it's a stretch to say that's false.

In a conversational human context, saying it's made of rubber implies it's a rubber shell with air inside.

It floats because it's rubber [with air] as opposed to being a ceramic figurine or painted metal.

I can imagine most non-physicist humans saying it floats because it's rubber.

By analogy, we talk about houses being "made of wood" when everybody knows they're made of plenty of other materials too. But the context is instead of brick or stone or concrete. It's not false to say a house is made of wood.

This is what the reply was:

Oh, it it's squeaking then it's definitely going to float.

It is a rubber duck.

It is made of a material that is less dense than water.

Full points for saying if it's squeaking then it's going to float.

Full points for saying it's a rubber duck, with the implication that rubber ducks float.

Even with all that context though, I don't see how "it is made of a material that is less dense than water" scores any points at all.

Yeah, I think arguing the logic behind these responses misses the point, since an LLM doesn't use any kind of logic--it just responds in a pattern that mimics the way people respond. It says "it is made of a material that is less dense than water" because that is a thing that is similar to what the samples in its training corpus have said. It has no way to judge whether it is correct, or even what the concept of "correct" is.

When we're grading the "correctness" of these answers, we're really just judging the average correctness of Google's training data.

Maybe the next step in making LLM's more "correct" is not to give them more training data, but to find a way to remove the bad training data from the set?

In a conversational human context, saying it's made of rubber implies it's a rubber shell with air inside.

Disagree. It could easily be solid rubber. Also, it's not made of rubber, and the model didn't claim it was made of rubber either, so it's irrelevant.

It floats because it's rubber [with air] as opposed to being a ceramic figurine or painted metal.

A ceramic figurine or painted metal in the same shape would float too. The claim that it floats because of the density of the material is false. It floats because the shape is hollow.

It's not false to say a house is made of wood.

It's false to say a house is made of air simply because its shape contains air.

There's nothing wrong with what you're saying, but what do you suggest? Factuality is an area of active research, and Deepmind goes into some detail in their technical paper.

The models are too useful to say, "don't use them at all." Hopefully people will heed the warnings of how they can hallucinate, but further than that I'm not sure what more you can expect.

The problem is not with the model, but with its portrayal in the marketing materials. It's not even the fact that it lied, which is actually realistic. The problem is the lie was not called out as such. A better demo would have had the user note the issue and give the model the opportunity to correct itself.

But you yourself said that it was so convincing that the people doing the demo didn't recognize it as false, so how would they know to call it out as such?

I suppose they could've deliberately found a hallucination and showcased it in the demo. In which case, pretty much every company's promo material is guilty of not showcasing negative aspects of their product. It's nothing new or unique to this case.

They should have looked more carefully, clearly. Especially since they were criticized for the exact same thing in their last launch.

I did some reading and it seems that rubber's relative density to water has to do with its manufacturing process. I see a couple of different quotes on the specific gravity of so-called 'natural rubber', and most claim it's lower than water.

Am I missing something?

I asked both Bard (Gemini at this point I think?) and GPT-4 why ducks float, and they both seemed accurate: they talked about the density of the material plus the increased buoyancy from air pockets and went into depth on the principles behind buoyancy. When pressed they went into the fact that "rubber"'s density varies by the process and what it was adulterated with, and if it was foamed.

I think this was a matter of the video being a brief summary rather than a falsehood. But please do point out if I'm wrong on the rubber bit, I'm genuinely interested.

I agree that hallucinations are the biggest problems with LLMs, I'm just seeing them get less commonplace and clumsy. Though, to your point, that can make them harder to detect!

Someone on Twitter was also skeptical that the material is more dense than water. I happened to have a rubber duck handy so I cut a sample of material and put it in water. It sinks to the bottom.

Of course the ultimate skeptic would say one test doesn't prove that all rubber ducks are the same. I'm sure someone at some point in history has made a rubber duck out of material that is less dense than water. But I invite you to try it yourself and I expect you will see the same result unless your rubber duck is quite atypical.

Yes, the models will frequently give accurate answers if you ask them this question. That's kind of the point. Despite knowing that they know the answer, you still can't trust them to be correct.

Ah good show :). I was rather preoccupied with the question but didn't have one handy. Well, I do, but my kid would roast me slowly over coals if I so much as smudged it. Ah the joy of the Internet, I did not predict this morning that I would end the day preoccupied with the question of rubber duck density!

I guess for me the question of whether or not the model is lying or hallucinating is if it's correctly summarizing its source material. I find very conflicting materials on the density of rubber, and most of the sources that Google surfaces claim a lower density than water. So it makes sense to me that the model would make the inference.

I'm splitting hairs though, I largely agree with your comment above and above that.

To illustrate my agreement: I like testing AIs with this kind of thing... a few months ago I asked GPT for advice as to how to restart my gas powered water heater. It told me the first step was to make sure the gas was off, then to light the pilot light. I then asked it how the pilot light was supposed to stay lit with the gas off and it backpedaled. My imagining here is that because so many instructional materials about gas powered devices emphasize to start by turning off the gas, that weighted it as the first instruction.

Interesting, the above shows progress though. I realized I asked GPT 3.5 back then, I just re-asked 3.5 and then asked 4 for the first time. 3.5 was still wrong. 4 told me to initially turn off the gas to disappate it, then to ensure gas was flowing to the pilot before sparking it.

But that said I am quite familiar with the AI being confidently wrong, so your point is taken, I only really responded because I was wondering if I was misunderstanding something quite fundamental about the question of density.

Devil's advocate. It is made of a material less dense than water. Air.

It certainly isn't how I would phrase it, and I wouldn't count air as what something is made of, but...

Soda pop is chocked full of air, it's part of it! And I'd say carbon dioxide is a part of the recipe, of pop.

So it's a confusing world for a young LLM.

(I realise it may have referenced rubber prior, but it may have meant air... again, Devil's advocate)

When you make carbonated soda you put carbon dioxide in deliberately and use a sealed container to hold it in. When you make a rubber duck you don't put air in it deliberately and it is not sealed. Carbonated soda ceases to be carbonated when you remove the air. A rubber duck in a vacuum is still a rubber duck and it even still floats.

If the rubber duck has air inside, it is known, and intentional, for it is part of that design.

If you remove the air from the duck, and stop it so it won't refill, you have a flat rubber duck, which is useless for its design.

Much as flat pop is useless for its design.

And this nuance is even more nuance-ish than this devil's advocate post.

It also says the attribute of squeaking means it'll definitely float

That's actually pretty clever because if it squeaks, there is air inside. How many squeaking ducks have you come across that don't float?

You could call it clever or you could call it a spurious correlation.

I don't see it as a problem with most non-critical uses cases (critical being things like medical diagnoses, controlling heavy machinery or robotics, etc).

LLMs right now are most practical for generating templated text and images, which when paired with an experienced worker, can make them orders of magnitude more productive.

Oh, DALL-E created graphic images with a person with 6 fingers? How long would it have taken a pro graphic artist to come up with all the same detail but with perfect fingers? Nothing there they couldn't fix in a few minutes and then SHIP.

> Nothing there they couldn't fix in a few minutes and then SHIP.

If by ship, you mean put directly into the public domain then yes.

https://www.goodwinlaw.com/en/insights/publications/2023/08/...

and for more interesting takes: https://www.youtube.com/watch?v=5WXvfeTPujU&

The duck is indeed made of a material that is less dense. Namely water and air.

If you go to such technical routes your definition is wrong too. It doesn't float because it contains air. If you poke in the head of the duck it will sink. Even though at all times it contains air.

The duck is made of water and air? Which duck are we talking about here.

LLMs do not lie, nor do they tell the truth. They have no goal as they are not agents.

With apologies to Dijkstra, the question of whether LLMs can lie is about as relevant as the question of whether submarines can swim.

Well this seems like a huge nitpick. If a person said that, you would afford them some leeway, maybe they meant the whole duck, which includes the hollow part in the middle.

As an example, when most people say a balloon's lighter than air, they mean an inflated balloon with hot air or helium, but you catch their meaning and don't rush to correct them.

The model specifically said that the material is less dense than water. If you said that the material of a balloon is less dense than air, very few people would interpret that as a correct statement, and it could be misleading to people who don't know better.

Also, lighter-than-air balloons are intentionally filled with helium and sealed; rubber ducks are not sealed and contain air only incidentally. A balloon in a vacuum would still contain helium (if strong enough) but would not rise, while a rubber duck in a vacuum would not contain air but would still easily float on a liquid of similar density to water.

The reason why it seems like a nitpick is that this is such an inconsequential thing. Yeah, it's a false statement but it doesn't really matter in this case, nobody is relying on this answer for anything important. But the point is, in cases where it does matter these models cannot be trusted. A human would realize when the context is serious and requires accuracy; these models don't.

I, a non-AGI, just ‘hallucinated’ yesterday. I hallucinated that my plan was to take all of Friday off and started wondering why I had scheduled morning meetings. I started canceling them in a rush. In fact, all week I had been planning to take a half day, but somehow my brain replaced the idea of a half day off with a full day off. You could have asked me and I would have been completely sure that I was taking all of friday off.

People seem to want to use LLMs to mine knowledge, when really it appears to be a next-gen word-processor.

Given the misleading presentation by real humans in these "whole teams" that this tweet corrects, this doesn't illustrate any underlying powers by the model

language models do not lie. (this pedantic distinction being important, because language models.)

Agree, then the question becomes how will this issue play out?

Maybe AI correctness will be similar to automobile safety. It didn’t take long for both to be recognized as fundamental issues with new transformative technologies.

In both cases there seems to be no silver bullet. Mitigations and precautions will continue to evolve, with varying degrees of effectiveness. Public opinion and legislation will play some role.

Tragically accidents will happen and there will be a cost to pay, which so far has been much higher and more grave for transportation.

It's the single biggest problem with LLMs and Gemini isn't solving it.

I loved it when the lawyers got busted for using a hallucinating LLM to write their briefs.

After asserting it's a rubber duck, there are some claims without follow-up:

- Just after that it doesn't translate the "rubber" part

- It states there's no land nearby for it to rest or find food in the middle of the ocean: if it's a rubber duck it doesn't need to rest nor feed. (That's a missed opportunity to mention the infamous "Friendly Floatees spill"[1] in 1992 as some rubber ducks floated to that map position). Although it seems to recognize geographical features of the map, it fails to mention Easter Island is relatively nearby. And if it were recognized as a simple duck — which it described as a bird swimming in the water — it seems oblivious to the fact that the duck might feed itself in the water. It doesn't mention either that the size of the duck seems abnormally big in that map context.

- The concept of friends and foes doesn't apply to a rubber duck either. Btw labeling the duck picture as a friend and the bear picture as a foe seems arbitrary (e.g. a real duck can be very aggressive even with other ducks.)

Among other things, the astronomical riddle seems also flawed to me: it answered "The correct order is Sun, Earth, Saturn".

I'd like for it to state :

- the premises it used, like "Assuming it depicts the Sun, Saturn and the Earth" (there are other stars, other ringed-planets, and the Earth similarity seems debatable)

- the sorting criteria it used (e.g. using another sorting key like the average distance from us "Earth, Sun, Saturn" can be a correct order)

[1] https://en.wikipedia.org/wiki/Friendly_Floatees_spill

This is so crazy. Google invented transformers which is the bases for all these models. How do they keep fumbling like this over and over. Google Docs created in 2006! Microsoft is eating their lunch. Google creates the ability to change VM's in place and makes a fully automated datacenter. Amazon and Microsoft are killing them in the cloud. Google has been working on self driving longer than anyone. Tesla is catching up and will most likely beat them.

The amount of fumbles is monumental.

Microsoft eating Google's lunch on documents is laughable at best. Not to mention it confuses the entire timeline of office productivity software??

Is paid MS Teams is more or less common than paid GSuite? It's hard to find stats on this. GSuite is the better product IMO, but MS has a stronger b2b reputation, and anecdotally I hear more about people using Teams.

Does anyone use paid GSuite for anything other than docs/drive/Gmail ? In all companies I've worked at, we've used GSuite exclusively for those, and used slack/discord for chat, and zoom/discord for video/meetings.

I know that MS Teams is a more full-featured product suite, but even at companies that used it, we still used Zoom for meetings.

My company uses Meet. It works great! I like it more than Zoom.

zoom is horrible. Meet works for me.

Counterpoint: I take probably 3/4 of my meetings on Zoom and 1/4 on meet. So on any given day I'm probably doing at least 1 on meet. If I look back on any day at all the meetings with unaceptable audio lag or very degraded video quality? They are always all "meet". It is just hands-down worse when networks are unreliable.

In addition meet insists I click on the same about 4 or 5 different "Got it" feature popups every single call, and every call also insists on asking me if I want to use Duet AI to make my background look shit which just adds to annoyance.

It's a lot better than it used to be. In 2020, universities that already had GSuite (which includes Meet) still paid to put their classes on Zoom. Personally I like Zoom more today, mostly because even my high-end laptop can struggle with Meet.

I like meet too, but the inability to send messages to breakout rooms is quite annoying.

GSuite for calendar makes sense too. Chat sucks, and Meet would be decent if it weren't so laggy, but those are two things you can easily not use.

Nobody pays for Teams, but everyone pays for Office, and if you get Teams for free with it ...

Not to mention it integrates with Azure365, which damn near certainly the IT department has already standardized on, feels comfortable with, and has been flooded with enough propaganda to believe anything else is massively less secure. Plus Teams has tons of knobs and buttons for managing what your users do with it... and companies love managing their employees lol.

Sure, Teams is a steaming pile of crap to use day-to-day as a chat app, the search is slow and vague - and depending on policy, probably links you to messages that no longer exist in the archive lol. Oh you want to download message history? Nah gotta get an admin to do that bruh.

I'm in one nonprofit org using MS 365 and Teams, and listening to the guy behind the original decision talk about that ecosystem, I think its popularity really does come from propaganda. I was almost convinced until I actually used it... what a piece of junk. It's ugly for me and borderline unusable for our nontechnical users. I'm in charge now and considering ditching it.

The only saving grace is that members who can't deal with it are using local MS Office, which has some integrations with 365, thus making it kinda viable. But I feel like it's still a net negative.

This is how it became so popular so fast. If they had charged for it, all those Teams users would still be using Zoom.

Teams will likely still be around in 20 years. I doubt gsuite will exist in 5... or even 1.

GSuite has existed since 2006, so it's not like Google lacks focus on it.

That's ancient by google metrics!!!

Kinda. In 2006 they launched "Google Apps for your domain." The name quickly changed to "Google Apps" and then in 2016 it became "GSuite." In 2020 they changed the name to Google Workspace. And of course, in 2022 they tried to kick all of the free "Gsuite Legacy" users off the platform and make them pay for Google Workspace lol.

Gsuite is clearly a lot better product than Office365. I feel like I'm taking crazy pills when I see many institutions make the wrong choice here.

I base about 50% of my choice of employer on what they choose in that area.

GSuite is an awful product for an employer.

If you have a problem there’s no one available to help you.

On the MS side they will literally pull an engineer who is writing the code for the product you have a problem for to help resolve the issue if you’re large enough.

The part you see in your browser isn’t the only part of the product a company has to buy. In fact, it’s not even the most expensive bit. If you see the most expensive plans for most SAAS products (ie the enterprise plans) almost the entire difference in costs is driven by support illustrating the importance and value of support.

Google unfortunately is awful at this.

I worked at many companies in my times and all of them used teams except from one that used slack but all used MS products, none used googles.

I was at MS in 2008 September and internally they had a very beautiful and well functioning Office web already (named differently, forgot the name but it wasn't sharepoint if I recall correctly, I think it had to do something with expense reports?) that would put Google Docs to shame today. They just didn't want to cannibalize their own product.

Microsoft demoed Office Web Apps in 2008 L.A PDC it seems: https://www.wired.com/2008/10/pdc-2008-look-out-google-docs-...

Don't forget they also invented XHR (aka fetch) in 2001. https://en.wikipedia.org/wiki/XMLHttpRequest

Kind of, using it became known as "AJAX" and it took many many years (and the addition of promises to JS) before the more sophisticated "Fetch API" became available.

Even then usage of AJAX declined rather slowly as it was so established, and indeed even now it's still used by many websites!

I assume you mean the decline in the use of the term AJAX as it was now just the standard and you don’t need to use that to describe your site or tool as being capable of being highly interactive and dynamic vs just static.

Before the invention of the xmlhttprequest there was so little you could do with JS most dynamic content was some version of shifty tricks with iframes or img tags or anything that could trigger the browser to make a server request to a url that you could generate dynamically.

Fetch was the formalization of the xmlhttprequest (hence the use of xhr as the name of the request type ). Jquery wrapped it really nicely and essentially popularized (they may have invented async js leveraging callbacks and the like), the creation of promises was basically the formalization and standardization of this.

So AJAX itself is in fact used almost in the entire totality of the web, the term has become irrelevant given the absolute domination of the technology.

Funny, I asked Google Bard to guess what the actual product name was from the comment.

"It was probably Office Web Apps. It was a web-based office suite that was introduced in 2008. It included Word Web App, Excel Web App, Powerpoint Web App, and OneNote Web App. It was not SharePoint, but it was based on SharePoint technology."

Does bard browse the web yet? Is it possible it read the parent comment?

Wild that we have to ask these questions.

Classic innovator’s dilemma!

Interesting how it seems like MS may have been right this time? They were able to milk Office for years, and despite seeming like it might, Google didn't eat their lunch.

People still email word docs around. It’s nuts. Maybe Exchange is smart enough to intercept them and say “hey use this online one instead”? At least for intra-org..

I think the ability to actually email the docs around is half the value proposition. Having to always refer back to the cloud versions is annoying as hell when you're not actually collaborating, just showing someone a thing.

I email Word docs around. It’s like low-tech version control - I know exactly what was in the doc, and can recover it easily.

More like the Acquirer's dilemma.

Google Analytics - acquired 2004, renamed from Urchin Analytics

Google Docs - acquired 2004, renamed from Writely

Youtube - acquired 2005

Android - acquired, 2005 (Samsung have done more to advance the OS than Google themselves)

Don’t forget that McAfee was delivering virus scanning in a browser in 1998 with active x support, TinyMCE was full wysiwyg for content in the browser by 2004, and Google docs was released in 2006 on top of a huge ecosystem of document solutions and even some real-time co-authoring document writing platforms.

2008 is late to the party for a docs competitor! Microsoft got the runaround by Google and after Google launched docs they could have clobbered Microsoft which kind of failed to respond properly in kind, but they didn’t push the platform hard enough to eat the corporate market share, and didn’t follow up with a share point alternative that would appeal to the enterprise, and kind of blew the opportunity imo.

I mean to this day Google docs is free but it still hasn’t unseated Word in the marketplace, but the real killer app that keeps office on top is Excel, which some companies built their entire tooling around.

It’s crazy interesting to look back and realize how many twists there were leading us to where we are today.

Btw it was Office Server or Sharepoint Portal earlier (this is like Frontpage days so like 2001?) and Microsoft called it Tahoe internally. I don’t think it became Sharepoint until Office 365 launched.

The XMLHTTP object launched in 2001 and was part of the dhtml wave. That gave a LOT of the capabilities to browsers that we currently see as browser-based word processing, but there were efforts with proprietary extensions going back from there they just didn’t get broad support or become standards. I saw some crazy stuff at SGI in the late 90s when I was working on their visual workstation series launch.

Google Apps have several other problems as well.

1. Poor Google Drive interface makes managing documents difficult.

2. You cannot just get a first class Google Doc file which you can then share with others over email, etc. Very often you don’t want to just share a link to a document online.

3. Lack of desktop apps.

I was at MS in 2008 September and internally they had a very beautiful and well functioning Office web already

So why did they never release that and went with Office 365 instead?

They did, it was called Office Online with Word, PowerPoint, Excel and SkyDrive (later OneDrive). Everything got moved under the Office 365 umbrella because selling B2B cloud packages (with Sharepoint, Azure AD, Power BI, Teams, Power Automate) was more lucrative than selling B2C subscriptions.

NetDocs was an effort in 2000/2001 that is sometimes characterized as a web productivity suite. There was an internal battle between the Netdocs and Office groups, and Office won.

https://www.zdnet.com/article/netdocs-microsofts-net-poster-...

https://www.eweek.com/development/netdocs-succumbs-to-xp/

Google Docs created in 2006

tech was based on an acquired company, Google just abused their search monopoly to make it more popular(same thing they did with YT). This has been the strategy for every service they've ever made, Google really hasn't launched a decent in-house product since Gmail and even that was grown using their search monopoly as free advertising

Google Docs originated from Writely, a web-based word processor created by the software company Upstartle and launched in August 2005

Google really hasn't launched a decent in-house product since Gmail

What about Chrome? And Chromebooks?

Sorry if this was a joke and I didn't spot it. Chrome was based on WebKit which was itself based on KHTML if memory serves. Chromebooks are based on a version of that outside engine running on top of Linux which they also didn't create.

Chrome as a project was still a Google thing even if they used Konqueror's rendering library.

The process model was the novel selling point at the time from my memory [1].

[1] https://www.scottmccloud.com/googlechrome/

The faster javascript runtime was what made it a success IMO.

The leveraging their search monopoly to push it and paying other software to sneak it into installs is what made it a success.

It's not a joke. Just because they didn't write everything from scratch (Chromebooks also are made with hard disks that Google didn't create from directly mining raw materials and performing all intermediate manufacturing stages) doesn't mean they haven't released successful products that they didn't just buy in.

Webkit is not a browser.

They used the KDE-derived HTML renderer, sure, but they wrote the whole Javascript runtime from scratch, which was what gave it the speed they used as a selling point.

mmm, WebKit?

Chromium browsers do use WebKit

https://www.chromium.org/developers/design-documents/display...

That’s extremely outdated. There’s very little WebKit code remaining in Chromium today.

It was a fork from the beginning and it is blink engine for 10 years.

I laughed out loud for this one

Chromebooks are worse version of the netbooks from 2008, which ran an actual desktop OS. Chromebooks are OLPCs for the rich world, designed with vendor lock-in built in. They eventually end up at discount wholesale lots if not landfills because how quickly they go obsolete.

Engineer-driven company. Not enough top-down direction on the products. Too much self-perceived moral high ground. But lately they've been changing this.

Uhh, no, not really; quite the opposite in fact.

Under Eric Schmidt they were engineer-driven, during the golden era of the 2000s. Nowadays they're MBA driven, which is why they had 4 different messaging apps from different product managers.

Lack of top-down direction is what allowed that situation. Microsoft is MBA-driven and usually has a coherent product lineup, including messaging.

Also, "had." Google cleaned things up. They still sometimes do stuff just cause, but it's a lot less now. I still feel like Meet using laggy VP9 (vs H.264 like everyone else) is entirely due to engineer stubbornness.

The same Microsoft that squandered MSN messenger and Skype and then brought us the abomination that is MS teams?

Those are two messaging apps regular people can actually name, unlike all of Google's messaging apps. MSN Messenger survived 13 years supposedly. Skype was also a big thing for several years MS owned it.

And I hate Teams personally, but lots of teams use it.

but lots of teams use it.

I bet most team members who switched from Slack to Microsoft Teams do not feel like they consented or were asked for their opinions beforehand.

The same Microsoft that recently brought us "New Teams" and "New Outlook" and gave us a reskinned version of the same programs but now we have it installed twice?

I would say that Microsoft's craziness around buying Kin and Nokia, and Windows 8, RT edition, etc etc, was far more fundamental product misdirection than anything Google has ever done.

Microsoft failed to enter the mobile space, yeah. Google fumbled with the Nexus stuff, even though they succeeded with the Android software. But bigger picture, Microsoft was still able to diversify their revenue sources a lot while Google failed to do so.

That's true, although Pixel seems good as a successor, but the big thing Microsoft did was use what they had to get into new markets.

Procuring Azure is a good option for lots of companies because most companies' IT staff know AD and Microsoft in general, and Microsoft's cloud offers them a way to use the same (well, not the same, but it's too late by then) tools to manage their company IT.

I'm not disagreeing with its success, but I do think they had a much simpler journey, as to my understanding a lot of it involved cloudifying their locked-in enterprise customers, rather than diversifying into new markets.

20 versions of .net is wonderful. Changing the names of features over and over again is great too. I am also pleased that windows ten is the last version of windows.

The golden era of the 2000s produced no revenue stream other than ads on Google Search.

My engineer friend who work at Google would strongly disagree with this assertion. I keep hearing about all sorts of hijinks initiated by senior PMs and managers trying to build their fiefdoms.

Disagree with which part? The hijinks are there, no denying it. Kind of a thing at any company, but remedied by leaders above those PMs taking charge.

You bring up fumbles, but they still have the more products with more than a billion users than any company in the world.

This is what Google has always cared about. Bring application to the billions of users.

People are forgetting Google is the most profitable AI company in the world right now. All of their products use ML and AI.

So who is losing?

The goal of Gemini isn't to build a chatbot like ChatGPT despite Google having Bard.

The goal for Gemini is to integrate it into those 10 products they have with a billion users.

Sure, but I think Google's commanding marketshare is more at risk than it has been in a long time due to their fumbles in the AI space.

All of their products use ML and AI.

Is that supposed to be a vote of confidence for the current state of Google search?

This is like critiquing Disney for putting out garbage and then defending them because dummies keep giving them money regardless of quality. Having standards and expectations of greatness is a good thing and the last thing you want is for mediocrity to become acceptable in society.

People are forgetting Google is the most profitable AI company in the world right now. All of their products use ML and AI.

So who is losing?

The people who use their products, which are worse than they’ve been in decades? The people who make the content Google now displays without attribution on search results?

I say it again and again: sales, sales. Money is earned in enterprise domains.

And this business is so totally different to Google in every way imaginable.

Senior Managers love customer support, SLAs - Google loves automation. Two worlds collide.

Google Workspace works through resellers, they train less people, and those people give the customer support instead. IMO Google's bad reputation comes from their public customer support.

IMO Google's bad reputation comes from their public customer support.

Garbage in = garbage out.

If Google cannot deign to assign internal resources and staffing towards providing first-party support for paid products, it's not a good choice over the competition. You're not going to beat the incumbent (Office 365) by skimping on customer service.

If you want the kind of support that, when there is a fault with the product, can get the fault fixed - then unfortunately Google Workspace's support is also trash.

Good if you want someone else to google the error message for you though.

Google customer support says "Won't Fix [Skill Issue]"

Isn't it always easier to learn from others' mistakes?

Google has the problem that it's typically the first to encounter a problem, and it has the resources to approach it (from search), but the incentive to monetize it (to get away from depending entirely on search revenue). And, management.

I don't know if that really excuses Google in this case because it's a productization problem. Google never tried to release a ChatGPT competitor until after OpenAI had. OpenAI has been wildly successful as the first mover, despite having to blaze some new product trails. Even after months of watching them and with near-infinite resources, Google is still struggling to catch up.

Outside of outliers like gmail, Google didn’t get their success with product. The organization is set up for engineering to carry the day, funded by search.

An AI product that makes search irrelevant is an existential threat, but I don’t think Google has the product DNA to pull off a replacement product for search themselves. I heard Google has been taken over by more business / management types, but it is still missing product as a core pillar.

Considerng the number of messaging apps they tried to launch, if there's at least one thing that can be concluded, it's that it isn't easier to learn from their own mistakes.

Microsoft is eating their lunch.

Well, that is trully shocking.

Also hard to say it’s really true. OpenAI is certainly, is Microsoft without OpenAI’s tech eating Google’s lunch?

Given MSFT's level of investment in OpenAI, and all the benefits that accrue from it, they're one and the same.

It is yet to be seen if MSFT has actually gained a benefit. Maybe from marketing perspective it has insane potential to print big bucks, but it is a bit too soon to announce that the efforts to deliver Copilot (all tools+agents) far and wide was/is successful.

We'll get a definitive answer in a few years. Til then, OpenAI benefits from the $ value from their end of products, MSFT eats the compute costs, but also gets a stock bump.

Google doesn't know how to do anything else.

A product requires commitment, it requires grind. That 10% is the most critical one, and Google persistently refuses to push products across the finish line, just giving up on them and adding to the infamous Google Product Graveyard.

Honestly, what is the point? They could just maintain the core search/ads and not pay billions of dollars for tens of thousands of expensive engineers who have to go through a bullshit interview process and achieve nothing.

If they tried to focus on ads, then they wouldn’t have the talent to support the business. They probably don’t need 17 chat apps - but they can’t start saying no without having other problems.

They only hire some talent to prevent other companies to hire them.

It's a way to strangle the competition. But also not good for the industry in general.

I demo'd a full browser office suite in 1998 called Office Wherever (o-w.com). It used Java applets to do a lot of the more tricky functions.

Shopped it around VCs. Got laughed out of all the meetings. "Companies storing their documents on the Internet?! You're out of your mind!"

Some things are just too ahead of their times.

Globe dot com was basically Facebook, but the critical mass wasn't there. Nor were the smartphones.

Im curious if the code would be available somewhere? I have to admit I'm curious how it worked!

I was with you until the Tesla hot take. I'd bet dollars to donuts that Tesla doesn't get to level 4 by the end of the decade. Waymo is already there.

I agree, but I also bet Waymo doesn't exist by the end of the decade. Not just because it's Google but because it's hard to profit from.

I could see that in the coming years the value of Waymo for Google is not actually in collecting revenue from transportation fees but to collect multi modal data to feed into its models.

The amount of data that is collected by these cars is massive.

The difference seems to be the top leadership.

Nadella is an all time great CEO. Pichai is an uninspired MBA-type.

Nadella is as much of an MBA type as Pichai. Their education and career paths are incredibly similar.

The difference is Nadella is a good CEO and Pichai isn’t.

Part of it could also be a result of circumstance. Nadella came at a time when MS was foundering and he had to make what appeared to be fairly obvious decisions (pivot to cloud…he was literally picked because of this, and reducing dependence on Windows…which was an obvious necessary step for the pivot to cloud). Pichai OTOH was selected to run Google when it was already doing pretty well. His biggest mandate was likely to not upset the Apple cart.

If roles were reversed, I suspect Nadella would still have been more successful than Pichai, but you never know. I’d Nadella introduction to the CEO job was to keep things going as they were, and Pichai’s was to change the entire direction of the company, maybe a decade later Pichai would have been the aggressive decision maker whereas Nadella would have been the overly cautious guy making canned demos.

Google Docs created in 2006

Word and Excel have been dominant since the early 1980s. Google has never had a real shot in the space.

You mean 1990s? I don't think Word and Excel even existed until the late 80s, and nobody[0] used them until Windows 3.1.

[0] yes, not literally nobody. I know about the Windows 2.0 Excel or whatever, but the user base compared to WordPerfect or 1-2-3 was tiny up until MS was able to start driving them out by leveraging Windows in the early-mid 90s.

None of that matters. They'll still make heaps of profit long into the future unless someone beats them in Search or Ads.

AI is a threat there, but it'd require an AI company to transform the culture of Internet use to stop people 'Googling', and that will require two things: something significantly better than Google Search that's worth switching to, and a company that is willing to reject whatever offer Google makes to buy it. Neither is very likely.

I would love to see internal data on volume of search at google. Depending on the interpretation of them chatGPT can meet both of your requirements. Personally, I still search instead of chatGPT mostly, but I have seen other users chatGPT more and more.

Also "interesting" to see the if results being SEO spam generated using AI will keep seo search viable.

>Google Docs created in 2006! Microsoft is eating their lunch.

Of all the things, this.

I use both Google and Microsoft office products. One thing that strikes you is just how feature rich Microsoft products are.

Google doesn't look like is serious about making money.

I squarely blame rockstar product managers and OKRs for this. Not everything can be a 1000% profitable product built in the next quarter. A lot of things require small continuous improvement and care over years.

Microsoft’s killer product is Excel. I didn’t realize how powerful it was until I saw an expert use it. There are entire billion dollar organisations that would collapse without Excel.

I can tell you exactly why. It’s because they have a separate vp and org for all these different products like search, maps, etc. none of them talk to each other and they all compete for promotions. There is no one shot caller same thing with gcp. Google does not know products.

A lot of companies have this structure. You have the Doritos line, the Pepsi line for example etc… maybe you find some common synergies but it’s not unusual.

What would the ideal setup in your opinion?

Sundar Pichai should have been out of Google long ago.

While it is crazy, it's not too surprising. Google has become as notorious for product ineptitude as they have been for technical prowess. Dominating the fundamental research for GenAI but face planting on the resulting consumer products is right in line with the company that built Stadia, GMail/Inbox, and 17 different chat apps.

There is little to no inertia inside google to build and invent stuff. But there is a massive bloat to ship stuff.

I agreed until the last bit. Waymo is making continuous progress and is years ahead of everyone else. Tesla is not catching up and won't beat anyone. Tesla plateaued years ago and has no clue how to improve further. Their Partial Self Driving app has never been anywhere near reliable.

Google has been working on self driving longer than anyone. Tesla is catching up and will most likely beat them.

I agree with your general post but I disagree with this. Tesla's FSD is so far behind Google it's almost negligent on the part of Tesla despite having so much more data.

Errr sorry what’s the innovation of google docs exactly ? Being able to write simultaneously with somebody else? Ok, so this is what it takes for a top notch docs app to exist? Microsoft been developing this product for ages, Google tried to steal the show, although had little to no experience in producing and marketing office apps…

Besides collaborative reuniting is a no feature and there is much more important stuff than this for a word processor to be useful.

They are an ads company. Focus is never on "core" products.

Tesla will not beat them at self driving simply due to hardware at the very least

It's reassuring that the biggest tech company doesn't automatically make the best tech. If it were guaranteed that Google's resources would automatically trump any startup in the AI field, then it would likely predict a guaranteed dominance of incumbents and consolidation of power in the AI space.

Big companies are where innovation goes to die.

It's the curse of the golden goose.

They can't do anything that threatens their main income. They are tied to ads and ads technology, and can't do anything about it.

Microsoft had a crisis and that drives focus. Google... they probably mistreat their good employees if they don't work on ads.

I have used Swype texting since the t9 days.

If I demoed swype texting as it functions in my day to day life to someone used to a querty keyboard they would never adopt it

The rate at which it makes wrong assumptions about the word, or I have to fix it is probably 10% to 20% of the time

However because it’s so easy to fix this is not an issue and it doesn’t slow me down at all. So within the context of the different types of text Systems out there, I t’s the best thing going for me personally, but it takes some time to learn how to use it.

This is every product.

If you demonstrated to people how something will actually work after 100 hours of habituation and compensation for edge cases, nobody would ever adopt anything.

I’m not sure how to solve this because both are bad.

(Edit: I’m keeping all my typos as meta-comment on this given that I’m posting via swype on my phone :))

Its honestly pretty mind boggling that we’d even use querty on a smartphone. The entire point of the layout is to keep your fingers on the home row. Meanwhile people text with a single or two thumbs 100% of the time.

"The entire point of the layout is to keep your fingers on the home row."

No, that is how you're told to type. You have to be told to type that way precisely because QWERTY is not designed to keep your fingers on the home row. If you type in a layout that is designed to do that, you don't need to be told to keep your fingers on the home row, because you naturally will.

Nobody really knows what the designers were thinking, which I do not mean as sarcasm, I mean it straight. History lost that information. But whatever they were thinking that is clearly not it because it is plainly obvious just by looking at it how bad it is at that. Nobody trying to design a layout for "keeping your fingers on the home row" would leave hjkl(semicolon) under the resting position of the dominant hand for ~90% of the people.

This, perhaps in one of technical history's great ironies, makes it a fairly good keyboard for swype-like technologies! A keyboard layout like Dvorak that has "aoeui" all right next to each other and "dhtns" on the other would be constantly having trouble figuring out which one you meant between "hat" and "ten" to name just one example. "uio" on qwerty could probably stand a bit more separation, but "a" and "e" are generally far enough apart that at least for me they don't end up confused, and pushing the most common consonants towards the outer part of the keyboard rather than clustering them next to each other in the center (on the home row) helps them be distinguishable too. "fghjkl" is almost a probability dead zone, and the "asd" on the left are generally reasonably distinct even if you kinda miss one of them badly.

I don't know what an optimal swype keyboard would be, and there's probably still a good 10% gain to be made if someone tried to make one, but it wouldn't be enough to justify learning a new layout.

Hold up young one. The reason for QWERTYs design has absolutely not been lost to history yet.

The design was to spread out the hammers of the most frequently used letters to reduce the frequency of hammer jamming back when people actually used typewriters and not computers.

The problem it attempted to improve upon, and which is was pretty effective at, is just a problem that no longer exists.

I’m curious how this works because all the common letters seem to be next to each other on the left side of the keyboard

The original intent I do believe was not separating the hammers per se, but also helping the hands alternate, so they would naturally not jam as much.

However, I use a Dvorak layout and my hands feel like they alternate better on that due to the vowels being all on one hand. The letters are also in more sensical locations, at least for English writing.

It can get annoying when G and C are next to each other, and M and W, but most of the time I type faster on Dvorak than I ever did on Qwerty. It helps that I learned during a time where I used qwerty at work and Dvorak at home, so the mental switch only takes a few seconds now.

The design was to spread out the hammers of the most frequently used letters to reduce the frequency of hammer jamming

That's a folk myth that's mostly debunked.

https://www.smithsonianmag.com/arts-culture/fact-of-fiction-...

Also apocryphal: https://en.wikipedia.org/wiki/QWERTY#Contemporaneous_alterna...

And it does a bad job at it, which is further evidence that it was not the design consideration. People may not have been able to run a quick perl script over a few gigabytes of English text, but they would have gotten much closer if that was the desire. I don't believe that was their goal but they were just too stupid to get it even close to right.

Nobody really knows what the designers were thinking, which I do not mean as sarcasm, I mean it straight. History lost that information.

My understanding of QWERTY layout is that it was designed so that characters frequently used in succession should not be able to be typed in rapid succession, so that typewriter hammers had less chance of colliding. Or is this an urban myth?

My understanding (which is my recollections of a dive into typewriter history decades ago) is that avoiding typebar collisions was a real concern, but that the general consensus was that the exact final layout was strongly influenced by allowing salesmen to quickly type out 'typewriter' on the top row of letters.

I love this explanation.

You have to be taught to use the home row because the natural inclination for most people is to peck and hunt with their two index fingers. Watch how old people or young kids type. That being said staying on the home row is how you type fast and make the most of the layout. Everything is comfortably reachable for the most part unless you are a windows user ime.

If you learn a keyboard layout where the home row is actually the most common keys you use, you will not have to be encouraged to use the home row. You just will. I know, because I have, and I never "tried" to use the home row.

People don't hunt and peck after years of keyboard use because of the keyboard; they do it because of the keyboard layout.

If you want to prove I'm wrong, go learn Dvorak or Colemak and show me that once you're comfortable you still hunt and peck. You won't be, because it wouldn't even make sense. Or, less effort, find a hunt & peck Dvorak or Colemak user who is definitely at the "comfortable" phase.

People will actually avoid using their nondominant fingers like their ring or pinky. Its an issue with typing irrespective of layout. Its an issue with guitarplaying even. I am no pianist but I wouldn’t be surprised if new players have an aversion towards using those fingers as well.

I understand how dvorak is designed. I am still not convinced people will be using all their fingers especially their pinkys in a consistent manner that without learning that this is what you should work towards.

The reason we use qwerty on a smartphone is extremely straightforward: people tend to know where to look for the keys already, so it's easy to adopt to even though it's not "efficient". We know it better than we know the positions of letters in the alphabet. You can easily see the difference if you're ever presented with an onscreen keyboard that's in alphabetical order instead of qwerty (TVs do this a lot, for some reason, and it's a different physical input method but alpha order really does make you have to stop and hunt). It slows you down quite a bit.

That's definitely a good reason why, but perhaps if iOS or Android were to research what the best layout is for typical touch screen typing and release that as a new default, people would find it quite quick to learn a second layout and soon get just the benefits?

After all, with TVs I've had the same experience as you with the annoying alphabetical keyboard, but we type into they maybe a couple of times a year, or maybe once in 5 years, whereas if we changed our phone keyboard layout we'd likely get used to it quite quickly.

Even if not going so far as to push it as a new default for all users (I'm willing to accept the possibility that I'm speaking for myself as the kind of geeky person who wouldn't mind the initial inconvenience of a new kb layout if it meant saving time in the long run, and that maybe a large majority of people would just hate it too much to be willing to give it a chance), they could at least figure out what the best layout is (maybe this has been studied and decided already, by somebody?) and offer that as an option for us geeks.

Even most technically-minded people still use QWERTY on full-size computer keyboards despite it being a terrible layout for a number of reasons. I really doubt a new, nonstandard keyboard would get much if any traction on phones.

T9 was fine for typing and probably hundreds of millions of people used it.

Path dependency is the reason for this, and is the reason why a lot of things are the way they are. An early goal with smart phone keyboards was to take a tool that everyone already knew how to use, and port it over with as little friction as possible. If smart phones happened to be invented before external keyboards the layouts probably would have been quite different.

I use 8vim[0] from time to time, it's a good idea but needs a dictionary/autocompletion. You can get ok speeds after an hour of usage.

[0] https://f-droid.org/en/packages/inc.flide.vi8/

Does swype make editing easier somehow? iOS spellcheck has negative value. I turned it off years ago and it reduced errors but there are still typos to fix.

Unfortunately iOS text editing is also completely worthless. It forces strange selections and inserts edited text in awkward ways.

I’m a QWERTY texter but text entry on iOS is a complete disaster that has only gotten worse over time.

Hard disagree. I could type your whole comment without any typos completely blindly (except maybe "QWERTY" because uppercaps don't get autocorrected).

Apple autocorrect has a tendency to replace technical terms with similar words, eg. rvm turns into rum or ram or something.

It's even worse on the watch somehow. I take care to hit every key exactly, the correct word is there, I hit space, boom replaced with a completely different word. On the watch it seems to replace almost every word with bullshit, not just technical terms.

They've pretty much solved this with iOS 17. You can even use naughty words now, provided you use it for a day or so to have it get used to your vocabulary.

seems to replace almost every word with bullshit

Sort of related, it also doesn't let you cuss. It will insist on replacing fuck with pretty much anything else. I had to add fuck to the custom replacement dictionary so it would let me be. What language I choose to use is mine and mine alone, I don't want Nanny to clean it up.

Maybe my fingers are just too big but the watch for anything like texting is basically impossible for me to use.

I'm an iOS user and prefer the swipe input implementation in GBoard over the one in the native keyboard. I'm not sure what the differences are, but GBoard just seems to overall make fewer mistakes and do a better job correcting itself from context.

As I was reading Andrew's comment to myself, I was trying to figure out when and why I stopped using swype typing on my phone. Then it hit me – I stopped after I switched from Android to iOS a few years ago. Something about the iOS implementation just doesn't feel right.

Apple's version is shit. Period. That's why.

But you can install other keyboards like SwiftKey or Gboard which are closer to what you are used to on Android.

My only issue is that no keyboard implementation really supports more than two languages which makes me switch back to plain qwerty with autocomplete all the time.

Have you tried the native keyboard since iOS 17? It’s quite a lot better than older versions.

Showing a product in its best light is one thing. Demonstrating a mode of operation that doesn't exist is entirely another. It would be like if a demo of your swipe keyboard included telepathic mind control for correcting errors.

I’m not sure I’d agree that what they showed will never be possible and in fact my whole point is that I think Google can most likely deliver on that in this specific case. Chalk it up to my experience in the space, but from what I can see it looks like something Google can actually execute on (unlike many areas where they fail on product regularly).

I would agree completely that it’s not ready for consumers the way it was displayed, which is my point.

I do want to add that I believe that the right way to do these types of new product rollout is not with these giant public announcements.

In fact, I think generally speaking the “right” way to do something like this demonstrates only things that are possible robustly. However that’s not the market that Google lives in. They’re capitalists trying to make as much money as possible. I’m simply evaluating that what they’re showing I think is absolutely technically possible and I think Google can deliver it even if its not ready today.

Do I think it’s supremely ethical the way that they did it? No I don’t.

The voice interaction part didn't look a far cry from what we are doing with Dynamic Interaction at SoundHound. Because of this I assumed (like many it seems) that they had caught up.

And it's dangerous to assume they can just "deliver later". It's not that simple. If it is why not bake it in right now instead of committing fraud?

This is damaging to companies that walk the walk and then people have literally said to me "but what about that Gemini"? and dismiss our work.

I feel that more than you realize

That was basically what magic leap did to the whole AR development market. Everyone deep in it knew they couldn’t do it but they messed up so badly that it basically killed the entire industry

So let's not give big tech benefit of the doubt on this one. We have to call them out but even then the lie is already half way around the world...

I don't care what google could, in theory, deliver on some time in the future maybe. That's irrelevant. They are demonstrating something that can't be done with the product as they are selling it.

The insight here is that the speed of correction is a crucial component of the perceived long-term value of an interface technology.

It is the main reason that handwriting recognition did not displace keyboards. Once the handwriting is converted to text, it’s easier to fix errors with a pointer and keyboard. So after a few rounds of this most people start thinking: might as well just start with the pointer and keyboard and save some time.

So the question is, how easy is it to detect and correct errors in generative AI output? And the unfortunate answer is that unless you already know the answer you’re asking for, it can be very difficult to pick out the errors.

I think this is a good rebuttal.

Yeah the feedback loop with consumers has a higher likelihood of being detrimental, so even if the iteration rate is high, it’s potentially high cost at each step.

I think the current trend is to nerf the models or otherwise put bumpers on them so people can’t hurt themselves. That’s one approach that is brittle at best and someone with more risk tolerance (OpenAI) will exploit that risk gap.

It’s a contradiction then at best and depending on the level of unearned trust from the misleading marketing, will certainly lead to some really odd externalities

Think “man follows google maps directions into pond” but for vastly more things.

I really hated marketing before but yeah this really proves the warning I make in the AI addendum to my scarcity theory (in my bio).

I think you mean swipe. Swype was a brilliant third party keyboard app for Android which was better at text prediction and manual correction than Gboard is today. If however you really do still use Swype then please tell me how because I miss it.

Ha good point, and yes I agree Swype continues to be the best text input technology that I’ll never be able to use again. I guess I just committed genericide here but I meant the general “swiping” process at this point

You make a decent point, but you might underestimate how much this Gemini demo is faked[0].

In your Swype analogy, it would be as if Swype works by having to write out on a piece of paper the general goal of what you're trying to convey, then having to write each individual letter on a Post-it, only for you to then organize these Post-its in the correct order yourself.

This process would then be translated into a slick promo video of someone swiping away on their keyboard.

This is not a matter of “eh, it doesn't 100% work as smooth as advertised.”

0: https://techcrunch.com/2023/12/07/googles-best-gemini-demo-w...

I don't buy it. OpenAI did not have to do it with ChatGPT, and they always include a live demo when they release new products.

Maybe you can spice up a demo, but misleading to the point of implying things are generated when they're not (like the audio example) is pretty bad.

I know marketing is marketing, but it's bad form IMO to "demo" something in a manner totally detached from its actual manner of use. A swype keyboard takes practice to use, but the demos of that sort of input typically show it being used in a realistic way, even if the demo driver is an "expert".

This is the sort of demo that 1) gives people a misleading idea of what the product can actually do; and 2) ultimately contributes to the inevitable cynical backlash.

If the product is really great, people can see it in a realistic demo of its capabilities.

This is every product.

Except actual good ones, like ChatGPT or Gmail (by their time).

What is the latency is Swype? < 10ms? Not at all comparable to the video.

However because it’s so easy to fix this is not an issue and it doesn’t slow me down at all.

But that's a different issue than LLM hallucinations.

With Swype, you already know what the correct output looks like. If the output doesn't match what you wanted, you immediately understand and fix it.

When you ask an LLM a question, you don't necessarily know the right answer. If the output looks confident enough, people take it as the truth. Outside of experimenting and testing, people aren't using LLMs to ask questions for which they already know the correct answer.

I did this at university. It was our first comp sci class ever, we were given raspberry pi's. We had no coding experience or guidance, and were asked to create "something". All we had to work with was information on how to communicate with the pi using putty. Oddly, this assignment didn't require us to submit code, but simply demonstrate it working.

My group (3 of us) bought a moisture sensor to plug into the pi, and had the idea to make a "flood detection system" that would be housed under a bridge, and would send an email to relevant people when the bridge home from work is about to flood.

So for our demonstration, we had a guy in the back of the class with gmail open ready to send an email saying some variation of "flood warning". Our script was literally just printing lines with wait statements in between. Running the script, it prints to the screen "awaiting moisture", and after 3 seconds it will print "moisture detected". In that 3 seconds I dip the sensor into the glass of water. Then the script would wait a few more seconds before printing "sending email to xxx@yyy.com". We then opened up our email, our mate at the back of the room hit send, and an email appeared saying flood warning, and we would get full marks.

Is this cheating? It sounds like cheating and reflects quite poorly on you.

This can depend a lot on the context, which we don't have a lot of.

Looking at this a different way, they gave first-year students, likely with no established pre-requistites, an open-ended project with fixed hardware but no expectation to submit the final project for review. If they wanted to verify the students actually developed a working program, they could have easily asked for the Pi's to be returned along with the source code.

A project like this was likely intended to get the students to think about the "what" and not worry so much about the "how." Faking it entirely may have gone a bit further than intended, but would still meet the goal of getting the students to think about what they could do with this computer (if they knew how)

While university instructors can vastly underestimate student's creativity, they are, generally speaking, not stupid. At the very least, they know if you don't tell students to submit their work, you can often count on them doing as little as possible.

If they wanted to verify the students actually developed a working program, they could have easily asked for the Pi's to be returned along with the source code.

Wait, is your argument honestly "it's not cheating because they just trusted the students"?

There's a huge difference between demoing something as "this is what we did" vs "we didn't quite get there, but this is what we're envisioning."

Edit: You all are responding very weirdly. The cheating is because you're presenting "something" that is not that thing. Put a dog in a dress and call it a pretty woman and I'll call you a conman.

No, it’s not cheating because the ask was “something” not “some program”

Which is only not cheating if it was presented as not a program and a fellow project mate sending out an email.

In US colleges at least (only because that’s where I have personal experience…not because I believe standards are any higher or lower here), this is cheating if they led their professor to believe that it was indeed the raspberry pi sending out an email rather than someone at the back of the class.

You're assuming the objective was to develop a functioning program on the Pi.

But what if the Pi was only ever meant as a story-telling device to get the students thinking about the kinds of things computer programs can do?

Sure, some of students would be able tell a story by building a functioning program, but dvsfish simply found another way to tell theirs.

No, the argument is, "It's not cheating because it wasn't a programming assignment."

It was our first comp sci class ever, we were given raspberry pi's. We had no coding experience or guidance, and were asked to create "something".

Garbage in, garbage out.

Wow this is such an awful excuse.

Here’s a whole list of projects intended for kids.

https://all3dp.com/2/best-raspberry-pi-projects-for-kids/

It includes building out a whole weather station which includes a humidity sensor as one of the many things it can do.

Wow this is such an awful excuse.

yes for whomever organized such a curse and didn't give such guidance.

And besides curse asked for project to do something. It did. It printed lines. We can call the email gimmick, the marketeering strategy, making a turd look good.

Don't blame students for failure of whomever designed the curse.

Creativity is a good thing, sad to see trust abused this way.

More like cheaters in, cheaters out.

BS. The CEO of one of the largest public companies just did it and he is fine. Board and shareholders all happy.

I gleefully await Matt Levine’s article on AI demos as securities fraud.

If the teacher was competent, they would've asked to see the code.

The view up on that high horse must be interesting! Were you the kid who reminded teachers about homework?

Literally all that matters is that they passed.

Of course it's cheating

I'd call it cheating too but yeah. I like the pi and sensors though. Sounds like the start of something cool. Wish I could get a product like this to put in my roof to detect leaks. That would be useful.

It depends on what the intention of the assignment was. If it was primarily to help the students understand what these devices _could_ be used for, then it's fine. If it was to have them actually do it, well, then the professor should have at least tried to verify that. Given that it's for first-years who have no experience programming, it really could be either.

Kind of? Yes but they still demonstrated as much as was expected from them, which was very little to begin with.

It certainly reflects poorly on the institution for not requiring anything other than a dog and pony show for grading.

Related, I work with industrial control systems. We’d call this “smoke and mirrors”. Sometimes the client would insist on seeing a small portion of a large project working well before it was ready. They’d misunderstand that 90% of the bulk of the work is not visible to the user but they’d want to see a finished state.

We’d set up a dummy HMI and have someone pressing buttons on it for the demo, and someone in the next room manually driving outputs and inputs to make it seem like it was working. Very common.

To me it sounds like a useful thing, for communicating your vision and getting early feedback on whether you are building the right thing. Like uh, a live action blueprint. Let's say the client is like "Email???? That's no good. It needs to be a text message". Now you have saved yourself the trouble of implementing email.

To me it sounds like a useful thing

To me it sounds like lying...

Context matters a ton though. Are you presenting the demo as if the events are being automatically triggered (as in the OP) or are you presenting as this is your plan? Explicitly. If it is implicit, it's deceptive. If you explicitly do not say what parts are faked, it is lying. Of course in a magic show this is totally okay because you're going to the show with the explicit intention to be lied to, but I'm not convinced the same is true for business but I'm sure someone could make a compelling argument.

How about in the original iPhone keynote, when Steve says “this cable is just here for it to work on our projector” - if you follow closely, there was no phone demo’d that day with screen on and no cable. I’m sure the main board was still external.

No, the main board wasn't external at that time. The Wallaby were a different design.

There have literally been people sued over this because they used it to get funding; the most extreme example, where they kept doing it well beyond demo purposes, is Theranos.

And yet, you’ll still have people here acting like it’s totally fine.

As you said, it’s one thing to demonstrate a prototype, a “this is how we intend for it to work.” It’s a whole other thing to present it as the real deal.

I've done similar things by manually inserting and updating rows in the database etc to "demo" some process.

Like you say it can be useful as a way to uncover overall process or UX issues before all the internals are coded.

I'm quite open about what I'm doing though, as most of our clients are reasonable.

Definitely a useful thing to do to validate UI/UX/fitness.

Relying on blackbox components first mocked then later swapped (or kept for testing) with the real implementation is also a thing, especially when dealing with hardware. It's a bit hard to put moisture sensors on CI!

That said...

I would have recommended at least being open about it being a mockup, but even when doing so I've had customers telling me "so why can't you release this tomorrow since it's basically done!? why is there still two months worth of work, you're trying to rip us off!!!"

It definitely is a thing, we literally teach this to CS students as part of the design process in the UX design course.

It’s called prototyping, in this case it would be a hi-fi prototype, and it lets you communicate ideas, test the implementation and try out alternatives before committing to your final product.

Lo-fi prototyping usually precedes it and is done on pen and paper, Figma, or similar very basic approaches.

Google (unsurprisingly) uses it and has instructional videos about it: https://youtu.be/lusOgox4xMI

the phrase I've heard is "dog and pony show" lol

Or “Wizard of Oz” — as a standard HCI practice for understanding human experience

The industry term is “art of the possible”.

Well, you literally had a backend

more like backhand lol

do things that don't scale

I did this as well. I worked on a localized navigation system back when I was in a school, and unfortunately we broke all available GPS receivers in our hands over the course---that particular model of RS-232 GPS modules was really fragile. As a result we couldn't actually demonstrate a live navigation (and it was incomplete anyway). We proceeded to finish the GUI nevertheless, and then pretended that this is what you see during the navigation, but never actually ran the navigation code. It was an extracurricular activity and didn't affect GPA or anything, for the curious, but I remain kinda uneasy about it.

Sounds very familiar... UoM, UK ? :)

My version of this involved a Wii remote: freshmen-level CompSci class, and the group had to build a simple game in Python to be displayed at a showcase among the class. We wrote a space invaders clone. I found a Bluetooth driver that allowed your Wiimote to connect to your Mac as a game controller, so I set up a basic left/right tilt control using a Wiimote for our space invaders clone.

The Wiimote connection was the star of the show by a long shot :P

Learning those fraud skills needed later in the so-called "tech" industry.

A big red flag for me was that Sundar was prompting the model to report lots of facts that can be either true or false. We all saw the benchmark figures that they published and the results mostly showed marginal improvements. In other words, the issue of hallucination has not been solved. But the demo seemed to imply that it had. My conclusion was that they had mostly cherry picked instances in which the model happened to report correct or consistent information.

They oversold its capabilities, but it does still seem that multi-modal models are going to be a requirement for AI to converge on a consistent idea of what kinds of phenomena are truly likely to be observed across modalities. So it's a good step forward. Now if they can just show us convincingly that a given architecture is actually modeling causality.

These LLMs do not have a concept of factual correctness and are not trained/optimized as such. I find it laughable that people expect these things to act like quiz bots - this misunderstands the nature of a generative LLM entirely.

It simply spits out whatever output sequence it feels is most likely to occur after your input sequence. How it defines “most likely” is the subject of much research, but to optimize for factual correctness is a completely different endeavor. In certain cases (like coding problems) it can sound smart enough because for certain prompts, the approximate consensus of all available text on the internet is pretty much true and is unpolluted by garbage content from laypeople. It is also good at generating generic fluffy “content” although the value of this feature escapes me.

In the end the quality of the information it will get back to you is no better than the quality of a thorough google search.. it will just get you a more concise and well-formatted answer faster.

The first question I always ask myself in such cases: how much input data has a simple "I don't know" lines? This is clearly a concept (not knowing sth) that has to be learned in order to be expressed in the output.

What stops you from asking the same question multiple times, and seeing if the answers are consistent. I am sure the capital of France is always going to come out Paris, but the name of a river passing a small village might be hallucinated differently. Even better - use two different models, if they agree it's probably true. And probably the best - provide the data to the model in context, if you have a good source. Don't use the model as fact knowledge base, use RAG.

Can’t speak for other people but I find it more time consuming to get ChatGPT to correct its mistakes than to do the work myself.

What type of work? I'm really only interested in coding related help :)

because for certain prompts, the approximate consensus of all available text on the internet is pretty much true

I think you're slightly mischaracterising things here. It has potential to be at least slightly and possibly much better than that. This is evidenced by the fact it is much better than chance at answering "novel" questions that don't have a direct source in the training data. Why it can do it is because at a certain point, to solve the optimisation problem of "what word comes next" the least complex strategy actually becomes to start modeling principles of logic and facts connecting them. It is not in any systematic or reliable way so you can't ever guarantee when or how well it is going to apply these, but it is absolutely learning higher order patterns than simple text / pattern matching, and it is absolutely able to generalise these across topics.

It simply spits out whatever output sequence it feels is most likely to occur after your input sequence... but to optimize for factual correctness is a completely different endeavor

What if the input sequence says "the following is truth:", assuming it skillfully predicts following text, it would mean telling the most likely truth according to its training data.

unfortunately, this is the product they want to sold.

In the end the quality of the information it will get back to you is no better than the quality of a thorough google search.. it will just get you a more concise and well-formatted answer faster.

I would say it’s worse than Google search. Google tells you when it can’t find what you are looking for. LLMs “guess” a bullshit answer.

Ever since the "stochastic parrots" and "super-autocomplete" criticisms of LLMs, the question is whether hallucinations are solvable in principle at all. And if hallucinations are solvable, it would of such basic and fundamental scientific importance that I think would be another mini-breakthrough in AI.

An interesting perspective on this I’ve heard discussed is whether hallucinations ought to be solved at all, or whether they are core to the way human intelligence works as well, in the sense that that is what is needed to produce narratives.

I believe it is Hinton that prefers “confabulation” to “hallucination” because it’s more accurate. The example in the discussion about hallucination/confabulation was that of someone who had been present in the room during Nixon’s Watergate conversations. Interviewed about what he heard, he provided a narrative that got many facts wrong (who said what, and what exactly was said). Later, when audio tapes surfaced, the inaccuracies in his testimony became known. However, he had “confabulated truthfully”. That is, he had made up a narrative that fit his recall as best as he was able, and the gist of it was true.

Without the ability to confabulate, he would have been unable to tell his story.

(Incidentally, because I did not check the facts of what I just recounted, I just did the same thing…)

Without the ability to confabulate, he would have been unable to tell his story.

You can tell a story without making up fiction. Just say you don’t know when you don’t know.

Inaccurate information is worse than no information.

You can tell a story without making up fiction. Just say you don’t know when you don’t know.

The point is that humans can't in general, because we don't actually know which parts of what we "remember" are real and which parts are our brain filling in the blanks. And maybe it's the same for nonhuman intelligences too.

It’s hard even when you don’t take into consideration that you don’t know what you’ve misremembered. Try writing a memoir. You’ll realize you never actually remember what anyone says, but caveating your dialog with “and then I think they said something like” would make horrible reading.

Don’t people just bulk qualify the dialogue, e.g. “I don’t remember the exact words. Tom said … then Dick replied … something like that”.

Often we don’t quote people and instead provide a high level description of what they said, e.g. “Harry described the problems with his car.”, where detail is omitted.

If “confabulation” is necessary, you can use confabulation for the use cases where it’s needed and turn it off for the use cases where you need to return actual “correct” information.

The best one I've run across so far is, "spicy autocomplete".

The issue of hallucinations won't be solved with the RAG approach. It requires a fundamentally different architecture. These aren't my words but Yann LeCun's. You could easily understand if you spend some time playing around. The autoregressive nature won't allow the LLMs to create an internally consistent model before answering the question. We have approaches like Chain of Thought and others, but they are merely band-aids and superficially address the issue.

If you build a complex Chain if Thought style Agent and then train/finetune further by reinforcement learning with this architecture then it is not a band-aid anymore, it is an integral part of the model and the weights will optimize to make use of this CoT ability.

It's been 3.5 years since GPT-3 was released, and just over a year since ChatGPT was released to the public.

If it was possible to solve LLM hallucinations with simple Chain-of-Thought style agents, someone would have done that and released a product by now.

The fact that nobody has released such a product, is pretty strong evidence that you can't fix hallucinations via Chain-of-Thought or Retrieval-Augmented Generation, or any other band-aid approaches.

I agree: but I just wanted to say that there are specific subdomains where you can mitigate some of these issues.

For example, generating json.

You can explicitly follow a defined grammar to get what will always be a valid json output.

Similarly, structured output such as code can be passed to other tools such as compilers, type checkers and test suites to ensure that at a minimum the output you selected passes some minimum threshold of “isn’t total rubbish”.

For unstructured output this a much harder problem, and bluntly, it doesn’t seem like there’s any kind of meaningful solution to it.

…but the current generation of LLMs are driven by probabilistic sampling functions.

Over the probability curve you’ll always get some rubbish, but if you sample many times for structure and verifiable output you can, to a reasonable degree, mitigate the impact that hallucinations have.

Currently that’s computationally expensive, to drive the chance of error down to a useful level, but compute scales.

We may seem some quite reasonable outputs from similar architectures wrapped in validation frameworks in the future, I guess.

…for, a very specific subset of types of output.

I agree that the "forcing valid json output" is super cool.

But it's unrelated to the problem of LLM hallucinations. A hallucination that's been validated as correct json is still a hallucination.

And if your problem space is simple enough that you can validate the output of an LLM well enough to prove it's free of hallucinations, then your problem space doesn't need an LLM to solve it.

your problem space doesn’t need an LLM to solve it

Hmmm… kinda opinion right?

I’m saying; in specific situations, you can validate the output and aggregate solutions based on deterministic criteria to mitigate hallucinations.

You can use statistical methods (eg. There’s a project out there that generates tests and uses “on average tests pass” as a validation criteria) to reduce the chance of an output hallucination to probability threshold that you’re prepared to accept… for certain types of problems.

That the problem space is trivial or not … that’s your opinion, right?

It has no bearing on the correctness of what I said.

There’s no specific reason to expect that just like you can validate output against a grammar to require output that is structurally correct, you can’t validate output against some logical criteria (eg. unit tests) to require output that is logically correct against the specified criteria.

It’s not particularly controversial.

Maybe the output isn’t perfectly correct if you don’t have good verification steps for your task, maybe the effort required to build those validators is high, I’m just saying: it is possible.

I expect we’ll see more of this; for example, this article about decision trees —> https://www.understandingai.org/p/how-to-think-about-the-ope..., requires no specific change in the architecture.

It’s just using validators or search the solution space.

i think this was demonstrated in that mark rober promo video[1] where he asked why the paper airplane stalled by blatantly leading the witness.

"do you believe that a pocket of hot air would lead to lower air pressure causing my plane to stall?"

he could barely even phrase the question correctly because it was so awkward. just embarrassing.

[1] https://www.youtube.com/watch?v=mHZSrtl4zX0&t=277s

This has got to be satire! That is too funny.

Yeah, this was so obvious too. Clearly Mark Rober tried to ask it what to try and got stupid answers, then tried to give it clues and had to get really specific before he got a usable answer.

I mean it's a demo. Isn't this kinda what they all do

The whole Gemini webpage and contents felt weird to me, it's in the uncanny valley of trying to look and feel like an Apple marketing piece. The hyperbolic language, surgically precise ethnic/gender diversity, unnecessary animations and the sales pitch from the CEO felt like a small player in the field trying to pass as a big one.

surgically precise ethnic/gender diversity

What does that mean and why is it bad?

Diversity in marketing is used because, well, your desired market is diverse.

I don't know what it means for it to be surgically precise, though.

I imagine the commenter was calling out what they perceived to be an inauthentic yet carefully planned facade of diversity. This marketing trend rubs me the wrong way as well, because it reminds me of how I was raised and educated as a 90s kid to believe that racism was a thing of the past. That turned out to be a damaging lie.

I don't mean to imply that companies should avoid displays of diversity, I just mean that it's obvious when it's inauthentic. Virtue signaling in exchange for business is not progress.

You'd prefer the alternative with just a few white guys in the picture and no consideration given at all to appearing diverse?

Just take a group of people that actually know and work together and you're authentic. Forced diversity is idiotic: either you do it or you don't, but you show what you're doing to be authentic.

Imagine how cringe it would be if only white guys were allowed to work at Google and they displayed in all their marketing a fully diverse group of non-white girls. That would be... inauthentic.

Just the fact girls are less than guys in IT is something we should demonstrate, understand, change if needed. Not hide behind a facade of 50/50 display everywhere as if the problem was already solved or that it was even a problem in the first place.

The alternative is to just be authentic and not put up a fake show.

https://en.wikipedia.org/wiki/Potemkin_village

But for diversity.

It think it could be a seen as a good thing, it's a little chicken and egg. If you want to increase diversity at a company, one good way would be to represent diversity in your keynotes in order to make it look to a diverse base that they would be happy working there, thus increasing the diversity at the company.

Agreed with your comment. This is every marketing department on the planet right now, and it's not a bad thing IMO. Can feel a bit forced at times, but it's better than the alternative.

The alternative being showing actual level of diversity in the company?

It's new token black guy. It's not completely bad, just feels inauthentic.

It's bad if the makeup of the company doesn't reflect the diversity seen in the marketing, because it doesn't reflect any genuine value and is just for show.

Now, I don't know how diverse the AI workforce is at Google, but the YT thumbnails show precisely 50% of white men. Maybe that's what the parent meant by "surgically precise".

Of course to normal people, this just seems like another Google keynote. If OP is counting the number of white people, maybe they're the weird one here.

It's funny because now the OpenAI keynote feels like it's emulating the Google keynotes from 5 years ago.

Google Keynote feels like it's emulating the Apple keynote from 5 years ago.

And the Apple keynote looks like robots just out of an uncanny valley pretending to be humans - just like keynotes might look in 5 years, but actually made by AI. Apple is always ahead of the curve in keynote trends.

You know those memes where AI keeps escalating a theme to more extreme levels with each request?

That's what Apple keynotes feel like now. It seems like each year, they're trying to make their presentations even more essentially 'Apple.' They crossed the uncanny valley a long time ago.

"make it feel more like a hospital"

To me it feels more like a cult. Wear this kind of shoes and clothing. Make these hand gestures, talk that way. They look and sound fake, over processed and living in their own bubble detached from the rest of the world.

I’m what many would describe as a bit of an Apple fanboy but for the last few years I’ve been skipping most keynotes. They are becoming pretty unbearable.

I used to look forward to watching them live but now I just go back after the event and skip through the videos to see the most relevant bits (or just read the various reports instead).

I hadn’t thought about it until just now, but the most recent Apple events really are the closest real-person thing I’ve ever seen to some of the “good” computer generated photorealistic (kinda…) humans “reading” with text-to-speech that I’ve seen.

It’s the stillness between “beats” that does it, I think, and the very-constrained and repetitive motion.

Is there such a concept as a “reverse uncanny valley”??

Where humans behave so awkwardly that they seem artificial but are just not quite close enough…

If so, Apple have totally nailed the reverse uncanny valley!

Hmm. Like the "NPC fetish" stuff that was going around for a brief minute?

The more I think about this the more it rings true...

I’m imagining the project managers are patting themselves on the back for checking all the performative boxes, blind to the absolute satire of it all.

I got the same vibes. Ultra and Pro. It feels tacky that it declares the "Gemini era" before it's even available. Google really want to be seen as level on the playing field.

I was fooled. The model release announcement said it could accept video and audio multi-modal input. I understood that there was a lot of editing and cutting, but I really believed I was looking at an example of video and audio input. I was completely impressed since it’s quite a leap to go from text and still images to “eyes and ears.” There’s even the segment where instruments are drown and music was generated. I thought I was looking at a model that could generate music based on language prompts, as we have seen specialized models do.

This was all fake. You are taking a collection of cherry picked prompt engineered examples, then dramatizing them for maximum shareholder hype. The music example was just outputting a description of a song, not the generated music we heard in the video.

It’s one thing to release a hype video with what-ifs and quite another to claim that your new multi-modal model is king of the hill then game all the benchmarks and fake all the demos.

Google seems to be in an evil phase. OpenAI and MS must be quite pleased with themselves.

Do you believe everything verbatim that companies tell you in advertising?

If they show a car driving I believe it's capable of self-propulsion and not just rolling downhill.

A marketing trick that has, in fact, been tried: https://arstechnica.com/cars/2020/09/nikola-admits-prototype...

Used to be "marketing tricks" were prosecuted as fraud.

still is. Nikola's CEO, Trevor Milton, was convicted of fraud and is awaiting sentencing.

Oh. Good to hear. Thank you.

If I recall correctly, that led to literal criminal fraud charges.

And iirc Tesla is also being investigated for fraudulent claims for faking the safety of their self driving cars.

Hmm, might I interest you in a video of an electric semi-truck?

No, but most people tend to make a mental note of which companies tend to deliver and which ones work hard to mislead them.

You do understand the concept of reputation, right?

When a company invents tech that can do this, how would their ad be different?

this was plausible

This kind of moral fraud - unethical behavior - is tolerated for some reason. It's almost like investors want to be fooled. There is no room for due diligence. They squeel like excited Taylor Swift fans as they are being lied to.

As long as you’re not the last one out, “being fooled” can be very profitable

This shouldn't be a surprise. Companies optimize for what benefits shareholders. Or if there's an agency conflict of interest, companies optimize for what benefits managements' career and bonuses (perhaps at the expense of shareholders). Companies pay lip service to external stakeholders, but really that's a ploy to reduce attention and the risk of regulation, there is no fundamental incentive to treat all stakeholders well.

If lying helps, which can happen if there aren't large legal costs or social repercussions on brand equity, or if the lie goes undetected, then they'll lie. This is what we necessarily get from the upstream incentives. Fortunately, lying in a marketing video is fairly low on the list of ethical violations that have happened in the recent past.

We've effectively got a governance alignment problem that we've been trying to solve with regulations, taxes and social norms. How can you structure guardrails in the form of an incentive system to align companies with ethical outcomes? That's the question and it's a difficult one. This question also applies to any form of human organization, including governments.

Seems reminiscent of a video where the lead research department within Google is an animation studio (wish I could remember more about that video)

Doing all these hype videos just for the sake of satisfying shareholders or whatever is just making me loose trust in their research division. I don't think they did anything like this when they released Bert.

I agree completely. When alphazero was announced I remember feeling like shocked over how they stated this revolutionary breakthrough as if it was like a regular thing. Alphafold and Alphacode are also impressive but this one just sounds like it was forced from Sundar and not the usual deepmind

Exactly. Personally I’m fine with both:

1) Forward looking demoes that demonstrate the future of your product, where it’s clear that you’re not there yet but working in that direction

2) Demoes that show off current capabilities, but are scripted and edited to do so in the best light possible.

Both of those are standard practice and acceptable. What Google did was just wrong. They deserve to face backlash for this.

I, too, was fooled to think Gemini has seen and heard through a video/audio feed instead of showing still images and prompting though text. While it might seem not much difference between still images and a video feed, in fact it requires a lot of (changing) context understanding to not make the bot babbling like an idiot all the time. It also requires the bot to recognize the “I don’t know it yet” state to keep appropriate silence in a conversation with live video feed, which is notoriously difficult with generative AI. Certainly one can do some hacking, build in some heuristics to make it easier, but to make a bot seems like a human partner in a conversion is indeed very hard. And that has been the most impressive aspect of the showed “conversations”, which are unfortunately all faked :(

Well put. I’m not touching anything Google does any more. They’re far too dishonest. This failed attempt at a release (which turns out was all sizzle and no steak) only underscored how far behind OpenAI they actually are. I’d love to have been a fly on the wall in the OAI offices when this demo video went live.

I too thought it was able to accept video.

Given the massive data volume in videos, I assumed it processed video into pictures by extracting a frame per second or something along those lines, while still taking the entire video as the initial input.

Turns out, it wasn't even doing that!

"phase"?

My friend, all these large corporations are going to get away with exactly as much as they can, for as long as they can. You're implying there's nothing to do but wait until they grace us with a "not evil phase", when in reality we need to be working on restoring our anti-monopoly regulation that was systematically torn down over the last 30 years.

The video itself and the video description give a disclaimer to this effect. Agreed that some will walk away with an incorrect view of how Gemini functions, though.

Hopefully realtime interaction will be part of an app soon. Doesn’t seem like there would be too many technical hurdles there.

People don't really pay attention to disclaimers. Google made a choice knowing people would remember the hype, not the disclaimer.

I remember watching it and I was pretty impressed, but as I was walking around thinking to myself I came to the conclusion that there was something fishy about the demo. I didn't know exactly what they fudged, but it was far too polished to explain how well their current AI demos preform.

I'm not saying there have been no improvements in AI. There is and this includes Google. But the reason why ChatGPT has really taken over the world is that the demo is in your own hands and it does quite well there.

Indeed, and this is how Google used to be as a company. I remember when Google Maps & Earth launched, and how they felt like world-changing technology. I'm sure they're doing lots of innovative science and development still, but it's and advertising/services company now, and one that increasingly talks down to its users. Disappointing considering their early sense of mission.

Thinking back to the firm's early days, it strikes me that some HN users and perhaps even some Googlers have no memory of a time before Google Maps and simply can't imagine how disruptive and innovative things like that were at the time. Being able to browse satellite imagery for the whole world was something previously confined to the upper echelons of the military-industrial complex.

That's one reason I wish the firm (along with several other tech giants) were broken up; it's full of talented innovative people, but the advertising economics at the core of their business model warp everything else.

    :%s/Google/the team
    :%s/people/the promotion board

Conway's law applied to the corporate-public interface :)

The disclaimer in the description is "For the purposes of this demo, latency has been reduced and Gemini outputs have been shortened for brevity."

That's different from "Gemini was shown selected still images and not video".

What I found impressive about it was the voice, the fast real-time response to video, and the succinct responses. So apparently all of that was fake. You got me, Google.

performance and cost are hurdles?

It can be realtime while still having more latency than depicted in the video (and the video clearly stated that Gemini does not respond that quickly).

A local model could send relevant still images from the camera feed to Gemini, along with the text transcript of the user’s speech. Then Gemini’s output could be read aloud with text-to-speech. Seems doable within the present cost and performance constraints.

The entirety of the disclaimer is "sequences shortened throughout", in tiny text at the bottom for two seconds.

They do disclose most of the details elsewhere, but the video itself is produced and edited in such a way that it's extremely misleading. They really want you to think that it's responding in complex ways to simple voice prompts and a video feed, and it's just not.

Yea, of all the edits in the video, the editing for timing is the least of concern. My gripe is that the prompting was different and in order to get that information you have to watch the video only on YouTube, expand the description and click on a link to a different blog article. Linking a "making of" video where they show this and interview some of the minds behind Gemini would have been better PR.

Yeah, and ads on Google search have the teeniest, tiniest little "ad" chip on them, a long progression of making ads more in-your-face and less well-distinguished.

In my estimation, given the context around AI-generated content and general fakery, this video was deceptive. The only impressive thing about the video (to me) was how snappy and fluid it seemed to be, presumably processing video in real time. None of that was real. It's borderline fraudulent.

No. The disclaimer was not nearly enough.

The video fooled many people, including myself. This was not your typical super optimized and scripted demo.

This was blatant false advertising. Showing capabilities that do not exist. It’s shameful behavior from Google, to be perfectly honest.

The difference between “Hey, figure out a game based on what you see right now” vs “here is a description of a game with the only too possible outcomes as examples” cannot be explained by the disclaimer.

If there weren't serious technical hurdles they wouldn't have faked it.

They were just parroting this video on CNBC without any disclaimers, so the viewers who don't happen to also read hacker news will likely form a different opinion than those of us who do.

I suppose this is a great example of how trust in authentic videos, audio, images, company marketing must be questioned and, until verified, assumed to be 'generated'.

I am curious, if the voice, email, chat, and shortly video can all be entirely generated in real or near real time, how can we be sure that remote employee is actually not a full or partially generated entity?

Shared secrets are great when verifying but when the bodies are fully remote - what is the solution?

I am traveling at the moment. How can my family validate that it is ME claiming lost luggage and requesting a Venmo request?

Ask for information that only the actual person would know.

That will only work once if the channels are monitored.

You only know one piece of information about your family? I feel like I could reference many childhood facts or random things that happened years ago in social situations.

Make up a code phrase/word for emergencies, share it with your family, then use it for these types of situations.

Fair, but that also assumes the recipients ("family") are in a mindset of constantly thinking about the threat model in this type of situation and will actually insist on hearing the passphrase.

This will only work once.

I am traveling at the moment. How can my family validate that it is ME claiming lost luggage and requesting a Venmo request?

PGP

Now you have two problems.

(I say this in jest, as a PGP user)

If you can't verify whether your employee is AI, then you fire them and replace them with AI.

The question is if an attacker tells you they lost access can you please reset some credential, and your security process is getting on a video call because you're a fully remote company let's say.

I think it's also why we as a community should speak out when we catch them for doing this as they are discrediting tech demos. It won't be enough because a lie will be around the world before the truth gets out the starting gates but we can't just let this go unchecked.

At this point, probably a handwritten letter. Back to the 20th century we go.

How is this not false advertising?

I suppose it's not false advertising, since they don't even claim to have a product released yet that can do this, since Trojans Ultra won't be available until an unspecified time next year

You're right, it's astroturfing a placeholder in the market in the absence of product. The difference is probably just the target audience - feels like this one is more aimed at share-holders and internal politics.

It's still false advertising.

This is common in all industries. Take gaming, for example. Game publishers love this kind of publicity, as it creates hype, which leads to sales. There have been numerous examples of this over the years: Watch Dogs, No Man's Sky, Cyberpunk 2077, etc. There's a period of controversy once consumers realize they've been duped, the company releases some fake apology and promises or doubles down, but they still walk out of it richer, and ready to do it again next time.

It's absolutely insidious, and should be heavily fined and regulated.

possibly securities fraud though. Their stock popped a few percent on the back of that faked demo.

It's a software demo. If you ever gave an honest demo, you gave a bad demo. If you ever saw a good and honest demo, you were fooled.

I prefer to let my software be good enough to let it speak for itself without resorting to fraud, thank you ver much.

https://www.youtube.com/watch?v=OPUq31JZFsA

As a programmer, I'd say that all the demos of my code were honest and representative of what my code was doing.

But I recognize we're all different programmers in different circumstances. But at a minimum, I'd like to be honest with my work. My bosses seem to agree with me and I've never been pressured into hosting a fake demo or lie about the features.

In most cases, demos are needed because there's that dogfood problem. Its just not possible for me to know how my (prospective) customers will use my code. So I need to show off what has been coded, my progress, and my intentions for the feature set. In response, the (prospective) customer may walk away, they may have some comments that increases the odds of adoption, or they think its cool and amazing and take it on the spot. We can go back and forth with regards to feature changes or what is possible, but that's how things should work.

------------

I've done a few "I could do it like this" demos, where everyone in the room knew that I didn't finish the code yet and its just me projecting into the future of how code would work and/or how it'd be used. But everyone knew the code wasn't done yet (despite that, I've always delivered on what I've promised).

There is a degree of professional ethics I'd expect from my peers. Hosting honest demos is one of them, especially with technical audience members.

Or worse, fraud to make their stock go up

edit: s/stuck/stock

Why did you have to mention your edit?

Yes, that was obvious as soon as I saw it wasn’t live I clicked off. You can train any LLM to perform a certain task(s) well and google engineers are not that dense. This was obvious marketing PR as open AI has completely made google basically obsolete with 90% of my queries can be answered without wading through LLM generated text for a simple answer.

without wading through LLM generated text

...OpenAI solved this by generating LLM text for you to wade through?

No. It solved it by (most of the time) giving the OP and I the answer to our queries, without us needing to wade through spammy SERP links.

How do you tell a plausible wrong answer from a real one?

By testing the code it returns (I mostly use it as a coding assistant) to see if it works. 95% of the time it does.

For technical questions, ChatGPT has almost completely replaced Google & Stack Overflow for me.

In my experience, testing code in a way that ensures that it works is often harder and takes more time than writing it.

If LLMs can replace 90% of your queries, then you have very different search patterns from me. When I search on Kagi, much of the time I’m looking for the website of a business, a public figure’s social media page, a restaurant’s hours of operation, a software library’s official documentation, etc.

LLMs have been very useful, but regular search is still a big part of everyday life for me.

GPT4 search is a very good experience.

Though because you don’t see the answers it doesn’t show you, it’s hard to really validate the quality, so I’m still wary, but when I look for specific stuff it tends to find it.

Good, that video was mostly annoying and creepy. The AI responses as shown in the linked Google dev blogpost are a lot more reasonable and helpful. BTW I agree that the way the original video was made seems quite misleading in retrospect. But that's also par for the course for AI "demos", it's an enduring tradition in that field and part of its history. You really have to look at production systems and ignore "demos" and pointless proofs of concept.

The GPT-4 demo early this year when it was released was a lot less.. fake, and in fact very much indicative of it's feature set. The same is true for what OpenAI showed during their dev days, so at the very least those demos don't have too much fakery going on, as far as I could tell.

A certain minimum level of jank always makes demos more believable. Watching Brockman wade through Discord during the napkin-to-website demo immediately made the whole thing convincing.

AI is in the "hold it together with hope and duct tape" phase, and marketing videos claiming otherwise are easy to spot and debunk.

What the Quack? I found it tasty as pâté.

You really have to look at production systems and ignore "demos" and pointless proofs of concept.

While I agree, I wouldn't call proofs or concepts and demos pointless. They often illustrate a goal or target functionality you're working towards. In some cases it's really just a matter of allotting some time and resources to go from a concept to a product, no real engineering is needed, it all exists, but there's capital needed to get there.

Meanwhile some proof of concepts skip steps and show higher level function that needs some serious breakthrough work to get to, maybe multiple steps of that. Even this is useful because it illustrates a vision that may be possible so people can understand and internalize things you're trying to do or the real potential impact of something. That wasn't done here, it was embedded in a side note. That information needs to be before the demo to some degree without throwing a wet blanket on everything and needs to be in the same medium as the demo itself so it's very clear what you're seeing.

I have no problem with any of that. I have a lot of problems when people don't make it explicitly clear beforehand that it's a demo and explain earnestly what's needed. Is it really something that exists today in working systems someone just needs to invest money and wire it up without new research needed? Or is it missing some breakthroughs, how many/what are they, how long have these things been pursued, how many people are working on them... what does recent progress look like and so on (in a nice summarized fashion).

Any demo/poc should come up front with an earnest general feasibility assessment. When a breakthrough or two are needed then that should skyrocket. If it's just a lot of expensive engineering then that's also a challenge but tractable.

I've given a lot of scientific tech demonstrations over the years and the businesses behind me obviously want me to be as vague as possible to pull money in. I of course have some of those same incentives (I need to eat and pay my mortgage like everyone else). None-the-less the draw of science to me has always been pulling the veil from deception and mystery and I'm a firm believer in being as upfront as possible. If you don't lead with disclaimers, imaginations run wild into what can be done today. Adding disclaimers helps imaginations run wild about what can be done tomorrow, which I think is great.

It’s not live, but it’s in the realm of outputs I would expect from a GPT trained on video embeddings.

Implying they’ve solved single token latency, however, is very distasteful.

OP says that Gemini had still images as input, not video - and the dev blog post shows it was instructed to reply to each input in relevant terms. Needless to say, that's quite different from what's implied in the demo, and at least theoretically is already within GPT's abilities.

How do you think the cup demo works? Lots of still images?

A few hand-picked images (search for "cup shuffling"): https://developers.googleblog.com/2023/12/how-its-made-gemin...

Holy crap that demo is misleading. Thanks for the link.

OK I get that everyone’s hype sensitive and I absolutely remain to be convinced on Gemini’s actual ability

BUT

The fact this wasn’t realtime or with voice is not the issue. Voice to text could absolutely capture this conversation easily. And from what I’ve seen Gemini seems quicker than GPT4

Being able to respond quicker and via voice chat is not actually a big deal.

The underlying performance of the model is what we should be focussing on

Exactly.

Corporate tech demo exaggerates actual capabilities and smoothes over rough edges? Impossible, this is unprecedented!!

The Apple vs Google brand war is so tiresome. Let's focus on the tech.

To be clear - I’m not saying this makes Gemini good. Just that it isn’t bad for these reasons!

Even if you'd be inclined to shrug off the fact that this wasn't real-time voice and video-based, which you shouldn't because the underlying implications would be huge for performance, there's still the matter that the prompts used. The prompts shown are miles apart and significantly misrepresent the performance of the underlying model.

It goes from a model being able to infer a great deal at the level of human intelligence to a model that needs to be fed essential details, and that doesn't do much inferring.

I get the feeling that many here on HN who are just shrugging it off don't realize how much of the “demo” was faked. Here’s a decent article that goes into it some more: https://techcrunch.com/2023/12/07/googles-best-gemini-demo-w...

I disagree.

One of the major issues in LLMs is the economics; a lot of people suspect ChatGPT loses money on every user, or at least every heavy user, because they've got a big model and A100 GPUs are expensive and in short supply.

They're kinda reluctant to have customers, with API rate limits galore, and I've heard people claiming ChatGPT has lost the performance crown having switched to a cheaper-to-run model.

If google had a model that operated on video in realtime, that would imply they've got a model that performs well, and is also very fast or that their 'TPUs' outperform the A100 quite a bit, either of which would be a big step forward.

it was obviously marketing material, but if this tweet is right, then it was just blatant false advertising.

Google always does fake advertising. “Unlimited” google drive accounts for example. They just have such a beastly legal team no one is going to challenge them on anything like that.

What was fake about unlimited google drive? There were some people using petabytes.

The eventual removal of that tier and anything even close speaks to Google's general issues with cancelling services, but that doesn't mean it was less real while it existed.

What about when gmail was released and the storage was advertised as increasing forever, but at first they just increased it infinitesimally slower and then stopped increasing it all.

Oh, long before google drive existed?

I don't remember the "increasing forever" ever being particularly fast. I found some results from 2007 and 2012 both saying it was 4 bytes per second, <130MB per year.

So it's true that the number hasn't increased in ten years, but that last increase was +5GB all by itself. They've done a reasonable job of keeping up.

Arguably they should have kept adding a gigabyte each year, based on the intermittent boosts they were giving, but by that metric they're only about 5GB behind.

Does it matter at all with regards to its AI capabilities though?

The video has a disclaimer that it was edited for latency.

And good speech-to-text and text-to-speech already exists, so building that part is trivial. There's no deception.

So then it seems like somebody is pressing a button to submit stills from a video feed, rather than live video. It's still just as useful.

My main question then is about the cup game, because that absolutely requires video. Does that mean the model takes short video inputs as well? I'm assuming so, and that it generates audio outputs for the music sections as well. If those things are not real, then I think there's a problem here. The Bloomberg article doesn't mention those, though.

I'm ok with "edited for latency" or "only showing the golden path".

But the most impressive part of the demo, was the way the LLM just seemed to know when to jump in with a response. It appeared to be able to wait until the user had finished the drawing, or even jumping in slightly before the drawing finished. At one point the LLM was halfway though a response and then saw the user was now colouring the duck in blue, and started talking about how the duck appearing to be blue.

The LLM also appeared to know when a response wasn't needed because the user was just agreeing with the LLM.

I'm not sure how many people noticed that on a conscious level, but I positive everyone noticed it subconsciously, and felt the interaction was much more natural.

As you said, good speed-to-text and speech-to-text has already been done, along with multi-model image/video/audio LLMs and image/music generation. The only novel thing google appeared to be demonstrating and what was most impressive was this apparent natural interaction. But that part was all fake.

Even your skeptical take doesn't fully show how faked this was.

The video has a disclaimer that it was edited for latency.

There was no disclaimer that the prompts were different from what's shown.

And good speech-to-text and text-to-speech already exists, so building that part is trivial. There's no deception.

Look at how many people thought it can react to voice in real-time - the net result is that a lot of people (maybe most?) were deceived. And the text prompts were actually longer and more specific than what was said in the video!

somebody is pressing a button to submit stills from a video feed, rather than live video.

Somebody hand-picked images to convey exactly the right amount of information to Gemini.

Does that mean the model takes short video inputs as well? I'm assuming so

It was given a hand-picked series of still images with the hands still on the cups so that it was easier to understand what cup moved where.

Source for the above: https://developers.googleblog.com/2023/12/how-its-made-gemin...

Audio input that's not text in the middle and video input are two things they made a big deal out of. Then they called it a hands on demo and it was faked.

My main question then is about the cup game, because that absolutely requires video.

They did it with carefully timed images, and provided a few examples first.

I'm assuming so, and that it generates audio outputs for the music sections as well

No, it was given the ability to search for music and so it was just generating search terms.

Here's more details:

https://developers.googleblog.com/2023/12/how-its-made-gemin...

Gemini demo looks like ChatGPT with a video feed, except it doesn't exist, like ChatGPT. I have ChatGPT on my phone right now, and it works (and it can process images, audio, and audio feed in). This means Google has shown nothing of substance. In my world, it's a classic stock price manipulation move.

Gemini Pro is available on Bard now.

Ultra is not yet available.

Yeah and have you tried it? It’s as dogshit as the original Bard.

I've been using Gemini in Bard since the launch, with respect to coding it is outperforming GPT4 in my opinion. There is some convergence in the answers,but Bard is outputting really good code now.

link to duck video: https://www.youtube.com/watch?v=UIZAiXYceBI

Thanks. This is the first I'm hearing of a duck demo, and couldn't figure out what it was.

Timestamp for the duck demo: https://youtu.be/UIZAiXYceBI?si=pNT74PXjyDataF1T&t=246

Is the link to the article broken and anyone has it archived somewhere?

I wish people stop posting Twitter messages to HN and provide a link directly to the original article. What's next, post on HN on an Instagram post?

I dont get it too, my browser was loading for good 15 seconds and made 141 requests fetching almost 9 MB of resources to show me exactly same content as provided in OpenGraph tags and a freaking redirect to a Bloomberg link. Feels like a slap in the face to open such phishing link at any time, just a useless redirect with nine million bytes of overhead.

And another: https://techcrunch.com/2023/12/07/googles-best-gemini-demo-w...

The bloomberg article seems to have been taken down and is now going to 404. https://www.bloomberg.com/opinion/articles/2023-12-07/google...

Just an error in the link, here's the corrected version: https://www.bloomberg.com/opinion/articles/2023-12-07/google...

and here's a readable version: https://archive.ph/ABhZi

I don’t understand why is Gemini even considered “jaw-dropping” to begin with. GPT-4V has set the bar so high that all their demos and presentations paled in comparison. And it’s available for anyone to use. People have already build mind-blowing demos with it (like https://arstechnica.com/information-technology/2023/11/ai-po...).

The entire launch felt like a concentrated effort to “appear” competitive to OpenAI. Google was splitting hairs talking about low single digit percentage improvement in benchmarks. Against a model that has been out for over 6 months.

I have never been so unimpressed with them. Not only has OpenAI managed to snag this one from under Google’s nose, IMO - they seem to be defending their lead quite well. Now that is something unmistakably remarkable. Color me impressed!

Some other commenter, a former googler, a while back alluded to figuring out the big secret and being thrown for a tizzy at the resulting cognitive dissonance they realize they’ve been buying into. Its never about making a good product. Its about keeping up with the joneses in the eyes of tech investors. And just look at the movement on the stock today as a result of this probable lemon of a product: nothing else mattered except keeping up appearances. CEOs make historic careers optimizing companies for appearances over function like this.

I get that. But I don’t get why journalists who cover the industry and are expected to get it would call this “jaw dropping”. Maybe I am reading too much into it. It was likely added to increase the shock factor.

I watched this video, impressed, and thought: what if it’s fake. But then dismissed the thought because it would come out and the damage wouldn’t be worth it. I was wrong.

The worst part is that there won't be any damage. They'll release a blog post with PR apologies, but the publicity they got from this stunt will push up their brand in mainstream AI conversations regardless.

"There's no such thing as bad publicity."

There’s no such thing as bad publicity only applies to people and companies that know how to spin it.

Reading the comments of all these disillusioned developers, it’s already damaged them because now smart people will be extra dubious when Google starts making claims.

They just made it harder for themselves to convince developers to even try their APIs, let alone bet on them.

This was stupid.

Silicon Valley is plagued by over promising far too early and fooling everyone we are years ahead of where we actually are. Estimates are always way off, and claims and even fake demos are almost always oversold to get investors.

Which means anyone who isn’t engaged in the hype cycle looks bad by comparison.

Still I would distinguish between genuine optimism and cynical exaggeration. If you go back and read what was being said in and around the 70s about technology and watch the demos they gave…it always makes me feel a little sad for them. People really thought so many things were right around the corner, e.g. if a computer can beat a human star chess player, surely understanding natural language or driving a car will be easy. There were so many things being made like primitive robots that could follow a line painted on the floor, that in a controlled environment seemed so promising but the realization of the vision would take decades longer than predicted. Without their efforts we would not be where we are today.

True. I think the optimism and grandiosity of the engineers with gems in their eyes keeps the passion going to work hard - some are just are terrible at grounding ourselves in reality. Genuine science needs to take more foothold, skepticism and curiosity over certainty of accelerated outcomes or non existant (heck even fraudulent in some situations) capabilities

Anyone remember the Google IO demo where they had their “AI” call a barber to book an appointment.

Turns out it was all staged.

Lost a lot of trust after that.

Google is stuck in innovators dilemma.

They make 300B of revenue which ~90% is ads revenue.

Their actual mission that management chain optimizes for is their $ growth.

A superior AI model that gives the user exactly what they want would crash their market cap.

Microsoft has tons of products with Billion+ profit, Google has only a handful and other than cloud they all tie to Ads.

Google is addicted to ads. If chrome adds a feature that decreases ad revenue, that team gets a stick.

Nothing at Google should jeopardize their ad revenue.

AI is directly a threat to Google’s core business model - ads. It’s obvious they’re gonna half ass it.

For OpenAI, AI is existential for them. If they don’t deliver, they’ll die.

No way, was Google Duplex fake?!

https://www.nytimes.com/2019/05/22/technology/personaltech/a...

wout paywall: https://archive.vn/Gz614

There was also the cringey "niiice!", "sweeeet!", "that's greaatt", "that's actually pretty good" responses from the narrator in a few of the demo videos that gave them the feel of a cheap 1980's TV ad.

It really reminds me of the Black Mirror episode Smithereens with the tech CEO talking with the shooter. Tech people really struggle with empathy, not just 1 on 1 but with the rest of the outside world which is predominantly low income relatively, with no college education. Paraphrased, Black Mirror ep was like:

[Tech CEO read instructions to "show empathy" from his assistant via Slack]

CEO: I hear you. It must be very hard for you.

Shooter: Of course you fucking hear me, we're on the phone! Talk like a normal person!

I remember that conversation! Man, that was a great episode.

Remember when they faked that Google Assistant booking a restaurant thing too.

How was that fake?

Mhm

I'll admit I was fooled. I didn't read the description of the video. The most impressive thing they showed was the real-time responses to watching a video. Everything else was about expected.

Very misleading and sad Google would so obviously fake a demo like this. Mentioning in the description that it's edited is not really in the realm of doing enough to make clear the fakery.

i too was excited and duped about the real-time implications. though i'm not surprised at all to find out it's false.

mea cupla i should have looked at the bottom of the description box on youtube where it probably says "this demonstration is based on an actual interaction with an LLM"

I'm surprised it was false. It was made to look realistic and I wouldn't expect Google to fake this kind of thing.

All they've done is completely destroy my trust in anything they present.

The Twitter-linked Bloomberg page is now down.[1] Alternative page: [2] New page says it was partly faked. Can't find old page in archives.

[1] https://www.bloomberg.com/opinion/articles/2023-12-07/google...

[2] https://www.bloomberg.com/opinion/articles/2023-12-07/google...

I am similarly enraged when TV show characters respond to text messages faster than humans can type. It destroys the realism of my favorite rom-coms.

The report from TechCrunch has more details - https://techcrunch.com/2023/12/07/googles-best-gemini-demo-w...

At the 5:35 point, the phone screen rotates before he touches it. https://youtu.be/UIZAiXYceBI?t=335

This is what convince me that all are hoax.

Lol, could have done without the cocky narration. "I think we're done here."

The whole launch is cocky. Bleh. Stick to the engineering.

I missed the disclaimer. So, when watching it, I started to think "Wow, so Google is releasing their best stuff".

But then I soon noticed some things that were too smooth, so seemed at best to be cherry-picked interactions occasionally leaning on hand-crafted situation handlers. Or, it turns out, faked.

Regardless of disclaimers, this video seems misleading to be releasing right now, in the context of OpenAI eating Google's lunch.

Everyone is expecting Google to try to show they can do better. This isn't that. This isn't even an mocked-up interaction future of HCI concept video, because it's not showing a vision of what people want to do --- it's only showing a demo of technical capabilities.

It's saying "This is what a contrived tech demo (not application vision concept) could look like, but we can't do it yet, so we faked it. Hopefully, the viewer will get the message that we're competitive with OpenAI."

(This fake demo could just be an isolated oops of a small group, not representative of Google's ability to rise to the current disruption challenge, I don't know.)

I knew immediately this was just overhyped PR when I noticed the author of the blogpost is Sundar.

I guess a much better next step is to compare how GPT4V performs when asked similar prompts. Even if mostly staged this is very impressive to me, not much on the current tech but more on how much leverage Google has to win this race on the long run because of its hardware presence.

The more these models improve the more we will want less friction and faster interactions, this means that in the long term having to open an app and ask a question is not gonna fly compared to just pointing your phone camera to something, asking a question and getting an answer that's tailored to everything Google knows about you in real time.

Apple will most likely also roll their own in house solution for Siri instead of relying on an external company. This leaves OpenAI and the other small companies not just competing for the best models but also on how to put them in front of people in the first place and how to get access to their personal information.

Even if mostly staged this is very impressive to me, not much on the current tech but more on how much leverage Google has to win this race on the long run because of its hardware presence.

I think you have too much information to form a reasonable opinion on the situation. Google is using editing techniques and specific scripting to try to demonstrate they have a sufficiently powerful general AI. The magnitude of this claim is huge, and the fact that they're faking it should be a likewise enormous scandal.

To sum this up "well I guess they're doing better than XYZ" discounts the absurd context of all this.

While this might just be a bit of bad PR now, it will eventually be a nothing burger. Remember the original debut of Apple's Siri for which Apple also put out a promotional demo with greatly exaggerated functionality? People even sued Apple and they lost.

As much as I hate it, this is absolutely fine by our society's standards. https://www.theregister.com/AMP/2014/02/14/apple_prevails_in...

There's a vast difference between advertising a product, slightly shortening the sequences and, cutting out failed prompts, and completely misrepresenting the product at hand to a degree that the depiction doesn't resemble the product at all[0].

The former is considered Puffery[1] and is completely legal, and the latter is straight up lying.

0: https://techcrunch.com/2023/12/07/googles-best-gemini-demo-w...

1: https://en.wikipedia.org/wiki/Puffery

You can tell whoever put together that demo video gave no f*cks whatsoever. This is the quality of work you can expect under an uninspiring leader (Sundar) in a culture of constant layoff fear and bureaucracy.

Literally everyone I know who works at Google hates their job and are completely checked out.

Huh? It was a GREAT demo video!

If it had been real, that is.

If you've seen the video, it's very apparent it's a product video, not a tech demo. They cut out the latencies to make a compelling product video.

I wasn't at all under the impression they were showcasing TTS or low latencies as product features. I don't find the marketing misleading at all, and find these criticisms don't hit the mark.

https://www.youtube.com/watch?v=UIZAiXYceBI

It's not just cutting. The answers were obtained by taking still photos and inputting them into the model together with detailed text instructions explaining the context and the task to the model, giving some examples first and using careful chain-of-thought style prompting. (see e.g. https://developers.googleblog.com/2023/12/how-its-made-gemin...) My guess is that the video was fully produced after the Gemini outputs were generated by a different team, instead of while or before.

Just how many lives does Sundar have? Where is the board?

Counting their bonusses?

Any sufficiently-advanced technology is indistinguishable from a rigged demo.

Fake it til you make it, then keep faking it.

I wanted to see what's really going on, but the bloomberg article on the twit link seems taken down right now.

This is just a tweet that makes a claim without backing, and links to an article that was pulled.

Can we change the URL to the real article if it still exists?

That demo was much further on the "marketing" end of the spectrum when compared to some of their other videos from yesterday which even included debug views: https://youtu.be/v5tRc_5-8G4?t=43

There is a possibility of dataset contamination on the competitive programming benchmark. A nice discussion on the page where AlphaCode2 was solving the problems https://codeforces.com/blog/entry/123035

Problem showed in the video was reused in a recent competition (so could have been available in the dataset).

I geniounly don't understand why this comes as a surprise. If it was a realtime video with realtime responses it would have been some other generational leap beyond transformers. Not only that, it would somehow deal with video + audio with smaller latency than the current SOTA model? Cmon...

Google Gemi-lie

I can't find the original article anymore

So what? Voice to text is a solved problem. And in cases where realtime is important, just throw more compute at it. I'm missing the damning gotcha moment here.

Hard to build when you fire everyone huh

I’ll assume Gemini is vaporware unless I can actually use it.

fake benchmarks, fake stitched together videos, disingenuous charts, no developer API on launch, announcements stuffed with marketing fluff.

As soon as I saw that opening paragraph from Sundar and how it was written I knew that Gemini is going to be a steaming pile of shit.

They should have watched the GPT-4 announcement from OpenAI again. That demo Greg Brockman did with converting a sketch on a piece of paper to a CodePen from a Discord channel, with all the error correcting and whatnot, is how you launch a product that's appealing to users.

TechCrunch, Twitter and some other sites (including HN i guess) are already piling on to this and by Monday things will go back to how they were and Google will have to go back to the drawing board to figure out another way to relaunch Gemini in the future.

I thought it was implied and obvious that the video was edited.

So what?

AI: artificial incompetence

I was looking at this demo today and was wondering the same. It looked way to fast to be calculating all of that info.

It seems like the fake video did the trick, their stock is up 5.5% today.

That's my suspicion when I first saw it. Its really an impressive demo though.

I didn’t believed Google presentation off-hand because I don’t care anymore, especially because it comes from them. I just use tools and adapt. Copilot helps me automating boring tasks, can’t help much at new stuff, so I actually discovered I often do “interesting” work. I use gpt 3.5/4 for everything but work, it’s been a bless, best suggestion engine for movies, books and music with just a prompt and without the need of tons of data about my watch history(looking at you youtube). In these strange times I’m actually learning a lot more, productivity is more or less the same as before llms, but annoying tasks are relieved a bit. All of that without the hype. Sometimes I laugh at Google, it must be a real shit show inside that mega corporation, but I kinda understand the need of a marketing editing, having a first class ticket on the AI train is so important for them as it seems they see it as an existential threat. At least it seems so since they decided to take the risk of lying.

rofl

C'mon that was obvious. Be real.

I looked at is as if it were a good aspirational target for 5 years from now. It was obvious the whole video was edited together not real time.

Well, sometimes I have this "Google Duplex: A.I. Assistant Calls Local Businesses To Make Appointments" feeling.

https://www.youtube.com/watch?v=D5VN56jQMWM

The demo may be fake, but people genuinely love it. the top comment on youtube of that demo is still:

"Absolutely mindblowing. The amount of understanding the model exhibits here is way way beyond anything else."

Anybody remember the google duo demo?

This must’ve been shot by one of the directors who did Apple’s new show Extrapolations. Very plausible illustration of AI in daily life though, despite the aggressive climate change claims made in it.

But neither AI not climate are there yet…

Bloomberg link not working; here's TechCrunch: https://techcrunch.com/2023/12/07/googles-best-gemini-demo-w...

This is endemic to public product demos. The thing never works as it does in the video. I'm not excusing it, I'm saying: don't trust public product demos. They are commercials, they exist to sell to you, not to document objectively and accurately, and they will always lie and mislead within the limits of the law.

I really thought this was a realtime demo.

Shame on them :(

If the truck doesn't have a working engine, we can just roll it down the hill.

Brilliant idea!

Well, google has a history for faking things.. so I'm not not surprised. I expected that..

All companies are just yelling that they're "in" the AI/LLM game.. If they don't, share prices will drop.

Google is done, they can't compete in this space.

The red flag for me was that they started that demo video with a background noise to make it seem like it's a raw video. A subtle manipulation for no reason, it's obviously not a raw video.

The fact that they did not fact check the videos again makes me not particularly confident in the quality of Google's work. The bit where the model misinterpreted music notation (the circled area does not mean "piano"), and the "less dense than water" rubber duck are beyond the pale. The SVG demo where they generate a South Park looking tree looks like a parody.

https://nitter.unixfox.eu/parmy/status/1732811357068615969?f...

Wow - my first thought was I wonder what framerate they're sending video at. The whole demo seems significantly less impressive in that case.

too little, too late. my impression is that google is not one, but two steps behind what MS can offer (they need a larger leap if they want to get ahead)

The more Google tries to over-hype stuff the more that keeps giving me a greater impression they are well behind OpenAI. Time to STFU and focus on working on stuff.

Bloomberg link in Xeet is 404 for me (Bangalore).

The hype really is drowning out the simple fact that basically no one really knows what these models are doing. Why does it matter so much that we include auto-correlation of embedding vectors as the "attention" mechanism in these models? And that we do this sufficiently many times across all the layers? And that we blindly smoosh values together with addition and call it a "skip" connection? Yes, you can tell me a bunch of stuff about gradients and residual information, but tell me why any of this stuff is or isn't a good model of causality.

Fucking. Shocking.

Anyone with half a brain could see through this "demo." It was vastly too uncanny to be real, to the point that it was poorly setup. Google should be ashamed.

Google did the same with Pixel 8 Pro advertising - they showed stuff like photo and video editing, that people couldn't replicate on their phones.

Imagine being google and having 100 BILLION+ in liquid cash, tens of thousands of the best engineers world-wide, and everything you could possibly need for running tech products. Yet being completely unable to launch anything new or worthwhile. Like, how tf does that even happen? Is Google the next Kodak?

Here’s the detailed article: https://archive.is/20231207205555/https://www.bloomberg.com/...

the original linked article in the tweet [0] now returns 404 for me

[0] - https://www.bloomberg.com/opinion/articles/2023-12-07/google...

Even a year ago, this advert would have been obvious puffery in advertising.

But right now, all the bits needed to do this already exist (just need to be assembled and -to be fair- given a LOT of polish), so it would be somewhat reasonable to think that someone had actually Put In The Work already.

I imagine the model also has some video embeddings like for the example when it needed to find where the ball was hiding.

Unpaywalled Bloomberg article linked in the tweet:

https://archive.is/4H1fB

Link to the Bloomberg article from the Tweet is 404 now.

The bloomberg article gives 404 for me

I didn't even look at the demo. And not due to lack of interest in LLMs (I'm an academic working in NLP, and I have some work on LLMs).

The thing is that for all practical intents and purposes, if I can't try it, it doesn't exist. If they claim it exists, they should show it, a video or a few cherry-picked examples don't prove anything. It's easy to make a demo making even Eliza look like AGI by asking the right questions.

For more details about how the video was created, see this blog post: https://developers.googleblog.com/2023/12/how-its-made-gemin...