Our next-generation model: Gemini 1.5

It would do Google a lot of service if every such announcement is not met with 'join the waitlist' and 'talk to your vertex ai team'.

Remember when Gmail was new and you needed an invite to join? I guess Google is stuck in 2004.

I'm embarrassed to admit that I bought a Gmail invite on eBay for $6 when it was still invite-only.

That's not entirely a waste, it would have given you a better chance for an email address you wanted.

Yeah. I ended up with an eight letter @gmail.com because I dithered, but if I'd signed up by any means necessary when I'd first heard of it, I would've gotten a four letter one.

I bartered on gmailswap.com, sending someone a bicentennial 50¢ US coin in exchange for an invite.

The envelope made it to the recipient, but the coin fell out in transit because I was young and had no idea how to mail coinage. They graciously gave me the invite anyway.

Nothing to be ashamed of. I think I might have bought a Google Wave invite a couple of years later :/

shrug It probably gave you months of fun.

Yielding a priceless anecdote

They don't seem to remember when that literally sunk Google+ because people had no use for a social network without their friends on it.

and not having to wait months if you live in EU

What's worse is that I can't seem to find a way to let Google know where I actually live (as opposed to where I am temporarily traveling, what country my currently inserted SIM card is from etc). And apparently there is no way to do this at all without owning an Android device!

Apple at least lets me change this by moving my iTunes/App Store account, which is its own ordeal and far from ideal, but at least there's a defined process: Tell us where you think you live, provide a form of payment from that place, maybe we'll believe you.

Yeah Google aggressively uses geolocation throughout their services, regardless of your language settings. The flipside of that is that it's really easy to access the latest Gemini or whatever by just using a VPN.

Wait, does that mean if I subscribe to Gemini Pro in country A where it's available (e.g. the US) but travel to Europe, I can't use it?

I'm really frustrated by Google's attitude of "we know better where you are than you do". People travel sometimes and that's not the same thing as moving!

I signed up for all of their AI products when I was in the US, some of them work while I'm out of country some don't. I can't tell what the rule is...

I really, really hate all of these geo heuristics. Sure, don't advertise services to people outside of your market, I get that. Do ask for a payment method from that country too to provide your market-specific pricing if you must.

But once I'm a paying customer, I want to use the thing I'm paying for from where I am without jumping through ridiculous hoops!

The worst variant of this I've seen is when you can neither use nor cancel the subscription from outside a supported market.

To be clear, I didn't pay for any of them. I just signed up for early access to every product that uses some form of ML that can remotely be called "AI"...

Once I got accepted, some of them work outside of the US and some don't

This is bad practice across the board IMO. There seems to be an idea that this builds anticipation for new products. Sounds good in a PowerPoint presentation by an MBA but doesn't work in practice. Six months (or more!) after joining a waitlist, I'm not seeing it for the first time, so I don't really care when yet another email selling me something hits my inbox. I may not even open the email. This could be mitigated somewhat by at least offering a demo, but that's rare.

Likely they have limited capacity and are alloting things for highest paying and strategic customers

So tactical, wow. Meanwhile OpenAI and others will eat their lunch again.

Agreed. OpenAI also doesn't need to grock with Shareholders fearing a GDPR like-fine. Sadly the larger you are the bigger the pain is from small mistakes.

they have limited capacity

Google

Yes, one cannot expect much from such a small plucky startup like them.

As someone who worked in Google Cloud's partnerships team, the way the Early Access Program, not to mention the Alpha --> Beta --> GA launch process for AI products, works, is really dysfunctional. Inevitably what happens is that a few strategic customers or partners get exceptionally early (Alpha) access and work directly with the product team to refine things, fix bugs and iron out kinks. This is great and the way market driven product development should work.

The issues arise with the subsequent stagegate graduation processes, requirements and launches to less restricted markets. It's inconsistent, the QoS pre-GA customers receive is often spotty and the products come with no SLAs, and -- just like Gmail on the consumer side -- things frequently stay in EAP/Beta phase for years with no reliable timeline for launch. ... and then often they're killed before they get to GA, even though they may have been being used by EAP customers for upwards of 1-2 years.

I drafted a new EAP model a few years ago when Google's Cloud AI & Industry Solutions org was in the process of productizing things like the retail recommendation engine and Manufacturing Data Engine, and had all the buy-ins from stakeholders on the GTM side ... but the CAIIS GM never signed off. Subsequently, both the GM & VP Product of that org have been forced out.

In my opinion, this is something Microsoft does very well and Google desperately needs to learn. If they pick up anything from their hyperscaler competitors it should be 1) how to successfully become a market driven engineering company from MSFT and 2) how to never kill products (and not punish employees for only doing KTLO work) from AMZN.

One PM in 2005 knocked it out of the park with Gmail and every Google PM since then has cargo-culted it.

They can't do that because only they are the incorruptible stewards empowered with the ability to develop these models, making them accessible to the unwashed masses would be irresponsible!

The victim complex on this topic is getting really old.

They’re an enterprise software company doing an enterprise sales motion.

If that was true, they wouldn't have named it Gemini 1.5 to follow the half-point increment of ChatGPT, they desperately want "people" to care about their product to gain back their mindshare.

Anthropic's Claude targets mostly business use cases and you don't see them write self-congratulating articles about Claude v2.1, they just pushed the product.

Mindshare is part of enterprise sales, yes.

I work at a very large company and everyone knows about ChatGPT and Gemini (in part because we for our sins have a good chunk of GCP stuff), but I doubt anyone here not doing some LLM-flavored development has ever even heard of Anthropic, let alone Claude.

And look at how well it's going for Claude. Their primary claim to fame is being called "an annoying coworker" and that's it.

Why would anyone look to form a contract with Anthropic right now? I'd say they're in danger here, because their models and offerings don't have clear value propositions to customers.

They’re an enterprise software company

Really? Someone ought to tell them.

Yeah compared to e.g. Apple’s ‘here’s our new iWidget 42 pro, you can buy it now’ it’s at best disappointing.

Apple is good about only announcing real products you can buy. They don't do tech demos. It's always, "here's a problem. the new apple watch solves it. here're five other things the watch does. $399."

I agree that Apple does a better job, but wasn't Apple Vision Pro announced 240 days before you could get it? I think it's a pretty safe bet that Gemini 1.5 (or something better) will be available anyone who wants to use it in the next 240 days.

AI software release cycles are incredibly short right now. Every month, there is some major development released in a usable right now form.

The first of it's type AR/VR hardware has, understandably, a longer release cycle. Also, Apple announced early to drive up developer interest.

The verdict is not yet out on the Vision Pro but otherwise your point stands.

Apple is indeed masterful at advertising. Google, somewhat ironically, is really bad at it.

I'm generally an excited early adopter, but this kills my excitement immediately. I don't know if Gemini is out (or which Gemini is out) because I've associated Google with "you can't try their stuff", so I've learned to just ignore everything about Gemini.

There is a Gemini service that you can use with your Google account, but it's kind of meh as it repeats your input, makes all sorts of assumptions. I am confused as well about the version. There's a link to another premium version (1.5?) on its page, to which I don't have access to without completing a quest which likely ends with a credit card input. That kills it for me.

Or can't use ... I have a newish work account and downloaded Gemini on a Pixel 8 Pro and get "Gemini isn't available" and "Try again later" with no explanation of why not and when.

I think the way to understand this is to realize that this isn’t targeted at a Hacker News audience and they don’t care what we think. The world doesn’t revolve around us.

What’s the goal? Maybe, being able to work with partners without it being a secret project that will inevitably leak, resulting in inaccurate stories in the press. What are non-goals? Driving sales or creating anticipation with a mass audience, like a movie trailer or an Apple product launch.

So they have to announce something, but most people don’t read Hacker News and won’t even hear about it until later, and that’s fine with them.

Google is really good at diluting any possible anticipation hardcore users might have for new stuff they do. 10 years ago I loved when there was a big update to one of their Android apps and I could sideload the apk from the internet to try it out early. Then they made all those changes A/B tests controlled by server side flags that would randomly turn themselves on and off, and there was no way to opt in or out. That was one of the (many) moves that contributed to my becoming disenchanted with Android.

Its because they don't want you to actually use it and see how far behind they are compared to other companies. These announcement are meant to placate investors. "See, we are doing a lot of SotA AI too".

You might be right, but other things from Google tell the same story. For example, I recently tried to get ahold of Pixel 8 Pro. Had to import one from UK, and when I did, turns out that new feature of using thermometer on humans isn't available outside of USA. It doesn't even seem that process to certificate it outside of USA is in play. Google and sales/support just aren't a thing like with Apple, as a contrast. Which is a total shame. I know Google is strong, if not strongest in the game of tech, they just need to get their act together and I believe in them succeeding in that, but sales and support was never in their DNA. Not sure if that can be changed.

I'm more than happy to transfer my monthly $20 to google from OpenAI, on top of my youtube and google one subscription. It's up to Google to take it.

After the complete farce that was the last 90% faked video of their tech, maybe just give us a text box we can talk to the thing and see it working ourselves next time.

Like it's shocking to me, are management really so clueless they don't realize how far behind they are? This isn't 2010 Google, your not the company that made your success anymore and in a decade the only two sure fire things that will still exist are android and chrome. Search, Maps, Youtube are all in precarious positions that the right team could dethrone.

I believe this is a standard practice in Google whenever they need to launch a change expected to consume huge resources and they cannot reasonably predict the demand. Though I agree that this is a bad PR practice; waitlist should be considered as a compromise, not a PR technique.

It lets the company control the narrative, without the distraction of fifty tech bloggers test-driving it and posting divergent opinions or findings. Instead, the conversation is anchored to what the company claims about the product.

It's interesting that it's the opposite of the gaming industry. There, because the reviewers dictate the narrative, the industry is better at ferreting out bogus claims. On the flip side, loud voices sometimes steamroll over decent products because of some ideological vendetta.

Totally agree with this. I can see the desire to show off, but I don't understand how anyone can believe this is good marketing strategy. Any initial excitement I get from reading such announcements will be immediately extinguished when I discover I can't use the product yet. The primary impression I receive of the product is "vaporware." By the time it does get released I'll already have forgotten the details of the announcement, lost enthusiasm, and invested my time in a different product. When I'm choosing between AI services, I'll be thinking "no, I can't choose Gemini Pro 1.5 because it's not available yet, and who knows when it will be available or how good it'll be." Then when they make their next announcement, I'll be even less likely to give it any attention.

Eh, I think it's about as bad as the OpenAI method of officially announcing something and then "continuously rolling it out to all subscribers" which may be anything between a few days and months.

I wrote off the PS5 because of waitlists. I was surprised to learn just yesterday that they are now actually, honestly purchasable (what I would consider "released").

I guess I let my original impression anchor my long-term feelings about the product. Oh well.

These announcements are mainly for investors and other people interested in planning purposes. It's important to know the roadmap. More information is better.

I get that it's frustrating not to be able to play with it immediately, but that's just life. Announcing things in advance is still a valuable service for a lot of people.

Plus tons of people have been claiming that Google has somehow fallen behind in the AI race, so it's important for them to counteract that narrative. Making their roadmap more visible is a legitimate strategy for that.

100%, I can't even use Imagen despite being an early tester of Vertex.

I have access and will share some learnings soon

And region based. Yawn.

It's probably going to be dead/deprecated in a year, so maybe there's a silver lining to how hard it is to get to use the service. I, for one, wouldn't "build with Gemini".

Dear Google, please fix your names and versioning.

Gemini Pro, Gemini Ultra... but was 1.0?

now upgraded but again Gemini Pro? jumping from 1.0 to 1.5?

wait but not Gemini Pro 1.5... Gemini "1.5" Pro

What actually happened between 1.0 and 1.5?

It's not that difficult.

Their LLM brand is now Gemini. Gemini comes in three different sizes, Nano/Pro/Ultra.

They recently released 1.0 versions of each, most recently (a few months after Nano and Pro) Ultra.

Today they are introducing version 1.5, starting with the Pro size. They say 1.5 Pro offers comparable performance to 1.0 Ultra, along with new abilities (token window size).

(I agree Small/Medium/Large would be better.)

What you described is difficult.

It’s really not. Substitute Gemini for iPhone. Apple releases an iPhone model in mini, standard, and pro lines. They announce iPhone model+1 but are releasing the pro version first. Still difficult?

Apple releases an iPhone model in mini, standard, and pro lines.

not an iphone user but just looked at iphone 15. Don't see any mini version. I am guess 'standard' is called just 'iphone' ? Is pro same thing as plus ?

https://www.apple.com/shop/buy-iphone/iphone-15

Still difficult?

yes your example made it even more confusing.

Now you’re being intentionally difficult. Do you want it to be cars? Last year $Automaker released $Sedan 2023 in basic, standard, and luxury trims. This year $Automaker announced $Sedan 2024 but so far have only announced the standard trim. If I had meant the iPhone 15 specifically I would’ve said iPhone 15. I think the 12 was the last mini? The point is product families are often released in generations (versions in the case of Gemini) and with different available specs (ultra/pro/nano etc) that may not all be released at the same time.

Apple discontinued mini phones two generations back, unfortunately.

So Google will be upgrading the version number of each model at the same time? Based on other comments here, that's not the case - some are 1.5 and some are 1?

Apple doesn't announce the iPhone 12 Mini and compare it to the iPhone 11 Pro.

Apple doesn't announce the iPhone 12 Mini and compare it to the iPhone 11 Pro.

Maybe not that specific example, but yes they absolutely do, both across their phones and across their Macs.

I think it's the "iPhone +1 Mini is as fast as the old Standard" that confuses people here. This is obvious and expected but not how it's usually marketed I guess ...

How? Three models Nano/Pro/Ultra currently at 1.0. New upgrades just increment the version number.

What's Advanced then, chat? Also, by that, 1.5 Ultra is then still to come and it'll show even bigger guns.

Yes, my understanding is also there will be a 1.5 Ultra.

It's however nowhere explicitly said that I could find. The Technical Report PDF also avoids even hinting at it.

Advanced is a price/service tier for the end-user frontend. At the moment it gets you 1.0 Ultra access vs. 1.0 Pro for the free version. Similar to how ChatGPT Plus gives you 4 instead of 3.5.

I agree this part is messy. Does everyone who had Pro already get 1.5 Pro? If 1.5 Pro is better than 1.0 Ultra, why pay for Advanced? Is 1.5 Pro behind the Advanced paywall? etc.

Ok, so from what I've gathered then from all of the comments so far, primary confusion is that both Chat service and llm models are named the same.

There are three models: nano/pro/ultra and all are at v1.0

There are two tiers of chat service: basic and pro

There is AIStudio from google through which you can interact with / use directly gemini llms.

Chat service Gemini basic (free) uses Gemini Pro 1.0 llm.

Chat service Gemini advanced uses Gemini Ultra 1.0 llm.

What was shown is ~~Ultra~~ Pro 1.5 LLM which is / will be available to select few for preview to be used via AIStudio.

That leaves a question, what's nano for, and is it only used via AIStudio/API?

Jesus, Google..

No, what they showed is Pro 1.5. Only via API and on a waitlist.

How this relates to the end-user chat service/price tiers is still unknown.

The best scenario would be that they just move Gemini free and Advanced tiers to Pro 1.5 and Ultra 1.5, I guess.

Yes, you are right. I meant Pro. Let's see then.

Thank you, is more clear to me now. But I also read in some Google announcement about "Gemini Advanced", do you know what is that and the relation with the Nano/Pro/Ultra levels?

Gemini is also the brand name for the end-user web and phone chatbot apps, think ChatGPT (app) vs. GPT-# (model).

Gemini Advanced is the paid subscription service tier that at the moment gets you access to the Ultra model, similar to how a ChatGPT Plus subscription gets you access to GPT-4.

Honestly, they should have called this part Gemini Chat and Gemini Chat Plus, but of course ego won't let them follow the competitor's naming scheme.

So there's Nano 1.0, Pro 1.5, Ultra 1.0, but Pro 1.5 can only be accessed if you're a Vertex AI user (wtf is Vertex)?

That's very difficult.

It's a bit similar to how new OpenAI stuff is initially usually partner-only or waitlisted.

Vertex AI is their developer API platform.

I agree OpenAI is a bit better at launching for customers on ChatGPT alongside API.

, starting with the Pro size

This is where it gets confusing IMO.

It's like if Apple announced macOS Blabahee, starting with Mini, not long after releasing Pro and Air touting benefits of Sonoma.

Also, just.. this is how TFA begins:

Last week, we rolled out our most capable model, Gemini 1.0 Ultra, [...] Our teams continue pushing the frontiers of our latest models with safety at the core. They are making rapid progress. [...] 1.5 Pro achieves comparable quality to 1.0 Ultra

Last week! And now we have next generation. And the wow is that it's comparable to the best of the previous generation. Ok fine at a smaller size, but also that's all we get anyway. Oh and the most capable remains the last generation one. As long as it's the biggest one.

It's almost exactly like Apple, actually, with their M1 and M2 chips available in different sizes, launching at different times in different products.

It's really not that confusing. There are different sizes and different generations, coming out at different times. This pattern is practically as old as computing itself.

I can't even imagine what alternative naming scheme would be an improvement.

They should remove the name Gemini Advanced and just stick to one name

Agreed.

Gemini Advanced seems to be the brand name for the higher price tier for the end-user frontend that gets you Ultra access, similar how ChatGPT Plus gets you ChatGPT 4.

I get it, but it does beg the question whether you will need Advanced now to get 1.5 Pro. Or does everyone get Pro, making it useless to pay for 1.0 Ultra?

I still don't think it's confusing, but that part is definitely messy.

I understod the transition as following.

Google Bard to Google Gemini is what they call Gemini 1.0.

Gemini consists of Gemini Nano, Gemini Pro, & Gemini Ultra.

Gemini Nano is for embedded and portable devices I guess? The free version of Gemini (gemini.google.com) is Gemini Pro. The paid version, called Gemini Advanced is using Gemini Ultra.

What we're reading now is about Gemini Pro version 1.0 switching to version 1.5 as of today.

That just made my head spin even more. (Like, I get it, but it's just a very tortuous naming system.) The free version is called Pro, Gemini Advanced is actually Gemini Ultra, the less powerful version upgraded to the more powerful model but the more powerful version is on the less powerful model.

People make fun of OpenAI for not using product names and just calling it "GPT" but at least it's straightforward: 2, 3, 3.5, 4. (On the API side it's a little more complicated since there's "turbo" and "instruct" but that isn't exposed to users, and turbo is basically the default.)

But you don't pay for GPT-4, you pay for a product called ChatGPT Plus, which allows you to write 40 messages to GPT-4 within a three-hour time window, after which you need to switch to 3.5 in the menu.

but if Vertex AI is using Gemini Ultra, then why makersuite (aisuite now? hmmm) showing only "Gemini 1.0 Pro 001" (001: a version inside a version)

and why have makersuite/aisuite in the first place, if Vertex AI is the center for all things AI? and why aitestkitchen?

I'm seeing only Gemini 1.0 Pro on Vertex AI. So even if I enabled Google Gemini Advanced (Ultra?), enabled Vertex AI API access, I have to first be blessed by Google to access advanced APIs.

It seems paying for their service doesn't mean anything to Google at this point. As a developer, you have to jump through hoops first.

I think this answers why you can't see Ultra.

"Gemini 1.0 Ultra, our most sophisticated and capable model for complex tasks, is now generally available on Vertex AI for customers via allowlist."

https://cloud.google.com/blog/products/ai-machine-learning/g...

It was probably not a wise choise to give the model itself and the product the same name: "Gemini Advanced is using Gemini Ultra". Also: "The free version ... is Gemini Pro" - is not what you usually see out there.

Furthermore, is a minor version upgrade two months later really "next generation"?

Maybe it's not a "next generation" model, but rather their next model for text generation ;)

I mean i don't see any other models watching and answering questions about a 44 minute video lol

Well if it's from 1 to 1.5 then it's really 5 minor version upgrades at once. And since 1.5 is halfway to 2 and you round up, it's next generation!

This naming is terrible, if I understand correctly this is the release of Gemini 1.5 Pro, but not Gemini 1.5 Ultra right ?

Looks like the former PM of chat at google found a new job.

How is that hard to understand? Yes its gemini 1.5 pro, they haven't released ultra or nano, like this isn't rocket science they didnt introduce Gemini 1.5 ProLight or something, lol its the Pro size model's 1.5 version.

Maybe they should take a hint on Windows versions name scheme and call the next version Gemini Meh.

Are you talking about Xbox one?

No. Gemini Purple Plus Platinum Advanced Home Version 11.P17

Their inability to name things sensibly has been called out for years and it doesn't look like they care?

I'm not sure what the deal is, it has to be a marketing hinderance as every major tech company is trying to claw their way up the AI service mountain. Seems like the first step would be cogent naming.

It would have been better as Gemini Lite, Gemini, Gemini Pro, and then v1, v1.5 for model bumps.

Ultra vs pro vs nano with Ultra unlocked by buying Gemini Advanced is confusing.

I'm also not sure why they make base Gemini available after you have Advanced, because presumably there's no reason to use a worse model.

See https://news.ycombinator.com/item?id=39385230

Dear OpenAI please fix your names and versioning. Why do you have GPT-3 and GPT-3.5? What happened between 3 and 3.5? And why isn't GPT-3 a single model? Why are there variations like GPT-3-6.7B and GPT-3-175b? And why is there now a turbo version? How does turbo compared to 4? And what's the relationship between the end-user product ChatGPT and a specific GPT model?

You see this problem isn't unique to Google.

They can't decide on a single name for a chat application so I think expecting them to come up with a sensible naming suggestion is optimistic at best.

This just means we'll be getting a Nano 1.5 and Ultra 1.5

and if Pro 1.5 is this good holy shit what will Ultra be...

Nano/Pro/Ultra are the model sizes, 1.0 or 1.5 is the version

The white paper is worth a read. The things that stand out to me are:

1. They don't talk about how they get to 10M token context

2. They don't talk about how they get to 10M token context

3. The 10M context ability wipes out most RAG stack complexity immediately. (I imagine creating caching abilities is going to be important for a lot of long token chatting features now, though). This is going to make things much, much simpler for a lot of use cases.

4. They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting.

5. It seems like 1.5 Ultra is going to be highly capable. 1.5 Pro is already very very capable. They are running up against very high scores on many tests, and took a minute to call out some tests where they scored badly as mostly returning false negatives.

Upshot, 1.5 Pro looks like it should set the bar for a bunch of workflow tasks, if we can ever get our hands on it. I've found 1.0 Ultra to be very capable, if a bit slow. Open models downstream should see a significant uptick in quality using it, which is great.

Time to dust out my coding test again, I think, which is: "here is a tarball of a repository. Write a new module that does X".

I really want to know how they're getting to 10M context, though. There are some intriguing clues in their results that this isn't just a single ultra-long vector; for instance, their audio and video "needle" tests, which just include inserting an image that says "the magic word is: xxx", or an audio clip that says the same thing, have perfect recall across up to 10M tokens. The text insertion occasionally fails. I'd speculate that this means there is some sort of compression going on; a full video frame with text on it is going to use a lot more tokens than the text needle.

    > The 10M context ability wipes out most RAG stack complexity immediately.

Remains to be seen.

Large contexts are not always better. For starters, it takes longer to process. But secondly, even with RAG and the large context of GPT4 Turbo, providing it a more relevant and accurate context always yields better output.

What you get with RAG is faster response times and more accurate answers by pre-filtering out the noise.

Don't forget that Gemini also has access to the internet, so a lot of RAGging becomes pointless anyway.

Internet search is a form of RAG, though. 10M tokens is very impressive, but you're not fitting a database, let alone the entire internet into a prompt anytime soon.

You shouldn't fit an entire database in the context anyway.

btw, 10M tokens is 78 times more context window than the newest GPT-4-turbo (128K). In a way, you don't need 78 GPT-4 API calls, only one batch call to Gemini 1.5.

Well it's nice, just sad nobody can use it

I don't get this why is it people think that you need to put an entire database in the short-term memory of the AI to be useful? When you work with a DB are you memorizing the entire f*cking database, no, you know the summaries of it and how to access and use it.

People also seem to forget that the average is 1b words that are read by people in their entire LIFETIME, and at 10m, with nearly 100% recall thats pretty damn amazing, i'm pretty sure I don't have perfect recall of 10m words myself lol

This may be useful in a generalized use case, but a problem is that many of those results again will add noise.

For any use case where you want contextual results, you need to be able to either filter the search scope or use RAG to pre-define the acceptable corpus.

Hopefully we can get a better RAG out of it. Currently people do incredibly primitive stuff like chunking text into chunks of a fixed size and adding them to vector DB.

An actually useful RAG would be to convert text to Q&A and use Q's embeddings as an index. Large context can make use of in-context learning to make better Q&A.

A lot of people in RAG already do this. I do this with my product: we process each page and create lists of potential questions that the page would answer, and then embed that.

We also embed the actual text, though, because I found that only doing the questions resulted in inferior performance.

So in this case, what your workflow might look like is:

    1. Get text from page/section/chunk
    2. Generate possible questions related to the page/section/chunk
    3. Generate an embedding using { each possible question + page/section/chunk }
    4. Incoming question targets the embedding and matches against { question + source }

Is this roughly it? How many questions do you generate? Do you save a separate embedding for each question? Or just stuff all of the questions back with the page/section/chunk?

"The 10M context ability wipes out most RAG stack complexity immediately."

I'm skeptical, my past experience is just becaues the context has room to stuff whatever you want in it, the more you stuff in the context the less accurate your results are. There seems to be this balance of providing enough that you'll get high quality answers, but not too much that the model is overwhelmed.

I think a large part of developing better models is not just a better architectures that support larger and larger context sizes, but also capable models that can properly leverage that context. That's the test for me.

They explicitly address this in page 11 of the report. Basically perfect recall for up to 1M tokens; way better than GPT-4.

I don't think recall really addresses it sufficiently: the main issue I see is answers getting "muddy". Like it's getting pulled in too many directions and averaging.

I'd urge caution in extending generalizations about "muddiness" to a new context architecture. Let's use the thing first.

also costs are always based on context token, you dont want to put in 10m of context for every request (its just nice to have that option when you want to do big things that dont scale)

Would like to see the latency and cost of parsing entire 10M context before throwing out the RAG stack which is relatively cheap and fast.

They are pretty clear that 1.5 Pro is better than GPT-4 in general, and therefore we have a new LLM-as-judge leader, which is pretty interesting.

They try to push that, but it's not the most convincing. Look at Table 8 for text evaluations (math, etc.) - they don't even attempt a comparison with GPT-4.

GPT-4 is higher than any Gemini model on both MMLU and GSM8K. Gemini Pro seems slightly better than GPT-4 original in Human Eval (67->71). Gemini Pro does crush naive GPT-4 on math (though not with code interpreter and this is the original model).

All in 1.5 Pro seems maybe a bit better than 1.0 Ultra. Given that in the wild people seem to find GPT-4 better for say coding than Gemini Ultra, my current update is Pro 1.5 is about equal to GPT-4.

But we'll see once released.

>people seem to find GPT-4 better for say coding than Gemini Ultra

For my use cases, Gemini Ultra performs significantly better than GPT-4.

My prompts are long and complex, with a paragraph or two about the general objective followed by 15 to 20 numbered requirements. Often I'll include existing functions the new code needs to work with, or functions that must be refactored to handle the new requirements.

I took 20 prompts that I'd run with GPT-4 and fed them to Gemini Ultra. Gemini gave a clearly better result in 16 out of 20 cases. Where GPT-4 might miss one or two requirements, Gemini usually got them all. Where GPT-4 might require multiple chat turns to point out its errors and omissions and tell it to fix them, Gemini often returned the result I wanted in one shot. Where GPT-4 hallucinated a method that doesn't exist, or had been deprecated years ago, Gemini used correct methods. Where GPT-4 called methods of third-party packages it assumed were installed, Gemini either used native code or explicitly called out the dependency.

For the 4 out of 20 prompts where Gemini did worse, one was a weird rejection where I'd included an image in the prompt and Gemini refused to work with it because it had unrecognizable human forms in the distance. Another was a simple bash script to split a text file, and it came up with a technically correct but complex one-liner, while GPT-4 just used split with simple options to get the same result.

I have a very similar prompting style to yours and share this experience.

I am an experienced programmer and usually have a fairly exact idea of what I want, so I write detailed requirements and use the models more as typing accelerators.

GPT-4 is useful in this regard, but I also tried about a dozen older prompts on Gemini Advanced/Ultra recently and in every case preferred the Ultra output. The code was usually more complete and prod-ready, with higher sophistication in its construction and somewhat higher density. It was just closer to what I would have hand-written.

It's increasingly clear though LLM use has a couple of different major modes among end-user behavior. Knowledge base vs. reasoning, exploratory vs. completion, instruction following vs. getting suggestions, etc.

For programming I want an obedient instruction-following completer with great reasoning. Gemini Ultra seems to do this better than GPT-4 for me.

Is there any chance you could share an example of the kind of prompt you're writing?

I'm always reluctant to write long prompts because I often find GPT4 just doesn't get it, and then I've wasted ten minutes writing a prompt

I mean i don't see GPT4 watching a 44 minute movie and being able to exactly pinpoint a guy taking a paper out of his pocket..

1. They don't talk about how they get to 10M token context

2. They don't talk about how they get to 10M token context

Yes. I wonder if they're using a "linear RNN" type of model like Linear Attention, Mamba, RWKV, etc.

Like Transformers with standard attention, these models train efficiently in parallel, but their compute is O(N) instead of O(N²), so in theory they can be extended to much longer sequences much efficiently. They have shown a lot of promise recently at smaller model sizes.

Does anyone here have any insight or knowledge about the internals of Gemini 1.5?

They do give a hint:

"This includes making Gemini 1.5 more efficient to train and serve, with a new Mixture-of-Experts (MoE) architecture."

One thing you could do with MoE is giving each expert different subsets of the input tokens. And that would definitely do what they claim here: it would allow search. You want to find where someone said "the password is X" in a 50 hour audio file, this would be perfect.

If your question is "what is the first AND last thing person X said" ... it's going to suck badly. Anything that requires taking 2 things into account that aren't right next to eachother is just not going to work.

Is MOE then basically divide and conquer? I have no deep knowledge of this so I assumed MOE was where each expert analyzed the problem in a different way and then there was some map-reduce like operation on the generated expert results. Kinda like random forest but for inference.

One thing you could do with MoE is giving each expert different subsets of the input tokens.

Don't MoE's route tokens to experts after the attention step? That wouldn't solve the n^2 issue the attention step has.

If you split the tokens before the attention step, that would mean those tokens would have no relationship to each other - it would be like inferring two prompts in parallel. That would defeat the point of a 10M context

The fact they are getting perfect recall with millions of tokens rules out any of the existing linear attention methods.

RAG would still be useful for cost savings assuming they charge per token, plus I'm guessing using the full-context length would be slower than using RAG to get what you need for a smaller prompt

This is going to be the real differentiator.

HN is very focused on technical feasibility (which remains to be seen!), but in every LLM opportunity, the CIO/CFO/CEO are going to be concerned with the cost modeling.

The way that LLMs are billed now, if you can densely pack the context with relevant information, you will come out ahead commercially. I don't see this changing with the way that LLM inference works.

Maybe this changes with managed vector search offerings that are opaque to the user. The context goes to a preprocessing layer, an efficient cache understands which parts haven't been embedded (new bloom filter use case?), embeds the other chunks, and extracts the intent of the prompt.

Agreed with this.

The leading ability AI (in terms of cognitive power) will, generally, cost more per token than lower cognitive power AI.

That means that at a given budget you can choose more cognitive power with fewer tokens, or less cognitive power with more tokens. For most use cases, there's no real point in giving up cognitive power to include useless tokens that have no hope of helping with a given question.

So then you're back to the question of: how do we reduce the number of tokens, so that we can get higher cognitive power?

And that's the entire field of information retrieval, which is the most important part of RAG.

Is 10M token context correct? The blog post I see 1M but I'm not sure if these are different things

Edit: Ah, I see, it's 1M reliably in production, up to 10M in research:

Through a series of machine learning innovations, we’ve increased 1.5 Pro’s context window capacity far beyond the original 32,000 tokens for Gemini 1.0. We can now run up to 1 million tokens in production.

This means 1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words. In our research, we’ve also successfully tested up to 10 million tokens.

I know how I’m going to evaluate this model. Upload my codebase and ask it to “find all the bugs”.

RAG doesn’t go away at 10 Million tokens if you do esoteric sources like shodan API queries.

The 10M context ability wipes out most RAG stack complexity immediately.

The video queries they show take around 1 minute each, this probably burns a ton of GPU. I appreciate how clearly they highlight that the video is sped up though, they're clearly trying to avoid repeating the "fake demo" fiasco from the original Gemini videos.

For #1 and #2 it is some version of mixture of experts. This is mentioned in the blog post. So each expert only sees a subset of the tokens.

I imagine they have some new way to route tokens to the experts that probably computes a global context. One scalable way to compute a global context is by a state space model. This would act as a controller and route the input tokens to the MoEs. This can be computed by convolution if you make some simplifying assumptions. They may also still use transformers as well.

I could be wrong but there are some Mamba-MoEs papers that explore this idea.

How do you know it isn't RAG?

The 10M context ability wipes out most RAG stack complexity immediately.

This may not be true. My experience of the complexity of RAG lays in how to properly connect to various unstructured data sources and perform data transformation pipeline for large scale data set (which means GB, TB or even PB). It's in the critical path rather a "nice to have", because the quality of data and the pipeline is a major factor for the final generated the result. i.e., in RAG, the importance of R >>> G.

I assume using this large of a context window instead of RAG would mean the consumption of many orders of magnitude more GPU.

I just hope at some point we get access to mostly uncensored models. Both GPT-4 and Gemini are extremely shackled, and a slightly inferior model that hasn’t been hobbled by a very restricting preprompt would handily outperform them.

The youtube video of the Multimodal analysis of a video is insane, imagine feeding in movies or tv shows and being able to autosummary or find information about them dynamically, how the hell is all this possible already? AI is moving insanely fast.

Massive whoa if true from technical report

"Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens"

https://storage.googleapis.com/deepmind-media/gemini/gemini_...

So, will this outperform any RAG approach as long as the data fits inside the context window?

basically, yes. Pinecone? Dead. Azure AI Search? Dead. Quadrant? Dead.

Prompt token cost still a variable.

Outperform is dependent on the RAG approach (and this would be a RAG approach anyways, you can already do this with smaller context sizes). A simplistic one, probably, but dumping in data that you don't need dilutes the useful information, so I would imagine there would be at least _some_ degradation.

But there is also the downside of "tuning" the RAG to return less tokens you will miss extra context that could be useful to the model.

Doesn't their needle/haystack benchmark seem to suggest there is almost no dilution? They pushed that demo out to 10M tokens.

A perfect RAG system would probably outperform everything in a larger context due to prompt dilution, but in the real world putting everything in context will win a lot of the time. The large context system will also almost certainly be more usable due to elimination of retrieval latency. The large context system might lose on price/performance though.

Cost would still be a big concern

Until I can talk to it, I care exactly zero.

you can buy their stock if you think they'll make a lot of money with their tech

Well that's really the right question .. what can, and will, Google do with this that can move their corporate earnings needle in a meaningful way? Obviously they can sell API access and integrate it into their Google docs suite, as well as their new Project IDX IDE, but do any of these have potential to make a meaningful impact ?

It's also not obvious how these huge models will fare against increasingly capable open source ones like Mixtral, perhaps especially since Google are confirming here that MoE is the path forward, which perhaps helps limit how big these models need to be.

In the long run it could move the needle in enterprise market share of Workspace and GCP. They have a lot of room to grow and IMO have a far superior product to O365/Azure which could be exacerbated by strong AI products. Only problem is this sales cycle can take a decade or more, and Google hasn’t historically been patient or strategic about things like this.

Could you (or someone) explain what this means?

The input you give it can be very long. This can qualitatively change the experience. Imagine, for example, copy pasting the entire lord of the rings plus another 100 books you like and asking it to write a similar book...

I doubt it’s smart enough to write another (coherent, good) book based on 103 books. But you could ask it questions about the books and it would search and synthesize good answers.

I just googled it, and the LOTR trilogy apparently has a total of 480,000 words, which brings home how huge 1M is! It'd be fascinating to see how well Gemini could summarize the plot or reason about it.

One point I'm unclear on is how these huge context sizes are implemented by the various models. Are any of them the actual raw "width of the model" that is propagated through it, or are these all hierarchical summarization and chunk embedding index lookup type tricks?

10M tokens is absolutely jaw dropping. For reference, this is approximately thirty books of 500 pages each.

Having 99% retrieval is nuts too. Models tend to unwind pretty badly as the context (tokens) grows.

Put these together and you are getting into the territory of dumping all your company documents, or all your departments documents into a single GPT (or whatever google will call it) and everyone working with that. Wild.

Seems like Google caught up. Demis is again showing an incredible ability to lead a team to make groundbreaking work.

If any of this is remotely true, not only did it catch up, it’s wiping the floor with how useful it can be compared to GPT4. Not going to make a judgement until I can actually try it out though.

In the demo videos gemini needs about a minute to answer long context questions. Which is better than reading thousands of pages yourself. But if it has to compete with classical search and skimming it might need some optimization.

Another whoa for me

Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person learning from the same content.

Results - https://imgur.com/a/qXcVNOM

I think this somewhat is mostly due to the ability to handle high context lengths better. Note how Claude 2.1 already highly outperforms GPT-4 on this task.

Did you watch the video of the Gemini 1.5 video recall after it processed the 44 minute video... holy shit

Google is like a nervous and insecure engineer — blowing their value by rushing the narrative and releasing too much too confusingly fast.

When OpenAI raced through 3/3.5/4 it was "this team ships" and excitement.

This cargo-cult hate train is getting tiresome. Half the comments on anything Google-related are like this now, and it doesn't add anything to the conversation.

Gemini Ultra was announced two months ago. It just launched in the last week. It literally is still the featured post on the AI section of their blog, above this announcement. https://blog.google/technology/ai/

There’s “this team ships” and there’s “ok maybe wait until at least a few people have used your product before you change it all”.

OpenAI announced GPT-4 image input in mid-March 2023 and made it generally available on the API in November 2023.

Google announced a fancy model two months early and released it in the promised timeframe.

Seems par for the course.

Did OpenAI then announce GPT-5 two weeks after launching GPT-4?

No, of course they didn’t. And you’re comparing one specific feature (image input) and equating it to a whole model’s release date.

Maybe compare apples to apples next time.

People pointing out release/announcement burnout is a reasonable thing; people in general can only deal with the “next new thing” with some breaks to process everything.

I made the comparison because both companies demonstrated advanced/extended abilities (model size, image input) and shipped it delayed.

The difference, though, as someone who really doesn't have a particular dog in this fight, is that I can go use GPT-4 right now, and see for myself whether it's as exciting as the marketing materials say.

When OpenAI launched GPT-4, API access was initially behind a waitlist. And they released multiple demo stills of LMM capacilities on launch day that for months were in a limited partner program before they became generally available only 7 months later.

I also want the shiny immediately when I read about it, but I also know when I am acting entitled and don't go spam comment threads about it.

But really, mostly I mean this: It's fine to criticize things, but when half a dozen people have already raised a point in a thread, we don't need more dupes. It really changes signal-to-noise.

"this team ships"

Because they actually shipped ... (!)

OpenAI has no Moat

A reference to the good doc: https://www.semianalysis.com/p/google-we-have-no-moat-and-ne...

While I'm linking semianalysis, though, it's probably worth talking about how everyone except Google is GPU poor: https://www.semianalysis.com/p/google-gemini-eats-the-world-... (paid)

Whether Google has the stomach to put these models out publicly without neutering their creativity or their existing business model is a different discussion.

Google has a serious GPU (well, TPU) build out, and the fact that they're able to train moe models on it means there aren't any technical barriers preventing them from competing at the highest levels

they also have internet.zip and all of its repo history as well as usenet and mails etc.. which others don't.

This. He’s right you know.

OpenAI is extremely overvalued and Google is closing their lead rapidly.

Is there any meaningful valuation on OpenAI? It’s not for sale, there is no market.

Google … has no ability to commercialize anything. Their only commercial successes are ads and YouTube. Doing deceptive launches and flailing around with Gemini isn’t helping their product prospects. I wouldn’t take a bet between open ai and anyone, but I also wouldn’t take a bet on Google succeeding commercially on anything other than pervasive surveillance and adware.

hence why it's Open

They only have a head start, and the lead is closing

but GPT-4 is nearly a year old now, I'd wait for the next release of OAI before judgement. Probably rather soonish now I would expect.

If I understand correctly, they're releasing this for Pro but not Ultra, which I think is akin to GPT 3.5 vs 4? Sigh, the naming is confusing...

But my main takeaway is the huge context window! Up to a million, with more than 100k tokens right now? Even just GPT 3.5 level prediction with such a huge context window opens up a lot of interesting capabilities. RAG can be super powerful with that much to work with.

The announcement suggests that 1.5 Pro is similar to 1.0 Ultra.

I am reaching a bit, however, I think its a bit of a marketing technique. The Pro 1.5 being compared to the Ultra 1.0 model seems to imply that they will be releasing a Ultra 1.5 model which will presumably have similar characteristics to the new Pro 1.5 model (MOE architecture w/ a huge context window).

Apparently the technical report implies that Ultra 1.5 is a step-up again, I'm not sure it's just context length, that seems to be orthogonal in everything I've read so far.

So Pro and Ultra are from my understanding link to the number of parameters. More parameters means more reasonning capabilities, but more compute needed.

So Pro is like the light and fast version and Ultra the advanced and expensive one.

It's sizes

Nano/Pro/Ultra are model SIZES. 1.0/1.5 is generations of the architecture.

Gemini (or whatever google ai) will be all about ads. I’m not adopting this shit. Their whole business model is ads. Why would I adopt a product from a company that only cares about selling more ads?

Google One's business model is not ads?

I mention Google One because you can access Gemini Ultra through it.

All their services are just a way to get more information about their users so they can serve them ads.

Those Gemini queries will be no exception.

Not true - Gemini looks to be marketed towards companies, where it's far more profitable to just charge thousands of dollars. Ads wouldn't fund AI usage anyway. GPU's are extremely expensive (even Google's fancy TPU's).

Agreed, people continually forget that Google has fundamentally failed at everything besides selling ads despite decades of moonshots and other attempts to shift the business. Very skeptical that any company getting 80% revenue from ads will be able to resist the pressure to advertise

Can anyone explain how context length is tested? Do they prompt something like:

"Remember val="XXXX" .........10M tokens later....... Print val"

Yep that's pretty much it! That's what they call needle in a haystack. See: https://github.com/gkamradt/LLMTest_NeedleInAHaystack

yep they hide things throughout the prompt and then ask it about that specific thing, imagine hiding passwords in a giant block of text and then being like, what was bobs password 10 million tokens later.

According to this it's remembering with 99% accuracy, which if you think about it is NUTS, can you imagine reading a 22x 1000 page books, and remembering every single word that was said with 100% accuracy lol

Very simplified There are arrays (matrices) that are length 10M inside the model.

It’s difficult to make that array longer because training time explodes.

Yep, that’s actually a common one

"Gemini 1.5 Pro (...) matches or surpasses Gemini 1.0 Ultra’s state-of-the-art performance across a broad set of benchmarks."

So Pro is better than Ultra, but only if the version numbers are higher?

Isn't that usually the case with many products? Like the M3 Pro CPU in the new Macs is more powerful than the M1 Max in the old Macs.

The Nano < Pro < Ultra is an in-revision thing. For their LLMs it's a size thing. Then there's newer releases of Nano, Pro, and Ultra. Some Pro might be better than some older Ultra.

A lot of people seem confused about this but it feels so easy to understand that it's confusing to me that anyone could have trouble.

Apple didn't release the M3 Pro a week after the M1 Max

Adam Osborne’s wife was one of my dad’s patients so I’m not unacquainted with the risk of early announcements. But surely they do not prevent comprehension.

Yes, but you'd have to wait for Gemini Pro Max next year to see the real improvements

Is there a reason this isn't available in the UK/France/Germany/Spain but is in available in Jersey... and Tuvalu?

Probably because EU/national governments have regulations with respect to the safety and privacy of the users, and the purveyors must evaluate the performance of their products against the regulatory standards.

EU regulations and fines.

Incredible. RAG will be obsolete in a year or two.

Obsolete if you don't take cost in consideration. Having 10 millions of token going through each layer of the LLM is going to cost a lot of money each time. At gpt4 rate that could mean 200 dollars for each inference

It's already obsolete. It doesn't work except for trivial cases which have no real value.

“One of the key differentiators of this model is its incredibly long context capabilities, supporting millions of tokens of multimodal input. The multimodal capabilities of the model means you can interact in sophisticated ways with entire books, very long document collections, codebases of hundreds of thousands of lines across hundreds of files, full movies, entire podcast series, and more.”

This is nice, but it’s hard to judge how nice without knowing more about how much compute and memory is involved in that level of processing. Obviously Google isn’t going to tell us, but without having some idea it’s impossible to judge whether this is an economically sustainable technology on which to start building dependencies in my own business.

Sustainable? The countdown to cancellation on this project is already underway.

"Does it make sense today?" is really the only question you can ask and then build dependencies with the understanding that the entire thing will go away in 3-7 years.

In one of the demos, it successfully navigates a threejs demo and finds the place to change in response to a request.

How long until it shows similar results on middle-sized and large codebases? And do the job adequately?

Today.

1-2 years probably. There will still be a question around who determines what "adequately" is for a while though. Presumably even if an LLM can do something in theory you wouldn't actually want it doing anything without human oversight.

And we should keep in mind that to understand a code change in depth is often just as much work as making the change. When review PRs I don't really know exactly what every change is doing. I certain haven't tested it to be 100% certain I understand fully. I'm just checking the logic looks mostly right and that I don't see anything clearly wrong, and even then I'll often need to ask for clarifications why something was done.

I can't imagine LLMs being used in most large code bases for a while yet. They'd probably need to be 99.9% reliable before we can start trusting them to make changes without verifying every line.

I think Anthropic and OpenAI could also have offered a one million context window a while ago. The relevant architecture breakthrough was probably when a linear increase in context length only required a linear increase in inference compute instead of a quadratic one. Anthropic and then OpenAI achieved linear context compute scaling before an architecture for it was published publicly (MAMBA paper).

The problem is, the 128k window performed terribly and showed that attention was mostly limited to the first and last 20%.

Increasing it to 1M just means even more data is ignored.

Maybe their architecture wasn't as good as MAMBA and Google could use the better architecture thanks to being late to the game...

Does anyone actually have access to Ultra yet? It's a lame blog post where it says "it's available!" but the fine print says "by whitelist".

Ok, whatever that means.

OpenAI at least releases it all at once, to everyone.

oh, openai had a lot of waitlists also, gpt4 API, large context versions etc

I just watched the demo with the Apollo 11 transcript. (sidenote: maybe Gemini is named after the space program?).

Wouldn't the transcript or at least a timeline of Apollo 11 be part of the training corpus? So even without the 400 pages in the context window just given the drawing I would assume a prompt like "In the context of Apoll 11, what moment does the drawing refer to?" would yield the same result.

Gemini is named that way because of the collaboration between Google brain and deep mind

I'd love to know how much a 1 million token prompt is likely to cost - both in terms of cash and in terms of raw energy usage.

Cannot emphasize enough, even with the improvements in context handling I imagine 128k tokens costs as much as 16k tokens did previously.

So 1M tokens is going to be astronomical.

10M tokens is an absolute game changer, especially if there's no noticeable decay in quality with prompt size. We're going to see things like entire domain specific languages embedded in prompts. IMO people will start thinking of the prompt itself as a sort of runtime rather than a static input.

Back when OpenAI still supported raw text completion with text-davinci-003 I spent some time experimenting with tiny prompt-embedded DSLs. The results were very, very, interesting IMO. In a lot of ways, text-davinci-003 with embedded functions still feels to me like the "smartest" language model I've ever interacted with.

I'm not sure how close we are to "superintelligence" but for baseline general intelligence we very well could have already made the prerequisite technological breakthroughs.

It's pretty slow, though looks like up to 60 seconds for some of the answers, and uses god knows how much compute, so there's probably going to be some trade offs -- you're going to want to make sure that that much context is actually useful for what you want.

Gemini 1.5 delivers dramatically enhanced performance. It represents a step change in our approach, building upon research and engineering innovations across nearly every part of our foundation model development and infrastructure. This includes making Gemini 1.5 more efficient to train and serve, with a new Mixture-of-Experts (MoE) architecture.

Looks like they fine tuned across use cases and grabbed the mixtral architecture?

There's no way that's all it is, scaling mixtral to a context length of 10M while maintaining any level of reasoning ability would be extremely slow. If the only purpose of the model was to produce this report then maybe that's possible, but if they plan on actually deploying this to end users then there is no way they can run quadratic attention on 10M tokens.

looks interesting enough that i wanted to give Gemini a try and join the waitlist.

And i thought it would be easy, what a rookie mistake.

Looks like "France" isn't on the list of available regions for Ai Studio ?

Now i'm trying to use Vertex AI, not even sure what's the difference with Ai Studio, but it seems it's available.

So far i've been struggling for 15 minutes through a maze of google cloud pages: console, docs, signups. No end in sight, looks like i won't be able to try it out

It's not available outside of a private preview yet. The page says you can use 1.0 ultra in vertex but it's not available to me in the UK.

I can't get on the waitlist, because the waitlist link redirects to aistudio and I can't use that.

I should stop expecting that I can use literally anything google announces.

The whitepaper says the Buster Keaton film was reduced to 1 FPS before being fed in. Apparently multi-modal language models can only read individual pictures, so videos have to be reduced to a series of frames. I assume animal brains are more efficient than that. E.g. by only feeding the "changes/difference over time" instead of a sequence of time slices.

it will probably eventually be improved by adding some encoder on top of LLM, which will encode 60 frames into 1 while attempting to preserve information..

Does this mean gemini ultra 1.0 -> gemini ultra 1.5 is the same as gpt-4 -> gpt-4-turbo?

There's no Gemini Ultra 1.5 yet. Gemini Pro 1.5 is a smaller model than Gemini Ultra 1.0.

I've always been suspicious of any announcement from Demis Hassabis since way back in his video game days when he did a monthly article in Edge magainze about the game he was developing. "Infinite Polygons" became a running joke in the industry because of his obvious snake-oil. The game itself, Republic [1], was an uninteresting failure.

He learned how to promote himself from working for Peter "Project Milo" Molyneux and I see similar patterns of hype.

[1] https://en.wikipedia.org/wiki/Republic:_The_Revolution#Marke...

Google is a public company. Anything and everything will be scrutinized very heavily by shareholders. Of course how Zuck operates very different than Sindar.

What are they doing with their free cash is my question. Are they waiting for the LLM bubble to pop to buy some of these companies at a discount?

I like that they are rushing with this and don't care enough to make it Gemini 2 or even really release it, to me it looks like they are concerned to share progress.

Hope they do a good job and once OpenAI releases GPT 5 they are competitive with it with their offerings, it will be better for everyone.

Onwards to a billion tokens

The technical report: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

Slightly surprisingly I can't get to AI Studio from the UK. It is available in quite a few countries, but not here.

This new generation also delivers a breakthrough in long-context understanding. We’ve been able to significantly increase the amount of information our models can process — running up to 1 million tokens consistently, achieving the longest context window of any large-scale foundation model yet.

Sweet, this opens up so many possibilities.

As a sidenote, it's worth clicking the play button and then checking how they're highlighting the current paragraph and word in the inspector.

just wade through documentation to access it?

clicking on the AI studio link doesn't show me the app page - it redirects to a document on early access. I do as required - go back and try clicking on the AI studio link and I'm redirected to the document on turning early access.

frustrating.

Version number suggests they're waiting to announce something bigger already?

Still no Ultra model API available to UK devs? Considering Deepmind's London base, this is kinda strange. Maybe they could ask Ultra how to roll it out faster?

Yeah. I'll believe that when I can use it.

AI race is amazing, Nvidia reaping the benefits now, but soon the world.

0 trust to what they put out until I see it live. After the last "launch" video which was fundamentally a marketing edit not showing the real product, I don't trust anything coming out of Google that isn't an instantly testable input form.

Our teams continue pushing the frontiers of our latest models with safety at the core.

They're not kidding, Gemini (at least what's currently available) is so safe that it's not all that useful.

The "safety" permeates areas where you wouldn't even expect it, like refusing to answer questions about "unsafe" memory management in C. It interjects lectures about safety in answers when you didn't even ask it to do that in the question.

For example, I clicked on one of the four example questions that Gemini proposes to help you get started and it was something like "Write an SMS calling in sick. It's a big presentation day and I'm sad to let the team down." Gemini decided to tell me that it can't impersonate positions of trust like medical professionals or employers (which is not at all what I asking it to do).

The other things I asked it, it gave me wrong and obviously wrong answers. The funniest (though glad it was obviously wrong) was when I asked it "I'm flying from Karachi to Denver. Will I need to pick up my bags in Newark?" and it told me "no, because Karachi to Newark is a domestic flight"

Unless they stop putting "safety at the core," or figure out how to do it in a way that isn't unnecessarily inhibiting, annoying, and frankly insulting (protip: humans don't like to be accused of asking for unethical things, especially when they weren't asking for them. when other humans do that to us, we call that assuming the worst and it's a negative personality trait), any announcements/releases/breakthroughs from Google are going to be a "meh" for me.

i saw this announcement on twitter and i was excited to check it out, only to see that "we’re offering a limited preview of 1.5 Pro to developers and enterprise customers via AI Studio and Vertex AI".

please google, only announce things when people can actually use it.

Based on what I've seen so far, I think the probability that this is actually better than GPT4 on the kind of real world coding tasks that I use it for is less than 1%. Literally everything from Google on this has been vaporware or laughably bad in actual practice in my personal experience. Which is totally insane to me given their financial resources, human resources, and multi-year lead in AI/DL research, but that's what seems to have happened. I certainly hope that they can develop and actually release a capable model, but at this point, I think you have to be deeply skeptical of everything they say until such a model is available for real by the public and you can try it on actual, real tasks and not fake benchmark nonsense and waitlists.

It would, probably, be cost prohibitive to use 10M context to it's fullest each time.

I instead hope for to have an api to access to the context as a datastore, so like RAG we can control what to store but unlike rag all data stays within context.

Demo with Google AI Studio: https://twitter.com/bobvanluijt/status/1758185143116730875

Hooray for competition.

For reference, here is the technical report: https://storage.googleapis.com/deepmind-media/gemini/gemini_...

signup on mobile too big, doesn't fit submit button :\

I see a lot of talk about retrieval over long context. Some even think this replaces RAG.

I don't care if the model can tell me which page in the book or which code file has a particular concept. RAG already does this. I want the model to notice how a concept is distributed throughout a text, and be able to connect, compare, contrast, synthesize, and understand all the ways that a book touches on a theme, or to rewrite multiple code files in one pass, without introducing bugs.

How does Gemini 1.5's reasoning compare to GPT-4? GPT-4 already has superhuman memory; its bottleneck is its relatively weak reasoning.

This is the first time I've been legitimately impressed by one of Google's LLMs (with the obvious caveat that I'm taking the results reported in their tech report at face value).

Remember AI Dungeon and how it was frustrating about how it would forget what happened previously? With a 10M context window, am I right to assume it would be possible to weave a story which would span with multiple multiple books worth of content? (more or less 1400 pages)

How can I fine tune these models for my use? Their docs isn't clear whether the Gemini models are fine tuneable.

Most data accumulates gradually (e.g., one email at a time, one line of text at a time across various documents). Is this huge 10M scale of context window relevant to a gradual, yet constant, influx of data (like a prompt over a whole google workspace) ?

Is this just more nonsense from Google though? I expect big things from Google, but they need to shut up and actually release stuff instead of saying how amazing there stuff is and then release potato ai, nothing they have done in the AI space recently has lived up to any of the hype, they should stay silent for a bit then release something that kills GPT4 if they honestly are able but instead they are just full of hype.

I remember one of the biggest advantages with Google Bard was the heavily limited context window. I am glad Google is now actually delivering some exciting news now with Gemini and this gigantic token size.

Sure it's a bummer that they slap the "Join the waiting list", but it's still interesting to read about their progress and competing with ClosedAi (OpenAi).

One last thing I hope they fix is the heavily morally and ethically guardrail, sometimes I can barely ask proper questions without it triggering Gemini to educate me about what's right and wrong. And when I try the same prompt with ChatGPT and Bing ai, they happily answer.