Looking at MMLU and other benchmarks, this essentially means sub-second first-token latency with Llama 3 70B quality (but not GPT-4 / Opus), native multimodality, and 1M context.
Not bad compared to rolling your own, but among frontier models the main competitive differentiator was native multimodality. With the release of GPT-4o I'm not clear on why an organization not bound to GCP would pick Gemini. 128k context (4o) is fine unless you're processing whole books/movies at once. Is anyone doing this at scale in a way that can't be filtered down from 1M to 100k?
With 1M tokens you can dump 2000 pages of documents into the context windows before starting a chat.
Gemini's strength isn't in being able to answer logic puzzles, it's strength is in its context length. Studying for an exam? Just put the entire textbook in the chat. Need to use a dead language for an old test system with no information on the internet? Drop the 1300 page reference manual in and ask away.
How much do those input tokens cost?
According to https://ai.google.dev/pricing it's $0.70/million input tokens (for a long context). That will be per-exchange, so every little back and forth will cost around that much (if you're using a substantial portion of the context window).
And while I haven't tested Gemini, most LLMs get increasingly wonky as the context goes up, more likely to fixate, more likely to forget instructions.
That big context window could definitely be great for certain tasks (especially information extraction), but it doesn't feel like a generally useful feature.
Is there a way to amortize that cost over several queries, i.e. "pre-bake" a document into a context persisted in some form to allow cheaper follow-up queries about it?
They announced that today, calling it "context caching" - but it looks like it's only going to be available for Gemini Pro 1.5, not for Gemini Flash.
It reduces prompt costs by half for those shared prefix tokens, but you have to pay $4.50/million tokens/hour to keep that cache warm - so probably not a useful optimization for most lower traffic applications.
https://ai.google.dev/gemini-api/docs/caching
That's on a model with $3.5/1M input token cost, so half price on cached prefix tokens for $4.5/1M/hour breaks even at a little over 2.5 requests/hour using the cached prefix.
Though I'm not familiar with the specifics, they announced "context caching"
Depending on the output window limit, the first query could be something like: "Summarize this down to its essential details" -- then use that to feed future queries.
Tediously, it would be possible to do this chapter by chapter in order to exceed the output limit building something for future inputs.
Of course, the summary might not fulfill the same functionality as the original source document. YMMV
That per exchange context cost is what really puts me off using cloud LLM for anything serious. I know batching and everything is needed in the data center, and important for keeping around KVQ cache, you basically need to fully take over machine to get an interactive session to get the context costs to scale with sequence length. So it's useful, but more in the case of a local LLaMA type situation if you want a conversation.
I wonder if we could implement the equivalent of a JIT compilation, whereby context sequences which get repeatedly reused would be used for an online fine-tuning.
No, but you can just cache the state after processing the prompt. https://github.com/ggerganov/llama.cpp/tree/master/examples/...
It makes building any app that requires generous user prompting impossible to build for regular developers (cloud pricing).
$20 hosting can serve thousands of users per month. $20 llm sub services just one person. This is fucking impossible.
Can anyone speculate on how G arrived at this price, and perhaps how it contrasts with how OAI arrived at its updated pricing? (realizing it can't be held up directly to GPT x at the moment)
Isn't there retrieval degradation with such a large context size? I would still think that a RAG system on 128K is still better than No Rag + 1M context window, no? (assuming text only)
Absolutely. Gemini results tend to drop off after 128k tokens according to RULER: https://github.com/hsiehjackson/RULER
Not sure why you've been downvoted. Needle in a haystack testing exists for a reason
You don't really use it, right? There's no way to debug if you're doing it like this. Also, the accuracy isn't high, and it can't answer complicated questions, making it quite useless for the cost.
Please make your substantive points without crossing into personal attack. Your comment would be fine without the first sentence.
https://news.ycombinator.com/newsguidelines.html
I tried to use the 1M tokens with Gemini a couple of months ago. It either crashed or responded ___very__ slowly and then crashed.
I tried a half dozen times and gave up, I hope this one is faster and more stable.
Context length isn’t the same as context volume I think. Just input the 1m tokens slower, it’ll still be in context.
I think that's a bit like asking why would someone need a 1gb Gmail when 50mb yahoo account is clearly enough.
It means you can dump context without thinking about it twice and without needing to hack some solutions to deal with context overflow etc.
And given that most use cases most likely deal with text and not multimodal the advantage seems pretty clear imo.
Long context is a little bit different than extra email storage. Having 1 gb of storage instead of 50 mb has essentially no downside to the user experience.
But submitting 1M input tokens instead of 100k input tokens:
- Causes your costs to go up ~10x
- Causes your latency to go up ~10x (or between 1x and 10x)
- Can result in worse answers (especially if the model gets distracted by irrelevant info)
So longer context is great, yes, but it's not a no-brainer like more email storage. It brings costs. And whether those costs are worth it depends on what you're doing.
There's no way it's Llama 3 70b quality.
I've been trying to work Gemini 1.5 Pro into our workstream for all kinds of stuff and it is so bad. Unbelievable amount of hallucinations, especially when you introduce video or audio.
I'm not sure I can think of a single use case where a high hallucination tiny multimodal model is practical in most businesses. Without reliability it's just a toy.
Seconding this. Gemini 1.5 is comically bad at basic tasks that GPT4 breezes through, not to mention GPT4o.
GPT-3.5 has 0.5s average first-token latency and Claude3 Haiku 0.4s.
1m is great for multimodal agentic workflows where you need to keep track of history
I guess it depends on what you want to do.
E.g. I want to send an entire code base in a context. It might not fit into 128k.
Filtering down is a complex task by itself. It's much easier to call a single API.
Regarding quality of responses, I've seen both disappointing and brilliant responses from Gemini. Do maybe worth trying. But it will probably take several iterations until it can be relied upon.
Price for anything, particularly multimodal tasks that with OpenAI GPT-4o is the cheapest model, that doesn't need GPT-4 quality. GPT-3.5-Turbo — which itself is 1/10 the cost of GPT-4o, is $0.5/1M tokens on input, $1.50/1M on output, with a 16K context window. Gemini 1.5 Flash, for prompts up to 128K, is $0.35/1M tokens on input, and $0.53/1M tokens on output.
For tasks that require multimodality but not GPT-4 smarts (which I think includes a lot of document-processing tasks, for which GPT-4 with Vision and now GPT-4 are magical but pricy), Gemini Flash looks like close to a 95% price cut.
Price.