return to table of content

Google's Gemini AI caught scanning Google Drive PDF files without permission

vouaobrasil
26 replies
6h9m

All AI should be opt-in, which includes both training and scanning. You should have to check a box that says "I would like to use AI features", and the accompanying text should be crystal clear what that means.

This should be mandatory, enforced, and come with strict fines for companies that do not comply.

drzaiusx11
6 replies
6h3m

We also need a robots.txt extension for publicly accessable file exclusion from AI training datasets. iirc there's a nascent ai.txt but not sure if anyone follows it (yet)

chias
2 replies
4h34m

I don't think `robots.txt` works on the basis of the crawlers wanting to do this to be nice, or "socially responsible" or anything. So I don't hold up much hope that anything similar can happen again.

Early search engines had a problem, which was that when they crawled willy nilly, people would block their IP addresses. Inventing this concept of `robots.txt` worked because search engines wanted something: to avoid IP blocks, which they couldn't easily get around. And site hosts generally wanted to be indexed.

Today it's WAY harder to block relevant IP addresses, so site hosts generally can't easily block a crawler that wants its data: there is no compromise to be found here, and the imbalance of power is much stronger. And many site hosts generally don't want to be crawled for free for AI purposes at all. Pretty much anyone who sets up an `ai.txt` uses it to just reject all crawling, so there is no reason for any crawler to respect it.

mtnGoat
1 replies
3h40m

Google ignores robots.txt as do many others. Try it yourself, setup a honeypot URL, don’t even link to it, just throw it in robots.txt, google bot will visit it at some point.

JohnFen
0 replies
1h51m

I discovered this years ago, and it's what made me start stop bothering with robots.txt and start blocking all the crawlers I can using .htaccess, including Google's.

That's a game of whack-a-mole that always lets a few miscreants through. I used to find that an acceptable amount of error until I learned that crawlers were gathering data to be used to train LLMs. That's a situation where even a single bot getting through is very problematic.

I still haven't found a solution to that aside from no longer allowing access to my sites without an account.

pennomi
0 replies
5h53m

I think the closest thing is the NoAI and NoImageAI meta tags, which have some relatively prominent adoption.

_joel
0 replies
3h25m

Haven't some companies explicitly ignored robots.txt to scrape the sites more quickly (and pissing off a number of people)

JohnFen
0 replies
1h53m

robots.txt is useless as a defense mechanism (that isn't what it's trying to be). Taking the same approach for AI would likewise not be useful as a defense mechanism.

crazygringo
5 replies
2h11m

Training I can understand, but why scanning?

It's literally just running an algorithm over your data and spitting out the results for you. Fundamentally it's no different from spellcheck, or automatically creating a table of contents from header styles.

As long as the results stay private to you (which in this case, they are), I don't see what the concern is. The fact that the algorithm is LLM-based has zero relevance regarding privacy or security.

theolivenbaum
2 replies
2h10m

Except that there's still a grey area on who owns the copyright of the generated text, and they might be able to use the output without you knowing.

Suppafly
1 replies
2h1m

Except that's not what's happening, so why pretend otherwise?

ipaddr
0 replies
1h42m

Because tomorrow it will with little or no discussion

vouaobrasil
0 replies
2m

It's literally just running an algorithm over your data and spitting out the results for you.

I don't want any results from AI. I don't even want to see them. And there is too much of a grey area. What if they use how I use the results to improve their AI. I hate AI also and want nothing to do with its automations.

If I want a document summarized, I will read it myself. I still want to be human and do things AT A REASONABLE LEVEL with my own two hands.

JohnFen
0 replies
1h56m

I think that vouaobrasil was talking about scanning on the behalf of others, not scanning that you're doing on your own data. Scanning your own stuff is automatically and naturally an opt-in situation. You've consciously chosen for it to happen.

DebtDeflation
5 replies
4h13m

AI is becoming the new Social Media in that users are NOT the customer they are the product. Instead of generating data for a Social Media company to use to sell ads to companies you are generating data to train their AI, in exchange you get to use their service for free.

rurp
1 replies
3h24m

The deal keeps getting worse too. In addition to hoovering up your data for whatever products they want, Google has gotten more aggressive about pushing paid services on top of it. The amount of up-sell nags and ads have increased significantly in the past couple years. For a company like Google that kind of monetization creep only gets worse over time.

DebtDeflation
0 replies
2h42m

Not surprising at all. Inferencing against foundation models is very expensive, training them is insanely expensive. Orders of magnitude more so than whatever was needed to run the AdWords business. I guess I should modify my original post to "in exchange you get to use our service at a somewhat subsidized price".

Rinzler89
1 replies
3h38m

>you are generating data to train their AI

That's why I seriously recommend everyone everywhere regularly replace their blinker fluid and such.

exe34
0 replies
35m

it's very important to replace your blinker fluids yearly, but also, polka dot paint comes in 5L tubs.

vouaobrasil
0 replies
3h34m

This should be illegal.

signatoremo
4 replies
5h46m

What is the privacy implication of AI training?

jerpint
2 replies
5h39m

Models can easily regurgitate back training data verbatim, so anything private can be in theory accessed by anyone without proper access to that file

brookst
1 replies
5h2m

This is partly true but less and less every day.

IMO the bigger concern is that this data is not just used to train models. It is stored, completely verbatim, in the training set data. They aren’t pulling from PDFs in realtime during training runs, they’re aggregating all of that text and storing it somewhere. And that somewhere is prone to employees viewing, leakage to the internet, etc.

oblio
0 replies
4h53m

This is partly true but less and less every day.

Isn't this like encryption, though?

I'm fairly sure that the cryptography community basically says: if someone has a copy of your encrypted data for a long time, the likelihood over time for them to be able to read it approaches 100%, regardless of the current security standard you're using.

Who could possibly guarantee that whatever LLM is safe now will be safe at all times over the next 5-10-20 years? And if they're guaranteeing, they're lying.

vouaobrasil
0 replies
5h43m

I care much more about allowing my content to be used at all, despite any privacy concerns. I simply don't want one single AI model to train on my content.

phendrenad2
0 replies
3h21m

By "scanning" what do you mean exactly? I assume you mean for non-training purposes, in other words simply ephemerally reading docs and providing summaries. Why should that be regulated exactly?

GaggiX
0 replies
6h0m

The feature was enable by the author.

atum47
18 replies
5h45m

Every single week I have to refuse enabling back up for my pictures on my Google pixel. I refuse it today, next week I open the app and the UI shows the back up option enabled with a button "continue using the app with back up".

Somebody took the time to talk down my comment about this being a strategy to give their AI more training data. I continue believing that if they have your data they will use it.

jgalt212
5 replies
5h25m

I in no way want to absolve Google, but that's the case for so many app permissions on Android. Turn off notifications, and two weeks later the same app your turned off notifications for is once again sending you notifications. It's beyond a joke.

masalah
2 replies
4h51m

Can you share some apps where this happens for you. I have rather the complete opposite experience where unused apps with permissions eventually lose said permissions.

no-reply
0 replies
4h40m

This is normal, with newer versions of android (probably 10+) there is a feature that checks and removes unused permissions from apps in the last X days.

According to the OP here, it does seem like a pain in the butt to disable - https://support.google.com/android/thread/268170076/android-...

jgalt212
0 replies
3h29m

Lyft and Uber

wafflemaker
0 replies
4h56m

You might have disabled one type of notifications, instead of all types of them. Making sure I disable all types of notification from an app usually works for me. What brand of phone are you using?

hanniabu
0 replies
4h52m

Name sure you also disable the ability for the app to change settings

nickstinemates
4 replies
5h27m

This goes for every SaaS / cloud native company

I think there will be a real shift back on prem with software delivered traditionally due to increased in shit like this (and also due to cost)

JumpCrisscross
3 replies
5h18m

there will be a real shift back on prem with software

Not while we’re production constrained on the bleeding edge of GPUs.

ttul
1 replies
4h37m

… and that situation will persist until other vendors release consumer GPUs with significant VRAM. Nvidia craftily hamstrings the top consumer GPUs by restricting VRAM to 24GB. To get a bit more costs 3-5x. Only competition will fix this.

rtkwe
0 replies
4h16m

Even then NVIDIA has a pretty significant technology moat because most of the tools are built around and deeply integrate CUDA so moving off NVIDIA is a costly rewrite.

OtherShrezzing
0 replies
3h59m

How many SaaS / Cloud Native companies are really GPU constrained? The overwhelming majority of SaaS is a relatively trivial CRUD web-app interfacing with a database, performing some niche business automation, all of which would fit comfortably on a couple of physical servers.

switch007
2 replies
4h30m

GrapheneOS helps here by having no real backup solution at all. Google account is entirely optional

Pixels have first class support

You can also disable Network access to any app

(It's a buggy ride though and requires reading a lot of docs and forum posts)

neilv
1 replies
4h11m

How is it buggy? Are you using the Google Play Store?

I've been using GrapheneOS for years (Pixel 3 through 7), with only open source add-on apps and no Google Play Store, and it's been pretty solid. (Other than my carrier seeming to hate the 6a hardware or model specifically.)

switch007
0 replies
2h7m

Are you suggesting GrapheneOS is bug-free? ... https://github.com/GrapheneOS/os-issue-tracker/issues

It was referring to the overall experience to which I was referring, not the OS specifically ("it's a buggy ride", it = the ride, not GrapheneOS)

I imagine a lot of the issues are because of the apps not testing on GrapheneOS.

But I've had lots of little issues:

- Nova Launcher on a daily basis stopped working when pressing the right button (the 'overview' button). I had to kill the stock Launcher app to fix it, interestingly. Had to revert to the stock launcher

- 1Password frequently doesn't trigger auto-fill in Vivaldi

- Occasionally on boot the SIM unlock doesn't trigger

- Camera crashing often (yes, "could be hardware"...I read the forums/GitHub issues)

More that I can't remember. It's a bit frustrating.

But don't get me wrong, I appreciate the project. I'm not going to go back to stock

mimimi31
1 replies
5h24m

I don't get those prompts with Google Photos. Have you tried selecting "Use without an account" in the account menu at the top right?

tyfon
0 replies
5h13m

Thank you, I didn't even consider this to be a possibility. I back up to my own storage and was annoyed by this message.

Untying photos from my google account is even better!

netsec_burn
0 replies
5h13m

I fixed this by disabling the Photos app and using Google Gallery (on the Play store). It's the same thing as Photos for what I was using it for, without the online features.

Suppafly
0 replies
1h40m

Somebody took the time to talk down my comment about this being a strategy to give their AI more training data.

Because that's an insane interpretation of what's happening.

Cthulhu_
14 replies
10h7m

Just reiterates that you don't own your data hosted on cloud providers; this time there's a clear sign, but I can guarantee that google's systems read and aggregated data inside your private docs ages ago.

This concern was first raised when Gmail started, 20 years ago now; at the time people reeled at the idea of "google reads your emails to give you ads", but at the same time the 1 GB inbox and fresh UI was a compelling argument.

I think they learned from it, and google drive and co were less "scary" or less overt with scanning the stuff you have in it, also because they wanted to get that sweet corporate money.

Aurornis
6 replies
6h27m

but I can guarantee that google's systems read and aggregated data inside your private docs ages ago

That is how search works, yes.

But if you’re trying to imply that everyone’s private data was scraped and loaded into their LLM, then no, that’s obviously a conspiracy theory.

It’s incredible to me that people think Google has convinced tens of thousands of engineers to quietly keep secret an epic conspiracy theory about abusing everyone’s private data.

fifteen1506
1 replies
5h57m

It’s incredible to me that people think Google has convinced tens of thousands of engineers to quietly keep secret an epic conspiracy theory about abusing everyone’s private data.

With NDA being all over the place, it does strike me as doable.

NDAs should have a time limit.

Additionally, no-one in their right mind will be a whistleblower nowadays.

digitalsushi
0 replies
5h47m

isn't that it's supposed to work? we just need >0 people to blow a whistle if a whistle needs blowing. we don't need to rely on the people fearing for their jobs so long as >0 people are willing to sacrifice their careers/lives when there's some injustice so great it's worth dying for

dmvdoug
1 replies
5h57m

I dunno, bro, software engineers have repeatedly shown total lack of wider judgment in these contexts over the years. Not to say there is, in fact, some kind of “epic conspiracy,” just that SWEs appear not to take much time to consider just what it is their code ends up being used for. Incidentally, that would be one way to start to get out of the mess we’ve found ourselves in: start holding SWEs accountable for their work. You work on privacy-destroying projects that society pushes back against, it’s fair game to put you under the microscope. Perhaps not legally, but we as a society shouldn’t hold back from directed criticism and social accountability for the individual engineers who enable this kind of shit. That will not be a popular take here. Perhaps it will be some solace to know I advocated the same kind of accountability for lawyers who enabled torture and other governmental malfeasance in the GWoT years. I was also looked at askance by other lawyers for daring to suggest such a thing. In that way, SWEs remind me of lawyers in how they view their own work. “What, I’m not personally responsible for what my client chooses to use my services for.”

Yeah, you are, actually.

f6v
0 replies
5h22m

I was hanging out around startup incubators, and, by extension, many wantrepreneurs. When asked about business model, the knee jerk reaction was usually “we’re going to sell data!” regardless of product. I was appalled by how hard it is to keep founders from abusing the data when I worked at startups. GDPR and the likes are seen as an annoyance and they make every effort to find a loophole.

zarathustreal
0 replies
6h14m

To riff on the famous Upton Sinclair quote:

“It is difficult to get an engineer to see something, when his salary depends on his not seeing it.”

psychoslave
0 replies
6h9m

that’s obviously a conspiracy theory.

Well, while some of our fellow humans are far too quick to jump on concluding that everything and the rest comes from some conspiracy, it shouldn't void the existence of any conspiracy as an extreme opposite.

In that case, whether these actors do it or not is almost irrelevant: they have the means and incentives to do so. What safeguard civil society is putting in place to avoid it to happen is a far more interesting matter.

shadowgovt
3 replies
6h36m

Of course Google reads and aggregates data inside your private docs. How would it provide search over your documents otherwise?

dylan604
2 replies
3h54m

When I hit search, do the search right then. Don't grep out of a stored cache of prior searches.

shadowgovt
1 replies
1h41m

The thing that makes it possible for search to be fast is pre-crawling and pre-indexing.

Some other engines don't do this, and the difference is remarkabe. Try a full-content search in Windows 7, you'll be staring at the dialog for two minutes while it tries to find a file that's in the same directory as you started the search in.

dylan604
0 replies
1h32m

You said nothing about fast in your original though, so now you've moved the goal posts

mark_l_watson
1 replies
5h28m

re: data on cloud providers: I trust ProtonDrive to not use my data because it is encrypted in transit and in place.

Apple now encrypts most data in transit and in place also, and they document which data is protected. I am up in the air on whether a future Apple will want to use my data for training public models. Apple’s design of pre trained core LLMs, with local training of pluggable fine tuning layers would seem to be fine, privacy wise, but I don’t really know.

I tend to trust the privacy of Google Drive less because I have authorized access to drive from Colab Pro, and a few third parties. That said, if this article is true, then less trust.

Your analogy with early Gmail is good. I got access to Gmail three years before it became public (Peter Norvig gave me an early private invite) and I liked, at the time, very relevant ads next to my Gmail. I also, gave Google AI plus (or whatever they called their $20/month service) full access to all my Google properties because I wanted to experiment with the usefulness of LLMs integrated into a Workplace type environment.

So, I have on my own volition surrendered privacy if Google properties.

dylan604
0 replies
3h56m

All it takes is a "simple" typo in the code that checks if the user has granted access to their content. Something as amateur (which I still find myself occasionally doing) as "if (allowInvasiveScanning = true)" that goes "undetected" for any period of time gives them the a way out yet still gains them access to all the things. Just scanning these docs one time is all they need.

foobarian
0 replies
5h47m

One time I booked something on Expedia, which resulted in an itinerary email to my Gmail account. Lo and behold, minutes later I got a native CTA on the Android home screen to set up some thing or another on Google’s trip product. I dropped Android since, but Gmail is proving harder to shake.

padolsey
8 replies
4h55m

There is a fundamentally interesting nuance to highlight. I don't know precisely what google is doing, but if they're just shuttling the content through a closed-loop deterministic LLM, then, much like a spellchecker, I see no issue. Sure, it _feels_ creepy, but it's just an algo.

Perhaps someone can articulate the precise threshold of 'access' they wish to deny apps that we overtly use? And how would that threshold be defined?

"Do not run my content through anything more complicated than some arbitrary [complexity metric]" ??

Retric
4 replies
4h38m

The issue isn’t doing something to your data, it’s what happens after that point.

People would be pissed if Android make everyone’s photos public, AI does this with extra steps. Train AI on X means everyone using that AI potentially has access to X with the right prompt.

padolsey
1 replies
4h34m

I don't think it's _training_ on your content. That would be a whole other (very horrifying) problem, yes.

Retric
0 replies
20m

Why make that assumption? Many AI companies are using conversations as part of the training, so even if the documents aren’t used directly that’s doesn’t mean the summaries are safe.

Suppafly
1 replies
1h43m

Except that's not what is happening here or what the rest of us are discussing, so why even bring it up?

Retric
0 replies
23m

We don’t know what’s happening beyond:

“the privacy settings used to inform Gemini should be openly available, but they aren't, which means the AI is either "hallucinating (lying)" or some internal systems on Google's servers are outright malfunctioning”

Many AI systems do use user interactions as part of training data. So at most you might guess those documents aren’t directly being used for training AND they will never include conversations in training data but you don’t know.

PessimalDecimal
2 replies
4h48m

It was already possible to search for photos in Google Drive by their content. They seemed to be doing some sort of image tagging and feeding that into search results. Did that ever cause a fuss?

I think the more interesting point is how little people seem to care for the auto-summarization feature. Like, why would anyone want to see their archived tax docs summarized by a chatbot? I think whether an "AI" did that or not is almost a red herring.

padolsey
1 replies
4h42m

Right, but it's triggered by the user themselves, per the article:

"[it] only happens after pressing the Gemini button on at least one document"

I agree the AI aspect is largely a red herring. But I don't think running an algo like a spellchecker within an open document is so awful. If people hate it or it's not useful or accurate, then it should be binned, ofc. And if we're ignoring the AI aspect, then it's just a meh/crappy feature. Not especially newsworthy IMHO.

PessimalDecimal
0 replies
3h42m

I agree entirely here.

The autosummarization of unopened documents is closer to the image search functionality I mentioned above than it is to a spell checker running on an open doc. Both autosummarization and image search are content retrieval mechanisms. The difference is only in how its presented. Does it just point you to your file, or does it process it further for you? The privacy aspects are equivalent IMO. The only difference is in whether the feature is useful and well received.

r2vcap
6 replies
6h40m

There is no cloud. It's just someone else's computer.

bitnasty
4 replies
6h27m

Just because “the cloud” is someone else’s computer doesn’t mean it doesn’t exist.

grugagag
2 replies
5h6m

I think that wasn’t supposed to be taken literally but more tongue in cheek. The main point being that it belongs to some other party. But the cloud buzzword is fuzzy in description.

Ever since ‘cloud’ privacy took a nosedive

denton-scratch
0 replies
2h35m

Back in the 80s, we used to draw network diagrams on the whiteboard; those parts of the network that belonged neither to us nor to our users was represented by an outline of a cloud. This cloud didn't provide storage or (useable) computing resource. If you pushed stuff in here, it came out there.

I think it was a reasonable analogy. You can't see inside it; you don't know how it works, and you don't need to. Note that at this time, 'the internet' wasn't the only way of joining heterogenous networks; there was also the OSI stack.

So I was annoyed when some bunch of kids who had never seen such whiteboard diagrams decided to re-purpose the term to refer to whatever piece of the internet they had decided to appropriate, fence-in and then rent out.

Zambyte
0 replies
4h55m

It's worth noting that cloud computing has existed since the 1960s. It just used to be called "time-sharing".

meiraleal
0 replies
5h21m

That's exactly how cloud is marketed. As a invisible mass of computers floating in the atmosphere.

SteveSmith16384
0 replies
4h58m

That doesn't change anything.

PessimalDecimal
4 replies
5h10m

Meta commentary but still relevant I think:

The author first refers to his source as Kevin Bankston in the article's subtitle. This is also the name shown in the embedded tweet. But the following two references call him Kevin _Bankster_ (which seems like an amusing portmanteau of banker and gangster I guess).

Is the author not proofreading his own copy? Are there no editors? If the author can't even keep the name of his source straight and represent that consistently in the article, is there reason to think other details are being relayed correctly?

numbsafari
2 replies
4h55m

There are no editors.

person23
1 replies
4h44m

Maybe an AI editor?

That would be somewhat disconcerting.

Write about problem with AI and article get changed to 10 best fried chicken recipes.

PessimalDecimal
0 replies
3h40m

Write about problem with AI and article get changed to 10 best fried chicken recipes.

Hopefully along with ten hallucinated life stories for the AI author, to pad the blog spam recipe page for SEO.

mistrial9
0 replies
3h57m

meta comment - an important moment in a trend where the human and human act of authorship, the attribution in a human social way, is melted and disassociated by typos or noise; meanwhile the centralized compute store, its brand, its reach and recognition, grow.

thenoblesunfish
3 replies
3h53m

The title is misleading, isn't it? I was expecting this was scanning for training or testing or something, but this is summarization of articles the user is looking at, so "caught" is disingenous. You don't "catch" people doing things they tell you they are doing, while they're doing it.

mtnGoat
2 replies
3h43m

He had the permissions turned off, so regardless of what it did with the document, it did it without permission! The title is correct!

phendrenad2
0 replies
3h19m

People don't carry with them an objective model of reality, to to be misled all that needs to happen is for their internal, subjective model of reality to he subverted in some way.

It's misleading.

Suppafly
0 replies
1h53m

He had the permissions turned off, so regardless of what it did with the document, it did it without permission! The title is correct!

Honestly it sounds like he was toggling permissions off and on and actually has no idea why it summarized that particular document despite him requesting it summarize other documents. Google should make the settings more clear, but "I had the options off, except when I didn't, and I set some other options in a different place that I didn't think would override the others, and also I toggled a bunch of the options back and forth" is hardly the condemnation that everyone is making it out to be.

silvaring
2 replies
4h35m

I just want to add that gmail has a very sneaky 'add to drive' button that is way too easy to click when working with email attachments.

How long til gmail attachments get uploaded into drive by default through some obscure update that toggles everything to 'yes'?

klabb3
0 replies
4h15m

What difference does it make? They’re both on Google servers and even ACLed the same user account. Gmail isn’t exactly a privacy preserving email provider.

hiatus
0 replies
2h38m

How long til gmail attachments get uploaded into drive by default through some obscure update that toggles everything to 'yes'?

This already is the case for attachments that exceed 25 megabytes.

shadowgovt
2 replies
6h37m

The headline is a little unclear on the issue here.

It is not surprising that Gemini will summarize a document if you ask it to. "Scanning" is doing heavy lifting here; The headline implies Google is training Gemini on private documents, when the real issue is Gemini was run with a private document as input to do a summary when the user thought they had explicitly switched that off.

That having been said, it's a meaningful bug in Google's infrastructure that the setting is not being respected and the kind of thing that should make a person check their exit strategy if they are completely against using The new generation of AI in general.

dmvdoug
1 replies
5h54m

It is not surprising that Gemini will summarize a document if you ask it to.

No, but it is surprising that Gemini will summarize every single PDF you have on your Drive if you ask it to summarize a single PDF one time.

shadowgovt
0 replies
5h52m

Honestly, that's not particularly surprising either. Google's privacy model is that if you trust them to store those files, you trust them to use them on your behalf to enable features you want. There's no concept in their ecosystem of a security model for " enable features on this account for these documents but not those documents;" you'd have to create two accounts or set up a business account if you want that.

nerdjon
2 replies
4h37m

Shocker, Google not going quite far enough with privacy and data access? They talk about it but its never quite far enough to avoid their own services accessing data.

We really need to get to the point that all data remotely stored needs to be encrypted and unable to be decrypted by the servers, only our devices. Otherwise we just allow the companies to mine the data as much as they want and we have zero insight into what they are doing.

Yes this requires the trust that they in fact cannot decrypt it. I don't have a good solution to that.

Any AI access to personal data needs to be done on device, or if it requires server processing (which is hopefully only a short term issue) a clear prompt about data being sent out of your device.

It doesn't matter if this isnt specifically being used to train the model at this point in time, it is not unreasonable to think that any data sent through Gemini (or any remote server) could be logged and later used for additional training, sitting plaintext in a log, or just viewable by testers.

JohnFen
1 replies
1h43m

Yes this requires the trust that they in fact cannot decrypt it.

Yes, this is where it all breaks down. In the end, it all boils down to the company saying "trust us", and it's very clear that companies simply cannot be trusted with these sorts of things.

nerdjon
0 replies
7m

Yeah I wish there was a solution to that, even open source isn't a solution since it would be trivial for there to be a difference between what is running on the server and what is open source.

Ultimately you have to make a decision based on the companies actions and your own personal risk threshold with your own data.

In this particular case, we know that at the very least Google's track record on this is... basically non existent.

acar_rag
2 replies
10h6m

The title misleads the point, and the article is, imoo, badly written. The post implies there is indeed a setting to turn it off. So the author deliberately asked Gemini AI to summarize (so, scan) its documents...

Related to this news: https://news.ycombinator.com/item?id=40934670

jbstack
1 replies
9h52m

"There is a setting to turn it off" is nowhere near equivalent to "the author deliberately asked for documents to be scanned".

Also, see:

"What's more, Bankston did eventually find the settings toggle in question... only to find that Gemini summaries in Gmail, Drive, and Docs were already disabled"

jasonlotito
0 replies
3h54m

FTA: For Bankston, the issue seems localized to Google Drive, and only happens after pressing the Gemini button on at least one document. The matching document type (in this case, PDF) will subsequently automatically trigger Google Gemini for all future files of the same type opened within Google Drive.

The author deliberately asked for at least one document to be scanned. He goes on to talk about all the other things that might be overriding the setting, other, potentially more specific settings that would override this.

I agree, there appear to be interactions that aren't immediately obvious, and what takes priority isn't clear. However, the setting was off, and the author did deliberately ask for at least one document to be scanned. Further, there author talks about Labs being on, and that could easily have priority over default settings. After all, that's sort of what Labs is about. Experimenting with stuff and giving approval to do these sorts of things.

worksonmine
1 replies
5h16m

This shouldn't come as a surprise to anyone, their entire business is our data. I always encrypt anything important I want to backup on the cloud.

SteveSmith16384
0 replies
4h56m

But we shouldn't have to, and just because Google is famous for it, it doesn't make it right or acceptable.

space_oddity
1 replies
5h55m

The inability to disable this feature adds to the frustration

Zambyte
0 replies
4h51m

If your computer does something that you don't want it to do, it is either a bug or malware depending on intent. "Feature" is too generous.

estebarb
1 replies
5h8m

It is urgent to educate people about how these systems work. Search requires indexing. Summarizing with a LM requires inference. Data used for inference usually is forgotten forever after used, as it is not used for training.

Yeah, that should be obvious for many here, but even software engineers believe that AI are sentient things that will remember everything that they see. And that is a problem, because public is afraid of the tech due to a wrong understanding of how it works. Eventually they will demand laws protecting them from stuff that have never existed.

Yes, there are social issues with AI. But the article just shows a big tech illiteracy gap.

Suppafly
0 replies
1h41m

And that is a problem, because public is afraid of the tech due to a wrong understanding of how it works

Honestly the general public doesn't seem to care, the people freaking out are the tech adjacent people who make money driving clicks to their own content. Regular Joes aren't upset that google shows them a summary of their documents and many of them actively appreciate it.

okdood64
0 replies
3h16m

I'm shocked, especially this being HN, that how many people are being successfully misled on what is actually going on here. Do people still read articles before posting?

muscomposter
0 replies
3h53m

we should just embrace digital copying galore instead of trying to digitalize the physical constraints of regular assets

we should just ignore physical constraints of assets which do not have them, like any and all digital data

which do you prefer? everybody can access all digital data of everybody (read only mode), or what we have now which is trending towards having so many microtransactions that every keystroke gets reflected in my bank account

motohagiography
0 replies
4h44m

this is similar to the scramble for health data during covid where a number of groups tried (and some succeeded) at using the crisis to squeeze the toothpaste out of the tube in a similar way, as there are low costs to being reprimanded and high value in grabbing the data. bureaucratic smash-and-grabs, essentially. disappointing, but predictable to anyone who has worked in privacy, and most people just make a show of acting surprised then moving on because their careers depend on their ability to sustain a gallopingly absurd best-intentions narrative.

your hacked SMS messages from AT&T are probably next, and everyone will be just as surprised when keystrokes from your phones get hit, or there is a collection agent for model training (privacy enhanced for your pleasure, surely) added as an OS update to commercial platforms.

Make an example of the product managers and engineers behind this, or see it done worse and at a larger scale next time.

meindnoch
0 replies
1h11m

Your first mistake was storing your data on someone else's computer.

eagerpace
0 replies
3h23m

In the push for AGI do companies feel a recursive learning future is soon achievable and therefore getting to the first cycle of that is worth the cost of any legal issues that may arise?

api
0 replies
4h37m

Not stored on your device or encrypted with keys only you control, not yours.

I assume anything stored in such a system will be data mined for many purposes. That includes all of Gmail and Google Docs.

ajsnigrutin
0 replies
7h8m

This article is really written in a way, as if "the AI" suddenly decided to read stuff it wasn't supposed to... like some sci-fi story, where it gains some kind of awarness and did stuff it wasn't supposed to. Why? It only did what it was programmed for, so either some coder messed up something, or someone somewhere hid some on-by-default checkmark that allows it to do so (...after it was coded to do that).

_spduchamp
0 replies
4h56m

I now feel obligated to cram as much AI-f'n-up crap into my Drive as possible. Come'n get it!

VeejayRampay
0 replies
5h9m

this is not openai doing shady things so everyone should be up in arms

Khaine
0 replies
2h2m

If this is true, google needs to be charged with the violation of various privacy laws.

I’m not sure how they can claim they have informed consent for this from their customers

Havoc
0 replies
5h0m

Only a matter of time before someone extracts something valuable out of googe's models. Bank passwords or crypto keys or something

Glue pizza incident illustrated they're just yolo'ing this

Aurornis
0 replies
6h32m

The original Tweet and this article are mixing terms in a deliberately misleading way.

They’re trying to suggest that exposing an LLM to a document in any way is equivalent to including that document in the LLM’s training set. That’s the hook in the article and the original Tweet, but the Tweet thread eventually acknowledges the differences and pivots to being angry about the existence of the AI feature at all.

There isn’t anything of substance to this story other than a Twitter user writing a rage-bait thread about being angry about an AI popup, while trying to spin it as something much more sinister.