All AI should be opt-in, which includes both training and scanning. You should have to check a box that says "I would like to use AI features", and the accompanying text should be crystal clear what that means.
This should be mandatory, enforced, and come with strict fines for companies that do not comply.
We also need a robots.txt extension for publicly accessable file exclusion from AI training datasets. iirc there's a nascent ai.txt but not sure if anyone follows it (yet)
I don't think `robots.txt` works on the basis of the crawlers wanting to do this to be nice, or "socially responsible" or anything. So I don't hold up much hope that anything similar can happen again.
Early search engines had a problem, which was that when they crawled willy nilly, people would block their IP addresses. Inventing this concept of `robots.txt` worked because search engines wanted something: to avoid IP blocks, which they couldn't easily get around. And site hosts generally wanted to be indexed.
Today it's WAY harder to block relevant IP addresses, so site hosts generally can't easily block a crawler that wants its data: there is no compromise to be found here, and the imbalance of power is much stronger. And many site hosts generally don't want to be crawled for free for AI purposes at all. Pretty much anyone who sets up an `ai.txt` uses it to just reject all crawling, so there is no reason for any crawler to respect it.
Google ignores robots.txt as do many others. Try it yourself, setup a honeypot URL, don’t even link to it, just throw it in robots.txt, google bot will visit it at some point.
I discovered this years ago, and it's what made me start stop bothering with robots.txt and start blocking all the crawlers I can using .htaccess, including Google's.
That's a game of whack-a-mole that always lets a few miscreants through. I used to find that an acceptable amount of error until I learned that crawlers were gathering data to be used to train LLMs. That's a situation where even a single bot getting through is very problematic.
I still haven't found a solution to that aside from no longer allowing access to my sites without an account.
I think the closest thing is the NoAI and NoImageAI meta tags, which have some relatively prominent adoption.
Haven't some companies explicitly ignored robots.txt to scrape the sites more quickly (and pissing off a number of people)
robots.txt is useless as a defense mechanism (that isn't what it's trying to be). Taking the same approach for AI would likewise not be useful as a defense mechanism.
Training I can understand, but why scanning?
It's literally just running an algorithm over your data and spitting out the results for you. Fundamentally it's no different from spellcheck, or automatically creating a table of contents from header styles.
As long as the results stay private to you (which in this case, they are), I don't see what the concern is. The fact that the algorithm is LLM-based has zero relevance regarding privacy or security.
Except that there's still a grey area on who owns the copyright of the generated text, and they might be able to use the output without you knowing.
Except that's not what's happening, so why pretend otherwise?
Because tomorrow it will with little or no discussion
I don't want any results from AI. I don't even want to see them. And there is too much of a grey area. What if they use how I use the results to improve their AI. I hate AI also and want nothing to do with its automations.
If I want a document summarized, I will read it myself. I still want to be human and do things AT A REASONABLE LEVEL with my own two hands.
I think that vouaobrasil was talking about scanning on the behalf of others, not scanning that you're doing on your own data. Scanning your own stuff is automatically and naturally an opt-in situation. You've consciously chosen for it to happen.
AI is becoming the new Social Media in that users are NOT the customer they are the product. Instead of generating data for a Social Media company to use to sell ads to companies you are generating data to train their AI, in exchange you get to use their service for free.
The deal keeps getting worse too. In addition to hoovering up your data for whatever products they want, Google has gotten more aggressive about pushing paid services on top of it. The amount of up-sell nags and ads have increased significantly in the past couple years. For a company like Google that kind of monetization creep only gets worse over time.
Not surprising at all. Inferencing against foundation models is very expensive, training them is insanely expensive. Orders of magnitude more so than whatever was needed to run the AdWords business. I guess I should modify my original post to "in exchange you get to use our service at a somewhat subsidized price".
>you are generating data to train their AI
That's why I seriously recommend everyone everywhere regularly replace their blinker fluid and such.
it's very important to replace your blinker fluids yearly, but also, polka dot paint comes in 5L tubs.
This should be illegal.
What is the privacy implication of AI training?
Models can easily regurgitate back training data verbatim, so anything private can be in theory accessed by anyone without proper access to that file
This is partly true but less and less every day.
IMO the bigger concern is that this data is not just used to train models. It is stored, completely verbatim, in the training set data. They aren’t pulling from PDFs in realtime during training runs, they’re aggregating all of that text and storing it somewhere. And that somewhere is prone to employees viewing, leakage to the internet, etc.
Isn't this like encryption, though?
I'm fairly sure that the cryptography community basically says: if someone has a copy of your encrypted data for a long time, the likelihood over time for them to be able to read it approaches 100%, regardless of the current security standard you're using.
Who could possibly guarantee that whatever LLM is safe now will be safe at all times over the next 5-10-20 years? And if they're guaranteeing, they're lying.
I care much more about allowing my content to be used at all, despite any privacy concerns. I simply don't want one single AI model to train on my content.
By "scanning" what do you mean exactly? I assume you mean for non-training purposes, in other words simply ephemerally reading docs and providing summaries. Why should that be regulated exactly?
The feature was enable by the author.