The AI Trust Crisis

I think your article glosses over the fact that we have privacy concerns beyond training on my data.

I'm a working professional and I have clients that are governed by confidentiality agreements and regulations regarding where information goes, and I would just prefer using a service where my data rests on a server instead of having more and more points for a data breach to be introduced.

I don't really understand why my data isn't fully encrypted at all times and only I can view it in the first place, but the idea that they are actively sending it across the internet to be ingested by other companies and processed without my consent or interest is so terrible.

I often use AI features when I opt into them, but having a company just sending my personal files all over the internet without my consent is insanity.

Honestly, OneDrive has a migration tool and I got a trial for dropbox business and moved all my files automatically last night. It's just the last straw in their company constantly doing things I don't ask them to do like introducing crap and popups into my desktop interface and never offering the feature I constantly ask for... end to end encryption.

If you want a couple click migration from Dropbox Business to an Office 365 Onedrive account, it's right here: https://learn.microsoft.com/en-us/sharepointmigration/mm-dro...

Is OneDrive end-to-end encrypted? I think Microsoft will introduce a similar feature soon, if they haven’t already.

Nope, but it just makes sense as a quick and easy stopgap measure until I find time to figure out who to use, because it's easily compatible with my mac and pc, I already pay for a subscription so it saves money, it can run without putting all the files on every machine that I own, they aren't sending my info everywhere.

If anyone has suggestions on a reliable end-to-end encrypted solution with a lightweight cross platform sync software that doesn't force you to download all the files to your device (my Mac's HD is too small) and is generally fully featured. That won't take very much time to migrate with a trustworthy migration vendor. I'm willing to pay a premium price for it and I'm all ears.

There are several: iCloud, ProtonDrive, Sync.com, etc.

iCloud isn't cross platform enough ProtonDrive is going to pull down every file to every device and act in a dumb way Sync might be a good choice actually. I have to investigate them further

iCloud with Advanced Data Protection turned on is the only one I know of. And it’s obviously Mac only.

They’ve already announced “M365 Copilot”, which is in the same ballpark as the Dropbox features (ask questions about your documents, etc.), and of course uses OpenAI like Dropbox does. So the only real difference here would be trust.

I would imagine it doesn’t use OpenAI’s servers, but their hosted versions of OpenAI’s technology though right? Like whatever they’re currently selling to enterprise customers via Azure?

I don’t really know the structure there. Does OpenAI actually have servers? Or do they use Azure servers?

iCloud Files is the only major cloud file syncing service that is e2e encrypted afaik. Was my reason to migrate over from OneDrive (especially since I was already using Apple devices and paying for iCloud for photos anyway).

Thanks! I would use iCloud, but I have a few windows devices I need for work unfortunately. It doesn't let me use the advanced security and maintain a windows device. Ironically, like many of these services I already have an Apple One subscription and am paying for it anyways.

Can you trust anyone who claims that things are end-to-end encrypted? Microsoft could take the point of view that you are one "end" and DropBox is the other "end". If they encrypt the data in transit and decrypt it on their end, it's still technically "end-to-end encrypted".

They could also just lie. Having a company claim end-to-end encryption still means that I have to trust that the company isn't being sleazy.

The only encryption you can really have some measure of trust in is the encryption you apply yourself.

Dropbox issued this statement literally yesterday:

    "If you’ve used any of Dropbox’s AI tools, some of your documents and files may have been temporarily shared with OpenAI."

Good luck thinking a cloud provider has YOUR best interest at heart. This is Hacker News, I feel like trust should be earned, never implied.

That word, "temporarily", is doing a lot of heavy lifting in a digital world where things can be duplicated for free.

Seems like an s3 bucket would have been a better alternative. We have no idea what OpenAI does with Dropbox customer data outside of storing it for 30 days. They're doing something, basically all Dropbox customer files with get propagated to OpenAI by default and that should be scary, not feel good.

    Dropbox’s practices aren’t unprecedented, but customer documents do pass through OpenAI’s servers and are stored there for up to 30 days, and the “third-party AI” toggle is turned on by default in account settings.

What makes you think this is about "basically all Dropbox customer files"?

If it’s turned on by default and one of its capabilities is to use AI to search your files, then why wouldn’t we assume it applies to basically all files? How could it not?

So that depends entirely on how they implemented the feature. There are a few ways this could be working:

- They gave their chat interface the ability to run regular full-text searches against Dropbox - when you ask a question that can be answered by file content, it searches for relevant files and then copies just a few paragraphs of text into the prompt to the AI.

- They might be doing this using embeddings-based semantic search. This would require them to create an embeddings vector index of all of your content and then run vector searches against that.

- If they're doing embeddings search, they might have calculated their own embeddings on their own servers... or they might have sent all of your content to OpenAI's embeddings API to calculate those vectors.

Without further transparency we can't tell which of these they've done.

My strong hunch is that they're using the first option, for cost reasons. Running embeddings is an expensive operation, but storing embeddings is even more expensive - to get fast results from an embeddings vectors store you need dedicated RAM. Running that at Dropbox scale would be, I think, prohibitively expensive if you could get not-quite-as-good results from a traditional search index, which they have already built.

If they ARE sending every file through OpenAI's embedding endpoint that's a really big deal. It would be good if they would clarify!

Turning on AI by default seems to indicate they're sending your data somewhere automatically before seeking approval or opt-in. I could be very wrong, but the wording alone would at the very least make me cautious.

Also, temporarily doesn't necessarily mean a time period less than one hundred years.

It's such an obvious obfuscation of what everyone can assume is a permanent ownership of user data. As well as the assumption that it's use will be limited. There are no supports for user data retention in the ToS. Unless a whistleblower reveals specific uses of the data and users litigate the issue, they do what they want with zero opposition.

I moved to Mega recently, who now have a very tidy full E2EE cloud storage dropbox equivalent: https://mega.io/storage.

No affiliation, but it does exactly what you're asking for, and I've been very happy with it.

I'm a little bit scared of using Mega, because it's linked to Kim Dot Com, a wanted fugitive and Megaupload used to be known for hosting tons of pirated files.

I know they have a cheap solution, but it's not exactly something that checks my box for stability and high character for hosting my very important files.

I thought that too, but according to Wikipedia Kim Dot Com "severed all ties with the service in 2015".

https://en.wikipedia.org/wiki/Kim_Dotcom

Didn't he also claim that the new Mega was Chinese malware?

it may be, but the dude isn't exactly know for credibility or integrity

so what you're saying is this service is probably slightly more trustworthy than the typical VC funded Silicon Valley company?

Neat, can you share a link to "Privacy and Data Policy" they mention?

The better solution is to use a separate encryption overlay like Cryptomator over whatever cloud storage you use. If you have confidentiality agreements with clients, you shouldn’t be using Dropbox without E2EE anyway, nor OneDrive.

Is it going to work on my iPad or iPhone? How long is it going to take? I tried to research that once, but it looked like Dropbox bought whatever service worked well and no longer seemed like a good solution. I would prefer the service to just work out of the box.

Yes, it works on iOS [0]. Personally I’m still using the standalone mode of Boxcryptor (the iOS app is still receiving updates), which unfortunately was bought up by Dropbox, and in the past there were opinions that it worked better than Cryptomator, but many people seem to be happy with Cryptomator now, so I’d give it a shot.

[0] https://apps.apple.com/us/app/cryptomator-2/id1560822163

I found solutions like Boxcryptor cumbersome to use. Unless you stored the data redundantly locally, you had to download big encrypted files in order to access a small file.

Also searching files was impossible unless you downloaded everything, decrypted it, and searched locally.

I quickly realized it was adding huge delays in my day-to-day work and a lot of stress during time-sensitive tasks.

Have these e2ee overlays improved in usability since then?

Obviously you can’t search contents in encrypted state, and with E2EE this means that server-side search is not possible.

I rarely use indexed file contents search (filename search is usually enough and that works well, and tools like grep work transparently), however the Boxcryptor drive can be added to the Windows Search Index (or whatever search software you use), and I assume it’s similar on Mac. You don’t have to decrypt manually. Indexing causes more system load due to the necessary decryption, of course.

On desktop systems I always store all relevant data locally, exactly for the redundancy. I’m not sure what you mean by “big encrypted files”, because each original file is encrypted individually and thus has basically the same size as the original.

Do you want your cloud storage to "just work everywhere" or do you want to have full control of your data? Basically you get to choose one of the two options.

Cryptomater on top of any of the cloud storage providers is a great setup for home / personal use. I have been doing this for the past 3 years with minimal issues. Google Drive + Cryptomater on Windows + Cryptomater on ios, working pretty seamlessly.

What you're describing is a deeper problem not only with "AI" but with the entire cloud-centric side of the tech world. Homomorphic encryption might save the day for delocalised computing but we're some years away from that being a reality. Meanwhile de-clouding, "repatriation" to on-prem and hybrid private cloud cooperatives within a trusted group are how we get there. Another good reason is simply to stem the enormous wealth transfer to big-data from individuals and smaller companies.

I'm glad to see that fantasies about omniscient AI taking over the world are giving way to a better grasp of the more mundane realities; AI just accelerating the already obscene power imbalances present in our world. Keep your private stuff private.

I really do genuinely think is that at the core the is problem that it's very hard to address your real computer from the Internet. There are programs that effortlessly can make a directory on your PC into the quite secure "cloud". Windows lets you share a password-protected directory right out of the box!

It is that easy to buy a few Tb disk and just run a program, if not for the: 1. Routers that doesn't allow easy port forwarding (or even ban certain ports) 2. Dynamic IPs 3. People selling domains pushing their own VPS services. 4. The amount of steps you have to take to allow

A lot of small organizations I know didn't need a system administrators to configure and run programs like this. A lot of them are beginner friendly! They needed them to configure the network.

Same applies for self-hosting sites. If there was a program that just hosted on your PC address any html-page you put into it, a lot of people would self-host something. But you can't unless you can wrangle your router and figure out how to buy static IP – two tasks that are way harder than basic html.

The stack for easy and secure self-hosting is here, but the network changed too much. Hopefully, ipv6 will help to solve this problem.

Tailscale is a huge step forward here: it makes creating a secure IPv6+Wireguard network for accessibg your home devices from anywhere genuinely easy.

Yeah, the problem is that I just need something quick and easy. I'm not an idealist. I just need something simple that checks most of the boxes and works easily across my blended eco-system of devices.

It's why I use dropbox to begin with.

I get you, but when you say "sending my personal files all over the internet without my consent is insanity", that sounds like pretty strong affect. Maybe it's time you re-evaluated your choice for expediency and ease in tension with something sane you believe in.

In today's world just championing sanity is already an "ideology" :)

I agree with most of your points, but why not encrypt sensitive information yourself before it gets uploaded/shared on your dropbox account?

Sure, it's not end to end encryption but it prevents the company from using the encrypted data as a training corpus.Are shared folders and files created by co-workers and family not tech savvy enough to know about encryption?

I agree this is a good practice, but if you have to do this to defend yourself against the rogue actions of a service you’re paying for, you’re probably better off not suing the service.

Different services have different and unclear expectations. For example, you'd imagine that a big online storage service would have some access controls in place to limit what uploaded data random employees in operations and engineering can see, but in at least one case you'd be totally wrong. This strikes me as a perfectly reasonable expectation - the data isn't just sitting there exposed to arbitrary employees and that only trusted employees have that kind of access, and not broadly.

I don't think this is rogue actions, I think it falls into the category of perfectly reasonable expectations that are not actually met by a wide variety of cloud services.

Sure, it's not end to end encryption

Nit, but what you’re describing is e2ee (except for the metadata). If you encrypt your files before uploading and decrypt it only locally then only the logical sender and recipient have access. That it goes through Dropbox is not important (and also the beauty of e2ee).

This is a bit unusual, otherwise it’s typically people (and shady service providers) who say that something is e2ee when it isn’t.

Because Dropbox acquired Boxcryptor -- one of the tools that easily let people encrypt files before upload -- to be replaced with "plans for end-to-end encryption"

https://techcrunch.com/2022/11/29/dropbox-acquires-boxcrypto...

I've worked for multiple department of defense contractors where they have their entire code base, to the tune of a few dozen terabytes - including highly sensitive ML training data - in their dropbox accounts. I bet they are in full panic.

I wouldn't have much faith in OneDrive either. Just use Cryptomator for everything.

I very much agree with you, however, do you really think that Microsoft, of all companies, isn't using your data for training LLMs, with all the data leakage risks? What makes you think that your data is safer on OneDrive?

Bing Chat (based on ChatGPT), Copilot, etc. As a GitHub user, I never got a checkbox to opt-out of GitHub Copilot's training on my code. At least Dropbox provides a checkbox.

Just read this. I have had enough weird incidents where you for example chat with somebody about a topic and a day later I suddenly received ads Youtube recommendations about this topic. There is a lot of reason to not believe big companies. I don't see why Facebook should be more ethical than tobacco or oil companies that have been lying and obfuscating for decades.

The only solution I see is to make it illegal to use personal data. Especially make data brokers illegal. When you give a company access to your data, you often agree that they will pass the data to third parties who then pass it on and suddenly everybody has your data because you have permission to use your data.

People often get mixed up about what causes what in these situations:

We might chat about all sorts of random things, but many times there's something that happened before, let's call it event X, that leads us to talk about a certain topic, say topic Y (like when you mention a cool shirt a friend just bought).

Then, if they see an ad about that later, folks jump to the conclusion that "Facebook was listening" to their chat. But what's more likely is that this topic was already trending among people like you, and that's why it popped up in your ads.

So, it's not that talking about it made it appear in your ads. It's more about a common event that sparked your conversation about it and also made it show up in your ads.

Have tech companies ever deserved the benefit of the doubt when the plausible deniability is that solid?

If you think objectively about this issue, it's hard to see how they can get a positive ROI out of this.

Simon does a great job highlighting why this doesn't make sense from the technical and business standpoints: https://simonwillison.net/2023/Dec/14/ai-trust-crisis/#faceb...

I'm not so worried about the corporation's ROI calculations at the top as much as paperclip maximizing middle management with warped incentives and little in the way of corporate controls or morals. We're one badly designed KPI and generic internal data collection API away from such spying falling through the cracks. Once it's there, the incentives are heavily against looking too hard at where the data came from until a public scandal draws an executive's attention.

It's played out over and over again, especially with Facebook. They don't deserve the benefit of the doubt, even when "logically" the ROI for the company doesn't make sense. Facebook specifically can't be trusted at the executive level either; that culture trickles down.

Sorry, I’ve worked for and known about too many companies that “spent money to make money” and went bankrupt.

You can’t use net profit as a yardstick of rationality in the corporate world, but especially Corporate America. People will seek revenue that results in negative returns on investment. There’s nothing rational about it, which is the problem.

They can increase revenue by making ethically terrible decisions. Full stop.

Not sure anyone's asking for "benefit of the doubt". As defenses against privacy intrusion go, "Of course Facebook isn't listening to all your conversations and using that to serve ads to you. Why, that would be too expensive to contemplate. All they're doing is tracking everything you and your friends do on the web, tracking your location, noticing you're physically close to each other, inferring that you're probably having a conversation together, and using your friend's web activity to serve you targeted ads, as just one of a suite of methods for targeting ads based on your context. Also as soon as it's not too expensive to contemplate we're probably going to actually do it. We're not failing to do it for any ethical reason. It's just too expensive right now." is pretty weak sauce.

"We're not actually transcribing every word you hear and say, we're just using capabilities that the Stasi could only dream of at rates of speed they couldn't even conceive of, and it's just that these coincidences happen so often that it looks like we're transcribing your speech. Come on, be less paranoid."

Your example right here is exactly why I care so much about this.

If people think "they listen to me through the microphone" it distracts them from understanding what's actually happening - the whole sequence of things you listed there.

Which means they can't take measures to protect themselves, or campaign for better practices - because they're working off the wrong mental model of how this stuff works.

No company has ever deserved the benefit of doubt. There are plenty of examples in history where companies prioritized profits over harm to people. And they knew exactly what's going on.

It doesn’t have to be something the “deserve” really. Facebook, data brokers, and other creeps have already infected the whole internet. They spy on the content you aren’t even willing to talk about with anybody. They don’t need to secretly access your microphone because they’ve openly violated your privacy more throughly than that.

No matter what exactly the mechanism is they use to end up with these hyper-specific results the end results is that by fusing data from an always expanding amount of sources companies are gaining insights into peoples lives which are nearing the almost mythical "all seeing eye" in a way that never happened before in human history. And I think we should ask really hard questions of whether society should continue to allow this or whether it should be reigned in. And we shouldn't accept vague "oh, you know, a person could do this too, so it's okay" or "companies are just persons" statements as an answer. Even for die-hard market fans there's the question if a market can work if the other side in an transaction knows everything about you.

"So, it's not that talking about it made it appear in your ads. It's more about a common event that sparked your conversation about it and also made it show up in your ads."

It's not a common event. They were totally random conversations. That's why I noticed that something is going on. I am not sure if they really listening on the microphone or what else they are doing but it's super creepy and should not happen.

Every time i've received a hyper targeted ad, it's because the person I was just on the phone with was googling the thing we were talking about.

I told my mom i wanted a vacuum cleaner for christmas - guess what, facebook advertises to all your friends what you're shopping for. "lookalike audience" - to your point.

As for using personal data - EU has it covered.

But the whole „facebook listening to you” os absurd - on iOS you would see a system notification that the microphone is active, I assume on android as well. Also, it would take crazy amounts of bandwidth/battery/processing power to pull off.

It would also be trivially discoverable - a ton of people are listening to what apps are sending out, and even if encrypted, it would be noticeable by the outgoing traffic volume being high.

It is virtually impossible to pull off even today, as anyone who ever developed anything on mobile can tell you

It's not absurd. Siri is listening for "hey siri" all the time. Folks don't really grok how that's different than Facebook spying as described. I'm frankly not sure that it's impossible, just that there's enough churn of Facebook employees that it would eventually leak if it was being done.

It's plausible to a naive observer, but it's absurd from a technical level. As you're probably aware, there's an entirely separate low-power coprocessor tasked with understanding just the words "hey Siri", because keeping the main processor awake for that task would absolutely destroy the battery life.

It's not absurd from a technical level. The coprocessor can absolutely have a larger vocabulary and adding an entry to local storage for hearing "tennis watch" would be valuable for ad-marketing.

I think some people need to figure out how those high speed video cameras work. It would affect their view of what is possible and not possible with electronics.

The ELI5 version is that it can watch everything and then keep the bits that happen just before the person hits the button.

That’s how they can catch things that explode or do something and you aren’t sure when it will happen. Like popcorn popping.

Siri is basically doing the same thing.

Siri also misfires. And that data can be sent in.

It's not impossible to do, but impossible to hide, and not feasible to do anyway. The technical complexity is massively higher than just using your browser history, which means there would have to be a huge team that does it, and therefore no secrecy. Also for the same reason it would be massively more expensive than existing data collection, but the ads would sell for the same price,.so it's not economically viable.

Likely a cognitive bias there, we pay selective attention to the topics we've been discussing previously, and likely remember when an ad comes up that's related. So yes, they're not listening to our microphones. It's just our selective attention and memory, instead.

plus all the other types of tracking and stalking they do

The scarier reality to me is less that I'm being actively spied on and more that we're all very algorithmically predictable. Why bother actively spying when it turns out that the websites they visit in a day does a good enough of a job predicting what kind of person they are.

And even if it were trivially easy to pull off surreptitious microphone monitoring, Facebook's incentives to do so aren't obvious. You can't charge advertisers more for an ad-targeting feature which officially doesn't exist (and if optimising targeting above and beyond expectations was that important, they could probably start by fixing some of the terrible choices made by self service buyers that wouldn't incur the same risk)

Yeah, I'll add that one to my list of arguments. If Facebook were doing it they would be actively telling marketers in order to get more ad money.

If that is the case, why aren't developers who are implementing such pipelines coming forward. This was the case for 9/11 being an inside job because many workers are needed for such an inside job. A whistleblower like Tristan Harris could come forward and accept that facebook is doing such stuff.

They have so much data about you that they can predict what you are going to do or look for before you know it yourself.

Technology good enough is indistinguishable from magic.

chat with somebody about a topic and a day later I suddenly received ads Youtube recommendations about this topic

What's funny about it is that counterclaims will often be that this information doesn't necessarily need to be coming from reading your chat messages. It could be that they just know you happened to be chatting to that person -- despite "metadata" supposedly not being personally identifiable information -- and then one or more parties started to look up on google the topic they were discussing. As if that's supposed to be less invasive.

GDPR is on the right track, but they made collecting consent too easy so now the entire Internet is covered with dark patterns trying to trick you into giving consent.

What we need is basically GDPR except that consent can only be collected in the form of a signed and notarized contract.

I took that screenshot on my own account. It’s toggled “on”—but I never turned it on myself.

There is a consent crisis as well, though I agree it's a smaller issue related to trust. There needs to be an actionable, legal definition of consent as it applies to website privacy—my naive assumption is that there was, but clearly that's not true, or it's not good enough or actionable enough—and it needs to preclude implying users must positively grant consent to harvest, process, or transfer data to third parties, when in fact the dirty deeds have already been done in secret.

There already is. There always has been! It's called fraud.

If you trick someone into signing a contract, then that contract is fraudulent.

If you tell someone that you will ask their permission before doing something; then silently claim you already obtained that permission in a prior contract, you are committing fraud.

I don't know when our judicial system lost all of its teeth, but you sure as hell can't blame that on the citizens they are failing to defend.

"I consent to your [legalese with weasel words that only specialist lawyers have any chance to understand the implications of]."

See. No fraud.

Its my understanding that contracts are only enforceable when there is a meeting of the minds. At lease on common law, and I think some of these people that run these companies should be dragged before a common law court to answer for these things.

Let me frame this in a story:

Dave is a prostitute. He is hired by Sue via contract. In this contract, Dave agrees to have sex with Sue. Also in the contract is the agreement that Sue may introduce any sex toys that will not physically harm Dave.

When they get together, Dave notices a camera pointed at them. Sue tells Dave that, "I only record sex with people who consent to my use of a camera."

Later, Dave learns that Sue did in fact record the encounter. Sue's defense? The camera was a sex toy! It didn't physically harm you! This was all in the contract!

Is Sue's defense valid? Of course not! Sue has committed fraud.

---

We should hold corporations every bit as accountable as we would fictional prostitutes.

That's still fraud in most jurisdictions (although it's no longer provable unless there's a paper trail that says it's intentionally designed to deceive). However, even if fraud can't be proven, it may no longer be a valid contract. A contract is the mutual understanding of an agreement, not the words printed on a piece of paper. The paper can be used as evidence of mutual understanding, but it's not incontrovertible evidence of that. Many kinds of evidence can go the other way, such as unconscionability (terms that there's no way anyone would willingly agree to, thus suggesting coercion or subterfuge).

Or, framed another way: consent, in both a moral sense and a legal one, requires awareness from both parties of what you are consenting to. Each party is, to some extent, responsible for using the resources at their disposal to ensure they understand the contract (e.g. a blind person must find someone to read them the contract), but intentionally undermining the process of mutual understanding is at best very good evidence against enforcement of the contract and at worst fraud.

I don't know when our judicial system lost all of its teeth

Easy. It's when we started treating the virtual reality of the internet as though it is actual reality.

Digital contracts shouldn't be binding, any more than mowing down pedestrians in Grand Theft Auto constitutes murder. These video-game contracts are mutable and ephemeral; how the hell do you prove what version of the contract/TOS you agreed to, when companies can change it arbitrarily-- and without any revision history? Tech makes it easy to gaslight the fuck out of anybody.

There'd be an actual cost to them were they do play this game with paper contracts.

Are you saying that because something is digital and exists as data on a computer, that it is not valid? As opposed to something on paper? How is that different in any real way? Paper is transitory -- anyone can forge a signature or reprint a contract and photocopy your name on it and shred the original and put the modified one in the file cabinet. Just because the materials used are different doesn't make the law any less applicable.

This is an actual issue in consumer banking. If a bank forges a paper document, it's possible to hold up the original countercopy (assuming you filed it) and say "No, your document is fake, and you are attempting fraud."

These agreements often have millions of copies, so it's easy to find the applicable version of the T&Cs.

If all you have are digital T&Cs at the end of a web link, and those T&Cs can be updated at any time, and you are only ever sent the link, and not the updated T&Cs - you have nothing.

You could argue that everyone should download every T&C update. In theory that's true. But in practice a problem that could be solved relatively simply - keep a single letter - now requires many more steps, backup strategies, and so on.

You could also argue that banks should keep copies of their T&C versions and supply them on request. Which they do - except that in practice a bank that's trying to forge a document won't have a problem forging T&Cs.

This not hypothetical. I've known people win court cases because they kept a single piece of paper.

The courts (in the US) are only really useful for this sort of thing if you have money.

I don't know when our judicial system lost all of its teeth

The actions of the government have been and likely always will be done for the interests of the wealthy and powerful. Policing itself was a concept first introduced by the wealthy to protect them and their interests. I mean just look at George Washington - he was the richest US president of all time after Trump at a present valuation of around $700 million.

In GDPR there is. But the companies don't care and neither do the regulators. Big business just has too much power and influence.

GDPR is not much enforced, unfortunatly.

But it is enforced. Here is a list: https://www.enforcementtracker.com/

Given that I encounter at least one violation a day, it's really a small list.

It should contains hundred of thousands of entries.

Like fines for speeding.

Imagine if you had a page for parking tickets since 2018 (!) and it contained only a few hundred penalties.

This is ridiculous, especially since:

- it's not hard to find most cases. Just take any cookie banner that makes harder to opt out than opt in, and that's it, you have a violation. It can even be automated.

- it can bring a lot of money to a system that is dying for it.

Are you reporting such violation? How authorities would know about it?

Also the jurisdiction is in question. Who should enforce? Country where user is? Country where domain is registered? Country where data center with servers is? Country where nominal owner of the web page is? Country where final beneficiary is? Country where creator of the banner is?

The jurisdiction is EU (GDPR is binding legislation for all member countries, not dissimilar to US federal law), and the enforcer is primarily the country where the company that owns the website has European incorporation.

In theory anybody can report to their country's data protection officer, although at least in Finland they don't care about individual citizen complaints.

The NGO noyb is sending out batches of these reports. They have some minor wins, but as you can probably tell from all these nags around, by and large there's not much effect.

All these questions are sorted, on paper.

https://techcrunch.com/2022/08/08/noyb-gdpr-cookie-consent-c...

Very few of those are about consent. Which is quite striking, as practically all of the spyware nags are illegal and thus the consent is violated probably billions of times every day.

it blew my mind when i logged into dropbox that sharing my data with third party "vetted" AI companies was enabled by default.

i just wrote a WTF email to their support, but most likely i will be discontinuing my account. can't imagine what they can possibly say that will make this OK

I saw the writing on the wall when they started sending ads via their notification system with no way to turn it off. I'm on pCloud now, but its not looking like it's not much better, They recently started spamming my mobile app with thanks giving ads to upgrade to a different tier of their product even though I had offer notifications turned off.

I had some words with their support department, here is what they said:

Hello,

Thank you for contacting pCloud's Technical Support.

This is a pCloud banner for Black Friday and it's not a notification. Unfortunately, you won't be able to remove it manually and you should wait until the end of the Black Friday promo - 30.11.2023.

Should you require any further assistance, do not hesitate to contact us.

Regards,

George Lewis

pCloud's Technical Support

Like it matters that its not a notification. You still need my consent regardless of what kind of ad it is.

heh.. here is the response i got.

  Hi there,
 
  Thanks for taking the time to write in to Dropbox Support. My name is Ross, and I will provide assistance with your case.
 
  From my understanding, you are inquiring about being opted into Dropbox AI by default.
 
  Thank you for alerting us to this problem.
 
  Dropbox engineers are aware of the problem and are working on a solution.
 
  Also, please note that no files were shared in this case.
 
  Sorry for any inconvenience this is causing.
 
  We'll update you shortly on this issue.
 
  Don't hesitate to reach out to me again for further questions!
 
  Best regards,
  Ross

This sounds like a variation on the line that "it's not an ad if we aren't getting paid to show it to you". Like with Windows notifications "informing" you of other Microsoft products. It's bullshit, and is one of the things that decreases trust.

Whatever comes out of the courts will be just another anchor point to abuse civilian VS cooperate power asymmetry. What we need is the ability to revert laws back to the thrust bust new deal era laws, crush the bad influence and rebuild.

Seems like an overly vague, cynical view.

I certainly don't "trust" OpenAI or any other big company about what they say they did, or will do, or are doing.

Yet I believe OpenAI isn't using data from Dropbox to train their models without users' consent.

BUT I don't think that's the problem here. The problem is data in transit; data sent to third parties who can actually read it, and who may have rogue employees that Dropbox has no control over; data that can appear in logs or subject to different policies, etc.

If I send private data to Dropbox they can't send it to anyone for any reason, including "improving" their product, without my explicit and informed consent. I'm not sure how this is even debatable.

If Dropbox wants to house models and offer RAG search themselves, to consenting users, that's one thing.

If Dropbox sends all data of all users to third parties without telling anyone before the fact, that's another thing. A terrible thing.

Yet I believe OpenAI isn't using data from Dropbox to train their models without users' consent.

why? they trained on my code without my consent, why is user data any different?

training is either fair use or it isn't

and high growth Silicon Valley companies aren't known for the adherence to the spirit of the law

why? they trained on my code without my consent, why is user data any different?

It's different. Code and other content that you shared online - no matter what license you shared it under - is still fundamentally different from user data that you never shared anywhere at all.

That difference is really important, to me at least.

I would be absolutely furious if I found that someone had trained an AI model on my private data in Dropbox. I personally have no problem at all with someone training an AI model on content I have posted to my blog.

A huge problem with the art community right now is if you're a professional artist, you basically need to maintain a public portfolio and social media presence to find work. That ecosystem was developed and existed before image generation AI was a commercial thing, yet artists are finding out retroactively that their data got pulled into these training sets without their knowledge or consent. Even if it's legal, it's still pretty gross imo.

The two things aren't the same, and one is more egregious than the other. But I do think that hoovering up all that web data to train models was way over the line. Worse, there's nothing I can do to prevent it. That's why I felt forced to remove my websites from the public web. I can't think of any other way to defend myself from these entities.

Training against your code shouldn't be kosher if your code isn't open, you retain a copyright, and you didn't consent.

Even if training on copyrighted, private code _was_ fair use, training on PII without consent is _still_ problematic. It's a huge violation of privacy and adds all kinds of legal/regulatory/ethical risks that potential consumers of said models won't be on board with.

The language and spirit of the law certainly hasn't caught up, but I feel like in some cases (unconsenting use of PII, medical information, etc), the laws and ethics of how such data is used is pretty well established.

AI systems are trained against private, PII, copyrighted data all the time without explicit consent. For example, consider the spellcheck in Google search. Every query you make will go into the training for that system, along with your preferred language and country of origin.

You can certainly argue that generative AI systems are different than previous AI systems and should be treated differently. But the current situation is basically that you are allowed to train an AI on any data you have, regardless of copyright or consent. I wouldn't be surprised if that ends up being considered legally and ethically okay, because it's the status quo, and because it's hard to define "what counts as AI".

The point about spell check is a very good one. I think one big differentiator is that the attack/risk surface for something as complex as a LLM is much higher and that much, much more information is encoded in an LLM than a spell checking dictionary.

For example, it's possible to extract training data from an LLM—which could include PII/medical data/etc. Those risks don't exist with spellcheck as far as I'm aware.

To your point about what is "AI", I'd state that AI is a misnomer. What we're really talking about are generative large language models (LLMs). What an _can_ be considered an LLM is definitely up for debate, but if you were to describe one in general terms I think we could reasonably say that most (or all) things we consider LLMs are:

  1. A probabilistic model of a natural language
  2. Have the ability to interpret and generate natural language text
  3. Typically encode a large volume of training data

I'd love to hear other thoughts on how one would define an LLM in practical, simple language. I imagine doing so would be a pre-requisite to any effective legislation.

Just out of interest I have a few questions.

    1. How do you know for sure it was trained on your code?
    1.1 if you saw it reproduce you code verbatim how do you know it didn't just hallucinate it ?
    2. Was you code publicly available?
    3. What licence was it under?

Well Github says it out loud:

“ GitHub Copilot is trained on all languages that appear in public repositories.”

Note it says “public” and not “open source”

https://docs.github.com/en/copilot/overview-of-github-copilo...

I don't think they do, but I don't know for sure.

But my point is, this (training) isn't the main problem.

If Dropbox wants to house models and offer RAG search themselves, to consenting users, that's one thing.

Well, I'm a paying Dropbox customer, and I would not pay for this feature. I would like it if they encrypted my data in a way that made offering this sort of feature impossible. I do want my data recoverable, but the fact that they can offer this AI "feature" at all, it seems like they've made zero effort to prevent malicious internal employees or third parties from accessing my data.

It’s like when forgot my password features send you back your current password instead of a new temporary one. Great, I’m back in my account, but now I dread even being here. What other stupid shit are you doing?

Exactly. It's all about who can see my sensitive data, period.

If I send private data to Dropbox they can't send it to anyone for any reason, including "improving" their product, without my explicit and informed consent. I'm not sure how this is even debatable.

It'd debatable because your data isn't "private" when you foolishly hand it over to some third party without encrypting it first and their policies say they can basically use it for anything as long as they can claim that it was "in furtherance of its legitimate interests in operating our Services and business". In fact their policy says they can update their policies any time they feel like it.

Privacy policies aren't even legally binding. If you're in the US and didn't sign a contract with dropbox you have near-zero rights, and any attempt to assert whatever rights you think you have will require going to court which is basically a pay to win system and you'll be up against a company with billions in assets so good luck with that.

Yes, it'd be a very shitty thing for dropbox to outright violate the trust you put in them, and it might be a terrible business decision that means no one ever trusts dropbox with their data again, but if they one day decided to go fully evil and start handing your data over to anyone willing to pay for it I doubt there'd be much of anything you could do about it.

Don't put any data you care about in the cloud without locally backing up and encrypting it and you'll never have to worry about what the cloud provider does with it or who they give it to.

If Dropbox sends all data of all users to third parties without telling anyone before the fact, that's another thing

If they've ever signed a BAA (business associate agreement) with any enterprise customers using Dropbox for documents bound by HIPAA, this would get them into a lot of trouble very quickly. The financial penalties are _high and per exposure per employee involved_. They also hit employees doing directly (if you disclose/share HIPAA info then you personally are liable).

So I'm certain that even if they did share documents with undisclosed 3rd parties without notice, it wouldn't be "all". Enterprise data is likely safe. Those contracts get heavy scrutiny before signing.

I don’t think so: I think this is a combination of confusing wording and the eternal vagueness of what the term “consent” means in a world where everyone agrees to the terms and conditions of everything without reading them.

Bullshit. Consent means consent.

There is no confusion here. Nothing is vague. This is explicit fraud.

I feel like every day now, I'm reading an article where the problem is obviously just fraud.

Fraud has always been illegal. When did we stop prosecuting it? Why aren't we talking about that every day?

The entire basis of the current generation of AI is in stolen materials. Stolen writing, stolen art, stolen music, all of it taken because it was "publicly available" meaning "there was nothing in existing law that said we couldn't take it for training a learning model." Now they've done the same with Dropbox contents.

AI is very cool technology. The incredible overstepping of any and all ethics with regard to getting training data, just all of the data from any source as fast as possible right now, in this mad dash to create it at all costs is not and should mandate a complete restart on behalf of the people behind it. For this one, and for so many other ethical lapses on the part of OpenAI, the models as they exist are tainted beyond any ethical use as far as I'm concerned.

If your product uses this stuff, you are not getting one red cent from me for it. Period, paragraph.

This is a silly argument. Public sharing on the internet has nothing in common with storing on a private storage service. Posting something online you are literally inviting people to look at it, and there is copyright law that governs how it can be copied. You can argue training AI models on internet data violates copyright (though it almost certainly doesn't) or that the law needs to be changed. But none of that has anything to do with training on private data. It's the same kind of "you wouldn't steal a car" false comparison.

Public sharing on the internet has nothing in common with storing on a private storage service.

No shit.

Posting something online you are literally inviting people to look at it,

Yes, PEOPLE. Posting things for people to see is why the Internet exists, is why USENET was created, is why web forums were created, is why social media was created, is why 9/10ths of the Internet as we know it today was brought into creation.

It was not put there so people who do not know any of those people and do not give a rats ass about what they made could take millions of images, writings, and sounds and shove them into their product without their consent for purposes it was never meant for so they could automate art. That is categorically not what any of that is for, and you, and everyone else making this tired point damn well know that.

If you have no issue at all with your creative output being used to train data models, more power to you! That's how consent works! You consent and that's completely, 100% fine. That consent should not have been presumed as it was, and even if you assume complete and total innocence on the part of the AI creators, once it became extremely fucking obvious that tons, and tons, and tons of creatives absolutely would not have consented if asked, then their data models should've been re-trained with that misused data removed. That is the ETHICAL thing to do: when someone says "hey I really don't like what you're doing with my material, please stop" you STOP, not because that's legally binding, not because you'll be sued, not because you're infringing copyrights, but because you have a fundamental respect for your fellow human being who says "I don't like this, please don't do it" and then you, you know, don't do it.

Unless of course what you actually are is an ethics-free for-profit entity that needs to get to market as soon as possible with a product to sell that you probably can't be sued over, in which case you tell those people who's work your product could not exist without to eat shit, and proceed anyway. Which is basically exactly what happened and continues to happen.

And before you even go there to the "well how could they ask for the entire dataset's contents" I DON'T CARE. I'm not the one doing this, this is not my problem to solve, just because the ethical way to do a thing is hard, time consuming, and/or expensive or otherwise difficult, you don't get to just waltz past the ethics involved, even if you're a research project! I personally wouldn't want to get permission from a few million artists to use their work in this way, I don't think most of them would be comfortable with it, and even if they were, I don't really want to do that, it sounds like a ton of work. SO I DIDN'T.

Not to mention that a lot of the time it wasn't just that they didn't ask first, but they ignored specific widely used and machine readable license abbreviations and copyright symbols attached to the content

If a corporation argues that it can ignore your AGPL because it didn't have to blow the bloody doors off to get hold of your code and its training process is "just like your browser cache" or "a person learning" and the derivative stuff is completely novel, why would you trust them not to deploy the same "but it's not exactly copying" arguments when given access to other stuff that has third-party "no copying" agreements wrapped around it, like your Dropbox?

Agreed, it feels like people's copyright wishes are being flagrantly ignored, with a sense of "well, it's too late now, just live with it".

And I do not buy the "just like a person learning" argument. At least, not fully.

I could see that, if you have a fully-functioning AI system, then handing it a new article to ingest could be "just like a person learning".

But many people graduate high school having read just a few dozen books (or less), having been around maybe a few dozen people (or less), and watched a few dozen movies (or less). A person does not need to ingest a nontrivial percentage of the entire wealth of human knowledge just to be able to be intelligent enough to read an article in the first place.

There may be people who do not care about this distinction. That's fine. But I am quite convinced that the distinction exists. And thus I do not believe that training an AI system is just like teaching a person -- and making copyright decisions on the basis that the two things are identical does not make sense.

And thus I do not believe that training an AI system is just like teaching a person

100%. I love the way you put this and just wanted to expand on it a little bit, to remind everyone that there have been numerous, flagrant examples of various creators of various media who have their names/handles put into these models, who have work that is ludicrously similar to theirs in style produced from the model, and despite the fact that it's technically original, it is not original in any way meaningful to the topic, or defensible by anyone debating in good faith. That you need to put things like "unreal engine" "featured on artstation" and the like proves this. You're telling the machine to aim for works you know are of a higher quality in the dataset to get a better result.

Now if you're just fine with that and content to fuck over artists like that for no other reason than you can, I can't stop you. But please spare me the righteous indignation of objecting to the characterization of such behavior. It's fucking obvious, do not insult the intelligence of your opponents by insisting otherwise.

And tbh even if people are absolutely fine with that, and think that the analogies and legal arguments that they make are absolutely sound and maybe think copyright's a terrible idea anyway, I still can't see why they'd expect Big AI to suddenly drop the "don't care what you think about how we use your stuff, if it's not explicitly illegal we're going to use it" stance when it comes to stuff that's supposed to be 'private' rather than stuff that supposed to be 'property'.

Sure maybe you care more about whether OpenAI has stuff derived from the contents of your Dropbox on their servers which is technically neither "training a model" nor the actual "copy" they were required to delete after 30 days than you ever did about copyrighted stuff. But why would OpenAI?

Now they've done the same with Dropbox contents.

This is what I mean by the "trust crisis".

Dropbox very clearly deny that Dropbox content is being used to train AI models, by them or by OpenAI.

You don't believe them, because you don't trust them.

It's almost like acting completely unethically in the public space has consequences or something

It's well past time for the end of the Digital Millennium of Copyright.

The problem here is that these corporations are given carte blanche to make any derivative works they want. They get the exclusive freedom to ignore copyright law.

The rest of us don't.

The worst part is that they get to turn around and say their models are protected by copyright!

This is copyright laundering. There are only 2 reasonable avenues for us to react:

1. Make "AI" companies respect existing copyright law when compiling training datasets.

2. Get rid of digital copyright for everyone.

I vote option #2.

See also e.g. YouTube advertisements for financial schemes, supplements, health devices - and all the products that just don’t do what they say they do.

In a conversation online about this recently, someone said (paraphrasing) ‘Government should regulate this!’ - but there are already regulations about all of this! It’s fraud! Fraud is literally paying for YouTube (and to varying extents a lot of the rest of the web) and nothing is being done about it.

I fear that it’s just considered so normal now that it will take a very long time to stop. If existing regulations about fraud had been enforced when the fields were nascent, establishing norms that it was just as unacceptable ‘with tech’ as it was before, I think it would be a lot less widespread now.

Here's my advice: stop coloring our discussions with those norms.

We all expect the courts to give the benefit of the doubt. The problem is that we echo that expectation in our discussions. We are not the courts! We should be loud and direct with our criticism.

There is no doubt here. I refuse to give corporations that benefit.

If you're not a subscriber, Matt Levine has a running theme of articles where "Everything is securities fraud."

It's not free to read his articles on Bloomberg, but his newsletter, Money Stuff, is free and one of my favorite daily reads.

https://www.bloomberg.com/account/newsletters/money-stuff

Good article overall, but I find the analogy between "my phone is eavesdropping on me" and "openai might be lying about how they use my data" somewhat flawed because there are robust checks on third-party apps accessing my iphone's microphone that have no equivalent when my data is handed to a third party in plain. Certainly for the layperson they may as well be the same thing, but that layperson is still being protected from the former.

This might seem like splitting hairs, but I think it's extremely counterproductive to act like the battle for users' data privacy and sovereignty is lost just because most users can't tell the difference. I see this a lot from the other side: extremely cynical, at least somewhat tech-savvy people reacting to each new corporate abuse with an "old news" mentality as if we've already reached the endgame—if you haven't been using tails linux for at least a decade you may as well just zip up your home directory and email it to every shady tech corp and data broker you can find contact info for. These people ought to know better, and they ought to set a better example for those who don't.

This inculcated helplessness definitely hurts trust, but it also gives people the impression that a better world isn't possible—that there aren't better and worse choices for who to trust with their data and privacy. This Dropbox snafu seems to be that mindset coming back around: users won't give a shit if we imply we're sending their private files to a third party without asking, right? Lunacy.

Incidentally, I've had most of my data out of Dropbox for a while (in favor of a self-hosted solution), but yesterday was the kick in the ass I needed to cancel it for good. Thanks, Dropbox!

it also gives people the impression that a better world isn't possible—that there aren't better and worse choices for who to trust with their data and privacy.

I have no choice but to believe that a better world is possible. The current state of things isn't tolerable, and if tomorrow can't be better then what's the point of anything?

Also, I'm sure that there are better and worse choices for who to trust with data and privacy -- but it's literally impossible for me to know who that would be (or if anybody is "trustworthy" in a broad sense), so I have to work on the assumption that nobody can be trusted.

I want to be less cynical, but the way things have gone over the last decade or two makes cynicism look entirely justified.

Is my attitude incorrect? If so, how can I correct it?

Also, I'm sure that there are better and worse choices for who to trust with data and privacy -- but it's literally impossible for me to know who that would be (or if anybody is "trustworthy" in a broad sense), so I have to work on the assumption that nobody can be trusted.

I think it's very possible to get a decent sense of who cares more and who cares less. For instance, look at Apple's recent pushback against scanning private user data for CSAM, and compare that with the recently publicized case of a guy having his Google account permanently closed (and being referred to the police) because he took pictures of his baby to send to a pediatrician.

"Normies" think these companies are the same, that they're all selling your data and that trusting one is just as bad as trusting the other. If we can chip away at that misapprehension, maybe said normies will start giving more of their money to companies that do a better job keeping private data private, and maybe that will drive long-term trends in the market.

Or maybe not! Things are bad, I agree. But I still don't believe everything is equally bad everywhere. I don't believe it's all over.

look at Apple's recent pushback against scanning private user data for CSAM

I don't think that's a good example, because Apple was very much in favor of client-side scanning until they encountered a popular uproar about it.

My impression of Apple is that they're very sensitive to privacy intrusions by anyone who isn't named "Apple".

because Apple was very much in favor of client-side scanning until they encountered a popular uproar about it.

I always find it curious when I see "company changes plans due to consumer feedback" framed as a bad thing rather than a good thing, especially when they are one of a field of companies that are largely already doing the thing people were upset about. Personally, I think it's great when a company has a pragmatic motive to act in the consumer's interest rather than (or possibly in addition to) an idealistic motive, because a pragmatic motive is much more likely to stand the test of time.

So, I disagree with you that it's not a good example. In fact, I think it's a great example of one of these big platforms being materially better wrt. consumer privacy than their competition. Frankly, I think your "impression of Apple [...] that they're very sensitive to privacy intrusions by anyone who isn't named 'Apple'" is exactly what I was talking about in my first comment in this thread: a cynical, unsubstantiated implication that all these platforms are equally as bad as each other, and that we've already lost the battle in the consumer space.

I always find it curious when I see "company changes plans due to consumer feedback" framed as a bad thing rather than a good thing

I wasn't framing it that way at all. Being responsive is a good thing. But in the context of this issue, that they did so means they're responsive to outcry. It doesn't seem to me to be an indication that they care more about privacy issues generally.

implication that all these platforms are equally as bad as each other

As I stated in an earlier comment, I don't think this at all and certainly wasn't intending to imply it. But just because one company is better than another on these issues doesn't mean that company is good on these issues.

Can you give more detail about your self-hosted storage solution? I've wanted something like that for awhile.

FWIW, if you’re looking to enter that space as simply and reliably as possible, I would pick up a Synology machine in whatever size you need. They have some competition but remain the best mix of stability in hardware/software/price imo.

I use a Synology NAS in my house with Syncthing running on it (for Dropbox-like directory sync) and Wireguard running on my router so that when I'm remote I can tunnel in and access all my services that way rather than opening ports individually (also great for mobile device ad blocking via my pi-hole). I'm luckily not behind a CGNAT, so no external proxy is required—just dynamic DNS. It was a fair amount of work to set it all up but it works flawlessly now, and I had fun. :)

People who are really hardcore about data sovereignty would probably take issue with my choice of Synology, but you could roll your own if you prefer. Synology seems to have a decent record and reputation, and my experience with it has been pretty good overall.

For backup, I'm currently using Synology's proprietary thing to do snapshots to Backblaze, and mirroring all my critical stuff to various other places. I'm planning to set up a Restic backup for more redundancy (and a bit of obscurity), and I would also like to figure out a cold backup scheme for the whole NAS though I haven't thought very hard about that yet.

You can basically go in two directions with this. Invest the resources necessary to use the open source and self hosted tools, or settle for the convenience of proprietary services and be mindful about what you put in them. I use Dropbox, but everything I put in Dropbox is either encrypted or is something I wouldn't care about if it got leaked onto the open internet. As someone who has spent many hours of his life tinkering with self-hosted solutions, after a certain point I just decided that I wasn't deriving any real benefit from it and my time and energy would be better spent elsewhere.

I appreciate the value of time, but the reality of Dropbox and similar commercial solutions is that they're constantly changing and forcing their customers to deal with those changes. A good self-hosted stack may take more up-front effort, but it's much less likely to change without your explicit buy-in.

There's a third option here, too: don't use the tools.

I don’t think you’re splitting hairs here. You’re making a very good point. App access to the microphone is controlled by the OS and there are OS provided, user-accessible tools that make it clear what apps can use the microphone and when.

Meanwhile cloud data access is completely on a “trust me” basis, and plenty of companies have been proven to have abused that trust.

Yeah, I tried to touch on that analogy flaw in the article:

One interesting difference here is that in the Facebook example people have personal evidence that makes them believe they understand what’s going on.

With AI we have almost the complete opposite: AI models are weird black boxes, built in secret and with no way of understanding what the training data was or how it influences the model.

Completely agree that our biggest threat right now is complacency. If people form incorrect mental models of what's going on and then shrug their shoulders and accept that's just how it is, we won't make much progress in improving the actual problems.

The key issue here is that people are worried that their private files on Dropbox are being passed to OpenAI to use as training data for their models

This is only part of it. I don’t want my data being sent anywhere unless I authorize it, regardless of what it’s being used for.

In this case, we not only have to worry about OpenAI training on our files (I have no reason to doubt that they are truthful when they say they won’t), but we also have to trust that they can securely handle our files.

It’s SaaS vendors all the way down.

If you don’t want third (or second) parties reading your data, make sure it is e2e encrypted clientside.

This means no Dropbox, but Syncthing instead. That means no Slack or Discord, but Signal.

That is not entirely accurate. Thanks to the GDPR subprocessors need to be disclosed and changes to them too. As a result plenty of companies are now much more selective about how many they are willing to entertain.

Courts can only address the issues by applying privacy laws long after your data is leaked and your business or privacy is destroyed.

Use clientside encryption. Technical measures, not legal ones.

to paraphrase a CTO that I used to know:

you may win the court case... 3 years from now. your secrets have been on the dark web for that whole time, and have filtered to the regular web by then. even if you know who did it, and can get them extradited they may not have cash to pay for the damages, and seeing them in jail, though nice, won't make your data private again.

There is also the distinct possibility that the wording is literally true in sense they don't train their model on your data ("train" can be interpreted as having a very specific meaning...), but they nevertheless also undertake some sort of monitoring of the outputs a model produces (which could obviously entail privacy leaks, especially if it's using RAG on your personal files).

Seems quite sensible for people not to trust that they fully understand the small print, since (i) they probably don't and (ii) the one thing most AI companies have made clear is that they think they should be able to use whatever material they like however they like, regardless of whether the creator of that material gives them their blessing or not.

This is absolutely part of the problem.

Even if you have a deep technical understanding of how this stuff works and how these tools are built, you can still have very legitimate questions and doubts about how your data is being used beyond pure model training.

The same argument could apply to processing data in the cloud, and somehow nobody complains anout the fact that Dropbox stores their data on AWS/Google cloud or wherever

I complain about my data being stored in the cloud all the time. I won't use services like Dropbox in large part because of the cloud, but I can't do a thing about other entities I do business with storing the data they glean about me in the cloud.

It's a real issue.

That's true. It's very reasonable to worry that OpenAI's "we only hold on to your data for 30 days for auditing purposes" policy means there's 30 days in which they might have a breach that leaks your data.

Especially since they've had a few documented security problems in the past.

This article misses the mark. The problem is that, for good reason, we've moved to a post-trust society. The author argues that you're a nutjob if you don't trust what large corporations are saying. He goes so far as to say it's on you to prove that they're doing something wrong.

This doesn't represent reality. Take this quote:

A society where big companies tell blatant lies about how they are handling our data—and get away with it without consequences—is a very unhealthy society.

So, what are the consequences of lying? If there are any, I'm curious what they are. The author is arguing that "big companies" are going to do the morally correct thing even though it hurts their profit. That's just a silly argument.

In a post-trust society, companies have to prove they're doing what they say. There's no more making a statement and having the masses act as if that's what's going to happen.

Honestly, is there less than a 100% chance that down the road there will be a disclosure that there was a "mistake" where your data was used differently from what they claim? Fool me once, shame on you, fool me 5,872,328 times, shame on me.

"The author argues that you're a nutjob if you don't trust what large corporations are saying"

That wasn't the message I was trying to communicate at all.

My point with this piece is that people don't trust AI companies, and companies need to figure out what to do about this - that's why I called it a "crisis".

I certainly didn't argue that big companies would "do the morally correct thing even though it hurts their profit".

"In a post-trust society, companies have to prove they're doing what they say. There's no more making a statement and having the masses act as if that's what's going to happen."

I think you and I might be in strong agreement there. My goal in writing this company is to get AI companies to take this problem seriously and, like you said, "prove they're doing what they say".

If a company is going to claim that they don't have Syphilis, then they need to guarantee that each of their "trusted partners" also doesn't have Syphilis, and each of their trusted partners needs to provide the same guarantees about their own trusted partners as well, ad infinitum.

Good. Let's make the companies that can make such claims stand out above the rest.

I thought your piece here was pretty thoughtful and I appreciate that you provided us with how Facebook shows they are not, in fact, spying on us through the microphone.

If there are any, I'm curious what they are

The EU is well established now in setting penalties as a percentage of revenue. These really do scare big tech because their revenues are huge.

But I think your whole point post-trust is highly aligned with the article. It's exactly that: people don't trust what they are told by anybody now (even government), and that's a crisis.

They also need to stop writing things like this:

Use artificial intelligence (AI) from third-party partners.... Your data is never used to train their internal models

Qualifiers like "their" and "internal" make me deeply skeptical. Are they saying that my data is used to train some "non-internal" models, whatever those are? Also, is my data being used to train any models in Dropbox other than one exclusively used by my account for the features I'm expecting? etc.

“In a post-trust society”

Is it really useful to think in black and white teleological terms? “Post-trust” just sounds a bit sophomoric/marxist. Like “late stage capitalism” or something similarly presumptuous.

This article has good intentions but seems like a giant buildup to promote local inference tech stack

My agenda with this article is to make AI companies aware that they have a genuine crisis on their hands, and hopefully push them to be more transparent and work hard to try and regain the trust that they have lost.

I added the bit about local models mainly because I knew that if I didn't 90% of the conversation about the article would be "yeah but he didn't talk about the obvious solution, which is local models".

Thanks for the clarification.

I agree that “opt in by default” is a dark pattern - but how else would any cloud service provide a RAG enabled service? Perhaps it should be made explicit during the onboarding process like geolocation (opinions as a former location PM at faang)

In this particular case it looks like opt-in would have been a much better way to go.

Having a "enable AI features" checkbox seems to me like it would be a smart approach here.

I agree that “opt in by default” is a dark pattern

Not only that, it's a nonsequitor. It's ridiculous on its face to say someone else making a decision for you is you "opting in".

Providing evidence to support a conclusion is a perfectly valid rhetorical strategy.

It's only annoying if a blog post is using the structure to pull a "here's a list of reasons why X sucks; this is why my monetized project Y is better, please use it!" which is not the case here.

Promoting local inference is a good thing to do.

As with many conspiracy theories, too many people would have to be “in the loop” and not blow the whistle.

I used to believe this, but I do not understand how anyone can say this after the Snowden revelations. The vast majority of engineers are not martyrs. Given a choice to blow the whistle and then seek politcal asylum in Russia, most will not blow that whistle. Instead, they will live their lives and work to support their families.

"But this is not about Goverment Surveillance!"

Maybe this is true - I don't want to argue it here but I do think large tech companies are willing accomplices in many surveillance states - but an engineer at Open AI or Dropbox will probably not be facing felony charges for blowing the whistle. They will be facing the end of their career in big tech, and that's enough to dissuade most of them.

I'd guess maybe 200 engineers would have to be in loop before there is an 80% chance that one blows the whistle in any five-year period, and if they are careful they can keep the number of engineers in the know MUCH, MUCH smaller than that.

Well we know the official history books are lying to us. Manhattan project? No way that could have been kept secret for as long as it was. All those people "in the loop" and not a single one choosing to blow the whistle? Preposterous.

The idea that conspiracy is impossible ignores the conspiracies that happened and were covered up for significant amounts of time, long before we have the aid of modern technology. Why does the argument receive much attention despite that?

"Why does the argument receive much attention despite that?"

Because government national security secrets and dumb corporate secrets are different.

The Manhattan project did leak info to the Soviets. It is true it stayed out of public view for several years, but a difference in the Manhattan project is that every single person involved was investigated by the FBI beforehand, and the country was so heavily united in the fight against the Axis powers that it is quite likely almost everyone involved was a true believer.

The Facebook Mobile apps teams have thousands of engineers. They have access to the source code for the apps.

Is there a big black binary blob of source-not-available functionality in there that they are told to include in their builds without question, and for which a competent iOS or Android engineer wouldn't be able to tell if it has access to the microphone?

I was referring to the specific argument I quoted, not to the Facebook audio conspiracy it was applied to, since it is also being applied to Dropbox/Open AI where the numbers are much smaller. I think the technical arguments in the Facebook case are compelling enough on their own.

The problem with AI isn't (p)doom it's the 99% probability that the people at companies like Dropbox will continue to treat customers like human chattel.

They'd enslave you and charge you a subscription for oxygen if they could – and insist it was a feature.

people at companies like Dropbox will continue to treat customers like human chattel.

The key question is: Who is the actual customer?

That drives everything else. The FAANG's customer isn't the "users", but rather its advertisers, et al. Anyone who will pay for the data their "users" generate.

This is the Faustian bargain we all made when we decided that free stuff from the internet was a good idea. Someone else pays, and they get our data in exchange for that payment. The same data is likely sold as many times as they can do so, of course.

Even if we accept that we made this bargain: I pay for Dropbox. If they want to use my data for their shit they better make it a) optional and b) make me pay less.

Also, I have a feeling this could be another case of the EU having to fine an US company, which means half of HN will cry again how the EU is always so mean to US companies. Or maybe this time HN acknowledges that the reason could be that the companies trying to use private data without permission are often US companies.

when we decided that free stuff from the internet was a good idea

I agree it's a faustian bargain but I don't think we had much agency here. Our society now depends on these services to a high degree, which is why these companies are (and probably should) face increased scrutiny and regulation.

Any for-profit company would. In fact, if they are public, they are legally required to maximize profit without any other consideration.

No, they don't. That myth cannot die fast enough.

https://insight.kellogg.northwestern.edu/article/shareholder...

There are a lot of misconceptions about maximizing shareholder value, even among economists. But talk to a legal scholar or a corporate lawyer: a CEO or board is not legally obliged to maximize shareholder value. They need to maximize the value of the corporation and act in its best interest. Only when there is a change in legal control, such as a merger or imminent hostile takeover, do they have to maximize shareholder value.

From article:

Trust is really important. Companies lying about what they do with your privacy is a very serious allegation.

A society where big companies tell blatant lies about how they are handling our data—and get away with it without consequences—is a very unhealthy society.

A key role of government is to prevent this from happening. If OpenAI are training on data that they said they wouldn’t train on, or if Facebook are spying on us through our phone’s microphones, they should be hauled in front of regulators and/or sued into the ground.

I find it very difficult to believe that big technology could be held accountable to anything by legal means by regulators. They have too large legal teams, and too well written legal agreements, and eulas.

I find it impossible to believe that I, as a person, could challenge them in any way though legal means.

No matter how big the company, they still care about a billion dollar fine. That's bad for shareholder value! And billion dollar fines do happen.

Billion ain't much nowadays. The big 5 do 100B+/year in net profits. <1% doesn't matter. It's like being fined a $1000, while earning 6 figures. It feels like a pinch.

Fines are often gamed: lobbying, legal defence.

I would argue big companies care more about losing customers. Сompetitive pressure is more effective than laws and fines. And laws are not without downsides, including becoming barriers to entry into competition.

Given history of how US lawmakers have passing any bills that seriously threaten the might of BigTech, compared to how EU keeps on giving the middle finger to Big Tech, it's a weak argument.

Corporations run America.

This reads a bit naive and “assume good intent”-y for me. Look at what’s happening outside of AI for a decade: everyone’s eating data like a compulsive hoarder. Not just Google and Facebook who are actually using it in their core products, but everyone. Today I discovered that a mini-site of Swedish traditional recipes I’ve been using for Christmas has added autoplaying videos, dark-pattern cookie consent banners, the usual. Almost every new app/site is oriented around this economic axis.. and then the sudden coordinated lockdown of 3p APIs right around when LLMs starts getting strong.

And now we have ChatGPT/OpenAI and their competitors.. if the other players eat data like a secret midnight snack, current-gen AI is like zombies (the fast and twitchy variants) starving for blood and brains. Both because data serves a more direct role in the product, but also because of the typical hype-train-race psychology of tech VCs has been woken from slumber by the first potential paradigm shift in decades.

All circumstances point towards zombie apocalypse/ gold rush / ask for forgiveness later / etc etc. I strongly believe that’s why they’re (all of them) doubling down on the safety/responsibility rhetoric now, before the inevitable reputational PR crises (plural). Get ammo to muddy the waters early.

Meanwhile we techies are lulling around like we didn’t deeply just experience the last 10 years and thinking it’ll be different this time because.. AI is rooted in academia? Flashy new companies? The safety rhetoric? Edgy Twitter takes from “down-to-earth” founders?

I don’t pretend to know exactly what goes on behind the scenes but I’ve been around long enough to know how people work. And they haven’t changed for the better.

Meanwhile we techies are lulling around like we didn’t deeply just experience the last 10 years and thinking it’ll be different this time because.. AI is rooted in academia? Flashy new companies? The safety rhetoric? Edgy Twitter takes from “down-to-earth” founders?

These companies have already stolen everyone's data and techies are bitching about IP law and saying they don't need to ask permission to use anything on the public internet. That might be the legal reality but you still look like tech douchebag for doing it.

More than looking like a douchebag, that sort of response is a big part of why the tech industry has lost so much trust. It's the response of someone who wants to continue to abuse people, not the response of someone who wants to act in a trustworthy manner.

it’ll be different this time because..

You see, the new giant company promises not to be evil...

Trust Crisis for AI? What after we see the board/CEO of one company fired/changed apparently for allegations of lies or manipulation that nobody is clear on? If Dropbox derives data from user data by scanning said data, then that "derived" data is no longer "users data" it is Dropbox data and can be shared. It may only be statistical in nature and not related directly to individual users but isn't that exactly what training data is? Isn't that how it works? That can be shared to train AI models can't it? Its not lies its hair splitting.

No its called unethical behavior and has become the norm for big tech.

Too often the claims of unethical behavior are knee-jerk, baseless, and speculative. In the example the author cites, Dropbox only sends data to OpenAI when the user explicitly tells the app to engage an AI-related feature, for example to summarize a document. Yet the backlash seems to assume that they're scanning and uploading people's documents en masse, even though there is no evidence of this.

Unethical behavior definitely exists in AI companies. Personally, I'm agnostic about whether it's higher or lower than the base rate of unethical behavior in the general population. Anyway, if we're going to talk about bad behavior, we should use specific examples with cited evidence, not fear-mongering.

Personally, I'm agnostic about whether it's higher or lower than the base rate of unethical behavior in the general population.

I don't think that AI companies are less trustworthy than our industry in general, but I absolutely think that our industry is far less trustworthy than the general population.

AI companies engaging in the wholesale harvesting of web data to train their models, though, was a particularly egregious violation of trust.

In fairness to Sam Altman and OpenAI, all credible reporting I've heard (and signifcantly Kara Swisher's work) has shown no issues of AI safety or lies in CEO communications to the Board, but rather broader concerns over the direction the CEO and board felt appropriate for OpenAI.

I've no dog in this hunt. I'm not partial to either Altman or OpenAI. And I have considerable reservations over where this Brave New World may be taking us. Or whether there is any credible option to stop riding this merry-go-round, no matter how unattractive the destination(s).

The DropBox behaviour described is only one in a long, long, long line of trust-violating practices by tech firms.

The microphone trust aspect in the article seems like a red herring distracting from what could be a clearer point.

Facebook literally takes your data from their apps and internet, tracks your behavior on the internet, and feeds this data into models of you. These models are so accurate they can sometimes basically predict what you're thinking. Hence the layman jumping to the conclusion that they must be spying though the mic.

A LLM company like OpenAI, and their partners, employs almost literally this exact model. Grabs data from whatever sources to improve their models, to increase the likelihood you'll keep clicking where they want you to click, to monetize you.

Facebook literally takes your data from their apps and internet, tracks your behavior on the internet, and feeds this data into models of you. These models are so accurate they can sometimes basically predict what you're thinking. Hence the layman jumping to the conclusion that they must be spying though the mic.

and all of this just to show me shitty ads for online games that I will never, ever, EVER play, college-themed dating services I'm not going to use, yoga shit, and money remittance services. I live near a big university, so I'm guessing it's simply by IP.

occasionally get ads for Lexus or Jags tho. that's nice.

Hence the layman jumping to the conclusion that they must be spying though the mic.

Right, and in a larger sense, the laymen aren't exactly wrong. They're technically wrong about the mechanism, but they're exactly right about the extreme intrusion into their private lives. That the intrusion comes in the form of accurate models rather than the microphone is just a technical detail. The end effect is the same.

Yes. In general, it would be great if companies threw away the outdated PR-communication playbook, and instead took a leaf from some indie devs and had the builders directly communicate with users as people. Products are built by people, companies are made up of people, but most communication is instead corporate.

In practice, I think the PR playbook is too entrenched, but we can dream.

The problem is that large companies are made of lots of people, and some of those people might have awful opinions or say things that will get your business in trouble.

The larger you are and the more people that are communicating, the bigger the chance that someone says something terrible.

Particularly if you are a large public company and there may be legal requirements around specific parts of disclosure.

Besides, companies don't want their secret new feature announced before their competitors know about it, in a way that is poorly communicated and confuses customers, and then find that the person who announced it has a profile picture which is them at a far-right protest / far-left protest / made some sexist jokes / has a bio advertising their onlyfans account etc or a million other things that could be seen as against the corporate image.

I think this article is a little unfair to the lay critics, aka conspiracy theorists. The average user of Facebook has no idea how massive and widespread their personal data collection machine is. They correctly intuit that Facebook has an absurb amount of personal data on them, but don't have any frame of reference for how most of that data is collected. But the concept of a microphone listening in to private conversations is easy grok, and while it's factually wrong, the effect it points to is a lot closer to reality than the corporate speak they here from the company about how they "Have the utmost care for users personal privacy, blah blah blah". Remember a few years ago when Facebook tried to rebrand itself as a privacy company!?

Facebook absolutely has a lot of information that the average person considers personal and private, and big tech companies work hard to mantain the illusion that their data collection isn't as pervasive as it is.

With OpenAI, most non-technical users understand that the company has ingested a LOT of creative works done by individuals, most of which weren't expected to be ingested by a giant AI company and potentially regurgitated to millions of users.

The better corporate messaging this article argues for will help ameliorate concerns to some degree, but won't address the root of the problem. Most people might have the details wrong, but they are directionally right about the massive data being collected and used by these companies and they have every right to be concerned about that.

This is a great comment. I think you're absolutely right about the microphone thing: people believe that because it's easier to understand than what's actually going on.

That's also my frustration with it: I want people to understand what's actually happening with targeted advertising so they can get angry about that. Having them get angry about a fake microphone conspiracy theory is a distraction that makes it harder to focus on solving the underlying issues.

It's 2023 let's summarise AI

* My Documents can be trained on

* My Instagram images are legally not mine and can be trained on

* My voice can be cloned without consent

* My complete motion capture data can be replicated to another image

* My analytics and usage data is owned by either US or Chinese corporations who either side will train AI on to feed me more ads

* A green colored hardware giant is leading the AI warfare

* The internet is turning highly anti consumer

Don't forget Airbnb. Two or three years ago they claimed they needed scanned photo IDs of all users to eradicate racism on their platform, and they were going to use cutting edge ML to do so!

Personally, I think "AI" should be required to disclose the origins of the data it bases it's responses on, and if any of those are copyrighted.

For current LLMs that's not technically feasible, because every single token they output is influenced by every token they trained on - so any answer you got from them would have to include disclosure of millions of documents that went into the training set.

(I'd very much like them to disclose the full scope of their training set, but it's not possible for them to do that on a prompt-by-prompt basis in the way you describe.)

Facebook say they aren’t doing this. The risk to their reputation if they are caught in a lie is astronomical.

Even though I don’t believe Facebook is spying on anyone surreptitiously through their phone’s microphone, I find that specific argument entirely unpersuasive.

Facebook’s reputation is dogshit, at least with regular non-technical people I know. I’m in the US. People know they helped foment the January 6 insurrection in 2021 and saw how they deflected any and all responsibility while not fixing a thing afterwards. The reputational damage they’d absorb from actually doing this thing that so many people already think they’re probably doing pales by comparison.

they know its dogshit... but still keep using Insta and WhatsApp

Can someone explain the consisten use of “Facebook are” instead of “Facebook is,” is this proper grammar?

This is a personal style choice I make, because I like to emphasize that companies like Facebook and OpenAI are made up of individuals, and it's those individuals that make decisions.

The cost/benefit of believing an infotech company about privacy concerns not backed by an explicit contract is firmly on the side of "don't bother believing them".

Restore the trust? Pay us.

For me, money is not the issue and being paid for having my trust abused will not restore my trust.

Copy-pasta from the article, about a related claim (Facebook spying using phone microphones)

This theory has been floating around for years. From a technical perspective it should be easy to disprove:

   * Mobile phone operating systems don’t allow apps to invisibly access the microphone.
   * Privacy researchers can audit communications between devices and Facebook to confirm if this is happening.
   * Running high quality voice recognition like this at scale is extremely expensive—I had a conversation with a friend who works on server-based machine learning at Apple a few years ago who found the entire idea laughable.

The first point seems believable, but the second and third do not. Obviously a nefarious Facebook would be toggling these features off when they were being inspected, and even if they were not, they could be using sneaky exfiltration techniques. With the rise of app attestation, they would be able to do so in a way that would be 100% undetectable by reverse engineers. The relevant code would only be delivered to the app when it saw that it was running on an L1 trusted phone (with hardware security intact, unrooted). Additionally, they could embed some whisper-tiny on the device and force it to do its own speech recognition and hotword detection, only sending a list of "ad topics" back to their servers.

I don't think Facebook is doing this, and the social reasons hold more weight with me, but I don't think it's technically impossible or easy to prove that they did not _ever_ do this.

"Additionally, they could embed some whisper-tiny on the device and force it to do its own speech recognition"

Yes, they could do that today. People have been assuming they've been doing this since way before 2017, long before Whisper.

I am building an AI content generator and my edge that makes it better is sourcing data from not easily accessible data sources (think research papers) through SEO. I cannot imagine that these companies are not using every bit of confidential data for their advantage. Proprietary data access is the only thing that matters in the future of AI wars

true. i call them "prompt stimuli". prompts can give structure and context and whatnot but a dash of original domain gives the edge.

Other than just manually generating a UUID and storing it then doing searches for it, is there any sort of formalized service/application/library for "Air Tag but for your data"?

Generate a unique string, store it along with my data, notify me if it shows up anywhere.

I wonder if it's possible to generate a canary of that sort that can be detected in AI models.

I think that the difficulty is not in doing this, the difficulty is in doing it in a way that an adversary can't strip out.

I think local models will be more and more relevant. First of all, the locally available computing power for inference is going up. In addition, inference is much cheaper than training. Finally, there are diminishing returns with more and more parameters.

Local models also solve a lot of the trust issues

Yes and if we start chatting to a local AI model instead of searching on Google, that means we get an overall increase in privacy. So the new local AI era could actually be much more private than the previous Web era - an era in which we could have much more freedom.

This is a great piece. I often observe that this really applies at a high level in general as one of the key elements that demarcates successful / advanced economies from lesser ones : how much is trust a limiting factor on what you can achieve?

If you think about modern society and the levels of trust needed to do certain things : I spent nearly a lifetime of earned income on a house and I completely trusted the agent and bank who facilitated that. Without that trust, the whole transaction wouldn't be possible. Why did I have that trust? These are the hidden value of all those laws and regulations that we often hate so much. As much as anything they create the stable platform for high value transactions to take place. The price of not having them is not being able to access the value you can get if you can successfully execute high trust transactions. There's no way I would gamble my whole life's income if I didn't trust the process. I would have to buy a smaller house.

So AI is playing out along these lines. How much of the value we can access from AI is eventually going to be limited by how much trust we can grant it. Which ironically may derive from how much regulation / legal framework is successfully put around it, which may in the short term slow it down. The self-hosted open source model route is a substitute / fallback for not having trust in hosted models. We can't really pretend that it's possible to self-host a model that will be as good as a hosted one, the scale is always going to be a material difference. But we may be to hit the point where for targeted applications there are diminishing returns, and that may be good enough.

I love convincing people that Facebook isn't spying through the microphone but those relevant ads that relate to what your friend just said are as creepy as you think, just without the microphone. The example I like to give is if your friend with pets bought a new vacuum that they tell you is an absolute godsend for dog and cat hair. Facebook knows you have pets, they know your friend has pets, they know that you and your friend are friends. If you have location data on, they know that you and your friend are together. They know your friend recently bought a vacuum because they have access to a lot of purchase data. So it may seem like they're listening when your friend mentions the vacuum and an ad for it shows up in your feed later that day. They weren't listening, they were tracking. Listening would be spying in one aspect; rather they were following multiple aspects like purchases, relationships, lifestyle. Still creepy, just in a way that's harder for consumers to understand.

There’s a deeper connection between concerns about OpenAI and the “Facebook is spying on us via microphones” meme:

While there are perfectly rational, non-hot-mike explanations for the phenomenon of talking to a friend about something and then having it show up in an ad, the experience itself is damn spooky, no matter how it’s occurring.

And hypertargeted ads are just the earliest ripples of the wave of AI-driven applications for predicting human behavior. Those predictions—about where we’ll decide to go on vacation, when we’ll quit our jobs, what we’d like to buy, who we’ll marry and divorce—will be more right than wrong to an incredibly spooky degree.

I have no idea how society will react to that.

"A key role of government is to prevent this from happening."

Well, unfortunately, we have a government trust crisis also.

I think it’s obscene that Dropbox turns that setting on by default.

People believing the FB was eavesdropping on their conversations might be wrong, but that's mostly a technicality. It's like saying that I passed out because you hit me with a baseball bat, to which you'd reply: naaa...ah! it was a hockey stick!

FB has been/is spying on people and making us collectively more stupid, but using different, more esoteric/harder to explain methods (e.g. cross-device behavioural targeting).

Those companies need to earn our trust. How can we help them do that?

This is a wicked problem and such a broad question (+ a bit of a weird take imho), that I don't see anything actionable. So, here's a half-baked list off the top of my head (so, poorly expressed, incomplete, incoherent):

- slow down the ill-conceived progress at all cost and understand that moving fast in the wrong direction ≠ progress

- force stricter transparency rules through legislative action

- understand and accept that no-trust should be the default

- invest more in open and offline models

- educate people

- accept that some of those companies in their current shape don't deserve our trust

Edit: This is a better take: https://news.ycombinator.com/reply?id=38643792&goto=item%3Fi...

The biggest problem with AI is not the tech but the people. White billionaires worrying about SkyNet but totally overlooking the impact on the everyday "poors".

The individuals who never had to go through a second of adversity in their lives cannot be expected to understand. In fact, their sociopathic tendencies only skyrocketed through the years of isolation from the real world, with COVID compounding the effects.

Casually talking about how robots should replace humans as a species because of "efficiency"...

By the way, did you know that Sam Altman was VERY active in politics until the latest drama?

He was backing a candidate to primary Joe Biden, but now he had to go take care of his own stuff. The outcome of this would be pretty obvious to those who follow American politics:

https://www.theatlantic.com/press-releases/archive/2023/12/a...

My point is, a person that cannot draw a straight line from F---ing Around to Finding Out cannot be trusted with something infinitely more complex.

The TLDR is "local models are great to have but hopefully we don't lose out on large hosted models due to distrust about data going straight to OA or some other big companies."

The arguments made in favour of bigco trustability are not great though, esp the parallel between AI data ingestion and Instagram's rumored use of mics to capture voice data for ad targeting:

Facebook say they aren’t doing this. The risk to their reputation if they are caught in a lie is astronomical.

FB has already paid the biggest fine in FTC history for privacy breaches. It has no goodwill, it's running on user inertia and indifference.

People are going to believe the worst about AI for the rest of the decade, for reasons entirely unrelated to what actually happens.

Concluding "if we respect people's feelings and do one weird trick, people will suddenly trust AI" is naive beyond number.

The non-technical reasons are even stronger: > Facebook say they aren’t doing this. The risk to their reputation if they are caught in a lie is astronomical.

Ah, bullshit. They've lied about other things, repeatedly and constantly. It's a risk, sure - but a calculated one, and one they've taken before.

...people are worried that their private files on Dropbox...

I'm not worried, because I use Cryptomator. Great app, acts as a file encryption layer on top of any cloud storage.

Great article overall and agree with local models. If it is connected to the internet, then it is not secure, you cannot trust your data not to be slurped up somewhere.

Threat models yada yada.

I'm in love with how fast models like insanely fast whisper, ggml llama.cpp, Mixtral, YOLO v5 e.t.c have come along.

I very much predict devices in 2024+ that have powerful energy efficient GPUs and do multimodal inference locally without any connection to the internet.

I remember buying the first comma.ai pre-panding which was essentially a modded android LG phone driving the car. No connection to internet, a freaking mobile phone driving the car without any human input on the highway. That totally blew my mind.

I'm awaiting Apple to release a version of Siri that runs totally offline.

Smart phones are the perfect target.

All it takes is a simple tiny change to the user agreement and then OpenAI will start training on this user data and most people won't notice in time to switch their toggle off

I've seen this play out on Discord too. When they rolled out the new "summarize conversations with AI" feature, I saw a lot of people concerned their data was going to be stolen for AI training as a result. I pointed out that they explicitly said this won't happen, and also that Discord's ToS is already vague enough that there's no guarantee your data isn't already being used for other purposes. The LLM fearmongering is powerful.

A friend told me about the Dropbox setting last night, so I logged in and turned it off. This morning, I went to look for the setting to take a screenshot but it's gone. The setting just...isn't there anymore.

Made me feel like I was going a bit crazy TBQH. Surely I didn't misremember?

Annoyed because it was a convenient file storage solution and now they have proven themselves untrustworthy so I have to set up my own thing. My fault for trusting to begin with, I suppose.

Hmm, isn't the whole point of the "Facebook spies on us though our Microphone" story that then people laugh about it move on and use Facebook regardless?

Hundreds of millions of people use ChatGPT, billions Google and Facebook. The value wins. If the service is good enough people just tend to "hope" more than trust that their data is save enough.

For the dropbox backlash people just didn't see the value.

I've briefly touched it in another blog post. It is not just trusting the institutions to not use the data for training. Optimizations you apply to making LLM inference fast might present new security landscape that we don't know much about. For example, if your LLM inference engine stores KV cache across user sessions, you can in theory do a timing attack to retrieve what's in the KV cache, potentially leaking other users' requests.

"As with so much in AI, people are left with nothing more than “vibes” to go on. And the vibes are bad."

word

Isn't there sufficient evidence to conclude that OpenAI is training on things they know they probably shouldn't be training on? Like the full text of novels that are not publicly available on the internet? I don't think it's that unreasonable to extrapolate that they would blur the lines of legal agreements to get access to private user data they shouldn't have.

Facebook say they aren’t doing this. The risk to their reputation if they are caught in a lie is astronomical.

The author refers to the theory that FB is always listening as “laughable”, but this statement is even more laughable. If FB actually got caught doing this, at most there might be a few articles about it, and maybe a class action lawsuit resulting in a few pennies distributed to individual users, and then things would go right back to business as usual.

The only convincing argument that FB isn’t always listening is that it would be too expensive to be worth it. Arguments about “reputation” risk are beyond absurd. They don’t care.

I find the premise that users should trust tech companies by default absolutely ludicrous. The author seems to live in a bubble where tech companies don't commit privacy violations regardless of what their privacy policies say. Yet in the real world, even when they're caught with their hand in the cookie jar, the fines these companies are forced to pay are just the cost of doing business.

Earlier this year Microsoft was fined $20M for collecting PII from children, which it "shared" with advertisers[1]. And this is _Microsoft_, not even an adtech giant. The abuses from those have their own Wikipedia articles[2,3]. FB was famously fined $5B in 2019[4]. It would be beyond naive to believe that these were just "oopsies", and that somehow these companies have changed their ways.

_This_ is why it's not implausible that the software that runs on our most personal computing devices, built by these same companies, is hoovering up our data in the sneakiest ways possible. Equating this to conspiracy theories is dishonest at best.

When Mark Zuckerberg puts tape over his webcam and microphone, it would be foolish to think you know better not to.

[1]: https://www.ftc.gov/news-events/news/press-releases/2023/06/...

[2]: https://en.wikipedia.org/wiki/Privacy_concerns_regarding_Goo...

[3]: https://en.wikipedia.org/wiki/Privacy_concerns_with_Facebook

[4]: https://www.ftc.gov/news-events/news/press-releases/2019/07/...

There is an ongoing question about whether AI can be trained on copywritten works.

If the AI pioneering companies are able to push back against the entire publishing industry and have nothing more than a conversation about it, then why would I ever expect any company to respect me or my personal data?

In fact, I think there is a large and growing portion of the population who have watched big business scoff at laws and regulation (see Wells Fargo, Uber multiple crypto exchanges, etc) and are realizing that profit is the only motivator.

That's not even talking about the chaotic political landscape.

I think we will soon see a crisis in institutional trust across the board.

There is no independent monitor or authority that can check what OpenAI does or doesn’t do. So there is no reason to trust them. Simple as that. Tipping the scales further are the numerous instances of tech companies lacking sufficient data security, the latest being 23andme. Case closed.

Who in their right mind would trust a company that just had its CEO publicly ousted for allegedly lying to the board, only to have him put back in charge? And we have yet to see any kind of reasonable explanation or statement about what really happened.

Where there's smoke, there's fire. Public trust in large tech companies is at an all-time low. Nobody thinks tech companies want to help anybody but the few rich people running them anymore.

I wouldn't trust a word OpenAI says, because even their name is a lie. The code isn't open, it's not auditable and you have dishonest people running the company.

In what way would AI deserve our trust? To date data hungry companies have proven to NOT have consumers interest in mind. Why should AI companies be any different?

In dubio pro reo? I don't think so.

Facebook say they aren’t doing this. The risk to their reputation if they are caught in a lie is astronomical.

This isn’t the disincentive you think it is. VW manufacturing vehicles that purposely deceived emissions testing was textbook corporate malfeasance, perhaps one of the worst examples of it. They took a loss on the ledger and came out of it relatively unscathed

If I can be automatically signed up to allow openai to process (but not train on) my data, I absolutely don't trust that the deal will not be altered the moment any counter-party to the “agreement” (besides me) desires it be altered.

think this is a combination of confusing wording and the eternal vagueness of what the term “consent” means in a world where everyone agrees to the terms and conditions of everything without reading

I think this is always the case. Unless the companies are forced to actually simplify and limit their TOS -or even divide it by features-. No one is realistically expected to read hundred of pages daily in legalize. I don't think many vendors do it our of necessity but as a workaround consent issue. This needs to be addressed by legislation that ban those practices.

This article touches on the more critical and flip side of a significant problem, which I think will hurt many of the incumbents, while whoever figures out how to resolve it will gain a lot of credit.

Facebook doesn’t explain how their ad targeting works at the individual level, but they thought about it and tried—barely. It was perceived as an obstacle to black-box ML, even when there were clear opportunities to show customers (and partners): here’s how the algebra works, here are key dimensions that we have identified explicitly and that the advertisers wanted us to target (age, gender), here are more dimensions and the feedback from showing that ads to other tells us (people who like board games respond more to the ad, and we think you do).

That training, in clear, accessible language that anyone would understand exists: it’s part of onboarding.

The result of not having it is a large chunk of ads are shown to people who literally don’t speak that language (something Antonio GM mentions in his book and that was identified, corrected, broken, and re-identified at least three separate times since). But even benefits would be enormous: it would free what looks like 60% of my ad inventory by not showing gambling ads.

But they went the other way, treating users asking questions that their own employees ask on their first days with suspicion—assuming they knew better than the people they are talking about.

OpenAI is less adversarial, but they must explain how training and testing work with user-relevant examples: “Here’s a conversation you had last week. Here’s the feedback you gave. Here’s how we detected it was relevant to refinement. Here are examples of questions whose answers were changed by your contribution.”

Having a full-text search on their training set might be difficult, but something along those lines could be implemented and reviewed by a neutral third party. Those conversations would lead to insights about where to find more relevant data.

As Simon W. writes, this matters, and you want to respect privacy because it’s an inherent good, but you also want to preserve trust because it’s commercially essential (even though people are quick to forget if they get free stuff). I think you should do it because security by obscurity, or at least the ML equivalent of that, isn’t working.

https://xkcd.com/1838/ is funny, but the truth is that we can offer some tools for exploration and feature breakdown. Every time I did, we learned so much from those.

I'm asking this genuinely and honestly:

How can I trust a company about my data, when their whole technology's performance depends on scraping the whole internet and feeding to a model? Also consider the fact that the data I'm providing them is way cleaner than what they scrape.

Moreover, we (at least the older ones amongst us) have seen what Microsoft has done in the past, and said company is almost one with the same Microsoft, which is ready to do anything for platform dominance.

OpenAI, esp. when combined with Microsoft and the whole A.I. hype paints a very untrustworthy picture in my eyes.

I'm not against the technology, but how it's trained, developed, hyped and how the researchers in the ecosystem behave about the data they handle makes it very off-putting in every sense, esp. in privacy and ethics.

It's like looking into a proverbial sausage factory, but this one is way worse.

The internet has fractionated out our cultural references and commonly used language. This has caused the communication of belief to separate out from lanaguage.

The Facebook Spying idea is the common place to agree and complain about phone spying in general. It is the poster boy for communicating an implicit belief about a larger topic.

We have begun to use language to communicate implicit and shared beliefs without any evidence, facts, agreement or qualification.

We come to our own beliefs privately and then choose a poster boy to carry the conversation. There's not much surprising about the continuation of conspiracy theories, or topics that look like conspiracy theories. We shredded the common cultural references apart...

It's a market trust crisis amplified by AI. Everything in the article applies to scummy business practices that were spreading before we all caught the CUDA crazies.

Reading the post, my first reaction is to remove all docs from my free dropbox account.

I think you'd find the people who don't trust AI companies also harbor similar feelings about companies in many different industries, or non profits, or even government agencies, depending on who specifically you ask. It seems to me that there is a much bigger trust issue well beyond the scope of AI based companies, so trying to have that specific field fight this distrust against them that really comes from all fronts, seems like an impossible task well out of the scope of these businesses. I'm not sure what the answer to this issue is, or if it even is a real issue, or what we might be marching toward should this pervasive cynicism spread to all things and all people. Perhaps we are simply cursed to live in interesting times.

This isn't really AI specific. The AI angle I suppose is that there is a new way for companies to exploit data they get from you, but like the article says with Facebook, people already believe their info is being exploited. Laws won't change it (not that they shouldn't exist) there's always too much fine print that can be gamed and they'd never be enforced to the benefit of the consumer anyway. Like privacy, it's awareness and not giving companies the chance, which will take a long time but hopefully will be the default instead of just not caring. For AI, local or own-cloud models are a good start, it's actually hard to belive a company like dropbox would go with OpenAI and not see how bad it looks.

“Data is deleted from third party servers in 30 days.”

No, that’s a thing that should never have happened. Ever.

You can only truly trust someone when your incentives are aligned, and people's incentives aren't aligned with OpenAI, Facebook or Dropbox. That trust simply can't be earned.

It’s increasing clear to me like people simply don’t believe OpenAI when they’re told that data won’t be used for training.

I certainly don't. I don't trust any of the AI companies at all. Their behavior and statements have given me quite a number of reasons not to, and no reasons why I should.

Trust is simple. If you can’t trust your own family, who can you trust? Families, that is the answer. Oddly, the trend is in the opposite direction; trust media, trust government, trust apple, OpenAI, etc…. Things will get better when that trend reverses.

This comes down to a combination of security and provenance. You have to protect the data, of course, but you also have to keep track of the use rights. Identifying all use rights upfront seems tricky, are the Creative Commons licenses sufficient?

One interesting difference here is that in the Facebook example people have personal evidence that makes them believe they understand what’s going on.

With AI we have almost the complete opposite: AI models are weird black boxes, built in secret and with no way of understanding what the training data was or how it influences the model.

In principle, I think we could have significant visibility into how OpenAI trains its models by doing something like the following:

* Generate a big list of "facts" like: Every fevilsor has a chatterbon. Some have a miralitt too, but that is rare. (Or even: Suzy Q Campbell, a philosopher who specializes in thought experiments, lives in Springfield, Virginia. Her phone number is 555-8302.)

* Generate a big list of places where OpenAI could, in principle, be harvesting data: HN threads, ChatGPT conversations, StackOverflow, Dropbox files, reddit posts, etc.

* Randomly seed facts in various places.

* Check which facts get picked up by the next version of GPT.

If you really don't trust OpenAI, be subtle about where and what you seed (e.g. avoid obvious nonsense words) to make it hard for their engineers to filter out your canaries even if they tried.

It's important to seed some facts in places they're known to scrape, as a control. If those facts aren't getting picked up, then you have methodological issues. You might have to repeat a fact a number of times in the dataset before it actually makes its way into the model.