Any online service that lets users upload material that is then publicly visible will eventually be used for command-and-control, copyright infringement and hosting CSAM. This is especially true for services that have other important uses besides file hosting and hence are hard to block.
This already happened to Twitter[1], Telegram[2], and even the PGP key infrastructure[3], not to mention obvious suspects like GitHub.
[1] https://pentestlab.blog/2017/09/26/command-and-control-twitt... [2] https://www.blazeinfosec.com/post/leveraging-telegram-as-a-c... [3] https://torrentfreak.com/openpgp-keyservers-now-store-irremo...
And Gmail and Google groups, and Google drive, and Gchat, on and on. The data you store doesn't even have to be public. With Gmail they would distribute credentials to log in and read attachments that they uploaded via imap.
(I am a former Google SAD-SRE [Spam, Abuse, Delivery])
Question, how would you know without invading the user's privacy?
An algorithm that processes private user data is by itself not invading anyone's privacy. It's clear to me that invasion of privacy only happens when humans look at private user data directly, or look at user data that's not sufficiently processed by an algorithm.
Otherwise, something as simple as a spell checker would be an invasion of privacy because it literally looks at every word in an email you write. That's absurd.
At least in my opinion, there's a big difference with where the data lives and where the checking algorithm is run. I don't think a spell checker would fall into what I'd consider a privacy concern as long as the spell checker is running locally on my device.
I don't work in the area of email nor Google but I see two problems.
1) you need to constantly update the spell checker so each time you say this is word or something like that most likely the data is send the problem is part of the data, I assume Google do something similar whit data send to span and mark as not spam. This is full email redirect and analysis not partial like old word processing.
2)I feel ai make this even more harder so now you can't simply check patterns as simply as before, and you need to check the whole content constantly
We've had spell/grammar checkers in word processors that worked totally offline for a long time now. They definitely can be improved with a hosted service but that's by no means necessary and comes with tradeoffs like latency and offline support.
An algorithm that denies service, changes ad behavior, etc based on user content is definitely invading privacy compared to your spell checker case.
The spell checker would also be a massive privacy invasion if if flagged users based on the content of what they wrote.
If an algorithm is looking through private stuff and making a decision based on it or is sending signals where the signal depends on the private stuff, then it's pretty much by definition leaking private information.
An algorithm that leaked no private information would not be useful to a business. It would do a bunch of computation and then throw it away. So realistically anything that looks at private information is privacy-relevant.
That includes even just the email headers. To quote the former head of the NSA "We Kill People Based on Metadata" https://abcnews.go.com/blogs/headlines/2014/05/ex-nsa-chief-...
You can have debates about how much private information should be leaked and for what purposes. But I don't think having a threshold like "it's all private unless another human reads it" is a good way to think about the issue.
Companies are legally obligated to scan for CSAM in the US.
I don't think that's accurate... Do you have a link?
I do think there is an obligation to report if any is found, but I don't think they need to look.
https://www.theguardian.com/technology/2022/aug/22/google-cs...
I dont think that's a hard legal requirement to scan. Just some law around what to do once they are known, and some executive arrangements
I think there was a case, where several people loged in the same Gmail account and shared data not by sending mails, just by write and read drafts.
yep.. And it would split uploads across dozens of accounts with parity so that if any account was disabled it could re-create the data from what was in the other accounts. (think bittorrent using imap uploaded content in gmail)
You might be thinking of the General David Patraeus case, a national security leak that was slightly worse than Snowden's, but with little repurcussions :)
Pre-AI we had a system that watched user patterns and would identify possibly suspect patterns that were outside of the norm. We also had system that would content-id the images and attachments to see what was going uploaded in a general way. Given enough suspicion then the account would be opened to look for abusive patterns.
There is absolutely no promise on any cloud hosted services that a human will not ever see your data. However, at Google it was made very, very, VERY clear that if we had to scan somebody's personal email for any reason then discussion of the contents outside of legally mandated, or required for work ways would lead to immediate termination and possible lawsuit for any damages to reputation incurred.
While fixing user accounts, or dealing with delivery of content I saw epic piles of personal email. Besides the ones full of CASM or other abusive material I couldn't say that I ever remembered the contents 30 minutes later. Its like a checker at a grocery store. They don't care about whatever embarrassing tings your buying and won't remember you 10 minutes later. =)
Just curious, "Delivery" doesn't seem to be the same sort of thing as "Spam" and "Abuse": why are the three grouped?
Delivery is what happens if it’s not spam or abuse.
I was apparently not watching this well enough, sorry for the delayed response.
Deliver was because we ran the SMTP and queuing infrastructure at the time. We started as Gmail SRE, then split out some of the delivery and abuse services into its own team (SAD), then SAD got SRE, hence SAD-SRE =)
No inside information, but presumably this means Delivery to other organizations, which, among other things, includes maintaining outbound IP reputation, which is closely related to Spam and Abuse.
Just a side note, I found the name sad sre funny and blursed at the same time
Whats’ blursed mean?
What does “Whats’” mean?
Simultaneously blessed and cursed.
From long enough ago that I should apologize to you for libgmail: https://libgmail.sourceforge.net ? :D
libgmail was the least of our problems. There was a Polish software team that wrote a bittorrent layer on top of Gmail. That thing was a pain in the butt as they constantly improved it to get around abuse filters and such. Plus it had parity bits so if we killed accounts it would just re-replicate the data to new accounts.. That software was devilish and impressive at the exact same time. =)
Not sure if it has already happened, but the not so obvious one is HuggingFace.
No idea if it's used for CSAM or malware, but copyright infringement on a massive scale? https://huggingface.co/datasets/EleutherAI/pile
It seems like it would be pretty easy to use PyPI for this, because packages can contain arbitrary non-Python files. And you can also do things like base 64 encoding your files in strings in Python code.