HN comments for: Slack AI Training with Customer Data

zmmmmm

50 replies

18h10m

2024-05-17 00:18:28 UTC

For any model that will be used broadly across all of our customers, we do not build or train these models in such a way that they could learn, memorise, or be able to reproduce some part of Customer Data

This feels so full of subtle qualifiers and weasel words that it generates far more distrust than trust.

It only refers to models used "broadly across all" customers - so if it's (a) not used "broadly" or (b) only used for some subset of customers, the whole statement doesn't apply. Which actually sounds really bad because the logical implication is that data CAN leak outside those circumstances.

They need to reword this. Whoever wrote it is a liability.

afc

30 replies

16h5m

2024-05-17 02:24:05 UTC

Especially when a few paragraphs below they say:

If you want to exclude your Customer Data from helping train Slack global models, you can opt out.

So Customer Data is not used to train models "used broadly across all of our customers [in such a way that ...]", but... it is used to help train global models. Uh.

hackernewds

23 replies

15h36m

2024-05-17 02:53:13 UTC

Why are these kinda things opt-out? And need to be discovered..

We're literally discussing switching to Teams at my company (1500 employees)

Shadowmist

11 replies

14h9m

2024-05-17 04:19:53 UTC

You’d be better off just not having chat than switching to Teams.

metadat

9 replies

13h46m

2024-05-17 04:42:52 UTC

But the business will suffer by most likely being less successful due to less cohesive communication.. it's "The Ick" either way.

andy_ppp

6 replies

12h29m

2024-05-17 05:59:51 UTC

The idea that Slack makes companies work better needs some proof behind it, I’d say the amount of extra distraction is a net negative… but as with a lot of things in software and startups nobody researches anything and everyone writes long essays about how they feel things are.

unkulunkulu

4 replies

10h12m

2024-05-17 08:16:44 UTC

Distraction is not enforced. Learning to control your attention and how to help yourself do it is crucial whatever you do in whatever time and in whatever technological context or otherwise. It is the most long term valuable resource you have.

I think we start to recognize this at larger scale.

Slack easily saves a ton of time solving complex problems that require interaction and expertise of a lot of people, often unpredictable number of them for each problem. They can answer with delay, in a good culture this is totally accepted and people still can independently move forward or switch tasks if necessary, same as with slower communication tools. You are not forced to answer with any particular lag, however slack makes it possible when needed to reduce it to zero.

Sometimes you are unsure if you need help or you can do smthing on your own. I certainly know that a lot of times eventually I had no chance whatsoever, because knowledge requires was too specialized, this is not always clear. Reducing barriers to communication in those cases is crucial and I don't see Slack being in the way here, only helpful.

The goal of organizing Slack is such that you pay right amount of attention to right parts of communication for you. You can do this if you really spend (hmm) attention trying to figure out what that is and how to tune your tools to achieve that.

andy_ppp

3 replies

9h44m

2024-05-17 08:44:40 UTC

That’s a lot of words with no proof isn’t it, it’s just your theory. Until I see a well designed study on such things I struggle to believe the conjecture you make either way. It could be quite possible that you benefit from Slack and I don’t.

Even receiving a message and not responding can be disruptive and on top I’d say being offline or ignoring messages is impossible in most companies.

whatevaa

1 replies

9h32m

2024-05-17 08:56:46 UTC

Your idea also comes with no proof, just your personal experience.

andy_ppp

0 replies

4h41m

2024-05-17 13:48:10 UTC

Which is extremely clear from what I’m saying, it’s completely anecdotal.

unkulunkulu

0 replies

7h15m

2024-05-17 11:13:29 UTC

This is your choice to trust only statements backed by scientific rigour or trying things out and applying to your way of life. This is just me talking to you, in that you are correct.

Regarding “receiving a message”: my devices are allowed only limited use of notifications. Of all the messaging/social apps only messages from my wife in our messaging app of choice pop us as notifications. Slack certainly is not allowed there

metadat

0 replies

12h22m

2024-05-17 06:07:00 UTC

Good point, could be that it reduces friction too far in some instances. However, in general less communication doesn't seem better for the bottom line.

codingdave

1 replies

7h22m

2024-05-17 11:06:54 UTC

I'm not sure chat apps improve business communications. They are ephemeral, with differing expectations on different teams. Hardly what I'd label as "cohesive"

Async communications are critical to business success, to be sure -- I'm just not convinced that chat apps are the right tool.

skydhash

0 replies

3h13m

2024-05-17 15:15:20 UTC

From what I’ve seen (not much actually) Most channels can be replaced by a forum style discussion board. Chat can be great for 1:1 and small team interactions. And for tool interactions.

amne

0 replies

12h25m

2024-05-17 06:03:40 UTC

we use Teams and it's fine.

Just don't use the "Team" feature of it to chat. Use chat groups and 1-to-1 of course. We use "Team" channels only for bots: CI results, alerts, things like that.

Meetings are also chat groups. We use the daily meeting as the dev-team chat itself so it's all there. Use Loops to track important tasks during the day.

I'm curious what's missing/broken in Teams that you would rather not have chat at all?

M4v3R

5 replies

12h24m

2024-05-17 06:04:50 UTC

If you switch to Teams only for this reason I have some bad news for you - there’s no way Microsoft is not (or will not start in future) doing the same. And you’ll get a subpar experience with that (which is an understatement).

guappa

1 replies

11h33m

2024-05-17 06:55:43 UTC

I think a self hosted matrix/irc/jitsi is the way to do it.

IshKebab

0 replies

11h5m

2024-05-17 07:23:28 UTC

We've been using Mattermost and it works very well. Better than Slack.

The only downside is their mobile app is a bit unreliable, in that it sometimes doesn't load threads properly.

District5524

1 replies

6h55m

2024-05-17 11:33:17 UTC

The Universal License Terms of Microsoft (applicable to Teams as well) clearly say they don't use customer data (Input) for training: https://www.microsoft.com/licensing/terms/product/ForallOnli... Whether someone believes it or not, is another question, but at least they tell you what you want to hear.

bayindirh

0 replies

5h56m

2024-05-17 12:33:05 UTC

What if they exfiltrate customer data to a data broker and they buy it back?

It's not customer data anymore.

cpach

0 replies

7h24m

2024-05-17 11:04:52 UTC

I would guess Microsoft has a lot more government customers (and large customers in general) than Slack does. So I would think they have a lot more to loose if they went this route.

trinsic2

0 replies

4h16m

2024-05-17 14:12:33 UTC

Ugh Teams = Microsoft. They are the worst when it comes to data privacy. I'm not sure how that is even a choice.

rogerthis

0 replies

6h5m

2024-05-17 12:23:30 UTC

Teams have better voice/video. But chat is far worse, absolutely shit, though Slack seems to be working to get there.

marricks

0 replies

14h54m

2024-05-17 03:34:23 UTC

Obviously because no one would ever opt in.

jeffdn

0 replies

14h54m

2024-05-17 03:34:50 UTC

I'd make sure to do an extended trial run first. Painful transition.

bayindirh

0 replies

5h57m

2024-05-17 12:31:41 UTC

Why are these kinda things opt-out? And need to be discovered..

Monies.

We're literally discussing switching to Teams at my company (1500 employees)

Considering what Microsoft does with its "New and Improved(TM)" Outlook and love for OpenAI, I won't be so eager...

DougBTX

2 replies

12h52m

2024-05-17 05:36:25 UTC

To me it says that they _do_ train global models with customer data, but they are trying to ensure no data leakage (which will be hard, but maybe not impossible, if they are training with it).

The caveats are for “local” models, where you would want the model to be able to answer questions about discussions in the workspace.

It makes me wonder how they handle “private” chats, can they leak across a workspace?

Presumably they are trying to train a generic language model which has very low recall for facts in the training data, then using RAG across the chats that the logged on user can see to provide local content.

ENGNR

1 replies

10h4m

2024-05-17 08:25:09 UTC

My intuition is that it's impossible to guarantee there are no leaks in the LLM as it stands today. It would surely require some new computer science to ensure that no part of any output that could ever possibly be developed isn't sensitive data from any of the input.

It's one thing if the input is the published internet (even if covered by copyright), it's entirely another to be using private training data from corporate water coolers, where bots and other services routinely send updates and query sensitive internal services.

visarga

0 replies

9h2m

2024-05-17 09:26:21 UTC

There is a way. Build a preference model from the sensitive dataset. Then use the preference model with RLAIF (like RLHF but with AI instead of humans) to fine-tune the LLM. This way only judgements about the LLM outputs will pass from the sensitive dataset. Copy the sense of what is good, not the data.

j45

0 replies

13h57m

2024-05-17 04:31:45 UTC

Hope it's not doublespeak, ambiguity leaves it grey, maybe to play.

hackernewds

0 replies

13h10m

2024-05-17 05:18:34 UTC

so if I don't want slack to train on _anything_ what do I do? I still suspect everything now

__loam

0 replies

13h48m

2024-05-17 04:40:57 UTC

Opt out is such bullshit.

mayank

3 replies

18h3m

2024-05-17 00:25:28 UTC

They need to reword this. Whoever wrote it is a liability

Sounds like it’s been written specifically to avoid liability.

MingFengLiu

2 replies

17h1m

2024-05-17 01:28:14 UTC

I'm sure it was lawyers. It's always lawyers.

cqqxo4zV46cp

1 replies

11h20m

2024-05-17 07:08:38 UTC

Yes, lawyers do tend to have a part to play in writing things that present a legally binding commitment being made by an organisation. Developers really can’t throw stones from their glass houses here. How many of you have a pre-canned spiel explaining why the complexities of whichever codebase you spend your days on are ACTUALLY necessary, and are certainly NOT the result of over-engineering? Thought so.

ben_w

0 replies

5h38m

2024-05-17 12:50:59 UTC

How many of you have a pre-canned spiel explaining why the complexities of whichever codebase you spend your days on are ACTUALLY necessary, and are certainly NOT the result of over-engineering? Thought so.

Hm, now you mention it, I don't think I've ever seen this specific example.

Not that we don't have jargon that's bordering on cant, leading to our words being easily mis-comprehended by outsiders: https://i.imgur.com/SL88Z6g.jpeg

Canned cliches are also the only thing I get whenever I try to find out why anyone likes the VIPER design pattern — and that's despite being totally convinced that (one of) the people I was talking to, had genuinely and sincerely considered my confusion and had actually experimented with a different approach to see if my point was valid.

chefandy

3 replies

16h14m

2024-05-17 02:14:16 UTC

Nah. Whoever decided to create the reality their counsel is dancing around with this disclaimer is the actual problem, though it's mostly a problem for us, rather than them.

FuckButtons

2 replies

13h36m

2024-05-17 04:53:11 UTC

It’s a problem for them if it looses customer trust / customers.

hackernewds

0 replies

13h9m

2024-05-17 05:19:47 UTC

if they lose enough, they will "sorry we got caught"

if they don't, they will not do anything

chefandy

0 replies

13h2m

2024-05-17 05:27:01 UTC

If it impacted their business significantly, it would restore some of the faith I've lost in humanity recently. Frankly, I'm not holding my breath.

j45

2 replies

13h57m

2024-05-17 04:31:19 UTC

I'm imagining a corporate slack, with information discussed in channels or private chats that exists nowhere else on the internet.. gets rolled into a model.

Then, someone asks a very specific question.. conversationally.. about such a very specific scenario..

Seems plausible confidential data would get out, even if it wasn't attributed to the client.

Not that it’s possible to ask an llm how a specific or random company in an industry might design something…

hackernewds

1 replies

13h8m

2024-05-17 05:20:45 UTC

exactly. a fun game to see why it is so hard to prevent this

https://gandalf.lakera.ai/

j45

0 replies

12m

2024-05-17 18:16:20 UTC

Sometimes the obvious questions are met with a lot of silence.

I don't think I can be the only one who has had a conversation with GPT about something obscure they might know but there isn't much about online, and it either can't find anything... or finds it, and more.

dheera

1 replies

16h27m

2024-05-17 02:01:48 UTC

Seems like time to start some slack workspaces and fill them with garbage. Maybe from Uncyclopedia (https://en.uncyclopedia.co/wiki/Main_Page)

hsaliak

0 replies

15h15m

2024-05-17 03:13:22 UTC

The Riders of the Lost Kek dataset is an excellent candidate https://arxiv.org/abs/2001.07487

throwaway4aday

0 replies

5h47m

2024-05-17 12:42:14 UTC

I think it's as clear as it can be, they go into much more detail and provide examples in their bullet points, here are some highlights:

Our model learns from previous suggestions and whether or not a user joins the channel we recommend. We protect privacy while doing so by separating our model from Customer Data. We use external models (not trained on Slack messages) to evaluate topic similarity, outputting numerical scores. Our global model only makes recommendations based on these numerical scores and non-Customer Data.

We do this based on historical search results and previous engagements without learning from the underlying text of the search query, result, or proxy. Simply put, our model can't reconstruct the search query or result. Instead, it learns from team-specific, contextual information like the number of times a message has been clicked in a search or an overlap in the number of words in the query and recommended message.

These suggestions are local and sourced from common public message phrases in the user’s workspace. Our algorithm that picks from potential suggestions is trained globally on previously suggested and accepted completions. We protect data privacy by using rules to score the similarity between the typed text and suggestion in various ways, including only using the numerical scores and counts of past interactions in the algorithm.

To do this while protecting Customer Data, we might use an etrnal model (not trained on Slack messages) to classify the sentiment of the message. Our model would then suggest an emoji only considering the frequency with which a particular emoji has been associated with messages of that sentiment in that workspace.

hyping9

0 replies

5h9m

2024-05-17 13:19:55 UTC

They need to reword this. Whoever wrote it is a liability.

Wow you're so right. This multi-billion dollar company should be so thankful for your comment. I can't believe they did not consult their in-house lawyers before publishing this post! Can you believe those idiots? Luckily you are here to save the day with your superior knowledge and wisdom.

__loam

0 replies

13h47m

2024-05-17 04:41:58 UTC

If you trained on customer data your service contains custom data.

Nition

0 replies

17h35m

2024-05-17 00:54:05 UTC

- Create a Slack account for your 95-year-old grandpa

- Exclude that one account from using the models, he's never going to use Slack anyway

- Now you can learn, memorise, or reproduce all the Customer Data you like

JCM9

0 replies

7h1m

2024-05-17 11:27:25 UTC

Whatever lawyer wrote that should be fired. This poorly written nonsense makes it look like Slack is trying to look shady and subversive. Even if well intended this is a PR blunder.

IanCal

0 replies

11h26m

2024-05-17 07:03:10 UTC

The problem is this also covers very reasonable use cases.

Use sampling across messages for spam detection, predicting customer retention, etc - pretty standard.

Then there's cases where you could have models more like llms that can output data from the training set but you're running them for that customer.

koolba

32 replies

18h1m

2024-05-17 00:27:18 UTC

We offer Customers a choice around these practices. If you want to exclude your Customer Data from helping train Slack global models, you can opt out. If you opt out, Customer Data on your workspace will only be used to improve the experience on your own workspace and you will still enjoy all of the benefits of our globally trained AI/ML models without contributing to the underlying models.

Why would anyone not opt-out? (Besides not knowing they have to of course…)

Seems like only a losing situation.

m463

7 replies

17h52m

2024-05-17 00:36:46 UTC

Why would anyone not opt-out?

This is basically like all privacy on the internet.

Everyone WOULD opt-out, if it was easy, and it becomes a whack-a-optput game.

note how you opt-out (generic contact us), and what happens when you do opt-out (they still train anyway)

__loam

5 replies

13h27m

2024-05-17 05:01:21 UTC

Opt out should be the default by law

hackernewds

3 replies

13h6m

2024-05-17 05:22:54 UTC

so upgrades and customer approves everything? slippery slope to over regulation

yellow_postit

0 replies

10h45m

2024-05-17 07:43:32 UTC

Hence the cookie banners

bayindirh

0 replies

1h18m

2024-05-17 17:10:36 UTC

I’d take over regulation every day over unabashed user abuse in the name of free markets.

__loam

0 replies

12h21m

2024-05-17 06:07:28 UTC

The status quo is consumer abuse.

patrickk

0 replies

12h40m

2024-05-17 05:49:09 UTC

This is how GDPR works, explicit opt-in consent is needed from the customer.

https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-re...

halostatue

0 replies

17h18m

2024-05-17 01:10:40 UTC

When we send our notice, we are going to be sending a notice that we want none of our data used for any ML training from Slack or anyone else.

IMTDb

5 replies

17h22m

2024-05-17 01:07:10 UTC

Why would anyone not opt-out?

Because you might actually want to have the best possible global models ? Think of "not opting out" as "helping them build a better product". You are already paying for that product, if there is anything you can do, for free and without any additional time investment on your side that makes their next release better, why not do it ?

You gain a better product for the same price, they get a better product to sell. It might look like they get more than you do in the trade, and that's probably true; but just because they gain more does not mean you lose. A "win less / win more" situation is still a win-win. (It's even a win-win-win if you take into account all the other users of the platform).

Of course, if you value the privacy of these data a lot, and if you believe that by allowing them to train on them it is actually going to risk exposing private info, the story changes. But then you have an option to say stop. It's up to you to measure how much you value "getting a better product" vs "estimated risk of exposing some information considered private". Some will err on one side, some on the other.

trinsic2

0 replies

4h41m

2024-05-17 13:47:51 UTC

Of course, if you value the privacy of these data a lot, and if you believe that by allowing them to train on them it is actually going to risk exposing private info, the story changes. But then you have an option to say stop. It's up to you to measure how much you value "getting a better product" vs "estimated risk of exposing some information considered private". Some will err on one side, some on the other.

The problem with this reasoning, at least from what I am understanding is that you don't really know when/where the training of you data crosses the line into information you don't want to share until it's too late. It's also a slippery slope.

krainboltgreene

0 replies

16h54m

2024-05-17 01:34:39 UTC

Think of "not opting out" as "helping them build a better product"

I feel like someone would only have this opinion if they've never ever dealt with any in the tech industry, or capitalist, in their entire life. So like 8-19 year olds? Except even they seem to understand that the profit absolutist goals undermine everything.

This idea has the same smell as "We're a family" company meetings.

hehdhdjehehegwv

0 replies

15h56m

2024-05-17 02:33:08 UTC

I for one consider it my duty to bravely sacrifice my privacy to the alter of corporate profit so that the true beauty of LLM trained in emojis and cat gifs can bring humanity to the next epoch.

ericjmorey

0 replies

16h19m

2024-05-17 02:09:38 UTC

Do I have free access to and use of those models? If not, I don't care to help them.

cess11

0 replies

7h35m

2024-05-17 10:53:35 UTC

"Best" and "better" is doing a lot of extremely heavy lifting here.

Are you sure you actually want what's hiding under those weasel words?

tifik

4 replies

11h14m

2024-05-17 07:15:07 UTC

Whats baffling to me is why companies think that when they slap AI on the press release, their customers will suddenly be perfectly fine with them scraping and monetizing all of their data on an industrial scale, without even asking for permission. In a paid service. Where the service is private communication.

cynicalsecurity

2 replies

5h15m

2024-05-17 13:14:05 UTC

Most people don't care, paid service or not. People are already used to companies stealing and selling their data up and down. Yes, this is absolutely crazy. But was anything substantial done against it before? No, hardly anyone was raising awareness against it. Now we keep reaping what we were sawing. The world keeps sinking deeper and deeper into digital fascism.

pera

1 replies

4h56m

2024-05-17 13:32:50 UTC

Companies do care: Why would you take additional risk of data leakage for free? In the best case scenario nothing happens but you also don't get anything out of it, in the worst case scenario extremely sensitive data from private chats get exposed and hits your company hard.

ffsm8

0 replies

4h32m

2024-05-17 13:56:49 UTC

Companies are comprised of people. Some people in some enterprises care. I'd wager that in any company beyond a tiny upstart you'll have people all over the hierarchy that dont care. And some of them will be responsible for toggling that setting... Or not, because they just can't be arsed to with how little they care about the chat histories of the people they'll likely never even going to interact with being used to train some AI.

8338550bff96

0 replies

1h56m

2024-05-17 16:32:55 UTC

I am not pro-exploiting users' ignorance for their data, but I would counter this with the observation that slapping AI on product suddenly makes people care about the fact that companies are monetizing on their usage data.

Monetizing on user activity data through opt-out collection is not new. Pretending that his phenomenon has anything to do with AI seems like a play for attention that exploits peoples AI fears.

I'll sandwich my comments with a reminder that I am not pro-exploiting users' ignorance for their data.

schneehertz

2 replies

17h45m

2024-05-17 00:43:47 UTC

We offer Customers a choice around these practices.

I remembered the joke from The Hitchhiker's Guide to the Galaxy, maybe they will have a small hint in a very inconspicuous place, like inserting this into the user agreement on page 300 or so.

m463

0 replies

16h10m

2024-05-17 02:18:52 UTC

Never more true than with apple.

Activating an iphone for example has a screen devoted to how privacy is important!

It will show you literally thousands of pages of how they take privacy seriously!

(and you can't say NO anywhere in the dialog, they just show you)

They are normalizing "you cannot do anything", and then everyone does it.

bigfudge

0 replies

12h1m

2024-05-17 06:27:56 UTC

But the plans were on display…” “On display? I eventually had to go down to the cellar to find them.” “That’s the display department.” “With a flashlight.” “Ah, well, the lights had probably gone.” “So had the stairs.” “But look, you found the notice, didn’t you?” “Yes,” said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard.

mvkel

2 replies

16h41m

2024-05-17 01:47:53 UTC

Because it's default opt-in, and most people won't see this announcement.

dheera

1 replies

16h31m

2024-05-17 01:58:05 UTC

Yep, much like just about every credit card company shares your personal information BY DEFAULT with third parties unless you explicitly opt out (this includes Chase, Amex, Capital One, but likely all others).

hackernewds

0 replies

15h35m

2024-05-17 02:54:14 UTC

how do you opt out of these? I do share my data with rocket money though since there's no good alternatives :(

clwg

2 replies

16h13m

2024-05-17 02:15:17 UTC

Because they don't seem to make it easy. It doesn't seem as a individual user I have any say in how my data is used, I have to contact the Workspace Owner. When I do I'll be asking them to look at alternative platforms instead.

"Contact us to opt out. If you want to exclude your Customer Data from Slack global models, you can opt out. To opt out, please have your Org or Workspace Owners or Primary Owner contact our Customer Experience team at feedback@slack.com with your Workspace/Org URL and the subject line “Slack Global model opt-out request.” We will process your request and respond once the opt out has been completed."

hackernewds

1 replies

13h5m

2024-05-17 05:23:38 UTC

You can always quit your job right? /s

clwg

0 replies

6h11m

2024-05-17 12:17:29 UTC

I'm the one who picked Slack over a decade ago for chat, so hopefully my opinion still holds weight on the matter.

One of the primary reasons Slack was chosen was because they were a chat company, not an ad company, and we were paying for the service. Under these parameters, what was appropriate to say and exchange on Slack was both informally and formally solidified in various processes.

With this change, beyond just my personal concerns, there are legitimate concerns at a business level that need to be addressed. At this point, it's hard to imagine anything but self-hosted as being a viable path forward. The fact that chat as a technology has devolved into its current form is absolutely maddening.

p1esk

1 replies

17h0m

2024-05-17 01:28:36 UTC

I’d be surprised if more 1% opt out.

tifik

0 replies

11h10m

2024-05-17 07:18:56 UTC

I'd be surprised if any legal department in any company with one will not freak the f out when they read this. They will likely loose the biggest customers first, so even if it is 1% of customers, it will likely affect their bottom line enough to give it a second though. I don't see how they might profit from an in-house LLM more than from their enterprise-tier plans.

Their customer support will have a hell of a day today.

zeckalpha

0 replies

16h11m

2024-05-17 02:17:26 UTC

Opting out in this way may implicitly opt you in to workspace specific models.

sensanaty

0 replies

9h10m

2024-05-17 09:18:23 UTC

I'm willing to bet that for smaller companies, they just won't care enough to consider this an issue and that's what Slack/Salesforce is hedging on.

I can't see a universe in which large corpos would allow such blatant corporate espionage for a product they pay for no less. But I can already imagine trying to talk my CTO (who is deep into the AI sycophancy) into opting us out is gonna be arduous at best.

nyc_data_geek

30 replies

18h7m

2024-05-17 00:21:20 UTC

Story time.

I was at a VC conference last year and if I learned nothing else there, I learned how to spell "AI". Every single exhibitor just about had their signage proudly proclaiming their capabilities in this area, but one in particular struck me.

They were touting the API integrations they could offer to train their "Enterprise AI"/LLM, and among those integrations were things like M365, Slack, etc.

It struck me because of the garbage in, garbage out problem. I'd like to think that the amount of shitposting I do on Slack personally will poison that particular well of training data, but this seems to point to a larger problem to me.

LLM's don't have a concept of truth or reality, or awareness of any sort. If the training data they are fed is poorly quality checked/unsanitized by human intelligence, the outputs will be as useless/noisy as the original data set. It feels to me that in the frothy rush to capture market buzz and VC, this is being forgotten.

Am I missing something, here?

leoh

7 replies

18h5m

2024-05-17 00:23:54 UTC

Yes, consider an existing LLM being given “shitpost-y” messages and asking it if there is anything interesting in there. It could probably summarize it well and that could then be used for training another LLM.

etc etc

nyc_data_geek

6 replies

18h3m

2024-05-17 00:25:23 UTC

This assumes everything in the training data set is accurate. Sometimes people are wrong, obtuse, sarcastic, etc. LLM's don't have any way of detecting or accounting for this, do they?

That output, then being used to train other LLM's, just creates an ouroboros of AI generated dogshit.

sp332

3 replies

17h42m

2024-05-17 00:46:41 UTC

LLMs are state-of-the-art at detecting sarcasm. It won't help if the data is just wrong though.

Edit: https://arxiv.org/abs/2312.03706 Human performance on this benchmark (detecting sarcasm in Reddit comments) was 0.82, a BERT-based LLM scored 0.79.

https://arxiv.org/abd/2106.05752 LSTM, 98% at detecting sarcasm in a Project Gutenburg-based dataset.

E39M5S62

1 replies

16h55m

2024-05-17 01:33:44 UTC

I literally can't tell if you're being sarcastic or not.

rrr_oh_man

0 replies

16h49m

2024-05-17 01:39:37 UTC

Exactly

comboy

0 replies

16h40m

2024-05-17 01:48:40 UTC

LLMs are state-of-the-art at detecting sarcasm.

This is such a precious gem.

brookst

0 replies

14h2m

2024-05-17 04:26:44 UTC

And yet human civilization has survived the fact that many humans are wrong, lying, delusional, etc. There is no assumption that everything in our personal training set is accurate. In fact, things work better when we explicitly reject that idea.

LLMs do not rely on 100% factually accurate inputs. Sure, you’d rather have less BS than more, but this is all statistics. Just like most people realize that flat earthers are nutty, LLMs can ingest falsehoods without reducing output quality (again, subject to statistics)

bongodongobob

0 replies

13h15m

2024-05-17 05:13:54 UTC

The training data doesn't need to be strictly accurate. If it was, you'd just be programming a deterministic robot. The whole point is the feed it actual human language. Giving it shitposts and sarcasm is literally what makes it good. Think of it like 100 people guessing the number of marbles in a jar. Average their guesses and it will be very close. The training data is the guesses, the inference is the average.

antipaul

7 replies

16h35m

2024-05-17 01:53:25 UTC

What do you think chatGPT uses as training data?

The whole world’s “sh*tposting”: Reddit, blogs, and the rest of the internet.

But also books and Wikipedia and what not.

You can “smooth” all the crap out via the training procedure.

But even more, Slack can easily filter training data to, say, only posts in high-use channels.

Further, slack has other options: eg, use their customer data only for marginal fine-tuning, for example.

Or, they don’t even know their use case yet - but want to wrap their arms around your data pronto.

rozap

2 replies

15h1m

2024-05-17 03:27:53 UTC

What makes you think I don't shitpost in the #engineering channel?

And heuristics don't even scratch the surface of the bigger problem where it's trained on people who aren't great at their jobs but type a lot of words on slack about circling back on KPIs.

bee_rider

1 replies

14h7m

2024-05-17 04:21:25 UTC

I think those types of people are actually shockingly well paid. If slack can make bots to replace them, they’ll print money, right?

blackenedgem

0 replies

10h15m

2024-05-17 08:13:53 UTC

That's all well and good until something goes down and you need someone knowledgeable to diplomatically shout at a vendor.

moneywoes

2 replies

15h6m

2024-05-17 03:22:50 UTC

how does the training procedure smooth the garbage out?

thomashop

1 replies

14h36m

2024-05-17 03:52:35 UTC

Through regularization techniques, data augmentation, loss functions, and gradient optimization, ensuring the model focuses on meaningful patterns and reduces overfitting to noise.

bigfudge

0 replies

12h11m

2024-05-17 06:18:09 UTC

It’s not obvious how any of those would do anything but better approximate the average of a noisy dataset. RLHF might help, but only if it’s not done by idiots.

nyc_data_geek

0 replies

14h41m

2024-05-17 03:47:37 UTC

ChatGPT isn't known for it's accuracy though, is it? They coined the term "hallucination" because it is wrong so much.

chatmasta

3 replies

18h4m

2024-05-17 00:24:49 UTC

Why shouldn’t AI be able to shitpost too? At the very least, and much more importantly, AI should be able to recognize shitposting.

nyc_data_geek

2 replies

18h2m

2024-05-17 00:27:02 UTC

This is the crux of it, and where I'm wondering if I'm missing something. Can it, today? My understanding is it cannot discern reality from fiction, thus "hallucinations" (a misnomer because it implies awareness, which these probability models lack).

tomrod

0 replies

16h58m

2024-05-17 01:30:56 UTC

The poorly named hallucinations are creation of ideas from provided prompts, which ideas are not grounded in reality. It isn't the mistaken adjudication of the reality of a provided prompt.

fuchse

0 replies

17h9m

2024-05-17 01:20:05 UTC

That sounds surprisingly human

williamcotton

1 replies

18h3m

2024-05-17 00:25:54 UTC

The sheer scale of data on the long tail. Sure, the head is already a trash pile and has been for decades now, but there is plenty of non-monetized information all over the internet that is barely linked to or otherwise discoverable.

krainboltgreene

0 replies

16h56m

2024-05-17 01:32:16 UTC

It does not matter how hard they try, nothing will rival the CommonCrawl treasure trove except maybe Google's index itself.

beeboobaa3

1 replies

18h6m

2024-05-17 00:22:49 UTC

More ignored than forgotten.

nyc_data_geek

0 replies

18h5m

2024-05-17 00:23:57 UTC

Wallpapered over?

wongarsu

0 replies

16h4m

2024-05-17 02:24:50 UTC

Most shitposting is probably more straightforward to understand than business communication or press releases where realizing what wasn't said often carries more insight than the things that were said.

Of course training an AI model on simple, straightforward and honest data provides good results. That's the essence behind "textbooks is all you need" which lead to the phi LLMs. Those are great small LLMs. But if you want your model to understand the complexity of human communication you have to include it in your training data.

If you subscribe to the idea that to be the very best text completion engine possible you would need to have a perfect understanding of reality itself, how different humans perceive reality differently, and how they choose to communicate about this perception and their interaction with reality, themselves and other humans, then it's not unreasonable to expect that back-propagation would eventually find that optimal representation if given enough data, the right architecture and enough processing power. Or at least come somewhat close. In that paradigm there is no "bad data", only insufficient or badly balanced datasets. Just don't try doing that with a 3B parameter LLM.

swalsh

0 replies

15h56m

2024-05-17 02:32:46 UTC

Along the same lines, phi-3 is kind of a sign of what you can do if you focus only on high quality data. It seems like while yes, quantity is very important, quality almot matters just as much.

mvkel

0 replies

16h42m

2024-05-17 01:46:36 UTC

The best LLMs were trained on data from the open internet, which is full of garbage. They still do a pretty good job (granted it has been fine tuned and RLHF'd, but you can do that with Slack data too)

jorisboris

0 replies

16h49m

2024-05-17 01:40:06 UTC

Same for Reddit or Facebook groups. There's a lot of shitposting there, but absolutely a lot of valuable information if LLMs manage to separate the wheat from the chaff.

bongodongobob

0 replies

13h18m

2024-05-17 05:10:44 UTC

I think what you're missing is assuming that what an LLM "reads" thinks is a true statement. Shitposting is almost like meta slang. I feel like that's a necessary thing for it to train on to truly understand language. I feel like people underestimate the depth LLMs can pick up on.

IanCal

0 replies

10h18m

2024-05-17 08:10:45 UTC

The more obvious things are that it's not training llms fully on all channels.

Some quick ideas:

Search and summarize other messages. No new llms training and about mostly linking to existing answers.

Fine tune on your messages, but only customer support messages in the public channel, not "eng-shitpost"

Natural language requests over your company data.

tifik

24 replies

11h20m

2024-05-17 07:08:40 UTC

Well I really hope this massively blows up in their face when all of Europe goes to work just about now, and then North America in 5-8 hours. Let's see if we have another Helldivers 2 event that makes them do a hard backpedal after losing thousands of large customers that will not under any circumstances take the chance.

I have a friend with a law firm who just called me yesterday for advice as he's thinking about switching to Slack from Teams. I gave him a glowing recommendation because it is literally night and day, but there is no way in hell he takes any chance any sensitive legal discussions leak out through prompt hacking. He might even be liable himself for knowingly using a tool that spells out "we read and reuse your conversations".

p1esk

15 replies

10h50m

2024-05-17 07:38:50 UTC

But you can opt out, right? So what’s the problem?

Also, is Teams (and other messengers) any different?

troupo

7 replies

10h40m

2024-05-17 07:48:55 UTC

But you can opt out, right? So what’s the problem?

This thinking is the problem. "Oh, we just added your entire private/privileged/NDA/corporate information to our training set without your consent. What's the problem?"

Opt-out must be the default.

Edit: By "Opt-out must be the default." I mean: no one's data must be included until they explicitly give consent via an opt-in :)

Liquidor

2 replies

10h26m

2024-05-17 08:02:46 UTC

Opt-out must be the default.

Don't you mean opt-in must be the default?

Or am I misunderstanding the concept of opt-ins :P

troupo

0 replies

10h25m

2024-05-17 08:03:57 UTC

Opt-in is "I agree to have my data included"

Opt-out is "I don't agree to have my data included"

ellisnguyen

0 replies

10h21m

2024-05-17 08:07:57 UTC

Opt-out by default = Opt-in.

Opt-in by default = Opt-out

zelphirkalt

1 replies

10h26m

2024-05-17 08:02:30 UTC

Especially since once it has been trained, it is in the model, and I am not aware of any way anyone has discovered to later remove from the model single or selected training data points, except for re-training/re-learning the model. So basically the crime might already be done.

But I also know that so many businesses are too sluggish to make a switch and employees incapable of understanding the risk. So unfortunately not all of Europe will switch away. But I hope a significant number gives them the middle finger.

bayindirh

0 replies

9h35m

2024-05-17 08:53:19 UTC

There's something called "machine unlearning" being worked on to address these issues.

This doesn't mean that I support Slack or any opt-in without consent training model. On the contrary. I don't have any OpenAI/Midjourney/etc. account, and don't plan to have one.

falcor84

0 replies

10h25m

2024-05-17 08:03:22 UTC

Exactly! Allowing access to your data should only be opt-in

ADeerAppeared

0 replies

9h35m

2024-05-17 08:53:47 UTC

Worth noting: This is a legal requirement in Europe

The GDPR mandates that consent is given affirmatively, with this kind of "oh we put it in the EULA nobody reads" being explicitly called out as non-compliant.

apignotti

2 replies

10h42m

2024-05-17 07:46:36 UTC

You can opt-out by manually writing an email to them. The process matters.

whatevaa

1 replies

9h35m

2024-05-17 08:53:58 UTC

They could make it even better, like requiring signed/certified physical mail /s. Or fax...

lproven

0 replies

6h24m

2024-05-17 12:05:02 UTC

:-D

torginus

0 replies

10h31m

2024-05-17 07:57:36 UTC

and how does that even work? Slack is a chat app. Does everyone involved in the chat need to opt out for it to be meaningful? What about bots?

rpastuszak

0 replies

10h21m

2024-05-17 08:07:21 UTC

Defaults matter!

Just look at how much Apple and Mozilla get from Google by having their browser as a default (ca. $20,000,000,000 and $400,000,000 IIRC per annum).

Or look at how many people rejected the tracking prompt displayed for FB when it was added to iOS (+70%).

notachatbot1234

0 replies

10h7m

2024-05-17 08:22:00 UTC

Do they discard everything processed so far every time someone opts out?

Arisaka1

0 replies

7h35m

2024-05-17 10:53:23 UTC

The way to opt out is by contacting support,in an era where opt ins and outs should be handled by a toggle button.

Either they don't expect many people to wish to opt their slacks out, or they're aware of the asynchronous friction this introduces and they don't care.

KronisLV

6 replies

9h49m

2024-05-17 08:39:32 UTC

Personally, I rather liked self-hosted versions of these:

Mattermost: https://mattermost.com/

Rocket.Chat: https://www.rocket.chat/

Nextcloud Talk: https://nextcloud.com/talk/

Out of those, Mattermost was the easiest to setup (just need PostgreSQL and a web server, in addition to the main container), however not being able to easily permanently delete instead of just archiving workspaces was awkward. Nextcloud Talk was very easy to get going if you already have Nextcloud but felt a bit barebones last I checked, whereas Rocket.Chat was overall the more pleasant option to use, although I wasn't the biggest fan of them using MongoDB for storage.

The user experience is pretty good with all of them, however in the groups that I've been a part of, ultimately nobody cared about self-hosting an instance, since most orgs just prefer Teams/Slack (or even Skype for just chatting/meetings) and most informal groups just default to Discord. Oh well.

cowpig

2 replies

6h23m

2024-05-17 12:05:33 UTC

Surprised you didn't mention zulip: https://zulip.com/

We use it and wouldn't trade it for any of the alternatives.

zelphirkalt

0 replies

5h38m

2024-05-17 12:50:21 UTC

Also is easy to set up and has Jitsi Meet integration and feels 10x more snappy than Slaaaaack.

KronisLV

0 replies

5h28m

2024-05-17 13:01:11 UTC

That's a lovely addition, thanks! I'll have to try it out as well at some point.

bayindirh

2 replies

9h26m

2024-05-17 09:02:17 UTC

The problem is not technical, but social with these platforms.

i.e. How do you convince 40+ people from 5 countries to add yet another memory resident chat application and fragment their knowledge to another app/mental space?

This gets way harder as the community becomes more dynamic and temporary (i.e. high circulation like students). I gave the good fight last year with someone, and they just didn't flex a nanometer citing ergonomics of Slack is way better than alternatives, and didn't care about data mining (was a possibility back then) or keeping older messages at ransom.

KronisLV

1 replies

9h13m

2024-05-17 09:15:30 UTC

i.e. How do you convince 40+ people from 5 countries to add yet another memory resident chat application and fragment their knowledge to another app/mental space?

If it's a company, you can just be like: "Hey, we use this platform for communication, you can log in with your Active Directory credentials."

It also has the added benefit of acting as a directory for every employee in the company, so getting in touch can be more convenient than e-mail (while you can also customize the notification preferences, so it doesn't get too spammy), as opposed to the situation which might develop, where some teams or org units are on Slack, others on Teams and getting in touch can be more messy.

If it's a free-form social group, then you can throw that idea away because of network effects, it'd be an uphill battle, same as how sometimes people complain about people using Discord for various communities, but at the same time the reality is that old school forums and such were also killed off - since most people already have a Discord account and there's less friction to just use that.

Either way, I'm happy that self-hosted software like that exists.

bayindirh

0 replies

8h56m

2024-05-17 09:32:55 UTC

If it's a company

That's a big if, and the answer is "No" in my case. If it was, that comment wouldn't be there.

It's not a "social group" either, but a group of independent institutions working together. It's like a large gear-train. A lot of connections between small islands of people. So you have to work together, and have to find a way somehow. So, it's complicated.

Either way, I'm happy that self-hosted software like that exists.

Me too. I happen to manage a Nextcloud instance, but nobody is interested in the "Talk" module.

tifik

0 replies

8h25m

2024-05-17 10:03:52 UTC

Yeah my lawyer friend is worried he might even lose his license over this. It's gonna be very interesting seeing how legal departments react to this.

If you disagree with practices like this, mention this to your legal.

paxys

13 replies

18h53m

2024-05-16 23:35:23 UTC

Data will not leak across workspaces.

If you want to exclude your Customer Data from helping train Slack global models, you can opt out.

I don't understand how both these statements can be true. If they are using your data to train models used across workspaces then it WILL leak. If they aren't then why do they need an opt out?

Edit: reading through the examples of AI use at the bottom of the page (search results, emoji suggestions, autocomplete), my guess is this policy was put in place a decade ago and doesn't have anything to do with LLMs.

Another edit: From https://slack.com/help/articles/28310650165907-Security-for-...

Customer data is never used to train large language models (LLMs).

So yeah, sounds like a nothingburger.

whimsicalism

8 replies

18h48m

2024-05-16 23:40:57 UTC

They're saying they won't train generative models that will literally regurgitate your text, my guess is classifiers are fair game in their interpretation

swatcoder

4 replies

18h43m

2024-05-16 23:46:12 UTC

You are assuming they're saying that, because it's one charitable interpretation of what they're saying.

But they haven't actually said that. It also happens that people say things based on faulty or disputed beliefs of their own, or people willfully misrprepresent things, etc

Until they actually do say something as explicit as what you suggest, they haven't said anything of the sort.

whimsicalism

3 replies

18h39m

2024-05-16 23:50:03 UTC

Data will not leak across workspaces. For any model that will be used broadly across all of our customers, we do not build or train these models in such a way that they could learn, memorize, or be able to reproduce some part of Customer Data.

I feel like that is explicitly what this is saying.

zmmmmm

2 replies

18h16m

2024-05-17 00:12:46 UTC

The problem is, it's really really hard to guarantee that.

Yes if they only train say, classifiers, then the only thing that can leak is the classification outcome. But these things can be super subtle. Even a classifier could leak things if you can hack the context fed into it. They are really playing with fire here.

whimsicalism

0 replies

17h43m

2024-05-17 00:45:34 UTC

yes, i certainly agree with you. i think oftentimes these policies are written by non-technical people

i'm not entirely convinced that classifiers and LLMs are disjoint to begin with

BHSPitMonkey

0 replies

11h36m

2024-05-17 06:52:15 UTC

If is also hard to guarantee that, in a multi-tenant application, users will never see other users' data due to causes like mistakes AuthZ logic, caching gone awry, or other unpredictable situations that come up in distributed systems—yet even before the AI craze we were all happy to use these SaaS products anyway. Maybe this class of vulnerability is indeed harder to tame than most, but third-party software has never been without risks.

btown

2 replies

17h29m

2024-05-17 00:59:27 UTC

The OP privacy policy explicitly states that autocompletion algorithms are part of the scope. "Our algorithm that picks from potential suggestions is trained globally on previously suggested and accepted completions."

And this can leak: for instance, typing "a good business partner for foobars is" might not send that text upstream per se, but would be consulting a local model whose training data would have contained conversations that other Slack users are having about brands that provide foobars. How can Slack guarantee that the model won't incorporate proprietary insights on sourcing the best foobar producers into its choice of the next token? And sure, one could build an adversarial model that attempts to minimize this kind of leakage, but is Slack incentivized to create such a thing vs. just building an optimal autocomplete as quickly as possible?

Even if it were just creating classifiers, similar leakages could occur there, albeit requiring more effort and time from attackers to extract actionable data.

I can't blame Slack for wanting to improve their product, but I'd also encourage any users with proprietary conversations to encourage their admins to opt out as soon as possible.

yorwba

0 replies

9h44m

2024-05-17 08:44:53 UTC

How can Slack guarantee that the model won't incorporate proprietary insights on sourcing the best foobar producers into its choice of the next token?

This is explained literally in the next sentence after the one you quoted: "We protect data privacy by using rules to score the similarity between the typed text and suggestion in various ways, including only using the numerical scores and counts of past interactions in the algorithm."

If all the global model sees is {similarity: 0.93, past_interactions: 6, recommendation_accepted: true} then there is no way to leak tokens, because not only are the tokens not part of the output, they're not even part of the input. But such a simple model could still be very useful for sorting the best autocomplete result to the top.

whimsicalism

0 replies

17h9m

2024-05-17 01:19:43 UTC

yeah i absolutely agree that even classifiers can leak, and the autocorrect thing sounds like i was wrong about generative (it sounds like an n-gram setup?)... although they also say they don't train LLMs (what is an n-gram? still an LM, not large... i guess?)

next_xibalba

1 replies

18h15m

2024-05-17 00:13:49 UTC

This reminds me of a company called C3.ai which claims in its advertising to eliminate hallucations using any LLM. OpenAI, Mistral, and others at the forefront of this field can't manage this, but a wrapper can?? Hmm...

BChass

0 replies

17h53m

2024-05-17 00:35:18 UTC

Ah yes, the stock everyone believed in and thought it would reach the moon during 2020.

fallingsquirrel

0 replies

18h48m

2024-05-16 23:40:25 UTC

There are no data leaks in Ba Sing Se.

berniedurfee

0 replies

5h34m

2024-05-17 12:54:43 UTC

Right? I take Slack and Salesforce at their word. They’re good companies and look out for the best interests of their customers. They have my complete trust.

kepano

12 replies

17h49m

2024-05-17 00:39:18 UTC

In summary, you must opt-out if you want to exclude your data from global models.

Incredibly confusing language since they also vaguely state that "data will not leak across workspaces".

Use tools that cannot leak data not "will not".

mvkel

5 replies

16h39m

2024-05-17 01:49:27 UTC

Most, if not all SaaS software is multi-tenant, so we've been living in the "will not" world for decades now.

kepano

2 replies

16h2m

2024-05-17 02:26:55 UTC

That's exactly my point. "File over app"[1] is just as relevant for businesses as it is for individuals — if you don't want your data to be used for training, then take sovereignty of it.

[1] https://stephango.com/file-over-app

lxgr

1 replies

3h0m

2024-05-17 15:28:27 UTC

"File over app" is a good way of putting it!

Something strange is happening on your blog, fwiw: Bookmarking it via command + D flips the color scheme to "night mode" – is that intentional?

kepano

0 replies

2h21m

2024-05-17 16:07:26 UTC

Ah good catch. Yes the "D" key can be used to switch light/dark mode, but I didn't account for the bookmark shortcut. That should be fixed now. Thanks!

zeckalpha

1 replies

16h10m

2024-05-17 02:18:21 UTC

In your experience.

mvkel

0 replies

3h7m

2024-05-17 15:21:57 UTC

The SaaS business model breaks if you go single tenant except in Fortune 500 enterprise

creativeSlumber

3 replies

17h30m

2024-05-17 00:59:03 UTC

what is the difference between "will not" and "cannot" in legalese?

glennericksen

1 replies

17h15m

2024-05-17 01:14:07 UTC

"Will not" allows the existence of a bridge but it's not on your route and you say you're not going to go over it. "Cannot" is the absence of a bridge or the ability to cross it.

gketuma

0 replies

14h10m

2024-05-17 04:18:52 UTC

Wow, well explained.

purplejacket

0 replies

16h16m

2024-05-17 02:13:14 UTC

My off leash dog will not bite you, he is well behaved. My dog at home cannot bite you, he is too far away.

semitones

1 replies

17h18m

2024-05-17 01:10:25 UTC

Aren't all tools, essentially, just one API call away from "leaking data"?

kepano

0 replies

16h5m

2024-05-17 02:23:51 UTC

In this case they mean leak into the global model — so no. You can have sovereignty of your data if you use an open protocol like IRC or Matrix, or a self-hosted tool like Zulip, Mattermost, Rocket Chat, etc

tikkun

11 replies

19h23m

2024-05-16 23:06:11 UTC

Eugh. Has anyone compiled a list of companies that do this, so I can avoid them? If anyone knows of other companies training on customer data without an easy highly visible toggle opt out, please comment them below.

paxys

3 replies

18h49m

2024-05-16 23:40:14 UTC

It would be easier to compile a list of companies that don't do this.

The list:

hosteur

1 replies

17h53m

2024-05-17 00:36:09 UTC

My company does not do this and have no plans to do such a thing.

berniedurfee

0 replies

5h41m

2024-05-17 12:48:04 UTC

Lol, my favorite corpo-speak.

I’m not eating a steak and have no plans to eat a steak. Ask again tomorrow.

internetter

0 replies

18h5m

2024-05-17 00:23:28 UTC

Nonsense. There are plenty of companies that don't have shit policies like this. A vast majority, even. Stop normalizing it.

kepano

1 replies

18h4m

2024-05-17 00:24:33 UTC

if your data is stored in a database that a company can freely read and access (i.e. not end-to-end encrypted), the company will eventually update their ToS so they can use your data for AI training — the incentives are too strong to resist

https://twitter.com/kepano/status/1688610782509211648

https://twitter.com/kepano/status/1682829662370557952

ncr100

0 replies

16h10m

2024-05-17 02:18:21 UTC

And the penalty is unnoticeable to these companies.

goles

1 replies

18h54m

2024-05-16 23:34:23 UTC

Synology updated this policy back in March (Happened to be a Friday afternoon).

Services Data Collection Disclosure

"Synology only uses the information we obtain from technical support requests to resolve your issue. After removing your personal information, we may use some of the technical details to generate bug reports if the problem was previously unknown to implement a solution for our products."

"Synology utilizes the information gathered through technical support requests exclusively for issue resolution purposes. Following the removal of personal data, certain technical details may be utilized for generating bug reports, especially for previously unidentified problems, aimed at implementing solutions for our product line. Additionally, Synology may transmit anonymized technical information to Microsoft Azure and leverage its OpenAI services to enhance the overall technical support experience. Synology will ensure that personally identifiable information, such as names, phone numbers, addresses, email addresses, IP addresses and product serial numbers, is excluded from this process."

I used to just delete privacy policy update emails and the like but now I make a habit of going in to diff them to see if these have been slipped in.

bn-l

0 replies

16h36m

2024-05-17 01:53:02 UTC

Like the other poster it would be great to have a name and shame site that lists companies training on customer data

weikju

0 replies

19h22m

2024-05-16 23:07:14 UTC

ceruleanseas

0 replies

19h3m

2024-05-16 23:25:32 UTC

We can fight back by not posting anything useful or accurate to the internet until there are protections in place and each person gets to decide how their data is used and whether they are compensated for it.

berniedurfee

0 replies

5h44m

2024-05-17 12:44:16 UTC

pyromaker

11 replies

18h2m

2024-05-17 00:26:25 UTC

Our mission is to build a product that makes work life simpler, more pleasant and more productive.

I know it would be impossible but I wish we go back to the days when we didn't have Slack (or tools alike). Our Slack is a cesspool of people complaining, talking behind other people's backs, echo chamber of negativity etc.

That probably speaks more to the overall culture of the company, but Slack certainly doesn't help.You can also say "tool is not the problem, people are" - sure, we can always explain things away, but Slack certainly plays a role here.

rjh29

2 replies

17h20m

2024-05-17 01:08:23 UTC

That probably speaks more to the overall culture of the company

Yep. Fun fact, my last workplace had a fairly nontoxic Slack... but there was a whole second Slack dedicated to bitching and shitposting where the bosses weren't invited. Humans gonna human.

grepfru_it

1 replies

17h11m

2024-05-17 01:18:09 UTC

Was not limited to just the bosses who were not invited. If you weren’t in the cool club you also did not get an invite.

A very inclusive company on paper that was very exclusionary behind the scenes.

eggdaft

0 replies

12h12m

2024-05-17 06:16:36 UTC

What happened when someone from the cool club got promoted and became a boss?

matthewmacleod

1 replies

17h59m

2024-05-17 00:29:43 UTC

No, I don’t think Slack does play a role in this. It is quite literally a communication tool (and I’d argue one that encourages far _more_ open communication than others).

If Slack is a cesspool, that’s because your company culture is a cesspool.

grob-gambit

0 replies

6h5m

2024-05-17 12:24:09 UTC

I think open communication in a toxic environment can obviously amplify toxicity or at least less open communication can act as a damper on toxicity.

Slack is surely not the generator of toxicity but it seems obvious it could act at increasing the bandwidth.

You can't have it both ways.

barkbyte

1 replies

16h55m

2024-05-17 01:33:40 UTC

Your company sucks. I’ve used slack at four workplaces and it’s not been at all like that. A previous company had mailing lists and they were toxic as you describe. The tool was not the issue.

muglug

0 replies

14h3m

2024-05-17 04:26:08 UTC

Yeah, written communication is harder than in-person communication.

It’s easy to come across poorly in writing, but that issue has no easy resolution unless you’re prepared to ban Slack, email, and any other text-based communication system between employees.

Slack can sometimes be a place for people who don’t feel heard in conventional spaces to vent — but that’s an organisational problem, not a Slack problem.

zemo

0 replies

12h49m

2024-05-17 05:39:35 UTC

HN isn't really a bastion of media literacy or tech criticism. If you ever ask "does [some technology] affect [something qualitative] about [anything]", the response on hn is always going to be "technology isn't responsible, it's how the technology is used that is responsible!", asserting, over and over again, that technology is always neutral.

The idea that the mechanism of how people communicate affects what people communicate is a pretty foundational concept in media studies (a topic which is generally met with a hostile audience on HN). Slack almost certainly does play a role, but people who work in technology are incentivized to believe that technology does not affect people's behaviors, because that belief allows people who work in technology to be free of any and all qualitative or moral judgements on any grounds; the assertion that technology does not play a role is something that technology workers cling to because it absolves them of all guilt in all situations, and makes them, above all else, innocent in every situation. On the specific concept of a medium of communication affecting what is being communicated, McLuhan took these ideas to such an extreme that it's almost ludicrous, but he still had some pretty interesting observations worth thinking on, and his writing on this topic is some of the earlier work. This is generally the place where people first look, because much of the other work assumes you've understood McLuhan's work in advance. https://en.wikipedia.org/wiki/Understanding_Media

vasco

0 replies

17h51m

2024-05-17 00:38:01 UTC

I disagree slack plays a role. You only mentioned human aspects, nothing to do with technology. There was always going to be instant messaging as software once computers and networks were invented. You'd just say this happens over email and blame email.

userbinator

0 replies

16h19m

2024-05-17 02:10:09 UTC

Switch to Teams instead.

Only half-kidding, but it's an application which is so repulsive it seems to discourage people from communicating at all.

rcaught

0 replies

16h18m

2024-05-17 02:11:00 UTC

Keyboard warriors

hex4def6

11 replies

19h18m

2024-05-16 23:10:39 UTC

I'm confused about this statement: "When developing AI/ML models or otherwise analyzing Customer Data, Slack can’t access the underlying content. We have various technical measures preventing this from occurring"

"Can't" is a strong word. I'm curious how an AI model could access data, but Slack, Inc itself couldn't. I suspect they mean "doesn't" instead of "can't", unless I'm missing something.

EGreg

5 replies

19h8m

2024-05-16 23:20:45 UTC

Every company that promises "end-to-end encryption" is just pinky-swearing to you also. Like Telegram or WhatsApp

int_19h

2 replies

15h20m

2024-05-17 03:08:44 UTC

Telegram client is open source, so you can see what exactly happens there when you enable E2EE.

EGreg

1 replies

13h20m

2024-05-17 05:08:40 UTC

Reproducible builds somehow ?

shultays

0 replies

10h32m

2024-05-17 07:56:30 UTC

https://core.telegram.org/reproducible-builds

LelouBil

1 replies

18h17m

2024-05-17 00:11:21 UTC

Yeah, at least if the client is open source you could verify.

8372049

0 replies

3h3m

2024-05-17 15:26:11 UTC

"If you can read assembly, all programs are open source."

Sure, it's easier and less effort if the program is actually open source, but it's absolutely still possible to verify on bytecode, decompiled or disassembled programs, too.

015a

2 replies

18h47m

2024-05-16 23:42:08 UTC

I also find the word "Slack" in that interesting. I assume they mean "employees of Slack", but the word "Slack" obviously means all the company's assets and agents, systems, computers, servers, AI models, etc.

I would find even a statement from Signal like "we can't access our users content" to be tenuous and overly-optimistic. Like, when I heard the word "can't" my brain goes to: there is nothing anyone in the company could do, within the bounds of the law, to do this. Employees at Slack could turn off the technical measures preventing this from occurring. Employees at Signal could push an app update which side-channels all messages through to a different server, unencrypted.

Better phrasing is "Employees of Slack will not access the underlying content".

throwaway22032

0 replies

17h40m

2024-05-17 00:48:16 UTC

Interestingly I'd probably go the other way.

If it's verifiably E2EE then I consider "we can't access this" to be a fairly powerful statement. Sure, the source could change, but if you have a reasonable distribution mechanism (e.g. all users get the same code, verifiably reproducible) then that's about as good as you can get.

Privacy policies that state "we won't do XYZ" have literally zero value to me to the extent that I don't even look at them. If I give you some data, it's already leaked in my mind, it's just a matter of time.

gbalduzzi

0 replies

7h54m

2024-05-17 10:35:08 UTC

I would find even a statement from Signal like "we can't access our users content" to be tenuous and overly-optimistic.

I don't really agree with this statement. Signal literally can't read user data right now. The statement is true, why can't they use it?

If they can't use it, nobody can. there are no services that can't publish an update reversing any security measure available. Also doing that would be illegal, because it would render the statement "we can't access our users content" false.

In Slack case, it is totally different. Data is accessible by Slack systems, the statement "we can't access our users content" is already false. Probably what they mean is something along the lines of: "The data can't be accessed by our systems, but we have measures in place that block the access to most of our employees"

spywaregorilla

0 replies

18h58m

2024-05-16 23:30:28 UTC

From their white paper linked in the same comment

Provisioning To minimize the risk of data exposure, Slack adheres to the principles of least privilege and role-based permissions when provisioning access—workers are only authorized to access data that they reasonably must handle in order to fulfill their current job responsibilities. All production access is reviewed at least quarterly.

so... seems like they very clearly can.

r_klancer

0 replies

4h45m

2024-05-17 13:44:12 UTC

As an engineer who has worked on systems that handle sensitive data, it seems straightforwardly to me to be a statement about:

1. ACLs

2. The systems that provision those ACLs

3. The policies that determine the rules those systems follow.

In other words, the model training batch job might run as a system user that has access to data annotated as 'interactions' (at timestamp T1 user U1 joined channel C1, at timestamp T2 user U2 ran a query that got 137 results), but no access to data annotated as 'content', like (certainly) message text or (probably) the text of users' queries. An RPC from the training job attempting to retrieve such content would be denied, just the same as if somebody tried to access someone else's DMs without being logged in as them.

As a general rule in a big company, you the engineer or product manager don't get to decide what the ACLs will look like no matter how much you might feel like it. You request access for your batch job from some kind of system that provisions it. In turn the humans who decide how that system work obey the policies set out by the company.

It's not unlike a bank teller who handles your account number. You generally trust them not to transfer your money to their personal account on the sly while they're tapping away at the terminal--not necessarily because they're law abiding citizens who want to keep their job, but because the bank doesn't make it possible and/or would find out. (A mom and pop bank might not be able to make the same guarantee, but Bank of America does.) [*]

In the same vein, this is a statement that their system doesn't make it possible for some Slack PM to jack their team's OKRs by secretly training on customer data that other teams don't use, just because that particular PM felt like ignoring the policy.

[*] Not a perfect analogy, because a bank teller is like a Slack customer service agent who might, presumably after asking for your consent, be able to access messages on your behalf. But in practice I doubt there's a way for an employee to use their personal, probably very time-limited access to funnel that data to a model training job. And at a certain level of maturity a company (hopefully) also no longer makes it possible for a human employee to train a model in a random notebook using whatever personal data access they have been granted and then deploy that same model to prod. Startups might work that way, though.

xyst

8 replies

16h44m

2024-05-17 01:44:37 UTC

The gold rush for data is wild. Private companies selling us out.

- Slack

- Discord

- Reddit

- Stackoverflow

Let’s just hope this data gold rush dies out faster than the web3 craze before OpenAI reaches critical mass and gets access to government server farms.

Alphabet boys have server farms of domestic and foreign surveillance and intelligence. Exabytes of data [1]

[1] https://en.m.wikipedia.org/wiki/Utah_Data_Center

ilrwbwrkhv

4 replies

16h39m

2024-05-17 01:49:47 UTC

I mean slack is sold off. Founders made money. For all intents and purposes it's dead software.

whywhywhywhy

0 replies

4h48m

2024-05-17 13:41:05 UTC

The data it has is incredibly valuable if they build their own product or sell it off. You essentially have org charts of entire companies and people asking for things and getting responses and working back and forth together long term.

In terms of building agents for doing real work this could be more valuable than things like Reddit.

smileysteve

0 replies

15h35m

2024-05-17 02:54:08 UTC

It's owned by Salesforce;n if they stop growing it, it'll go the way of Heroku - and that breach didn't go well.

eggdaft

0 replies

12h14m

2024-05-17 06:14:40 UTC

I actually think Slack is great and it has improved over the last 12 months.

Liquix

0 replies

15h44m

2024-05-17 02:45:10 UTC

proprietary software that no one who cares about privacy or security should use? absolutely. dead? not exactly

bigstrat2003

1 replies

16h24m

2024-05-17 02:05:14 UTC

Don't forget Dropbox: https://twitter.com/Werner/status/1734890651378975007

BHSPitMonkey

0 replies

11h51m

2024-05-17 06:37:30 UTC

That thread was a mischaracterization and a misunderstanding. The toggle simply exposed UI entry points to AI integrations that users could then opt to use, with consent.

TechDebtDevin

0 replies

11h50m

2024-05-17 06:38:27 UTC

Meh, tbh I think these guys live in a bit of a dream world about how much their data is worth. While investors and corporate partners will rush to these companies for their data, I'm not really convinced random internet conversations are going to push anything forward but let them sell shovels. Most of the miners always go broke.

donfotto

5 replies

7h12m

2024-05-17 11:17:12 UTC

Emoji suggestion: Slack might suggest emoji reactions to messages using the content and sentiment of the message, the historic usage of the emoji and the frequency of use of the emoji in the team in various contexts. For instance, if [PARTY EMOJI] is a common reaction to celebratory messages in a particular channel, we will suggest that users react to new, similarly positive messages with [PARTY EMOJI].

Finally someone has figured out a sensible application for "AI". This is the future. Soon "AI" will have a similar connotation as "NFT".

apwell23

2 replies

7h8m

2024-05-17 11:21:09 UTC

"leadership" at my company tallies emoji reactions to their shitty slack messages and not reacting with emojies over a period of time is considered a slight against them.

I had to up my slack emoji game after joining my current employer

tifik

0 replies

6h24m

2024-05-17 12:04:36 UTC

That sounds literally the same as "flare" from office space: https://www.youtube.com/watch?v=F7SNEdjftno

berniedurfee

0 replies

5h38m

2024-05-17 12:50:34 UTC

Yikes! Thats some pretty heavy insecurity signals from leadership. Please like us! Sad.

shultays

0 replies

6h18m

2024-05-17 12:10:44 UTC

Finally. I am all for this AI if it is going to learn and suggest my passive aggressive "here" emoji that I use when someone @here s on a public channel with hundreds of people for no good reason.

latexr

0 replies

7h6m

2024-05-17 11:22:27 UTC

And it continues:

To do this while protecting Customer Data, we might use an external model (not trained on Slack messages) to classify the sentiment of the message. Our model would then suggest an emoji only considering the frequency with which a particular emoji has been associated with messages of that sentiment in that workspace.

This is so stupid and needlessly complicated. And all it does is remove personality from messages, suggesting everyone conforms to the same reactions.

tylerrobinson

4 replies

19h22m

2024-05-16 23:06:59 UTC

If you want to exclude your Customer Data from helping train Slack global models, you can opt out.

Well gee whiz, you want to help your old pal Slack, don’t you?

It’s such a slap in the face. And the opt-out is only available by email where more friction could be introduced.

nyc_data_geek

1 replies

18h25m

2024-05-17 00:03:59 UTC

So slimy that this isn't instead an opt-in!

padolsey

0 replies

18h17m

2024-05-17 00:12:11 UTC

They know, as well, that opt-in simply wouldn’t give them the scale they’d need for meaningful training data. They’re being very intentionally self interested and unconcerned with their customers best interests.

ttul

0 replies

18h21m

2024-05-17 00:08:04 UTC

Real change requires legislation.

bionhoward

0 replies

19h11m

2024-05-16 23:17:30 UTC

“Do not access the Services in order to build a similar or competitive product or service or copy any ideas, features, functions, or graphics of the Services;” https://slack.com/acceptable-use-policy

I wonder if OpenAI wishes they could train ChatGPT on their corporate chat history? How ironic ? Or they don’t care?

light_hue_1

4 replies

18h31m

2024-05-16 23:57:35 UTC

This is the final push we needed to move to Discord. Bye slack. We won't miss you.

noman-land

2 replies

18h21m

2024-05-17 00:07:50 UTC

Discord is not a better option than Slack. They are basically the same thing. Matrix is a better option from a privacy standpoint, just not from a UX one.

FujiApple

1 replies

18h7m

2024-05-17 00:21:47 UTC

I recently tried Zulip [1] again after a few years and the UX is much improved on web and mobile, worth a look (it is OSS and you can self host).

[1] https://zulip.com/

noman-land

0 replies

18h1m

2024-05-17 00:27:43 UTC

Thanks! I'll check it out.

Shekelphile

0 replies

15h57m

2024-05-17 02:31:32 UTC

It is a bold assumption to think that discord hasn't always been using harvested text/voice/video data to train models.

dfcarney

4 replies

18h5m

2024-05-17 00:23:20 UTC

In case this is helpful to anyone else, I opted out earlier today with an email to feedback@slack.com

Subject: Slack Global Model opt-out request.

Body:

<my workspace>.slack.com

Please opt the above Slack Workspace out of training of Slack Global Models.

noman-land

3 replies

17h58m

2024-05-17 00:30:29 UTC

Make sure you put a period at the end of the subject line. Their quoted text includes a period at the end.

Please also scold them for behaving unethically and perhaps breaking the law.

jgalt212

0 replies

17h50m

2024-05-17 00:39:08 UTC

We just opted out. I told them our lawyers have been instructed to watch them like a hawk.

drcongo

0 replies

8h33m

2024-05-17 09:55:24 UTC

The period is outside the quotes though, are you suggesting we should have the quotes too?

dfcarney

0 replies

17h52m

2024-05-17 00:36:44 UTC

Updated!

chefandy

4 replies

16h7m

2024-05-17 02:21:40 UTC

I wonder how many people that are really mad about these guys or SE using their professional output to train models thought commercial artists were just being whiny sore losers when Deviant Art, Adobe, OpenAI, Stability, et al did it to them.

Liquix

3 replies

15h50m

2024-05-17 02:38:41 UTC

squarely in the former camp. there's something deeply abhorrent about creating a place that encourages people to share and build and collaborate, then turning around and using their creative output to put more money in shareholder pockets.

i deleted my reddit and github accounts when they decided the millions of dollars per month they're receiving from their users wasn't enough. don't have the power to move our shop off slack but rest assured many will as a result of this announcement.

chefandy

2 replies

15h35m

2024-05-17 02:53:36 UTC

Yeah I haven't put a new codebease on GH in years. It's kind of a PITA hosting my own gitea server for personal projects but letting MS copy my work to help make my professional skillset less valuable is far less palatable.

Companies doing this would make me much less angry if they used an opt-in model only for future data. I didn't have a crystal ball and I don't have a time machine, so I simply can't stop these companies from using my work for their gain.

8372049

1 replies

2h43m

2024-05-17 15:46:03 UTC

Why do you think it's a pain to host the Gitea?

chefandy

0 replies

2h1m

2024-05-17 16:27:35 UTC

Compared to hosting other things? Nothing! It's great.

Hosting my own service rather than using a free SaaS solution that is entirely someone else's problem? There's a significant difference there. I've been running Linux servers either professionally or personally for almost 25 years, so it's not like it's a giant problem... but my work has been increasingly non-technical over the past 5 years or so, so even minor hiccups require re-acclimating myself to the requisite constructs and tools (wait, how do cron time patterns work? How do I test a variable in bash for this one-liner? How do iptables rules work again?)

It's not a deal breaker, but given the context, it's definitely not not a pain in the ass, either.

WhiteNoiz3

4 replies

7h28m

2024-05-17 11:00:33 UTC

To add some nuance to this conversation, what they are using this for is Channel recommendations, Search results, Autocomplete, and Emoji suggestion and the model(s) they train are specific to your workspace (not shared between workspaces). All of which seem like they could be handled fairly privately using some sort of vector (embeddings) search.

I am not defending Slack, and I can think of number of cases where training on slack messages could go very badly (ie, exposing private conversations, data leakage between workspaces, etc), but I think it helps to understand the context before reacting. Personally, I do think we need better controls over how our data is used and slack should be able to do better than "Email us to opt out".

wolfwyrd

1 replies

6h31m

2024-05-17 11:57:33 UTC

The way it's written means this just isn't the case. They _MAY_ use it for what you have mentioned above. They explicitly say "...here are a few examples of improvements..." and "How Slack may use Customer Data" (emph mine). They also... may not? And use it for completely different things that can expose who knows what via prompt hacking.

WhiteNoiz3

0 replies

3h52m

2024-05-17 14:36:55 UTC

Agreed, and that is my concern as well that if people get too comfortable with it then companies will keep pushing the bounds of what is acceptable. We will need companies to be transparent about ALL the things they are using our data for.

JackC

1 replies

4h58m

2024-05-17 13:30:34 UTC

the model(s) they train are specific to your workspace (not shared between workspaces)

That's incorrect -- they're stating that they use your "messages, content, and files" to train "global models" that are used across workspaces.

They're also stating that they ensure no private information can leak from workspace to workspace in this way. It's up to you if you're comfortable with that.

WhiteNoiz3

0 replies

3h53m

2024-05-17 14:35:37 UTC

From the wording, it sounds like they are conscious of the potential for data leakage and have taken steps to avoid it. It really depends on how they are applying AI/ML. It can be done in a private way if you are thoughtful about how you do it. For example:

Their channel recommendations: "We use external models (not trained on Slack messages) to evaluate topic similarity, outputting numerical scores. Our global model only makes recommendations based on these numerical scores and non-Customer Data"

Meaning they use a non-slack trained model to generate embeddings for search. Then they apply a recommender system (which is mostly ML not an LLM). This sounds like it can be kept private.

Search results: "We do this based on historical search results and previous engagements without learning from the underlying text of the search query, result, or proxy" Again, this is probably a combination of non-slack trained embeddings with machine learning algos based on engagement. This sounds like it can be kept private and team specific.

autocomplete: "These suggestions are local and sourced from common public message phrases in the user’s workspace." I would be concerned about private messages being leaked via autocomplete, but if it's based on public messages specific to your team, that should be ok?

Emoji suggestions: "using the content and sentiment of the message, the historic usage of the emoji [in your team]" Again, it sounds like they are using models for sentiment analysis (which they probably didn't train themselves and even if they did, don't really leak any training data) and some ML or other algos to pick common emojis specific to your team.

To me these are all standard applications of NLP / ML that have been around for a long time.

Rebuff5007

4 replies

18h9m

2024-05-17 00:19:56 UTC

How can anyone in their right mind think building AI for emoji selection is a remotely good use of time...

nextworddev

1 replies

16h37m

2024-05-17 01:51:28 UTC

it's just a justification for collecting tokens

TechDebtDevin

0 replies

11h47m

2024-05-17 06:42:08 UTC

Tokens (outside of a few trillion ) are worthless imo, I think OAI has pushed that limit, let the others chase them with billions into the ocean of useless conversational data and drown.

dlandis

0 replies

17h31m

2024-05-17 00:57:46 UTC

"These types of thoughtful personalizations and improvements are only possible if we study and understand how our users interact with Slack."

LOL

barkbyte

0 replies

16h51m

2024-05-17 01:38:11 UTC

I’d use that, at work. It would be a welcome improvement to their product.

theyinwhy

3 replies

12h10m

2024-05-17 06:18:37 UTC

Good we moved to matrix already. I just hope they start putting more emphasis on Element X, which message handling is broken on iOS for weeks now.

Arathorn

2 replies

11h29m

2024-05-17 06:59:38 UTC

Element X is where all the effort is going, and should be working really well. How is msg handling broken?

theyinwhy

0 replies

6h16m

2024-05-17 12:12:30 UTC

I need to go back to the overview whenever I receive a new message, as the reply form is broken after each message received.

drcongo

0 replies

8h35m

2024-05-17 09:53:57 UTC

Not the OP here, but I've tried really hard to use Element X and it crashes constantly.

ramijames

3 replies

18h43m

2024-05-16 23:45:55 UTC

I bet Discord is next.

reportgunner

1 replies

8h29m

2024-05-17 09:59:52 UTC

I regularly get ads or content on tiktok based on what I discuss in DMs on discord. It takes about an hour or sometimes even less.

nurple

0 replies

2h15m

2024-05-17 16:13:26 UTC

Same, but on YouTube.

herpdyderp

0 replies

18h11m

2024-05-17 00:18:04 UTC

I bet Discord is already doing it.

r_thambapillai

3 replies

17h28m

2024-05-17 01:01:14 UTC

The incentive for first party tool providers to do this is going to be huge, whether its Slack, Google, Microsoft, or really any other SaaS tool. Ultimately, if business want to avoid getting commoditized by their vendors, they need be in control of their data, and their AI strategy. And that probably ultimately means turning off all of these small-utility-very-expensive-and-might-ruin-your-business features, and actually creating a centralized, access controlled, well governed knowledge base which you can plug any open source or black box LLM, from any provider.

bn-l

1 replies

16h38m

2024-05-17 01:50:54 UTC

It’s definitely a moral hazard (/opportunity). As a reminder, by default on windows 11 Microsoft syncs your files to their server.

Liquix

0 replies

15h39m

2024-05-17 02:50:06 UTC

all your files? no way that cozy of a blanket statement can be true. if you kept cycling in drives full of /dev/random you could fill up M$ servers with petabytes of junk? sounds like an appealing weekend project

jonnycomputer

0 replies

16h59m

2024-05-17 01:29:52 UTC

"commoditized by their vendors" is exactly the phrase I was looking for. It's why I wanted my co to self-host Mattermost instead of using Slack.

icoe

3 replies

19h0m

2024-05-16 23:28:34 UTC

Not to be glib, but this why we built Tonic Textual (www.tonic.ai/textual). It’s both very challenging and very important to protect data in training workflows. We designed Textual to make it easy to both redact sensitive data and replace it with contextually relevant synthetic data.

Ephil012

2 replies

18h45m

2024-05-16 23:43:34 UTC

To add on to this: I think it should be mentioned that Slack says they'll prevent data leakage across workspaces in their model, but don't explain how they do this. They don't seem to go into any detail about their data safeguards and how they're excluding sensitive info from training. Textual is good for this purpose since it redacts PII thus preventing it from being leaked by the trained model.

Disclaimer: I work at Tonic

a2128

1 replies

15h41m

2024-05-17 02:47:29 UTC

How do you handle proprietary data being leaked? Sure you can easily detect and redact names and phone numbers and addresses, but without significant context it seems difficult to detect whether "11 spices - mix with 2 cups of white flour ... 2/3 teaspoons of salt, 1/2 teaspoons of thyme [...]" is just a normal public recipe or a trade secret kept closely guarded for 70 years

icoe

0 replies

2024-05-17 18:26:41 UTC

Fair question, but you have to consider the realistic alternatives. For most of our customers inaction isn't an option. The combination of NER models + synthesis LLMs actually handles these types of cases fairly well. I put your comment into our web app and this was the output:

How do you handle proprietary data being leaked? Sure you can easily detect and redact names and phone numbers and addresses, but without significant context it seems difficult to detect whether "17 spices - mix with 2lbs of white flour ... half teaspoon of salt, 1 tablespoon of thyme [...]" is just a normal public recipe or a trade secret kept closely guarded for 75 years.

blhack

3 replies

18h47m

2024-05-16 23:42:12 UTC

How could this possibly comply with European "right to be forgotten" legislation? In fact, how could any of these AI models comply with that? If a user requests to be forgotten, is the entire model retrained (I don't think so).

whimsicalism

1 replies

18h42m

2024-05-16 23:46:35 UTC

how could any of these AI models comply with that? If a user requests to be forgotten, is the entire model retrained (I don't think so).

I don't believe that is the current interpretation of GDPR, etc. - if the model is trained, it doesn't have to be deleted due to a RTBF request afaik. there is significant legal uncertainty here

Recent GDPR court decisions mean that this is probably still non-compliant due to the fact that it is opt-out rather than opt-in. Likely they are just filtering out all data produced in the EEA.

8372049

0 replies

2h56m

2024-05-17 15:32:19 UTC

Likely they are just filtering out all data produced in the EEA.

Likely they are just hoping to not get caught and/or consider it cost of doing business. GDPR has truly shown us (as if we didn't already know) that compliance must be enforced.

beefnugs

0 replies

17h34m

2024-05-17 00:54:27 UTC

This "ai" scam going on now is the ultimate convoluted process to hide sooo much tomfuckery: theres no such thing as copyright anymore! this isn't stealing anything, its transforming it! you must opt out before we train our model on the entire internet! (and we still won't spits in our face) this isn't going to reduce any jobs at all! (every company on earth fires 15% of everyone immediately) you must return to office immediately or be fired! (so we get more car data teehee) this one weird trick will turn you into the ultimate productive programmer! (but we will be selling it to individuals not really making profitable products with it ourselves)

and finally the most aggregious and dangerous: censorship at the lowest level of information before it can ever get anywhere near peoples fingertips or eyeballs.

paulv

2 replies

14h56m

2024-05-17 03:32:23 UTC

It seems like we've entered an era where not only are you paying for software with money, you're also paying for software with your data, privacy implications be damned. I would love to see people picking f/oss instead.

eggdaft

1 replies

12h7m

2024-05-17 06:21:43 UTC

Problems with f/oss for business applications:

1. Great UX folks almost never work for free. So the UX of nearly all OSS is awful.

2. Great software comes from a close connection to users. When your software is an OS kernel that works just fine for programmers, but how many OSS folks want to spend their free time on zoom talking to hundreds of businesses and understanding their needs, so they can give them free software?