HN comments for: I scraped all of OpenAI's Community Forum

enonimal

12 replies

2d2h

2024-03-28 15:56:00 UTC

Number of Posts with negative sentiment, grouped by Topic

# 1 Result: Python Packaging

Checks out

minimaxir

6 replies

2d1h

2024-03-28 16:39:27 UTC

A pro-tip for using the OpenAI API is to not use the official Python package for interfacing with it. The REST API documentation is good, and just using it in your HTTP client of choice like requests is roughly the same LOC without unexpected issues, along with more control.

rockostrich

2 replies

2d1h

2024-03-28 16:47:24 UTC

I've found this happens with a lot of first party clients. At work, we use LaunchDarkly for feature flags and use their code references tool to keep track of where flags are being referenced. The tool uses their first party Go client to interact with the API but the client doesn't handle rate limiting at all even though they have rate limiting headers clearly documented for their API.

klooney

1 replies

1d17h

2024-03-29 01:19:59 UTC

First party clients are typically an afterthought, and you can't add features without getting a PM to sign off, which strangles the impulse to polish & sand down rough edges.

rattray

0 replies

1d16h

2024-03-29 01:37:40 UTC

Agreed. Any in particular come to mind that you'd like to see improved?

(my company provides first-party clients with a lot of polish; maybe we could help)

rattray

2 replies

1d17h

2024-03-29 01:18:59 UTC

Hey minimaxir, I help maintain the official OpenAI Python package. Mind sharing what issues you've had with it? (Have you used it since November, when the 1.0 was released?)

Keen for your feedback, either here or email: alex@stainlessapi.com

minimaxir

1 replies

1d14h

2024-03-29 03:32:56 UTC

There's nothing wrong per se, it works as advertised. But as a result it's a somewhat redundant dependency.

rattray

0 replies

1d14h

2024-03-29 03:51:47 UTC

Ah, gotcha. Thanks, that makes sense. FWIW, here are some things it provides which might be worth having:

1. typesafety (for those using pyright/mypy) and autocomplete/intellisense

2. auto-retry (w/ backoff, intelligently so w/ rate limits) and error handling

3. auto-pagination (can save a lot of code if you make list calls)

4. SSE parsing for streaming

5. (coming soon) richer streaming & function-calling helpers (can save / clean up a lot of code)

Not all of these matter to everybody (e.g., I imagine you're not moved by such benefits as "dot notation over dictionary access", which some devs might really like).

I would argue that auto-retry would benefit a pretty large percentage of users, though, especially since the 429 handling can paper over a lot of rate limits to the point that you never actually "feel" them. And spurious/temporary network connections or 500s also ~disappear.

For some simple use-cases, none of these would really matter, and I agree with you - especially if it's not production code and you don't use a type-aware editor.

rattray

2 replies

1d17h

2024-03-29 01:23:56 UTC

FWIW, here are the only links I could find in the article which were tagged "Python3 Package": https://community.openai.com/t/647723 and https://community.openai.com/t/586484 . Note they don't see to have anything to do with the Python package whatsoever.

I was pretty disappointed to see this, as I work on the Python package and was hoping for a good place to find feedback (apart from the github issues, which I monitor pretty closely).

I'm not a data scientist; maybe someone from the Julep team could comment on the labeling? Or how I could find some more specific themes of problems with the Python package? (Was it just that people who have a problem of some kind just happen to also use the Python library?)

alt-glitch

1 replies

1d15h

2024-03-29 02:48:38 UTC

Hey! Happy to chat over email/X more closely and help you out.

Nomic Atlas automatically generates the labels here. There could be different variations of posts involving the Python Packages.

But I did some manual digging & here's what I found; Heading over to the map and filtering by posts around "Python Packages" leads to around 900 posts.

Sharing a few examples which do talk about people's posts related to the python package:

- https://community.openai.com/p/701058 - https://community.openai.com/p/652075 - https://community.openai.com/t/32442 - https://community.openai.com/p/143928

Note: My intuition is that most of the posts are very basic, probably user errors like "No API Key Found" etc.

rattray

0 replies

1d14h

2024-03-29 03:53:26 UTC

gotcha, that makes sense - thank you!

doctorpangloss

1 replies

2024-03-28 17:51:59 UTC

The Python package is really well engineered, and the startup that is making the OpenAPI client based on it, Stainless, is doing a good job.

This shows laypeople piling into a hype thing and running immediately into the roadblock of programming.

Normal people don't want to like, put in effort to feel like they are a part of something.

They are used to "just" having to turn on Netflix to feel like they are a part of the biggest TV show, or "just" having to click a button to buy a Stanley Cup, or "just" having to click a button to buy Bitcoin. The API and performance issues, IMO, they're not noise, but they are meaningless. To me this also signals how badly Grok and Stability are doing it, they are doubling and tripling down on popular opinions that have a strong, objective meaninglessness to them (like how fast the tokens come out and how much porn you're allowed to make). Whereas the Grok people are looking at this analysis and feeling very validated right now.

I have no dog in this race, but I would hope that the OpenAI people do not waste any time on Python APIs for dumb people; instead, they should definitely improve their store and have a firmer opinion on how that would look. They almost certainly have a developing opinion on a programming paradigm for chatbots, but I feel like they are hamstrung by needed to quantize their models to meet demand, not decisions about the look and feel of Python APIs or the crappiness of the Python packaging ecosystem. Another POV is that the Apple development experience persists to be notoriously crappy, and yet they are the most valuable platform for most companies in the world right now; and also, JetBrains could not sustain an audience for the AppCode IDE, because everyone uses middlewares anyway; so I really don't think Python APIs matter as much as the community says they do. It's a Nice to Have, but it Does Not Matter.

enonimal

0 replies

2024-03-28 18:00:39 UTC

we may think more similarly than you seem to think...

this was more a slam on python packaging in general, than it is on the OpenAI implementation.

I wouldn't be surprised if many of the issues under this topic are more related to Python package version nightmares, than OpenAI's Python implementation itself.

wavyknife

11 replies

2d2h

2024-03-28 16:12:09 UTC

(disclaimer: I work for Discourse)

Discourse has an AI plugin that admins can run on their community to generate their own sentiment analysis (among other things), though it's not quite as thorough as this write up! https://meta.discourse.org/t/discourse-ai-plugin/259214

We're always interested to see how public data can be used like this. It's something that can be a lot more difficult on closed platforms.

Aachen

10 replies

2d1h

2024-03-28 17:02:31 UTC

helps you keep tabs on your community by analyzing posts and providing sentiment and emotional scores to give you an overall sense of your community for any period of time [...]

Toxicity can scan both new posts and chat messages and classify them on a toxicity score across a variety of labels

Is that within the defined data processing purposes of all Discourse setups? Does the tool warn admins they might need to update their policies before being able to run this tool, perhaps needing to seek consent (depending on their jurisdiction and ethics)? It sounds somewhat objectionable, trying to guess my mental state from what I write without opt-in

Edit: and apparently it also tries to flag NSFW chat messages, does Discourse have PM chats where this would flag private messages for admins to read or is it only public chats that this bot runs on?

tagging NSFW image content in posts and chat messages

xfalcox

2 replies

1d22h

2024-03-28 20:19:50 UTC

Is that within the defined data processing purposes of all Discourse setups?

It's an optional plugin that can be enabled / disabled by the site admin. Those modules are all disabled by default, and each need to be enabled by the site owner.

Edit: and apparently it also tries to flag NSFW chat messages, does Discourse have PM chats where this would flag private messages for admins to read or is it only public chats that this bot runs on?

Discourse PMs can be read by admins, see the definition here: https://meta.discourse.org/t/guidance-and-best-practices-on-...

Aachen

1 replies

1d17h

2024-03-29 00:54:42 UTC

Of course an admin can always open up the database and read your forum PMs, that's not surprising. The very first line in the link you provided, however, is what I was worried about:

Moderators can read PMs that have an active flag.

This system is now setting nsfw flags in an automated fashion, specifically seeking out content that the persons involved wouldn't want others to see. Clearly a forum is the wrong place for that content, but people don't always make good decisions (especially kids; I was a kid on forums too and would be very surprised if nothing ever transpired there). The receiving person can already flag anything they deem inappropriate. A system making automated decisions about messages that were intended to be private creates problems and it is not clear to me who this serves

wavyknife

0 replies

1d4h

2024-03-29 13:27:18 UTC

it is not clear to me who this serves

customers

BadHumans

2 replies

2024-03-28 17:37:08 UTC

More companies and communities than you think already do this without your knowledge let alone consent.

david_allison

1 replies

1d23h

2024-03-28 18:59:53 UTC

That doesn't mean we can't do better

BadHumans

0 replies

1d23h

2024-03-28 19:21:16 UTC

Better at what though? I don't even think it's a problem to begin with.

wavyknife

1 replies

1d22h

2024-03-28 19:55:20 UTC

Discourse is not a centralized platform, so it's up to individual sites to ensure they're compliant with data and privacy regulations.

Aachen

0 replies

1d17h

2024-03-29 00:32:30 UTC

I mention that in the first nonquote sentence

eddd-ddde

1 replies

1d22h

2024-03-28 19:42:29 UTC

I don't think there's anything left for you to consent once you decide to post on a public forum. If I can read your post and guess your mental state so can any other bot.

Aachen

0 replies

1d17h

2024-03-29 00:36:26 UTC

If you park your car on the side of the road, that also doesn't allow anyone to do with it what they please

If you write an article and post it on your blog, people can't just come along and take the text verbatim

If you license your blog as public domain, then someone takes the content and does something objectionable with it, you can (in many countries) still make use of moral rights if you'd wish to correct the situation

If I post something publicly on a forum, I'm well aware I may have agreed or consented (depending on the forum) to terms that allow this type of processing, but that is not the default. There exist restrictions, both legally and morally (some legal ones are even called moral rights and are inalienable). Hence my question how this plugin handles extending the allowed data processing to cover taking the content and making automated decisions and claims that may or may not be accurate. I would not be comfortable with that being an automated behind-the-scenes process flagging my posts as good or bad towards the moderators, since they likely won't care to read back hundreds of comments and see whether the computer did a good job

dorkwood

8 replies

2d2h

2024-03-28 16:17:15 UTC

I did a bit of data scraping for fun in the past, but I was never quite sure of the legality of what I was doing. What if I was breaking some law in some jurisdiction of some country? Was someone going to track me down and punish me?

OpenAI has taught me that no one gives a shit. Scrape the entire internet if you want, and use the data for whatever you feel like.

bsuvc

2 replies

2d1h

2024-03-28 17:05:15 UTC

OpenAI has taught me that no one gives a shit. Scrape the entire internet if you want, and use the data for whatever you feel like.

Cloudflare gives a shit.

My household had to use our 5G internet for most things for a week or two until our IP reputation recovered.

stoorafa

1 replies

2d1h

2024-03-28 17:19:46 UTC

Yeah it’s probably worth renting a server if there’s any doubt about whether it’s wholly appropriate to do something

htrp

0 replies

1d1h

2024-03-29 17:07:35 UTC

Some sites just block the entire AWS/GCP ip address range.

ifyoubuildit

1 replies

2d1h

2024-03-28 16:36:57 UTC

Do you think it would be better if someone did track you down and punish you? Which world do you want to live in?

n0sleep

0 replies

2d1h

2024-03-28 16:43:34 UTC

I think large companies should be punished for stealing from people to make themselves richer.

EcommerceFlow

1 replies

2d1h

2024-03-28 16:58:26 UTC

A precursor to this would have been that Linkedin lawsuit Microsoft lost, allowing that one company to scrape all of Linkedin (technically "public information").

htrp

0 replies

1d2h

2024-03-29 16:13:52 UTC

hiQ Labs v. LinkedIn

alt-glitch

0 replies

2d1h

2024-03-28 17:01:03 UTC

We were really heading someplace with The Semantic Web aka The Real Web 3.0 [1]

Alas we have to fight against the machines in order to properly read the internet thru machines.

I believe Discourse knowingly keeps its data easy to scrape though, so kudos to them!

[1]: https://en.wikipedia.org/wiki/Semantic_Web

velid0

6 replies

2d2h

2024-03-28 15:53:11 UTC

Now train a gpt based on the data :D

testfrequency

5 replies

2d2h

2024-03-28 16:15:09 UTC

But make sure to call it ClosedData or something so we know it’s not open source

(sorry, I think openai and sam are gross)

davely

4 replies

2024-03-28 18:08:31 UTC

Maybe I don’t understand this sentiment, but are people really that hung up on the name?

I see this sort of thing posted a lot (i.e., “it should be ClosedAI instead of OpenAI, lol”)

What if it just means “Open for Business” instead of “Open Access for All”? Or maybe they should just make it an acronym?

I’m sorry for the confusion on my part, but there’s just been a lot of words dedicated toward expressing frustration with the company because they chose to use “open” in their name.

Personally, I don’t find it frustrating that Apple doesn’t sell fruit and Intel doesn’t actually give intelligence data.

rootusrootus

1 replies

1d23h

2024-03-28 18:28:40 UTC

Is the frustration because of the name, or because open [access] was part of their ethos at the beginning, and people think they've abandoned it?

startupsfail

0 replies

1d23h

2024-03-28 18:53:29 UTC

OpenAI is supposed to be a nonprofit. But, when the nonprofit board tried to exercise control, it became very clear that the nonprofit arm is not, in fact in control any longer. The board was wiped out, nearly everyone in the company seemingly was willing to join Microsoft or Sam Altman or what not.

This doesn’t seem to be compatible with continuing loftily call themselves with the same name, as the initial nonprofit mission.

woopsn

0 replies

1d23h

2024-03-28 18:33:55 UTC

It's a gimmick. When the nonprofit was organized in 2015, the name certainly did not mean open for business. It meant (loftily) undertaking the quasi-religious quasi-humanist mission "in the spirit of liberty" to generate a new kind of super wealth as "broadly and evenly distributed as possible".

As in prepare for the end... THE END OF HIGH PRICES!

to benefit humanity as a whole, unconstrained by a need to generate financial return

- https://openai.com/blog/introducing-openai

phyzome

0 replies

1d20h

2024-03-28 22:20:56 UTC

"What if it just means" -- I mean, we don't have to ask "what if". We can look at the original press release:

https://openai.com/blog/introducing-openai

« We’re hoping to grow OpenAI into such an institution. As a non-profit, our aim is to build value for everyone rather than shareholders. Researchers will be strongly encouraged to publish their work, whether as papers, blog posts, or code, and our patents (if any) will be shared with the world. We’ll freely collaborate with others across many institutions and expect to work with companies to research and deploy new technologies. »

They never give an explicit explanation for their name, but it's pretty obvious.

xandrius

3 replies

2d2h

2024-03-28 15:30:16 UTC

Love it, just for the sole reason of turning something OpenAI made into a dataset for everyone else :D

codetrotter

2 replies

2d1h

2024-03-28 17:24:43 UTC

I don’t think OpenAI are gonna lose any sleep over this.

Isn’t a “community forum” like this basically just: “we’re not gonna spend money on providing adequate customer support so instead here is a forum where y’all can talk amongst yourselves and we’ll give you some badges and imaginary points for doing the customer support yourselves”?

solardev

0 replies

2024-03-28 17:30:31 UTC

They probably just sic a customer service GPT on it and use it to train the other ones...

alt-glitch

0 replies

1d14h

2024-03-29 04:16:51 UTC

I believe a community forum is absolutely vital for an "ecosystem" company. There needs to be a town square where people can discuss ideas and share feedback about that particular ecosystem.

OpenAI has a pretty active forum with moderators replying and helping out all the time.

SunlitCat

3 replies

2d2h

2024-03-28 15:51:56 UTC

I didn't even knew they have community forums. Looking at the main homepage (openai.com), the only external links I can find are to chatgpt and their docs hosted on platform.openai.com. The other links lead to their socials, github and soundcloud (of all places).

Maybe I'm not looking thoroughly enough, so I may be wrong, tho!

hughesjj

2 replies

2d2h

2024-03-28 16:11:59 UTC

I would also love to see these forums both to post and to lurk

djantje

1 replies

2d2h

2024-03-28 16:21:13 UTC

https://community.openai.com/ (when you are logged in on platform.openai.com, there is a link from the menu)

SunlitCat

0 replies

1d23h

2024-03-28 18:36:48 UTC

Thank you!

Gone are the days when you simply saw all the important links on the main page, it seems. :)

miduil

2 replies

2d2h

2024-03-28 15:44:02 UTC

That's an interesting write-up, I wonder how this would look for other big Discourse communities such as NixOS.

alt-glitch

1 replies

1d15h

2024-03-29 02:29:11 UTC

This is definitely a workflow we can package into something open-source.

I wonder how the community moderators would like it.

dcreater

0 replies

1d11h

2024-03-29 07:03:39 UTC

I for one would love it!

klooney

1 replies

1d17h

2024-03-29 01:20:41 UTC

What's the "Day Knowledge Direction" cluster in the Atlas view?

alt-glitch

0 replies

1d15h

2024-03-29 02:26:37 UTC

Neat find! That's actually a cluster of all the system messages notifying users about closing and re-opening of the thread. That's why they're so tightly clustered.

I believe the naming isn't perfect for this, but this was all automatic topic modelling!

Example: [1]: https://community.openai.com/t/read-this-before-posting-a-ne...

xfalcox

0 replies

2d2h

2024-03-28 16:17:37 UTC

That's super cool, thanks for sharing! I will share this as an easy to follow example of what we can with AI.

Allowing a Q&A interface using these embeddings over the post contents could speed up research over the community posts (if you know the right questions to ask :P). Let's view some posts similar to this one complaining about function calling

That's indeed a great thing to surface, and that's exactly how the the OpenAI forum selects the "Related Topics" to show at the end of every topic. We use embeddings for this feature, and the entire thing is open-source: https://github.com/discourse/discourse-ai/blob/main/lib/embe...

We also embeddings for suggesting tags, categories, HyDE search and more. It's by far my favorite tech of this new AI/ML gen so far in terms of applicability.

Using Twitter-roBERTa-base for sentiment analysis, we generated a post_sentiment label (negative, positive, neutral) and post_sentiment_score confidence score for each post.

We do the same, with even the same model, and conveniently show that information on the admin interface of the forum. Again all open source: https://github.com/discourse/discourse-ai/tree/main/lib/sent...

Disclaimer: I'm the tech lead on the AI parts of Discourse, the open source software that powers OpenAI's community forum.

throwaway98797

0 replies

2d2h

2024-03-28 15:50:37 UTC

did they have the right to use all thier data?

garyiskidding

0 replies

1d9h

2024-03-29 08:33:57 UTC

This is really amazing. Pretty insightful. Thank you.

fzysingularity

0 replies

2d1h

2024-03-28 17:21:14 UTC

So epic, thank you for making this dataset available to everyone!

alright2565

0 replies

1d3h

2024-03-29 15:00:12 UTC

I saw this part:

Every Discourse Discussion returns data in JSON if you append .json to the URL.

then this:

Raw data was gathered into a single JSONL file by automating a browser using Playwright.

Kinda seems to me like having a whole browser instance for this isn't necessary? I would have been surprised if this .json pattern didn't continue for all pages, and it turns out that it does in fact also work for the topic list: https://community.openai.com/latest.json

The other place I've seen this sort of API pattern is reddit. For example, https://www.reddit.com/r/all.json or (randomly chosen) https://www.reddit.com/r/mildlyinfuriating/comments/1bqn3c0/...