return to table of content

Show HN: Exploring HN by mapping and analyzing 40M posts and comments for fun

abe94
33 replies
1d5h

This is impressive work, especially for a one man show!

One thing that stood out to me was the graph of the sentiment analysis over time, I hadn't seen something like that before and it was interesting to see it for Rust. What were the most positive topics over time? And were there topics that saw very sudden drops?

I also found this sentence interesting, as it rings true to me about social media "there seems to be a lot of negative sentiment on HN in general." It would be cool to see a comparison of sentiment across social media platforms and across time!

wilsonzlin
20 replies
1d4h

Thanks! Yeah I'd like to dive deeper into the sentiment aspect. As you say it'd be interesting to see some overview, instead of specific queries.

The negative sentiment stood out to me mostly because I was expecting a more "clear-cut" sentiment graph: largely neutral-positive, with spikes in the positive direction around positive posts and negative around negative posts. However, for almost all my queries, the sentiment was almost always negative. Even positive posts apparently attracted a lot of negativity (according to the model and my approach, both of which could be wrong). It's something I'd like to dive deeper into, perhaps in a future blog post.

deadbabe
8 replies
1d3h

Anecdotally, I think anyone who reads HN for a while will realize it to be a negative, cynical place.

Posts written in sweet syrupy tones wouldn’t do well here, and jokes are in short supply or outright banned. Most people here also seem to be men. There’s always someone shooting you down. And after a while, you start to shoot back.

xanderlewis
3 replies
1d2h

(Without wanting to sound negative or cynical) I don’t think it is, but maybe I haven’t been here long enough to notice. It skews towards technical and science and technology-minded people, which makes it automatically a bit ‘cynical’, but I feel like 95% of commenters are doing so at least in good faith. The same cannot be said of many comparable discussion forums or social media websites.

Jokes are also not banned; I see plenty on here. Low-effort ones and chains of unfunny wordplay or banter seem to be frowned upon though. And that makes it cleaner.

sethammons
2 replies
1d2h

I've been here a hot minute and I agree with you. Lots of good faith. Lots of personal anecdotes presumably anchored in experience. Some jokes are really funny, just not reddit-style. Similarly, no slashdot quips generally, such as "first post" or "i, for one, welcome our new HN sentiment mapping robot overlords." Sometimes things get downvoted that shouldn't, but most of the flags I see are well deserved, and I vouch for ones that I think are not flag-worthy

goles
1 replies
1d

I wonder how much of a persons impression of this is formed by their browsing habits.

As a parent comment mentions big threads can be a bit of a mess but usually only for the first couple of hours. Comments made in the spirit of HN tend to bubble up and off-topic, rude comments and bad jokes tend to percolate down over the course of hours. Also a number of threads that tend to spiral get manually detached which takes time to go clean up.

Someone who isn't somewhat familiar with how HN works that is consistently early to stories that attract a lot of comments is reading an almost entirely different site than someone who just catches up at the end of the day.

fragmede
0 replies
22h36m

some of the more negative threads will get flagged and detached and by the end of the day a casual browse through the comments isn't even going to come across them. eg something about the situation in the middle east is going to attract a lot of attention.

holoduke
0 replies
23h8m

Really? Mmm i think hn is a place with on avarage above intelligent people. People who understand that their opinion is not the only one. I rarely have issues with people here. Might be also because we are all in the same bubble here.

flir
0 replies
22h52m

I think it's the engineering mindset. You're always trying to figure out what's wrong with an idea, because you might be the poor bastard that ends up having to build it. Less costly all round if you can identify the flaw now, not halfway through sprint 7. After a while it bleeds into everything you do.

darby_eight
0 replies
1d

Anecdotally, I think anyone who reads HN for a while will realize it to be a negative, cynical place.

I don't think this is particularly unique to HN. Anonymous forums tend to attract contrarian assholes. Perhaps this place is more, erm, poorly socially-adapted to the general population, but I don't see it as very far outside the norm outside of the average wealth of the posters.

chiefalchemist
0 replies
18h38m

Anecdotally, I think anyone who reads HN for a while will realize it to be a negative, cynical place.

Sure, sometimes. But usually it's

Truth seeking > group thinking

There's a fine line between critical and cynical. Sometimes that line gets crossed. Sometimes the ambiguity of text-only comms clouds the water.

dylan604
5 replies
1d

The sentiment issue is a curious one to me. For example, a lot of humans I interact with that are not devs take my direct questioning or critical responses to be "negative" when there is no negative intent at all. Pointing out something doesn't work or anything that the dev community encounters on a daily basis isn't an immediate negative sentiment but just pointing out the issues. Is it a meme-like helicopter parent constantly doling out praise positive so that anything differing shows negativity? Not every piece of art needs to be hung on the fridge door, and providing constructive criticism for improvement is oh so often framed as negative. That does the world no favors.

Essentially, I'm not familiar with HuggingFace or any models in this regard. But if they are trained from the socials, then it seems skewed from the start to me.

Also, fully aware that this comment will probably be viewed as negative based on stated assumptions.

edit: reading further down the comments, clearly I'm not the first with these sentiments.

flawsofar
1 replies
21h3m

Every helicopter gets a trophy

dylan604
0 replies
20h50m

wait, the parents get a trophy?

wilsonzlin
0 replies
17h40m

You may be right, a more tailored classifier for HN comments specifically may be more accurate. It'd be interesting to consider the classes: would it still be simply positive/negative? Perhaps constructive/unconstructive? Usefulness? Something more along the lines of HN guidelines?

uyzstvqs
0 replies
10h59m

Speaking from experience, debate is easily misread as negative arguing by outsiders, even though all involved parties are enjoying challenging each other's ideas.

prox
0 replies
9h36m

Just one point of note : people are FAR more likely to respond and take to writing to something negative than positive. I don’t know the exact numbers but it just engages people more. People just don’t pick up the pen to write how good something is as much.

walterbell
1 replies
1d4h

Great work! Would you consider adding support for search-via-url, e.g. https://hn.wilsonl.in/?q=sentiment+analysis. It would enable sharing and bookmarks of stable queries.

luke-stanley
0 replies
1d3h

I did something related for my ChillTranslator project for translating spicy HN comments to calm variations which has a GGUF model that runs easily and quickly but it's early days. I did it with a much smaller set of data, using LLM's to make calm variations and an algo to pick the closest least spicy one to make the synthetic training data then used Phi 2. I used Detoxify then OpenAI's sentiment analysis is free, I use that to verify Detoxify has correctly identified spicy comments then generate a calm pair. I do worry that HN could implode / degrade if there is not able to be a good balance for the comments and posts that people come here for. Maybe I can use your sentiment data to mine faster and generate more pairs. I've only done an initial end-to-end test so far (which works!). The model, so far is not as high quality as I'd like but I've not used Phi 3 on it yet and I've only used a very small fine-tune dataset so far. File is here though: https://huggingface.co/lukestanley/ChillTranslator I've had no feedback from anyone on it though I did have a 404 in my Show HN post!

al_hag
0 replies
16h52m

It will be a deep dive into the most essential of HN staples, the nitpick

abakker
0 replies
1d

its so interesting that in Likert scale surveys, I tend to see huge positivity bias/agreement bias, but comments tend to be critical/negative. I think there is something related to the format of feedback that skews the graph in general.

On HN, my theory is that positivity is the upvotes, and negativity/criticality is the discussion.

Personally, my contribution to your effort is that I would love to see a tool that could do this analysis for me over a dataset/corpus of my choosing. The code is nice, but it is a bit beyond me to follow in your footsteps.

gieksosz
6 replies
18h6m

HN is a pretty toxic place indeed.

taco-hands
1 replies
17h54m

Perhaps... it can be toxic if you dip into the comments sometimes... Otherwise the content and links are the stuff of gold!

gieksosz
0 replies
8h55m

links are indeed the best. It is hard not to click on the comments however, which is a roll of a dice.

swatcoder
1 replies
17h21m

How did you get from negative sentiment to toxicity? Are those the same to you?

It may be a cultural thing, but I think many people see negative sentiment as a constructive tool and a demonstration of trust and respect among people who recognize each others as robust and capable peers.

Avoiding it is something you do with people who you believe need special delicacy: whether because they've told you so, because they intimidate you, or because you sense something pitiable and fragile about them.

If you can trust that it's given in good faith, and by the guidelines of HN you are asked to, negative sentiment should be seen as an expression that someone thinks you're a fully capable adult and peer. Personally, I deeply appreciate that it's generally so comfortably shared and received here and would never include "toxicity" in one of my critiques of HN.

It's a surprising thing to read someone say!

(Unless you're thinking of the nastiness that can surface on flamewar topics, but there are numerous means by which those get downranked and displaced, and they're otherwise sparse and easy to avoid.)

gieksosz
0 replies
8h56m

Negative sentiment is more general than toxicity in my understanding - but it does include it. The fact that the study found HN consistently negative does not surprise me, one of the ways HN is negative (the most disruptive and which makes me post here less often) is indeed toxic comments. But I am still here (in the comments no less) so the benefit still outweighs the pain.

Swizec
1 replies
17h36m

HN is a pretty toxic place indeed

This may be a personal style difference, but I find HN to be the least toxic of all social media I’ve tried. LinkedIn would be my example of ultra toxicity – the aggressive positivity there is unbearable. At least on HN people tell you what they think and even use a constructive decently argumented approach to doing so.

HN to me feels like a good technical discussion where people tear apart ideas instead of each other.

But yeah if you put a lot of ego into your ideas, HN must be an awful place to visit.

rossant
0 replies
12h46m

I agree, HN is much less toxic than about any other place on the internet.

kcorbitt
1 replies
1d

I actually did a blog post a few months ago where I analyzed HN commenter sentiment across AI, blockchain, remote work and Rust. The final graph at the very end of the post is the relevant one on this topic!

https://openpipe.ai/blog/hn-ai-crypto

abe94
0 replies
14h6m

thanks, the sentiment in these graphs seem more positive in comparison. Did you run the sentiment on the whole corpus? What did that look like?

walterbell
0 replies
1d4h

> sentiment across social media platforms and across time!

Also time zones and weekday/weekend.

necovek
0 replies
1d

It's really unfortunate the HN API does not provide votes on comments: I wonder if and how sentiment analysis would change if they were weighted by votes/downvotes?

My unsupported take is that engineers are mostly critical, but will +1 positive feedback instead of repeating it, as they might for critism :)

moneywoes
0 replies
15h38m

Crypto i imagine is in that bucket

oersted
12 replies
1d6h

This is a surprisingly big endeavour for what looks like an exploratory hobby project. Not to minimize the achievement, very cool, I'm just surprised by how much was invested into it.

They used 150 GPUs and developed two custom systems (db-rpc and queued) for inter-server communication, and this was just to compute the embeddings, there's a lot of other work and computation surrounding it.

I'm curious about the context of the project, and how someone gets this kind of funding and time for such research.

PS: Having done a lot of similar work professionally (mapping academic paper and patent landscapes), I'm not sure if 150 GPUs were really needed. If you end up just projecting to 2D and clustering, I think that traditional methods like bag-of-words and/or topic modelling would be much easier and cheaper, and the difference in quality would be unnoticeable. You can also use author and comment-thread graphs for similar results.

wilsonzlin
4 replies
1d5h

Hey, thanks for the kind words. I wasn't able to mention the costs in the post (might follow up in the future) but it was in the hundreds of dollars, so was reasonably accessible as a hobby project. The GPUs were surprisingly cheap, and was only scaled up mostly because I was impatient :) --- the entire cluster only ran for a few hours.

Do you have any links to your work? They sound interesting and I'd like to read more about them.

oersted
3 replies
1d5h

"Hundreds of dollars" sounds a bit painful as an EU engineer and entrepreneur :), but I guess it's all relative. We would think twice about investing this much manpower and compute for such an exploratory project even in a commercial setting if it was not directly funded by a client.

But your technical skill is obvious and very impressive.

If you want to read more, my old bachelor's thesis is somewhat related, from when we only had word embeddings and document embeddings were quite experimental still: https://ad-publications.cs.uni-freiburg.de/theses/Bachelor_J...

I've done a lot follow-up work in my startup Scitodate, which includes large-scale graph and embedding analysis, but we haven't published most of it for now.

wilsonzlin
0 replies
1d5h

Thanks for sharing, I'll have a read, looks very relevant and interesting!

gardenhedge
0 replies
9h52m

A golf membership can cost 1000s of euro.. Any hobby costs money

b800h
0 replies
21h46m

As an EU-based engineer, you wouldn't do this, it's a massive GDPR violation (failure to notify data subjects of data processing), which does actually have extraterritoriality, although I somehow doubt that the information commissioners are going to be coming after OP.

PaulHoule
4 replies
1d5h

(1) Definitely you could use a cheaper embedding and still get pretty good results

(2) I apply classical ML (say probability calibrated SVM) to embeddings like that and get good results for classification and clustering at speeds over 100x fine-tuning an LLM.

Karrot_Kream
3 replies
20h20m

I didn't think the OP used LLMs? They did use a BERT based sentiment classifier but that's not an LLM.

My HN recommender works fine just using decision trees and XGBoost FWIW. I'll bet SVM would work great too.

PaulHoule
2 replies
6h51m

Some of the SBERT models now are based on T5 and newer architectures so there's not. The FlagEmbedding model that the author uses

https://huggingface.co/BAAI/bge-base-en-v1.5

is described as an "LLM" by the people who created it. It can be used in the SBERT framework.

I tried quite a few models for my RSS feed recommender (applied after taking the embedding) and SVM came out ahead of everything else. Maybe with parameter tuning XGBoost would do better but it was not a winner for me.

If you look at the literature

https://arxiv.org/abs/2405.00704

you find that the fashionable LLMs are not world-beating at many tasks and actually you can do very well at sentiment analysis applying the LSTM to unpooled BERT output.

Karrot_Kream
1 replies
1h48m

Some of the SBERT models now are based on T5 and newer architectures so there's not. The FlagEmbedding model that the author uses

Oh thanks! Right I had heard about T5 based embeddings but didn't realize it was basically an LLM.

I tried quite a few models for my RSS feed recommender (applied after taking the embedding) and SVM came out ahead of everything else. Maybe with parameter tuning XGBoost would do better but it was not a winner for me.

XGBoost worked the best for me but maybe I should retry with other techniques.

you find that the fashionable LLMs are not world-beating at many tasks and actually you can do very well at sentiment analysis applying the LSTM to unpooled BERT output.

Definitely. Use the right tool for the right job. LLMs are probably massive overkill here. My non-LLM based embeddings work just fine for my own recommender so shrug.

PaulHoule
0 replies
1h35m

Are you applying an embedding to titles on HN, comment full-text or something else?

When it comes to titles I have a model that gets an AUC around 0.62 predicting if an article will get >10 votes and a much better one (AUC 0.72 or so) that predicts if an article that got > 10 votes will get a comment/vote ratio > 0.5, which is roughly the median. Both of these are bag-of-words and didn't improve when using an embedding. If I go back to that problem I'm expecting to try some kind of stacking (e.g. there are enough New York Times articles submitted to HN that I can train a model just for NYT articles.)

Also I have heard the sentiment that "BERT is not an LLM" a lot from commenters on HN a lot but every expert source I've seen seems to treat BERT as an LLM. It is in this category in Wikipedia for instance

https://en.wikipedia.org/wiki/Category:Large_language_models

and

https://www.google.com/search?client=firefox-b-1-e&q=is+bert...

gives an affirmative answer in 8 cases out of 10, one of which denies it is a language model at all on a technicality that has since been overthrown.

alchemist1e9
1 replies
1d5h

The author is definitely very skilled. I find it interesting they submit posts on HN but haven’t commented since 2018! And then embarked on this project.

As far as funding/time, one possibility is they are between endeavors/employment and it’s self funded as they have had a successful career or business financially. They were very efficient at the GPU utilization so it probably didn’t cost that much.

wilsonzlin
0 replies
1d5h

Thanks! Haha yeah I'm trying to get into the habit of writing about and sharing the random projects I do more often. And yeah the cost was surprisingly low (in the hundreds of dollars), so it was pretty accessible as a hobby project.

thyrox
10 replies
1d6h

Very nice. Since Hn data spawns so many such fun projects, there should be a monthly or weekly updates zip file or torrent with this data, which hackers can just download instead of writing a scraper and starting from scratch all the time.

zX41ZdbW
3 replies
1d3h

It is very easy to get this dataset directly from HN API. Let me just post it here:

Table definition:

    CREATE TABLE hackernews_history
    (
        update_time DateTime DEFAULT now(),
        id UInt32,
        deleted UInt8,
        type Enum('story' = 1, 'comment' = 2, 'poll' = 3, 'pollopt' = 4, 'job' = 5),
        by LowCardinality(String),
        time DateTime,
        text String,
        dead UInt8,
        parent UInt32,
        poll UInt32,
        kids Array(UInt32),
        url String,
        score Int32,
        title String,
        parts Array(UInt32),
        descendants Int32
    )
    ENGINE = MergeTree(update_time) ORDER BY id;
    
A shell script:

    BATCH_SIZE=1000

    TWEAKS="--optimize_trivial_insert_select 0 --http_skip_not_found_url_for_globs 1 --http_make_head_request 0 --engine_url_skip_empty_files 1 --http_max_tries 10 --max_download_threads 1 --max_threads $BATCH_SIZE"

    rm -f maxitem.json
    wget --no-verbose https://hacker-news.firebaseio.com/v0/maxitem.json

    clickhouse-local --query "
        SELECT arrayStringConcat(groupArray(number), ',') FROM numbers(1, $(cat maxitem.json))
        GROUP BY number DIV ${BATCH_SIZE} ORDER BY any(number) DESC" |
    while read ITEMS
    do
        echo $ITEMS
        clickhouse-client $TWEAKS --query "
            INSERT INTO hackernews_history SELECT * FROM url('https://hacker-news.firebaseio.com/v0/item/{$ITEMS}.json')"
    done
It takes a few hours to download the data and fill the table.

strooper
1 replies
6h19m

While trying the script, I am getting the following error -

<Trace> ReadWriteBufferFromHTTP: Failed to make request to 'https://hacker-news.firebaseio.com/v0/item/40298680.json'. Error: Timeout: connect timed out: 216.239.32.107:443. Failed at try 3/10. Will retry with current backoff wait is 200/10000 ms.

I googled with no luck. I was wondering if you have a solution for it.

zX41ZdbW
0 replies
1h15m

It makes many requests in parallel, and that's why some of them could be retried. It logs every retry, e.g., "Failed at try 3/10". It will throw an error only if it fails all ten tries. The number of retries is defined in the script.

Example of how it should work:

    $ ch -q "SELECT * FROM url('https://hacker-news.firebaseio.com/v0/item/40298680.json')" --format Vertical
    Row 1:
    ──────
    by:     octopoc
    id:     40298680
    parent: 40297716
    text:   Oops, thanks. I guess Marx was being referenced? I had thought Marx was English but apparently he was German-Jewish[1]<p>[1] <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Karl_Marx" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Karl_Marx</a>
    time:   1715179584
    type:   comment

noman-land
1 replies
1d5h

I very much support this idea. Put them on ipfs and/or torrents. Put them on HuggingFace.

pfarrell
0 replies
1d4h

I’ve had this same thought but was unsure what the licensing for the data would be.

pfarrell
0 replies
1d5h

I have a daily updated dataset that has the HN data split out by months. I've published it on my web page, but it’s served from my home server so I don’t want to link to it directly. Each month is about 30mb of compressed csv. I’ve wanted to torrent it, but don’t know how to get enough seeders since each month will produce a new torrent file (unless I’m mistaken). If you’re interested, send me a message. My email is mrpatfarrell. Use gmail for the domain.

average_r_user
0 replies
1d6h

that's a nice idea

minimaxir
7 replies
1d3h

A modern recommendation for UMAP is Parametric UMAP (https://umap-learn.readthedocs.io/en/latest/parametric_umap....), which instead trains a small Keras MLP to perform the dimensionality reduction down to 2D by minimizing the UMAP loss. The advantage is that this model is small and can be saved and reused to predict on unknown new data (a traditionally trained UMAP model is large), and training is theoetically much faster because GPUs are GPUs.

The downside is that the implementation in the Python UMAP package isn't great and creates/pushes the whole expanded node/edge dataset to the GPU, which means you can only train it on about 100k embeddings before going OOM.

The UMAP -> HDBSCAN -> AI cluster labeling pipeline that's all unsupervised is so useful that I'm tempted to figure out a more scalable implementation of Parametric UMAP.

Der_Einzige
3 replies
1d3h

It exists in cuML with a fast GPU implementation. Not sure why cuMl is so poorly known though…

minimaxir
2 replies
1d2h

I'll give that a look: the feature set of GPU-accelerated ops seems just up my alley for this pipeline: https://github.com/rapidsai/cuml

EDIT: looking through the docs it's just GPU-acceletated UMAP, not a parametric UMAP which trains a NN model. That's easy to work around though by training a new NN model to predict the reduced dimensionality values and minimizing rMSE.

minimaxir
1 replies
23h29m

Tested it out and the UMAP implementation with this library is very very fast compared to Parametric UMAP: running it on 100k embeddings took about 7 seconds when the same pipeline on the same GPU took about a half-hour. I will definitely be playing around with it more.

lmeyerov
0 replies
21h28m

Yeah we advise Graphistry users to keep GPU umap training sets to < 100k rows, and instead focus on doing careful sampling within that, and multiple models for going beyond that. It'd be more accessible for teams if we could raise the limit, but quality wise, it's generally fine. Security logs, customer activity, genomes, etc.

RAPIDS umap is darn impressive tho. Instead of focusing on improving further, it did the job. Our bottleneck shifted to optimizing the ingest pipeline to feed umap, so we released cu_cat as a GPU-accelerated automated feature engineering library to get all that data into umap. RAPIDS cudf helps take care of the intermediate IO and wrangling in-between.

Downstream, we generally stopped doing DBSCAN , despite being so pretty. We replace with cugraph/GFQL on the umap similarity graph, to avoid quality issues we see in practice, and then visually & interactively investigate the similarity graph in pygraphistry. Once you can see the k-nn similarity edges - and lack thereof -- you realize why scatter plot clusterings (visual or algorithmic) are so misleading to analysts and treat with more caution. There is a variety of umap contenders nowadays, but with this pipeline, we haven't felt the need to go beyond. That's a multi-year testament to Leland and team.

The result is we can now umap and interactively visualize most real world large datasets, database query results, and LLM embeddings that pygraphistry & louie.ai users encounter in seconds. Many years to get here, and now it is so easy!

bravura
2 replies
1d3h

From a quick glance, it appears that it's because the implementation pushes the entire graph (all edges) to the GPU. Sampling of edges during training could alleviate this.

minimaxir
1 replies
1d3h

Indeed, TensorFlow likes pushing everything to the GPU by default whereas many PyTorch DL implementations encourage feeding data from the CPU to the GPU as needed with a DataLoader.

There have been attempts at a PyTorch port of Parametric UMAP (https://github.com/lmcinnes/umap/issues/580) but nothing as good.

rantymcrant
6 replies
16h38m

I'd like to see an analysis of the rise of self promotion on HN.

I define self promotion on HN as a "Show HN: I ..." post vs "Show HN: Something ..."

Examples from the top 100 right now

* "Show HN: Exploring HN by mapping and analyzing 40M posts and comments for fun"

* "Show HN: Browser-based knitting (pattern) software"

These are not self promotional titles. The subjects are the exploration and the software respectively.

* "Show HN: I built a non-linear UI for ChatGPT"

* "Show HN: I created 3,800+ Open Source React Icons"

These are self promotional titles. The subject of each is "I"

My own simple check just via algolia search results checking for titles that start with "Show HN: I" gave these results for years starting April 1st. Graphed divided by the total number of results for that year

    2023 ****************************************
    2022 ***********************************
    2021 ***************************
    2020 **************************************
    2019 *************************
    2018 *************
    2017 *******
    2016 **********
    2015 ********
    2014 ************
    2013 *********************
    2012 *****************
    2011 *********
    2010 ***
I feel like maybe I grew up in a time when generally, self promotion was considered a bad character trait. Your actions are supposed to be what promotes you, calling attention to them is not but I feel that culture is changing.

I wonder if the rise in self promotion (assuming there is a rise) has to do with social media etc...

I perceive a similar rise on Youtube but I have no data, just a feeling from the number of youtube recommendations for videos of "I....."

Thorrez
3 replies
16h7m

Your definition of self promotion is a bit different from what I usually think. I usually consider self promotion to be someone promoting something that that same person did. Both of your non-self-promotion examples would be self promotion under my definition.

So what you consider to be self promotion vs non-self-promotion, I consider to be self promotion with a title that very clearly indicates that vs self promotion with a title that less clearly indicates that. However, the "Show HN" phrase is only used for self promotion I think, so even without the "I", anyone familiar with the convention will know it's self promotion.

rantymcrant
2 replies
15h25m

However, the "Show HN" phrase is only used for self promotion I think, so even without the "I", anyone familiar with the convention will know it's self promotion.

I think that's an extremely cynical view though a common one. I've never thought of "Show HN" as self promotion if it doesn't include "I" unless I go through to the actual product/library/post and find it full of self promotion. I agree with you that a post that doesn't include "I" can be self promotion but I don't think it always is even if the person made/worked on it.

"Show HN: XYZ and LLM library in rust" to me is informational. It's point is, more often than not, to inform people of something they might get use out of. I know that's true when I've posted something like that. It's meaning is "here's a useful resource that was just created". Sure I get pleasure from knowing I helped people with something but I'm not trying to promote myself, I'm trying to promote the library/post/info.

"Show HN: I made an LLM Library in rust" to me is self promotional. It might be useful to others but it's intent was clearly self promotion given the subject is "I", not the library/post/product.

satvikpendem
0 replies
14h54m

Show HN is defined in the rules (as the sibling comment quotes) as something someone made to be shared, ie self promotion, regardless of whether they used "I" in the title. Your definition seems more arbitrary than what Hacker News itself intends.

wodenokoto
0 replies
14h26m

All show HN has to be created by the author, so I’m not sure what is self promoting about making the implicit explicit.

They are all “look, I made something cool, what do you think?”

amitlevy49
0 replies
15h46m

This is talked about a lot in Einstein's Walter Isaacson biography, so people have been observing this trend for a long time (e.g the Germans accusing Einstein of doing self promotion, the US having celebrity culture in contrast), maybe it's cyclical

graiz
4 replies
1d6h

Would be cool to see member similarity. Finding like-minded commentors/posters may help discover content that would be of interest.

noman-land
1 replies
1d5h

Accidental dating app.

internetter
0 replies
1d5h

Accidental dating app.

Possibly the greatest indicator of social startup success.

vsnf
0 replies
1d6h

Reminds me of a similar project a few months ago whose purpose was to unmask alt accounts. It wasn’t well received as I recall.

NeroVanbierv
4 replies
1d6h

Really love the island map! But the automatic zooming on the map doesn't seem very relevant. E.g. try typing "openai" - I can't see anything related to that query in that part of the map

wilsonzlin
1 replies
1d5h

Thanks! Yeah sometimes there are one or two "far" away results which make the auto zoom seem strange. It's something I'd like to tune, perhaps zooming to where most but not all results are.

luke-stanley
0 replies
1d3h

Often embeddings are not so good for comparing similarity of text. A cross-encoder might be a good alternative, perhaps as a second-pass, since you already have the embeddings. https://www.sbert.net/docs/pretrained_cross-encoders.html Pairwise, this can be quite slow, but as a second pass, it might be much higher quality. Obviously this gets into LLM's territory, but the language models for this can be small and more reliable than cosine on embeddings.

oersted
0 replies
1d5h

Indeed I've long been intreagued by the idea of rendering such clustering maps more like geographic maps for better readability.

It would be cool to have analogous continents, countries, sub-regions, roads, different-sized settlements, and significant landmarks... This version looks great at the highest zoom level, but rapidly becomes hard to interpret as you zoom in, same as most similar large embedding or graph visualizations.

NeroVanbierv
0 replies
1d6h

Ok I just noticed there is a region "OpenAI" in the north-west, but for some reason it zooms in somewhere close to "Apple" (middle of the island) when I type the query

seanlinehan
3 replies
1d6h

It was not obvious at first glance to me, but the actual app is here: https://hn.wilsonl.in/

uncertainrhymes
0 replies
1d4h

I'm curious if the link to the landing page was intentionally near the end. Only the people who actually read it would go to the site.

(That's not a dig, I think it's a good idea.)

oschvr
0 replies
1d1h

I found me and my post there ! Nice

bravura
0 replies
1d3h

1) it doesn’t appear search links are shareable or have the query terms are in it

2) are you embedding the search phrases word by word? And using the same model as the documents used? Because I searched for „lead generation“ which any decent non-unigram embedding should understand, but I got results for lead poisoning.

CuriouslyC
3 replies
1d6h

Good example of data engineering/MLops for people who aren't familiar.

I'd suggest using HDBScan to generate hierarchical clusters for the points, then use a model to generate names for interior clusters. That'll make it easy to explore topics out to the leaves, as you can just pop up refinements based on the connectivity to the current node using the summary names.

The groups need more distinct coloring, which I think having clusters could help with. The individual article text size should depend on how important or relevant the article is, either in general or based on the current search. If you had more interior cluster summaries that'd also help cut down on some of the text clutter, as you could replace multiple posts with a group summary until more zoomed in.

jszymborski
0 replies
15h45m

Ooo thanks for this

wilsonzlin
0 replies
1d4h

Thanks for the great pointers! I didn't get the time to look into hierarchical clustering unfortunately but it's on my TODO list. Your comment about making the map clearer is great and something I think there's a lot of low-hanging approaches for improving. Another thing for the TODO list :)

gaauch
2 replies
1d4h

A long term side project of mine is to try to build a recommendation algorithm trained on HN data.

I trained a model to predict if a given post will reach the front page, get flagged etc, I collected over a 1000 RSS feeds and rank the RSS entries with my ranking models.

I submit the high ranking entries on HN to test out my models and I can reach the front page consistently sometimes having multiple entries on the front page at a given time.

I also experiment with user->content recommendation, for that I use comment data for modeling interactions between users and entries, which seems to work fine.

Only problem I have is that I get a lot of 'out of distribution' content in my RSS feeds which causes my ranking models to get 'confused' for this I trained models to predict if a given entry belongs HN or not. On top of that I have some tagging models trained on data I scraped from lobste.rs and hand annotated.

I had been working on this on and off for the last 2 years or so, this account is not my main, and just one I created for testing.

AMA

saganus
1 replies
1d4h

did you find if submitted entries are more likely to reach the frontpage depending on the title or the content?

i.e. do HN users upvote more based on the title of the article or on actually reading them?

gaauch
0 replies
1d4h

I tried making an LLM generate different titles for a given article and compared their ranking scores. There seems to be a lot of variation in the ranking scores based on the way the title is worded. Titles that are more likely to generate 'outrage' seems to be getting ranked higher, but at the same time that increases is_hn_flagged score which tries to predict if a entry will get flagged.

ashu1461
2 replies
1d5h

This is pretty great.

Feature request : Is it possible to show in the graph how famous the topic / sub topic / article is ?

So that we can do an educated exploration in the graph around what was upvoted and what was not ?

wilsonzlin
1 replies
1d4h

Thanks! Do you mean within the sentiment/popularity analysis graph? Or the points and topics within the map?

ashu1461
0 replies
18h50m

Points and topics within the map.

xnx
1 replies
1d5h

As a novice, is there a benefit to using custom Node as the downloader? When I did my download of the 40 million Hacker News api items I used "curl --parallel".

What I would like to figure out is the easiest way to go from the API straight into a parquet file.

wilsonzlin
0 replies
1d5h

I think your curl approach would work just as fine if not better. My instinct was to reach for Node.js out of familiarity, but curl is fast and, given the IDs are sequential, something like `parallel curl ::: $(seq 0 $max_id)` would be pretty simple and fast. I did end up needing more logic though so Node.js did ultimately come in handy.

As for the Arrow file, I'm not sure unfortunately. I imagine there are some difficulties because the format is columnar, so it probably wants a batch of rows (when writing) instead of one item at a time.

c17r
0 replies
18h54m

An HN "item" is not just posts but everything: posts, comments, the parts of a poll, etc.

Still an impressive number

sourcepluck
1 replies
22h30m

Where is lisp?! I thought it was a verifiable (urban) legend around these parts that this forum is obssessed with lisp..?

pinkmuffinere
0 replies
22h16m

Maybe lisp is so niche that even a rather small interest makes HN relatively lispy?

paddycap
1 replies
1d6h

Adding a subscribe feature to get an email with the most recent posts in a topic/community would be really cool. One of my favorite parts of HN is the weekly digest I get in my inbox; it would be awesome if that were tailored to me.

What you've built is really impressive. I'm excited to see where this goes!

wilsonzlin
0 replies
1d5h

Thanks! Yeah if there's enough interested users I'd love to turn this into a live service. Would an email subscription to a set of communities you pick be something you'd be interested in?

jxy
1 replies
1d1h

We can see that in this case, where perhaps the X axis represents "more cat" and Y axis "more dog", using the euclidean distance (i.e. physical distance length), a pitbull is somehow more similar to a Siamese cat than a "dog", whereas intuitively we'd expect the opposite. The fact that a pitbull is "very dog" somehow makes it closer to a "very cat". Instead, if we take the angle distance between lines (i.e. cosine distance, or 1 minus angle), the world makes sense again.

Typically the vectors are normalized, instead of what's shown in this demonstration.

When using normalized vectors, the euclidean distance measures the distance between the two end points of the respective vectors. While the cosine distance measures the length of one vector projected onto the other.

GeneralMayhem
0 replies
23h11m

The issue with normalization is that you lose a degree of freedom - which when you're visualizing, effectively means losing a dimension. Normalized 2d vectors are really just 1d vectors; if you want to show a 2d relationship, now you have to use 3d vectors (so that you have 2 degrees of freedom again).

freediver
1 replies
1d6h

If you have a blog, add an RSS feed :)

breck
0 replies
1d4h

I tried to fetch his RSS too! :)

Turns out, there's only 1 post so far on his blog.

Hoping for more! This one is great.

ed_db
1 replies
1d6h

This is amazing, the amount of skill and knowledge involved is very impressive.

wilsonzlin
0 replies
1d4h

Thank you for the kind words!

coolspot
1 replies
1d

Absolutely wonderful project and even more so the writeup!

Feedback: on my iOS phone, once you select a dot on the map, there is no way to unselect it. Preview card of some articles takes full screen, so I can’t even click to another dot. Maybe add a “cross” icon for the preview card or make that when you tap outside of a card, it hides whole card strip?

wilsonzlin
0 replies
9h41m

Thank you! And thanks for raising that issue. I've pushed a fix that should hopefully mitigate this for you: it's possible to unselect, card images are hidden on mobile, and the invisible results area around a card (caused by the tallest card stretching the results area) should no longer intercept map touches. Let me know if it helps!

chossenger
1 replies
1d5h

Awesome visualisation, and great write-up. On mobile (in portrait), a lot of longer titles get culled as their origin scrolls off, with half of it still off the other side of the screen - wonder if it'd be worth keeping on rendering them until the entire text field is off screen (especially since you've already got a bounding box for them).

I stumbled upon [1] using it that reflects your comments on comment sentiment.

This also reminded me of [2] (for which the site itself had rotted away, incidentally) - analysing HN users' similarity by writing style.

[1] https://minimaxir.com/2014/10/hn-comments-about-comments/ [2] https://news.ycombinator.com/item?id=33755016

wilsonzlin
0 replies
1d4h

Thanks for the kind words, and raising that problem --- I've added it as an issue to fix.

Thanks for sharing that article, it was an interesting read. It was cool how deep the analysis went with a few simple statistical methods.

ComputerGuru
1 replies
17h15m

Amazing work, I'm impressed by the scope of your project!

I must say though, is it jina or bge-3/flag - the embeddings (and tokenizer?) do not do a good job on tech topics. It's fine for natural words, but searching for tech concepts like "xaml", "simd", etc cause it fall back to tokenizing the inputs and tries to grab similar sounding words.

Also, just some constructive feedback, if there were some way to stop it from showing the same "hn leaderboard" of results when there are no results because a topic is too niche would be nice. I get a lot of "Stephen Hawking has died" when searching for words the embeddings aren't familiar with.

Edit: I'm not so sure how well the sentiment analysis is working. I had the feeling that there was too much negative sentiment that didn't match up to reality, so I tried looking up things HN would feel overwhelmingly positive about like "Mr Rogers", I mean, who could feel negatively about him? The results show some serious negative spikes. Look up "Carter" and there's a massive negative peak associated with the passing of Rosalynn Carter. It was an HN submission talking about all the wonderful things the Carters did.

Also, I think the "popularity over time" needs to be scaled by the median number of votes a story got that month/year, because the trend lines just go up and up if you plot strictly the number of posts. Look at the popularity of "diesel" and you'll see what I mean - this is a term that peaked ten years ago! Or perhaps it should be some sort of keyword incidence rate or number of items with a cosine similarity index of less than x from the query rather than post score, maybe?

Edit2: The dynamic "click a post to remove and recalculate similarity threshold" is awesome.

tarasglek
0 replies
41m

How does one tell programmatically that any given embedding model doesn't recognize a term or word?

tomthe
0 replies
14h30m

I made something very similar a few weeks ago. I also included usernames with the average of their comments: https://tomthe.github.io/hackmap/

replete
0 replies
1d4h

I think this is easily the coolest post I've seen on HN this year

redbell
0 replies
11h26m

Truly, amazing work! Not only because of the final results, but also because of the whole process it took the author to bring this to life. If I could upvote this by giving points from my karma, I wouldn't hesitate to easily give a hundred points. Without a doubt, I would classify this on par with "40k HN comments mentioning books, extracted using deep learning" (https://news.ycombinator.com/item?id=28595967), which is the highest-voted "Show HN" project related to hacker news so far with 1359 points.

I'm not in the ML/AI arena yet, so I couldn't fully understand the second half of the article except for having a general idea about Embeddings and their potential, but the first part is what interests me as a software engineer.

Following are some of the challenges the author came across, was able to overcome each of them, and published the full source code.

Downloading HN database

There's also a maxitem.json API, which gives the largest ID. As of this writing, the max item ID is over 40 million. Even with a very nice and low 10 ms mean response time, this would take over 4 days to crawl, so we need some parallelism.

I've exported the HN crawler [1] (in TypeScript) to its own project, if you're ever in need to fetch HN items.

Fetching and parsing linked URLs' HTML for metadata and text

For text posts and comments, the answer is simple. However, for the vast majority of link posts, this would mean crawling those pages being linked to. So I wrote up a quick Rust service [2] to fetch the URLs linked to and parse the HTML for metadata (title, picture, author, etc.) and text. This was CPU-intensive so an initial Node.js-based version was 10x slower and a Rust rewrite was worthwhile. Fortunately, other than that, it was mostly smooth and painless, likely because HN links are pretty good (responsive servers, non-pathological HTML, etc.).

Recovering missing/dead links

A lot of content even on Hacker News suffers from the well-known link rot: around 200K resulted in a 404, DNS lookup failure, or connection timeout, which is a sizable "hole" in the dataset that would be nice to mend. Fortunately, the Internet Archive has an API that we can use to use to programmatically fetch archived copies of these pages. So, as a final push for a more "complete" dataset, I used the Wayback API to fetch the last few thousands of articles, some dating back years, which was very annoying because IA has very, very low rate limits (around 5 per minute).

Finding a cost-effective cloud provider for GPUs

Fortunately, I discovered RunPod, a provider of machines with GPUs that you can deploy your containers onto, at a cost far cheaper than major cloud providers. They also have more cost-effective GPUs like RTX 4090, while still running in datacenters with fast Internet connections. This made scaling up a price-accessible option to mitigate the inference time required.

This is the type of content that makes HN stands out from the crowd.

_____________________________

1. https://github.com/wilsonzlin/crawler-toolkit-hn/

2. https://github.com/wilsonzlin/hackerverse/tree/master/crawle...

racosa
0 replies
6h1m

Very cool project. Thanks for sharing it!

pudiklubi
0 replies
7h26m

This is wild. I've been creating my own dataset of trending articles and ironically this is how I came across your post. I'm doing a similar project for my uni thesis.

I set out with similar hypotheses and goals like you (on a slightly different scale though, haha) but I've been completely stuck on the interactive map part. Definitely getting a lot of pointers from how you handled this!

Maybe one key difference in approach is that I've put more emphasis on trying to extract key topics as keywords.

For ex:

article (title): "Useful Uses of cat"

keywords: ['Software design', 'Contraction', 'Code changes', 'Modularity', 'Ease of extension']

My hypothesis is this will be a faster search solution than using the embeddings, but potentially not as accurate. Not that far yet to really prove this though.

Would love to hear what you think! Any other cool ideas on what could be done with the keywords? I explain my process a bit more here if interested: https://hackernews-demo.streamlit.app/#data-aggregation-meth...

oersted
0 replies
1d5h

Here's a great tool that does almost exactly the same thing for any dataset: https://github.com/enjalot/latent-scope

Obviously the scale of OP's project adds a lot of interesting complexity, this tool cannot handle that, but it's great for medium-sized datasets.

nojvek
0 replies
1d2h

I'm impressed with the map component in canvas. It's very smooth, dynamic zoom and google-maps like.

Gonna dig more into it.

Exemplary Show HN! We need more of this.

kriro
0 replies
17h8m

Very nice project and documented really well. I learned a lot reading the post. The examples of the improved HN search are pretty awesome.

Any idea why password reuse is so far away from security? That was the only oddity of the map for me.

gsuuon
0 replies
1d1h

This is super cool! Both the writeup and the app. It'd be great if the search results linked to the HN story so we can check out the comments.

gitgud
0 replies
21h42m

Very cool! I was hoping to be able to navigate to the HN post from the map though? Is that possible?

gardenhedge
0 replies
9h26m

AI is the most popular topic (by far) that I could find. Is there anything more popular?

fancy_pantser
0 replies
1d4h

HN submissions and comments are very different on weekends (and US holidays). Your data could explore and quantify this in some very interesting ways!

datguyfromAT
0 replies
1d2h

What a great read! Thats for taking the time and effort to provide the inside into your process

dangoodmanUT
0 replies
21h12m

excellent work

cyclecount
0 replies
1d

I can’t tell from the documentation on GitHub: does the API expose the flagged/dead posts? It would be interesting to see statistics on what’s been censored lately.

celltalk
0 replies
13h50m

It would be cool to see yearly changes of UMAP, by different years or the overall evolution in pseudotime on the embedding. Such a cool side project!

callalex
0 replies
1d4h

“Cloud Computing” “us-east-1 down”

This gave me a belly laugh.

aeonik
0 replies
20h9m

I couldn't help but notice that Hy is on the map but Clojure isn't.

Am I out of touch?

https://hylang.org

Venkatesh10
0 replies
5h8m

This is the type of content I'm here for.

Lerc
0 replies
1d2h

A suggestion for analysis:

Compare topics/sentiment etc. by number of users and by number of posts.

Are some topics dominated by a few prolific posters? Positively or negatively.

Also, How does one seperate negative/positive sentiment to criticism/advocacy?

How hard is it to detect positive criticism, or enthusiastic endorsement of an acknowledged bad thing?

Igor_Wiwi
0 replies
23h52m

how much you paid to generate those embeddings?