This is great. This sentence struck a chord with me in particular:
Imagine applying these techniques on the Common Crawl
You would be able to produce a ... map of the internet.
Making maps of things not usually on maps has been my passion for years. And I made many of them. One of the more popular ones that some of you might know is the Music-Map:I have had the urge to make a map of the web for quite a while. Already registered the web-map.com domain for it. I did some experiments, built a custom crawler and an algorithm which finds related websites fast. It showed that the project would be feasible.
But I hold back on doing it, because I already run multiple experimental maps and have yet to come up with a business model for "making maps of everything".
So cool, thanks for sharing! I see you've also done it for movies, which is pretty cool and useful.
I could not find any technical details on the input data / feature extraction / clustering method used in these tool. Do you mind sharing what you have used so far?
The Music-Map and the Movie-Map are based on user preferences. The Music-Map is based on https://www.gnoosic.com and the Movie-Map on https://www.gnovies.com, two AI projects I started before the maps.
The AI and the mapping algorithm are my own developments. I was mostly inspired by thinkers like Douglas Hofstadter and John R. Koza.
It is really cool and useful. Interesting that you were able to gather enough data from users to make it work. I guess it was much less useful in the beginning?
I thought of making something similar with data from https://musicbrainz.org/
Yes, in the beginning pretty much everybody hated it and thought the project was nuts. I got pretty much no positive feedback but lots of negative. I was like "But it's learning! It's learning!" :) Strangely, that convinced almost nobody, even among my friends.
Now that many millions of people have used it, I get a lot of great, often enthusiastic feedback on how Gnod makes the best recommendations.
That teached me that you can't convince people with just an idea. For most people, you have to deliver something which is already useful to them.
Your effort is appreciated, but recommendations miss the mark by a considerable margin, to say at least.
I built a map of all the PDF urls on the internet recently.
I used a tiny embeddings model and PCA for dimensionality reduction.
https://weblog.snats.xyz/posts/2024/03/20/
Interesting, did you try also using PaCMAP or UMAP for dimensionality reduction? It might result in a more meaningful representation of their underlying semantic structure: see the 'mammoth' example in my article.
No! I only tried PCA, but I still have the embeddings.
I'll try later and post results.
I had something similar once - it was a graph of connections between all the artists in my Spotify library to see who had collab'd with who. It was a lot of fun to see just how distantly connected two artists were through a long chain of collabs and collabs. Of course, like most human connection maps, it mostly came down to a handful of super-connectors who collaborate with hundreds of people, who in turn collaborate with their own niche groups. But there were some interesting groups revealed by it.
I was halfway expecting a 6-degree to Kevin Bacon reference here. Disregarding the actual Bacon, I was almost hoping a similar effect from any 2 artists can be connected in 1 Bacon or less
Great job! There is a form to report typos. Anywhere for duplicates and more complicated errors?
What is the difference between a typo and a duplicate? If you mean that two ways of writing the same name are both legit, then you have to decide on one being the more "correct" one. After a while Gnod will figure out which one is the more common name.
And "complicated errors"?
You are the creator? Thank you for what you do! I've used it with pleasure for many years.
Yes. Happy you like it!
You might also like: https://everynoise.com/
This one is mesmerizing. Highly recommend checking it out.
Thanks for sharing that map, I’m going to start using it to discover new artists :)
I’d love to see a semantic map of the internet, I’m considering having a crack it as well, but it’d be a monumental task. There is this cool map but it’s quite dated: http://internet-map.net/
I did something similar for fragrances a little while ago: https://observablehq.com/@55th/every-fragrance-at-once
Very cool. I've immediately found some music I really like that I've never heard before.