This plot shows a few millions of the around 40 million stories, comments and authors from Hacker News. Similar stories and comments are placed close to each other, based on the content of the comments.
The data was downloaded from the HN-API
But since ~40 Million items are too much for a browser,
I kept only those 2.6 Million with at least some replies.
I fed the comments to a SentenceTransformers model (all-MiniLM-L6-v2)
to create text embeddings. Titles of submissions are often ambiguous, so I used the average embeddings of their comments
to get a better representation of the content of submissions. The same was done for the users.
Then I used UMAP to reduce the dimensionality of the embeddings
to 3, 2 and 1 dimensions. 3 for the colors, 2 for the placement of the nodes and 1 for a plot with the time dimension.
But the 3D colors didn't add much information, so I removed them.
I also used Bertopic to get clusters and names for these
clusters... but they also don't add much information upon the titles of the submissions.
There are several implementations of maps like this:
The Hathi Trust Library,
20 Million Pubmed articles
and a few examples from datamapplot
(a library from Leland McInnes, the creator of UMAP):
Wikipedia
20 Newsgroups
(todo: add links)
Some of them are very sophisticated, but they don't show the actual text on the canvas.
I think showing as much information as possible, while not overwhelming the user (and browser...)
is very important for how much the user can get out of such a visualization of big data.
Another important aspect is that I wanted to host the whole thing on a static hoster,
which makes things much easier in the long term.
I used mostly vanilla Javascript (good decision for such a site - no build step
and no fighting against Svelte or React) and the excellent
force-graph library.
Since there are too many data points to show at once, the page fetches a base map with
the 40 000 most important nodes and then fetches additional data tiles when you zoom in.
Unfortunately, I couldn't find the time to implement a static search over all the data,
so the search currently only works for the base-tile of 40 000 nodes.
The color of the nodes is based on the publication date. The size is based on the
score of submissions and the number of direct and indirect child comments for comments
and users.
The biggest challenge in this project was that it worked so well that I got constantly
distracted by the stories and comments that I discovered while testing the plot.
This is why I release it now in this work-in-progress state. Firefox doesn't render some nodes
when zoomed in too much, Chrome renders them, but has problems with showing the correct tooltips.
Candos and Todos:
The code can be found on Github: github.com/tomthe/demographymap