Hack Map - 1.5 Million Hackernews stories, users and comments in one plot

This plot shows a few millions of the around 40 million stories, comments and authors from Hacker News. Similar stories and comments are placed close to each other, based on the content of the comments.

How to use:

Use the search box to find a story, a user, a comment or a topic
Use the mouse wheel or two fingers to zoom in and out

Details

The data was downloaded from the HN-API But since ~40 Million items are too much for a browser, I kept only those 2.6 Million with at least some replies. I fed the comments to a SentenceTransformers model (all-MiniLM-L6-v2) to create text embeddings. Titles of submissions are often ambiguous, so I used the average embeddings of their comments to get a better representation of the content of submissions. The same was done for the users. Then I used UMAP to reduce the dimensionality of the embeddings to 3, 2 and 1 dimensions. 3 for the colors, 2 for the placement of the nodes and 1 for a plot with the time dimension. But the 3D colors didn't add much information, so I removed them.
I also used Bertopic to get clusters and names for these clusters... but they also don't add much information upon the titles of the submissions.
There are several implementations of maps like this: The Hathi Trust Library, 20 Million Pubmed articles and a few examples from datamapplot (a library from Leland McInnes, the creator of UMAP): Wikipedia 20 Newsgroups (todo: add links) Some of them are very sophisticated, but they don't show the actual text on the canvas. I think showing as much information as possible, while not overwhelming the user (and browser...) is very important for how much the user can get out of such a visualization of big data. Another important aspect is that I wanted to host the whole thing on a static hoster, which makes things much easier in the long term. I used mostly vanilla Javascript (good decision for such a site - no build step and no fighting against Svelte or React) and the excellent force-graph library. Since there are too many data points to show at once, the page fetches a base map with the 40 000 most important nodes and then fetches additional data tiles when you zoom in. Unfortunately, I couldn't find the time to implement a static search over all the data, so the search currently only works for the base-tile of 40 000 nodes.
The color of the nodes is based on the publication date. The size is based on the score of submissions and the number of direct and indirect child comments for comments and users.

The biggest challenge in this project was that it worked so well that I got constantly distracted by the stories and comments that I discovered while testing the plot. This is why I release it now in this work-in-progress state. Firefox doesn't render some nodes when zoomed in too much, Chrome renders them, but has problems with showing the correct tooltips.

Candos and Todos:

better search
more levels of tiles
tuning of the size and show parameters
earlier data from HN
Other datasets (MusicBrainz, bibliometrics, newspapers,...)
better data loading (use feather)
tweening between different embeddings or other representations
showing the text of the comments

tom@theilemail.de

The code can be found on Github: github.com/tomthe/demographymap