Aggregating blog comments ========================= :date: 2022-03-09 IT-related stuff written in English tends to live its own life: when people find something interesting, it ends up linked on various news aggregators.footnote:[For example: I was writing an article about my attempts at implementing X11 GUIs with Go, https://blog.rfox.eu/[my friend] got the idea to post it on r/programming, and very soon it spontaneously ended up on more subreddits, Hacker News, and Lobsters, with lots of people reading it. To my mild surprise, the Acer laptop serving it over a 5 Mbit/s uplink on a residential DSL connection survived the traffic just fine, though I didn't have many images on my site at the time.] Some years back, it could be http://hackles.org/cgi-bin/archives.pl?request=129[Slashdot], nowadays mainly Hacker News, Reddit, or Lobsters. Aside from assigning internet points, people there like to discuss the contents, often disregarding whether it already has a comment section of its own. When I noticed Jeff Kaufman https://www.jefftk.com/p/designing-low-upkeep-software[importing comments] from various sources directly into his blog posts, I simply had to try this out as well. It turns out it's rather easy! So, how do you aggregate comments back from news aggregators? Step 1: Know where the discussion happens ----------------------------------------- One way to achieve this is to post your articles on all appropriate media yourself, i.e., being the first. There are two issues with that. First, on both Hacker News and Lobsters doing so is generally scowled at, in particular if you do this too often, and you may be awarded with a ban. Second, you don't control subsequent reposts, and everything on these sites has its own, limited lifetime. The more sustainable approach is to search for your domain's appearances on those sites of interest in an automated fashion: - *Hacker News* has an easy-to-use https://hn.algolia.com/api[specialized JSON search API]. - **Reddit**'s https://www.reddit.com/dev/api/[general-purpose API] doesn't like cURL's default User-Agent, and uses very awkward data structures, yet it's also fine to use. - *Lobsters* doesn't have an API per se at all, but you can easily parse the search page. In my case, I simply extended the link:personal-intelligence-service.html[ web-watching] script I launch once a day through systemd with a few more requests. It reduces to: ```sh #!/bin/sh -e status=0 workdir=watch mkdir -p $workdir check() { local url=$1 f=$workdir/$(echo "$1" | sed 's|/|\\|g') if ! curl -A Skynet --no-progress-meter -Lo "$f.download" "$url"; then status=1 else shift "$@" <"$f.download" >"$f.filtered" || status=1 if [ -f "$f" ] && ! diff "$f.filtered" "$f" >"$f.diff"; then mail -s "$url updated" root <"$f.diff" || status=1 fi mv "$f.filtered" "$f" fi } check 'https://hn.algolia.com/api/v1/search_by_date?query=p.janouch.name' \ jq -r '.hits[] | "https://news.ycombinator.com/item?id=\(.objectID) \(.url)"' check 'https://www.reddit.com/search.json?q=site%3Ap.janouch.name&sort=new' \ jq -r '"https://reddit.com" + .data.children[].data.permalink' check 'https://lobste.rs/domain/p.janouch.name' \ perl -lne 'print "https://lobste.rs$&" if m|/s/\w+| && !$seen{$&}++' exit $status ``` Thus, I get diffs by mail. You just have to love the Bourne shell, Perl and jq. Of course, the results can also be further processed directly, and with the exception of Lobsters, which I'll talk about in a moment, the requests can even be run from your reader's browser only. That is, if you don't care about people with disabled Javascript. So far I'm quite content with adding links to my static pages manually, in the manner of: ```

``` or declaratively, for my custom static site generator based on libasciidoc: ``` :hacker-news: 12345678 :lobsters: 1a2b3c :reddit: 1a2b3c, 4d5e6f :reddit-subs: r/linux, r/programming ``` Step 2: Steal all the replies ----------------------------- Having enumerated the story/post IDs, the rest is easy: you just fetch some JSON documents, slightly reprocess them to form a DOM tree to insert into your page, come up with appropriate styling, and you're done. Roughly speaking, because all the respective APIs come with certain quirks. Lobsters ~~~~~~~~ Such as, having Lobsters' CORS settings give you the middle finger, so you need to set up an appropriately rate-limited and cached proxy.footnote:[The relevant GitHub issues are: https://github.com/lobsters/lobsters/issues/1029[#1029], https://github.com/lobsters/lobsters/issues/361[#361].] nginx has proven to be quite awkward and fiddly to configure, so I won't bother quoting my hodge-podge reverse proxy rules in full. The most important lines were: ```nginx server { listen 443 ssl http2; server_name lobsters.your.domain; location / { proxy_pass https://lobste.rs; add_header Access-Control-Allow-Origin 'https://blog.your.domain'; add_header Access-Control-Allow-Methods 'GET, OPTIONS'; } } ``` Luckily, the only remaining way in which this aggregator resists being aggregated is that the comments you'll find at _+++https://lobsters.your.domain/s/1a2b3c.json+++_ are all flattened into a common array, along with their computed depth. Curiously, this makes the obvious solution to your transformation code non-recursive, as an exception here. Hacker News ~~~~~~~~~~~ The https://github.com/HackerNews/API[Hacker News API] is a bit awkward in that you need to fetch _each single comment separately_. The good news is that even for a modestly sized comment section, trivially spawning `fetch` calls in a massively asynchronous fashion turns out to work acceptably. Also beware that the API returns even deleted comments, and those don't have all fields set. Note that https://news.ycombinator.com/item?id=32543023[this API may be replaced with something else in the future]. Reddit ~~~~~~ Except for the already-mentioned insane data structures, confusing documentation, and the `raw_json` oddity where their JSON normally contains SGML-quoted HTML fields, there isn't much to talk about. I'm not sure if the API returns all discussions in full, but it doesn't seem to be worth spending much time on. It returns _enough_. Summing up the parts ~~~~~~~~~~~~~~~~~~~~ Have a look at https://p.janouch.name/common.js[the short Javascript code used on this page]. I'm an incorrigible code-golfer, and I try to make do without libraries, so it should be quite readable, and even reusable. As a side note, barebones web development is _easy_ these days, with all those browser APIs and ES6. To see it actually load nearly 200 comments from various sources, jump to link:/article-xgb.html#comments[the end of my X11 article]. Seeing as it made the comment section much, much longer than the article itself, I decided to roll them up once the loaded item count exceeds a threshold. Step 3: Profit -------------- There's no profit, you're the commodity. Is this a good idea? -------------------- I don't know. Reading imported comments gets mildly confusing, because people assume you can't see their reactions. I suppose it will be a bit better if you clearly identify yourself as an intelligence agency at the end of your articles. Then there's the problem of inviting pseudonymous _masses_ into your _home_ to express opinions, shitpost, and ask weird questions. You will get a lot of effectively worthless comments, even though it tends to be outweighed by interesting feedback. In any case, it is considerably low-effort, and makes your place look a bit more '`cozy`'. And is this for me? ~~~~~~~~~~~~~~~~~~~ On a personal, tangential, mildly philosophical note, some past gossip that I've reconnected with its origin this way may also portray me unfavourably (or other people, for that matter), which will now be right there for everyone to see, and I assume embedded by search engines as well. Interspersing otherwise sterile works with strong views, immoral jokes, and/or ranting apparently likes to make people get caught up in pointless indignated discussions, which I've ended up pulling all in. But without these peculiarities, my writing would be inauthentic, lifeless, and boring to read as well as to write. Do I embrace the resulting chaos? All things considered... YOLO, I guess.