Aggregating blog comments

IT-related stuff written in English tends to live its own life: when people find something interesting, it ends up linked on various news aggregators.[1] Some years back, it could be Slashdot, nowadays mainly Hacker News, Reddit, or Lobsters. Aside from assigning internet points, people there like to discuss the contents, often disregarding whether it already has a comment section of its own.

When I noticed Jeff Kaufman importing comments from various sources directly into his blog posts, I simply had to try this out as well. It turns out it’s rather easy!

So, how do you aggregate comments back from news aggregators?

Contents

Step 1: Know where the discussion happens

One way to achieve this is to post your articles on all appropriate media yourself, i.e., being the first. There are two issues with that. First, on both Hacker News and Lobsters doing so is generally scowled at, in particular if you do this too often, and you may be awarded with a ban. Second, you don’t control subsequent reposts, and everything on these sites has its own, limited lifetime.

The more sustainable approach is to search for your domain’s appearances on those sites of interest in an automated fashion:

  • Hacker News has an easy-to-use specialized JSON search API.

  • Reddit's general-purpose API doesn’t like cURL’s default User-Agent, and uses very awkward data structures, yet it’s also fine to use.

  • Lobsters doesn’t have an API per se at all, but you can easily parse the search page.

In my case, I simply extended the web-watching script I launch once a day through systemd with a few more requests. It reduces to:

#!/bin/sh -e
status=0 workdir=watch
mkdir -p $workdir

check() {
  local url=$1 f=$workdir/$(echo "$1" | sed 's|/|\\|g')
  if ! curl -A Skynet --no-progress-meter -Lo "$f.download" "$url"; then
    status=1
  else
    shift
    "$@" <"$f.download" >"$f.filtered" || status=1
    if [ -f "$f" ] && ! diff "$f.filtered" "$f" >"$f.diff"; then
      mail -s "$url updated" root <"$f.diff" || status=1
    fi
    mv "$f.filtered" "$f"
  fi
}

check 'https://hn.algolia.com/api/v1/search_by_date?query=p.janouch.name' \
  jq -r '.hits[] | "https://news.ycombinator.com/item?id=\(.objectID) \(.url)"'
check 'https://www.reddit.com/search.json?q=site%3Ap.janouch.name&sort=new' \
  jq -r '"https://reddit.com" + .data.children[].data.permalink'
check 'https://lobste.rs/domain/p.janouch.name' \
  perl -lne 'print "https://lobste.rs$&" if m|/s/\w+| && !$seen{$&}++'
exit $status

Thus, I get diffs by mail. You just have to love the Bourne shell, Perl and jq.

Of course, the results can also be further processed directly, and with the exception of Lobsters, which I’ll talk about in a moment, the requests can even be run from your reader’s browser only. That is, if you don’t care about people with disabled Javascript.

So far I’m quite content with adding links to my static pages manually, in the manner of:

<h3 class=hacker-news data-id=12345678>
<a href='https://news.ycombinator.com/item?id=12345678'>Hacker News</a></h3>
<h3 class=lobsters data-id=1a2b3c>
<a href='https://lobste.rs/s/1a2b3c'>Lobsters</a></h3>
<h3 class=reddit data-id=1a2b3c>
<a href='https://www.reddit.com/comments/1a2b3c/'>r/linux</a></h3>
<h3 class=reddit data-id=4d5e6f>
<a href='https://www.reddit.com/comments/4d5e6f/'>r/programming</a></h3>

or declaratively, for my custom static site generator based on libasciidoc:

:hacker-news: 12345678
:lobsters: 1a2b3c
:reddit: 1a2b3c, 4d5e6f
:reddit-subs: r/linux, r/programming

Step 2: Steal all the replies

Having enumerated the story/post IDs, the rest is easy: you just fetch some JSON documents, slightly reprocess them to form a DOM tree to insert into your page, come up with appropriate styling, and you’re done. Roughly speaking, because all the respective APIs come with certain quirks.

Lobsters

Such as, having Lobsters' CORS settings give you the middle finger, so you need to set up an appropriately rate-limited and cached proxy.[2] nginx has proven to be quite awkward and fiddly to configure, so I won’t bother quoting my hodge-podge reverse proxy rules in full. The most important lines were:

server {
  listen  443 ssl http2;
  server_name  lobsters.your.domain;
  location / {
    proxy_pass  https://lobste.rs;
    add_header  Access-Control-Allow-Origin 'https://blog.your.domain';
    add_header  Access-Control-Allow-Methods 'GET, OPTIONS';
  }
}

Luckily, the only remaining way in which this aggregator resists being aggregated is that the comments you’ll find at https://lobsters.your.domain/s/1a2b3c.json are all flattened into a common array, along with their computed depth. Curiously, this makes the obvious solution to your transformation code non-recursive, as an exception here.

Hacker News

The Hacker News API is a bit awkward in that you need to fetch each single comment separately. The good news is that even for a modestly sized comment section, trivially spawning fetch calls in a massively asynchronous fashion turns out to work acceptably. Also beware that the API returns even deleted comments, and those don’t have all fields set.

Reddit

Except for the already-mentioned insane data structures, confusing documentation, and the raw_json oddity where their JSON normally contains SGML-quoted HTML fields, there isn’t much to talk about. I’m not sure if the API returns all discussions in full, but it doesn’t seem to be worth spending much time on. It returns enough.

Summing up the parts

Have a look at the short Javascript code used on this page. I’m an incorrigible code-golfer, and I try to make do without libraries, so it should be quite readable, and even reusable. As a side note, barebones web development is easy these days, with all those browser APIs and ES6.

To see it actually load nearly 200 comments from various sources, jump to the end of my X11 article. Seeing as it made the comment section much, much longer than the article itself, I decided to roll them up once the loaded item count exceeds a threshold.

Step 3: Profit

There’s no profit, you’re the commodity.

Is this a good idea?

I don’t know. Reading imported comments gets mildly confusing, because people assume you can’t see their reactions. I suppose it will be a bit better if you clearly identify yourself as an intelligence agency at the end of your articles.

Then there’s the problem of inviting pseudonymous masses into your home to express opinions, shitpost, and ask weird questions. You will get a lot of effectively worthless comments, even though it tends to be outweighed by interesting feedback.

In any case, it is considerably low-effort, and makes your place look a bit more ‘cozy’.

And is this for me?

On a personal, tangential, mildly philosophical note, some past gossip that I’ve reconnected with its origin this way may also portray me unfavourably (or other people, for that matter), which will now be right there for everyone to see, and I assume embedded by search engines as well. Interspersing otherwise sterile works with strong views, immoral jokes, and/or ranting apparently likes to make people get caught up in pointless indignated discussions, which I’ve ended up pulling all in. But without these peculiarities, my writing would be inauthentic, lifeless, and boring to read as well as to write. Do I embrace the resulting chaos?

All things considered…​ YOLO, I guess.


1. For example: I was writing an article about my attempts at implementing X11 GUIs with Go, my friend got the idea to post it on r/programming, and very soon it spontaneously ended up on more subreddits, Hacker News, and Lobsters, with lots of people reading it. To my mild surprise, the Acer laptop serving it over a 5 Mbit/s uplink on a residential DSL connection survived the traffic just fine, though I didn’t have many images on my site at the time.
2. The relevant GitHub issues are: #1029, #361.

Comments

Use e-mail, webchat, or the form below. I'll also pick up on new HN, Lobsters, and Reddit posts.