My personal intelligence service

The subject can be slightly more accurately expressed in other languages:

🇨🇿 Osobní výzvědná služba
🇵🇱 Osobista służba wywiadowcza
🇷🇺 Личная разведывательная служба

All variants have remarkably different connotations.

View of the Říp mountain over the Czech Institute of Measurements

Figure 1. The Czech signals intelligence agency in Litoměřice…

Goals
Tools
- A general-purpose script
- Usage
Resource catalogue

Goals

To be notified when people talk about me, and more generally to delegate checking on various things—most of them publicly accessible (open-source intelligence).

Tools

In my first job, a colleague that stood behind much of the IT infrastructure has shown me how Unix enables assembling and controlling a fairly large system with relative ease using shell scripts and various standard utilities. So that’s what I use.

The prerequisites for replicating my experience are:

a Linux machine running non-stop as a multi-purpose server (BSD would also work),
having the mail program configured, so that it can send out notifications,^[1]
some non-standard but still basic software: cURL, Perl, jq, yq (as a pretty-printer),
knowledge of how to put it all together.

Вот и всё.

A general-purpose script

It turns out that most of what I care about is one HTTP GET request away. In particular, I want to know when the contents of an address change. And most of the time it’s only about a subset of them. These requirements have lead to a straight-forward shell script:

#!/bin/sh -e
status=0 workdir=watch accept=*/* subject=
mkdir -p $workdir

check() {
  # '/' is a forbidden filename character, so substitute it
  local url=$1 f=$workdir/$(echo "$1" | sed 's|/|\\|g')
  # Reddit's API doesn't like cURL's default User-Agent, so change it
  if ! curl --user-agent Skynet --header "Accept: $accept" \
    --no-progress-meter --location --output "$f.download" "$url"; then
    # Problems can be intermittent, don't let it abort right away
    status=1
  else
    shift
    "$@" <"$f.download" >"$f.filtered" || :
    if [ -f "$f" ] && ! diff -u "$f" "$f.filtered" >"$f.diff"; then
      mail -s "Updated: ${subject:-$url}" root <"$f.diff" || status=1
    fi
    mv "$f.filtered" "$f"
  fi
}

# Place for calls to retrieve and process stuff from the Internet

exit $status

I launch it once a day from a systemd timer. Unlike with cron, this won’t automatically send me an e-mail when the job fails, so I’ve wrapped the invocation in yet another little script:

#!/bin/sh
# Usage: mail-wrap.sh COMMAND ARGS...
"$@" >"$1.log" 2>&1 || mail -s "$1 failed" root <"$1.log"

Usage

As a minimal real-world example, to watch for a bugfix in EUC World, I would add a line like:

check 'https://euc.world/downloads/EUC%20World%20Release%20Notes.pdf' cat

The trailing filtering command isn’t optional, so I resort to passing the document unchanged through cat here.

With binary files like these, the intelligence of diff comes handy, as it will merely state that the saved copy differs from the downloaded one. And if I desperately needed to have an actual plain-text diff in my mail, I would employ the conversion utility from Xpdf/Poppler:

check 'https://euc.world/downloads/EUC%20World%20Release%20Notes.pdf' \
  pdftotext - -

Are you in love with Unix yet?

The Czech Institute of Measurements's many parabolic antennas

Figure 2. …are truly awesome listeners (but offer a low, pay-graded salary—such a clownery)

Resource catalogue

Even though I’m still in the process of building it up as obvious needs arise, I’ve already got a decent collection of ‘information extractors’. You’re more than free to reuse them.

I’ll be happy to learn of more generally useful public HTTP APIs and trivially scrapable pages.

MangaDex

New translated chapters of manga:

check_mangadex() {
  local url=https://api.mangadex.org/manga/$1
  check "$url"'/feed?order\[chapter\]=desc&translatedLanguage[]=en' \
    jq -r '.data[] |
      "https://mangadex.org/chapter/\(.id) \(
        .attributes.title // .attributes.chapter)"'
}

subject='Shimeji Simulation' \
check_mangadex '28b5d037-175d-4119-96f8-e860e408ebe9'

GitHub releases

New releases of projects on GitHub:

check_release() {
  accept=application/vnd.github+json \
    check "https://api.github.com/repos/$1/releases/latest" jq -r .name
}

check_release 'stedolan/jq'

Also worth noting is that GitHub repositories have Atom feeds (…/commits/OBJECT.atom).

Czech Post package tracking

They have a simple public XML/JSON API that isn’t terribly well documented, but it’s more than enough. As soon as I get a package number from AliExpress, I put it in my script, and I know when to expect a postman.

check_cpost() {
  # The "unknown ID" state always contains the current date, filter it out
  check "https://b2c.cpost.cz/services/ParcelHistory/getDataAsJson?idParcel=$1" \
    yq -P 'del(.[].states.state[] | select(.id == "-3"))'
}

check_cpost LF9876543210F

Reformatting the results as YAML has proven to be a neat means of prettification.

Fio banka

Despite some annoying rate-limiting, this bank's API can also spew JSON/… reports, and its documentation is fairly straight-forward.

Note for EU residents: it’s possible to create accounts in other states, but you’ll need to personally visit a branch of that bank.

Hacker News, Lobsters, Reddit

I’ve dealt with these sites in depth in a separate article.

Note that F5Bot provides a similar service.

GitHub mentions

The search API allows 10 unauthenticated requests per minute, enough to just sleep between requests instead of adding extra logic to manage rate-limiting. When receiving results sorted from the newest, paging needn’t be handled either.

# As of version 4.21.1, yq doesn't fully implement jq's sort_by
prettyjq() (jq "[$1]" | yq -P '.[]')

check_github_1() {
  accept=application/vnd.github.v3.text-match+json \
    check "https://api.github.com/search/$1&order=desc&per_page=100" \
    prettyjq "$2"
  sleep $((60 / 10 + 1))
}

check_github() {
  local query=q=$(fragment=$1 jq -rn 'env.fragment | @uri')
  check_github_1 "issues?$query&sort=created" '.items[].text_matches'
  check_github_1 "commits?$query&sort=committer-date" \
    '.items | sort_by(.commit.committer.date, .repository.id) | reverse |
    .[].text_matches'
}

check_github '"Premysl Eric Janouch"'

yq conveniently unrolls text matches to multiple lines, which isn’t possible with just JSON. It can be mildly problematic to work with, though.

Google

They will throw captchas at you once the algorithm starts to think you’re a bot, but the free tier of the semi-useful^[2] Custom Search JSON API is nothing to frown at. To set up the Programmable Search Engine, enter a domain like example.com, then set Search the entire web to ON.

Due to the necessity of paging through results, and the generally different nature of this task, it has its own special script. Perhaps I should have went for the Go module, and spare myself some time, but I’m happy with what I ended up with (slightly abbreviated):

#!/bin/sh -e
status=0 workdir=google
mkdir -p $workdir
cd $workdir

fetch() {
  local IFS=\&
  # Problems can be intermittent, don't let it abort right away
  if ! curl "$*" --compressed --fail-with-body --no-progress-meter; then
    status=1
    return 1
  fi
}

quote() (fragment=$1 jq -rn 'env.fragment | @uri')

google() {
  local query=$1 qq=$(quote "$1")
  local results=results.$qq seen=seen.$qq update=update.$qq new=new.$qq
  local start=1 >$results

  # "cx" is your "search engine ID", "key" is your API key
  while [ "$start" -gt 0 ] && fetch >download \
    "https://customsearch.googleapis.com/customsearch/v1?cx=XXX&key=XXX" \
    exactTerms=$qq sort=$(quote date:d:s) start=$start filter=0; do
    jq -r 'try .items[] | "\(.link) \(.title)"' download >>$results
    start=$(jq -r 'try .queries.nextPage[0].startIndex // 0' download)
  done

  # Collect a database of all links we've ever seen, notify about increments
  touch $seen
  sort -u $seen $results >$update
  comm -13 $seen $update >$new
  mail -E -s "New search results: $query" root <$new || status=1
  mv $update $seen
}

google 'weapons-grade uranium'
google 'Waldo'

exit $status

It produces fairly limited results even with the filtering turned off, so I’m not even close to exhausting the daily request quota. Should that change, cycling through queries based on what day it is seems like a good workaround.

A simpler alternative is Google Alerts, which can do e-mail digests as well as RSS.

1. It’s a good idea to occasionally review the mail you send out. I discovered Gitea spammers this way, through account activation requests.

2. Disregarding its generally abysmal qualities in recent years.

Comments

Use e-mail, webchat, or the form below. I'll also pick up on new HN, Lobsters, and Reddit posts.