The subject can be slightly more accurately expressed in other languages:
🇨🇿 Osobní výzvědná služba
🇵🇱 Osobista służba wywiadowcza
🇷🇺 Личная разведывательная служба
All variants have remarkably different connotations.
To be notified when people talk about me, and more generally to delegate checking on various things—most of them publicly accessible (open-source intelligence).
In my first job, a colleague that stood behind much of the IT infrastructure has shown me how Unix enables assembling and controlling a fairly large system with relative ease using shell scripts and various standard utilities. So that’s what I use.
The prerequisites for replicating my experience are:
Вот и всё.
It turns out that most of what I care about is one HTTP GET request away. In particular, I want to know when the contents of an address change. And most of the time it’s only about a subset of them. These requirements have lead to a straight-forward shell script:
#!/bin/sh -e
status=0 workdir=watch accept=*/* subject=
mkdir -p $workdir
check() {
# '/' is a forbidden filename character, so substitute it
local url=$1 f=$workdir/$(echo "$1" | sed 's|/|\\|g')
# Reddit's API doesn't like cURL's default User-Agent, so change it
if ! curl --user-agent Skynet --header "Accept: $accept" \
--no-progress-meter --location --output "$f.download" "$url"; then
# Problems can be intermittent, don't let it abort right away
status=1
else
shift
"$@" <"$f.download" >"$f.filtered" || :
if [ -f "$f" ] && ! diff -u "$f" "$f.filtered" >"$f.diff"; then
mail -s "Updated: ${subject:-$url}" root <"$f.diff" || status=1
fi
mv "$f.filtered" "$f"
fi
}
# Place for calls to retrieve and process stuff from the Internet
exit $status
I launch it once a day from a systemd timer. Unlike with cron, this won’t automatically send me an e-mail when the job fails, so I’ve wrapped the invocation in yet another little script:
#!/bin/sh
# Usage: mail-wrap.sh COMMAND ARGS...
"$@" >"$1.log" 2>&1 || mail -s "$1 failed" root <"$1.log"
As a minimal real-world example, to watch for a bugfix in EUC World, I would add a line like:
check 'https://euc.world/downloads/EUC%20World%20Release%20Notes.pdf' cat
The trailing filtering command isn’t optional, so I resort to passing
the document unchanged through cat
here.
With binary files like these, the intelligence of diff
comes handy, as it will
merely state that the saved copy differs from the downloaded one. And if
I desperately needed to have an actual plain-text diff in my mail, I would
employ the conversion utility from Xpdf/Poppler:
check 'https://euc.world/downloads/EUC%20World%20Release%20Notes.pdf' \
pdftotext - -
Are you in love with Unix yet?
Even though I’m still in the process of building it up as obvious needs arise, I’ve already got a decent collection of ‘information extractors’. You’re more than free to reuse them.
I’ll be happy to learn of more generally useful public HTTP APIs and trivially scrapable pages.
New translated chapters of manga:
check_mangadex() {
local url=https://api.mangadex.org/manga/$1
check "$url"'/feed?order\[chapter\]=desc&translatedLanguage[]=en' \
jq -r '.data[] |
"https://mangadex.org/chapter/\(.id) \(
.attributes.title // .attributes.chapter)"'
}
subject='Shimeji Simulation' \
check_mangadex '28b5d037-175d-4119-96f8-e860e408ebe9'
New releases of projects on GitHub:
check_release() {
accept=application/vnd.github+json \
check "https://api.github.com/repos/$1/releases/latest" jq -r .name
}
check_release 'stedolan/jq'
Also worth noting is that GitHub repositories have Atom feeds (…/commits/OBJECT.atom).
They have a simple public XML/JSON API that isn’t terribly well documented, but it’s more than enough. As soon as I get a package number from AliExpress, I put it in my script, and I know when to expect a postman.
check_cpost() {
# The "unknown ID" state always contains the current date, filter it out
check "https://b2c.cpost.cz/services/ParcelHistory/getDataAsJson?idParcel=$1" \
yq -P 'del(.[].states.state[] | select(.id == "-3"))'
}
check_cpost LF9876543210F
Reformatting the results as YAML has proven to be a neat means of prettification.
Despite some annoying rate-limiting, this bank's API can also spew JSON/… reports, and its documentation is fairly straight-forward.
Note for EU residents: it’s possible to create accounts in other states, but you’ll need to personally visit a branch of that bank.
I’ve dealt with these sites in depth in a separate article.
Note that F5Bot provides a similar service.
The search API
allows
10 unauthenticated requests per minute, enough to just sleep
between requests
instead of adding extra logic to manage rate-limiting. When receiving results
sorted from the newest, paging needn’t be handled either.
# As of version 4.21.1, yq doesn't fully implement jq's sort_by
prettyjq() (jq "[$1]" | yq -P '.[]')
check_github_1() {
accept=application/vnd.github.v3.text-match+json \
check "https://api.github.com/search/$1&order=desc&per_page=100" \
prettyjq "$2"
sleep $((60 / 10 + 1))
}
check_github() {
local query=q=$(fragment=$1 jq -rn 'env.fragment | @uri')
check_github_1 "issues?$query&sort=created" '.items[].text_matches'
check_github_1 "commits?$query&sort=committer-date" \
'.items | sort_by(.commit.committer.date, .repository.id) | reverse |
.[].text_matches'
}
check_github '"Premysl Eric Janouch"'
yq conveniently unrolls text matches to multiple lines, which isn’t possible with just JSON. It can be mildly problematic to work with, though.
They will throw captchas at you once the algorithm starts to think you’re a bot,
but the free tier of the
semi-useful[2]
Custom Search JSON API
is nothing to frown at. To set up the Programmable Search Engine, enter
a domain like example.com
, then set Search the entire web to ON.
Due to the necessity of paging through results, and the generally different nature of this task, it has its own special script. Perhaps I should have went for the Go module, and spare myself some time, but I’m happy with what I ended up with (slightly abbreviated):
#!/bin/sh -e
status=0 workdir=google
mkdir -p $workdir
cd $workdir
fetch() {
local IFS=\&
# Problems can be intermittent, don't let it abort right away
if ! curl "$*" --compressed --fail-with-body --no-progress-meter; then
status=1
return 1
fi
}
quote() (fragment=$1 jq -rn 'env.fragment | @uri')
google() {
local query=$1 qq=$(quote "$1")
local results=results.$qq seen=seen.$qq update=update.$qq new=new.$qq
local start=1 >$results
# "cx" is your "search engine ID", "key" is your API key
while [ "$start" -gt 0 ] && fetch >download \
"https://customsearch.googleapis.com/customsearch/v1?cx=XXX&key=XXX" \
exactTerms=$qq sort=$(quote date:d:s) start=$start filter=0; do
jq -r 'try .items[] | "\(.link) \(.title)"' download >>$results
start=$(jq -r 'try .queries.nextPage[0].startIndex // 0' download)
done
# Collect a database of all links we've ever seen, notify about increments
touch $seen
sort -u $seen $results >$update
comm -13 $seen $update >$new
mail -E -s "New search results: $query" root <$new || status=1
mv $update $seen
}
google 'weapons-grade uranium'
google 'Waldo'
exit $status
It produces fairly limited results even with the filtering turned off, so I’m not even close to exhausting the daily request quota. Should that change, cycling through queries based on what day it is seems like a good workaround.
A simpler alternative is Google Alerts, which can do e-mail digests as well as RSS.
Comments
Use e-mail, webchat, or the form below. I'll also pick up on new HN, Lobsters, and Reddit posts.