My personal intelligence service ================================ :date: 2022-03-09 The subject can be slightly more accurately expressed in other languages: 🇨🇿 _Osobní výzvědná služba_ + 🇵🇱 _Osobista służba wywiadowcza_ + 🇷🇺 _Личная разведывательная служба_ All variants have remarkably different connotations. .The Czech signals intelligence agency in Litoměřice... image::/files/nsa-litomerice.jpg[ View of the Říp mountain over the Czech Institute of Measurements, 1280, 350] Goals ----- To be notified when people talk about me, and more generally to delegate checking on various things--most of them publicly accessible (open-source intelligence). Tools ----- In my first job, a colleague that stood behind much of the IT infrastructure has shown me how Unix enables assembling and controlling a fairly large system with relative ease using shell scripts and various standard utilities. So that's what I use. The prerequisites for replicating my experience are: - a Linux machine running non-stop as a multi-purpose server (BSD would also work), - having the `mail` program configured, so that it can send out notifications,footnote:[It's a good idea to occasionally review the mail you send out. I discovered Gitea spammers this way, through account activation requests.] - some non-standard but still basic software: cURL, Perl, jq, https://github.com/mikefarah/yq[yq] (as a pretty-printer), - knowledge of how to put it all together. _Вот и всё._ A general-purpose script ~~~~~~~~~~~~~~~~~~~~~~~~ It turns out that most of what I care about is one _HTTP GET request_ away. In particular, I want to know when the contents of an address _change_. And most of the time it's only about a _subset_ of them. These requirements have lead to a straight-forward shell script: ```sh #!/bin/sh -e status=0 workdir=watch accept=*/* subject= mkdir -p $workdir check() { # '/' is a forbidden filename character, so substitute it local url=$1 f=$workdir/$(echo "$1" | sed 's|/|\\|g') # Reddit's API doesn't like cURL's default User-Agent, so change it if ! curl --user-agent Skynet --header "Accept: $accept" \ --no-progress-meter --location --output "$f.download" "$url"; then # Problems can be intermittent, don't let it abort right away status=1 else shift "$@" <"$f.download" >"$f.filtered" || : if [ -f "$f" ] && ! diff -u "$f" "$f.filtered" >"$f.diff"; then mail -s "Updated: ${subject:-$url}" root <"$f.diff" || status=1 fi mv "$f.filtered" "$f" fi } # Place for calls to retrieve and process stuff from the Internet exit $status ``` I launch it once a day from a link:internet-corpse-management.html#_stage_6_repeat[systemd timer]. Unlike with cron, this won't automatically send me an e-mail when the job fails, so I've wrapped the invocation in yet another little script: ```sh #!/bin/sh # Usage: mail-wrap.sh COMMAND ARGS... "$@" >"$1.log" 2>&1 || mail -s "$1 failed" root <"$1.log" ``` Usage ~~~~~ As a minimal real-world example, to watch for a bugfix in EUC World, I would add a line like: ```sh check 'https://euc.world/downloads/EUC%20World%20Release%20Notes.pdf' cat ``` The trailing filtering command isn't optional, so I resort to passing the document unchanged through `cat` here. With binary files like these, the intelligence of `diff` comes handy, as it will merely state that the saved copy differs from the downloaded one. And if I desperately needed to have an actual plain-text diff in my mail, I would employ the conversion utility from Xpdf/Poppler: ```sh check 'https://euc.world/downloads/EUC%20World%20Release%20Notes.pdf' \ pdftotext - - ``` Are you in love with Unix yet? +++
+++ :ellipsis: … .{ellipsis}are truly awesome listeners (but offer a low, pay-graded salary--such a clownery) image::/files/nsa-litomerice-south.jpg[ The Czech Institute of Measurements's many parabolic antennas, 1280, 300] Resource catalogue ------------------ Even though I'm still in the process of building it up as obvious needs arise, I've already got a decent collection of '`information extractors`'. You're more than free to reuse them. I'll be happy to learn of more generally useful public HTTP APIs and trivially scrapable pages. MangaDex ~~~~~~~~ New translated chapters of manga: ```sh check_mangadex() { local url=https://api.mangadex.org/manga/$1 check "$url"'/feed?order\[chapter\]=desc&translatedLanguage[]=en' \ jq -r '.data[] | "https://mangadex.org/chapter/\(.id) \( .attributes.title // .attributes.chapter)"' } subject='Shimeji Simulation' \ check_mangadex '28b5d037-175d-4119-96f8-e860e408ebe9' ``` GitHub releases ~~~~~~~~~~~~~~~ New releases of projects on GitHub: ```sh check_release() { accept=application/vnd.github+json \ check "https://api.github.com/repos/$1/releases/latest" jq -r .name } check_release 'stedolan/jq' ``` Also worth noting is that GitHub repositories have Atom feeds (_.../commits/OBJECT.atom_). Czech Post package tracking ~~~~~~~~~~~~~~~~~~~~~~~~~~~ They have a https://b2c.cpost.cz/[simple public XML/JSON API] that isn't terribly well documented, but it's more than enough. As soon as I get a package number from AliExpress, I put it in my script, and I know when to expect a postman. ```sh check_cpost() { # The "unknown ID" state always contains the current date, filter it out check "https://b2c.cpost.cz/services/ParcelHistory/getDataAsJson?idParcel=$1" \ yq -P 'del(.[].states.state[] | select(.id == "-3"))' } check_cpost LF9876543210F ``` Reformatting the results as YAML has proven to be a neat means of prettification. Fio banka ~~~~~~~~~ Despite some annoying rate-limiting, https://www.fio.cz/bank-services/internetbanking-api[this bank's API] can also spew JSON/... reports, and its documentation is fairly straight-forward. Note for EU residents: it's possible to create accounts in other states, but you'll need to personally visit a branch of that bank. Hacker News, Lobsters, Reddit ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I've dealt with these sites in depth in link:aggregating-blog-comments.html[a separate article]. Note that https://f5bot.com/[F5Bot] provides a similar service. GitHub mentions ~~~~~~~~~~~~~~~ https://docs.github.com/en/rest/reference/search[The search API] https://docs.github.com/en/rest/reference/search#rate-limit[allows] 10 unauthenticated requests per minute, enough to just `sleep` between requests instead of adding extra logic to manage rate-limiting. When receiving results sorted from the newest, paging needn't be handled either. ```sh # As of version 4.21.1, yq doesn't fully implement jq's sort_by prettyjq() (jq "[$1]" | yq -P '.[]') check_github_1() { accept=application/vnd.github.v3.text-match+json \ check "https://api.github.com/search/$1&order=desc&per_page=100" \ prettyjq "$2" sleep $((60 / 10 + 1)) } check_github() { local query=q=$(fragment=$1 jq -rn 'env.fragment | @uri') check_github_1 "issues?$query&sort=created" '.items[].text_matches' check_github_1 "commits?$query&sort=committer-date" \ '.items | sort_by(.commit.committer.date, .repository.id) | reverse | .[].text_matches' } check_github '"Premysl Eric Janouch"' ``` _yq_ conveniently unrolls text matches to multiple lines, which isn't possible with just JSON. It can be mildly problematic to work with, though. //// Twitter ~~~~~~~ I've given up on this, because '`Essential access`' requires '`having a verified phone number on file prior to submitting application'`, and without '`Elevated access`', the API won't let you search. Could use a burner SIM, but it's still way too much work. //// Google ~~~~~~ They will throw captchas at you once the algorithm starts to think you're a bot, but the free tier of the https://support.google.com/programmable-search/answer/70392[semi-useful ]footnote:[Disregarding https://www.newyorker.com/culture/infinite-scroll/what-google-search-isnt-showing-you[its generally abysmal qualities in recent years].] https://developers.google.com/custom-search/v1/overview[Custom Search JSON API] is nothing to frown at. To set up the Programmable Search Engine, enter a domain like `example.com`, then set _Search the entire web_ to _ON_. Due to the necessity of paging through results, and the generally different nature of this task, it has its own special script. Perhaps I should have went for the https://pkg.go.dev/google.golang.org/api/customsearch/v1[Go module], and spare myself some time, but I'm happy with what I ended up with (slightly abbreviated): ```sh #!/bin/sh -e status=0 workdir=google mkdir -p $workdir cd $workdir fetch() { local IFS=\& # Problems can be intermittent, don't let it abort right away if ! curl "$*" --compressed --fail-with-body --no-progress-meter; then status=1 return 1 fi } quote() (fragment=$1 jq -rn 'env.fragment | @uri') google() { local query=$1 qq=$(quote "$1") local results=results.$qq seen=seen.$qq update=update.$qq new=new.$qq local start=1 >$results # "cx" is your "search engine ID", "key" is your API key while [ "$start" -gt 0 ] && fetch >download \ "https://customsearch.googleapis.com/customsearch/v1?cx=XXX&key=XXX" \ exactTerms=$qq sort=$(quote date:d:s) start=$start filter=0; do jq -r 'try .items[] | "\(.link) \(.title)"' download >>$results start=$(jq -r 'try .queries.nextPage[0].startIndex // 0' download) done # Collect a database of all links we've ever seen, notify about increments touch $seen sort -u $seen $results >$update comm -13 $seen $update >$new mail -E -s "New search results: $query" root <$new || status=1 mv $update $seen } google 'weapons-grade uranium' google 'Waldo' exit $status ``` It produces fairly limited results even with the filtering turned off, so I'm not even close to exhausting the daily request quota. Should that change, cycling through queries based on what day it is seems like a good workaround. A simpler alternative is https://www.google.com/alerts[Google Alerts], which can do e-mail digests as well as RSS.