Managing dead links

The World Wide Web is a great thing, but one major problem inseparably joined with pointing to random web pages is how often that content either moves to a different location, or plainly disappears from the face of the Earth, never to be seen again.

For many years, I had been avoiding tackling the issue head-on, until I finally got fed up enough, saying no to randomly discovering that yet another link in one of my numerous pages and documents leads nowhere. Playing a detective to figure out what the heck I made a note of five years ago is no fun.

Here I’ll describe how I’ve come to deal with this nuisance in a semi-automated fashion, with help from my Unix-compatible server. Should you decide to take the same path, see the beginning of the personal intelligence service entry for a quick summary of required tools--we’re writing another shell script!

Stage 0: Boilerplate

Create a file named, for example, link-archiver.sh, and chmod it executable.

#!/bin/sh -e
workdir=archive
mkdir -p "$workdir"

# Just C wouldn't work, since mail(1) would reject UTF-8 input later
export LC_ALL=C.UTF-8

tempfile=$(mktemp) log=link-archiver.log fails=link-archiver.fails
lastlog=link-archiver.lastlog lastfails=link-archiver.lastfails
trap "rm -f -- '$tempfile'" HUP INT TERM EXIT

# To circumvent the read command's handling of sequences of whitespace,
# prefer the ASCII Unit Separator over, e.g., Horizontal Tabulation
us=$(printf "\037")

We’ll continue with a pair of functions that will form a pipeline.

collect_files() {
  find "$HOME/"Documents/ -type f
  find /srv/http/htdocs/ -type f -name '*.html'
  find "$HOME/"Projects/ -path "$HOME/Projects/.stversions" -prune -o \
    -type f \( -name '*.adoc' -o -name CMakeLists.txt -o \
      -name '*.sh' -o -name '*.go' -o -name '*.lua' -o \
      -name '*.c' -o -name '*.cpp' -o -name '*.h' \) -print
}

All files that I care about happen to be conveniently co-located on a single machine, and find makes it fairly easy to enumerate where to look, or what to leave out--such as back-up copies.

extract_links() {
  perl -lne 'use strict; use warnings;
  my $code = /\.(c|cc|cpp|h|hpp|go|lua|sh|js|po|pot)$/i;
  my $sgml = /\.(xml|html)$/i;
  next if -B || !open(my $fh, "<", $_);

  my $linenumber = 0;
  while (my $line = <$fh>) {
    $linenumber++;

    # Source code likes to glue bullshit fragments together,
    # so only look at comments there
    next if $code && $line !~ m@^\s*(#|//)@;

    $line =~ s/&amp;/&/g if $sgml;
    while ($line =~ m@(https?://[^][(){}<>"'\''\s]+
      [^][(){}<>"'\''\s,.:])(.?)@gx) {
      print "$_:${linenumber}\037${1}";
    }
  }'
}

Next, we need to milk them for links. Perl has this beautiful -B file test operator to throw away any binary files. It’s also great at text processing, making short work of expressing how to extract what looks like URLs from various kinds of files.

In my case, I special-case source code and SGML-derived languages. What I’ve left out from this example is handling of line-wrapped oversized links, something I’m prone to doing.

Stage 2: Archive

While something bloated like ArchiveBox might be more appropriate for more serious usage, our downloader of choice will be the trusty cURL. It will play two roles: make a shallow copy of an address if one hasn’t been made yet, then check if it’s still up. Let’s factor out most of its command line switches to make these invocations succint.

invoke_curl() {
  # CloudFront and others don't like custom user-agents, thus lie at them
  curl --no-progress-meter --compressed --location --connect-timeout 15 \
    --fail-with-body --insecure --speed-time 15 --user-agent \
    "Mozilla/5.0 (Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0" "$@"
}

Now, to join this all together:

# '/' is a forbidden filename character, so substitute it.
# gs_lcp is a terribly long protobuf argument used by Google.
archivename() (echo "$1" | sed 's|/|\\|g; s|&gs_lcp=[^&#]*||' | head -c 255)

collect_files | extract_links | sort -u | while IFS=$us read -r where url
do
  where=${where#$HOME/} archived=$workdir/$(archivename "$url")
  write_out=$(echo "$where" | sed 's|%|%%|g'
    )"$us%{url}$us%{http_code}$us%{errormsg}\n"

  printf -- "-- %s\t%s\n" "$where" "$url" >&2

  # If we've already downloaded it, simply check if it's still there.
  # Several sites (e.g., AliExpress, Amazon, Hacker News) don't support HEAD,
  # and may fail with more or less random codes (404, 405), so stick to GET.
  if [ -s "$archived" ]
  then invoke_curl -w "$write_out" --output /dev/null "$url" || :
  elif invoke_curl -w "$write_out" --output "$tempfile" "$url"
  then zstd --quiet -o "$archived" -- "$tempfile"
  fi
done >"$log"

All URLs undergo a simple transformation to turn them into valid filenames, and any content is followingly stored compressed in the ‘archive’ directory.

Stage 3: Pick out failures

Let’s reprocess the log file we’ve made cURL provide us with, and insert any links we manage to gather through the Internet Archive's API.

# Truncate and reuse the temporary file for a human-readable report
: >"$tempfile"

while IFS=$us read -r where url code message
do
  # Non-200 success responses are weird, bring them to our attention
  [ "$code" = 200 ] && continue

  # The Internet Archive will happily and pointlessly recurse;
  # also boldly assume that nothing ever disappears from there
  case "$url" in
  *://localhost[:/]*) continue ;;
  *://web.archive.org/*) continue ;;
  esac

  printf -- "== %s\t%s\n" "$where" "$url" >&2

  ia=$(
    curl --no-progress-meter "https://archive.org/wayback/available?url=$(
      fragment=$url jq -rn 'env.fragment | @uri'
    )" | jq -r '.archived_snapshots.closest.url // empty' || echo ?)
  [ -s "$workdir/$(archivename "$url")" ] && notes="{saved}" || notes=

  # Line numbers may shift around, exclude them from comparisons
  echo "${where%:*} $(echo "$message" | sed '
    s/ after [0-9]* ms//; s/Operation too slow.*/Read timeout/
  ') ($code) $url [$ia] $notes" >>"$tempfile"

  # Make sure text editors find the file by making paths absolute again
  [ "${where#/}" = "$where" ] && where=$HOME/$where
  echo "$where:$message ($code) $url [$ia] $notes"
done <"$log" >link-archiver.quickfix

Stage 4: Notify

All that remains for the script to do is to figure out which lights have went dark since the last time we checked (if ever), and send an e-mail report. Nothing could be simpler!

# comm(1) input needs to be ordered according to collation rules
sort <"$tempfile" >"$fails"
if [ -f "$lastfails" ]
then comm -13 -- "$lastfails" "$fails" | mail -E -s "Internet rot report" root
fi

rm -f -- "$tempfile"
mv -- "$fails" "$lastfails"
mv -- "$log" "$lastlog"

Stage 5: Deal with the carnage

As you might have noticed, we’ve also created a ‘quickfix’ file. This can be loaded into VIM through either its -q option, or the :cfile command.

VIM iterating quickfixes

With some luck, you’ll be able to just copy over the Internet Archive’s search results and call it a day, though my success rate isn’t really that great with it, and I tend to look whether I can find the content’s new home address first, anyway.

It’s handy to map the commands for iterating through Quickfix List entries to simple keypresses. For example, to mimic Qt Creator:

nnoremap <F6> :cnext<CR>:cc<CR>
nnoremap <S-F6> :cprevious<CR>:cc<CR>

A less convenient but more portable option would be vim-unimpaired's ]q, [q bindings.

In general, this approach isn’t entirely optimal, seeing as line numbers may get outdated fast, especially within large files, and resolved links don’t disappear from the list. Still, I find it more than adequate for my purposes.

Stage 6: Repeat

As with my ‘personal intelligence service’, the resulting script is run from a systemd/cron timer. Because the product is a depressive report of how things fall off the Internets, I don’t want to receive it too often--once a week is more than enough.

~/.config/systemd/user/link-archiver.service
[Unit]
Description=Link archiver
[Service]
Type=oneshot
ExecStart=%h/Skynet/link-archiver.sh
WorkingDirectory=%h/Skynet
~/.config/systemd/user/link-archiver.timer
[Unit]
Description=Link archiver
[Timer]
OnCalendar=Fri 17:30:00
Persistent=true
[Install]
WantedBy=timers.target
systemctl --user start link-archiver.timer --now

Optionally also enable session lingering or make it a system-wide unit so that it can run before that user logs in, and ensure you get notified should the unit fail.

Comments

Use e-mail or webchat. I'll also pick up on new Hacker News, Lobsters, and Reddit posts.