Rolling my own CI/CD

Software development includes a number of tasks to be done each time you make a change in a project, with various degrees of being annoying, error-prone, and/or time-consuming to take care of manually: verify that the code compiles or type-checks, run static analysis, unit tests, integration tests, publish a development build for others to use, or even deploy it to a system where users can play around with it. The larger the project, the worse this gets, especially when you support a number of different platforms.

The natural desire is to automate these processes—​formalize all the necessary steps, be it in some kind of a script, or perhaps through a more declarative approach, and make it so that they’re taken care of before (or after) someone concludes a unit of work.

Tooling that does this sort of automation is often labelled after two particular development practices that directly rely on them: ‘continuous integration’ and ‘continuous delivery’. People use ‘GitLab CI’ and talk about ‘CI/CD pipelines’. So while it’s not completely appropriate, I’ll be using continuous integration as a synonym for the tooling itself.

Contents

Motivations

Anyway, I write software. And my ever-growing collection of personal projects kept dragging along with it exactly those kinds of problems that are mentioned in the introduction. First of all, I kept finding out that they either had broken builds (typically on systems-that-are-not-mine), or more trivially, that some dependency decided to produce new warnings. Second of all, if someone else wanted to use what I had made, he’d have a hard time, seeing as essentially nothing has found its way to a distribution yet. He’d need to follow my terse instructions, and build it from source code—​which is only easy on BTW-I-use-Arch, because I maintain scripts for development builds of my software in the Arch User Repository. It doesn’t take much effort to publish installable packages for non-rolling distros, such as Ubuntu, but it’s a ton of mind-numbing work to do it by hand with each project release, much less commit.

In short, a CI/CD solution had the potential to be of great help, mostly to my mental well-being. Though I had a few conditions: I wasn’t about to pay for this, since I didn’t actually need it, and it couldn’t stop providing service out of the random, because that would bring me back to the beginning.

My experiences with public providers had generally been bad. In the past, I got acquainted with Travis CI, Open Build Service, Launchpad, and wanted to try out Wercker. Of these, the one with an ever-outdated instance of Ubuntu went all commercial, the packager-specific one was super convoluted, the pile of technical debt was a lot of pain to work with (then someone succeeded in having my account terminated), and the remaining one just disappeared.

On the other hand, the rather simple CI/CD that people had put together from shell scripts at my former job was more or less exactly what it needed to be. It was flexible, and the occasional problems had easy solutions.

I knew that the essence of this kind of automation is trivial: it runs shell scripts when triggered by a git push. The only hard parts are when and how to run them. And I believed I could do away with a lot of accidental complexity that comes with popular self-hosted solutions, so I set out to build something neat.

Overview

In the rest of this article, I’ll guide you through setting up the CI system I’ve put together. On the highest level, it looks like this:

Gitea → Queue → Virtual machines → {Commit statuses, Notifications, (Artifacts)}

Everything runs on the same physical machine (yet SSH makes it possible to also delegate tasks remotely to cover annoyances like macOS). When someone pushes to a repository, the event is recorded in a queue. The CI then launches any appropriate virtual machines from a snapshot, sends them work, and waits for the results. Those are attached to Gitea commits over its REST API. Build failures are forwarded to the administrator over e-mail or IRC. Eventually, build artifacts can be stored for download.

Gitea just happens to be what I use to manage my repositories. Gogs or Forgejo integration would look nearly identical. And GitLab or GitHub users should delete that crap.

(Unmanaged remotes would need minor adjustments to acid so that it doesn’t rely on the existence of a Gitea. Then, if needed, one can relatively easily patch cgit to pull commit statuses from its database.)

While it might make some sense to use containers for all CI targets that are Linux distributions rather than full-blown virtual machines for everything, it would also add a lot of headaches concerning management.

System setup

The host machine is assumed to be Arch Linux, with a running instance of Gitea. You should have a few tens of gigabytes of spare disk space, and several gigabytes of free memory. The more RAM, the better, since it allows for running our VMs without writing to disk.

Memory

Speaking of which, you should review the limits of memory-backed filesystems. For example, the mount_setup function in mkinitcpio’s init_functions script causes /run and /dev to assume the default size limit of 50% RAM. Which means that filling both of these up at once will simply end up freezing up your machine. I decided to go with systemd’s own unapplied values (see the table in mount-setup.c + mountpoint-util.h for limit macros), which look reasonable. If you put them in /etc/fstab, systemd will apply them later at boot.

run  /run  tmpfs     nosuid,nodev,mode=755,size=20%,nr_inodes=800k  0  0
dev  /dev  devtmpfs  nosuid,mode=755,size=4m,nr_inodes=1m           0  0

Then, of course, there’s /tmp, whose 50% limit is much more likely to cause issues, especially if you run swap-less. This can be edited with:

# systemctl edit tmp.mount

Proxy

Because we want to test repeatability of builds starting from more or less clean operating system installations, we’ll repeatedly install all dependencies on each build. To guard ourselves against various kinds of outages, to speed up the build process, as well as to generally avoid unnecessary traffic, we’ll set up a caching forward proxy.

The one-eyed king here sadly seems to be:

# pacman -S squid

Squid has its own group and user, both called ‘proxy’, and normally listens on port 3128. The default configuration’s ACL as of writing only allows access from localhost, which is fine for our purpose. In any case, you should review it.

What we absolutely need to adjust in /etc/squid/squid.conf are cache settings:

# 10G should be plenty of space.
# Distribution packages can get very large, so use a 256M limit.
cache_dir ufs /var/cache/squid 10000 16 256 max-size=256000000

and the shutdown timeout, because the service otherwise likes to wait for no good reason:

shutdown_lifetime 0 seconds

HTTP authentication

We don’t make the proxy Internet-accessible, but preventing misuse comes cheap:

auth_param basic program /lib/squid/basic_ncsa_auth /etc/squid/passwords
auth_param basic realm CI
acl auth proxy_auth REQUIRED
http_access allow auth

To create a password file (change PASSWORD to your liking):

# echo "ci:$(openssl passwd -5 -salt $(openssl rand -base64 9) 'PASSWORD')" \
  >/etc/squid/passwords

HTTPS

Next, to actually cache HTTPS requests (these use CONNECT) instead of blindly passing them through, we must sadly create a man-in-the-middle CA, and manually initialize a disk cache that will contain leaf certificates for proxied hosts:

# cd /etc/squid
# openssl req -newkey rsa:2048 -subj "/CN=Squid CA" -days 3650 -nodes \
  -keyout bump-ca.key.pem -x509 -out bump-ca.cert.pem
# openssl dhparam -out bump-dhparam.pem 2048
# /lib/squid/security_file_certgen -c -s /var/cache/squid/ssl_db -M 4MB
# chown -R proxy:proxy bump-* /var/cache/squid/ssl_db

The corresponding squid.conf snippet can be kept simple:

# HTTPS must also be cached.
http_port 3128 ssl-bump tls-cert=/etc/squid/bump-ca.cert.pem \
  tls-key=/etc/squid/bump-ca.key.pem tls-dh=/etc/squid/bump-dhparam.pem
ssl_bump stare all

Verification

All that remains is to check the configuration, enable the service, and to test that it works:

# squid -k parse
# systemctl enable squid --now
$ export http_proxy=http://ci:PASSWORD@localhost:3128
$ export https_proxy=http://ci:PASSWORD@localhost:3128
$ curl http://google.com
$ SSL_CERT_FILE=/etc/squid/bump-ca.cert.pem curl https://google.com

QEMU + KVM

Refer to the respective ArchWiki articles for more information, however there shouldn’t be any gotchas, and it should just work.

The qemu-desktop package pulls in a lot of crap, so let’s cherry-pick what we’ll actually use:

# pacman -S qemu-img qemu-system-x86

My minimalist daemon glue

The last piece of the puzzle is something to actually run a virtual machine upon receiving a push event. I hereby introduce you to my project named acid that takes care of the whole pipeline. I’ll talk about it a bit more at the end of the article. Right now, I’ll tell you how to set it up.

Let’s run everything under a special user, build a copy of the software, and create an initial configuration file (consult the project’s documentation for explanation):

# pacman -S git go
# useradd -m acid
# su - acid
$ git clone https://git.janouch.name/p/acid.git
$ make -C acid acid
$ cat > acid.yaml <<EOF
---
db: acid.db
listen: localhost:8080
root: https://acid.server

secret: $(openssl rand -base64 9)
gitea: https://gitea.server
token: INSERT-FROM-GITEA

notify: |
  {{if ne (print .State) `Success`}}
  mail -s '{{.FullName}} {{.Hash}}: {{.RunnerName}} {{.State}}' root <<END
  {{.URL}}
  END
  {{end}}
EOF

The Gitea access token, which you can create in user settings under Applications, must have read/write rights for every repository that acid will receive push events for.

Also don’t forget to create some Gitea webhooks, which for our example must target the URL of https://acid.server/push. For your convenience, I suggest creating a user-wide hook. Use the secret from your acid.yaml file.

If you don’t have mail configured on your system, feel free to remove the notification snippet, which serves only as an example. I use IRC instead for this purpose.

To run the service, create a straight-forward systemd unit, and start it:

/etc/systemd/system/acid.service
User=acid
WorkingDirectory=~
ExecStart=/home/acid/acid/acid acid.yaml
ProtectSystem=full
ProtectKernelTunables=true
ProtectControlGroups=true
SystemCallFilter=~@clock @cpu-emulation @debug @keyring @module @mount @obsolete @raw-io @reboot @swap

[Install]
WantedBy=default.target
# systemctl enable --now acid

Finally, since the daemon is not intended to be directly exposed to the Internet, and as such doesn’t support listening on HTTPS, you will probably want and need to set up your web server to proxy all requests to it. In the case of Nginx, the snippet will look like:

location /ci/ {
  proxy_pass  http://127.0.0.1:8080/;
  proxy_read_timeout  90;
}

which should make the https://acid.server specified in acid.yaml reach our little daemon.

Preparing target VMs

Luckily, there seems to be a thing called ‘cloud-init’ that unifies virtual machine setup across various operating systems, including BSD derivatives. You simply download a ‘cloud-enabled’ image which has that package pre-installed, and pass the VM a settings file to boot with. Let’s install a utility to enable passing the file as a virtual drive:

# pacman -S cloud-image-utils

Next, while it might seem sufficient to use passwords to log in to the machines, it also costs very little to create an SSH keypair. Additionally, it will be convenient to place proxy settings for the guests in a special file. Let’s put all these files in a subdirectory:

# su - acid
$ mkdir data
$ cd data
$ ssh-keygen -t ed25519 -f id_ed25519 -C acid -N "" -q
$ echo 'http://ci:PASSWORD@10.0.2.2:3128' > proxy

Some cloud images can be found linked in OpenStack documentation. For demonstration purposes, we will use Debian, which behaves very well, and doesn’t need special casing:

$ curl -LO https://cloud.debian.org/images/cloud/bookworm/latest/debian-12-genericcloud-amd64.qcow2

Making runner scripts

To configure the Debian image, add the following snippet to your acid.yaml:

runners:
  debian12:
    name: Debian Bookworm
    run: runners/debian12.sh
    ssh:
      user: ci
      address: localhost:8022
      identity: data/id_ed25519
    setup: |
      cloud-init status --wait
      set -ex
      sudo apt-get update
      sudo apt-get upgrade --yes
      sudo apt-get install --yes git
      git clone --recursive {{quote .CloneURL}} {{quote .Repo}}
      cd {{quote .Repo}}
      git -c advice.detachedHead=false checkout {{quote .Hash}}

acid trivially launches the given script or binary, and expects it to eventually make the target machine available over SSH at the specified address, keeping it that way for as long as it doesn’t exit.

Once connected and made sure of cloud-init's completion, we can make excellent use of shell options to get us the behaviour we want: -e to exit on the first error, and -x to print commands as they’re being executed. The following system upgrade is something that could be done within cloud-init, but it would also make progress invisible, which isn’t particularly desirable for actions that can take long to finish. Finally, we set up the stage for project scripts. Private repositories might need adjustments for authentication there.

Due to the repetitive nature of launching QEMU, let’s split out the common part first:

~acid/runners/qemu.sh
#!/bin/sh -xe
test -n "$ACID_ROOT"
tmp=/tmp/acid-qemu
#tmp=$ACID_ROOT/tmp

rm -rf -- "$tmp"
mkdir -p -- "$tmp"
cd -- "$tmp"
umask 026

qemu-img create -b "$ACID_ROOT/data/$acid_image" -F qcow2 \
  -f qcow2 overlay.qcow2 6G

cat > user-data <<EOF
#cloud-config
timezone: Europe/Prague
hostname: $acid_dist-ci
users:
  - name: ci
    sudo: ALL=(ALL) NOPASSWD:ALL
    groups: wheel
    ssh_authorized_keys:
      - $(cat -- "$ACID_ROOT"/data/id_ed25519.pub)
ssh_pwauth: False
ca_certs:
  trusted:
    - |
$(sed 's/^/      /' /etc/squid/bump-ca.cert.pem)
$(cat <&3)
EOF

touch meta-data
cloud-localds seed.img user-data meta-data

# Trade-off: running in "cache=unsafe" is faster, but the machine
# must be shut down cleanly if the overlay image is to be booted again.
echo 1000 > /proc/$$/oom_score_adj
exec qemu-system-x86_64 -enable-kvm -smp $(nproc) -m 2G \
  -drive file=overlay.qcow2,if=virtio,cache=unsafe \
  -drive file=seed.img,if=virtio,format=raw \
  -device virtio-net-pci,netdev=net0 \
  -netdev user,id=net0,hostfwd=tcp:127.0.0.1:8022-:22 \
  -audio none \
  -nographic

I’ll trust you, dear reader, to figure most of it out yourself. The gist of it is that it creates a throw-away overlay disk image in your /tmp, which on most Linux distributions is backed by system memory, generates a cloud-init configuration drive, and launches an instance of QEMU that forwards the host machine’s port 8022 to the guest’s SSH port.

Cloud images are distributed tiny, and blow up on boot to fill up the whole device, so we give them a bit more space to breathe. In practice, none of my builds require more than 6 gibibytes, and that includes a GTK+ project with two Win32 cross-builds. Unfortunately, it is not possible to keep the overlay image entirely in QEMU’s process memory—​it must be in some way backed by the filesystem. Meaning, you can choose between wearing down your SSD, if you have one, or risking that you cause a dangerous kind of memory pressure when you put it in a tmpfs.

2 gibibytes of RAM were never a problem either. Here we can make the process sacrifice itself to the OOM killer, at least.

That was the complicated part. The Debian-specific script is then just this:

~acid/runners/debian12.sh
#!/bin/sh -xe
test -n "$ACID_ROOT"
cd -- "$ACID_ROOT"

acid_dist=debian \
acid_image=debian-12-genericcloud-amd64.qcow2 \
exec runners/qemu.sh 3<<EOF
write_files:
  - path: /etc/environment
    content: |
      http_proxy=$(cat data/proxy)
      https_proxy=$(cat data/proxy)
    append: true
  - path: /etc/sudoers.d/90-proxy
    content: |
      Defaults:ci env_keep += "http_proxy https_proxy"
EOF

We use an extra file descriptor to pass the cloud-init script fragment, so that acid gets access to QEMU’s standard input stream.

Other Linux distributions and BSDs may require considerably more effort to get the MITM proxying to work, which will be your homework. Configuring certificates and global environment variables is apparently a hard problem.

Setting up projects

Congratulations, we’re almost done! You can now put stuff like the following contrived example in your acid.yaml file, and have it run automatically once you push to the repository:

projects:
  owner/repo:
    runners:
      debian12:
        setup: |
          sudo apt-get install --yes findutils coreutils
        build: |
          echo Computing line count...
          find . -not -path '*/.*' -type f -print0 | xargs -0 cat | wc -l

Just remember to make the daemon reload its configuration when you happen to change it:

# systemctl restart acid

Debugging

The easiest way to inspect failures is to put a very long sleep at the end of build scripts, then connect to the machine (assuming they’re all configured similarly, otherwise some yq magic would be in order here, as illustrated later on):

~acid/attach.sh
#!/bin/sh -e
ssh -i "$(dirname "$0")"/data/id_ed25519 -o UserKnownHostsFile=/dev/null \
  -o StrictHostKeyChecking=no ci@localhost -p 8022 "$@"
acid:~$ ./attach.sh

And you can certainly launch the runners independently, with the help of another simple script:

~acid/launch.sh
#!/bin/sh -e
ACID_ROOT=$(realpath "$(dirname "$0")") "$@"
acid:~$ ./launch.sh runners/debian12.sh

Evaluation

Truth be told, I was procrastinating over this endeavour for many years, almost scrapping the idea in favour of simple Nix-based build checking instead. But I’m happy about what I’ve ended up with. While a few areas remain that deserve more love, I’ve achieved my main goals. And I still think setting up something like buildbot the way I want would take me a similar amount of time.

I have deployed acid on most of my projects, so feel free to go have a look.


Implementation notes

I’ve skipped over how acid is actually put together. There is surprisingly little to it, because my language of choice—​Go—​comes with batteries included, and a ton of things can be achieved through shelling out.

Configuration

This is one thing where I went for an external package. While there is a lot of YAML hate, it is infinitely more convenient than, say, JSON, XML, or even TOML. Anchors and references are a very useful feature for deduplicating. The yq processor also makes for easy scripting. For example, to clone all configured projects, I can do:

$ yq '. as $r | .projects | keys | .[] | "\($r.gitea)/\(.).git"' acid.yaml \
| while read -r url
  do git clone --recursive "$url"
  done

The only ugly thing is passing scripts through text/template, which makes proper quoting at best awkward.

Project configuration

In principle, it makes sense to store CI configuration alongside source code, such as in a committed .acid.yaml file, however this results in a lot of commit spam, and you still cannot go back in history to account for changes ruining your build happening externally.

Because I develop on the master branch, and dependencies have a tendency to only increase in their numbers anyway, this spam is not worth it. Thus, all per-project configuration is included in the main configuration file. As I’ve already demonstrated, this also enables scripts to enumerate everything from a central location.

If absolutely need be, you can still make build instructions conditional. Or even execute scripts from the cloned repository.

Terminal

Tasks have two kinds of outputs: the runner log, pertaining to the target system, and the task log, showing output from your scripts. Both assume to be displayed by some kind of a terminal device.

The good news is that you will get very far by implementing just three control characters: BS (backspace) for going one character to the left, CR (carriage return) for going to the beginning of the line, and NL (new line) to move to the next line. The output won’t be particularly pretty, since a lot of things use several more ANSI escapes, but this subset already reduces a lot of the log spam you would get if you blindly assumed the output to be plain text.

Script I/O

I’ve considered several options, and the simplest possible solution worked the best:

  • QEMU’s serial console doesn’t have a very clear moment of coming alive, requires a fuzzy expect-like approach to interaction, and problematically merges script lines with command standard input, unless you first pipe the script into a file. It also limits you to virtual machines.

  • Piping script lines directly into the shell launched by sshd had the problem of certain programs flushing their standard input. Once that happens, you’re stuck, because the remote shell doesn’t receive any EOF events afterwards.

  • It turned out that you can just pass the script in its entirety as the command to execute, which will get picked up as the -c argument to a Bourne-like shell. Program arguments can be fairly long, up to roughly 100 kibibytes.

acid will simply concatenate the runner’s setup script, and the project’s setup and build scripts together as strings. This granularity turns out to be enough, given that Bourne shell’s -x option exists. What you usually do with logs is scroll to the end to see what failed, and maybe read a bit up. Similarly, if builds have several stages, they’re usually interdependent.

Database

Originally, I toyed around with the idea of using the filesystem to store tasks, but I was unhappy with the—​well—​ACID properties of that approach. So I rather went with SQLite, which seemed like a decent solution, despite depending on cgo or other wild ways of crossing the language barrier. I had also had recent experience with it.

Thanks to its command line interface, SQLite composes rather well with UNIX. You can easily page through past logs without the need for a special frontend, in colour no less (through less):

$ sqlite3 acid.db "SELECT runlog || tasklog FROM task WHERE id = 42" | less -R

You can even insert new tasks externally. You just need to awaken any running instance of acid with its restart command, so that it picks them up:

$ sqlite3 acid.db <<END
  INSERT INTO task (owner, repo, hash, runner)
  VALUES ('p', 'acid', 'fd6959fff82a87e92d9e73cb07e210cebb675050', 'debian12')
END
$ ./acid/acid acid.yaml restart

The only problem here is quoting of the individual fields.

RPC

While direct DB access is indeed mostly alright, only the daemon itself has control over tasks that are currently running. Therefore, the restart command above is a remote procedure call, relayed through a preexisting interface, which is HTTP, and authenticated by signing the request with the same secret as the push hook endpoint.

This means of write access is again friendly to scripting, and allows the web interface to stay simple—​there’s no need to keep track of sessions.

As another example, rebuilding all projects can be achieved as follows:

$ yq '.projects | keys | .[]' acid.yaml \
| while read -r project
  do ./acid/acid acid.yaml enqueue "${project%/*}" "${project#*/}" master
  done

The enqueue command takes care to resolve branch or tag names, and avoids creating new tasks for a project/commit/runner combination where one already exists, restarting them instead. The main reason why it exists at all is that implementing the same functionality using yq, curl, sqlite3, and the restart command started reaching an unsettling level of complexity.

Web interface

All that you actually need is something to dump the contents of the SQL table, enriched with any intermediate progress from currently running tasks, and maybe set meta refresh to give it a resemblence of dynamicity. Right?

It works for me, at least—​information comes first. Although I acknowledge that a few lines of CSS and Javascript generally might not hurt. It’s just not particularly cost-effective.

Artifacts

This is still not implemented, however the general idea is to SFTP certain files out from the machine, then postprocess them with arbitrary scripts. For example, to pin them to a release in the respective Gitea repository.

Comments

Use e-mail, webchat, or the form below. I'll also pick up on new HN, Lobsters, and Reddit posts.