Automated DNS Deployment

Brief

There are some incredible providers out there that make hosting your DNS records a breeze (Cloudflare, Digital Ocean, etc). These offerings might be a great choice for a personal or company domain that should be accessible to the world. However, if you have some internal infrastructure, you might not want to leak those domains to the public. There is still a place for hosting your own authoratative nameservers, and in this post I’ll describe how I approached automating deployment of new records with Gitlab CI. I won’t be going over any DNSSEC configuration here, the PowerDNS documentation should have you covered.

DNS Server, and Tooling

Forever ago, I used dbjdns, with some patches for IPv6 support. The format is incredibly simple for humans to read, and for computers to spit out.

Today, I use PowerDNS for my authoratative server (and even use the recursor in front for internal domains). I don’t use a database backend, because I want the history with a quick and easy rollback path (deploy previous commit). I’m still using the tinydns backend, but that data is all generated for me out of YAML. I have a small utility I call dnsgen that does a little more convenience for me than it probably should. Some of the YAML I write looks like this:

zone:
  # Name of the zone we're working on, this value
  # is also used anywhere that a name field is
  # simply a "." for convenience
  name: doot.net

  # Every record supports a `ttl` field.
  # If it isn't supplied, it will use this value.
  default_ttl: 600

  # Z in tinydns
  # Creates the SOA record for the domain, uses default TTL
  soa:
    primaryns: ns1.doot.net
    contact: toot@doot.net
    serial: '20220814001'
    refresh: 300
    retry: 300
    expire: 300
    min: 300

  # & in tinydns
  # Creates our NS records for this zone
  # Optional to supply A/AAAA records at the same time as
  # a convenience, but not required
  ns:
    - name: .
      ttl: 3600
      values:
        - name: ns1.doot.net
          a: [ 10.1.2.3 ]
          aaaa: [ fe80::123 ]
        - name: ns2.somewhere.com

  # @ in tinydns
  # Creates an MX record for the zone with the given priority
  # and uses the default TTL in this case
  mx:
    - name: mail.doot.net
      priority: 10

  # C in tinydns
  cname:
    - name: www
      target: doot.net

  # ' in tinydns
  txt:
    - name: .
      ttl: 3600
      values:
        - 'v=spf1 ip4:10.0.0.0/8 ip6:fe80::/16 ~all'

  # Combines +, ^, 3, and 6 as appropriate in tinydns
  # "a" is a poorly named field in the YAML
  a:
    # This creates an A, AAAA, and pair of ^ records
    # for "doot.net"
    - name: .
      ptr: true
      values: [ 10.1.1.1, fe80::100 ]

    # This creates a single A record for "foo.doot.net"
    - name: foo
      values: [ 10.3.2.1 ]

    # This creates a wildcard A/AAAA for "*.anyone.doot.net"
    - name: '*.anyone'
      values: [ 169.254.169.254, ::1 ]

I’ve commented out the above to give an idea of what things the definitions do for me. Where a trailing dot is required for names, the tool will inject them for me. One additional convenience is that it will fail to generate if there are multiple PTR records for a single IP. While it’s not invalid to do so, it’s not something I use, and has tripped me up in the past.

I highly recommend using the GeoIP backend for a new deployment, if you’re going to use flat files stored in git as I do here. Though a very useful feature is that it can change its response based on the location that IP is from (as reported by a GeoIP database you provide), it doesn’t need to. Its records are written with YAML, just as I did with dnsgen, except there are less moving parts when using the GeoIP backend.

Alternatively, you can of course use the MariaDB backend which might be easier to manage in general, but more difficult to track history with Git.

The path with dnsgen is as follows:

Build the tool (it’s ~200 lines of Go)
Generate the tinydns-data format with dnsgen gen -o deploy/data zones/*.yaml
Convert that with a very helpful tool
Make use of the resulting data.cdb that can be sent to the pdns instances

That is a lot of moving pieces, when it would be easier to just send the source YAML to the nameservers at CI time instead of all of that.

Gitlab CI: Deploying DNS Changes

After going through my twists and turns above to end up with a data.cdb, I have to deploy that to running nameservers somehow from within CI. There are plenty of ways we could do this:

rsync or scp the file from CI
Have a small service that can take in the files on the nameservers from CI
Take the data.cdb and slap it into a Docker container, deploy the container
- Further, deploy pdns with that data in Kubernetes

I’m taking the easy way out here, and going with the first option to rsync the data.cdb file to the running nameservers from and interal Gitlab runner. My nameservers are still running as virtual machines, and thus this is the easiest option. With a Kubernetes deployment, this can be improved.

CI Pipeline

My CI pipeline is super simple, and handles both internal and external zones. The build-{gen,test} jobs simply create some docker containers to build an image with dnsgen in it, and another to run pdns_server while also running dig to test it.

The dnsgen job runs the tool as I laid out above, and then uploads the data and data.cdb file as artifacts for each set of zones separately. This way I can use one repo and one quick job to handle internal and external domains.

The test step is trivial; the container starts pdns_server as a daemon using the data.cdb files we just generated, and then ensures that some well known entries work, and return what is expected. I run some simple [ "$(dig my.name aaaa +short @localhost)" == "fe80::123" ] || exit 1 type of things (including some multi line checks) to make sure the basics are working. I also do some comparisons against live nameservers to make sure we’re still reporting the same records we expect to (again, only for names that shouldn’t change very often). The biggest benefit to the test step is to make sure that dnsgen didn’t do something wrong, or if I was using GeoIP, it would verify that the server loaded the YAML correctly (though I could use yq to validate that, too).

Finally, the deploy job is run, which is a bit more complicated. My job definition in the .gitlab-ci.yml is really simple, this is all it is:

deploy:
  stage: deploy
  image: $CI_REGISTRY_IMAGE/dnsgen:$CI_COMMIT_REF_SLUG
  tags: [ internal ]
  only: [ main ]
  script:
    - time ./scripts/certs.sh
    - time ./scripts/sync-servers.sh external
    - time ./scripts/sync-servers.sh internal
    - time ./scripts/slack.sh

In the deploy stage, I pull down the dnsgen image that was built previously. I do this on a dedicated/tagged type of Gitlab runner which has access to the internal network, so that I can deploy to the internal nameservers which do not have a way in from the outside world (even their IPv6 /64 isn’t routed). This job only runs on the main branch, so it doesn’t happen in merge requests.

Then, I run a bunch of scripts:

certs.sh hasn’t been touched on yet, but it ensures that all new or deployed Lets Encrypt certs are up to date
sync-servers.sh does the actual deployment of data.cdb files to the appropriate nameservers
slack.sh just gives me a heads up in Slack that DNS was deployed (we can get a diff of the last data files, which is convenient)!

The only tricky part about sync-servers.sh is that we need an SSH key available to us to rsync files around. This problem goes away in the world of tomorrow when this is deployed in Kubernetes, or at least in Docker (because the infrastructure that runs on will still want nameservers). I create a protected variable in Gitlab that’s only accessible to protected branches (the main branch) in order to accomplish this. Then it’s as simple as echo "$SSH_KEY" > /tmp/key && chmod 0600 /tmp/key in the script, and now it can be used to SSH to our nameservers (this assumes the key is accepted on those servers, of course).

This deployment script is also incredibly simple (with some fluff removed for simplicity):

#!/bin/bash
set -e

# Before setting -x
if [[ ! -z "$SSH_KEY" ]]; then
  echo "$SSH_KEY" > /tmp/sshkey
  chmod 0600 /tmp/sshkey
fi

set -x

# One arg, the zone set we'll deploy this run
zone="$1"

# Iterate through all servers
for i in $(cat "zones/$zone/00-nameservers"); do
  args="-i /tmp/sshkey -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"
  rsync -e "ssh $args" -avP --chown=pdns:pdns "deploy/$zone/data.cdb" pdns@$i:/var/lib/powerdns/
  ssh $args pdns@$i 'bash -c "pdns_control reload && pdns_control purge"'
  ssh $args pdns@$i 'bash -c "pidof pdns_recursor && rec_control reload-zones"' || true
done

I’ve omitted some simple error checking, ensuring the data.cdb file exists, and that the 00-nameservers file exists. The 00-nameservers file is very simple, it’s a list of the nameservers we deploy to for the internal/external zones, so that the appropriate servers handle the appropriate domains. This is very simply one IP address per line.

The script then forces a cache flush after reloading. I’m not sure if your backend might require this or not, feel free to play with it. I also handle the recursor here because my internal nameservers have that in front of the authoratative nameservers on the same host.

Lets Encrypt Certificates

This is perhaps the most interesting, and possibly convenient part of this whole thing, in my opinion. While I have some apps and services that can just do a certbot renew in place, that’s not always the case. On the internal network I have my own CA chain which everything can trust, but that doesn’t hold true for the external domains.

When a new cert is required, I’ll generate a new private key, and certificate signing request (CSR). They private key is then thrown into a secret storage engine, and authenticated apps can pull it down when they’re bootstrapped. The CSR is thrown into this git repository, and can then be used to work with Lets Encrypt.

The goal here is to use the CSR to get a signed certificate from Lets Encrypt, and then put that certificate into object storage for consumption by consumers. I only want to do this when a certificate is due for renewal, allowing some time for retries in case of failure and time for the services to pick up the new certificate.

I accomplish this by using acme.sh, awscli, and some openssl tricks along with the DNS deployment detailed above. For each certificate in the repository, we start by reading its domain name from implied by its filename (and then verify its common name matches). Next, try to pull down the existing certificate from object storage if it already exists so that we can find out when it expires. You can do some ugly jank like below to figure out how many days a certificate is still valid for:

expires_date=$(curl -s "https://$objectstorage/$BUCKET/$DOMAIN.pem" | openssl x509 -noout -enddate | cut -d= -f2)
expires_sec=$(date -d "$expires_date" +%s)
now_sec=$(date +%s)
days=$(( (expires_sec - now_sec) / 86400 ))

Now you can test that for your renewal, or you can skip it if it’s fresh enough for your liking. Lets Encrypt rate limits are in place to protect itself, and we don’t want to hit the service more than we have to. If our certificate is fresh enough, we will move on to the next one. In the common case, our deployment won’t actually renew any certificates at all. Note that you’ll always want to quote the domain variable, as we can use a wildcard here.

Moving on, if we do need to renew the certificate, we run acme.sh and ask it to facilitate requesting that Lets Encrypt issues us a certificate.

acme.sh \
  --signcsr \
  --csr "./certs/$DOMAIN.csr" \
  --dns dns_mymethod \
  --dnssleep 0 \
  --domain "$DOMAIN" \
  --cert-file "./certs/$DOMAIN.pem" \
  --ca-file "./certs/$DOMAIN.ca.pem" \
  --fullchain-file "./certs/$DOMAIN.fullchian.pem"

There’s only one tricky thing to worry about here, and that’s the --dns dns_mymethod --dnssleep 0 part of the command. The --dnssleep 0 just means that acme.sh won’t wait for DNS to propogate, it’ll assume it’s ready immediately. The --dns dns_mymethod however is a bit more in depth: It suggests a file dns_mymethod.sh is available and has dns_mymethod_add() and dns_mymethod_rm() functions.

Starting with dns_mymethod_rm(), we have something kind of like this:

dns_mymethod_rm() {
  # This will be something like "_acme-challenge.mydomain"
  fulldomain=$1

  # Re-deploy the files just like we did above
  # in the CI section - this is because the normal
  # dnsgen stuff doesn't know about this challenge
  sync-servers.sh external || return 1
  return 0
}

Inserting the record is a little more involved, but only because we make it more involved. I’ll omit some of that complexity with comments below, but notice how simple it is to manipulate the tinydns-data format in plain text. The same could be accomplished with GeoIP and yq, of course.

dns_mymethod_add() {
  # Like above, same domain name
  fulldomain=$1

  # The challenge value to store in a TXT record for $fulldomain
  txtvalue=$2

  # First, re-generate the data.cdb file like above.
  # Then we'll very easily add the text record to the plain data file:
  echo "'${fulldomain}:${txtvalue}:60" >> data

  # Next use tinydns-data.py to turn that into a data.cdb file
  # and then re-deploy to the external servers
  sync-servers.sh external || return 1

  # Next is some complexity:
  #  - Loop for some amount of attempts
  #  - Check validity on the external nameservers for $fulldomain
  #  - Do this for each external nameserver
  #    - Remember we have the 00-nameservers file for deployment
  #  - If `dig $fulldomain txt @$ns +short` doesn't match $txtvalue, retry
  #    - Try this for all your nameservers for the domain
  #  - If we cannot confirm this worked, return 1 for error
  #  - When confirmed, we're good to go!
  return 0
}

Once the acme.sh script returns, the certificate is signed and ready to be used by your applications. Now that we have the certificate signed, we’ll upload them to object storage via aws --endpoint-url https://$objectstorage s3 sync. That will allow services to start picking up those certificates immediately and make use of them as soon as they’re able to.

This isn’t all that tricky, it’s just a few moving parts to keep track of.

Future

There are some improvements I could make to this entire process. I could cut down a ton of moving parts by just using the GeoIP backend. I could also not deploy via SSH, but instead use Docker or Kubernetes. There will be some chicken/egg though of course, as Kubernetes is going to need some nameservers to talk to before you deply the first one (i.e. nameserver deployed on Kubernetes, but Kubernetes needs the nameserver). This really only matters if you anticipate deploying from scratch, and want to go straight to deployment.

Closing

Using the GeoIP backend would be a lot cleaner here, because the format is documented (unlike mine), and it would reduce my CI pipeline times. The build for the dnsgen Docker image takes about a minute in CI, while every other job takes about ten seconds or less, except for the certficiate signing. However, when nothing is to be signed, the deploy script takes about ten seconds, as it has to verify deployed certs still.

Abusing Gitlab CI’s scheduled piplines is very helpful here. I run this job daily now because it skips certs that aren’t due for renewal, and then I never have to think about it. Slack notifications will tell me when certs are renewed, and when DNS is updated and deployed (along with a small snippet of info about what changed).

My solution works pretty well for my use case, but I’d be lying if I said it wouldn’t be easier to just have some JSON blobs in git, and hit some DNS provider’s API to do it all for me.