Brief

With a remote-only team, securely connecting to centralized infrastructure from home is an absolute requirement. Communicating with internal-only services, and deploying new services both require access to things that should not be accessible to the world. If you have your own hardware, being able to reach your internal underlay network and out of band networks need to be securely gated.

The benefits of a VPN can’t be undersold, and I’ll write a bit about how we’ve built some orchestration around WireGuard.

Good Morning, Staff

When staff start their day, they need two things: A connection to the VPN, and a staff cert. I’ve previously written about how we do our staff certs with our PKI, but I haven’t touched on how we orchestrate having access to it in the first place.

tl;dr vpnctl connect

That single command is used when staff start their day, and everything is done for them. There’s a lot of work done behind that one command, but it starts with establishing a WireGuard tunnel, and finishes with an additional helper to conditionally get a staff cert.

The goal is that this is a one-time per day command required to start working, with as close to zero-friction as possible to make getting started for the day as easy as possible (while still remaining secure, of course).

Breaking It Down

So, what does vpnctl connect do?

-[~:$]- vpnctl connect --help
NAME:
   vpnctl connect - connect to wireguard vpn

USAGE:
   vpnctl connect [command options] [arguments...]

OPTIONS:
   --device value, --dev value
   --dns value [ --dns value ]
   --endpoint value
   --full-tunnel
   --idle value
   --ipv4
   --ipv6
   --keepalive value
   --network-manager, --nm
   --no-dns
   --print-only
   --region value
   --route value [ --route value ]
   --split-tunnel
   --sso
   --ttl value

While I haven’t provided the description text and I’ve removed a few options, we can talk about the options displayed above. Every option has a sensible default, which means every option is optional. First, I’ll describe what these do, and what we consider to be the sensible default for our case.

  • --device simply defines the tunnel device for WireGuard, the default for us is corp-$region-$id.
  • --dns will override the tunnel provided DNS servers, which can be disabled with --no-dns. This is useful for split-tunnel DNS, allowing our internal zones to go over the company internal DNS servers, while everything else goes over the users normal network.
  • --endpoint will set a specific VPN tunnel server endpoint to use. By default we do some discovery to find the closest (or least latent) VPN tunnel server and region.
  • --full-tunnel will not create a split tunnel (which is the default), and instead request WireGuard sends all traffic for the user over the tunnel. This is useful during travel.
  • --idle, --keepalive and --ttl are values that can be overridden by specific groups for debugging and define the WireGuard PersistentKeepalive value and an internal orchestration value for the lifetime the VPN tunnel can stay alive. More on this later.
  • --ipv4 and --ipv6 will force the tunnel to be created over a specific address family. Our tunnel servers are available with both IPv4 and IPv6.
  • --network-manager is true by default (on systems with it) and will use NetworkManager to start up the tunnel upon successful completion.
  • --print-only will simply print the resulting WireGuard config file (which would otherwise be fed to NetworkManager or the likes). This is useful for testing.
  • --region will force a tunnel to be created in a given datacenter region.
  • --route value will force the tunnel to be created on a tunnel server in the specified datacenter region.
  • --split-tunnel is the opposite of --full-tunnel, and will force the tunnel to be created for only the internal routes.
  • --sso will be discussed more later. It will disregard existing staff certs, if one is available.

That’s a lot of potential configuration! Luckily, the only change used in practice is specifying --full-tunnel during travel, but we try to imply this when possible!

Starting with --device, I said that our default is corp-$region-$id. The $region might be obvious given the --region option, it is simply a datacenter that the tunnel is created in. The $id field is simply a short identifier to tell what tunnel server was picked. This makes it technically possible to connect to multiple regions or tunnel servers within the same region at the same time, though that doesn’t really make the most sense.

We tend to use dnsmasq for split-tunnel DNS configuration, allowing users to not have extra latency on their normal traffic. Our default here is to accept and use the DNS servers advertised by the tunnel, and then the users local dnsmasq instance can direct DNS traffic towards it for our internal DNS zone.

The --idle and --ttl options are where this starts to get interesting! WireGuard has no concept of a “session” on its own, it simply keeps a tunnel available, and will move traffic as long as a peer with the appropriate crypto is able to communicate with it. For our uses, we need tunnels to have a cap on their lifetime, and also become invalid if the tunnel hasn’t had any activity on it in some amount of time. This is where these switches come into play!

The --ttl option can specify the requested length of time that the tunnel will exist for. By default, this is the length of time that our identity provider tells us the users access token is valid for. Alternatively, if the session was created with a staff cert instead, the lifetime of the tunnel expires when the staff cert expires. This value can be set shorter if desired, which really only makes sense in integration testing.

The --idle flag allows us to similarly set an amount of time that the session may go without passing a keepalive. Our default (and maximum) is 30 minutes. Once this duration has passed without any traffic on the tunnel, the tunnel server will tear down the peer, and the user will have to issue another vpnctl connect. Setting this to a lower value is also only really useful for integration testing.

We default to --network-manager being enabled for systems with it available, and that basically tells vpnctl to invoke nmcli connection import type wireguard file /tmp/wg.conf or similar (with some additional logic to check for an existing one, and to clean it up if required). Most Linux distros won’t simply allow the user to invoke something like wg-quick without special permissions, but users with access to NetworkManager (for things like Wi-Fi) can benefit from this to reduce friction starting their day. The tool will fall back to attempting to use wg-quick when NetworkManager is not available.

Specifying --route (possibly many times, or comma separated) will override the traffic allowed to be sent over the tunnel. This might be useful to allow separate tunnels for different regions in case the users connection to two different datacenters from home is less latent than going through our internal network via one of the VPN tunnels. This is something we attempt to also discover for the user automatically, but there’s a fudge factor because it’s simply based on latency response times from the tunnel servers.

--sso is another place this gets interesting. vpnctl connect will try to use a staff cert to talk to the vpnd tunnel server on the other end if one is available and valid. If it isn’t, the tool will do a dance with our identity provider in an identical way as described in the previous PKI blog post. The difference is that we will hold onto the access token for a bit longer after establishing the WireGuard tunnel, because we will send it over the tunnel to pkid in order to request a staff cert.

The last thing to talk about are the --{full,split}-tunnel flags. Specifying these will do exactly as it says on the tin: It will create either a split, or full tunnel for your traffic. If it’s not specified, the tool will make an attempt to figure out if the user is “at home” or not, based on current local network configuration. We make this determination with the assumption that the user has connected to the tunnel for the first time when they are at home. This is simply in a local config file that can be blown away or altered at any time, and there’s a helper vpnctl reset sub-command that can do this, as well. It’s clearly a best effort attempt to make usage for staff easier, and not bullet proof by any means!

Tunnel Server Discovery

I’ve implied that we do some discovery to figure out what tunnel server(s) to connect to, and this is much simpler than it might at first seem, because we take the easy way out to try and figure it out. We have a well known set of DNS names where our tunnel servers reside, and through the use of CNAMEs, and multiple A/AAAA DNS records, we can discover all of the potential tunnel servers.

Knowing all the individual tunnel servers, we can very naively check which ones respond the fastest, and use that one to connect to. Generally speaking, all of the individual tunnel servers in a given region should respond to this test within the same small window of time.

We keep track of the per-region latencies for after the tunnel is established, assuming of course we’re going down the happy path. Once the tunnel is established, we check latencies to other regions through the tunnel, and if it’s significantly worse through the tunnel than it was without, we establish additional tunnels to the other regions to help with latency. We define “significantly worse” as 20% in latency response time; the tunnel always adds a bit of overhead, though it shouldn’t be meaningful compared to the route taken on the internet to get from site to site.

Of course, this is all skipped if the user specified the --endpoint flag to override the target. Additionally, we do this discovery only for a specific region if the --region flag was given.

Authentication and Establishing a Tunnel

It was touched on with the --sso flag, the authentication method for establishing a tunnel is the exact same as described in the PKI blog post. A quick recap: We spin up a local HTTP server listening on localhost to handle an OAuth callback. We open up a browser pointing to our identity provider in order to get an access token, and the identity provider knows to redirect us back to localhost, which from the browsers perspective is the staff members local machine, running our HTTP server. The tool now has the access token required to send over to the vpnd tunnel server, which will then in turn trade it for user information required to generate the tunnel, which we’ll go into more shortly. Perhaps interestingly, these tokens are specifically configured to be short lived (30 seconds) and good for two specific claims: One from vpnd, and then a follow up from pkid for the staff cert. Token refreshing is disabled for these ones, as well.

Before making the request to create a tunnel, we first create a WireGuard key pair, so that we have a private key and a public key for the peer. We take the configuration based on the defaults explained above, with possible overrides given, as well as the WireGuard public key we generated, and the identity provider access token, and send that over to vpnd to request a tunnel to be established.

If the tunnel server responds with the happy path, it will send over its public key and some configuration options to set. vpnctl will then build the appropriate WireGuard configuration, and start the tunnel. Routes and DNS are configured, the tunnel is established, and the tool will do some follow up work to retrieve a staff cert if required.

This request is sent to vpnd via gRPC, either using normal TLS when no staff cert is available (and thus, the identity provider access token is required), or using mTLS with the staff cert if it’s valid.

Tunnel Server Work

The tunnel server which is running vpnd needs to do some work here that’s worth expanding on. When it is asked to establish a new tunnel, it will do a simple auth check dance using either the provided access token (required if no client cert was provided), or using the staff cert when a client cert was given. It will find out the groups that the user belongs to and will use that to determine what internal routes to allow the user to be able to talk to, while also allowing the user to route to the public internet through this tunnel if they decided they want to.

A new ip{,6}tables chain is established to allow for masquerading for IPv4 external communication, and for extra safety nets to ensure the user is only able to route to the places we intend to allow them to.

It will also keep some on-disk and in-memory state to make sure it knows how to handle the idle session timeout as well as the maximum session TTL, and will tear down sessions as these timers are triggered.

We also keep track of some information and expose metrics via Prometheus for each tunnel session, which is attributed to the user the session was created for.

Phase Two: Staff Cert

Now that we have an established tunnel, we have access to pkid which is the service that will issue us a staff cert. Rather than require the user to also run certctl staff to request a staff cert be issued (which is separately possible), we have vpnctl connect do this for us. With a monorepo, we’re able to share the exact same code path for each action across the tools, the one difference being that vpnctl already has an access token available to it if required.

This is a huge time saver for staff, even if it seems trivial. It literally cuts requirements to start your day in half, because it’s simply one command instead of two. Reducing friction is very important.

Phase Three: mTLS Authentication

Assume that we have an established tunnel, but then we take the laptop with us to lunch, and set up in a coffee shop afterwards. At this point, it’s safe to assume that the idle timeout has passed, and vpnd has torn down our tunnel on the server side.

We’ll need to re-establish a tunnel, and will use vpnctl connect again. This time, two things are different: We detect that we’re not “at home”, and will implicitly use a full tunnel (unless instructed otherwise), and we don’t have to do the identity provider dance because we have a valid staff cert.

This means that vpnctl connect should return much quicker this time, even if we’re tethered to a phone, or on public Wi-Fi. The authentication dance was already done for us, and that’s why we were granted a staff cert previously. It has an expiration date of its own, so we are able to make use of that to again further reduce staff friction when getting to work.

Of course, if the staff cert has expired by this point, the original flow will still be followed and you’ll need to log in to our identity provider (I hope you brought your hardware token for MFA!). This is discovered before the tool establishes a connection and makes the request to vpnd all without the user needing to specify flags - unless of course they don’t want to use their staff cert, and specify --sso which will require a new login, and request a new staff cert instead.

Further Considerations

One thing we haven’t discussed yet is how end users end up with a copy of vpnctl in the first place. This will change for different shops, and you can get quite creative here. Consider not only staff, but contractors as well! Some options might include sending an S3 presigned URL (with expiry) via email to the user, or maybe a Slack bot that sends a link (or the binary directly). The Slack path would imply that the user already has access to Slack (or your messaging platform of choice), which is hopefully gated by your identity provider.

Another thing to consider is delivering updates to the vpnctl tool. We use a monorepo, which allows us to be pretty opinionated and also sprinkle commonality to many of our services and tools. One of these benefits is that every tool has an implicity update command available to it, such that vpnctl update and certctl update are available commands that aim to update the given tool.

Some tools may opt-in to running said update on their own at startup (or when finishing) if they desire, or even just alerting when an update is available and request the user run the update command. Our tooling requires access to the VPN to get both the manifest for any tooling update status, as well as getting the updated binaries, and so for vpnctl we will update the binary after a successful tunnel is established, and a staff cert is acquired.

This of course implies that we must support some old versions, as staff may be out of office for a few weeks at a time, and the last thing we want is to hamper their return to work by not being able to connect to the VPN! This is not a problem in practice, as the core logic of spinning up VPN tunnels is fairly trivial, and the only real updates to deliver to the tool are trying to make it a little smarter about any decisions it tries to make.

Closing

Hopefully I’ve been able to paint a fairly clear picture about what it’s like for us to start our work day and connect to our internal infrastructure with very little (if any) friction for our staff. I hope the effort put into staying out of our users way is obvious, we strongly believe that providing tooling to let them do their jobs is just as important as having that tooling be unintrusive.

With that, I suppose I can issue a vpnctl disconnect!