mTLS Chains in Go

Brief

Go gives us some really nice tools to handle TLS certificates. Included in this, is the ability to manipulate the root CA certificates for your services and consumers. It also has everything we need to generate new certificates, and sign them, which is what we’re going to focus on here.

We’ll first go over some quick basics about how certificates work, the chain of trust that we’re after, and mTLS. We’ll follow that up with some code snippets for how to build some of this stuff.

Certificates, and the chain of trust

X509 Certificates are what we are using when we visit a website over SSL, and also what we use with gRPC. There is a chain of trust that is established based on a common certificate that all parties trust, which we’ll get to shortly. A self-signed X509 certificate is generally the root certificate, it has signed itself, and others will consider that certificate trusted. The key used to sign that certificate can be used to sign other certificates, which will be trusted by any clients that trust the first self signed cert.

That first self-signed cert is what we refer to as the root CA (certificate authority). That root CA can be used to then sign other certificates, establishing a chain of trust as each certificate is trusted by the certificate before it, all the way back up to that root CA which the client is expected to trust. When we visit websites, the server will normally send us a bunch of certificates that our browser will choose to trust, or throw a big warning.

For example, we (currently) trust the ISRG Root X1 certificate on either our local machine, or our browsers do. This is a root certificate that Let’s Encrypt uses. From there, it (currently) signs the R3 certificate, which is used to sign end user certificates, such as the one for vandersomething.com. The Let’s Encrypt page explains this much better than I can.

We can see a brief certificate chain for this site via the following:

-[~:$]- echo 'quit' | openssl s_client -connect vandersomething.com:443 | openssl x509 -enddate -noout
depth=2 C = US, O = Internet Security Research Group, CN = ISRG Root X1
verify return:1
depth=1 C = US, O = Let's Encrypt, CN = R3
verify return:1
depth=0 CN = vandersomething.com
verify return:1
DONE
notAfter=Nov 19 01:36:56 2021 GMT

The server has presented us with the end certificate at depth=0, which is the certificate for vandersomething.com. It was signed by the R3 certificate at depth=1 by Let’s Encrypt, which was in turn signed by the ISRG Root X1 certificate. My machine trusts that certificate, which allows OpenSSL to verify that the certificate is valid.

However, if we fire up a docker container that doesn’t have ca-certificates installed, we get something else:

-[~:$]- docker run --rm -it ubuntu:20.04 bash
root@f9a606473c33:/# apt-get update &> /dev/null && apt-get install -qqy openssl &> /dev/null
root@f9a606473c33:/# echo 'quit' | openssl s_client -connect vandersomething.com:443 | openssl x509 -enddate -noout
depth=2 C = US, O = Internet Security Research Group, CN = ISRG Root X1
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=1 C = US, O = Let's Encrypt, CN = R3
verify return:1
depth=0 CN = vandersomething.com
verify return:1
DONE
notAfter=Nov 19 01:36:56 2021 GMT

Notice we can no longer verify the certificate: verify error:num=20:unable to get local issuer certificate

However, if we install ca-certificates, we’ll trust all the certs in that package, which is why you see this a lot in docker images.

mTLS

In the above examples, we communicated with the server after verifying a certificate that it sent us. We did not send any certificate or anything of the sort to the server, and it was happy to communicate with us. If the server was enforcing mTLS, or any client authentication type other than NoClientCert or RequestClientCert, it would refuse to communicate with us.

In Go, there is the ClientAuthType that can define how a service would handle incoming connections. The linked docs explain it quite clearly, and these values can be used to determine how to gate client access to a service.

The idea behind mTLS here, is that each side of the communication has a certificate, and each side must trust each other to an implementation specific degree. In practice, this is useful for trusting certificate chains signed by a specific root CA, regardless of the intermediate(s).

Use case

Our use case is pretty simple, and is as follows: We have one root CA for our product. We have many services, and we also have staff. For each service, we have many instances of the service running - and we also have an arbitrary number of staff.

We will lay out our certificate chain of trust as follows:

Root CA, self-signed, internally generated and environment (prod/staging/dev) specific
Intermediate certificate per service/role, signed by the root CA above
Leaf certificates signed by the intermediate above for its service/role, short lived

This means an individual instance of a service will spin up, obtain a certificate, and it will be signed by the intermediate for that service. That service can communicate with another type of service using the same cert, via mTLS, and they will accept each other’s communication via mutual trust of the root CA.

Staff will also use the same idea, but a different flow to get their certificates that are longer-lived for the time an engineer might want that certificate for the work day.

Code

We use a monorepo, which, other arguments aside, means that it’s super simple to drop in a single source of truth for this certificate. We can then provide a helper utility that supplies a *x509.CertPool that users can consume to figure out what the root CA is for their environment. This can either come from a hard coded PEM, files on disk, the environment, whatever we need, so long as we can turn it into that type in Go.

For example, if you hard code it and guard that by build tags for each environment, you can have a utility like the following:

package certs

import "crypto/x509"

// RootCertPool will return a new cert pool with just the CA above in it.
// Because that's hard coded, we will panic if it's not valid.
func RootCertPool() *x509.CertPool {
	certPool := x509.NewCertPool()

	if !certPool.AppendCertsFromPEM([]byte(RootCACert)) {
		panic("hardcoded root CA cert invalid")
	}

	return certPool
}

Services and clients will now be able to pull that certs.RootCertPool() and feed it into their tls.Config for their own purposes. But, what if we want to generate this first cert in Go?

// Create an RSA key with a specified bit size
// which we'll use to self-sign our cert
key, _ := rsa.GenerateKey(rand.Reader, 4096)

// The cert, too, which is mostly boilerplate.
// You should read the docs to find out what these fields mean
cert := &x509.Certificate{
	SerialNumber:       new(big.Int).SetString("123", 10),
	SignatureAlgorithm: x509.SHA512WithRSA,

	Subject: pkix.Name{
		Organization: []string{"Reenigne"},
		Country:      []string{"CA"},
		Province:     []string{"ON"},
		Locality:     []string{"Maple Leaf Village"},
	},

	KeyUsage:              x509.KeyUsageKeyEncipherment | x509.KeyUsageDigitalSignature,
	ExtKeyUsage:           []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth, x509.ExtKeyUsageClientAuth},
	BasicConstraintsValid: true,

	NotBefore: time.Now(),
	NotAfter:  time.Now().Add(10 * time.Minute),

	DNSNames:       []string{},
	EmailAddresses: []string{},
	IPAddresses:    []net.IP{},
	URIs:           []*url.URL{},
}

// Sign this cert, with itself, using the RSA key above
certBytes, _ := x509.CreateCertificate(rand.Reader, cert, cert, key.Public(), key)
signedCert, _ := x509.ParseCertificate(certBytes)

The above isn’t entirely useful, since we’re just discarding all of the output. However, if you saved the key in PEM format, and exported the cert in PEM format, you’ll be able to use the certificates in many applications.

The cert boilerplate above contains some interesting fields such as the DNSNames and EmailAddresses which have quite a few neat use cases. For example, the DNSNames field can be used for a service or role name in an intermediate cert, or a hostname for your service instance in a leaf certificate. The EmailAddresses might be used for staff certs to add a whitelist of email addresses for AuthZ. The DNSNames could also be abused with group names, if you wanted to set it to something like []string{"codemonkey.groups.reenigne.net"} which could indicate they are in in the mygroup group for RBAC.

Going further with PKI

We have a custom PKI (public key infrastucture) solution built in house that helps with a lot of the automation of this certificate management. All of our services require mTLS in order to communicate, except for two RPCs in our PKI service. These two RPCs are for first obtaining a new certificate, or more specifically, signing a certificate request. The two RPCs differ in the methods of signing the certificate, and the AuthN method.

The first is for our services. A service will spin up, and generate a private key that it will use throughout its lifetime. It does this in the way described above (but with error handling). It then generates a CertificateRequest that will be sent to our PKI service.

We do something along the lines of the following:

// dns would be the output from os.Hostname
// or empty if we're not a CSR for a service (i.e. staff cert)
dns := []string{}
csr := &x509.CertificateRequest{
	SignatureAlgorithm: x509.SHA512WithRSA,
	Subject: pkix.Name{
		Organization: []string{"Reenigne"},
		Country:      []string{"CA"},
		Province:     []string{"ON"},
		Locality:     []string{"Maple Leaf Village"},
	},

	DNSNames:       dns,
	EmailAddresses: []string{},
	IPAddresses:    []net.IP{},
	URIs:           []*url.URL{},
}

if len(dns) > 0 {
	cert.Subject.CommonName = dns[0]
}

This CSR is a template and is returned to the user to override some of the fields that it wants. The user might try to override the common name or DNS entries, but the PKI service will reject it (and raise some alarms). This CSR is then encoded to PEM, and sent in an RPC to the PKI service.

This is largely how Let’s Encrypt and other SSL cert providers work. It allows the provider to have zero knowledge of the private key, which no one but our service instance should ever know.

Our PKI service will do some AuthN work to verify the client. In our case, we deploy to Kubernetes, and the service is able to ask about running pods, and their metadata. It can then check the caller, their CSR details, and sign accordingly, or reject and alarm. If it signs the cert, a service cert will be short lived, though there are good reasons to be a few hours at least (consider an outage, and having expiring certs in the field that can’t be renewed).

The staff cert flow works similarly, but we have a tool that will do a few things first, to initiate the SSO flow, generate a local cert to disk if one doesn’t exist, and do the CSR. This is the other RPC that doesn’t require mTLS in our infrastructure. We return a cert to the user that is signed by the staff intermediate cert. This leaf/staff cert that the user gets will look similar to the following (with most fields removed):

-[~/src/pcgworld/go-services:$]- ./bin/certctl inspect data/staff-cert.pem | jq .
{
  "signature_algorithm": "SHA512-RSA",
  "public_key_algorithm": "RSA",
  "issuer": {
    "common_name": "Reenigne DEVELOPMENT Service CA: staff"
  },
  "subject": {
    "common_name": "Reenigne staff cert"
  },
  "not_before": "2021-10-20T22:37:44Z",
  "not_after": "2021-10-21T12:37:44Z",
  "is_ca": false,
  "san_dns": [
    "user.ops.reenigne.net",
  ],
  "san_emails": [
    "user@reenigne.net"
  ]
}

The service certs look similar, but have a different issuer, common name, and DNS name. The services will then spin up a background job that will routinely refresh the certificate. Refreshing the certificate does require mTLS, and because of that, the PKI service can identify the certificate used. The PKI service will take the new CSR, sign it after another round of verification, and then in a few seconds, it will add the cert used for the call to the CRL. All new connections will always use the latest certificate, so those few seconds are really just a grace period. The certificates age out of the CRL with expiry, as these certs don’t live for very long.

PKI automation limits

This PKI stuff sounds pretty useful, and it is! However, it has some limits, which make sense, but aren’t always immediately obvious.

The intermediate certificates must be manually added

This is because the PKI service does not generate the intermediate certs; it can’t! The intermediate certs first need to be signed by the root CA, and the private key for that is not known to PKI.

Intermediate certificates need their private key exposed

In order for the PKI service to sign certificates for the leaf nodes, it needs the private key. This isn’t a pro or a con, but something to keep in mind when you choose your storage for where these go.

PKI must register itself with PKI

This is a silly one, but most of our infrastructure uses a common library to pull in a certificate from PKI. The PKI service can’t use that, because the first one to spin up needs a certificate before it can start its server. Therefore, the PKI service has a custom certificate loader, and signs itself, which is safe if it already has access to the intermediate key store.

Using the mTLS certs

So far we’ve seen a few things about how to generate the certs, but we haven’t seen how to actually configure a service using mTLS. This part is pretty trivial, given Go’s common use of the *tls.Config type. We can create a config struct similarly to below (make sure you add everything you need, like acceptable cipher suites):

tlsConfig := &tls.Config{
	ServerName: "my-service-name",              // Needs to match how the client connects to it
	ClientAuth: tls.RequireAndVerifyClientCert, // This fully enforces valid mTLS certs
	ClientCAs:  certs.RootCertPool(),           // From the first example!

	// Dynamically get the certificate to present to the client.
	// You could go very far with this, but we just use it to
	// make sure we're sending an up-to-date certificate that
	// has not expired to the client.
	GetCertificate: func(chi *tls.ClientHelloInfo) (*tls.Certificate, error) {
		// Assume we have something refreshing the cert
		// so we'll do a read lock here.
		certLock.RLock()
		defer certLock.RUnlock()

		// Assume that lock is protecting a `cert`
		// which stores the current non-expired certificate.
		if cert == nil {
			return nil, errors.New("no cert loaded")
		}

		return cert, nil
	},
}

This *tls.Config can now be fed into a net/http server, or a gRPC server. It would also be used as a client, but ever so slightly differently:

tlsConfig := &tls.Config{
	ServerName: "target-service-name", // The virtual host we want to connect to
	RootCAs:    certs.RootCertPool(),  // From the first example!
	GetClientCertificate: func(cri *tls.CertificateRequestInfo) (*tls.Certificate, error) {
		certLock.RLock()
		defer certLock.RUnlock()
		if cert == nil {
			return nil, errors.New("no cert loaded")
		}

		return cert, nil
	},
}

Note that it is functionally equivalent to the server section, because we’d want to re-use the same certificate loading code for both servers and clients. An interface that returns the CA pool, those two certificate getters required in tls.Config, and maybe one helper that returns the certificate itself would be really helpful here. You could then implement that for your own PKI service, as well as loading the info from files, or the environment, or anywhere else.

Closing

Through this, we’ve only just touched on some of the basics of some of the certificate management available using the built in Go libraries. It has been really nice to use a language that includes this kind of easy to use functionality. Take a look through the rest of that tls.Config struct, and see what else might appeal to you. There are callbacks in there that let you do quite a bit per-connection.