11 minutes
Public Key Infrastructure
Brief
A while back, I wrote about mTLS in Go and talked a little about the in-house PKI service used to manage it. I wanted to go a little bit further in detail about how it works, how certs are issued and handed out to staff and services, and how clients and services make use of the certs transparently.
Every service is paired with client tooling in order to interact with it, and each of these need to use a staff cert to communicate with the service. The services all need to have valid certs to communicate with other services and clients. The path to obtaining the certs for staff and services also needs to not get in the way of productivity.
Quick Review
A whirlwind review of our cert setup: We have a root CA cert that was generated offline (the private key remains offline until a new intermediate cert needs to be issued and signed). There is a different intermediate cert for every service, signed by the root CA, and is then in turn used to sign the leaf certs for every service under it. The intermediate certs, and their private keys are available to the PKI service so that the leaf certs can be signed by these intermediates. Individual instances of services generate their private key when booting up, as well as a certificate signing request (CSR) to request their signed certificate, never allowing the private key to leave memory.
We don’t use any RSA keys, all keys are ECDSA (instead of Ed25519, but only because if we present these certs to a browser, there isn’t wide enough support yet). These keys are much faster to generate too, which is always nice.
Staff Certs
As mentioned, for staff to be able to communicate with any service, they need their own cert.
One of the intermediate certs in our PKI is staff
, stored in the same way as other service
intermediates, however it cannot be used to sign service certs (there is no actual service called
staff
).
Issuing a staff cert involves trading an access token from our identity provider with the PKI
service, and the PKI service will verify the token and obtain information about the user,
appropriately filling in fields in the staff cert that it will then sign.
When our tooling is invoked to get a staff cert, it will first start up a local HTTP server
listening on localhost
, whose job will be only to handle an OAuth2 callback.
This tool will generate a link to our OIDC compliant identity provider,
and open the users browser to that link, allowing them to sign in with multi factor authentication.
The identity provider knows that the token supplied to it is only allowed to redirect the browser
back to localhost
, which works for the user because that’s where the tool is listening, and the
HTTP server running in the tool is able to gather the access token.
The tool also makes sure that the user has an ECDSA private key locally on disk in a convention driven path. The private key is stored in PEM format on disk. It will use this key to generate a CSR template with many empty fields, expecting that the PKI service will fill them out.
The tool also makes sure that ECDSA private key is available for the user in their ~/.ssh
directory as an OpenSSH private key.
This will be useful for users who have SSH access to at least one group of hosts.
Now that the tool has the access token, and the CSR, it is able to send a gRPC request to the PKI service over regular TLS with the access token and CSR in the message. The PKI service will then verify the access token with the identity provider, and retrieves some user information to be able to accurately fill out the fields in the certificate. These fields include DNS names which are used to identify the user, and the groups that they belong to.
The PKI service is also able to, using the same CSR, issue an SSH cert if the user is in any groups that allow them SSH access. This SSH cert is given principals that are tied to the users groups from the identity provider, and servers are configured to accept appropriately signed certs.
Because the identity provider is able to relay how long the users access token is valid for, the PKI service knows exactly how long to set the lifetime of the cert to be. For both certificates, they will expire when the access token does, which for us is approximately 14 hours, allowing for a long or split work day without needing to refresh the users staff certs.
The user then recieves their public staff cert used for mTLS, and the OpenSSH cert used for SSH.
The tool will store both certs along the previously stored private keys, and for convenience provides a
standardized config file in the users ~/.ssh
directory and can suggest that the user consider
adding an include to it from their main ~/.ssh/config
file for ease of use.
(As a rule, we never want to touch the users ~/.ssh/config
file, we instead supply the include
line in the tools output if we were given an SSH cert to use.)
The user only needs to include it once, and will then be able to simply ssh some-server.datacenter
to gain access with the proper username, using their staff SSH cert.
The X.509 cert returned to the user is a chain that includes the staff
intermediate
cert, allowing for the user to have the full chain required to present to tooling.
This means that tooling that trusts our root CA cert is happy to trust the signed intermediate, and
in turn, the signed staff leaf cert.
Service Certs
Service certs, on the other hand, are actually much simpler. There’s no SSO, and it must be fully automated. We deploy services to Kubernetes, including the PKI service, which means we have an authority on what services are deployed. The PKI service has read access to the Kubernetes API in order to query about the service being requested, what pods are deployed, and is able to do some checks to ensure that the service requesting a certificate is valid.
Once we’ve completed that dance, the service is presented with a certificate to use. Because this is fully automated, we have a very short TTL on these certs, only one hour. Once the certs have half an hour of validity remaining, they will ask the PKI service every five minutes for a renewal until it succeeds. Of course, this will start to alert quite loudly if we drop to less than 15 minutes left for a single cert not being able to renew. In practice, we haven’t seen this happen, though it may of course make sense to allow these certs to live for longer, whether is six hours or seven days, to allow for more of a reaction in case of failure (which also allows for existing services to not become inoperable if PKI were to fail).
Service certs follow the same cert chain, off the root CA to an intermediate cert for its service, which then signs its leaf cert.
While the intermediate certs must be manually created, with the key staying online (accessible to PKI in order for it to sign leaf certs), this only happens during a new service deployment, but requires doing it with the root CA key which is kept entirely offline. Luckily, we have tooling that is able to do the arbitrary generation of keys, CSRs, and is able to sign whatever it needs to, so this is a very low effort action.
Certs in Development
A valid staff cert to work with services in development. This is required in order to exercise the full mTLS path from the start of any service, and it keeps the callstack required to set these things up quite small.
For the smallest scale development, we have a static keypair that represents a development root CA. The services in development will trust both the production and development root CAs, while in production services are built and deployed by CI, and will not have that development cert available to them. This allows for clients to have zero change during development: They will trust the development CA from the service they are working on, and the service will trust the users normal staff cert. This of course implies that we must be able to prohibit this development cert from working in production, and alerting if it’s found.
For anything beyond this smallest scale, we have automation that lets us run a small Kubernetes cluster locally, deploying just like we would in production, but with the development certs in use. The automation that builds this for us also handles all of the service certs, creating all the intermediate certs needed, and signing them with the development key, spinning up the PKI service in the cluster just like it would in production, but with this development root CA.
While the local Kubernetes cluster is easy to fire up and iterate on, it’s not entirely without friction (yet). Because of that, we will have a path for simply using the hardcoded cert in development when not running in Kubernetes. Another thing that’s possible, is that the path to use a staff cert is simply loading the certificate from a file. There’s nothing stopping a service from using the local staff cert on disk to present their endpoints during development, though there will be some hassle with hostname validation, of course.
CRLs
A Certificate Revocation List (CRL) is used to reject known-bad certificate, for whatever may be deemed “bad” in your environment. For us, a “bad” cert is one that has been rotated before it has expired, or when a service shuts down because it is being replaced by a new version, in a clean shut down it will let PKI know that its cert is no longer useful.
CRLs are used for staff certs as well, if one is regenerated for any reason by a user before their old one has expired, we add the old one to the CRL.
These are only useful if they’re implemented, and most users don’t care about this detail.
This is one of the reasons we want to make sure this feature is given for free with our tooling and
frameworks for creating services and client tooling.
In Go, this means having a helper that pre-populates things such as the
tls.Config
.
Client and Service Integration
Being able to create services and client tooling that take advantage of the PKI in a simple fashion is a requirement to ensure frictionless adoption. When we create services, we make sure there’s a simple entrypoint that wraps all of the configuration required to gather certificates from PKI, and local client tooling can make use of staff certs via automatic discovery. There is a certificate management type that keeps a background job around to handle refreshing the certificate as needed, both for the client and service side. Clients refresh their certs in case of expiration (however rarely a client will expire during its use). Services also refresh their certificate with PKI when it’s near expiry to make sure that the certificate it presents to clients is always valid.
The client side is quite simple. When connecting to a service, a client type is created and
pre-configured to look in a well known path for the users staff cert. This is the same place that
the staff cert tooling will place the certificate when it is first issued to the user. Tooling will
set its client key and certificate to this staff certificate, and create the connection. We call the
certificate management type for this the FileLoader
, which simply loads the certificate keypair
from files on disk.
Services are more involved. There are two methods: One can help handle legacy services in an
identical fashion to the client tooling - by looking for certificates on disk. The other is for any
service deployed to Kubernetes. These services will use the PKILoader
technique, which
communicates with the PKI service, sending over a generated CSR, keeping the private key in memory
which is ephemeral never leaves the instance.
Service to service communication requires no additional work, as service certs can be used as client certs. If service A calls an RPC on service B, all service B has to do is make sure that service A has acceess based on its certificate information.
Closing
As you can see, there are a few moving parts here, but we leverage well known technologies to keep implementations simple. As someone who has implemented this, the most important part is reducing friction as much as possible. If engineers who aren’t familiar with certificates are suddenly forced to deal with them, there will be hesitation and confusion. Introduction to using a system like I’ve described can be done gradually, starting with new services or making mTLS preferred, but not required. Automation is key to success, especially with a small team.