Building Kubernetes Controllers With Controller-Runtime: The Overview and the Warts

A few weeks ago I built a Kubernetes controller with controller-runtime, and was mostly pleasantly surprised at the experience, having only written controllers “from scratch” previously. I used kubebuilder to help scaffold controller-runtime. I’m on the fence about the value of kubebuilder for future projects (mostly due to how it scaffolds YAML), but it definitely got me set up far faster than had I not used it. Kubebuilder sets up the basic plumbing of a controller for you - the YAML for the controller itself, the main loop, blank reconcile functions for each resource, etc. The actual code it sets up uses controller-runtime, which I’ll talk about in the later half of this post.

Building The API

Kubernetes APIs are build of nested types. PinnedDeployment has a spec of type PinnedDeploymentSpec, PinnedDeploymentSpec has… etc. You define this as structs in Go, and a nifty tool handles some code generation (and YAML generation of the CRD) for you.

This… has some wrinkles. Kubernetes has a number of common fields, implemented as promoted fields. EG ObjectMeta, contained in any Kubernetes object, adds the name, namespace, annotations, etc. In YAML, we put this in the metadata field:

  name: example
  namespace: default
    app: example
  # ...etc

This doesn’t translate over well in sub-objects (the ObjectMeta/metadata on our CRD itself is fine though). For example, suppose you’re including a PodTemplateSpec (template metadata and spec for creating pods), like I did in the PinnedDeployment. The metadata fields, like what labels to put on the pods, are preserved… because PreserveUnknownFields is set to true. This option is incompatible with using admission webhooks (used to validate or default an object). If we turn switch that over, that app: example disappears.

There are ways around this, but they suck. For example, you can create your own object with the metadata fields made explicit. This is what I did for PinnedDeployments:

type FakePodTemplateSpec struct {
	Metadata BasicMetadata  `json:"metadata,omitempty"`
	Spec     corev1.PodSpec `json:"spec,omitempty"`

type BasicMetadata struct {
	Annotations map[string]string `json:"labels,omitempty"`
	Labels      map[string]string `json:"labels,omitempty"`

The upside is that the API behaves properly. I can submit my YAML, hit the REST API, etc, just like I could with a Deployment. However… Go isn’t duck-typed. My generated client does NOT take PodTemplateSpecs, it takes FakePodTemplateSpecs. This is opaque to the user, and can get messy. Say I wanted to embed a JobSpec in my CRD. I now also need to define a FakeJobSpec, which contains a FakePodTemplateSpec rather than a PodTemplate. Also, anyone using my client will need to do similar wrangling to convert to/from a vanilla JobSpec, should they need to.

I could also have a weirder API and not care as much about mimicking normal objects. For example, I could have:

type MyCrd struct {
	Job JobTemplate `json:"jobTemplate,omitempty"`

type JobTemplate struct {
	Spec        JobSpec     `json:"jobTemplate,omitempty"`
    // BasicMeta implimented like prior example.
    PodMetadata BasicMeta   `json:"podMetadata,omitempty"`

And super-impose the fields from MyCrd.Job.PodMetaData into MyCrd.Job.Spec.Template.Metadata.

This is a clear break from common patterning to have the pod metadata there, but is easier to reason with when the metadata is nested multiple objects deep in my CRD.

One thing I haven’t tried yet, but intend to, is to hack something together that inserts explicit Annotations/Labels fields into the generated CRD YAML. While messy in implementation, this would presumably allow both typed and untyped API users to put metadata on subobjects in exactly the same way as they would on a non-CRD subobject.

The Controller And Reconciliation

Kubebuilder gives you 1 controller Deployment per API group (EG the Deployment would also manage other CRDs in that group, if I made more). Within that, you have a distinct controller package for each CRD. controller-runtime allows you to build very simple reconcile functions, at a slightly increased performance cost.

Recall that Kubernetes has owner references on objects. Here’s an example from a Job object, referencing the parent CronJob.

  - apiVersion: batch/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: CronJob
    name: athenabot-k8sissues
    uid: 3cb56a0c-f242-11e9-8bf4-42010a8a0147

The controller-runtime machinery listens for a defined set of resources. EG, the PinnedDeployment controller listens for PinnedDeployments and ReplicaSets. If any create/update/delete event occurs, controller-runtime triggers the reconcile function, and passes in the namespace/name of the PinnedDeployment in question (insert your CRD here). This encourages developers to write simpler, but less efficient control loops. For example, suppose I have a controller that manages pods. Pods have many status fields that change frequently, especially during startup/shutdown. If something random happens that I don’t care about (EG an individual readiness gate flips to true while the pod is overall not ready), it will do a full reconcile cycle. It will check all the pods to see if anything needs to be updated, and if my CRD manages other objects too, it will check them as well. Luckily, controller-runtime fetches out of a cache, but this can still be nontrivial. Writing a reconcile loop like this is easier to do, easier to understand, and easier to test. Yourself, or any child changes? Sync everything.

This does fail you a bit if you don’t have a parent:child relationship. For example, an Ingress controller watches pods, but does not own them. You can still use controller-runtime, but you to add need your own watcher in the controller, and must map pods to (one or more) Ingress objects yourself. But you could directly call the reconcile function for each Ingress if you wanted, and maintain the general pattern.

Since the individual per-CRD-type controller is a struct, you can add arbitrary watchers and state. To take another ingress controller example, you could store remote load balancer state in the controller too - no need to constantly re-query it.

Integration testing is recommended, with a full cluster or minimal apiserver+etcd setup. Personally, I’ve managed to avoid this so far, using client-go fakes. As far as I can tell, running all the controller-runtime machinery with a fake client isn’t realistic. However, you can make the individual controller, and manually call and step through the reconcile function.

Reconciling Individual Objects

So, you have a desired state for an object, and you have an actual object (let’s say ReplicaSet, in keeping with the references to the PinnedDeployment Controller). Always applying an update is wasteful, especially with that “all reconciliation done in the same loop” thing I described.

The naive solution is to check !reflect.DeepEquals(desired, actual), given we’re working with structs.

Here’s the problem: say I create a PinnedDeployment, and I don’t define an imagePullPolicy on the pod spec. When the controller creates a ReplicaSet, the Kubernetes API sets a default value for imagePullPolicy (PullAlways if the image tag is “latest”, PullIfNotPresent otherwise). My desired ReplicaSetSpec will never equal the actual ReplicaSetSpec, because even though they are equivalent, the field is empty in one object and not the other.

Most upstream Kubernetes code doesn’t hit this, because it can call all the magic semi-generated defaulting functions. Those functions aren’t importable to a consumer, unless you want to do some crazy stuff. And even if you did import (or replicate them), you’d be tightly tying your controller to a particular Kubernetes release (as a change in logic would break your comparison).

In general, the solution around this is to compare an explicit subset of fields. Here are some things I have seen or have tried:

  1. Invent and set defaults when building the desired state. Enumerate that all known fields match (which forward-proofs being tripped up by new, unknown fields).
  2. Write an annotation with the parent’s object generation.
  3. Write an annotation containing a json version of the spec, a hash of the spec, or something similarly deterministic.

#1 only makes sense in specific use cases. For example, it’s too hands-on to use when a PodSpec is involved. But, I like the simplicity, and the fact that it looks at real state, not a spoofable annotation. I partially used this approach in the initial PinnedDeployment controller (all ReplicaSet fields, barring the template, were explicitly set and compared).

I often see #2 used, and I think it makes sense if almost any parent change requires changing the children. If this isn’t the case (such as a CRD that has multiple desired children), this can over-trigger writes, and #3 might make more sense.

Vallery Lancey
Software/Reliability Engineer

I work on distributed systems and reliability.