Skip to content

Spaghetti 🍝 (blog)

Patching and overlays in GitOps suck

Most implementations of Kubernetes clusters that I have come a cross, have migrated to using GitOps as the prefered method of deploying manifests, helm charts and others, and for good reason too. This isn't a rant about how great GitOps is, but rather a a discussion on how to share the same configuration across multiple clusters without extensive patching and lots of overlays.

Why I don't like patching and overlays

If a configuration needs to be shared across many Kubernetes clusters we tend to use the pattern of creating a base configuration and then do individual patches using overlays referencing the base. I think this approach is pretty normal and probably works if you are really strict about what kind of patching you are allowing, but here also lies the pitfall. Patching allows for you to almost do anything to the base configuration, so not only is the interface potentially massive it is also not by definition well defined. You are constantly running the risk of changing something in the base that is patched somewhere in overlays anyway, so all overlays and patches must always be taken into account, probably leading to a small base and a massive overlay. With patching we also need to know the resource Kind, name and maybe namespace plus the yaml path in the spec. But remember we wanted to share as much configuration as possible, so we want as much as possible to go into base.

On top of all this I also find patches to be hard to read and not knowing your implementation up front has other drawbacks. I'm not saying that patches and overlays doesn't have their use cases, but limiting them can certainly help.

Envsubst might be the solution

If you haven't heard of envsubst, it is a GNU package released in 1995, designed to substitute environment variable references in a given text file or string. In other words we can now parameterize e.g. a Kubernetes manifest. This means that we know where the manifest will change. Time to compare

# base/sa.yaml for envsubst
---
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: ${KARPENTER_ROLE_ARN} # envsubst notation
  name: karpenter
  namespace: karpenter

and

# base/sa.yaml for patching
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: karpenter
  namespace: karpenter

# kustomization
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ./base/sa.yaml
patches:
  - target:
      kind: ServiceAccount
      name: karpenter
      namespace: karpenter
    patch: |-
      - op: add
        path: /metadata/annotations/eks.amazonaws.com~1role-arn
        value: arn:aws:iam::1234:role/karpenter-role

The examples might not look so different on the surface and I also did the patching example a disfavor, by not using a placeholder and replacing that. But I wanted to show that from the base configurations point of view there is no knowledge of an annotionation, so if I where to add a similar annotation to the base it would eventually be overruled by the patch eventhough there are no indicators of it being a parameter. Also the overhead of knowing the exact resource and yaml path isn't great. Of course the envsubt example needs to be parsed through the envsubst command KARPENTER_ROLE_ARN="arn:aws:iam::1234:role/karpenter-role" envsubst < base/sa.yaml. But the difference is that my base configuration clearly expects a variable and I do not need to know the resource Kind, name, maybe namespace and yaml path to replace it.

postBuild.substitution FluxCD equivalent of envsubst

FluxCD (and probably also ArgoCD) has a trick up their sleeve called postBuild.substitution see here, which works like envsubst. Let take above example and try using this feature with the custom resource Kustomization provided by flux

---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: apps
spec:
  # ...omitted for simplicity
  postBuild:
    substitute:
      KARPENTER_ROLE_ARN: "arn:aws:iam::1234:role/karpenter-role"

We are anyway provisioning above resource if we are doing patching and overlays, so in my mind using post build variable substitution is just simpler and can certainly help minimizing the need for overlays and patching or even remove the need completely. I have run ~10 clusters at the same time all sharing the same configuration only using post build variable substitution and life was just so much simpler.

Automate CRD updates

In the previous post about Handle CRDs with GitOps I showed that CRDs for a helm chart can be generated using the helm cli and this enabled us to manage the full lifecycle of CRDs. In this post I will show automation around this, so that a helm chart version update triggers updates to the CRDs aswell.

Requirements

  • helm
  • yq

Creating recipies for CRD generation

Considering our previous example for a HelmRelease for cert-manager

# helm.yaml
---
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: jetstack
  namespace: cert-manager
spec:
  interval: 15m
  url: https://charts.jetstack.io

---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: cert-manager
  namespace: cert-manager
spec:
  interval: 5m
  targetNamespace: cert-manager
  chart:
    spec:
      chart: cert-manager
      version: "v1.18.2"
      sourceRef:
        kind: HelmRepository
        name: jetstack
      interval: 15m
  install:
    crds: Skip
  values:
    installCRDs: false
    ...

we can create the following recipe for generating CRDs using Makefiles (other technologies can be used)

version := $(shell yq '. | select(.kind == "HelmRelease") | .spec.chart.spec.version' helm.yaml)
url := $(shell yq '. | select(.kind == "HelmRepository") | .spec.url' helm.yaml)
chart := $(shell yq '. | select(.kind == "HelmRelease") | .spec.chart.spec.chart' helm.yaml)
kube_version := v1.22.0 # required >= 1.22.0
release_name := $(shell yq '. | select(.kind == "HelmRelease") | .metadata.name' helm.yaml)

crds.yaml: helm.yaml
    helm template $(release_name) $(chart) --repo $(url) --version $(version) --set installCRDs=true --kube-version $(kube_version) | yq '. | select(.kind == "CustomResourceDefinition")' > $@

now any updates made to helm.yaml will trigger a generation and overwrite of crds.yaml if make is run.

Creating a Github action

I have now showed that we can update CRDs manually if our helm chart version changes using make. Now the choice of choosing make makes our life a little bit difficult. In the root of our project I will create one Makefile to call make on all other Makefile's

base_dir := apps # path to our collection of HelmReleases
charts_with_crds := $(shell find $(base_dir) -name 'Makefile' -printf "%h\n")

all: $(charts_with_crds)

$(charts_with_crds):
    @$(MAKE) -C $@

.PHONY: all $(charts_with_crds)

above Makefile is really complex and wasn't fun to write at all. Let's finish with the Github Action

# .github/workflows/update-crds.yaml
name: Update CRDs
on:
  pull_request:
    paths:
      - "apps/*/**.yaml"
      - "apps/*/**.yml"
jobs:
  update-crds:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
      with:
        ref: ${{ github.event.pull_request.head.ref }}

    - uses: alexellis/arkade-get@master
      with:
        helm: latest
        yq: latest

    - name: Get changed helm files
      id: changed-files
      uses: tj-actions/changed-files@v46
      with:
        files: |
          apps/*/helm.yaml

    - name: Touch all changed helm files
      env:
        ALL_CHANGED_FILES: ${{ steps.changed-files.outputs.all_changed_files }}
      run: |
        for file in ${ALL_CHANGED_FILES}; do
          echo "touching file: ${file}"
          touch ${file}
        done

    - name: Update CRDs
      run: make

    - uses: EndBug/add-and-commit@v9
      with:
        add: ./apps/*/crds.yaml
        message: "Update CRDs"

So now when updates are made to our helm.yaml files updated crds.yaml are commited back by above workflow. The benefit is that changes to CRDs becomes really transparent.

Handle CRDs with GitOps

In this post we will discuss caveats with full lifecycle management of Custom Resource Definitions (CRDs) with Helm in a GitOps context and give a possible solution.

Helm caveats with CRDs lifecycle management

Helm is very good at getting CRDs into the cluster at install, but updating and deleting them is where problems tend to arrise. There are solutions in place, like seperate chart for CRDs, but it is very much dependent on how the chart is structured and implemented. Helm documentation has a full section on this here. In summary they write

There is no support at this time for upgrading or deleting CRDs using Helm. This was an explicit decision after much community discussion due to the danger for unintentional data loss. Furthermore, there is currently no community consensus around how to handle CRDs and their lifecycle. As this evolves, Helm will add support for those use cases

This means that for some charts it might work and for some it might be more challenging, leading to an incoherent experience that could be error prone.

A possible solution

Seperating handling the CRDs out from Helm and just referencing the manifests directly via kustomize seems like an obvious solution and it works great. In GitOps with FluxCD the HelmRelease has the options .spec.install.crds and .spec.upgrade.crds, the latters default is Skip so we really only have to add it to the install portion. Some charts exposes a installCRDs or similar in its values, so we will also set this to false for clarity even though strictly not nessecary. Let's see an example of the HelmRelease custom resource

# helm.yaml
---
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
  name: jetstack
  namespace: cert-manager
spec:
  interval: 15m
  url: https://charts.jetstack.io

---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: cert-manager
  namespace: cert-manager
spec:
  interval: 5m
  targetNamespace: cert-manager
  chart:
    spec:
      chart: cert-manager
      version: "v1.18.2"
      sourceRef:
        kind: HelmRepository
        name: jetstack
      interval: 15m
  install:
    crds: Skip
  values:
    installCRDs: false
    ...

and the kustomization file

# kustomization.yaml
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - namespaces.yaml
  - crds.yaml # the file containing all crds
  - helm.yaml
  ...

The CRDs can normally be generated in one of the two following ways using helm cli

helm show crds CHART --repo URL --version VERSION > crds.yaml
# or
helm template RELEASE_NAME CHART --repo URL --version VERSION --set installCRDs=true --kube-version KUBE_VERSION | yq '. | select(.kind == "CustomResourceDefinition")' > crds.yaml

for above cert-manager example the latter option works, but it is rare that it is the complex option of the two. Since we now have documented how to generate the CRDs, automation can be easily added and how I have done it will be show in later post.

To quickly conclude. A solution have been presented that manages the full lifecycle (install, update and deletion) of CRDs for helm charts with GitOps assuming pruning of resources is enabled.

Kubernetes resources

I find my self explaining how I approach setting Kuberentes resources over and over again, and I always struggle rediscovering the good references. So this post serves as a reminder for my self and hopefully it can also help you. I always recommend this 3 part post by Shon Lev-Ran

But to make it real easy. I always set resource using the following guidelines

  1. Only set resources request for CPU and never set CPU limit.
  2. Always set resource request and limit for memory and make sure they are equal.

Client side validations of Kubernetes manifests

To be honest writing Kubernetes manifests can be tedius and it prone to misconfiguration. Of course it will in the end be validated server side, but we would like to avoid most errors before we hand off the manifests to the API server. This can be particular helpful when utilizing GitOps, since the changes will be consumed asynchronous. To achieve this will use the following tooling:

Let's start with kustomize and make sure that we can actually build our manifest bundle.

kustomize build path-to-kustomziation-file

We can now add this to .pre-commit-config.yaml file to the root of the project to have it run every time we commit.

repos:
- repo: local
  hooks:
  - id: kustomize
    name: validate kustmoizations
    language: system
    entry: kustomize
    args:
    - build
    - path-to-kustomziation-file
    always_run: true
    pass_filenames: false

Now on to kubeconform for validating our manifests.

kubeconform -strict -skip CustomResourceDefinition,Kustomization \
  -kubernetes-version 1.33.0 \
  -schema-location default \
  -schema-location 'https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/{{.Group}}/{{.ResourceKind}}_{{.ResourceAPIVersion}}.json' \
  path-to-your-manifests

We of course depend on the CRDs catalog having our CRs and them being updated, but it is relatively easy to contribute to the catalog see PRs #453 and #600.

We can now also add this to our pre-commit config file like so.

repos:
...
- repo: local
  hooks:
  - id: kubeconform
    name: validate kubernetes manifests
    language: system
    entry: kubeconform
    args:
    - -strict
    - -kubernetes-version 1.33.0
    - -skip
    - CustomResourceDefinition,Kustomization
    - -schema-location
    - default
    - -schema-location
    - 'https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/{{.Group}}/{{.ResourceKind}}_{{.ResourceAPIVersion}}.json'
    files: ^path-to-your-manifests/.*

Using pre-commit is nice to validate your commits, but it requires everybody to install it and running pre-commit install. So to enforce above validations we can add a CI step in the form of a Github action.

name: Pre-commit
on:
  - pull_request
jobs:
  pre-commit:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - uses: actions/setup-python@v5
    - uses: alexellis/arkade-get@master
      with:
        kustomize: latest
        kubeconform: latest
    - uses: pre-commit/action@v3.0.1

This setup is not bullet proof, but it do add some extra confidence and it is very low effort to get going.


  1. This action is in maintenance-only mode and you should support the project by using pre-commit.ci instead. But so that everyone can follow the other option is used. 

EBS CSI driver and AL2023

After upgrading to Amazon Linux 2023 (AL2023) we started seeing errors from the aws-ebs-csi-driver running in our clusters.

ebs-plugin I0626 06:40:25.662215       1 main.go:154] "Initializing metadata"
ebs-plugin I0626 06:40:25.662374       1 metadata.go:66] "Attempting to retrieve instance metadata from IMDS"
ebs-plugin E0626 06:40:30.665263       1 metadata.go:72] "Retrieving IMDS metadata failed" err="could not get IMDS metadata: operation error ec2imds: GetInstanceIdentityDocument, canceled, context deadline exceeded"
ebs-plugin I0626 06:40:30.665357       1 metadata.go:75] "Attempting to retrieve instance metadata from Kubernetes API"

This is due to AL2023 improved security ensuring features blocking pods from calling metadata service on the nodes due to a network hop limit of 1. The aws-ebs-csi-driver eventually falls back to using the Kubernetes API, but we are waiting ~5 seconds for the call to timeout. With the release of aws-ebs-csi-driver v1.45.0 they have implemented a flag (--metadata-sources) allowing us to set a priority order or choose a specific way of getting metadata. In our case it would be set to "kubernetes".

This should prevent above shown errors.

Upgrade from AL2 to AL2023 learnings

Ever since AWS annouced that Amazon Linux 2023 (AL2023) AMI type is replacing Amazon Linux 2 (AL2), I have been excited about it. Mainly because of the cgroup v2 upgrade and the improved security with IMDSv2. To explain it quick

  • cgroup v2 should provide more transparency when container sub-processes are OOM killed.
  • IMDSv2 will block pods calling the metadata service on the nodes (getting an AWS context) due to a network hop limit.

The AMI upgrade is needed for upgrading worker nodes on EKS from 1.32 to 1.33, since no AL2 AMI is build for 1.33.

Upon testing we found a few things breaking, but nothing major. The AWS load balancer controller broke, but only needed the --aws-vpc-id and --aws-region flag set to work again. We ended up removing the spot-termination-exporter (supplying insight into spot-instance interruptions), since it realies heavily on the metadata service, which was now blocked. Sad, but we have lived without it before.

We then went on to upgrading all clusters and worker nodes to version 1.33. The upgrade went smooth except for one thing that we overlooked. We rely on flux image-reflector-controller to scan container registries and that also uses the metadata service to use get context of the nodes. Luckily this was a fairly easy fix, where we ended up patching an IRSA role annotation to the image-reflector-controller ServiceAccount in following way.

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - gotk-components.yaml
  - gotk-sync.yaml
patches:
  - patch: |
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        name: image-reflector-controller
        annotations:
          eks.amazonaws.com/role-arn: arn:aws:iam::ACCOUNT_ID:role/eks_CLUSTER_NAME_flux-image-reflector
    target:
      kind: ServiceAccount
      name: image-reflector-controller

We are now enjoing AL2023 and are so far happy with the upgrade.

Vertical pod autoscaler flush historical data

We recently had to roll out a new release of DCGM exporter, a tool that monitors Nvidia GPU performance and outputs metrics. It runs as a DaemonSet on all GPU Kubernetes nodes. With the new release there is a significant increase in memory resource consumption, normally this would be easy to handle through increasing resource requests and limits. But what happens if you decided to have Vertical Pod Autoscaler (VPA) manage resources through it's auto mode.

Introduction to Vertical Pod Autoscaler

Have you ever deployed a new and shiny thing, no matter if its custom or something off the shelf, and felt like choosing resource requests and limits was totally unqualified. This is where Vertical Pod Autoscaler comes into the picture, it can free users from setting or guessing resource requests and limits on containers in their pods and updating them if requirements changes.

VPA can run in two modes recommendation or auto mode. Recommendation mode has a lot of value by it self by analysis current and historical resource usage, but requires you to manual changes to follow the recommended resources settings. Auto mode uses the recommendation, but can also adjust resources on the fly. This is great and has a lot of benefits among them to not waste resources on services that fluctuate and cannot scale horizontally.

We run a lot services in VPA auto mode, among them the DCGM exporter.

Roll out new release of DCGM exporter

We already knew from testing that the DCGM exporter had a significant increase in memory resource consumption, so we changed the maxAllowed.memory specification on the VerticalPodAutoscaler custom resource. The hope was that VPA would automatically adjust resources for the DCGM exporter rather quickly, but that didn't happen. DCGM exporter went into OOMKill crashlooping mode while the recommended memory from the VPA slowly crawled upwards. The OOmKill was expected but the slow adjustment from VPA was a surprise. There where probably many contributing factors, but the crashloop backoff didn't help.

So how did we solve it?

Flushing VPA historical data

In the end we ended up deleting the appropiate VPACheckpoint resource and flushing memory on the VPA recommender component.

kubectl delete vpaceckpoint -n dcgm-exporter dcgm-exporter
kubectl delete pod -n kube-system -l app=vpa-recommender

This almost immidiatly got the dcgm-exporter to the appropiate resources and out of OOMKill crashlooping.

Docker Hub rate limits

Docker Hub recently announced 10 pulls/hour for unauthenticated users. This has pretty significant impact in container orchestration, e.g. Kubernetes. I will not cover whether it is fair or not, but give credit to Docker Hub for its contributions to the community.

So how does this rate limit impact Kubernetes?

It can be hard to predict how many images will be pulled when a new node joins the cluster from an operator/administrator perspective.

How could you solve it?

We've opted for implementing a AWS ECR pull through cache. It is easy to setup and works like a charm.

Where there any side effects?

  1. All image references in manifests has to change from nginx:latest to ACCOUNT_ID.dkr.ecr.REGION.amazonaws.com/docker-hub/nginx:latest (don't use latest)

  2. Flux GitOps ImageUpdateAutomation breaks for CRD resources that reference images

  3. Renovate updates breaks because the cache doesn't have knowledge of new tags

I will try to cover possible solutions for above side effects in future posts.

Run and debug Renovate locally

Last I gave a quick introduction to Renovate and how to run it in centralised configuration. Today we will go over how to run Renovate locally for debugging and extending configuration purpose, which is very handy.

npx --yes --package renovate -- renovate --dry-run=full --token="GITHUB_TOKEN" wcarlsen/repository0

This requires only a Github token and to change LOG_LEVEL, just set it as an environment variable to DEBUG.

Now go customise your config.js or renovate.json config files to get the best out of Renovate.