Automated OS update with kured for servers running k8s

Posted on April 20, 2021 by Adrian Wyssmann ‐ 6 min read

If you are running a bare-metal cluster you probably run kubernetes on top of some linux os, these systems have to be regularly updated. But an update means sometimes that you have to reboot your servers. This also means during a reboot that particular node is node available to schedule workload.

More importantly, you probably better cordon which mark a Node unschedulable and thus ensures that workloads are scheduled on un-cordoned nodes.

Current state

Even we use Saltstack to manage our nodes and [Rancher] for the cluster, we still have manual steps. For example we still manually cordon and un-cordon nodes - with 7 clusters and dozens of nodes that is cumbersome. As we cannot update (or better reboot) all servers at the same time, we have added a custom grain called orch_seq which allows us to do the upgrade happen in a certain sequence, ensuring enough worker nodes are available. The orch_seq is a number from 1 to 7 and these sequences can be done together

  • 1 (x) and 4 (y)
  • 2 (x) and 5 (y)
  • 3 (x) and 6 (y)

Then we do this cluster by cluster using the grain named context. Usually we start with the dev cluster which is the least sensible and then go further up to the production clusters

  1. Drain the nodes of the sequence x and y which are k8s_node and rancher

    1. Check which role the nods have

      sudo salt -C '( G@orch_seq:x or G@orch_seq:y ) grains.get roles
    2. Drain all nodes listed which are k8s_node and rancher:

  2. Upgrade the nodes of the sequence x and y using salt

    sudo salt -C '( ( G@orch_seq:x or G@orch_seq:y ) G@context:xxxxx)' pkg.upgrade

    We could run `pkg.upgrade` for all nodes in parallel, but we deliberately to this in sequence, cause in case something goes wrong with the package upgrade, we still have enough nodes available
    
    1

  3. If there is a kernel update - restart the nodes of the curent sequence(s)

    sudo salt -C '( G@orch_seq:x or G@orch_seq:y )' system.reboot
  4. After we check that node is properly running and the upgrade of the kernel was successful

    sudo salt -C '( G@orch_seq:x or G@orch_seq:x )' cmd.run "cat /etc/centos-release"
  5. Un-drain the recently drained nodes

  6. Repeat the steps for the next sequence(s) of the servers

This works fine but as you can see there are certainly steps which can be improved.

What can we do better?

As a first we might to solve the problem of manually draining and un-draining. I found the some interesting projects

  • system-upgrade-controller aims to provide a general-purpose, Kubernetes-native upgrade controller (for nodes).

    • It introduces a new CRD, the Plan, for defining any and all of your upgrade policies/requirements. A Plan is an outstanding intent to mutate nodes in your cluster.
  • Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.

    • Watches for the presence of a reboot sentinel file e.g. /var/run/reboot-required or the successful run of a sentinel command.
    • Utilises a lock in the API server to ensure only one node reboots at a time
    • Optionally defers reboots in the presence of active Prometheus alerts or selected pods
    • Cordons & drains worker nodes before reboot, uncordoning them after

Install and Test Kured

Kured sounds very promising so I give it a change. The installation is pretty straight forward:

  1. Download the kured-manifest for the kured which is compatible with our existing kubernetes version - see Kubernetes & OS Compatibility is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.)

  2. Modify the manifest as follows

    1. Replace docker.io with your internal registry

    2. Update the command to your needs

      i The documentation in the main branch refers to the latest code changes, not necessarily the version you are using. In my case version 1.6.1 is missing some of the configuration options listed there

    3. Update the tolerations so that the dameonset is also scheduled on your master nodes. We added this

      tolerations:
      - operators: Exist
    4. Then apply the manifest to all your clusters

      kubectl apply -f kured-1.x.x-dockerhub.yaml

How does it work

kured checks for the existence of a sentinel-file every 1h. Per default the sentinel-file is /var/run/reboot-required but these parameters can be changed. If the sentinel-file is detected, the daemon - which uses a random offset derived from the period on startup so that nodes don’t all contend for the lock simultaneously - will reboot the underlying node.

Alternatively - for versions > 1.6.1 - a reboot sentinel command can be used. If a reboot sentinel command is used, the reboot sentinel file presence will be ignored.

This is helpful, cause not all linux distributions have such a sentinel file. Below you can see an example log of the kured-daemon after I applied the patches to the machine server0095 at around 11:50 - I’ve set the check period from 1h to 10m for that:

time="2021-04-19T11:19:50Z" level=info msg="Kubernetes Reboot Daemon: 1.6.1"
time="2021-04-19T11:19:50Z" level=info msg="Node ID: server0095"
time="2021-04-19T11:19:50Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
time="2021-04-19T11:19:50Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2021-04-19T11:19:50Z" level=info msg="Reboot Sentinel: /var/run/reboot-required every 10m0s"
time="2021-04-19T11:19:50Z" level=info msg="Blocking Pod Selectors: []"
time="2021-04-19T11:19:50Z" level=info msg="Reboot on: ---MonTueWedThuFri--- between 07:00 and 17:00 UTC"
time="2021-04-19T11:28:28Z" level=info msg="Reboot not required"
time="2021-04-19T11:38:28Z" level=info msg="Reboot not required"
time="2021-04-19T11:48:28Z" level=info msg="Reboot not required"
time="2021-04-19T11:58:28Z" level=info msg="Reboot required"
time="2021-04-19T11:58:28Z" level=info msg="Acquired reboot lock"
time="2021-04-19T11:58:28Z" level=info msg="Draining node server0095"

As you can see the node is drained and cordoned:

After the node is restarted, the daemon is scheduled again and will un-cordon the node:

time="2021-04-19T11:33:55Z" level=info msg="Kubernetes Reboot Daemon: 1.6.1"
time="2021-04-19T11:33:55Z" level=info msg="Node ID: server0092"
time="2021-04-19T11:33:55Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
time="2021-04-19T11:33:55Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2021-04-19T11:33:55Z" level=info msg="Reboot Sentinel: /var/run/reboot-required every 10m0s"
time="2021-04-19T11:33:55Z" level=info msg="Blocking Pod Selectors: []"
time="2021-04-19T11:33:55Z" level=info msg="Reboot on: ---MonTueWedThuFri--- between 07:00 and 17:00 UTC"
time="2021-04-19T11:33:55Z" level=info msg="Holding lock"
time="2021-04-19T11:33:55Z" level=info msg="Uncordoning node server0092"
time="2021-04-19T11:33:55Z" level=info msg="Releasing lock"

And now?

Using kured we can don’t need to care about reboots and cordoning nodes. We still to the upgrade in sequence, this reduces the risk in case the pkg.upgrade itself fails and causes an inconsistent stage of the node. We also need to “mimik” the existence of the sentinel-file as we use CentOS. So the upgrade procedure now looks like this:

  1. Upgrade the nodes of the sequence x and y using salt

    sudo salt -C '( ( G@orch_seq:x or G@orch_seq:y ) G@context:xxxxx)' pkg.upgrade
    sudo salt -C '( G@orch_seq:x or G@orch_seq:y )' cmd.run 'needs-restarting -r ; if [ $? ]; then sudo touch /var/run/reboot-required; fi'
    The second commands ensures that the sentinel-file is there so `kured` knows a reboot is required
    
  2. Repeat the steps for the next sequence(s) of the servers

Much simpler isn’t it?