Automated OS update with kured for servers running k8s

Posted April 20, 2021 by Adrian Wyssmann ‐ 6 min read

If you are running a bare-metal cluster you probably run kubernetes on top of some linux os, these systems have to be regularly updated. But an update means sometimes that you have to reboot your servers. This also means during a reboot that particular node is node available to schedule workload.

More importantly, you probably better cordon which mark a Node unschedulable and thus ensures that workloads are scheduled on un-cordoned nodes.

Current state

Even we use Saltstack to manage our nodes and [Rancher] for the cluster, we still have manual steps. For example we still manually cordon and un-cordon nodes - with 7 clusters and dozens of nodes that is cumbersome. As we cannot update (or better reboot) all servers at the same time, we have added a custom grain called orch_seq which allows us to do the upgrade happen in a certain sequence, ensuring enough worker nodes are available. The orch_seq is a number from 1 to 7 and these sequences can be done together

  • 1 (x) and 4 (y)
  • 2 (x) and 5 (y)
  • 3 (x) and 6 (y)

Then we do this cluster by cluster using the grain named context. Usually we start with the dev cluster which is the least sensible and then go further up to the production clusters

  1. Drain the nodes of the sequence x and y which are k8s_node and rancher

    1. Check which role the nods have

      sudo salt -C '( [email protected]_seq:x or [email protected]_seq:y ) grains.get roles
      
    2. Drain all nodes listed which are k8s_node and rancher:

  2. Upgrade the nodes of the sequence x and y using salt

    sudo salt -C '( ( [email protected]_seq:x or [email protected]_seq:y ) [email protected]:xxxxx)' pkg.upgrade
    
  3. If there is a kernel update - restart the nodes of the curent sequence(s)

    sudo salt -C '( [email protected]_seq:x or [email protected]_seq:y )' system.reboot
    
  4. After we check that node is properly running and the upgrade of the kernel was successful

    sudo salt -C '( [email protected]_seq:x or [email protected]_seq:x )' cmd.run "cat /etc/centos-release"
    
  5. Un-drain the recently drained nodes

  6. Repeat the steps for the next sequence(s) of the servers

This works fine but as you can see there are certainly steps which can be improved.

What can we do better?

As a first we might to solve the problem of manually draining and un-draining. I found the some interesting projects

  • system-upgrade-controller aims to provide a general-purpose, Kubernetes-native upgrade controller (for nodes).

    • It introduces a new CRD, the Plan, for defining any and all of your upgrade policies/requirements. A Plan is an outstanding intent to mutate nodes in your cluster.
  • Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.

    • Watches for the presence of a reboot sentinel file e.g. /var/run/reboot-required or the successful run of a sentinel command.
    • Utilises a lock in the API server to ensure only one node reboots at a time
    • Optionally defers reboots in the presence of active Prometheus alerts or selected pods
    • Cordons & drains worker nodes before reboot, uncordoning them after

Install and Test Kured

Kured sounds very promising so I give it a change. The installation is pretty straight forward:

  1. Download the kured-manifest for the kured which is compatible with our existing kubernetes version - see Kubernetes & OS Compatibility is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.)

  2. Modify the manifest as follows

    1. Replace docker.io with your internal registry

    2. Update the command to your needs

    3. Update the tolerations so that the dameonset is also scheduled on your master nodes. We added this

      tolerations:
      - operators: Exist
      
    4. Then apply the manifest to all your clusters

      kubectl apply -f kured-1.x.x-dockerhub.yaml
      

How does it work

kured checks for the existence of a sentinel-file every 1h. Per default the sentinel-file is /var/run/reboot-required but these parameters can be changed. If the sentinel-file is detected, the daemon - which uses a random offset derived from the period on startup so that nodes don’t all contend for the lock simultaneously - will reboot the underlying node.

Alternatively - for versions > 1.6.1 - a reboot sentinel command can be used. If a reboot sentinel command is used, the reboot sentinel file presence will be ignored.

This is helpful, cause not all linux distributions have such a sentinel file. Below you can see an example log of the kured-daemon after I applied the patches to the machine devs0295 at around 11:50 - I’ve set the check period from 1h to 10m for that:

time="2021-04-19T11:19:50Z" level=info msg="Kubernetes Reboot Daemon: 1.6.1"
time="2021-04-19T11:19:50Z" level=info msg="Node ID: devs0295"
time="2021-04-19T11:19:50Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
time="2021-04-19T11:19:50Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2021-04-19T11:19:50Z" level=info msg="Reboot Sentinel: /var/run/reboot-required every 10m0s"
time="2021-04-19T11:19:50Z" level=info msg="Blocking Pod Selectors: []"
time="2021-04-19T11:19:50Z" level=info msg="Reboot on: ---MonTueWedThuFri--- between 07:00 and 17:00 UTC"
time="2021-04-19T11:28:28Z" level=info msg="Reboot not required"
time="2021-04-19T11:38:28Z" level=info msg="Reboot not required"
time="2021-04-19T11:48:28Z" level=info msg="Reboot not required"
time="2021-04-19T11:58:28Z" level=info msg="Reboot required"
time="2021-04-19T11:58:28Z" level=info msg="Acquired reboot lock"
time="2021-04-19T11:58:28Z" level=info msg="Draining node devs0295"

As you can see the node is drained and cordoned:

Node is cordoned by `kured` before it's restarted

After the node is restarted, the daemon is scheduled again and will un-cordon the node:

time="2021-04-19T11:33:55Z" level=info msg="Kubernetes Reboot Daemon: 1.6.1"
time="2021-04-19T11:33:55Z" level=info msg="Node ID: devs0292"
time="2021-04-19T11:33:55Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
time="2021-04-19T11:33:55Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2021-04-19T11:33:55Z" level=info msg="Reboot Sentinel: /var/run/reboot-required every 10m0s"
time="2021-04-19T11:33:55Z" level=info msg="Blocking Pod Selectors: []"
time="2021-04-19T11:33:55Z" level=info msg="Reboot on: ---MonTueWedThuFri--- between 07:00 and 17:00 UTC"
time="2021-04-19T11:33:55Z" level=info msg="Holding lock"
time="2021-04-19T11:33:55Z" level=info msg="Uncordoning node devs0292"
time="2021-04-19T11:33:55Z" level=info msg="Releasing lock"

And now?

Using kured we can don’t need to care about reboots and cordoning nodes. We still to the upgrade in sequence, this reduces the risk in case the pkg.upgrade itself fails and causes an inconsistent stage of the node. We also need to “mimik” the existence of the sentinel-file as we use CentOS. So the upgrade procedure now looks like this:

  1. Upgrade the nodes of the sequence x and y using salt

    sudo salt -C '( ( [email protected]_seq:x or [email protected]_seq:y ) [email protected]:xxxxx)' pkg.upgrade
    sudo salt -C '( [email protected]_seq:x or [email protected]_seq:y )' cmd.run 'needs-restarting -r ; if [ $? ]; then sudo touch /var/run/reboot-required; fi'
    
  2. Repeat the steps for the next sequence(s) of the servers

Much simpler isn’t it?