If you are running a bare-metal cluster you probably run kubernetes on top of some linux os, these systems have to be regularly updated. But an update means sometimes that you have to reboot your servers. This also means during a reboot that particular node is node available to schedule workload.
More importantly, you probably better cordon which mark a Node unschedulable and thus ensures that workloads are scheduled on un-cordoned nodes.
Current state
Even we use Saltstack to manage our nodes and [Rancher] for the cluster, we still have manual steps. For example we still manually cordon and un-cordon nodes - with 7 clusters and dozens of nodes that is cumbersome. As we cannot update (or better reboot) all servers at the same time, we have added a custom grain
called orch_seq
which allows us to do the upgrade happen in a certain sequence, ensuring enough worker nodes are available. The orch_seq
is a number from 1 to 7 and these sequences can be done together
- 1 (
x
) and 4 (y
) - 2 (
x
) and 5 (y
) - 3 (
x
) and 6 (y
)
Then we do this cluster by cluster using the grain
named context
. Usually we start with the dev
cluster which is the least sensible and then go further up to the production clusters
Drain the nodes of the sequence
x
andy
which arek8s_node
andrancher
Check which role the nods have
Drain all nodes listed which are
k8s_node
andrancher
:
Upgrade the nodes of the sequence
x
andy
using salt1We could run `pkg.upgrade` for all nodes in parallel, but we deliberately to this in sequence, cause in case something goes wrong with the package upgrade, we still have enough nodes available
If there is a kernel update - restart the nodes of the curent sequence(s)
After we check that node is properly running and the upgrade of the kernel was successful
Un-drain the recently drained nodes
Repeat the steps for the next sequence(s) of the servers
This works fine but as you can see there are certainly steps which can be improved.
What can we do better?
As a first we might to solve the problem of manually draining and un-draining. I found the some interesting projects
system-upgrade-controller aims to provide a general-purpose, Kubernetes-native upgrade controller (for nodes).
- It introduces a new CRD, the Plan, for defining any and all of your upgrade policies/requirements. A Plan is an outstanding intent to mutate nodes in your cluster.
Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.
- Watches for the presence of a reboot sentinel file e.g.
/var/run/reboot-required
or the successful run of a sentinel command. - Utilises a lock in the API server to ensure only one node reboots at a time
- Optionally defers reboots in the presence of active Prometheus alerts or selected pods
- Cordons & drains worker nodes before reboot, uncordoning them after
- Watches for the presence of a reboot sentinel file e.g.
Install and Test Kured
Kured sounds very promising so I give it a change. The installation is pretty straight forward:
Download the kured-manifest for the
kured
which is compatible with our existing kubernetes version - see Kubernetes & OS Compatibility is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.)Modify the manifest as follows
Replace
docker.io
with your internal registryUpdate the
command
to your needsi The documentation in the main branch refers to the latest code changes, not necessarily the version you are using. In my case version 1.6.1 is missing some of the configuration options listed there
Update the
tolerations
so that the dameonset is also scheduled on your master nodes. We added thisThen apply the manifest to all your clusters
How does it work
kured
checks for the existence of a sentinel-file every 1h. Per default the sentinel-file is /var/run/reboot-required
but these parameters can be changed. If the sentinel-file is detected, the daemon - which uses a random offset derived from the period on startup so that nodes don’t all contend for the lock simultaneously - will reboot the underlying node.
Alternatively - for versions > 1.6.1 - a reboot sentinel command can be used. If a reboot sentinel command is used, the reboot sentinel file presence will be ignored.
This is helpful, cause not all linux distributions have such a sentinel file. Below you can see an example log of the kured-daemon after I applied the patches to the machine server0095
at around 11:50 - I’ve set the check period from 1h to 10m for that:
time="2021-04-19T11:19:50Z" level=info msg="Kubernetes Reboot Daemon: 1.6.1"
time="2021-04-19T11:19:50Z" level=info msg="Node ID: server0095"
time="2021-04-19T11:19:50Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
time="2021-04-19T11:19:50Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2021-04-19T11:19:50Z" level=info msg="Reboot Sentinel: /var/run/reboot-required every 10m0s"
time="2021-04-19T11:19:50Z" level=info msg="Blocking Pod Selectors: []"
time="2021-04-19T11:19:50Z" level=info msg="Reboot on: ---MonTueWedThuFri--- between 07:00 and 17:00 UTC"
time="2021-04-19T11:28:28Z" level=info msg="Reboot not required"
time="2021-04-19T11:38:28Z" level=info msg="Reboot not required"
time="2021-04-19T11:48:28Z" level=info msg="Reboot not required"
time="2021-04-19T11:58:28Z" level=info msg="Reboot required"
time="2021-04-19T11:58:28Z" level=info msg="Acquired reboot lock"
time="2021-04-19T11:58:28Z" level=info msg="Draining node server0095"
As you can see the node is drained and cordoned:
After the node is restarted, the daemon is scheduled again and will un-cordon the node:
time="2021-04-19T11:33:55Z" level=info msg="Kubernetes Reboot Daemon: 1.6.1"
time="2021-04-19T11:33:55Z" level=info msg="Node ID: server0092"
time="2021-04-19T11:33:55Z" level=info msg="Lock Annotation: kube-system/kured:weave.works/kured-node-lock"
time="2021-04-19T11:33:55Z" level=info msg="Lock TTL not set, lock will remain until being released"
time="2021-04-19T11:33:55Z" level=info msg="Reboot Sentinel: /var/run/reboot-required every 10m0s"
time="2021-04-19T11:33:55Z" level=info msg="Blocking Pod Selectors: []"
time="2021-04-19T11:33:55Z" level=info msg="Reboot on: ---MonTueWedThuFri--- between 07:00 and 17:00 UTC"
time="2021-04-19T11:33:55Z" level=info msg="Holding lock"
time="2021-04-19T11:33:55Z" level=info msg="Uncordoning node server0092"
time="2021-04-19T11:33:55Z" level=info msg="Releasing lock"
And now?
Using kured
we can don’t need to care about reboots and cordoning nodes. We still to the upgrade in sequence, this reduces the risk in case the pkg.upgrade
itself fails and causes an inconsistent stage of the node. We also need to “mimik” the existence of the sentinel-file as we use CentOS. So the upgrade procedure now looks like this:
Upgrade the nodes of the sequence
x
andy
using saltThe second commands ensures that the sentinel-file is there so `kured` knows a reboot is required
Repeat the steps for the next sequence(s) of the servers
Much simpler isn’t it?