How I messed up a kubernetes upgrade of the Rancher local cluster
Posted on February 10, 2022 by Adrian Wyssmann ‐ 5 min read
Today I wanted to update the local Rancher cluster to the latest kubernetes version, which went horribly wrong. Let me explain what happened and what we learned out of it.
How all started
If you are using Rancher you have a “local” or managing server or cluster, which manages 1 to n “Downstream Clusters”. Today we planned to upgrade the local Rancher cluster in “non-production” to a more recent kubernetes version, as we are currently still running on 1.18.
We have the upgrade procedure performed several times in the past years, so we have all steps documented - so we believed at least. Unfortunately the last upgrade was done months back. We are using RKE to manage the local cluster so our upgrade procedure is like this:
run a etcd snapshot - we do this despite the fact we also have regular etcd backups setup
./rke etcd snapshot-save --config rancher-cluster.yaml --name $(date '+%Y%m%d%H%m')
When the backup has been completed successfully, use rke to upgrade the Kubernetes cluster.
Update the version of the kubernetes in
rancher-cluster.yaml
(run./rke config --list-version --all
to see all available versions):sudo sed -i -- 's/v1.*-[0-9a-z]/v1.18.10-rancher1-1/g' rancher-cluster.yaml
Check the file
cat rancher-cluster.yaml | grep "kubernetes_version" kubernetes_version: "v1.x.x-rancherx-x"
Run update
./rke_1.x.x up -config rancher-cluster.yaml
For completeness, the rancher-cluster.yaml
looks like this:
ignore_docker_version: true
nodes:
- address: 10.123.45.32
user: rancher
role: [controlplane,etcd,worker]
ssh_key_path: server0005/id_rsa
- address: 10.123.45.31
user: rancher
role: [controlplane,etcd,worker]
ssh_key_path: server0004/id_rsa
- address: 10.123.45.33
user: rancher
role: [controlplane,etcd,worker]
ssh_key_path: server0006/id_rsa
private_registries:
- url: docker.intra
is_default: true
services:
etcd:
snapshot: true # enables recurring etcd snapshots
creation: 6h0s # time increment between snapshots
retention: 72h # time increment before snashots
The upgrade itself seemed fine, there were no obvious errors, however after rke
finished, Rancher API was not reachable, hence no access to the Rancher UI nor access to the downstream clusters via kubectl
.
Restore issues
While we were trying to restore the state from the backup, we faced several issues
At first, restoring the backup did not work with the most recent backup. While trying to restore it did not find cluster.rkestate
from within the backup, so what rke then does is using the local file - the one in the same folder as from which you run rke. This file however, already contains the status from the recent upgrade, so restore just does not work. We noticed that it still uses the docker containers for v1.22.6-rancher1-1
instead of v1.18.x-rancher1-1
.
As per our restore docu, we sync this particular rke folder daily (on 00:00) to a separate file share, so we should find the desired cluster.rkestate
(the one before we ran the upgrade this morning) there. Sadly the sync job did not seem to properly work, as it only partly synched some of the files. We had a cluster.rkestate
from 2019. We have to figure out why this is, but first we need to restore the cluster.
Luckily we also have regular backups (every 6h) from the cluster state, and as there were no recent changes on the cluster (i.e. no new projects, changes in user config or whatsoever) we can use one of these. Unfortunately while we tried to restore this, the upgrade seems to use the correct docker images i.e v1.19.x-rancher1-1
, but the installation timed out. We then remembered one of our old support cases “coredns does not start unless we disable firewalld on the nodes of the new cluster”, which instructed us to do
One user reports success by creating a separate firewalld zone with a policy of ACCEPT for the Pod CIDR (https://github.com/rancher/rancher/issues/28840#issuecomment-787404822).
We manage our servers via Salt and implemented this change in 21. Dec. 2020 and applied to the servers. Unfortunately these were later (06. Jan 2021) removed again. So when the respective salt state was applied on the rancher nodes after 06th of Jan 2021, the firewalld changes were removed again. So to fix that, we re-added this change and re-applied the changes. After that we finally could restore the old state.
What went wrong with the upgrade
Well, even so we had proper documentation, there was a tiny little thing: “get the latest rke”. If you do regularly upgrades this may seem ok, but as we had a big gap between the latest upgrade and now, rke evolved, so the latest is not ok.
Unfortunately the latest rke binary is v1.3.7
has kubernetes v1.22.6-rancher1-1
as a default, but
- this version is marked experimental in the Release notes
- it is not supported with the Rancher version
2.5.11
, which we are running
As we were made aware by Suse Support, there is actually a Compatibility Matrix
Lessons learned
So there are several things we learned and will improve in the feature. I guess nothing of this is new, but a good reminder to keep good practices live:
Make your documentation bullet proof
Everybody should be able to perform the tasks, hence the documentation should contain all relevant details. Also ensure the knowledge of such tasks are shared and documentation is peer-reviewed
Check the consistency of your backups regularly
Don’t assume your backups are fine, but also check them from time to time
Use support if you have
If you have a support contract, use it. So for future upgrades we will notify Suse on time so the have an engineer ready in case something goes wrong and also can give you a hint about compatibility
Our documentation already has been updated to reflect the findings of today, so it should not happen in the future. Btw, by using the correct rke version, we could successfully upgrade our cluster afterwards.