How I messed up a kubernetes upgrade of the Rancher local cluster

Posted on February 10, 2022 by Adrian Wyssmann ‐ 5 min read

Today I wanted to update the local Rancher cluster to the latest kubernetes version, which went horribly wrong. Let me explain what happened and what we learned out of it.

How all started

If you are using Rancher you have a “local” or managing server or cluster, which manages 1 to n “Downstream Clusters”. Today we planned to upgrade the local Rancher cluster in “non-production” to a more recent kubernetes version, as we are currently still running on 1.18.

We have the upgrade procedure performed several times in the past years, so we have all steps documented - so we believed at least. Unfortunately the last upgrade was done months back. We are using RKE to manage the local cluster so our upgrade procedure is like this:

  • get the latest rke binary from Github

  • run a etcd snapshot - we do this despite the fact we also have regular etcd backups setup

     ./rke etcd snapshot-save --config rancher-cluster.yaml --name $(date '+%Y%m%d%H%m')
  • When the backup has been completed successfully, use rke to upgrade the Kubernetes cluster.

    • Update the version of the kubernetes in rancher-cluster.yaml (run ./rke config --list-version --all to see all available versions):

      sudo sed -i -- 's/v1.*-[0-9a-z]/v1.18.10-rancher1-1/g' rancher-cluster.yaml
    • Check the file

      cat rancher-cluster.yaml | grep "kubernetes_version"
      kubernetes_version: "v1.x.x-rancherx-x"
    • Run update

      ./rke_1.x.x up -config rancher-cluster.yaml

For completeness, the rancher-cluster.yaml looks like this:

ignore_docker_version: true

nodes:
   - address: 10.123.45.32
     user: rancher
     role: [controlplane,etcd,worker]
     ssh_key_path: server0005/id_rsa
   - address: 10.123.45.31
     user: rancher
     role: [controlplane,etcd,worker]
     ssh_key_path: server0004/id_rsa
   - address: 10.123.45.33
     user: rancher
     role: [controlplane,etcd,worker]
     ssh_key_path: server0006/id_rsa
 
private_registries:
  - url: docker.intra
    is_default: true
 
services:
  etcd:
    snapshot: true # enables recurring etcd snapshots
    creation: 6h0s # time increment between snapshots
    retention: 72h # time increment before snashots

The upgrade itself seemed fine, there were no obvious errors, however after rke finished, Rancher API was not reachable, hence no access to the Rancher UI nor access to the downstream clusters via kubectl.

Restore issues

While we were trying to restore the state from the backup, we faced several issues

At first, restoring the backup did not work with the most recent backup. While trying to restore it did not find cluster.rkestate from within the backup, so what rke then does is using the local file - the one in the same folder as from which you run rke. This file however, already contains the status from the recent upgrade, so restore just does not work. We noticed that it still uses the docker containers for v1.22.6-rancher1-1 instead of v1.18.x-rancher1-1.

As per our restore docu, we sync this particular rke folder daily (on 00:00) to a separate file share, so we should find the desired cluster.rkestate (the one before we ran the upgrade this morning) there. Sadly the sync job did not seem to properly work, as it only partly synched some of the files. We had a cluster.rkestate from 2019. We have to figure out why this is, but first we need to restore the cluster.

Luckily we also have regular backups (every 6h) from the cluster state, and as there were no recent changes on the cluster (i.e. no new projects, changes in user config or whatsoever) we can use one of these. Unfortunately while we tried to restore this, the upgrade seems to use the correct docker images i.e v1.19.x-rancher1-1, but the installation timed out. We then remembered one of our old support cases “coredns does not start unless we disable firewalld on the nodes of the new cluster”, which instructed us to do

One user reports success by creating a separate firewalld zone with a policy of ACCEPT for the Pod CIDR (https://github.com/rancher/rancher/issues/28840#issuecomment-787404822).

We manage our servers via Salt and implemented this change in 21. Dec. 2020 and applied to the servers. Unfortunately these were later (06. Jan 2021) removed again. So when the respective salt state was applied on the rancher nodes after 06th of Jan 2021, the firewalld changes were removed again. So to fix that, we re-added this change and re-applied the changes. After that we finally could restore the old state.

What went wrong with the upgrade

Well, even so we had proper documentation, there was a tiny little thing: “get the latest rke”. If you do regularly upgrades this may seem ok, but as we had a big gap between the latest upgrade and now, rke evolved, so the latest is not ok.

Unfortunately the latest rke binary is v1.3.7 has kubernetes v1.22.6-rancher1-1 as a default, but

  • this version is marked experimental in the Release notes
  • it is not supported with the Rancher version 2.5.11, which we are running

As we were made aware by Suse Support, there is actually a Compatibility Matrix

Lessons learned

So there are several things we learned and will improve in the feature. I guess nothing of this is new, but a good reminder to keep good practices live:

  • Make your documentation bullet proof

    Everybody should be able to perform the tasks, hence the documentation should contain all relevant details. Also ensure the knowledge of such tasks are shared and documentation is peer-reviewed

  • Check the consistency of your backups regularly

    Don’t assume your backups are fine, but also check them from time to time

  • Use support if you have

    If you have a support contract, use it. So for future upgrades we will notify Suse on time so the have an engineer ready in case something goes wrong and also can give you a hint about compatibility

Our documentation already has been updated to reflect the findings of today, so it should not happen in the future. Btw, by using the correct rke version, we could successfully upgrade our cluster afterwards.