I did it again or how I messed up a kubernetes upgrade of the Rancher downstream cluster

Posted February 15, 2022 by Adrian Wyssmann ‐ 4 min read

While last week I upgraded the local Rancher cluster in non-production with issues, today was time for one of the downstream clusters. Let me explain what happened and what we learned out of it.

How all started

You remember last weeks post How I messed up a kubernetes upgrade of the Rancher local cluster. Upgrade of downstream clusters is done via the Rancher UI, where you select the desired Kubernetes version Also here, we did an upgrade from 1.18 to 1.20:

  1. Open Cluster Management

    cluster management
    Cluster Management in Rancher UI
  2. Create an etcd snapshot of the managed cluster

    etcd snapshot
    Create etcd snapshot in Rancher UI
  3. Click “Edit Config” and in the drop-down select the desired kubernetes version:

    select kubernetes version
    select kubernetes version in Rancher UI

After that Rancher performs the upgrade.

Issue with calico

First we have seen that some of the rancher related pods did not start up but have thrown this error:

E0215 14:38:43.044065 16032 pod_workers.go:191] Error syncing pod 84cf06e8-2436-43c9-a994-2f60f15fe08a ("cattle-cluster-agent-68b7fd847-jn7kf_cattle-system(84cf06e8-2436-43c9-a994-2f60f15fe08a)"), skipping: failed to "KillPodSandbox" for "84cf06e8-2436-43c9-a994-2f60f15fe08a" with KillPodSandboxError: "rpc error: code = Unknown desc = networkPlugin cni failed to teardown pod \"cattle-cluster-agent-68b7fd847-jn7kf_cattle-system\" network: error getting ClusterInformation: connection is unauthorized: clusterinformations.crd.projectcalico.org \"default\" is forbidden: User \"system:node\" cannot get resource \"clusterinformations\" in API group \"crd.projectcalico.org\" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io \"calico-node\" not found"

We found this change in the config map related to the rke-network-plugin-deploy-job:

config map rke-network-plugin
config map rke-network-plugin
config map rke-network-plugin
config map rke-network-plugin

Further checking revealed, that the clusterrole was not properly renamed, so we applied the change manually, which eventually solved the issue we have seen above. We don’t know yet what exactly caused that problem, so we have to further investigate.

storage issue

We are currently using HPE 3Par, which will be eventually replaced by another solution in 2022. We have multiple downstream clusters, and most of them don’t have persistency, so as part of our cluster rehaul, we created new (v2) clusters, except for the “tooling” cluster. The “tooling” cluster is a bit special, as it runs a lot of development tooling with persistent storage like Artifactory. On the other hand we don’t use istio on this cluster, which makes it possible to do an in-place upgrade, rather than provide a separate cluster and do the migration to it. Additionally, doing that, we don’t have to care about migrating the storage yet - we anyway have to do it, when the new storage solution is available.

So while the new v2 clusters are setup from scratch with the official hpe-csi-driver, in “tooling” we can’t use the official one. When we initially installed the csi-driver version 1.2.0, we had to do some modification to the helm chart, as we ran into performance issues:

Since we are using multiple vlans for iscsi which was not supported by the driver, hpe made some changes for us. This is not implemented in the helm chart at time of writing. To include this in the helm chart an env variable has to be added to the primera-csp deployment template.

custom helm chart modification
Custom helm chart modification in hpe-csi chart

As you can imagine, we did not really update our custom helm chart with the changes made in the official helm chart. Unfortunately the developers at HPE also don’t care to add this change to the official helm chart, for reasons I don’t understand: Allow injection of environment variables for containers by papanito · Pull Request #219 · hpe-storage/co-deployments. Whatever, so we kept the hpe-csi driver in the “tooling” cluster in it’s current state. Sadly we only realized after the upgrade, that the current version of the driver 1.2 is not compatible with Kubernetes 1.20, hence we did the following in order to get the storage working again

  • removed the old driver
  • installed v1.4.2 as on the other clusters
  • manually added the env variables

Lessons learned

So there are several things we learned and will improve in the feature. I guess nothing of this is new, but a reminder to keep up with good practices:

  • Ensure to check for compatibility of your tooling with the desired kubernetes version

    This point may be obvious but can be easily forgotten, especially if you have a lot of tools. Keep track of your tooling, check release notes before upgrading anything

  • We shall always perform sufficient health checks to ensure upgrade went smoothly

    We ar not yet sure why the calico issue occurred - whether it’s a bug or something we did wrong, enough health checks shall be performed to ensure all went smoothly. Also ensure this is reflected in your update procedure / documentation

  • Update your tooling regularly

    Not only for security reasons, but in general patching and upgrading shall happen regularly, cause the bigger the gap between what you have installed and what is available, the more problematic an update can get