Upgrade istio, monitoring and logging from Rancher 2.4.x. to 2.5.x

Posted on August 5, 2021 by Adrian Wyssmann ‐ 4 min read

While we are using Rancher 2.5.x our current cluster management solution, we actually still are using the old istio- and monitoring-stack v1 rather than v2. We want to change that but it is not as easy as we thought.

What is Rancher

For the ones which don’t know, Rancher is a tool which unifies cluster management for on-prem and in-cloud clusters and ensure consistent operations, workload management and enterprise-grade security. We have chosen this to

  • Cut down on operational complexity
  • Grow capacity when the need arises
  • Use off the shelf, reasonably priced commodity hardware

while

  • Remaining within the same know how domain
  • Utilizing de facto standards propagated by a large part of the global user base
  • Lowering the life cycle effort
  • Improving up time with implicit redundancy

This is how our setup looks like:

Rancher architecture overview
Architecture overview of Rancher in our stack

One cool thing about Rancher is, that you can quite easily install istio and monitoring/logging.

What happened between 2.4.x and 2.5.x?

With Rancher 2.5 the approach on how istio, monitoring and logging is integrated has changed: Rancher Docs: Deprecated Features in Rancher v2.5. The changes have a huge impact on running clusters:

  • Rancher Docs: Monitoring in Rancher v2.5
    • You can migrate any dashboard added to Grafana in Monitoring V1 to Monitoring V2.
    • It is only possible to directly migrate expression-based alerts to Monitoring V2.
    • There is no direct equivalent for how notifiers work in Monitoring V1. Instead you have to replicate the desired setup with Routes and Receivers in Monitoring V2.
    • Project owners and members no longer get access to Grafana or Prometheus by default. If view-only users had access to Grafana, they would be able to see data from any namespace. For Kiali, any user can edit things they don’t own in any namespace.
    • Users with the project-member or project-owners roles assigned will not be given access to either Prometheus or Grafana in Rancher 2.5.x since we only create Grafana or Prometheus on a cluster-level.
    • Before disable cluster monitoring, istio has to be disabled first.
  • Rancher Docs: Migrate istio from previous istio version
    • The Istio v2 (in Rancher v2.5.x) uses a newer approach inline:with the upstream move from away from helm to using istioctl and an operator model. A helm chart is maintained to provide an easy approach to versioning and installing Istio with Rancher.
    • There is no direct upgrade paths so related objects have to be backed up, istio app in Cluster Manager to be deleted and then istio app has to be installed in Cluster Explorer.
  • Rancher Docs: Logging in Rancher v2.5:
    • The Banzai Cloud Logging operator now powers Rancher’s logging solution in place of the former, in-house solution.
    • Fluent Bit is now used to aggregate the logs, and Fluentd is used for filtering the messages and routing them to the outputs. Previously, only Fluentd was used.
    • Logging can be configured with a Kubernetes manifest, because logging now uses a Kubernetes operator with Custom Resource Definitions.
    • We now support filtering logs.
    • We now support writing logs to multiple outputs.
    • We now always collect Control Plane and etcd logs.
    • To configure cluster-wide logging for v2.5+ logging, one needs to setup a ClusterFlow. This object defines the source of logs, any transformations or filters to be applied, and finally the output(s) for the logs. This will result in logs from all sources in the cluster (all pods, and all system components) being collected and sent to the output(s) you defined in the ClusterFlow. Logging in v2.5+ is not project-aware. This means that in order to collect logs from pods running in project namespaces, you will need to define Flows for those namespaces.

Most of the changes are great news, as it will give more flexibility. However especially the issue with the access to Grafana and Prometheus is a bit worrying, as previously with project-specific monitoring each team had admin access to their project-specific monitoring so they could perform all necessary actions by them-self. But that’s something we talk later about.

Our Approach

We actually wanted to migrate each thing - monitoring, istio, logging - one-by-one per cluster, but that is not possible. As we run critical applications - especially in production - and the fundamental changes introduced in the context of logging, monitoring and istio, not everything can be achieved without major interruptions and risks if doing it in-place. We therefore agreed to re-create all 7 clusters one-by one from scratch, using latest tooling version (k8s, monitoring, logging, istio, etc.) and then migrate the workloads from the existing cluster to the new ones and finally decomission the old clusters. This does minimize the risks substantially, there’s an easy fallback in the worst case and the outage time will be limited to a minimum during the actual migration of individual workloads.

A side benefit of this approach is, that it allows us to streamline the current setup and iron out a couple of inconsistencies that have already accumulated in the short history of the platform. For instance all VMs will be setup from scratch again as well and they will be sized consistently strictly according to our requirements.

I will have some future posts where I will talk a bit more about specifics, which might be useful for others.