Flacky service connectivity on k8s production cluster while using fleet

Posted February 25, 2022 by Adrian Wyssmann ‐ 5 min read

As of a sudden, we started to experiencing connectivity issues in our production cluster. This post explain what happens and how

Problem

Today, our developers notified us about a strange behavior in our production cluster. We also noticed, that these connection issues also happen to our monitoring stack

grafana ingres success rate
Missing metrics in grafana
grafana ingres success rate
Re-appearing of metrics in grafana

Even so we use the same tooling in non-production, we don’t see the issue there. At first we suspected a Firewall issue, but further checks did not reveal any issue with the firewall config.

  • external firewall is ok
  • firewalld on the hosts is configured as expected

After further digging we trough the logs we eventually found some logs of our metallb speakers, which indicate, that the service objects were updated. We clearly can confirm, that the ingress objects are regularly re-created:

ingress objects re-created
Ingress objects were re-created

This is odd, as there were no updates made to these objects in our code repo. Apparently metallb has some issues when the objects are re-created too frequently. Thats is definitively not ok, but the main question for us is, why the re-creation happens even so no changed are made.

As we are using fleet to manage the tooling running on the clusters, so let’s look at the fleet logs, where we can see, that the load balancer is regularly provisioned. The same logs also show different namings of the bundles

W0224 09:07:14.191165 1 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
Thu, Feb 24 2022 10:07:14 am | W0224 09:07:14.206061 1 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release demo-prd-v2-demo-prd-v2"
Thu, Feb 24 2022 10:07:14 am | W0224 09:07:14.212705 1 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release tooling-prd-v2-cronjobber"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release hpe-csi"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release tooling-prd-v2-rancher-hpe-csi-prd-v2"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release tooling-prd-v2-rancher-hpe-csi-prd-v2"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release metallb"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release hpe-csi"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release tooling-prd-v2-rancher-hpe-csi-prd-v2"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release metallb"
Thu, Feb 24 2022 10:07:14 am | W0224 09:07:14.418112 1 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release tooling-prd-v2-rancher-nginx-ingress-lb-prd-46957"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release fleet-agent-c-btrw9"
Thu, Feb 24 2022 10:07:14 am | W0224 09:07:14.504415 1 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRole is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRole
Thu, Feb 24 2022 10:07:14 am | W0224 09:07:14.516338 1 warnings.go:70] rbac.authorization.k8s.io/v1beta1 ClusterRoleBinding is deprecated in v1.17+, unavailable in v1.22+; use rbac.authorization.k8s.io/v1 ClusterRoleBinding
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release tooling-prd-v2-rancher-nginx-ingress-lb-prd-46957"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release falcon-sensor"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release fleet-agent-c-btrw9"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release falcon-sensor"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release tooling-prd-v2-rancher-nginx-ingress-lb-prd-46957"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release kured"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release fleet-agent-c-btrw9"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release metallb"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release hpe-csi"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release kured"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release falcon-sensor"
Thu, Feb 24 2022 10:07:14 am | time="2022-02-24T09:07:14Z" level=info msg="getting history for release kured"

As we can see, the bundle related to the ingress and lb stuff has an id at the end tooling-prd-v2-rancher-nginx-ingress-lb-prd-46957, while the others have not. Here is the respective GitRepo-config:

apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: tooling-prd-v2
  namespace: fleet-default
spec:
  branch: master
  clientSecretName: gitrepo-auth-bitbucket
  helmSecretName: helm-repo
  caBundle: XXXXX
  insecureSkipTLSVerify: false
  paths:
    - rancher/metallb/prd
    - rancher/nginx-ingress-lb/prd-v2
    - ...
  repo: https://git.intra/k8s/config.git
  targets:
    - clusterSelector:
        matchLabels:
          management.cattle.io/cluster-name: c-btrw9

As the documentation states

each path is scanned independently. Internally each scanned path will become a bundle that Fleet will manage

So, as far as I have seen, the path rancher/nginx-ingress-lb/prd-v2 - which only contains yaml-files - will have a bundle called GITREPO_NAME-REPO_PATH i.e. tooling-prd-v2-rancher-nginx-ingress-lb-prd-v2, but as we can see above it’s different.

Root Cause and Solution

Searching for the message from above in Slack #fleet and GitHub lead me to this issue: fleet-agent cleanup continually deleting releases - Issue #523 ยท rancher/fleet.

Apparently if bundle names are > 53 characters, fleet will shorten the bundle name and add a hash value at the end, hence tooling-prd-v2-rancher-nginx-ingress-lb-prd-46957. As the issue is still open, we actually overcome the problem by changing the folder structure in the git repos, so the bundle name is shorter. So I changed

...
  paths:
    - rancher/nginx-ingress-lb/prd-v2
...

Hence the bundle name will be tooling-prd-v2-rancher-nginx-ingress-lb-prd-v2. So the issue seems gone. Before the cleanup happens around every 24mins, now after 1h things are still there and not re-created.