Kubernetes node is unavailable due to permanently crashing rancher-agent pod

Posted on October 22, 2021 by Adrian Wyssmann ‐ 4 min read

After our kubernetes cluster was fine for some weeks, one of our nodes was suddenly unavailable due to permanently restarting rancher-agent

Problem

The ranger-agent- and kubelet-pod crashes shortly after start:

CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS                          PORTS               NAMES
4beeef472f88        rancher/rancher-agent:v2.5.9         "run.sh --server htt…"   35 minutes ago      Restarting (2) 22 seconds ago                       serene_wright
0acf7d72d895        rancher/rke-tools:v0.1.77            "nginx-proxy CP_HOST…"   2 months ago        Up 36 minutes                                       nginx-proxy
648a9ad32eeb        rancher/hyperkube:v1.20.9-rancher1   "/opt/rke-tools/entr…"   2 months ago        Up 2 seconds                                        kubelet
4044b7318619        rancher/hyperkube:v1.20.9-rancher1   "/opt/rke-tools/entr…"   2 months ago        Up 36 minutes                                       kube-proxy

Both, the rancher-agent-pod …

INFO: Arguments: --server https://rancher.intra --token REDACTED --ca-checksum 13ae6978cb77386108988207d27cf2e149a49adbc7db5748b2fc4179df9b5c89 --worker
INFO: Environment: CATTLE_ADDRESS=10.123.45.141 CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME=server0072 CATTLE_ROLE=,worker CATTLE_SERVER=https://rancher.intra CATTLE_TOKEN=REDACTED
INFO: Using resolv.conf: search dev.intra wyssmann.intra nameserver 10.110.10.11 nameserver 10.110.10.12 nameserver 10.110.10.13
INFO: https://rancher.intra/ping is accessible
INFO: rancher.intra resolves to 10.123.45.38
INFO: Value from https://rancher.intra/v3/settings/cacerts is an x509 certificate
time="2021-10-18T09:59:30Z" level=info msg="node server0072 is not registered, restarting kubelet now"
time="2021-10-18T09:59:32Z" level=info msg="Listening on /tmp/log.sock"
time="2021-10-18T09:59:32Z" level=info msg="Rancher agent version v2.5.9 is starting"
time="2021-10-18T09:59:32Z" level=info msg="Option controlPlane=false"
time="2021-10-18T09:59:32Z" level=info msg="Option worker=true"
time="2021-10-18T09:59:32Z" level=info msg="Option requestedHostname=server0072"
time="2021-10-18T09:59:32Z" level=info msg="Option customConfig=map[address:10.123.45.141 internalAddress: label:map[] roles:[worker] taints:[]]"
time="2021-10-18T09:59:32Z" level=info msg="Option etcd=false"
time="2021-10-18T09:59:32Z" level=info msg="Connecting to wss://rancher.intra/v3/connect with token xxxxxxxxxxxxxxxxxxxxx"
time="2021-10-18T09:59:32Z" level=info msg="Connecting to proxy" url="wss://rancher.intra/v3/connect"
time="2021-10-18T09:59:34Z" level=error msg="Error parsing old certificate PEM for [kube-kubelet-10-123-45-142]: data does not contain any valid RSA or ECDSA certificates"
panic: runtime error: index out of range [0] with length 0

goroutine 105 [running]:
github.com/rancher/rancher/pkg/rkecerts.(*Bundle).Changed(0xc00013e408, 0x141bf)
/go/src/github.com/rancher/rancher/pkg/rkecerts/certs.go:236 +0xfa1
github.com/rancher/rancher/pkg/rkeworker.ExecutePlan(0x51c0340, 0xc0007a2940, 0xc0017140a0, 0x0, 0xc00191fe18, 0x1)
/go/src/github.com/rancher/rancher/pkg/rkeworker/execute.go:24 +0x509
github.com/rancher/rancher/pkg/rkenodeconfigclient.ConfigClient(0x51c0340, 0xc0007a2940, 0xc0005c8630, 0x2e, 0xc0018c28a0, 0xc0005c8600, 0x2e, 0x1, 0x5e5feb)
/go/src/github.com/rancher/rancher/pkg/rkenodeconfigclient/client.go:80 +0x2b7
main.run.func1(0x51c0340, 0xc0007a2940, 0xc0013ae460, 0x0, 0x1)
/go/src/github.com/rancher/rancher/cmd/agent/main.go:283 +0x12d
github.com/rancher/remotedialer.ConnectToProxy.func1(0xc000cf0090, 0x51c0340, 0xc0007a2940, 0xc0013ae460, 0xc0010f7980)
/go/pkg/mod/github.com/rancher/[email protected]/client.go:60 +0x44
created by github.com/rancher/remotedialer.ConnectToProxy
/go/pkg/mod/github.com/rancher/[email protected]/client.go:59 +0xb39

… as well as kubelet-pod with the same problem:

I1020 12:30:17.354258   25330 plugins.go:638] Loaded volume plugin "kubernetes.io/local-volume"
I1020 12:30:17.354267   25330 plugins.go:638] Loaded volume plugin "kubernetes.io/storageos"
I1020 12:30:17.354302   25330 plugins.go:638] Loaded volume plugin "kubernetes.io/csi"
I1020 12:30:17.354512   25330 server.go:1176] Started kubelet
I1020 12:30:17.354714   25330 server.go:148] Starting to listen on 0.0.0.0:10250
E1020 12:30:17.354929   25330 kubelet.go:1291] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache
I1020 12:30:17.355899   25330 server.go:410] Adding debug handlers to kubelet server.
I1020 12:30:17.360743   25330 fs_resource_analyzer.go:64] Starting FS ResourceAnalyzer
I1020 12:30:17.361410   25330 volume_manager.go:269] The desired_state_of_world populator starts
I1020 12:30:17.361421   25330 volume_manager.go:271] Starting Kubelet Volume Manager
I1020 12:30:17.361508   25330 desired_state_of_world_populator.go:142] Desired state populator starts to run
I1020 12:30:17.361663   25330 reflector.go:219] Starting reflector *v1.CSIDriver (0s) from k8s.io/client-go/informers/factory.go:134
F1020 12:30:17.367675   25330 server.go:162] tls: failed to find any PEM data in certificate input
goroutine 952 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc000010001, 0xc001822800, 0x63, 0x1e1)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x70ef5e0, 0xc000000003, 0x0, 0x0, 0xc000a5c770, 0x6f7aef1, 0x9, 0xa2, 0x411b00)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:975 +0x19b
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).printDepth(0x70ef5e0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc00102e170, 0x1, 0x1)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:732 +0x16f
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).print(...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:714
k8s.io/kubernetes/vendor/k8s.io/klog/v2.Fatal(...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1482
k8s.io/kubernetes/pkg/kubelet/server.ListenAndServeKubeletServer(0x500de00, 0xc000d57500, 0x4fa2240, 0xc0008c9940, 0xc000c0a860, 0x10, 0x10, 0x280a, 0xc000bfdce0, 0x4f84900, ...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/server.go:162 +0x4fa
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).ListenAndServe(0xc000d57500, 0xc000c0a860, 0x10, 0x10, 0x280a, 0xc000bfdce0, 0x4f84900, 0xc0012ba2a0, 0xc001000100)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go:2221 +0xdb
created by k8s.io/kubernetes/cmd/kubelet/app.startKubelet
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/app/server.go:1187 +0x226

The error message tls: failed to find any PEM data in certificate input sounds familiar to Rancher Issue #15801, but the certificates mentioned in that issue are fine on our side. However if we look further we can see this

time="2021-10-20T15:40:40Z" level=error msg="Error parsing old certificate PEM for [kube-kubelet-10-123-45-141]: data does not contain any valid RSA or ECDSA certificates"
    panic: runtime error: index out of range [0] with length 0

If we check the content of the certificate on the failing node, we can see that it is empty:

[papanito@server0072 ~]$ sudo salt -L "server0072" cmd.run "cat /etc/kubernetes/ssl/kube-kubelet-10-123-45-141.pem"

while one healthy node, the certificate is fine

[papanito@server0073 ~]$ sudo salt -L "server0072" cmd.run "cat /etc/kubernetes/ssl/kube-kubelet-10-123-45-141.pem"
-----BEGIN CERTIFICATE-----
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
-----END CERTIFICATE-----

Going trough all certificates under /etc/kubernetes/ssl/, showed that several of the certificates were affected.

Solution

The simplest is to copy all affected pem-files from a healthy node. As we don’t have direct ssh access but only access trough salt master these are the steps we have to take, assuming server0072 is the unhealthy, server0043 is the healthy one.

In our case, as we do not have direct ssh access to the nodes, but only access trough salz-master, this is what I did

  1. set the filename of the unhealthy file

    filename=kube-kubelet-10-123-45-156.pem
  2. get the healthy pem file from the healthy node and copy it to the master

    sudo salt -L "server0073" cp.push "/etc/kubernetes/ssl/$filename"
  3. copy the healthy pem file to the unhealthy node

    sudo salt-cp "server0072" /var/cache/salt/master/minions/server0073/files/etc/kubernetes/ssl/$filename /etc/kubernetes/ssl/$filename