Kubernetes node is unavailable due to permanently crashing rancher-agent pod

Posted October 22, 2021 by Adrian Wyssmann ‐ 4 min read

After our kubernetes cluster was fine for some weeks, one of our nodes was suddenly unavailable due to permanently restarting rancher-agent

Problem

The ranger-agent- and kubelet-pod crashes shortly after start:

CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS                          PORTS               NAMES
4beeef472f88        rancher/rancher-agent:v2.5.9         "run.sh --server htt…"   35 minutes ago      Restarting (2) 22 seconds ago                       serene_wright
0acf7d72d895        rancher/rke-tools:v0.1.77            "nginx-proxy CP_HOST…"   2 months ago        Up 36 minutes                                       nginx-proxy
648a9ad32eeb        rancher/hyperkube:v1.20.9-rancher1   "/opt/rke-tools/entr…"   2 months ago        Up 2 seconds                                        kubelet
4044b7318619        rancher/hyperkube:v1.20.9-rancher1   "/opt/rke-tools/entr…"   2 months ago        Up 36 minutes                                       kube-proxy

Both, the rancher-agent-pod …

INFO: Arguments: --server https://rancher.nop.sc.intra --token REDACTED --ca-checksum 13ae6978cb77386108988207d27cf2e149a49adbc7db5748b2fc4179df9b5c89 --worker
INFO: Environment: CATTLE_ADDRESS=10.120.46.141 CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME=devs0472 CATTLE_ROLE=,worker CATTLE_SERVER=https://rancher.nop.sc.intra CATTLE_TOKEN=REDACTED
INFO: Using resolv.conf: search dev.sc.intra sc.intra nameserver 10.128.12.11 nameserver 10.128.12.12 nameserver 10.128.12.13
INFO: https://rancher.nop.sc.intra/ping is accessible
INFO: rancher.nop.sc.intra resolves to 10.120.60.38
INFO: Value from https://rancher.nop.sc.intra/v3/settings/cacerts is an x509 certificate
time="2021-10-18T09:59:30Z" level=info msg="node devs0472 is not registered, restarting kubelet now"
time="2021-10-18T09:59:32Z" level=info msg="Listening on /tmp/log.sock"
time="2021-10-18T09:59:32Z" level=info msg="Rancher agent version v2.5.9 is starting"
time="2021-10-18T09:59:32Z" level=info msg="Option controlPlane=false"
time="2021-10-18T09:59:32Z" level=info msg="Option worker=true"
time="2021-10-18T09:59:32Z" level=info msg="Option requestedHostname=devs0472"
time="2021-10-18T09:59:32Z" level=info msg="Option customConfig=map[address:10.120.46.141 internalAddress: label:map[] roles:[worker] taints:[]]"
time="2021-10-18T09:59:32Z" level=info msg="Option etcd=false"
time="2021-10-18T09:59:32Z" level=info msg="Connecting to wss://rancher.nop.sc.intra/v3/connect with token x6fkvtnp98qr5ljhd4l7xrn2cbql24h9vl25r8r7r9l4mgg84qjdsf"
time="2021-10-18T09:59:32Z" level=info msg="Connecting to proxy" url="wss://rancher.nop.sc.intra/v3/connect"
time="2021-10-18T09:59:34Z" level=error msg="Error parsing old certificate PEM for [kube-kubelet-10-120-46-142]: data does not contain any valid RSA or ECDSA certificates"
panic: runtime error: index out of range [0] with length 0

goroutine 105 [running]:
github.com/rancher/rancher/pkg/rkecerts.(*Bundle).Changed(0xc00013e408, 0x141bf)
/go/src/github.com/rancher/rancher/pkg/rkecerts/certs.go:236 +0xfa1
github.com/rancher/rancher/pkg/rkeworker.ExecutePlan(0x51c0340, 0xc0007a2940, 0xc0017140a0, 0x0, 0xc00191fe18, 0x1)
/go/src/github.com/rancher/rancher/pkg/rkeworker/execute.go:24 +0x509
github.com/rancher/rancher/pkg/rkenodeconfigclient.ConfigClient(0x51c0340, 0xc0007a2940, 0xc0005c8630, 0x2e, 0xc0018c28a0, 0xc0005c8600, 0x2e, 0x1, 0x5e5feb)
/go/src/github.com/rancher/rancher/pkg/rkenodeconfigclient/client.go:80 +0x2b7
main.run.func1(0x51c0340, 0xc0007a2940, 0xc0013ae460, 0x0, 0x1)
/go/src/github.com/rancher/rancher/cmd/agent/main.go:283 +0x12d
github.com/rancher/remotedialer.ConnectToProxy.func1(0xc000cf0090, 0x51c0340, 0xc0007a2940, 0xc0013ae460, 0xc0010f7980)
/go/pkg/mod/github.com/rancher/[email protected]/client.go:60 +0x44
created by github.com/rancher/remotedialer.ConnectToProxy
/go/pkg/mod/github.com/rancher/[email protected]/client.go:59 +0xb39

… as well as kubelet-pod with the same problem:

I1020 12:30:17.354258   25330 plugins.go:638] Loaded volume plugin "kubernetes.io/local-volume"
I1020 12:30:17.354267   25330 plugins.go:638] Loaded volume plugin "kubernetes.io/storageos"
I1020 12:30:17.354302   25330 plugins.go:638] Loaded volume plugin "kubernetes.io/csi"
I1020 12:30:17.354512   25330 server.go:1176] Started kubelet
I1020 12:30:17.354714   25330 server.go:148] Starting to listen on 0.0.0.0:10250
E1020 12:30:17.354929   25330 kubelet.go:1291] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache
I1020 12:30:17.355899   25330 server.go:410] Adding debug handlers to kubelet server.
I1020 12:30:17.360743   25330 fs_resource_analyzer.go:64] Starting FS ResourceAnalyzer
I1020 12:30:17.361410   25330 volume_manager.go:269] The desired_state_of_world populator starts
I1020 12:30:17.361421   25330 volume_manager.go:271] Starting Kubelet Volume Manager
I1020 12:30:17.361508   25330 desired_state_of_world_populator.go:142] Desired state populator starts to run
I1020 12:30:17.361663   25330 reflector.go:219] Starting reflector *v1.CSIDriver (0s) from k8s.io/client-go/informers/factory.go:134
F1020 12:30:17.367675   25330 server.go:162] tls: failed to find any PEM data in certificate input
goroutine 952 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc000010001, 0xc001822800, 0x63, 0x1e1)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x70ef5e0, 0xc000000003, 0x0, 0x0, 0xc000a5c770, 0x6f7aef1, 0x9, 0xa2, 0x411b00)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:975 +0x19b
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).printDepth(0x70ef5e0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc00102e170, 0x1, 0x1)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:732 +0x16f
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).print(...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:714
k8s.io/kubernetes/vendor/k8s.io/klog/v2.Fatal(...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1482
k8s.io/kubernetes/pkg/kubelet/server.ListenAndServeKubeletServer(0x500de00, 0xc000d57500, 0x4fa2240, 0xc0008c9940, 0xc000c0a860, 0x10, 0x10, 0x280a, 0xc000bfdce0, 0x4f84900, ...)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/server.go:162 +0x4fa
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).ListenAndServe(0xc000d57500, 0xc000c0a860, 0x10, 0x10, 0x280a, 0xc000bfdce0, 0x4f84900, 0xc0012ba2a0, 0xc001000100)
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go:2221 +0xdb
created by k8s.io/kubernetes/cmd/kubelet/app.startKubelet
    /workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/app/server.go:1187 +0x226

The error message tls: failed to find any PEM data in certificate input sounds familiar to Rancher Issue #15801, but the certificates mentioned in that issue are fine on our side. However if we look further we can see this

time="2021-10-20T15:40:40Z" level=error msg="Error parsing old certificate PEM for [kube-kubelet-10-120-46-141]: data does not contain any valid RSA or ECDSA certificates"
    panic: runtime error: index out of range [0] with length 0

If we check the content of the certificate on the failing node, we can see that it is empty:

[[email protected] ~]$ sudo salt -L "devs0472" cmd.run "cat /etc/kubernetes/ssl/kube-kubelet-10-120-46-141.pem"

while one healthy node, the certificate is fine

[[email protected] ~]$ sudo salt -L "devs0472" cmd.run "cat /etc/kubernetes/ssl/kube-kubelet-10-120-46-141.pem"
-----BEGIN CERTIFICATE-----
MIIDBTCCAe2gAwIBAgIIHGCl/5CYxlIwDQYJKoZIhvcNAQELBQAwEjEQMA4GA1UE
AxMHa3ViZS1jYTAeFw0yMTA2MjUwOTAyNTRaFw0zMTEwMTgxNTA1MDVaMCUxIzAh
BgNVBAMTGmt1YmUta3ViZWxldC0xMC0xMjAtNDYtMTQxMIIBIjANBgkqhkiG9w0B
AQEFAAOCAQ8AMIIBCgKCAQEAragsOjBA4jQ59uS1QBh7c2WozFe6RQL8/l0E7ex3
GmlVRC/O8zL7bCde9aAR4HLe7qojJ46PUWdtKwfXuT7o/A+g/02NUR/DE11cREQt
5yR5lqm/PBkEU+WZw+VB5RKd9EePv354H20fq86pbL6JkQ7DbvlFQvmTqyvE3BfL
drZ3xHp7gR8FQchDVU6nRS5g73uOKn6Pbej5mcm+/V05Tv1RGrcxzdfgggk47GZn
uCajRFs8ElJqXE5C1WpJe9Zj3u8/F6IbnA9tF2VtFjZkCfk3R1PH6v9RYrBCyYa3
ZrLCTHpCLcXsSC9AXIq2+lyjx7F2FmAe9uLwW+8RUWP0EQIDAQABo0wwSjAOBgNV
HQ8BAf8EBAMCBaAwHQYDVR0lBBYwFAYIKwYBBQUHAwIGCCsGAQUFBwMBMBkGA1Ud
EQQSMBCCCGRldnMwNDcyhwQKeC6NMA0GCSqGSIb3DQEBCwUAA4IBAQCx+1XRcspv
AVFmbG9eO6rUstptXg5ZCnOqWhTtY7TwXlMVKHpmdlTELGOhGP/IAaP3U7CFhs+f
QNN6VhC63oRVC+srP8eosdZvuVT4wGEqcHZPQUHeoBMISt+MonMtePjRvkHWAqcD
qK6jMUuTm8r1EcKC95M97BhRMn26sgXvINCLFghnc3wr7j8x09oslyCTojN7FEqj
6cdLR8GvSlJ1vTYJAxskb4kmzORNpEfFhe1N6hSmqlqfisLyfeb1UHhM7v+u++JZ
wZXvyknLQ48c/9tlYPYepX160ygHi9IWDi0F9cMbhQxYGfmj95SrTgZYw1UxG5sn
wEpLGkftbA8G
-----END CERTIFICATE-----

Going trough all certificates under /etc/kubernetes/ssl/, showed that several of the certificates were affected.

Solution

The simplest is to copy all affected pem-files from a healthy node. As we don’t have direct ssh access but only access trough salt master these are the steps we have to take, assuming devs0472 is the unhealthy, devs0743 is the healthy one.

In our case, as we do not have direct ssh access to the nodes, but only access trough salz-master, this is what I did

  1. set the filename of the unhealthy file

    filename=kube-kubelet-10-120-46-156.pem
    
  2. get the healthy pem file from the healthy node and copy it to the master

    sudo salt -L "devs0473" cp.push "/etc/kubernetes/ssl/$filename"
    
  3. copy the healthy pem file to the unhealthy node

    sudo salt-cp "devs0472" /var/cache/salt/master/minions/devs0473/files/etc/kubernetes/ssl/$filename /etc/kubernetes/ssl/$filename