Kubernetes node is unavailable due to permanently crashing rancher-agent pod
Posted on October 22, 2021 by Adrian Wyssmann ‐ 4 min read
After our kubernetes cluster was fine for some weeks, one of our nodes was suddenly unavailable due to permanently restarting rancher-agent
Problem
The ranger-agent
- and kubelet
-pod crashes shortly after start:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4beeef472f88 rancher/rancher-agent:v2.5.9 "run.sh --server htt…" 35 minutes ago Restarting (2) 22 seconds ago serene_wright
0acf7d72d895 rancher/rke-tools:v0.1.77 "nginx-proxy CP_HOST…" 2 months ago Up 36 minutes nginx-proxy
648a9ad32eeb rancher/hyperkube:v1.20.9-rancher1 "/opt/rke-tools/entr…" 2 months ago Up 2 seconds kubelet
4044b7318619 rancher/hyperkube:v1.20.9-rancher1 "/opt/rke-tools/entr…" 2 months ago Up 36 minutes kube-proxy
Both, the rancher-agent
-pod …
INFO: Arguments: --server https://rancher.intra --token REDACTED --ca-checksum 13ae6978cb77386108988207d27cf2e149a49adbc7db5748b2fc4179df9b5c89 --worker
INFO: Environment: CATTLE_ADDRESS=10.123.45.141 CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME=server0072 CATTLE_ROLE=,worker CATTLE_SERVER=https://rancher.intra CATTLE_TOKEN=REDACTED
INFO: Using resolv.conf: search dev.intra wyssmann.intra nameserver 10.110.10.11 nameserver 10.110.10.12 nameserver 10.110.10.13
INFO: https://rancher.intra/ping is accessible
INFO: rancher.intra resolves to 10.123.45.38
INFO: Value from https://rancher.intra/v3/settings/cacerts is an x509 certificate
time="2021-10-18T09:59:30Z" level=info msg="node server0072 is not registered, restarting kubelet now"
time="2021-10-18T09:59:32Z" level=info msg="Listening on /tmp/log.sock"
time="2021-10-18T09:59:32Z" level=info msg="Rancher agent version v2.5.9 is starting"
time="2021-10-18T09:59:32Z" level=info msg="Option controlPlane=false"
time="2021-10-18T09:59:32Z" level=info msg="Option worker=true"
time="2021-10-18T09:59:32Z" level=info msg="Option requestedHostname=server0072"
time="2021-10-18T09:59:32Z" level=info msg="Option customConfig=map[address:10.123.45.141 internalAddress: label:map[] roles:[worker] taints:[]]"
time="2021-10-18T09:59:32Z" level=info msg="Option etcd=false"
time="2021-10-18T09:59:32Z" level=info msg="Connecting to wss://rancher.intra/v3/connect with token xxxxxxxxxxxxxxxxxxxxx"
time="2021-10-18T09:59:32Z" level=info msg="Connecting to proxy" url="wss://rancher.intra/v3/connect"
time="2021-10-18T09:59:34Z" level=error msg="Error parsing old certificate PEM for [kube-kubelet-10-123-45-142]: data does not contain any valid RSA or ECDSA certificates"
panic: runtime error: index out of range [0] with length 0
goroutine 105 [running]:
github.com/rancher/rancher/pkg/rkecerts.(*Bundle).Changed(0xc00013e408, 0x141bf)
/go/src/github.com/rancher/rancher/pkg/rkecerts/certs.go:236 +0xfa1
github.com/rancher/rancher/pkg/rkeworker.ExecutePlan(0x51c0340, 0xc0007a2940, 0xc0017140a0, 0x0, 0xc00191fe18, 0x1)
/go/src/github.com/rancher/rancher/pkg/rkeworker/execute.go:24 +0x509
github.com/rancher/rancher/pkg/rkenodeconfigclient.ConfigClient(0x51c0340, 0xc0007a2940, 0xc0005c8630, 0x2e, 0xc0018c28a0, 0xc0005c8600, 0x2e, 0x1, 0x5e5feb)
/go/src/github.com/rancher/rancher/pkg/rkenodeconfigclient/client.go:80 +0x2b7
main.run.func1(0x51c0340, 0xc0007a2940, 0xc0013ae460, 0x0, 0x1)
/go/src/github.com/rancher/rancher/cmd/agent/main.go:283 +0x12d
github.com/rancher/remotedialer.ConnectToProxy.func1(0xc000cf0090, 0x51c0340, 0xc0007a2940, 0xc0013ae460, 0xc0010f7980)
/go/pkg/mod/github.com/rancher/[email protected]/client.go:60 +0x44
created by github.com/rancher/remotedialer.ConnectToProxy
/go/pkg/mod/github.com/rancher/[email protected]/client.go:59 +0xb39
… as well as kubelet
-pod with the same problem:
I1020 12:30:17.354258 25330 plugins.go:638] Loaded volume plugin "kubernetes.io/local-volume"
I1020 12:30:17.354267 25330 plugins.go:638] Loaded volume plugin "kubernetes.io/storageos"
I1020 12:30:17.354302 25330 plugins.go:638] Loaded volume plugin "kubernetes.io/csi"
I1020 12:30:17.354512 25330 server.go:1176] Started kubelet
I1020 12:30:17.354714 25330 server.go:148] Starting to listen on 0.0.0.0:10250
E1020 12:30:17.354929 25330 kubelet.go:1291] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data in memory cache
I1020 12:30:17.355899 25330 server.go:410] Adding debug handlers to kubelet server.
I1020 12:30:17.360743 25330 fs_resource_analyzer.go:64] Starting FS ResourceAnalyzer
I1020 12:30:17.361410 25330 volume_manager.go:269] The desired_state_of_world populator starts
I1020 12:30:17.361421 25330 volume_manager.go:271] Starting Kubelet Volume Manager
I1020 12:30:17.361508 25330 desired_state_of_world_populator.go:142] Desired state populator starts to run
I1020 12:30:17.361663 25330 reflector.go:219] Starting reflector *v1.CSIDriver (0s) from k8s.io/client-go/informers/factory.go:134
F1020 12:30:17.367675 25330 server.go:162] tls: failed to find any PEM data in certificate input
goroutine 952 [running]:
k8s.io/kubernetes/vendor/k8s.io/klog/v2.stacks(0xc000010001, 0xc001822800, 0x63, 0x1e1)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).output(0x70ef5e0, 0xc000000003, 0x0, 0x0, 0xc000a5c770, 0x6f7aef1, 0x9, 0xa2, 0x411b00)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:975 +0x19b
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).printDepth(0x70ef5e0, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc00102e170, 0x1, 0x1)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:732 +0x16f
k8s.io/kubernetes/vendor/k8s.io/klog/v2.(*loggingT).print(...)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:714
k8s.io/kubernetes/vendor/k8s.io/klog/v2.Fatal(...)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/k8s.io/klog/v2/klog.go:1482
k8s.io/kubernetes/pkg/kubelet/server.ListenAndServeKubeletServer(0x500de00, 0xc000d57500, 0x4fa2240, 0xc0008c9940, 0xc000c0a860, 0x10, 0x10, 0x280a, 0xc000bfdce0, 0x4f84900, ...)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/server/server.go:162 +0x4fa
k8s.io/kubernetes/pkg/kubelet.(*Kubelet).ListenAndServe(0xc000d57500, 0xc000c0a860, 0x10, 0x10, 0x280a, 0xc000bfdce0, 0x4f84900, 0xc0012ba2a0, 0xc001000100)
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/pkg/kubelet/kubelet.go:2221 +0xdb
created by k8s.io/kubernetes/cmd/kubelet/app.startKubelet
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubelet/app/server.go:1187 +0x226
The error message tls: failed to find any PEM data in certificate input sounds familiar to Rancher Issue #15801, but the certificates mentioned in that issue are fine on our side. However if we look further we can see this
time="2021-10-20T15:40:40Z" level=error msg="Error parsing old certificate PEM for [kube-kubelet-10-123-45-141]: data does not contain any valid RSA or ECDSA certificates"
panic: runtime error: index out of range [0] with length 0
If we check the content of the certificate on the failing node, we can see that it is empty:
[papanito@server0072 ~]$ sudo salt -L "server0072" cmd.run "cat /etc/kubernetes/ssl/kube-kubelet-10-123-45-141.pem"
while one healthy node, the certificate is fine
[papanito@server0073 ~]$ sudo salt -L "server0072" cmd.run "cat /etc/kubernetes/ssl/kube-kubelet-10-123-45-141.pem"
-----BEGIN CERTIFICATE-----
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
-----END CERTIFICATE-----
Going trough all certificates under /etc/kubernetes/ssl/
, showed that several of the certificates were affected.
Solution
The simplest is to copy all affected pem
-files from a healthy node. As we don’t have direct ssh access but only access trough salt master these are the steps we have to take, assuming server0072
is the unhealthy, server0043
is the healthy one.
In our case, as we do not have direct ssh access to the nodes, but only access trough salz-master, this is what I did
set the filename of the unhealthy file
filename=kube-kubelet-10-123-45-156.pem
get the healthy pem file from the healthy node and copy it to the master
sudo salt -L "server0073" cp.push "/etc/kubernetes/ssl/$filename"
copy the healthy pem file to the unhealthy node
sudo salt-cp "server0072" /var/cache/salt/master/minions/server0073/files/etc/kubernetes/ssl/$filename /etc/kubernetes/ssl/$filename