We recently encountered a huge problem, when using OPA Gatekeeper in a Rancher cluster and performing a restore of this cluster
Issues with cluster restore and Gatekeeper
It has been some weeks since I have installed Gatekeeper. Today we had an issue, which lead me to do a restore of the cluster state to the state from some minutes ago. Usually that is not a problem - well at least before OPA Gatekeeper was installed. While the restore worked, the cluster eventually will not start, some pods stuck in Termination while others are stuck in Pending:
We can see the events in the namespace cattle-system:
Furthermore, we can also see these errors in the logs:
We eventually came out of this missery by deleting the Gatekeeper webhooks
After that we also had to restart some of the nodes, on which the pods were on state Pending.
If you use --exempt-namespace flag and admission.gatekeeper.sh/ignore label, Gatekeeper’s webhook will not be called by the API server for any resource in that namespace. That means that Gatekeeper being down should have no effect on requests for that namespace.
When looking at the helm chart, more specificall at gatekeeper-controller-manager-deployment.yaml we can see that one has to provide exemptNamespaces and exemptNamespacePrefixes:
Hence the values file has to be extended to your needs, especially under exemptNamespace you may need to add additional namespaces:
Before continue, let’s add back the webhooks we deleted above:
At last, we have to add the missing label to all namespaces. I run this script:
Note
Ensure that the namespaces are exempt, before you run the script above, otherwise you get this error
Error from server (Only exempt namespace can have the admission.gatekeeper.sh/ignore label): admission webhook "check-ignore-label.gatekeeper.sh" denied the request: Only exempt namespace can have the admission.gatekeeper.sh/ignore label
After all is in place, we create a new snapshot which we then will restore. Unfortunately, we still have some pods which are in state Pending and Terminating:
Still we observe the same events in the namespace kube-system …
… and in namespace cattle-system:
Ultimately, it helped to restart the nodes, on which the pods were in Pending state.
Conclusion
When using Gatekeeper, it is essential to have important namespaces exempt and add label admission.gatekeeper.sh/ignore to all namespaces in the system project.