In Kubernetes, certificate expiry is rarely “gradual degradation”; usually it’s an incident where everyone hits the wall at once: kubectl stops working, controllers can’t talk to the API, nodes can’t update status, and the system looks like it has “completely collapsed.”
This post is a practical runbook for the certificate-expiry scenario specifically on kubeadm-based (self-managed) clusters. The approach is different for managed Kubernetes (EKS/AKS/GKE).
Symptom set (quick triage)
The most frequent errors:
x509: certificate has expired or is not yet validUnable to connect to the server: x509: certificate signed by unknown authorityUnauthorized(the certificate was renewed but the kubeconfig is stale)
During incident triage, answer these two questions fast:
- Is the error on the client side (kubeconfig / local) or on the cluster side?
- Does the error affect a single component or the control plane in general?
Scope: Which setups is this runbook for?
- Clusters built with kubeadm
- You have SSH access to the control-plane nodes
- Etcd access is on the same node, or separate but manageable
If you’re on managed Kubernetes: certificate renewal is generally handled by the provider; use this runbook for the “kubeconfig / client cert” parts.
Step 0 — Change management (even in incident mode)
Two panic-driven mistakes are very expensive in this incident:
- “I tried a few things” and ended up running different operations on different nodes (inconsistency)
- Renewing certificates on multiple nodes simultaneously (amplifies split-brain risk)
So:
- A single Incident Commander + a single operator
- Every command run gets a note in the decision log
Step 1 — Check the certificate state (kubeadm)
On the control-plane node:
sudo kubeadm certs check-expiration
If the EXPIRES field in the output is in the past, that’s most likely your problem.
Even if only certain certificates have expired, do the renewal in a controlled way rather than “piece by piece.”
Step 2 — Renew the certificates
The most practical path on kubeadm:
sudo kubeadm certs renew all
In some setups the admin kubeconfig is renewed too. Still, validate the kubeconfig at the final step.
Step 3 — Restart the control-plane components safely
In most kubeadm setups, control-plane components run as static pods. Usually kubelet picks up the manifest change and recreates the pods; but after a certificate renewal this step practically helps:
sudo systemctl restart kubelet
Then:
sudo crictl ps | rg -n \"kube-apiserver|kube-controller-manager|kube-scheduler\"
sudo crictl logs $(sudo crictl ps -q --name kube-apiserver) | tail -n 40
The goal: confirm the API server has come back healthy.
Step 4 — Validate the kubeconfig (client side)
If you’re still seeing x509 from kubectl:
- Check whether the kubeconfig you’re using is up to date
- Re-fetch / copy the admin config
The typical path on a control-plane node:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl get nodes
Step 5 — Confirm etcd and the controllers have recovered
Don’t close out this incident just because “kubectl works.” Verify:
- Are the nodes
Ready? - Is anything in
kube-systemin crashloop? - Are controllers emitting new events?
kubectl get nodes -o wide
kubectl -n kube-system get pods
kubectl get events --sort-by=.lastTimestamp | tail -n 30
Preventive controls (don’t let this incident repeat)
The most effective prevention is producing an alert before the certificate expires.
- Collect the
kubeadm certs check-expirationoutput via a daily job EXPIRES< 30 days -> warning, < 7 days -> critical- Treat time synchronization (NTP) as a “critical service” for the control plane
Final word
When a control-plane certificate expires, the goal isn’t to “find the kubeadm command”; it’s to run a controlled, single-handed, validation-driven recovery. With a good runbook plus early warning, this incident can be managed as a planned 15-30 minute maintenance window rather than an outage stretching into hours.