İçeriğe Atla
Mustafa Erbay
Tutorials kubernetes-uretim-guvenlik · 13 min read · görüntülenme Türkçe oku
100%

Kubernetes Control Plane Certificate Expiry: A Runbook

When API Server access suddenly breaks with x509 errors; certificate renewal and safe recovery steps for kubeadm-based clusters.

Kubernetes Control Plane Certificate Expiry: A Runbook — cover image

In Kubernetes, certificate expiry is rarely “gradual degradation”; usually it’s an incident where everyone hits the wall at once: kubectl stops working, controllers can’t talk to the API, nodes can’t update status, and the system looks like it has “completely collapsed.”

This post is a practical runbook for the certificate-expiry scenario specifically on kubeadm-based (self-managed) clusters. The approach is different for managed Kubernetes (EKS/AKS/GKE).

Symptom set (quick triage)

The most frequent errors:

  • x509: certificate has expired or is not yet valid
  • Unable to connect to the server: x509: certificate signed by unknown authority
  • Unauthorized (the certificate was renewed but the kubeconfig is stale)

During incident triage, answer these two questions fast:

  1. Is the error on the client side (kubeconfig / local) or on the cluster side?
  2. Does the error affect a single component or the control plane in general?

Scope: Which setups is this runbook for?

  • Clusters built with kubeadm
  • You have SSH access to the control-plane nodes
  • Etcd access is on the same node, or separate but manageable

If you’re on managed Kubernetes: certificate renewal is generally handled by the provider; use this runbook for the “kubeconfig / client cert” parts.

Step 0 — Change management (even in incident mode)

Two panic-driven mistakes are very expensive in this incident:

  • “I tried a few things” and ended up running different operations on different nodes (inconsistency)
  • Renewing certificates on multiple nodes simultaneously (amplifies split-brain risk)

So:

  • A single Incident Commander + a single operator
  • Every command run gets a note in the decision log

Step 1 — Check the certificate state (kubeadm)

On the control-plane node:

sudo kubeadm certs check-expiration

If the EXPIRES field in the output is in the past, that’s most likely your problem. Even if only certain certificates have expired, do the renewal in a controlled way rather than “piece by piece.”

Step 2 — Renew the certificates

The most practical path on kubeadm:

sudo kubeadm certs renew all

In some setups the admin kubeconfig is renewed too. Still, validate the kubeconfig at the final step.

Step 3 — Restart the control-plane components safely

In most kubeadm setups, control-plane components run as static pods. Usually kubelet picks up the manifest change and recreates the pods; but after a certificate renewal this step practically helps:

sudo systemctl restart kubelet

Then:

sudo crictl ps | rg -n \"kube-apiserver|kube-controller-manager|kube-scheduler\"
sudo crictl logs $(sudo crictl ps -q --name kube-apiserver) | tail -n 40

The goal: confirm the API server has come back healthy.

Step 4 — Validate the kubeconfig (client side)

If you’re still seeing x509 from kubectl:

  • Check whether the kubeconfig you’re using is up to date
  • Re-fetch / copy the admin config

The typical path on a control-plane node:

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl get nodes

Step 5 — Confirm etcd and the controllers have recovered

Don’t close out this incident just because “kubectl works.” Verify:

  • Are the nodes Ready?
  • Is anything in kube-system in crashloop?
  • Are controllers emitting new events?
kubectl get nodes -o wide
kubectl -n kube-system get pods
kubectl get events --sort-by=.lastTimestamp | tail -n 30

Preventive controls (don’t let this incident repeat)

The most effective prevention is producing an alert before the certificate expires.

  • Collect the kubeadm certs check-expiration output via a daily job
  • EXPIRES < 30 days -> warning, < 7 days -> critical
  • Treat time synchronization (NTP) as a “critical service” for the control plane

Final word

When a control-plane certificate expires, the goal isn’t to “find the kubeadm command”; it’s to run a controlled, single-handed, validation-driven recovery. With a good runbook plus early warning, this incident can be managed as a planned 15-30 minute maintenance window rather than an outage stretching into hours.

Paylaş:

Bu yazı faydalı oldu mu?

Yükleniyor...

Bu yazı nasıldı?

ME

Mustafa Erbay

Sistem Mimarisi · Network Uzmanı · Altyapı, Güvenlik ve Yazılım

2006'dan bu yana sistem mimarisi, network, sunucu altyapıları, büyük yapıların kurulumu, yazılım ve sistem güvenliği ekseninde çalışıyorum. Bu blogda sahada karşılığı olan teknik deneyimlerimi paylaşıyorum.

Kişisel Notlar

Bu notlar sadece sizde saklanır. Tarayıcınızda yerel olarak tutulur.

Hazır 0 karakter

Comments

Server-side AI Moderation

Comments are AI-moderated server-side and stored permanently.

?
0/2000

Server-side AI moderation

✉️ Free · No spam · Unsubscribe anytime

Curated digest, hand-picked by me — not the AI

Once a week: the most important post of the week, behind-the-scenes notes, and a "what I actually used this week" section. Less noise, more signal.

  • 📌
    Best of the week Single most-worth-reading post
  • 🔧
    Toolbox notes Real tools I used this week
  • 🧠
    Behind-the-scenes Notes that don't make it to blog

We don't spam. Unsubscribe anytime. · Tracked only by Umami (self-hosted, no Google).

Your Reading Stats

0

Posts Read

0m

Reading Time

0

Day Streak

-

Favorite Category

Related Posts