One of the most misunderstood concepts on the Kubernetes side is the Secret object. A Secret is not an encrypted vault; in most installations it’s just base64-encoded data. If that data sits in ETCD essentially in plaintext (other than the encoding wrapper), then a compromised control plane or a leaked ETCD snapshot becomes a much bigger problem than it needs to be.
In this article I have two goals:
- Use Encryption at Rest to actually encrypt sensitive data sitting in ETCD
- Make that encryption operationally viable through KMS, instead of dropping a key file on disk
The problem: where do Secrets live, and who can read them?
The risk surface usually opens up through three channels:
- ETCD access (disk, snapshot, backup)
- Control-plane node compromise (encryption config plus access to certs/keys)
- Backup chain (S3 buckets, repositories, backup agents)
Encryption at rest doesn’t make the first two risks vanish; what it does is take the ETCD data itself out of the “directly readable” category.
Kubernetes Encryption at Rest: the logic, briefly
Kubernetes can encrypt certain resource types (such as secrets and configmaps) before writing them to ETCD. The core building blocks are:
- The
EncryptionConfigurationfile - The API Server applying encryption on the write path based on that file
A critical fact in this model:
- Encryption happens at write time.
- Objects that were already written can stay in the old format until a separate process re-writes them.
Why KMS is non-negotiable
Static keys (an AES key sitting in a file) are quick in the short run but weak for enterprise operations:
- Key rotation is painful
- Access control and audit are weak
- “Who used the key, and when?” has no clean answer
The goal of bringing in KMS:
- Centrally manage the key lifecycle
- Audit key usage
- Bring rotation into a regular “planned maintenance” rhythm
Design: how do you make the KMS integration “operational”?
1) Highly available KMS endpoint
If KMS is unreachable, the API Server’s write path takes the hit. So:
- Plan for at least two endpoints (or an HA service)
- Measure timeout and retry behavior
- Tie KMS maintenance windows to the platform’s maintenance calendar
2) Decide on the failure mode up front
There are typically three approaches:
- Fail-closed: no KMS, no writes (high security, high operational risk)
- Fail-open: write unencrypted when KMS is gone (easy operations, high risk)
- Controlled degradation: fail-closed only on selected resources
The most realistic model for enterprise practice:
- Fail-closed for
secrets(HA and a runbook are mandatory) - Controlled tolerance for lower-risk resources
3) Key rotation: “planned and measured”
Rotation goals:
- New writes are encrypted with the new key
- Older data is safely re-encrypted over time
Operational suggestions:
- Before rotation: check the trend of API latency and error budget
- Canary: change the key order on a single cluster or segment first
- Spread: gradual restart, controlled rollout
- Post-rotation: handle re-encryption “in installments”
Minimum viable runbook: when KMS misbehaves
Symptoms:
- API Server 5xx / write timeouts
secretscreate/update problems
Triage questions:
- Is the KMS endpoint reachable? (network, DNS, mTLS)
- Has KMS latency spiked? (throttling, quota)
- Is there a KMS plugin error in the API Server logs?
Initial response:
- Bring the KMS endpoint back to health (the cleanest answer)
- If the incident is escalating and the risk is acceptable: follow the pre-written break-glass plan to temporarily simplify the encryption config
Observation: don’t go blind just because there’s encryption
Signals to watch for this design to work properly:
- API server latency (especially the write path)
- KMS request rate, latency, and error ratio
- “Is the data actually encrypted?” expectation along the ETCD backup/snapshot pipeline
- Key rotation dates and access audit records
Conclusion
Encryption at rest in Kubernetes is not a “compliance checkbox”; it’s a serious architectural decision that reduces control-plane risk. But it’s not enough by itself: without KMS availability, a rotation rhythm, failover, and a break-glass runbook, that security gain turns into operational fragility. Done properly, the design makes security part of the platform’s behavior, with no surprises during an incident.