When teams talk about Kubernetes security, most of them focus on RBAC, network policy, and image signing. Those are right. But if you can’t give a clean answer to “who, when, and what did they do?” you’re flying blind during incident response. One of the main signal sources that closes that blind spot is the API Server audit log.
In this post my goal isn’t just “we turned audit logging on.” It’s to design an operable audit log pipeline:
- Keeping noise under control (cost + signal quality)
- Masking sensitive data (response body / secret leakage)
- Building correlation in the SIEM (IDP + cluster + node)
- Sharpening the operational runbook (validation, rollback, retention)
1) What do we actually want from the audit log?
The biggest value of the audit log is being able to tie an action that the security team flags as suspicious back to evidence:
- A spike in
create tokenreviews/create subjectaccessreviews - Privilege escalation moves like
create/patch clusterrolebinding - “Interactive” access patterns like
exec/portforward - Reads of sensitive objects like
get secrets
2) Audit policy design: the “log everything” trap
There are four basic levels in Kubernetes audit:
None: no loggingMetadata: who/what/where (no body)Request: includes the request bodyRequestResponse: includes both request and response body
A practical production approach generally looks like this:
- Default:
Metadata - Very chatty endpoints:
None - A handful of very critical actions:
Request(rarely) RequestResponse: highly exceptional (in most organizations it’s an unnecessary risk)
Sample audit policy (starting point)
This file is a “core” example of a policy; expand it to fit your environment:
apiVersion: audit.k8s.io/v1
kind: Policy
omitStages:
- "RequestReceived"
rules:
# 1) Gürültüyü kes: healthz/readyz/livez
- level: None
nonResourceURLs:
- "/healthz*"
- "/readyz*"
- "/livez*"
- "/version"
# 2) Sistem bileşenleri: default metadata
- level: Metadata
userGroups:
- "system:authenticated"
# 3) Secret erişimini görünür tut: metadata (body yok)
- level: Metadata
resources:
- group: ""
resources: ["secrets"]
# 4) Yetki değişiklikleri: metadata + daha sıkı izleme
- level: Metadata
resources:
- group: "rbac.authorization.k8s.io"
resources: ["roles", "rolebindings", "clusterroles", "clusterrolebindings"]
# 5) Interactive aksiyonlar: metadata
- level: Metadata
resources:
- group: ""
resources: ["pods/exec", "pods/portforward", "pods/attach"]
# 6) Fallback: metadata
- level: Metadata
3) The pipeline: API Server → collector → SIEM
The most stable approach is to write the audit log to the node disk, pick it up with an agent, normalize it, and ship it.
A simple architecture:
- API Server: audit file output (rotation enabled)
- Node: log collector (Vector/Fluent Bit/Alloy) tailing the file
- Pipeline: parse + normalize + redaction
- SIEM: index + correlation + alerting
Normalization: the fields a SIEM likes
Standardizing the following fields will pay off a lot when it comes to search and correlation in the SIEM:
cluster: a stable identifier such as prod-eu-1user.username,user.groupssourceIPs(the real client IP if the LB/ingress carries it)verb,objectRef.resource,objectRef.namespace,objectRef.nameresponseStatus.coderequestURIuserAgent
4) Retention and cost: don’t burn the log
Audit log cost grows quickly. So:
- Drop noisy non-resource URLs to
None - Omit the
RequestReceivedstage (most of the time it’s not required for correlation) - Treat only the critical rules at alert level and keep the rest “searchable”
- Tier retention: hot (7-14 days), cold (30-90 days), archive (per compliance need)
5) Alarm ideas (things that work in the field)
Practical signals to start with:
clusterrolebindingcreate/patch (especiallycluster-adminbindings)- A jump in
secretslist/get (per-user anomaly) - Repeated
pods/exec(especially in production namespaces) - A flood of
tokenreviews(suspicious auth probing) - Actions performed by a “break-glass” user (expected, but should be visible)
6) Validation and runbook
A mini runbook for rolling changes out safely:
- Activate the policy on a canary control-plane node (where possible)
- Watch log volume and SIEM ingestion for 15-30 minutes
- Confirm there are no secret body leaks (sample search:
\"kind\":\"Secret\"+data:) - Identify the noisy points and set them to
None/Metadatain the policy - Have a rollback path ready: revert the policy file + apiserver restart procedure
Designed well, the audit log is a “signal generation pipeline” for the security team and an “operational memory” for the platform team. Designed badly, it just produces cost and risk. So build policy + pipeline + runbook together.