Infrastructure security tends to be discussed around “network perimeter” and “OS hardening”. Yet for certain classes of attack the real problem sits much deeper: the boot chain. If the firmware/bootloader/kernel layer can be tampered with, every control above it is essentially running on the wrong foundation.
In this piece I bring two concepts down to the field:
- Secure Boot: “only signed components are allowed to boot”
- TPM + Measured Boot: “measure what you booted and prove it”
What does Secure Boot actually solve, and what does it not?
When Secure Boot is configured correctly:
- Bootloader / kernel / driver components do not run if they are unsigned
- The bar for “bootkit”-class persistence is raised significantly
What it does not solve:
- Anything an actor with elevated privileges does on the running OS
- Signed but malicious or compromised components
For these reasons Secure Boot has to be considered together with TPM-based measurement.
TPM and Measured Boot: the “evidence”
During boot the TPM writes hashes of certain components into registers called PCRs (Platform Configuration Registers). In short:
- Every boot leaves a measurement trail
- That trail can be compared against the “expected” values
Which means you can technically answer this question:
“Did this server actually boot with the boot chain I expected?”
Operational model: how is this managed across a fleet?
Turning Secure Boot on for a single machine is easy; the hard part is fleet management. The model that has worked for me in the field:
- Golden boot profile (the reference)
- Attestation policy (acceptance criteria)
- Rollout ring (canary → expansion)
- Break-glass (recovery without bricking)
1) Golden boot profile
Define a “reference”:
- Firmware/UEFI version
- Secure Boot key set (PK/KEK/db/dbx)
- Bootloader (shim/grub) version
- Kernel + initramfs + signing process
Whenever this profile changes, that change is a release.
2) Attestation policy
The point of attestation is not to declare “everything is perfect”; it is to do risk classification.
Sample policy levels:
- Green: PCR set matches expectation, host is accepted in prod
- Yellow: there is a version delta (planned update), accepted in prod but a ticket is opened
- Red: unexpected measurement, do not accept in prod / quarantine the host
3) Rollout ring
Secure Boot/TPM rollout has to be treated like a deployment:
- Canary: a low-criticality host set
- Pilot: a selected service group
- Generalization: the whole fleet
At every stage these metrics matter:
- Boot success rate
- Average reboot duration
- Recovery duration
- Red/Yellow ratio
4) Break-glass: the lockout scenario
When Secure Boot is wired up wrong, the worst day looks like this: “We pushed an update and the system won’t boot.”
A break-glass plan needs:
- Physical/remote console access (OOB management)
- Recovery media (signed)
- A key rollback procedure
- A way to revert dbx updates (within safe limits)
Do not “enable” this in production until that plan is written.
The 5 mistakes I see most often
- Not documenting key management (who, where, with what process?)
- Going forward with firmware updates without a ring strategy
- Treating attestation as binary (all or nothing)
- Leaving OOB management weak (no recovery path)
- Enabling Secure Boot but leaving TPM as decoration (no measurement/policy)
Closing: a root of trust is also a leadership topic
Secure Boot + TPM is just as much an organizational effort as a technical one: enabling it without change management, ring rollout, runbooks and a break-glass plan is risky. Done right, however, it gives the infrastructure something genuinely valuable: provable state.