My Cleanup Script Killed the GitHub Runner: A Self-Inflicted Incident

When I woke up, 16 crons had failed back-to-back

My disk-cleanup timer ran last night at 03:30 (right on schedule). When I checked in the morning, the GitHub Actions panel was bright red: 16 crons had failed. Not one of them produced fresh content. The pipeline-health monitor had sent a “DEGRADED” email. One mail in the inbox.

I opened the run logs. They were all dying in the exact same place — Checkout step succeeded, Verify Node succeeded, Install dependencies succeeded, then the workflow died before the next step even started. Not normal at all.

GitHub’s UI showed no detail — the runner-side error couldn’t be written to the log, it just said “Job failed”. I SSH’d into the VPS:

$ ssh vps 'sudo journalctl -u "actions.runner.*" --since "8 hours ago" | grep -iE "error|fail|missing"' | head -10
Missing file: /home/github-runner/runner-mustafaerbay/_work/_temp/_runner_file_commands/set_output_xyz123
Missing file: /home/github-runner/runner-mustafaerbay/_work/_temp/_runner_file_commands/set_output_abc456
Missing file: ...

Missing file: set_output_*. These are the files the GitHub Actions runner uses to pass state between steps.

GitHub Actions steps share state through a file called $GITHUB_OUTPUT:

echo "new_slugs=$NEW_SLUGS" >> "$GITHUB_OUTPUT"

This file lives under _work/_temp/_runner_file_commands/. The runner creates a unique file for each step. It’s written in a structured format, then read by the next step as ${{ steps.<id>.outputs.<name> }}.

If the file isn’t there, the runner behaves like it’s lost. That’s what makes the workflow fail.

I tracked down the cause with a sinking feeling

The day before, I had written my disk-cleanup.sh script. It had this line in it:

find /home/github-runner -path '*/_work/_temp/*' -mtime +7 -delete

That single line deletes everything older than 7 days — including directories. The _runner_file_commands directory may have been created 7 days ago, but the runner is still actively using it.

The -delete flag tells find to remove files AND directories. It won’t delete non-empty directories, but a directory created 7 days ago whose latest activity is older than a week → can get treated as if it were stale.

First thing I did was figure out the actual state. I checked over SSH:

$ ls -la /home/github-runner/runner-mustafaerbay/_work/_temp/_runner_file_commands/
total 8
drwxr-xr-x 2 github-runner github-runner 4096 May  3 03:30 .
drwxr-xr-x 5 github-runner github-runner 4096 May  3 03:30 ..

Empty. Just ”.” and ”..” Runner thinks the directory exists, writes into it, but when the hourly cron arrives, the directory is there but the files inside are gone → runner crash.

Actually if the runner restarts there’s no problem because it re-initializes itself. But at 03:30 it was in sleep mode. When the cron triggered, it was waiting for state, couldn’t find it, fail.

The fast recovery

I restarted the runner service:

$ ssh vps 'sudo systemctl restart actions.runner.<repo-slug>.<runner-name>.service'

After the restart, the runner rebuilt its state directory. The next cron passed cleanly.

I fixed disk-cleanup.sh. Now only files get deleted, directories stay untouched:

# Old (DANGEROUS)
find /home/github-runner -path '*/_work/_temp/*' -mtime +7 -delete

# New (safe — only known single-use file patterns)
find /home/github-runner -path '*/_work/_temp/*' -type f \
  \( -name 'set_output_*' -o -name 'set_env_*' -o -name 'add_path_*' -o -name '*.tmp' -o -name '*.log' \) \
  -mtime +7 -delete

Two key differences:

-type f — files only, not directories
A -name whitelist — only known single-use filenames

Runner state directories (like _runner_file_commands) are now off-limits. Old single-use set_output_* files get cleaned up (these are created once per step and are useless after they’re consumed).

The deeper lesson

The real reason I’m writing this isn’t to share the embarrassment of an incident I caused myself. The reason runs deeper:

“When you set up automation, changing its inputs without understanding them is more dangerous than the automation itself.”

When I wrote disk-cleanup.sh, I thought of _work/_temp as “temporary files”. The word temp literally means “temporary”. Anything older than 7 days is probably leftover. Sounds reasonable.

But it isn’t. _work/_temp is the runner’s active state storage. Despite the name temp, things like _runner_file_commands are critical state — created at the start of each step, consumed when the step ends. It can persist for 7 days because the runner may sleep for long stretches.

The bottom line: before you set up automation, learn the contracts of the system you’re touching.

For disk-cleanup.sh, the new principles are:

Whitelist > blacklist (list patterns to delete, don’t say “everything old”)
Files > directories (a directory continuing to exist may be critical to state)
Bump the time window (7 days might be too short, the runner can idle for a long time, I bumped it to 14)

Wrap-up

This event was a calibration error for me. Disk-cleanup.sh was a useful script, but a scope mistake by its owner broke their own system. I’m writing this openly because the cause of the 16-hour downtime is me — not GitHub, not AI, not a third-party bug.

I made a note to myself: when writing runbooks, ask three more questions:

What does this script delete? (Exactly. List it.)
Who owns the things it deletes? (A system service? The runner? Application data?)
Does that owner have a list of patterns it accepts being cleaned? (If not, I don’t have permission to ask.)

The find ... -delete I wrote without asking any of these three questions cost 16 hours of outage. Asking takes 30 seconds. The trade-off is now extremely obvious.

Frequently Asked Questions

Common questions readers have about this article.

How can I configure my disk‑cleanup script so it never wipes the GitHub Actions runner’s temporary folders?

I solved this by explicitly protecting the runner’s `_work/_temp` tree. First I added a whitelist section to my `disk-cleanup.sh` that checks the absolute path of every candidate before removal. If the path matches `/home/github-runner/*/_work/_temp/*` the script skips it and logs a warning. I also switched the cleanup to run inside a dedicated systemd timer that executes with a non‑root user confined to a specific cleanup directory via `ProtectSystem=strict` and `ReadOnlyPaths=`. This double‑layered guard guarantees that even a typo in the `find … -delete` command won’t reach the runner’s state files.

What steps should I take when a workflow fails with “Missing file: set_output_*” errors?

When I saw that message I immediately SSH’d into the runner host and inspected the journal for the `actions.runner` service. The missing files live under `_work/_temp/_runner_file_commands/`, so the first clue is that something deleted that directory. I stopped the runner (`./svc.sh stop`), cleared the stale `_temp` folder, and then restarted the service. After that I re‑ran a single failed workflow to confirm the runner could recreate the files. If the error persists, I reinstall the runner binary to ensure no corrupted binaries are causing the cleanup to misbehave.

Is it safe to schedule regular disk‑cleanup jobs on the same VM that hosts a self‑hosted GitHub runner?

I learned the hard way that it’s risky unless you isolate the runner’s workspace. The runner stores transient data in `_work/_temp`, which is essential for step‑to‑step communication. Running a generic `rm -rf /var/tmp/*` or a broad `find … -mtime +7 -delete` can unintentionally target those files. My current practice is to place the runner on a separate mount point and configure the cleanup timer to operate only on non‑runner partitions, using `--exclude-path` flags. Additionally, I keep a small health‑check cron that verifies the presence of the `_work/_temp` directory before each cleanup run.

What’s the quickest way to recover a self‑hosted runner after its temporary files have been erased?

The fastest recovery I use is to stop the runner, delete the corrupted `_temp` directory, and let the runner recreate it on start‑up. I run `./svc.sh stop`, then `rm -rf /home/github-runner/_work/_temp/*`, followed by `./svc.sh start`. If the runner binary itself was affected, I pull the latest runner package from GitHub, extract it over the existing installation, and re‑register the runner token. After the service is back, I trigger a lightweight workflow (e.g., a `echo hello`) to confirm the file‑command mechanism works. This brings the runner back online in under five minutes.

My Cleanup Script Killed the GitHub Runner: A Self-Inflicted Incident

When I woke up, 16 crons had failed back-to-back

I tracked down the cause with a sinking feeling

The fast recovery

The deeper lesson

Wrap-up

Frequently Asked Questions

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

One Night a Storage System Died and Changed How I Think About Software

A Self-Running Content System: An Indie Hacker's Experience

Cross-Team Tension During a Crisis: An Incident Story

When I woke up, 16 crons had failed back-to-back

I tracked down the cause with a sinking feeling

The fast recovery

The deeper lesson

Wrap-up

Frequently Asked Questions

Comments

Curated digest, hand-picked by me — not the AI

Your Reading Stats

Related Posts

One Night a Storage System Died and Changed How I Think About Software

A Self-Running Content System: An Indie Hacker's Experience

Cross-Team Tension During a Crisis: An Incident Story

Klavye Kısayolları