perf(ci): trim stale GOCACHE entries between runs#19425
Conversation
GOCACHE grows over time as each commit adds new entries for recompiled packages while old entries persist. Go's built-in trim deletes entries unused for 5 days, but only checks once per 24h — and since CI restores the cache (including trim.txt) fresh each run, the trim rarely fires. Fix: after restoring GOCACHE, backdate all entry mtimes to year 2000. Go's markUsed() updates accessed entries to "now" during the build (always fires since year-2000 mtime is >1h old). After the build, a trim step deletes entries still at year 2000. The cache auto-save post-step then saves only entries used by this build. trim.txt must be set to "now" after backdating — otherwise Go's Trim() sees "last trim in year 2000" and deletes all backdated entries before the build starts. Locally verified: 27% reduction (2026MB → 1531MB) with ~30 commits of real source changes between cache save and restore. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Hey - I've left some high level feedback:
- In the
trim-go-cachecomposite action, consider guarding thedu,find, andmktempcalls with a check that$(go env GOCACHE)is a non-empty existing directory (similar to the-dguard incache-go-dependencies) to avoid accidentally operating on the wrong path when Go or GOCACHE is misconfigured. - Since
trim-go-cacherelies oncache-go-dependencieshaving run earlier in the job, it may be safer to make that dependency explicit (e.g., by checking for the presence of the year-2000 marker ortrim.txt) and no-op otherwise, to avoid unexpected deletions if the action is reused in a job without the marking step.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the `trim-go-cache` composite action, consider guarding the `du`, `find`, and `mktemp` calls with a check that `$(go env GOCACHE)` is a non-empty existing directory (similar to the `-d` guard in `cache-go-dependencies`) to avoid accidentally operating on the wrong path when Go or GOCACHE is misconfigured.
- Since `trim-go-cache` relies on `cache-go-dependencies` having run earlier in the job, it may be safer to make that dependency explicit (e.g., by checking for the presence of the year-2000 marker or `trim.txt`) and no-op otherwise, to avoid unexpected deletions if the action is reused in a job without the marking step.Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
Images are ready for the commit at 90324be. To use with deploy scripts, first |
build.yaml's pre-build-go-binaries and build-and-push-operator have matrix variants (default, prerelease, race-condition-debug) that share the same github.job cache key. Each variant writes its own entries into one cache, causing it to grow to 3x the single-variant size (5.2GB on master for pre-build-go-binaries). Fix: pass matrix.name as key-suffix to cache-go-dependencies, giving each variant its own cache key. Also adds key-suffix input to the cache action and trim steps to all build.yaml Go jobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address SourceryAI review feedback: - Guard against missing/empty GOCACHE directory - Check for trim.txt before trimming to ensure the mark step ran - Show removed MB in output Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use gacts/run-and-post-run to register a post-step that trims stale GOCACHE entries before the cache save post-step runs (GHA executes post-steps in reverse registration order). This eliminates the need for explicit trim-go-cache steps in every workflow that uses cache-go-dependencies — the trim is now automatic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #19425 +/- ##
=======================================
Coverage 49.25% 49.26%
=======================================
Files 2727 2727
Lines 205788 205788
=======================================
+ Hits 101364 101376 +12
+ Misses 96888 96879 -9
+ Partials 7536 7533 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Remove extra blank lines left from earlier trim step removal - Remove no-op `run:` from gacts/run-and-post-run (only `post:` needed) - Add `set +e` to trim post-step so cache trim errors don't fail the job Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gacts/run-and-post-run executes each line separately. The if/then/fi block with indented lines caused YAML > to preserve newlines, splitting the script into fragments. Replace with single-line guard clause. Also: set +e prevents trim errors from failing the job, and the exit 0 in the guard clause is safe because gacts processes lines independently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevent the mark step from hanging indefinitely on I/O delays. If timeout fires, || true ensures the step doesn't fail — the trim just won't remove entries that weren't backdated (safe, keeps more). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- In the post-step trim logic, consider reusing the previously computed
${{ steps.cache-paths.outputs.GOCACHE }}instead of callinggo env GOCACHEagain to avoid divergence ifGOCACHEis overridden in the environment. - The
find ... -exec touch -t 200001010000over the entire GOCACHE could be expensive on very large caches despite thetimeout 120; you might want to restrict it (e.g., skip files already older than the cutoff or make the timeout configurable) to avoid rare but pathological runs.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the post-step trim logic, consider reusing the previously computed `${{ steps.cache-paths.outputs.GOCACHE }}` instead of calling `go env GOCACHE` again to avoid divergence if `GOCACHE` is overridden in the environment.
- The `find ... -exec touch -t 200001010000` over the entire GOCACHE could be expensive on very large caches despite the `timeout 120`; you might want to restrict it (e.g., skip files already older than the cutoff or make the timeout configurable) to avoid rare but pathological runs.
## Individual Comments
### Comment 1
<location path=".github/actions/cache-go-dependencies/action.yaml" line_range="68" />
<code_context>
+ # Go's markUsed() updates mtimes of accessed entries to "now"
+ # (it always updates when mtime is >1 hour old). The post-step
+ # below trims entries still at year 2000 before cache save.
+ timeout 120 find "$gocache" -type f -exec touch -t 200001010000 {} + || true
+ # Protect trim.txt: if backdated, Go's built-in Trim() sees
+ # "last trim was in year 2000" and deletes ALL backdated entries
</code_context>
<issue_to_address>
**issue (bug_risk):** Using `timeout` may break if this composite action is ever run on non-Linux runners.
Composite actions are often reused across workflows with different runners, and `timeout` is only guaranteed on Ubuntu (macOS has `gtimeout`, Windows lacks it). If this is Linux-only, please enforce or document that. Otherwise, consider a portable alternative (e.g., a manual `find` depth/limit strategy or guarding this block by `runner.os`).
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Post-merge resultsThe trim PR merged to master at commit 47557b3. Here are the measured results. Cache sizes: before vs after trim (compressed, from GHA cache)
First PR using trimmed cachesPR run 23504291244 (ROX-33771, ~23 min after merge) restored the trimmed sensor-integration cache at 620MB (was 1487MB). Other jobs restored pre-merge caches because the merge commit's trimmed cache was saved after this PR's restore step ran. key-suffix workingpre-build-go-binaries shows 0MB trimmed across all 9 matrix variants (default/prerelease/race-condition-debug × 4 architectures). Each variant now has its own cache key — no more 3x bloat from matrix collision. Context: why caches were growingCache restore times on master doubled from ~100s to ~340s over 8 days after per-commit cache saves were enabled (#19173 Mar 3 for builds, #19116 Mar 12 for tests). Go's built-in 5-day trim ( |
Cache restore times: before vs after trimMeasured from PR runs restoring the trimmed master caches (23505024475, 23504968839):
Cache restore times are back to the ~100s baseline from early March (before per-commit saves caused unbounded growth). The |
Three matrix jobs share GOCACHE keys across their variants, causing each variant to write into the same cache and the GOCACHE trim (#19425) to delete entries used by other variants: - go (unit-tests.yaml): GOTAGS="" and GOTAGS=release - go-postgres (unit-tests.yaml): same - pre-build-scanner-go-binary (scanner-build.yaml): default/prerelease/race-condition-debug Fix: pass the matrix discriminator as key-suffix so each variant gets its own cache key. Found by auditing all cache-go-dependencies usages across all workflows. build.yaml's pre-build-go-binaries and build-and-push-operator were already fixed in #19425. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused GOCACHE entries before saving to GHA cache.
Why?
Go deletes unused GOCACHE entries, at most once per hour, and if not used for 5 days based on the cache file modified-time: https://github.com/golang/go/blob/224489f11c2e0b394e93980fc8292c52f60b18a8/src/cmd/go/internal/cache/cache.go#L337-L346
In CI, this means that we accumulate unused cached builds for a trailing 5 days of history. This is not useful because we are unlikely to re-use the cached builds that are not hit for a job on master (when the cache is saved for use by PRs), and the GHA cache downloads are slow (usually under 200mb/s).
Cache Size
Before: on master, gocache of 29GB uncompressed (7GB compressed) for pre-build-cli GOCACHE https://github.com/stackrox/stackrox/actions/runs/23369362079/job/67989828965#step:5:129
After: on this branch, trimmed to
11 GB uncompressed
https://github.com/stackrox/stackrox/actions/runs/23103717499/job/67109229560#step:15:5