Skip to content

perf(ci): trim stale GOCACHE entries between runs#19425

Merged
davdhacs merged 10 commits intomasterfrom
davdhacs/gocache-trim
Mar 24, 2026
Merged

perf(ci): trim stale GOCACHE entries between runs#19425
davdhacs merged 10 commits intomasterfrom
davdhacs/gocache-trim

Conversation

@davdhacs
Copy link
Copy Markdown
Contributor

@davdhacs davdhacs commented Mar 14, 2026

Remove unused GOCACHE entries before saving to GHA cache.

Why?

Go deletes unused GOCACHE entries, at most once per hour, and if not used for 5 days based on the cache file modified-time: https://github.com/golang/go/blob/224489f11c2e0b394e93980fc8292c52f60b18a8/src/cmd/go/internal/cache/cache.go#L337-L346

In CI, this means that we accumulate unused cached builds for a trailing 5 days of history. This is not useful because we are unlikely to re-use the cached builds that are not hit for a job on master (when the cache is saved for use by PRs), and the GHA cache downloads are slow (usually under 200mb/s).

Cache Size

Before: on master, gocache of 29GB uncompressed (7GB compressed) for pre-build-cli GOCACHE https://github.com/stackrox/stackrox/actions/runs/23369362079/job/67989828965#step:5:129

After: on this branch, trimmed to
11 GB uncompressed
https://github.com/stackrox/stackrox/actions/runs/23103717499/job/67109229560#step:15:5

GOCACHE grows over time as each commit adds new entries for recompiled
packages while old entries persist. Go's built-in trim deletes entries
unused for 5 days, but only checks once per 24h — and since CI restores
the cache (including trim.txt) fresh each run, the trim rarely fires.

Fix: after restoring GOCACHE, backdate all entry mtimes to year 2000.
Go's markUsed() updates accessed entries to "now" during the build
(always fires since year-2000 mtime is >1h old). After the build, a
trim step deletes entries still at year 2000. The cache auto-save
post-step then saves only entries used by this build.

trim.txt must be set to "now" after backdating — otherwise Go's Trim()
sees "last trim in year 2000" and deletes all backdated entries before
the build starts.

Locally verified: 27% reduction (2026MB → 1531MB) with ~30 commits of
real source changes between cache save and restore.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Mar 14, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • In the trim-go-cache composite action, consider guarding the du, find, and mktemp calls with a check that $(go env GOCACHE) is a non-empty existing directory (similar to the -d guard in cache-go-dependencies) to avoid accidentally operating on the wrong path when Go or GOCACHE is misconfigured.
  • Since trim-go-cache relies on cache-go-dependencies having run earlier in the job, it may be safer to make that dependency explicit (e.g., by checking for the presence of the year-2000 marker or trim.txt) and no-op otherwise, to avoid unexpected deletions if the action is reused in a job without the marking step.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the `trim-go-cache` composite action, consider guarding the `du`, `find`, and `mktemp` calls with a check that `$(go env GOCACHE)` is a non-empty existing directory (similar to the `-d` guard in `cache-go-dependencies`) to avoid accidentally operating on the wrong path when Go or GOCACHE is misconfigured.
- Since `trim-go-cache` relies on `cache-go-dependencies` having run earlier in the job, it may be safer to make that dependency explicit (e.g., by checking for the presence of the year-2000 marker or `trim.txt`) and no-op otherwise, to avoid unexpected deletions if the action is reused in a job without the marking step.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@rhacs-bot
Copy link
Copy Markdown
Contributor

rhacs-bot commented Mar 14, 2026

Images are ready for the commit at 90324be.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-425-g90324be4eb.

davdhacs and others added 3 commits March 14, 2026 00:00
build.yaml's pre-build-go-binaries and build-and-push-operator have
matrix variants (default, prerelease, race-condition-debug) that share
the same github.job cache key. Each variant writes its own entries into
one cache, causing it to grow to 3x the single-variant size (5.2GB on
master for pre-build-go-binaries).

Fix: pass matrix.name as key-suffix to cache-go-dependencies, giving
each variant its own cache key. Also adds key-suffix input to the cache
action and trim steps to all build.yaml Go jobs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Address SourceryAI review feedback:
- Guard against missing/empty GOCACHE directory
- Check for trim.txt before trimming to ensure the mark step ran
- Show removed MB in output

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use gacts/run-and-post-run to register a post-step that trims stale
GOCACHE entries before the cache save post-step runs (GHA executes
post-steps in reverse registration order).

This eliminates the need for explicit trim-go-cache steps in every
workflow that uses cache-go-dependencies — the trim is now automatic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.26%. Comparing base (52b7509) to head (90324be).
⚠️ Report is 15 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #19425   +/-   ##
=======================================
  Coverage   49.25%   49.26%           
=======================================
  Files        2727     2727           
  Lines      205788   205788           
=======================================
+ Hits       101364   101376   +12     
+ Misses      96888    96879    -9     
+ Partials     7536     7533    -3     
Flag Coverage Δ
go-unit-tests 49.26% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

davdhacs and others added 3 commits March 14, 2026 09:23
- Remove extra blank lines left from earlier trim step removal
- Remove no-op `run:` from gacts/run-and-post-run (only `post:` needed)
- Add `set +e` to trim post-step so cache trim errors don't fail the job

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
gacts/run-and-post-run executes each line separately. The if/then/fi
block with indented lines caused YAML > to preserve newlines, splitting
the script into fragments. Replace with single-line guard clause.

Also: set +e prevents trim errors from failing the job, and the exit 0
in the guard clause is safe because gacts processes lines independently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevent the mark step from hanging indefinitely on I/O delays.
If timeout fires, || true ensures the step doesn't fail — the trim
just won't remove entries that weren't backdated (safe, keeps more).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs davdhacs marked this pull request as ready for review March 15, 2026 00:23
@davdhacs davdhacs requested a review from a team as a code owner March 15, 2026 00:23
@davdhacs davdhacs requested a review from janisz March 15, 2026 00:24
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In the post-step trim logic, consider reusing the previously computed ${{ steps.cache-paths.outputs.GOCACHE }} instead of calling go env GOCACHE again to avoid divergence if GOCACHE is overridden in the environment.
  • The find ... -exec touch -t 200001010000 over the entire GOCACHE could be expensive on very large caches despite the timeout 120; you might want to restrict it (e.g., skip files already older than the cutoff or make the timeout configurable) to avoid rare but pathological runs.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the post-step trim logic, consider reusing the previously computed `${{ steps.cache-paths.outputs.GOCACHE }}` instead of calling `go env GOCACHE` again to avoid divergence if `GOCACHE` is overridden in the environment.
- The `find ... -exec touch -t 200001010000` over the entire GOCACHE could be expensive on very large caches despite the `timeout 120`; you might want to restrict it (e.g., skip files already older than the cutoff or make the timeout configurable) to avoid rare but pathological runs.

## Individual Comments

### Comment 1
<location path=".github/actions/cache-go-dependencies/action.yaml" line_range="68" />
<code_context>
+          # Go's markUsed() updates mtimes of accessed entries to "now"
+          # (it always updates when mtime is >1 hour old). The post-step
+          # below trims entries still at year 2000 before cache save.
+          timeout 120 find "$gocache" -type f -exec touch -t 200001010000 {} + || true
+          # Protect trim.txt: if backdated, Go's built-in Trim() sees
+          # "last trim was in year 2000" and deletes ALL backdated entries
</code_context>
<issue_to_address>
**issue (bug_risk):** Using `timeout` may break if this composite action is ever run on non-Linux runners.

Composite actions are often reused across workflows with different runners, and `timeout` is only guaranteed on Ubuntu (macOS has `gtimeout`, Windows lacks it). If this is Linux-only, please enforce or document that. Otherwise, consider a portable alternative (e.g., a manual `find` depth/limit strategy or guarding this block by `runner.os`).
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

davdhacs and others added 2 commits March 14, 2026 22:35
@davdhacs davdhacs requested a review from tommartensen March 16, 2026 14:28
@davdhacs davdhacs merged commit 47557b3 into master Mar 24, 2026
92 checks passed
@davdhacs davdhacs deleted the davdhacs/gocache-trim branch March 24, 2026 17:28
@davdhacs
Copy link
Copy Markdown
Contributor Author

davdhacs commented Mar 24, 2026

Post-merge results

The trim PR merged to master at commit 47557b3. Here are the measured results.

Cache sizes: before vs after trim (compressed, from GHA cache)

Job Before trim After trim Reduction
go (unit tests) 7218MB 2021MB 72%
go-postgres 4354MB 936MB 78%
style-check 6617MB 2173MB 67%
pre-build-cli 5048MB 2569MB 49%
pre-build-go-binaries 3640MB 920MB 75%
go-bench 1875MB 623MB 67%
sensor-integration 1487MB 620MB 58%
local-roxctl 2495MB 1165MB 53%
check-generated-files 1596MB 799MB 50%

First PR using trimmed caches

PR run 23504291244 (ROX-33771, ~23 min after merge) restored the trimmed sensor-integration cache at 620MB (was 1487MB). Other jobs restored pre-merge caches because the merge commit's trimmed cache was saved after this PR's restore step ran.

key-suffix working

pre-build-go-binaries shows 0MB trimmed across all 9 matrix variants (default/prerelease/race-condition-debug × 4 architectures). Each variant now has its own cache key — no more 3x bloat from matrix collision.

Context: why caches were growing

Cache restore times on master doubled from ~100s to ~340s over 8 days after per-commit cache saves were enabled (#19173 Mar 3 for builds, #19116 Mar 12 for tests). Go's built-in 5-day trim (cache.Trim()) meant we were keeping 5-day old cache entries.

@davdhacs
Copy link
Copy Markdown
Contributor Author

Cache restore times: before vs after trim

Measured from PR runs restoring the trimmed master caches (23505024475, 23504968839):

Job Before trim (Mar 20 peak) After trim (PR runs) Improvement
go (GOTAGS="") 343s 99-107s 3.2-3.5x
go-postgres 231s 73-88s 2.6-3.2x
go-bench 131s 68-87s 1.5-1.9x
local-roxctl 148s 78-89s 1.7-1.9x
sensor-integration 96s 61-81s 1.2-1.6x

Cache restore times are back to the ~100s baseline from early March (before per-commit saves caused unbounded growth). The go job restore dropped from 343s peak back to ~100s.

davdhacs added a commit that referenced this pull request Mar 24, 2026
Three matrix jobs share GOCACHE keys across their variants, causing
each variant to write into the same cache and the GOCACHE trim (#19425)
to delete entries used by other variants:

- go (unit-tests.yaml): GOTAGS="" and GOTAGS=release
- go-postgres (unit-tests.yaml): same
- pre-build-scanner-go-binary (scanner-build.yaml): default/prerelease/race-condition-debug

Fix: pass the matrix discriminator as key-suffix so each variant gets
its own cache key.

Found by auditing all cache-go-dependencies usages across all workflows.
build.yaml's pre-build-go-binaries and build-and-push-operator were
already fixed in #19425.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants