ci(experiment): CPU/cache diagnostics matrix#21044
Conversation
Investigate whether Intel Xeon vs AMD EPYC runners produce different Go compile actionIDs, causing test cache misses across runner types. Changes to the go job in unit-tests.yaml: - Scale to 10 matrix copies (GOTAGS="" only) to sample runner hardware - Add post-test diagnostics step logging: CPU model, GOCACHE size, test cache hit rate, and compile actionIDs for canary packages - Strip non-essential steps (codecov, junit2jira, operator/integration) to keep experiment focused and fast Results will appear in each job's Step Summary for easy comparison. Prompt: set up a branch for collecting data on build time, cpuinfo, compilation cache, and used cached test results with a matrix dimension to sample CPU types Partially generated by AI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Skipping CI for Draft Pull Request. |
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
🚀 Build Images ReadyImages are ready for commit 8bf0f81. To use with deploy scripts: export MAIN_IMAGE_TAG=4.12.x-134-g8bf0f814cb |
Runs 3 test packages (pkg/set, central/cluster/util, central/notifiers/slack) with gocachetest=1 on copy 1 before the main test run. The GODEBUG output goes to Step Summary showing Phase 1 vs Phase 2 miss reason. Also removes -trimpath from the main test (back to matching CI behavior) since we need to diagnose the current 78% hit rate, not test a fix.
The previous fix only stabilized BUILD_TAG and SHORTCOMMIT, leaving CollectorVersion, FactVersion, and ScannerVersion changing per commit. These unstable ldflags propagate through go-test.sh → status.sh → -X flags → link actionID → test binary ID, invalidating the test cache for every package that transitively depends on pkg/version/internal. This was the root cause of the 78% cache hit rate on stale branches (vs 96% on fresh branches where the versions happen to match the cache). Fix: add env var overrides in status.sh for STABLE_COLLECTOR_VERSION, STABLE_FACT_VERSION, STABLE_SCANNER_VERSION, set to 0.0.0 in the unit-tests workflow. Normal builds (without the env vars) are unaffected. Expected result: cache hit rate should match the 96% baseline regardless of how far behind master the branch is. Partially generated by AI.
1448025 to
7eea672
Compare
Description
Experiment to investigate whether Intel Xeon vs AMD EPYC GHA runners produce
different Go compile actionIDs, causing test cache misses when the GOCACHE is
shared across runner types. We observed identical code, identical cache keys,
but 11% vs 94% test cache hit rate depending on which CPU the runner had.
Scales the
gojob to 10 matrix copies (GOTAGS="" only) to sample runnerhardware diversity. Each copy logs CPU model, GOCACHE size, test cache hit
rate, and compile actionIDs for canary packages to the Step Summary.
Not intended to merge. Experiment only.
User-facing documentation
Testing and quality
Automated testing
How I validated my change
Experiment PR — validation is the CI run output itself. Will compare Step
Summaries across the 10 copies to correlate CPU model with cache hit rate
and compile actionID differences.