Skip to content

ci(experiment): CPU/cache diagnostics matrix#21044

Draft
davdhacs wants to merge 12 commits into
masterfrom
davdhacs/cpu-cache-experiment
Draft

ci(experiment): CPU/cache diagnostics matrix#21044
davdhacs wants to merge 12 commits into
masterfrom
davdhacs/cpu-cache-experiment

Conversation

@davdhacs

@davdhacs davdhacs commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Description

Experiment to investigate whether Intel Xeon vs AMD EPYC GHA runners produce
different Go compile actionIDs, causing test cache misses when the GOCACHE is
shared across runner types. We observed identical code, identical cache keys,
but 11% vs 94% test cache hit rate depending on which CPU the runner had.

Scales the go job to 10 matrix copies (GOTAGS="" only) to sample runner
hardware diversity. Each copy logs CPU model, GOCACHE size, test cache hit
rate, and compile actionIDs for canary packages to the Step Summary.

Not intended to merge. Experiment only.

User-facing documentation

Testing and quality

Automated testing

  • modified existing tests

How I validated my change

Experiment PR — validation is the CI run output itself. Will compare Step
Summaries across the 10 copies to correlate CPU model with cache hit rate
and compile actionID differences.

Investigate whether Intel Xeon vs AMD EPYC runners produce different Go
compile actionIDs, causing test cache misses across runner types.

Changes to the go job in unit-tests.yaml:
- Scale to 10 matrix copies (GOTAGS="" only) to sample runner hardware
- Add post-test diagnostics step logging: CPU model, GOCACHE size,
  test cache hit rate, and compile actionIDs for canary packages
- Strip non-essential steps (codecov, junit2jira, operator/integration)
  to keep experiment focused and fast

Results will appear in each job's Step Summary for easy comparison.

Prompt: set up a branch for collecting data on build time, cpuinfo,
compilation cache, and used cached test results with a matrix
dimension to sample CPU types

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci

openshift-ci Bot commented Jun 9, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 92bb9cfc-ddcc-46f9-93a4-6686cb157222

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch davdhacs/cpu-cache-experiment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

🚀 Build Images Ready

Images are ready for commit 8bf0f81. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.12.x-134-g8bf0f814cb

davdhacs added 8 commits June 9, 2026 12:01
Runs 3 test packages (pkg/set, central/cluster/util, central/notifiers/slack)
with gocachetest=1 on copy 1 before the main test run. The GODEBUG output
goes to Step Summary showing Phase 1 vs Phase 2 miss reason.

Also removes -trimpath from the main test (back to matching CI behavior)
since we need to diagnose the current 78% hit rate, not test a fix.
The previous fix only stabilized BUILD_TAG and SHORTCOMMIT, leaving
CollectorVersion, FactVersion, and ScannerVersion changing per commit.
These unstable ldflags propagate through go-test.sh → status.sh → -X flags
→ link actionID → test binary ID, invalidating the test cache for every
package that transitively depends on pkg/version/internal.

This was the root cause of the 78% cache hit rate on stale branches
(vs 96% on fresh branches where the versions happen to match the cache).

Fix: add env var overrides in status.sh for STABLE_COLLECTOR_VERSION,
STABLE_FACT_VERSION, STABLE_SCANNER_VERSION, set to 0.0.0 in the
unit-tests workflow. Normal builds (without the env vars) are unaffected.

Expected result: cache hit rate should match the 96% baseline regardless
of how far behind master the branch is.

Partially generated by AI.
@davdhacs davdhacs force-pushed the davdhacs/cpu-cache-experiment branch from 1448025 to 7eea672 Compare June 10, 2026 03:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant