Skip to content

fix(ci): increase GKE disk to 120GB#19218

Closed
davdhacs wants to merge 19 commits into
masterfrom
davdhacs/rox-33305-gc-8h
Closed

fix(ci): increase GKE disk to 120GB#19218
davdhacs wants to merge 19 commits into
masterfrom
davdhacs/rox-33305-gc-8h

Conversation

@davdhacs

@davdhacs davdhacs commented Feb 26, 2026

Copy link
Copy Markdown
Contributor

default image garbage-collection expiration is 2 minutes (and cannot be increased). we're exceeding 85% disk and so GC was removing prefetched images

WIP: Latest commit tests setting a containerd "don't delete me" tag on the images after the prefetched pulls them.

  • I removed the SHAs from the image refs for the test pulls because the prefetcher fetches by the multi-arch SHA which doesn't match the arch-specific image SHA used in the test(s).

  • I increased the node instance's disk to 120GB from 80GB, but it still hit failures in some tests with the images not found:

  - 22:29:04 — risk-image pod: "already present on machine"
  - 22:29:38 — risk-image pod deleted
  - ~23:00 — K8sEventDetectionTest creates k8seventprivnginx2 — ErrImageNeverPull

logs showing image delete: https://console.cloud.google.com/logs/query;query=resource.labels.cluster_name%3D%22rox-ci-qa-e2e-test-2027140122893357056%22%0ASEARCH%2528%22'qa-image-management'%22%2529;cursorTimestamp=2026-02-26T22:29:38.105656477Z;duration=PT12H?authuser=0&project=acs-san-stackroxci

metrics showing used_bytes: https://console.cloud.google.com/monitoring/metrics-explorer;duration=PT12H?project=acs-san-stackroxci&pageState=%7B%22xyChart%22:%7B%22constantLines%22:%5B%5D,%22dataSets%22:%5B%7B%22plotType%22:%22LINE%22,%22pointConnectionMethod%22:%22GAP_DETECTION%22,%22prometheusQuery%22:%22max%20by%20(%5C%22node_name%5C%22)(max_over_time(%7B%5C%22__name__%5C%22%3D%5C%22kubernetes.io%2Fnode%2Fephemeral_storage%2Fused_bytes%5C%22,%5C%22monitored_resource%5C%22%3D%5C%22k8s_node%5C%22,%5C%22cluster_name%5C%22%3D~%5C%22rox-ci-qa-e2e-test-2027140122893357056%5C%22%7D%5B$%7B__interval%7D%5D))%22,%22targetAxis%22:%22Y1%22,%22unitOverride%22:%22%22%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22y1Axis%22:%7B%22label%22:%22%22,%22scale%22:%22LINEAR%22%7D%7D%7D

default is 2 minutes; we're exceeding 85% disk
and so GC was removing prefetched images
@davdhacs

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

@rhacs-bot

rhacs-bot commented Feb 26, 2026

Copy link
Copy Markdown
Contributor

Images are ready for the commit at 8961197.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-250-g8961197c7d.

node-accessible is disk minus OS/etc (~42GB gke-latest)
GC hits at 85% and evicts images not used in 2 minutes.
Neither the 85% nor the 2 minutes can be increased.
@davdhacs

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

@davdhacs

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

@davdhacs davdhacs changed the title fix(ci): extend gke image GC time to 8h fix(ci): increase GKE disk to 100GB Feb 26, 2026
@davdhacs davdhacs changed the title fix(ci): increase GKE disk to 100GB fix(ci): increase GKE disk to 120GB Feb 26, 2026
The prefetcher pulls images by tag via the CRI API, which stores them
indexed by tag name. When tests reference the image as
tag@sha256:<manifest-list-digest>, containerd 2.x cannot resolve it
with imagePullPolicy: Never because the manifest list digest is not
indexed as a named image by the CRI pull-by-tag path.

This caused ErrImageNeverPull on every node regardless of disk size,
as the image was present on disk but not findable by digest. Images
referenced by tag only (busybox-1-33-1, nginx-1-12-1, etc.) worked
fine with the same Never pull policy.

Remove the @sha256: digest from TEST_IMAGE so it matches how the
prefetcher stores the image. Keep TEST_IMAGE_SHA available for API
queries that need the digest.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs davdhacs requested a review from janisz as a code owner February 26, 2026 21:53
@davdhacs

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

After the prefetcher completes, deploy a short-lived DaemonSet that
runs ctr on each node to label all prefetched images with
io.cri-containerd.pinned=pinned. This tells kubelet's image GC to
skip these images regardless of disk pressure.

The DaemonSet uses an init container for the actual work and a main
container that exits immediately. The DaemonSet and its ConfigMap are
deleted after completion to avoid leaving pods running.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

@codecov

codecov Bot commented Feb 27, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.62%. Comparing base (009726f) to head (8961197).

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #19218   +/-   ##
=======================================
  Coverage   49.61%   49.62%           
=======================================
  Files        2680     2680           
  Lines      202195   202195           
=======================================
+ Hits       100327   100332    +5     
+ Misses      94390    94387    -3     
+ Partials     7478     7476    -2     
Flag Coverage Δ
go-unit-tests 49.62% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The previous commit used `apk add containerd-ctr` but the package is
in alpine's community repo which isn't enabled by default. The install
silently failed (stderr redirected to /dev/null), so ctr was never
installed and images were never actually pinned.

Add the community repo URL explicitly via -X flag and remove the
stderr suppression so failures are visible.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread scripts/ci/lib.sh Outdated
failed=0
while IFS= read -r img || [ -n "$img" ]; do
case "$img" in "#"*|"") continue ;; esac
if ctr -a "$socket" -n k8s.io images label "$img" "io.cri-containerd.pinned=pinned" >/dev/null 2>&1; then

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can the image-prefetcher not do this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm changing it here for testing quickly so I didn't need to alter the prefetcher and then use a dev build of the prefetcher.
if it fixes it, I imagine we'd put it into the prefetcher.

  • I forgot to set this as a work-in-progress/draft. (I set it as a draft now)

@davdhacs

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

@davdhacs davdhacs marked this pull request as draft February 27, 2026 15:29
…time

Installing containerd-ctr via apk at runtime is too slow (pulls full
containerd package + deps from community repo), causing the 5-minute
rollout timeout to be exceeded.

Use ghcr.io/containerd/containerd:2.0 which ships with ctr already
installed, eliminating the package install step entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

Previous attempts to install/provide ctr in the pinning DaemonSet
failed: apk add was too slow, and ghcr.io/containerd/containerd:2.0
was too large to pull within the 5-minute timeout.

Instead, use the image-prefetcher image (already cached on every node
from the prefetch step) with hostPID and nsenter to execute the host's
own ctr binary. This requires no image pull and no package install.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

1 similar comment
@davdhacs

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

Previous approaches failed because:
- apk add containerd-ctr: too slow (>5min timeout)
- ghcr.io/containerd/containerd:2.0: too large to pull in time
- nsenter via image-prefetcher image: no nsenter/sh available

Use kubectl debug node/ which mounts the host filesystem at /host,
giving access to the host's ctr binary via chroot. No image pull
delays since busybox:1.36 is tiny, and no DaemonSet rollout needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

The previous approach tried to pin images by the tag names from the
prefetch list, but containerd stores multi-arch images under different
references (manifest list digests, platform digests). Only 15-19 of 72
images were found by tag name.

Instead, list ALL images in containerd's k8s.io namespace via
`ctr images list -q` and pin every one. This catches all references
regardless of how containerd indexed them.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

kubectl debug with -it requires a TTY which is not available in CI,
causing output capture to silently fail. Remove -it so the command
runs non-interactively and its output is properly captured.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

kubectl debug node/ without -it doesn't execute the command (just
creates the pod and returns). With -it it needs a TTY unavailable in CI.

Instead, use kubectl run with --overrides to create a pod per node with:
- nodeName: targets specific node
- hostPID: true: enables nsenter to enter host namespaces
- nsenter -t 1 -m -u -n -p: runs the host's ctr directly
- busybox:1.36: tiny image (~4MB), has nsenter built in

Pods are launched in parallel, then we kubectl wait for completion
and collect logs. This gives proper output capture and error handling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs

davdhacs commented Mar 1, 2026

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

davdhacs and others added 2 commits February 28, 2026 22:52
kubectl run --overrides had shell quoting issues: $img, $p, $f in
the pin_cmd variable were expanded as empty strings when embedded
in the JSON. Also errors were suppressed with >/dev/null 2>&1.

Switch to kubectl apply with a heredoc YAML manifest which avoids
all quoting issues. Shell variables in the script body are escaped
with \$ so they're interpreted by the container, not the CI shell.

Also add kubectl describe on failure for better diagnostics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs

davdhacs commented Mar 1, 2026

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • _image_prefetcher_pin_images currently pins all images in the containerd k8s.io namespace on every node; consider scoping this to just the prefetched images (e.g., via labels or a known list) to avoid unbounded pinning and unexpected interference with kubelet GC.
  • In _image_prefetcher_pin_images, the second parameter (name) is unused and pod names are derived only from the node name suffix (${node##*-}), which can lead to confusion or collisions; consider either using the full node name or including the prefetch set name to make pod names unique and the signature meaningful.
  • In BaseSpecification.groovy, TEST_IMAGE was changed to a tag-only reference while TEST_IMAGE_NAME_WITH_SHA still points to it and TEST_IMAGE_SHA remains separate, which makes the constant names misleading; consider renaming or restructuring these constants so that the "*_WITH_SHA" variant actually includes the digest and usages remain clear.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- _image_prefetcher_pin_images currently pins all images in the containerd k8s.io namespace on every node; consider scoping this to just the prefetched images (e.g., via labels or a known list) to avoid unbounded pinning and unexpected interference with kubelet GC.
- In _image_prefetcher_pin_images, the second parameter (name) is unused and pod names are derived only from the node name suffix (${node##*-}), which can lead to confusion or collisions; consider either using the full node name or including the prefetch set name to make pod names unique and the signature meaningful.
- In BaseSpecification.groovy, TEST_IMAGE was changed to a tag-only reference while TEST_IMAGE_NAME_WITH_SHA still points to it and TEST_IMAGE_SHA remains separate, which makes the constant names misleading; consider renaming or restructuring these constants so that the "*_WITH_SHA" variant actually includes the digest and usages remain clear.

## Individual Comments

### Comment 1
<location path="scripts/ci/lib.sh" line_range="820" />
<code_context>
+
+    # Launch pin pods on all nodes in parallel
+    for node in $nodes; do
+        local pod_name="pin-images-${node##*-}"
+        cat <<PINEOF | kubectl apply -n "$ns" -f -
+apiVersion: v1
</code_context>
<issue_to_address>
**issue (bug_risk):** Using only the node name suffix for pod_name risks collisions across nodes with similar suffixes.

With `${node##*-}`, different nodes that share a suffix (e.g., `gke-cluster-a-pool-1-abc` and `gke-cluster-b-pool-2-abc`) will generate the same `pod_name` (`pin-images-abc`). In that case `kubectl apply` will update a single pod instead of one per node. Consider using the full node name or a stable hash of it in `pod_name` to ensure uniqueness while keeping names deterministic.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread scripts/ci/lib.sh Outdated
@davdhacs

davdhacs commented Mar 2, 2026

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

@davdhacs davdhacs force-pushed the davdhacs/rox-33305-gc-8h branch from 66d3fd4 to b634840 Compare March 2, 2026 14:42
@davdhacs

davdhacs commented Mar 2, 2026

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

@davdhacs davdhacs force-pushed the davdhacs/rox-33305-gc-8h branch from b634840 to 332545a Compare March 2, 2026 17:31
@davdhacs

davdhacs commented Mar 2, 2026

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

@davdhacs davdhacs force-pushed the davdhacs/rox-33305-gc-8h branch from 332545a to 6559da1 Compare March 2, 2026 20:18
@davdhacs

davdhacs commented Mar 2, 2026

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

io.cri-containerd.pinned has known bugs (containerd#9328, #10270)
that make it unreliable for preventing image GC.

Replace the pinning approach with a re-pull step: after the prefetcher
completes, run ctr images pull on each node for any images whose tag
reference was lost to GC. Since layers are still cached, re-pulls are
near-instant. Uses the same image list configmap as the prefetcher.

Only re-pulls images that are missing (ctr images check); skips images
that are still present.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs davdhacs force-pushed the davdhacs/rox-33305-gc-8h branch from 6559da1 to 3092616 Compare March 2, 2026 23:20
@davdhacs

davdhacs commented Mar 2, 2026

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

Point to the branch-rox-33305-prevent-gc image of image-prefetcher
which pins images via the containerd native API immediately after
each CRI pull, preventing kubelet GC from evicting them.

This replaces the post-hoc repull/pin approaches which all failed due
to various containerd/CRI issues.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@davdhacs

davdhacs commented Mar 3, 2026

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

@davdhacs

davdhacs commented Mar 4, 2026

Copy link
Copy Markdown
Contributor Author

/test gke-latest-qa-e2e-tests

@openshift-ci

openshift-ci Bot commented Mar 4, 2026

Copy link
Copy Markdown

@davdhacs: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/gke-operator-e2e-tests 3a8ab48 link false /test gke-operator-e2e-tests
ci/prow/gke-scanner-v4-install-tests 3a8ab48 link false /test gke-scanner-v4-install-tests
ci/prow/ocp-4-20-qa-e2e-tests 3a8ab48 link false /test ocp-4-20-qa-e2e-tests
ci/prow/ocp-4-12-operator-e2e-tests 3a8ab48 link false /test ocp-4-12-operator-e2e-tests
ci/prow/ocp-4-12-scanner-v4-install-tests 3a8ab48 link false /test ocp-4-12-scanner-v4-install-tests
ci/prow/ocp-4-12-qa-e2e-tests 3a8ab48 link false /test ocp-4-12-qa-e2e-tests
ci/prow/ocp-4-20-operator-e2e-tests 3a8ab48 link false /test ocp-4-20-operator-e2e-tests
ci/prow/ocp-4-20-scanner-v4-install-tests 3a8ab48 link false /test ocp-4-20-scanner-v4-install-tests
ci/prow/gke-latest-qa-e2e-tests 8961197 link false /test gke-latest-qa-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants