fix(ci): increase GKE disk to 120GB by davdhacs · Pull Request #19218 · stackrox/stackrox

davdhacs · 2026-02-26T18:16:25Z

default image garbage-collection expiration is 2 minutes (and cannot be increased). we're exceeding 85% disk and so GC was removing prefetched images

WIP: Latest commit tests setting a containerd "don't delete me" tag on the images after the prefetched pulls them.

I removed the SHAs from the image refs for the test pulls because the prefetcher fetches by the multi-arch SHA which doesn't match the arch-specific image SHA used in the test(s).
I increased the node instance's disk to 120GB from 80GB, but it still hit failures in some tests with the images not found:

  - 22:29:04 — risk-image pod: "already present on machine"
  - 22:29:38 — risk-image pod deleted
  - ~23:00 — K8sEventDetectionTest creates k8seventprivnginx2 — ErrImageNeverPull

logs showing image delete: https://console.cloud.google.com/logs/query;query=resource.labels.cluster_name%3D%22rox-ci-qa-e2e-test-2027140122893357056%22%0ASEARCH%2528%22'qa-image-management'%22%2529;cursorTimestamp=2026-02-26T22:29:38.105656477Z;duration=PT12H?authuser=0&project=acs-san-stackroxci

metrics showing used_bytes: https://console.cloud.google.com/monitoring/metrics-explorer;duration=PT12H?project=acs-san-stackroxci&pageState=%7B%22xyChart%22:%7B%22constantLines%22:%5B%5D,%22dataSets%22:%5B%7B%22plotType%22:%22LINE%22,%22pointConnectionMethod%22:%22GAP_DETECTION%22,%22prometheusQuery%22:%22max%20by%20(%5C%22node_name%5C%22)(max_over_time(%7B%5C%22__name__%5C%22%3D%5C%22kubernetes.io%2Fnode%2Fephemeral_storage%2Fused_bytes%5C%22,%5C%22monitored_resource%5C%22%3D%5C%22k8s_node%5C%22,%5C%22cluster_name%5C%22%3D~%5C%22rox-ci-qa-e2e-test-2027140122893357056%5C%22%7D%5B$%7B__interval%7D%5D))%22,%22targetAxis%22:%22Y1%22,%22unitOverride%22:%22%22%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22y1Axis%22:%7B%22label%22:%22%22,%22scale%22:%22LINEAR%22%7D%7D%7D

default is 2 minutes; we're exceeding 85% disk and so GC was removing prefetched images

davdhacs · 2026-02-26T18:16:45Z

/test gke-latest-qa-e2e-tests

rhacs-bot · 2026-02-26T18:38:39Z

Images are ready for the commit at 8961197.

To use with deploy scripts, first export MAIN_IMAGE_TAG=4.11.x-250-g8961197c7d.

node-accessible is disk minus OS/etc (~42GB gke-latest) GC hits at 85% and evicts images not used in 2 minutes. Neither the 85% nor the 2 minutes can be increased.

davdhacs · 2026-02-26T18:56:38Z

/test gke-latest-qa-e2e-tests

This reverts commit 5aab5eb.

davdhacs · 2026-02-26T19:31:42Z

/test gke-latest-qa-e2e-tests

@sha256

The prefetcher pulls images by tag via the CRI API, which stores them indexed by tag name. When tests reference the image as tag@sha256:<manifest-list-digest>, containerd 2.x cannot resolve it with imagePullPolicy: Never because the manifest list digest is not indexed as a named image by the CRI pull-by-tag path. This caused ErrImageNeverPull on every node regardless of disk size, as the image was present on disk but not findable by digest. Images referenced by tag only (busybox-1-33-1, nginx-1-12-1, etc.) worked fine with the same Never pull policy. Remove the @sha256: digest from TEST_IMAGE so it matches how the prefetcher stores the image. Keep TEST_IMAGE_SHA available for API queries that need the digest. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

davdhacs · 2026-02-26T21:53:51Z

/test gke-latest-qa-e2e-tests

After the prefetcher completes, deploy a short-lived DaemonSet that runs ctr on each node to label all prefetched images with io.cri-containerd.pinned=pinned. This tells kubelet's image GC to skip these images regardless of disk pressure. The DaemonSet uses an init container for the actual work and a main container that exits immediately. The DaemonSet and its ConfigMap are deleted after completion to avoid leaving pods running. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

davdhacs · 2026-02-27T05:20:28Z

/test gke-latest-qa-e2e-tests

codecov · 2026-02-27T05:43:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.62%. Comparing base (009726f) to head (8961197).

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #19218   +/-   ##
=======================================
  Coverage   49.61%   49.62%           
=======================================
  Files        2680     2680           
  Lines      202195   202195           
=======================================
+ Hits       100327   100332    +5     
+ Misses      94390    94387    -3     
+ Partials     7478     7476    -2

Flag	Coverage Δ
go-unit-tests	`49.62% <ø> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The previous commit used `apk add containerd-ctr` but the package is in alpine's community repo which isn't enabled by default. The install silently failed (stderr redirected to /dev/null), so ctr was never installed and images were never actually pinned. Add the community repo URL explicitly via -X flag and remove the stderr suppression so failures are visible. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tommartensen · 2026-02-27T09:52:07Z

+failed=0
+while IFS= read -r img || [ -n "$img" ]; do
+    case "$img" in "#"*|"") continue ;; esac
+    if ctr -a "$socket" -n k8s.io images label "$img" "io.cri-containerd.pinned=pinned" >/dev/null 2>&1; then


Why can the image-prefetcher not do this?

I'm changing it here for testing quickly so I didn't need to alter the prefetcher and then use a dev build of the prefetcher.
if it fixes it, I imagine we'd put it into the prefetcher.

I forgot to set this as a work-in-progress/draft. (I set it as a draft now)

davdhacs · 2026-02-27T15:29:43Z

/test gke-latest-qa-e2e-tests

…time Installing containerd-ctr via apk at runtime is too slow (pulls full containerd package + deps from community repo), causing the 5-minute rollout timeout to be exceeded. Use ghcr.io/containerd/containerd:2.0 which ships with ctr already installed, eliminating the package install step entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

davdhacs · 2026-02-27T17:28:23Z

/test gke-latest-qa-e2e-tests

Previous attempts to install/provide ctr in the pinning DaemonSet failed: apk add was too slow, and ghcr.io/containerd/containerd:2.0 was too large to pull within the 5-minute timeout. Instead, use the image-prefetcher image (already cached on every node from the prefetch step) with hostPID and nsenter to execute the host's own ctr binary. This requires no image pull and no package install. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

davdhacs · 2026-02-27T23:39:31Z

/test gke-latest-qa-e2e-tests

davdhacs · 2026-02-28T04:09:29Z

/test gke-latest-qa-e2e-tests

Previous approaches failed because: - apk add containerd-ctr: too slow (>5min timeout) - ghcr.io/containerd/containerd:2.0: too large to pull in time - nsenter via image-prefetcher image: no nsenter/sh available Use kubectl debug node/ which mounts the host filesystem at /host, giving access to the host's ctr binary via chroot. No image pull delays since busybox:1.36 is tiny, and no DaemonSet rollout needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

davdhacs · 2026-02-28T15:17:01Z

/test gke-latest-qa-e2e-tests

The previous approach tried to pin images by the tag names from the prefetch list, but containerd stores multi-arch images under different references (manifest list digests, platform digests). Only 15-19 of 72 images were found by tag name. Instead, list ALL images in containerd's k8s.io namespace via `ctr images list -q` and pin every one. This catches all references regardless of how containerd indexed them. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

davdhacs · 2026-02-28T17:59:18Z

/test gke-latest-qa-e2e-tests

kubectl debug with -it requires a TTY which is not available in CI, causing output capture to silently fail. Remove -it so the command runs non-interactively and its output is properly captured. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

davdhacs · 2026-02-28T20:35:14Z

/test gke-latest-qa-e2e-tests

kubectl debug node/ without -it doesn't execute the command (just creates the pod and returns). With -it it needs a TTY unavailable in CI. Instead, use kubectl run with --overrides to create a pod per node with: - nodeName: targets specific node - hostPID: true: enables nsenter to enter host namespaces - nsenter -t 1 -m -u -n -p: runs the host's ctr directly - busybox:1.36: tiny image (~4MB), has nsenter built in Pods are launched in parallel, then we kubectl wait for completion and collect logs. This gives proper output capture and error handling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

davdhacs · 2026-03-01T05:52:13Z

/test gke-latest-qa-e2e-tests

kubectl run --overrides had shell quoting issues: $img, $p, $f in the pin_cmd variable were expanded as empty strings when embedded in the JSON. Also errors were suppressed with >/dev/null 2>&1. Switch to kubectl apply with a heredoc YAML manifest which avoids all quoting issues. Shell variables in the script body are escaped with \$ so they're interpreted by the container, not the CI shell. Also add kubectl describe on failure for better diagnostics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

davdhacs · 2026-03-01T15:18:47Z

/test gke-latest-qa-e2e-tests

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

_image_prefetcher_pin_images currently pins all images in the containerd k8s.io namespace on every node; consider scoping this to just the prefetched images (e.g., via labels or a known list) to avoid unbounded pinning and unexpected interference with kubelet GC.
In _image_prefetcher_pin_images, the second parameter (name) is unused and pod names are derived only from the node name suffix (${node##*-}), which can lead to confusion or collisions; consider either using the full node name or including the prefetch set name to make pod names unique and the signature meaningful.
In BaseSpecification.groovy, TEST_IMAGE was changed to a tag-only reference while TEST_IMAGE_NAME_WITH_SHA still points to it and TEST_IMAGE_SHA remains separate, which makes the constant names misleading; consider renaming or restructuring these constants so that the "*_WITH_SHA" variant actually includes the digest and usages remain clear.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- _image_prefetcher_pin_images currently pins all images in the containerd k8s.io namespace on every node; consider scoping this to just the prefetched images (e.g., via labels or a known list) to avoid unbounded pinning and unexpected interference with kubelet GC.
- In _image_prefetcher_pin_images, the second parameter (name) is unused and pod names are derived only from the node name suffix (${node##*-}), which can lead to confusion or collisions; consider either using the full node name or including the prefetch set name to make pod names unique and the signature meaningful.
- In BaseSpecification.groovy, TEST_IMAGE was changed to a tag-only reference while TEST_IMAGE_NAME_WITH_SHA still points to it and TEST_IMAGE_SHA remains separate, which makes the constant names misleading; consider renaming or restructuring these constants so that the "*_WITH_SHA" variant actually includes the digest and usages remain clear.

## Individual Comments

### Comment 1
<location path="scripts/ci/lib.sh" line_range="820" />
<code_context>
+
+    # Launch pin pods on all nodes in parallel
+    for node in $nodes; do
+        local pod_name="pin-images-${node##*-}"
+        cat <<PINEOF | kubectl apply -n "$ns" -f -
+apiVersion: v1
</code_context>
<issue_to_address>
**issue (bug_risk):** Using only the node name suffix for pod_name risks collisions across nodes with similar suffixes.

With `${node##*-}`, different nodes that share a suffix (e.g., `gke-cluster-a-pool-1-abc` and `gke-cluster-b-pool-2-abc`) will generate the same `pod_name` (`pin-images-abc`). In that case `kubectl apply` will update a single pod instead of one per node. Consider using the full node name or a stable hash of it in `pod_name` to ensure uniqueness while keeping names deterministic.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

davdhacs · 2026-03-02T03:55:22Z

/test gke-latest-qa-e2e-tests

davdhacs · 2026-03-02T14:43:08Z

/test gke-latest-qa-e2e-tests

davdhacs · 2026-03-02T17:31:26Z

/test gke-latest-qa-e2e-tests

davdhacs · 2026-03-02T20:19:02Z

/test gke-latest-qa-e2e-tests

io.cri-containerd.pinned has known bugs (containerd#9328, #10270) that make it unreliable for preventing image GC. Replace the pinning approach with a re-pull step: after the prefetcher completes, run ctr images pull on each node for any images whose tag reference was lost to GC. Since layers are still cached, re-pulls are near-instant. Uses the same image list configmap as the prefetcher. Only re-pulls images that are missing (ctr images check); skips images that are still present. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

davdhacs · 2026-03-02T23:21:04Z

/test gke-latest-qa-e2e-tests

Point to the branch-rox-33305-prevent-gc image of image-prefetcher which pins images via the containerd native API immediately after each CRI pull, preventing kubelet GC from evicting them. This replaces the post-hoc repull/pin approaches which all failed due to various containerd/CRI issues. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

davdhacs · 2026-03-03T19:30:22Z

/test gke-latest-qa-e2e-tests

davdhacs · 2026-03-04T05:41:08Z

/test gke-latest-qa-e2e-tests

openshift-ci · 2026-03-04T07:42:40Z

@davdhacs: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/gke-operator-e2e-tests	`3a8ab48`	link	false	`/test gke-operator-e2e-tests`
ci/prow/gke-scanner-v4-install-tests	`3a8ab48`	link	false	`/test gke-scanner-v4-install-tests`
ci/prow/ocp-4-20-qa-e2e-tests	`3a8ab48`	link	false	`/test ocp-4-20-qa-e2e-tests`
ci/prow/ocp-4-12-operator-e2e-tests	`3a8ab48`	link	false	`/test ocp-4-12-operator-e2e-tests`
ci/prow/ocp-4-12-scanner-v4-install-tests	`3a8ab48`	link	false	`/test ocp-4-12-scanner-v4-install-tests`
ci/prow/ocp-4-12-qa-e2e-tests	`3a8ab48`	link	false	`/test ocp-4-12-qa-e2e-tests`
ci/prow/ocp-4-20-operator-e2e-tests	`3a8ab48`	link	false	`/test ocp-4-20-operator-e2e-tests`
ci/prow/ocp-4-20-scanner-v4-install-tests	`3a8ab48`	link	false	`/test ocp-4-20-scanner-v4-install-tests`
ci/prow/gke-latest-qa-e2e-tests	`8961197`	link	false	`/test gke-latest-qa-e2e-tests`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

fix(ci): extend gke image GC time to 8h

5aab5eb

default is 2 minutes; we're exceeding 85% disk and so GC was removing prefetched images

increase GKE CI node disk to 100GB

fb97161

node-accessible is disk minus OS/etc (~42GB gke-latest) GC hits at 85% and evicts images not used in 2 minutes. Neither the 85% nor the 2 minutes can be increased.

Revert "fix(ci): extend gke image GC time to 8h"

dcafbe4

This reverts commit 5aab5eb.

davdhacs changed the title ~~fix(ci): extend gke image GC time to 8h~~ fix(ci): increase GKE disk to 100GB Feb 26, 2026

davdhacs changed the title ~~fix(ci): increase GKE disk to 100GB~~ fix(ci): increase GKE disk to 120GB Feb 26, 2026

davdhacs requested a review from janisz as a code owner February 26, 2026 21:53

davdhacs mentioned this pull request Feb 27, 2026

fix(ROX-33305): add StackRox image pull secret to qa-image-scanning-test namespace #19182

Draft

9 tasks

tommartensen reviewed Feb 27, 2026

View reviewed changes

davdhacs marked this pull request as draft February 27, 2026 15:29

openshift-ci Bot added the do-not-merge/work-in-progress label Feb 27, 2026

davdhacs and others added 2 commits February 28, 2026 22:52

empty commit: warm run to verify v6 mtime-fixed cache

18214d4

github-actions Bot added area/ci ai-review labels Mar 1, 2026

sourcery-ai Bot reviewed Mar 1, 2026

View reviewed changes

Comment thread scripts/ci/lib.sh Outdated

davdhacs force-pushed the davdhacs/rox-33305-gc-8h branch from 66d3fd4 to b634840 Compare March 2, 2026 14:42

davdhacs force-pushed the davdhacs/rox-33305-gc-8h branch from b634840 to 332545a Compare March 2, 2026 17:31

davdhacs force-pushed the davdhacs/rox-33305-gc-8h branch from 332545a to 6559da1 Compare March 2, 2026 20:18

davdhacs force-pushed the davdhacs/rox-33305-gc-8h branch from 6559da1 to 3092616 Compare March 2, 2026 23:20

davdhacs mentioned this pull request Mar 4, 2026

ROX-33305 prevent gc stackrox/image-prefetcher#166

Closed

davdhacs added 3 commits March 3, 2026 22:38

Merge branch 'master' into davdhacs/rox-33305-gc-8h

c3dadb4

restore test_image @sha

f86e2bf

try larger machine and 200gb disk

8961197

davdhacs closed this Mar 26, 2026

porridge mentioned this pull request Apr 15, 2026

fix(ci): work around preloaded image use problem in recent k8s #19287

Merged

8 tasks

Conversation

davdhacs commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davdhacs commented Feb 26, 2026

Uh oh!

rhacs-bot commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davdhacs commented Feb 26, 2026

Uh oh!

davdhacs commented Feb 26, 2026

Uh oh!

davdhacs commented Feb 26, 2026

Uh oh!

davdhacs commented Feb 27, 2026

Uh oh!

codecov Bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tommartensen Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

davdhacs Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

davdhacs commented Feb 27, 2026

Uh oh!

davdhacs commented Feb 27, 2026

Uh oh!

davdhacs commented Feb 27, 2026

Uh oh!

davdhacs commented Feb 28, 2026

Uh oh!

davdhacs commented Feb 28, 2026

Uh oh!

davdhacs commented Feb 28, 2026

Uh oh!

davdhacs commented Feb 28, 2026

Uh oh!

davdhacs commented Mar 1, 2026

Uh oh!

davdhacs commented Mar 1, 2026

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davdhacs commented Mar 2, 2026

Uh oh!

davdhacs commented Mar 2, 2026

Uh oh!

davdhacs commented Mar 2, 2026

Uh oh!

davdhacs commented Mar 2, 2026

Uh oh!

davdhacs commented Mar 2, 2026

Uh oh!

davdhacs commented Mar 3, 2026

Uh oh!

davdhacs commented Mar 4, 2026

Uh oh!

openshift-ci Bot commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davdhacs commented Feb 26, 2026 •

edited

Loading

rhacs-bot commented Feb 26, 2026 •

edited

Loading

codecov Bot commented Feb 27, 2026 •

edited

Loading