Skip to content

Conversation

@HirazawaUi
Copy link
Contributor

@HirazawaUi HirazawaUi commented Nov 4, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

The dryrun will fail once NodeLocalCRISocket reaches GA.

Specifically, if we remove the kubeadm.alpha.kubernetes.io/cri-socket annotation upon GA, the dryrun will fail in the GetNodeRegistration function. This occurs because the dryrun process cannot access the instance-config.yaml file, and we will have also removed the annotation as planned.

Which issue(s) this PR is related to:

Fixes kubernetes/kubeadm#3257
Fixes #135086

Special notes for your reviewer:

I was contemplating a minimal solution by using the following approach before calling FetchInitConfigurationFromCluster:

getNodeRegistration := true
getAPIEndpoint := isControlPlaneNode
getComponentConfigs := true
if *dryRun {
	getNodeRegistration = false
}
initCfg, err := configutil.FetchInitConfigurationFromCluster(client, nil, "upgrade", getNodeRegistration, getAPIEndpoint, getComponentConfigs, *dryRun)

However, after implementing the change, I found it made the code more cumbersome. So I rolled it back and ended up using a more straightforward approach to implement it. :)

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 4, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubeadm sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 4, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: HirazawaUi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 4, 2025
@HirazawaUi
Copy link
Contributor Author

I've opened a draft PR where I removed the kubeadm.alpha.kubernetes.io/cri-socket annotation from the fake node object and executed the test job: https://github.com/neolit123/kubeadm-test/actions/runs/19072824492

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i said the same on @carlory 's PR.
i really don't think GetNodeRegistration should take dry-run into account.

IMO it should do the following

  • get from annotation (after the NodeLocalCRISocket goes GA this is removed)
  • if there is no annotation, get from file
  • if there is no file, print a warning and try to auto-detect
  • if it can't autodetect, print a warning and return default

use klog.Warningf.

@carlory
Copy link
Member

carlory commented Nov 4, 2025

if it can't autodetect, print a warning and return default

If so, ContainerRuntimeVersion cannot distinguish the difference between the dry run and the real run via comparing sock name. It will print an error message.
Our job needs to skip the ContainerRuntimeVersion error.

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 4, 2025
@neolit123
Copy link
Member

NVM, i deleted my last comment.

@HirazawaUi
Copy link
Contributor Author

Haha, maybe sometimes it's better to communicate a bit slower. The reason @carlory and I were so quick to respond might be that it's very late in our time zone :)

@HirazawaUi
Copy link
Contributor Author

Given the latest changes in this PR, we no longer need PR #135090, right? Because a "real" socket will always be returned now regardless.

@neolit123
Copy link
Member

if it can't autodetect, print a warning and return default

If so, ContainerRuntimeVersion cannot distinguish the difference between the dry run and the real run via comparing sock name. It will print an error message. Our job needs to skip the ContainerRuntimeVersion error.

@carlory in that case you should revert your PR to not compare socket names and use the dryRun argument passed to the preflight check, like you did before

...what a mess.

i think making the GetNodeRegistration dry-run aware just feels wrong.
it's an utility function.

@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: b1ba6edfec8431917af59cc57e8b70f62872ba96

@neolit123
Copy link
Member

@neolit123
Copy link
Member

neolit123 commented Nov 4, 2025

Hi @neolit123, sorry for inconvenience after my PR was merged. I'm considering whether we should add a optional job for the k/k repo. Once the cmd/kubeadm code is changed, the dryrun job will be triggered.

The test result of this job doesn't block any other PRs from merging. Is it possible?

no worries.

yes, we can have non-blocking presubmit e2e jobs for k/k.
instead of running the external e2e each time.
https://github.com/neolit123/kubeadm-test/actions/workflows/k8s-test-pr.yaml

two jobs that we can add are
dryrun-latest
presubmit-upgrade-latest

@carlory
Copy link
Member

carlory commented Nov 4, 2025

two jobs that we can add are

Okay, I will do it tomorrow

@neolit123
Copy link
Member

neolit123 commented Nov 4, 2025

I1104 16:03:56.269433     259 checks.go:120] validating the container runtime version compatibility
[preflight] Some fatal errors occurred:
	[ERROR ContainerRuntimeVersion]: could not connect to the container runtime: failed to create new CRI runtime service: validate service connection: validate CRI v1 runtime API for endpoint "unix://dry-run-cri-socket": rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix: missing address"
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
error: error execution phase preflight: preflight checks failed
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
	k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:262
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll

https://github.com/neolit123/kubeadm-test/actions/runs/19074447580/job/54485941263

options:

  • skip the preflight check on dry-run
  • or only add the annotation on the fake node if the FG is disabled (i don't think this option works)

let's continue tomorrow

@carlory
Copy link
Member

carlory commented Nov 5, 2025

or only add the annotation on the fake node if the FG is disabled (i don't think this option works)

Based on this PR, I tested it with this option. The ci is green. https://github.com/neolit123/kubeadm-test/actions/runs/19088637811/job/54534316923

@pacoxu

This comment was marked as off-topic.

@neolit123
Copy link
Member

neolit123 commented Nov 5, 2025

or only add the annotation on the fake node if the FG is disabled (i don't think this option works)

Based on this PR, I tested it with this option. The ci is green. https://github.com/neolit123/kubeadm-test/actions/runs/19088637811/job/54534316923

ok, let's use this option then, because it solves the issue with respect to the FG and doesn't force us to introduce the dryrun option for the preflight check.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Nov 5, 2025
@HirazawaUi HirazawaUi force-pushed the fix-3257 branch 2 times, most recently from e50c1ee to 8ee3fe2 Compare November 5, 2025 14:45
@HirazawaUi
Copy link
Contributor Author

HirazawaUi commented Nov 5, 2025

I directly removed the kubeadm.alpha.kubernetes.io/cri-socket annotation from the fake node object. Thanks to the defensive mechanism in GetNodeRegistration, even if we delete it, there won't be any side effects, and the test job runs perfectly.
ref: https://github.com/kubernetes/kubernetes/pull/135100/files#diff-d9520473ad3de82e8118c7fb75b34ac28d71f3701ed91ca973e3d4c2e19e61ebR179-R183

Are there any other potential failure scenarios? I haven't thought of any at the moment.

Although there are only two days left until the code freeze, I'm already confident enough to merge this PR. Alternatively, should we keep it on hold and wait until we've thought it through more thoroughly before merging?

@neolit123 @carlory WDYT?

@neolit123
Copy link
Member

neolit123 commented Nov 5, 2025

I directly removed the kubeadm.alpha.kubernetes.io/cri-socket annotation from the fake node object. Thanks to the defensive mechanism in GetNodeRegistration, even if we delete it, there won't be any side effects, and the test job runs perfectly. ref: https://github.com/kubernetes/kubernetes/pull/135100/files#diff-d9520473ad3de82e8118c7fb75b34ac28d71f3701ed91ca973e3d4c2e19e61ebR179-R183

Are there any other potential failure scenarios? I haven't thought of any at the moment.

Although there are only two days left until the code freeze, I'm already confident enough to merge this PR. Alternatively, should we keep it on hold and wait until we've thought it through more thoroughly before merging?

@neolit123 @carlory WDYT?

sgtm, thanks for testing and updating it
let's merge this soon.

the only thing that i am not so sure about is having the autodetect in the function. iirc, we explicitly didn't want to have something like that in the past.

the purpose of the cluster functions was to get existing config, so if we can't get existing config for the socket from annotation or instance file, then something must be wrong.

yet, i don't think this fallback to autodetect logic is too problematic.

@HirazawaUi
Copy link
Contributor Author

I also ran a presubmit-upgrade-latest test: https://github.com/neolit123/kubeadm-test/actions/runs/19107770508

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 5, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: b323eaa06227efe8d8aa6d2fc5c611127bbe44ae

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Nov 5, 2025

@HirazawaUi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-node-kubelet-serial-crio 2fa37f3 link false /test pull-kubernetes-node-kubelet-serial-crio
pull-kubernetes-node-kubelet-serial-containerd 2fa37f3 link false /test pull-kubernetes-node-kubelet-serial-containerd

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@neolit123
Copy link
Member

/retest
/hold cancel
/milestone v1.35

(we need to land this fix in 1.35)

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 5, 2025
@k8s-ci-robot k8s-ci-robot added this to the v1.35 milestone Nov 5, 2025
@k8s-ci-robot k8s-ci-robot merged commit 189b005 into kubernetes:master Nov 5, 2025
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubeadm cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Potential dryrun failure when NodeLocalCRISocket reaches GA [Failing Test] ci-kubernetes-e2e-kubeadm-kinder-dryrun-latest

5 participants