daemon: Wait between high availability control plane node updates #3586

cgwalters · 2023-03-06T23:25:47Z

This is aiming to test a theory in https://issues.redhat.com/browse/OCPBUGS-8426 that we are spinning control plane node updates too quickly in general, but that this particularly pushes us over a line on our default Azure setup.

Builds on the PR for rhcos9 because we have strong failing signal in that PR, and any positive difference in this PR will be strong evidence.

De-duplicate calls to `canonicalizeKernelType` to make the logic easier to read. Also add a few comments.

In prep for usage in MCD.

This is prep for fixing RHEL9 upgrades while maintaining `kernel-rt`. Previously the `switchKernel` logic tried to carefully handle all 4 cases (default -> default, default -> rt, rt -> default, rt -> rt). But, the last one (rt -> rt) never actually did anything because in fact the previous `rpm-ostree rebase` command already did it, and invoking `rpm-ostree upgrade` here was always a no-op. To say this another way: when doing a RHEL9 update, it's actually the first `rpm-ostree rebase` command which fails before we even get to `switchKernel`. And the reason is due to the introduction of a new `-core` subpackage; xref https://issues.redhat.com/browse/OCPBUGS-8113 So here's the new logic to handle this: - Before we do the `rebase` operation to the new OS, we detect any previous overrides of any packages starting with `kernel-rt` and we remove them. - The `rebase` operation will hence start out by deploying the stock image i.e. with throughput kernel (though note we *are* carefully preserving other local overrides) - The `switchKernel` function now longer needs to take the *previous* machineconfig state into account! Instead, we just detect if the target is RT, and if so we then do the overrides there. This significantly simplifies the logic in `switchKernel`, and will help fix RHEL9 upgrades.

This attempts to build on https://issues.redhat.com/browse/COS-1983 to test switching to RHEL9 (CoreOS) by default.

Unfortunately rpm-ostree requires this right now; we have an issue and code to provide a better API in coreos/rpm-ostree#2542 But using that will require shipping the updated rpm-ostree in RHEL 8.6.z or at least OCP 4.12.z, which is problematic. Because we know the new MCD will always be upgrading to RHEL9, for now let's update this hardcoded list. In the future we can detect when the running host has `--remove-installed-kernel` and use it instead.

Rapid file changes triggering the path unit can start the service here frequently, and then this can cause the start limit to be hit, and then systemd will refuse further activations (unless we bumped the limit). I don't think we need to synchronize the iptables rules more than once every 3 seconds.

When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written to the new location because: 1. When the upgrade configs are written to the node, it is still running RHCOS 8, so the keys are not being written to the new location. 2. The node reboots into RHCOS 9 to complete the upgrade. 3. The "are we on the latest config" functions detect that we are indeed on the latest config and so it does not attempt to perform an update.

openshift-ci · 2023-03-06T23:26:22Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2023-03-06T23:26:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cgwalters · 2023-03-06T23:27:59Z

/payload-job periodic-ci-openshift-release-master-ci-4.13-e2e-azure-ovn-upgrade

openshift-ci · 2023-03-06T23:28:03Z

@cgwalters: trigger 1 job(s) for the /payload-(job|aggregate) command

periodic-ci-openshift-release-master-ci-4.13-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/884ddb40-bc76-11ed-9864-d1a40dd312db-0

The theory is that we're moving on to draining other control plane nodes once we've simply landed the MCD on a control plane node and had it perform basic validation. But we want to give some time for the previously updated control plane node to quiecese and reach a steady state. TODO: replace this with something better based on etcd information or so

cgwalters · 2023-03-07T15:37:43Z

/payload-job periodic-ci-openshift-release-master-ci-4.13-e2e-azure-ovn-upgrade

openshift-ci · 2023-03-07T15:37:47Z

@cgwalters: trigger 1 job(s) for the /payload-(job|aggregate) command

periodic-ci-openshift-release-master-ci-4.13-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/00878570-bcfe-11ed-997d-c782766ecaf1-0

cgwalters · 2023-03-07T22:53:53Z

The sleeping here didn't help.

cgwalters and others added 10 commits March 4, 2023 18:07

daemon: Clean up switchKernel a bit

f29dcbb

De-duplicate calls to `canonicalizeKernelType` to make the logic easier to read. Also add a few comments.

vendor: Bump coreos/rpm-ostree-client-go

21f32c0

In prep for usage in MCD.

Add rhel-coreos-9 references

7e16837

This attempts to build on https://issues.redhat.com/browse/COS-1983 to test switching to RHEL9 (CoreOS) by default.

ensures that RHCOS 9 SSH keys are in the right place

f4929fc

OKD release controller is out-of-date

428ddd3

teaches TestIgn3Cfg about the new RHCOS 9 key path

44955f4

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 6, 2023

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2023

cgwalters force-pushed the rhel-coreos-9-controlplane-sleeps branch from b2b9172 to 5da6a5a Compare March 7, 2023 15:37

cgwalters changed the title ~~daemon: Wait 90s between high availability control plane node updates~~ daemon: Wait between high availability control plane node updates Mar 7, 2023

cgwalters closed this Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

daemon: Wait between high availability control plane node updates #3586

daemon: Wait between high availability control plane node updates #3586

Uh oh!

cgwalters commented Mar 6, 2023

Uh oh!

openshift-ci bot commented Mar 6, 2023

Uh oh!

openshift-ci bot commented Mar 6, 2023

Uh oh!

cgwalters commented Mar 6, 2023

Uh oh!

openshift-ci bot commented Mar 6, 2023

Uh oh!

cgwalters commented Mar 7, 2023

Uh oh!

openshift-ci bot commented Mar 7, 2023

Uh oh!

cgwalters commented Mar 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daemon: Wait between high availability control plane node updates #3586

daemon: Wait between high availability control plane node updates #3586

Uh oh!

Conversation

cgwalters commented Mar 6, 2023

Uh oh!

openshift-ci bot commented Mar 6, 2023

Uh oh!

openshift-ci bot commented Mar 6, 2023

Uh oh!

cgwalters commented Mar 6, 2023

Uh oh!

openshift-ci bot commented Mar 6, 2023

Uh oh!

cgwalters commented Mar 7, 2023

Uh oh!

openshift-ci bot commented Mar 7, 2023

Uh oh!

cgwalters commented Mar 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants