-
Notifications
You must be signed in to change notification settings - Fork 462
daemon: Wait between high availability control plane node updates #3586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon: Wait between high availability control plane node updates #3586
Conversation
De-duplicate calls to `canonicalizeKernelType` to make the logic easier to read. Also add a few comments.
In prep for usage in MCD.
This is prep for fixing RHEL9 upgrades while maintaining `kernel-rt`. Previously the `switchKernel` logic tried to carefully handle all 4 cases (default -> default, default -> rt, rt -> default, rt -> rt). But, the last one (rt -> rt) never actually did anything because in fact the previous `rpm-ostree rebase` command already did it, and invoking `rpm-ostree upgrade` here was always a no-op. To say this another way: when doing a RHEL9 update, it's actually the first `rpm-ostree rebase` command which fails before we even get to `switchKernel`. And the reason is due to the introduction of a new `-core` subpackage; xref https://issues.redhat.com/browse/OCPBUGS-8113 So here's the new logic to handle this: - Before we do the `rebase` operation to the new OS, we detect any previous overrides of any packages starting with `kernel-rt` and we remove them. - The `rebase` operation will hence start out by deploying the stock image i.e. with throughput kernel (though note we *are* carefully preserving other local overrides) - The `switchKernel` function now longer needs to take the *previous* machineconfig state into account! Instead, we just detect if the target is RT, and if so we then do the overrides there. This significantly simplifies the logic in `switchKernel`, and will help fix RHEL9 upgrades.
This attempts to build on https://issues.redhat.com/browse/COS-1983 to test switching to RHEL9 (CoreOS) by default.
Unfortunately rpm-ostree requires this right now; we have an issue and code to provide a better API in coreos/rpm-ostree#2542 But using that will require shipping the updated rpm-ostree in RHEL 8.6.z or at least OCP 4.12.z, which is problematic. Because we know the new MCD will always be upgrading to RHEL9, for now let's update this hardcoded list. In the future we can detect when the running host has `--remove-installed-kernel` and use it instead.
Rapid file changes triggering the path unit can start the service here frequently, and then this can cause the start limit to be hit, and then systemd will refuse further activations (unless we bumped the limit). I don't think we need to synchronize the iptables rules more than once every 3 seconds.
When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written to the new location because: 1. When the upgrade configs are written to the node, it is still running RHCOS 8, so the keys are not being written to the new location. 2. The node reboots into RHCOS 9 to complete the upgrade. 3. The "are we on the latest config" functions detect that we are indeed on the latest config and so it does not attempt to perform an update.
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/payload-job periodic-ci-openshift-release-master-ci-4.13-e2e-azure-ovn-upgrade |
|
@cgwalters: trigger 1 job(s) for the /payload-(job|aggregate) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/884ddb40-bc76-11ed-9864-d1a40dd312db-0 |
The theory is that we're moving on to draining other control plane nodes once we've simply landed the MCD on a control plane node and had it perform basic validation. But we want to give some time for the previously updated control plane node to quiecese and reach a steady state. TODO: replace this with something better based on etcd information or so
b2b9172 to
5da6a5a
Compare
|
/payload-job periodic-ci-openshift-release-master-ci-4.13-e2e-azure-ovn-upgrade |
|
@cgwalters: trigger 1 job(s) for the /payload-(job|aggregate) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/00878570-bcfe-11ed-997d-c782766ecaf1-0 |
|
The sleeping here didn't help. |
This is aiming to test a theory in https://issues.redhat.com/browse/OCPBUGS-8426 that we are spinning control plane node updates too quickly in general, but that this particularly pushes us over a line on our default Azure setup.
Builds on the PR for rhcos9 because we have strong failing signal in that PR, and any positive difference in this PR will be strong evidence.