Skip to content

Conversation

@cgwalters
Copy link
Member

This is aiming to test a theory in https://issues.redhat.com/browse/OCPBUGS-8426 that we are spinning control plane node updates too quickly in general, but that this particularly pushes us over a line on our default Azure setup.

Builds on the PR for rhcos9 because we have strong failing signal in that PR, and any positive difference in this PR will be strong evidence.

cgwalters and others added 10 commits March 4, 2023 18:07
De-duplicate calls to `canonicalizeKernelType` to make the
logic easier to read.  Also add a few comments.
In prep for usage in MCD.
This is prep for fixing RHEL9 upgrades while maintaining `kernel-rt`.

Previously the `switchKernel` logic tried to carefully handle
all 4 cases (default -> default, default -> rt, rt -> default, rt -> rt).

But, the last one (rt -> rt) never actually did anything because
in fact the previous `rpm-ostree rebase` command already did it,
and invoking `rpm-ostree upgrade` here was always a no-op.

To say this another way: when doing a RHEL9 update, it's actually
the first `rpm-ostree rebase` command which fails before we
even get to `switchKernel`.

And the reason is due to the introduction of a new `-core` subpackage;
xref https://issues.redhat.com/browse/OCPBUGS-8113

So here's the new logic to handle this:

- Before we do the `rebase` operation to the new OS, we detect
  any previous overrides of any packages starting with `kernel-rt`
  and we remove them.
- The `rebase` operation will hence start out by deploying the
  stock image i.e. with throughput kernel (though note we *are*
  carefully preserving other local overrides)
- The `switchKernel` function now longer needs to take the *previous*
  machineconfig state into account!  Instead, we just detect if
  the target is RT, and if so we then do the overrides there.

This significantly simplifies the logic in `switchKernel`, and will
help fix RHEL9 upgrades.
This attempts to build on https://issues.redhat.com/browse/COS-1983
to test switching to RHEL9 (CoreOS) by default.
Unfortunately rpm-ostree requires this right now; we have an issue
and code to provide a better API in coreos/rpm-ostree#2542
But using that will require shipping the updated rpm-ostree in RHEL 8.6.z
or at least OCP 4.12.z, which is problematic.

Because we know the new MCD will always be upgrading to RHEL9,
for now let's update this hardcoded list.  In the future we can
detect when the running host has `--remove-installed-kernel` and
use it instead.
Rapid file changes triggering the path unit can start the
service here frequently, and then this can cause the start
limit to be hit, and then systemd will refuse further
activations (unless we bumped the limit).

I don't think we need to synchronize the iptables
rules more than once every 3 seconds.
When we move from RHCOS 8 -> RHCOS 9, the SSH keys are not being written
to the new location because:

1. When the upgrade configs are written to the node, it is still running RHCOS 8, so the keys are not being written to the new location.
2. The node reboots into RHCOS 9 to complete the upgrade.
3. The "are we on the latest config" functions detect that we are indeed on the latest config and so it does not attempt to perform an update.
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 6, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 6, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 6, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 6, 2023
@cgwalters
Copy link
Member Author

/payload-job periodic-ci-openshift-release-master-ci-4.13-e2e-azure-ovn-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 6, 2023

@cgwalters: trigger 1 job(s) for the /payload-(job|aggregate) command

  • periodic-ci-openshift-release-master-ci-4.13-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/884ddb40-bc76-11ed-9864-d1a40dd312db-0

The theory is that we're moving on to draining other control plane
nodes once we've simply landed the MCD on a control plane node and
had it perform basic validation.  But we want to give some time
for the previously updated control plane node to quiecese and reach
a steady state.

TODO: replace this with something better based on etcd information
or so
@cgwalters cgwalters force-pushed the rhel-coreos-9-controlplane-sleeps branch from b2b9172 to 5da6a5a Compare March 7, 2023 15:37
@cgwalters
Copy link
Member Author

/payload-job periodic-ci-openshift-release-master-ci-4.13-e2e-azure-ovn-upgrade

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 7, 2023

@cgwalters: trigger 1 job(s) for the /payload-(job|aggregate) command

  • periodic-ci-openshift-release-master-ci-4.13-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/00878570-bcfe-11ed-997d-c782766ecaf1-0

@cgwalters cgwalters changed the title daemon: Wait 90s between high availability control plane node updates daemon: Wait between high availability control plane node updates Mar 7, 2023
@cgwalters
Copy link
Member Author

The sleeping here didn't help.

@cgwalters cgwalters closed this Mar 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants