[KVM] Rolling maintenance #3610

nvazquez · 2019-09-25T19:27:20Z

Description

Feature Specification: https://cwiki.apache.org/confluence/display/CLOUDSTACK/%5BKVM%5D+Rolling+Maintenance+of+hosts

This feature allows automating the upgrade/patch process of KVM hosts within a zone, pod or cluster by executing custom scripts.

In a typical scenario prior this feature, the administrator needed to automate the process of setting hosts in maintenance before performing the upgrade on each host. It is commonly achieved using external automation tools.

This feature allows administrators to perform the automation process within CloudStack, providing a flexible framework that allows to define custom scripts to execute on each host. CloudStack executes these scripts within the context of stages. This feature defines 4 stages for a host in the rolling maintenance process:

Pre-flight: Pre-flight script will be run on all hosts as part of the pre-flight checks that are carried out before commencing the rolling maintenance. If pre-flight check scripts return an error from any host, then rolling maintenance will be cancelled with no actions taken, and an error returned. If there are no pre-flight scripts defined, then no checks will be done from the hosts.
Pre-maintenance: Pre-maintenance script runs 'before' a specific host is put into maintenance. If no pre-maintenance script is defined, or if the pre-flight script on a given host determines no pre-maintenance is required on that host, then no pre-maintenance actions will be done, and the management server will move straight to putting the host in maintenance followed by requesting the agent to run the maintenance script.
Maintenance: Maintenance script runs after a host has been put into maintenance. If no maintenance script is defined, or if the pre-flight or pre-maintenance scripts on a given host determine that no maintenance is required on that host, then the host will not be put into maintenance, and the completion of the pre-maintenance scripts will signal the end of all maintenance tasks and the KVM agent will hand the host back to the management server. Once the maintenance scripts have signalled that it have completed, the host agent will signal to the management server that the maintenance tasks have completed, and therefore the host is ready to exit maintenance mode and any 'information' which was collected (such as processing times) will be returned to the management server.
Post maintenance: Post-maintenance script is expected to perform validations after the host exits maintenance. These scripts will help to detect any problem during the maintenance process, including reboots or restarts within scripts.

The administrator will be responsible for the maintenance and copying of the hook scripts across all KVM hosts.

On all the KVM hosts to undergo rolling maintenance, a maintenance hooks directory will be defined in the ‘agent.properties’.

Administrators must define only one script per stage. In case a stage does not contain a script, it is skipped, continuing with the next stage. Administrators are responsible for defining and copying scripts into the hosts.

On all the KVM hosts to undergo rolling maintenance, there are two type of scripts execution approaches:

Systemd service executor: This approach uses a systemd service to invoke a script execution. Once a script finishes its execution, it will write content to a file, which the agent reads and sends back the result to the management server.
Agent executor: The CloudStack agent invokes a script execution within the JVM. In case the agent is stopped or restarted, the management server will assume the stage was completed when the agent reconnects. This approach does not keep the state in a file.

The API command to commence rolling maintenance will allow for multiple hosts or clusters or pods or zones to be specified (though each type is mutually exclusive). Before commencing any rolling maintenance actions, pre-flight checks will be run. These fall into two categories:

State and capacity and checks on the hosts and clusters to check that a successful run should be possible 'at this time'.
The pre-flight scripts on the hosts. Which are created by the admin to check that a successful run should be possible 'at the time' from the context of the specific actions of the scripts. (i.e. checking that each host can access the yum repo)

If maintenance scripts have been defined, prior to running any scripts on a host, capacity within the cluster to put the given host into maintenance will be re-checked. If it is found there is not enough capacity in the cluster for that host to successfully go into maintenance, rolling maintenance will immediately stop and an error be output to the logs

Given, that compute demands on any cluster are dynamic (i.e. the virtual machines can be started stopped or created at any time), a cluster will be disabled once the prefight checks have been successfully completed, and re-enabled upon the completion of the rolling maintenance on the cluster OR upon a failure during the maintenance of a host to minimise the impact on end users.

Management server

A new API method is created to start the automated rolling maintenance process on hosts, ‘startRollingMaintenance’, with the following parameters:

‘hostid’, ‘clusterid’, ‘podid’ and ‘zoneid’ are mutually exclusive, and only one of them must be passed.
‘forced’: false by default. When enabled, does not stop iterating through hosts in case of any error in the rolling maintenance process.
‘timeout’: defines a timeout in seconds for a stage to be completed in a host
'payload': extra parameters to be passed as parameters on scripts

KVM hosts

Two new properties must be set in the agent.properties file:

‘rolling.maintenance.hooks.dir’: Pointing to the directory in which the custom scripts are defined
‘rolling.maintenance.service.executor.disabled’: false by default. When enabled, the service execution is disabled, using the CloudStack agent as the scripts’ executor.

A new systemctl service is defined to handle scripts’ execution. This service is started by the CloudStack agent when executing a script, allowing to be executed outside of the JVM in which the CloudStack agent runs. With this approach, script execution is not terminated if the CloudStack agent is terminated, as both processes are not related. This service invokes an executor script which simply invokes the custom script in a given path.

Types of changes

Breaking change (fix or feature that would cause existing functionality to change)
New feature (non-breaking change which adds functionality)
Bug fix (non-breaking change which fixes an issue)
Enhancement (improves an existing feature and functionality)
Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

How Has This Been Tested?

rohityadavcloud · 2019-09-26T04:44:27Z

@nvazquez do you have a doc PR or spec for this feature?

rohityadavcloud · 2019-09-26T04:44:52Z

@blueorangutan package

blueorangutan · 2019-09-26T04:45:06Z

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2019-09-26T05:08:12Z

Packaging result: ✖centos6 ✖centos7 ✔debian. JID-293

rohityadavcloud · 2019-09-26T05:11:58Z

@nvazquez can you check the build/pkg failure?

rohityadavcloud · 2019-09-26T12:56:35Z

@blueorangutan package

blueorangutan · 2019-09-26T12:57:07Z

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2019-09-26T13:16:04Z

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-297

nvazquez · 2019-09-27T03:57:26Z

@rhtyd please find the description updated

ravening · 2019-10-03T10:41:40Z

@nvazquez Do you have the document which explains the algorithm about how it performs upgrades on the hosts and also about error scenarios when one host cant be upgraded?

We have a similar application to perform all these tasks and also a centralized dashboard to control upgrading hypervisors on all platforms in a different locations.

Update:
Nvm, I went through the code and understood what its doing. It enables maintenance mode on each individual hypervisor. This can be really bad in most cases because in the worst case a VM can be migrated multiple times

rohityadavcloud · 2019-10-10T13:03:09Z

@nvazquez do you have a doc PR, or wiki spec link? I'll probably do a full review next week (sorry busy this week)
@blueorangutan package

…to control it

nvazquez · 2020-03-09T05:59:20Z

Done @rhtyd
@blueorangutan package

blueorangutan · 2020-03-09T05:59:25Z

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2020-03-09T06:24:19Z

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-1027

nvazquez · 2020-03-09T11:17:50Z

@blueorangutan test

blueorangutan · 2020-03-09T11:18:27Z

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2020-03-09T21:16:34Z

Trillian test result (tid-1218)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 34030 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3610-t1218-kvm-centos7.zip
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File

api/src/main/java/com/cloud/resource/RollingMaintenanceManager.java

…lues before starting any action

nvazquez · 2020-03-11T12:45:17Z

@blueorangutan package

nvazquez · 2020-03-11T13:17:48Z

@blueorangutan package

blueorangutan · 2020-03-11T13:18:35Z

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

blueorangutan · 2020-03-11T13:42:24Z

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-1040

andrijapanicsb · 2020-03-11T18:56:09Z

@blueorangutan test

blueorangutan · 2020-03-11T18:56:36Z

@andrijapanicsb a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

blueorangutan · 2020-03-12T04:26:48Z

Trillian test result (tid-1233)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 32742 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3610-t1233-kvm-centos7.zip
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test	Result	Time (s)	Test File

andrijapanicsb

LGTM

after extensive testing.

andrijapanicsb · 2020-03-12T15:53:55Z

TestID	Test Name	Steps	Expected Result	Status
1	Scenario where the PreMaintenance script “informs” CloudStack that the Maintenance stage should not be done	On one of the hosts, make sure that the PreMaintenance.sh scripts exists with code 70, i.e.: #!/bin/bash echo "############# This is PreMaintenance script" exit 70 Run the rolling maintenance API against that host and confirm that no Maintenance stage will be executed	Confirm that the maintenance was skipped on the host: (localcloud) SBCM5> > start rollingmaintenance hostids=0bffbfcb-fc6f-4c6d-9604-35e494439a33 { "rollingmaintenance": { "details": "OK", "hostsskipped": [ { "hostid": "0bffbfcb-fc6f-4c6d-9604-35e494439a33", "hostname": "ref-trl-478-k-M7-apanic-kvm3", "reason": "Pre-maintenance stage set to avoid maintenance" } ], "hostsupdated": [], "success": true } } Confirm that the rolling-maintenance.log on KVM host confirms no script was run after the “PreMaintenance.sh” script: root@ref-trl-478-k-M7-apanic-kvm3:~/scripts# grep -ir "INFO Executing script" /var/log/cloudstack/agent/rolling-maintenance.log 11:34:53,381 rolling-maintenance INFO Executing script: /root/scripts/PreFlight.sh for stage: PreFlight 11:35:03,662 rolling-maintenance INFO Executing script: /root/scripts/PreMaintenance.sh for stage: PreMaintenance	Pass
2	Confirm that the “forced” parameter doesn’t influence the scenario where the PreMaintenance script “informs” CloudStack that the Maintenance stage should not be done	On one of the hosts, make sure that the PreMaintenance.sh scripts exits with code 70, i.e.: #!/bin/bash echo "############# This is PreMaintenance script" exit 70 Run the rolling maintenance api against that host with the “forced=true” parameter and confirm that no Maintenance stage will be executed	(localcloud) SBCM5> > start rollingmaintenance hostids=0bffbfcb-fc6f-4c6d-9604-35e494439a33 forced=true { "rollingmaintenance": { "details": "OK", "hostsskipped": [ { "hostid": "0bffbfcb-fc6f-4c6d-9604-35e494439a33", "hostname": "ref-trl-478-k-M7-apanic-kvm3", "reason": "Pre-maintenance stage set to avoid maintenance" } ], "hostsupdated": [], "success": true } } Confirm that the rolling-maintenance.log on KVM host confirms no script was run after the “PreMaintenance.sh” script: root@ref-trl-478-k-M7-apanic-kvm3:~/scripts# grep -ir "INFO Executing script" /var/log/cloudstack/agent/rolling-maintenance.log 11:42:02,602 rolling-maintenance INFO Executing script: /root/scripts/PreFlight.sh for stage: PreFlight 11:42:13,1 rolling-maintenance INFO Executing script: /root/scripts/PreMaintenance.sh for stage: PreMaintenance	Pass
3	Confirm that a single stage failure DOES cause abortion of the rolling maintenance API call against the rest of the cluster, when “forced” parameter NOT specified	On the first hosts in the cluster, make sure that the PreMaintenance.sh scripts exits with code 1, i.e.: #!/bin/bash echo "############# This is PreMaintenance script" exit 1 Run the rolling maintenance API against the cluster and observe that since there is a failure on the first host, no other hosts will be attempted for maintenance	((localcloud) SBCM5> > start rollingmaintenance clusterids=333fc22b-7189-4f67-a691-d95051a1b0f5 { "rollingmaintenance": { "details": "Error starting rolling maintenance: Stage: PreMaintenance failed on host d55eb212-357a-41fa-9bd3-99b5492b94d2: ############################## This is PreMaintenance script\n", "hostsskipped": [], "hostsupdated": [], "success": false } }	Pass
4	Confirm that a single stage failure on one host does NOT cause abortion of the rolling maintenance API against the cluster, when “forced” parameter is specified	On the first hosts in the cluster, make sure that the PreMaintenance.sh scripts exits with code 1, i.e.: #!/bin/bash echo "############# This is PreMaintenance script" exit 1 Run the rolling maintenance API against the cluster and observe that even though there is a failure on the first host, other (2) hosts have been processed for Maintenance	(localcloud) SBCM5> > start rollingmaintenance clusterids=333fc22b-7189-4f67-a691-d95051a1b0f5 forced=true { "rollingmaintenance": { "details": "Error starting rolling maintenance: Maintenance state expected, but got ErrorInPrepareForMaintenance", "hostsskipped": [ { "hostid": "d55eb212-357a-41fa-9bd3-99b5492b94d2", "hostname": "ref-trl-478-k-M7-apanic-kvm1", "reason": "Pre-maintenance script failed: ############################## This is PreMaintenance script\n" } ], "hostsupdated": [ { "enddate": "2020-01-23'T'18:42:57+00:00", "hostid": "b0220ceb-c302-4102-b4eb-f12a72f9769b", "hostname": "ref-trl-478-k-M7-apanic-kvm2", "startdate": "2020-01-23'T'18:42:37+00:00" } ], "enddate": "2020-01-23'T'18:44:17+00:00", "hostid": "b0220ceb-c302-4102-b4eb-f12a72f9769b", "hostname": "ref-trl-478-k-M7-apanic-kvm3", "startdate": "2020-01-23'T'18:43:10+00:00" } ] "success": false } }	Pass
5	Confirm that the scripts are executed by the agent	Make sure that the agent.properties contain setting to disable service mode executor: rolling.maintenance.service.executor.disabled=true =true Restart cloudstack-agent, start the rolling maintenance of a single host and observe the log agent.log on the KVM host doesn’t mention script being executed by the systemd (systemctl)	2020-03-11 16:59:18,670 INFO [resource.wrapper.LibvirtRollingMaintenanceCommandWrapper] (agentRequest-Handler-2:null) (logid:9c201ba1) Processing stage PreFlight 2020-03-11 16:59:18,670 INFO [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Executing stage: PreFlight script: /root/scripts/PreFlight 2020-03-11 16:59:18,671 DEBUG [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Executing: /root/scripts/PreFlight 2020-03-11 16:59:18,673 DEBUG [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Executing while with timeout : 1800000 2020-03-11 16:59:18,675 DEBUG [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Execution is successful. 2020-03-11 16:59:18,680 INFO [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Execution finished for stage: PreFlight script: /root/scripts/PreFlight : 0	Pass
6	Confirm that the scripts are executed by the service executor	Make sure that the agent.properties contain setting to disable service mode executor: rolling.maintenance.service.executor.disabled=false Restart cloudstack-agent, start the rolling maintenance of a single host and observe the log agent.log on the KVM host mentioned systemd being invoked	2020-03-11 17:04:03,168 INFO [resource.wrapper.LibvirtRollingMaintenanceCommandWrapper] (agentRequest-Handler-3:null) (logid:525cb47d) Processing stage PreFlight 2020-03-11 17:04:03,168 DEBUG [rolling.maintenance.RollingMaintenanceServiceExecutor] (agentRequest-Handler-3:null) (logid:525cb47d) Invoking rolling maintenance service for stage: PreFlight and file /root/scripts/PreFlight with action: start 2020-03-11 17:04:03,170 DEBUG [utils.script.Script] (agentRequest-Handler-3:null) (logid:525cb47d) Executing: /bin/bash -c systemd-escape 'PreFlight,/root/scripts/PreFlight,1800,/root/scripts/rolling-maintenance-results,/root/scripts/rolling-maintenance-output' 2020-03-11 17:04:03,171 DEBUG [utils.script.Script] (agentRequest-Handler-3:null) (logid:525cb47d) Executing while with timeout : 3600000 2020-03-11 17:04:03,176 DEBUG [utils.script.Script] (agentRequest-Handler-3:null) (logid:525cb47d) Execution is successful. 2020-03-11 17:04:03,177 DEBUG [rolling.maintenance.RollingMaintenanceServiceExecutor] (agentRequest-Handler-3:null) (logid:525cb47d) Executing: /bin/systemctl start cloudstack-rolling-maintenance@PreFlight\x2c-root-scripts-PreFlight\x2c1800\x2c-root-scripts-rolling\x2dmaintenance\x2dresults\x2c-root-scripts-rolling\x2dmaintenance\x2doutput	Pass
7	Confirm capacity checks are in place	Out of 6 hosts, disable hosts 2,3,4,5, dedicate a single host (host6) to an account, and then execute the rolling maintenance against the remaining 1 host (host 1) which is hosting VMS that does NOT belong the account for which host6 was dedicated Having NO VMs on the host1, host1 can be put into maintenance as there are no capacities needed to be available elsewhere, as no VM will be migrated away from host1	(localcloud) SBCM5> > start rollingmaintenance hostids=86c0b59f-89de-40db-9b30-251f851e869f { "rollingmaintenance": { "details": "OK", "hostsskipped": [], "hostsupdated": [ { "enddate": "2020-03-11'T'17:21:38+00:00", "hostid": "86c0b59f-89de-40db-9b30-251f851e869f", "hostname": "ref-trl-711-k-M7-apanic-kvm1", "output": "null ", "startdate": "2020-03-11'T'17:21:07+00:00" } ], "success": true } }	Pass
8	Confirm capacity checks are in place	Out of 6 hosts, disable hosts 2,3,4,5, dedicate a single host (host6) to an account, and then execute the rolling maintenance against the remaining 1 host (host 1) which is hosting VMS that does NOT belong the account for which host6 was dedicated Having at least 1 VM on host1, try to execute rolling maintenance against it -it will fail as there are no free hosts (non-disabled, non-dedicated) that can host VMs from host1.	Due to the nature of putting the host into the Maintenance mode, after the first attempt is failed, management server will retry to migrate VMs away for 5 times and will then fail permanently (give up) while the startRollingMaintenance API call will be running until the timeout defined by “kvm.rolling.maintenance.wait.maintenance.timeout ” is reached, which defaults to 1800 seconds, after which the API call will fail as well. The failure to migrate VMs away can be observed in the management-server.log 2020-03-11 17:24:25,761 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) No suitable hosts found 2020-03-11 17:24:25,761 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) No suitable hosts found under this Cluster: 1 2020-03-11 17:24:25,761 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) Could not find suitable Deployment Destination for this VM under any clusters, returning. 2020-03-11 17:24:25,761 DEBUG [c.c.d.FirstFitPlanner] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) Searching resources only under specified Cluster: 1 2020-03-11 17:24:25,762 DEBUG [c.c.d.FirstFitPlanner] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) The specified cluster is in avoid set, returning. 2020-03-11 17:24:25,762 DEBUG [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) Unable to find destination for migrating the vm VM[User\|i-2-27-VM]	Pass
9	The pre-flight scripts must be executed on each host before any maintenance actions.	Start the cluster-level rolling maintenance and “tail -f” the /var/log/cloudstack/agent/rolling-maintenance.log on each KVM host simultaneously – observe that the PreFlight stage/script is executed on all hosts in the cluster before the PreMaintenance stage/script is executed on the first host in the cluster	tail -f” the /var/log/cloudstack/agent/rolling-maintenance.log on each KVM host simultaneously – observe that the PreFlight stage/script is executed on all hosts in the cluster before the PreMaintenance stage/script is executed on the first host in the cluster: tail -f” the /var/log/cloudstack/agent/rolling-maintenance.log 17:05:56,710 rolling-maintenance INFO Successful execution of /root/scripts/PreFlight	Pass
10	Failure of PreFlight check on hosts halts the API when forced=false is set	On the first hosts in the cluster, make sure that the PreFlight scripts exists with code 1, i.e.: #!/bin/bash echo "############# This is PreMaintenance script" exit 1 Run the rolling maintenance API against the cluster and observe that since there is a failure on the first host, no other hosts will be attempted for maintenance	(localcloud) SBCM5> > start rollingmaintenance clusterids=a0c249d2-e020-4f2b-ab9c-1e05bbe68b64 { "rollingmaintenance": { "details": "Error starting rolling maintenance: Stage: PreFlight failed on host 86c0b59f-89de-40db-9b30-251f851e869f: null", "hostsskipped": [], "hostsupdated": [], "success": false } }	Pass
11	In absence of Maintenance script on a host, that host will be skipped	On a single host, make sure that there is no script named “Maintenance”, “Maintenance.sh” or “Maintenance.py” present in the configured script folder and execute the rolling maintenance call against this host and another one The first host will be skipped and a proper message is shown	(localcloud) SBCM5> > start rollingmaintenance hostids=86c0b59f-89de-40db-9b30-251f851e869f,ef10dacd-ac4e-4ec0-bc8d-7fb5bb461c9d { "rollingmaintenance": { "details": "OK", "hostsskipped": [ { "hostid": "86c0b59f-89de-40db-9b30-251f851e869f", "hostname": "ref-trl-711-k-M7-apanic-kvm1", "reason": "There is no maintenance script on the host" } ], "hostsupdated": [ { "enddate": "2020-03-11'T'17:58:58+00:00", "hostid": "ef10dacd-ac4e-4ec0-bc8d-7fb5bb461c9d", "hostname": "ref-trl-711-k-M7-apanic-kvm2", "output": "", "startdate": "2020-03-11'T'17:58:18+00:00" } ], "success": true } }	Pass
12	Capacity checks are also done before putting host into Maintenance	Before executing rolling maintenance on host1, make sure to, out of 6 hosts, disable hosts 2,3,4,5, while host 6 is NOT disabled and does have enough capacities for VMs that exist on host1 – the capacity checks during PreFligh stage will not fail. On host 1 make sure the PreMaintenance script has the equivalent of “sleep 30” command inside it, so the script will take at least 30 seconds to execute (PreFlight capacity checks have completed by now and there is host6 with enough capacities to host VMs from host1) and there is enough time for test-operator to go and disable host6 during those 30 seconds. Execute “tail -f the /var/log/cloudstack/agent/rolling-maintenance.log” on the host1. When the line “Executing script: /root/scripts/PreMaintenance.sh for stage: PreMaintenance” appears (your script location might be different, as well as script extension) Quickly go and disable host6 during those 30 seconds. Observe that the rolling maintenance call will fail	(localcloud) SBCM5> > start rollingmaintenance hostids=86c0b59f-89de-40db-9b30-251f851e869f { "rollingmaintenance": { "details": "Error starting rolling maintenance: No host available in cluster a0c249d2-e020-4f2b-ab9c-1e05bbe68b64 (p1-c1) to support host 86c0b59f-89de-40db-9b30-251f851e869f (ref-trl-711-k-M7-apanic-kvm1) in maintenance", "hostsskipped": [], "hostsupdated": [], "success": false } } (localcloud) SBCM5> >	Pass
13	When a stage does not contain a script for execution, it is skipped	On a single host, make sure that there is no script named “PreMaintenance”, “PreMaintenance.sh” or “PreMaintenance.py” present in the configured script folder and execute the rolling maintenance call against this host. On the KVM host, observe that the lines in “rolling-maintenance.log” do not contain PreMaintenance script, but all the other scripts/stages have run normally	grep "Executing script" /var/log/cloudstack/agent/rolling-maintenance.log 18:42:37,527 rolling-maintenance INFO Executing script: /root/scripts/PreFlight for stage: PreFlight 18:43:47,961 rolling-maintenance INFO Executing script: /root/scripts/Maintenance for stage: Maintenance 18:43:58,133 rolling-maintenance INFO Executing script: /root/scripts/PostMaintenance.sh for stage: PostMaintenance	Pass
14	Execute rolling maintenance against the whole zone	Ensure to have at least 2 clusters in a zone. Perform rolling maintenance of the whole zone. Observe that clusters are processed one after the another – first all host from the first cluster, then all hosts from the second cluster NOTE: in these tests, we have remove kvm4/kvm5/kvm6 hosts from the first cluster and added them to the new cluster (in order of kvm6/kvm5/kvm4). Expected order of clusters/hosts processed is: - Cluster1 (p1-c1 in our case) à host kvm1/kvm2/kvm3 - Cluster2 (cluster2 in our case) à kvm6/kvm5/kvm4 (since the hosts within a cluster are processed by the order as they appear in the DB)	Observe the “startdate” reported by the API, that confirms all hosts across both clusters are processed in the expected order: (localcloud) SBCM5> > start rollingmaintenance zoneids=ce831d12-c2df-4b11-bec9-684dcc292c18 { "rollingmaintenance": { "details": "OK", "hostsskipped": [], "hostsupdated": [ { "enddate": "2020-03-11'T'20:06:09+00:00", "hostid": "86c0b59f-89de-40db-9b30-251f851e869f", "hostname": "ref-trl-711-k-M7-apanic-kvm1", "output": "", "startdate": "2020-03-11'T'20:05:28+00:00" }, { "enddate": "2020-03-11'T'20:08:09+00:00", "hostid": "ef10dacd-ac4e-4ec0-bc8d-7fb5bb461c9d", "hostname": "ref-trl-711-k-M7-apanic-kvm2", "output": "", "startdate": "2020-03-11'T'20:06:29+00:00" }, { "enddate": "2020-03-11'T'20:10:10+00:00", "hostid": "fcc8b96e-1c29-492e-a074-96babec70ecc", "hostname": "ref-trl-711-k-M7-apanic-kvm3", "output": "", "startdate": "2020-03-11'T'20:08:30+00:00" }, { "enddate": "2020-03-11'T'20:12:11+00:00", "hostid": "4a732078-2f5d-4bf1-8425-2135004a6b1a", "hostname": "ref-trl-711-k-M7-apanic-kvm6", "output": "", "startdate": "2020-03-11'T'20:11:01+00:00" }, { "enddate": "2020-03-11'T'20:13:12+00:00", "hostid": "8f27f11a-9c60-4c30-8622-0e1bce718adc", "hostname": "ref-trl-711-k-M7-apanic-kvm5", "output": "", "startdate": "2020-03-11'T'20:12:32+00:00" }, { "enddate": "2020-03-11'T'20:14:13+00:00", "hostid": "adbbfc34-9369-4a15-93dc-7ed85756c24e", "hostname": "ref-trl-711-k-M7-apanic-kvm4", "output": "", "startdate": "2020-03-11'T'20:13:33+00:00" } ], "success": true } }	Pass
15	Execute rolling maintenance against hosts from different clusters/zones	While having multiple zones, execute the rolling maintenance by specifying at least hosts from different zones	(localcloud) SBCM5> > start rollingmaintenance hostids=86c0b59f-89de-40db-9b30-251f851e869f,b0f54409-4874-4573-9c24-8efac5b07f6f { "rollingmaintenance": { "details": "OK", "hostsskipped": [], "hostsupdated": [ { "enddate": "2020-03-12'T'12:33:04+00:00", "hostid": "86c0b59f-89de-40db-9b30-251f851e869f", "hostname": "ref-trl-711-k-M7-apanic-kvm1", "output": "", "startdate": "2020-03-12'T'12:32:24+00:00" }, { "enddate": "2020-03-12'T'12:35:15+00:00", "hostid": "b0f54409-4874-4573-9c24-8efac5b07f6f", "hostname": "ref-trl-714-k-M7-apanic-kvm1", "output": "", "startdate": "2020-03-12'T'12:33:35+00:00" } ], "success": true } } (	Pass
16	Execute rolling maintenance against multiple zones	Having multiple zones, execute the rolling maintenance by specifying at least 2 zones, and notice that first all hosts in one zone will be processed (all hosts in a single cluster, then all hosts from another cluster), and only then the hosts from another zone	(localcloud) SBCM5> > start rollingmaintenance zoneids=6f3c9827-6e99-4c63-b7d5-e8f427f6dcff,ce831d12-c2df-4b11-bec9-684dcc292c18 { "rollingmaintenance": { "details": "OK", "hostsskipped": [], "hostsupdated": [ { "enddate": "2020-03-12'T'12:41:24+00:00", "hostid": "86c0b59f-89de-40db-9b30-251f851e869f", "hostname": "ref-trl-711-k-M7-apanic-kvm1", "output": "", "startdate": "2020-03-12'T'12:40:44+00:00" }, { "enddate": "2020-03-12'T'12:43:25+00:00", "hostid": "ef10dacd-ac4e-4ec0-bc8d-7fb5bb461c9d", "hostname": "ref-trl-711-k-M7-apanic-kvm2", "output": "", "startdate": "2020-03-12'T'12:41:45+00:00" }, { "enddate": "2020-03-12'T'12:45:26+00:00", "hostid": "fcc8b96e-1c29-492e-a074-96babec70ecc", "hostname": "ref-trl-711-k-M7-apanic-kvm3", "output": "", "startdate": "2020-03-12'T'12:43:46+00:00" }, { "enddate": "2020-03-12'T'12:47:27+00:00", "hostid": "4a732078-2f5d-4bf1-8425-2135004a6b1a", "hostname": "ref-trl-711-k-M7-apanic-kvm6", "output": "", "startdate": "2020-03-12'T'12:46:17+00:00" }, { "enddate": "2020-03-12'T'12:49:28+00:00", "hostid": "8f27f11a-9c60-4c30-8622-0e1bce718adc", "hostname": "ref-trl-711-k-M7-apanic-kvm5", "output": "", "startdate": "2020-03-12'T'12:47:48+00:00" }, { "enddate": "2020-03-12'T'12:51:29+00:00", "hostid": "adbbfc34-9369-4a15-93dc-7ed85756c24e", "hostname": "ref-trl-711-k-M7-apanic-kvm4", "output": "", "startdate": "2020-03-12'T'12:49:48+00:00" }, { "enddate": "2020-03-12'T'12:53:00+00:00", "hostid": "59159ade-f5c3-4606-9174-e501301f59d4", "hostname": "ref-trl-714-k-M7-apanic-kvm3", "output": "", "startdate": "2020-03-12'T'12:52:19+00:00" }, { "enddate": "2020-03-12'T'12:54:00+00:00", "hostid": "b0f54409-4874-4573-9c24-8efac5b07f6f", "hostname": "ref-trl-714-k-M7-apanic-kvm1", "output": "", "startdate": "2020-03-12'T'12:53:20+00:00" }, { "enddate": "2020-03-12'T'12:55:01+00:00", "hostid": "02228e26-a0d6-4607-824d-501ae5ac8dab", "hostname": "ref-trl-714-k-M7-apanic-kvm2", "output": "", "startdate": "2020-03-12'T'12:54:21+00:00" } ], "success": true } }	Pass

DaanHoogland

LGTM after several iterations of code review. (methods can still be smaller/less complexd)
reviewed test scheme by @andrijapanicsb ; looks good

nvazquez added the component:kvm label Sep 25, 2019

nvazquez requested a review from rohityadavcloud September 25, 2019 19:27

nvazquez self-assigned this Sep 25, 2019

nvazquez changed the title ~~KVM: Rolling maintenance~~ [WIP DO NOT MERGE] KVM: Rolling maintenance Sep 25, 2019

nvazquez added this to the 4.14.0.0 milestone Sep 25, 2019

nvazquez added component:agent component:systemd status:work-in-progress labels Sep 25, 2019

nvazquez requested review from andrijapanicsb and borisstoyanov September 27, 2019 03:57

nvazquez force-pushed the kvm-rolling-maintenance branch 2 times, most recently from 59f38f9 to 26fedcc Compare November 14, 2019 04:44

nvazquez force-pushed the kvm-rolling-maintenance branch 4 times, most recently from 67ebb44 to b1baab4 Compare November 24, 2019 22:30

nvazquez force-pushed the kvm-rolling-maintenance branch from 7a59c9d to a40bca0 Compare December 5, 2019 13:14

nvazquez force-pushed the kvm-rolling-maintenance branch from a40bca0 to 196d7ae Compare December 20, 2019 18:36

nvazquez force-pushed the kvm-rolling-maintenance branch from 196d7ae to 3d4d7d4 Compare January 2, 2020 17:37

nvazquez added 7 commits March 9, 2020 02:58

Refactor event

c33cb41

Refactor waiting logic for maintenance

1572041

Add missing output message to response

4ea30c8

Allow migrating VMs on disabled cluster with host ID equals last host ID

ed3e47b

Refactor waiting for maintenance logic and introduce timeout setting …

3bc86a8

…to control it

Add more descriptive message to the timeout for maintenance wait

333a808

Fetch latest capacity metrics before recalculating capacity

50a3a2b

nvazquez force-pushed the kvm-rolling-maintenance branch from 791e9ad to 50a3a2b Compare March 9, 2020 05:58

andrijapanicsb reviewed Mar 11, 2020

View reviewed changes

api/src/main/java/com/cloud/resource/RollingMaintenanceManager.java Outdated Show resolved Hide resolved

Increase maintenance timeout value and verify timeout and interval va…

f753d03

…lues before starting any action

andrijapanicsb self-requested a review March 12, 2020 11:55

andrijapanicsb approved these changes Mar 12, 2020

View reviewed changes

DaanHoogland approved these changes Mar 12, 2020

View reviewed changes

DaanHoogland merged commit efe00aa into apache:master Mar 12, 2020

DaanHoogland deleted the kvm-rolling-maintenance branch March 12, 2020 15:59

weizhouapache mentioned this pull request Mar 19, 2020

Cannot build packages with latest master on Ubuntu 18.04 #3981

Closed

[KVM] Rolling maintenance #3610

[KVM] Rolling maintenance #3610

Uh oh!

Conversation

nvazquez commented Sep 25, 2019 • edited by andrijapanicsb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Management server

KVM hosts

Types of changes

Screenshots (if appropriate):

How Has This Been Tested?

Uh oh!

rohityadavcloud commented Sep 26, 2019

Uh oh!

rohityadavcloud commented Sep 26, 2019

Uh oh!

blueorangutan commented Sep 26, 2019

Uh oh!

blueorangutan commented Sep 26, 2019

Uh oh!

rohityadavcloud commented Sep 26, 2019

Uh oh!

rohityadavcloud commented Sep 26, 2019

Uh oh!

blueorangutan commented Sep 26, 2019

Uh oh!

blueorangutan commented Sep 26, 2019

Uh oh!

nvazquez commented Sep 27, 2019

Uh oh!

ravening commented Oct 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohityadavcloud commented Oct 10, 2019

Uh oh!

nvazquez commented Mar 9, 2020

Uh oh!

blueorangutan commented Mar 9, 2020

Uh oh!

blueorangutan commented Mar 9, 2020

Uh oh!

nvazquez commented Mar 9, 2020

Uh oh!

blueorangutan commented Mar 9, 2020

Uh oh!

blueorangutan commented Mar 9, 2020

Uh oh!

Uh oh!

nvazquez commented Mar 11, 2020

Uh oh!

nvazquez commented Mar 11, 2020

Uh oh!

blueorangutan commented Mar 11, 2020

Uh oh!

blueorangutan commented Mar 11, 2020

Uh oh!

andrijapanicsb commented Mar 11, 2020

Uh oh!

blueorangutan commented Mar 11, 2020

Uh oh!

blueorangutan commented Mar 12, 2020

Uh oh!

andrijapanicsb left a comment

Choose a reason for hiding this comment

Uh oh!

andrijapanicsb commented Mar 12, 2020

Uh oh!

DaanHoogland left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nvazquez commented Sep 25, 2019 •

edited by andrijapanicsb

Loading

ravening commented Oct 3, 2019 •

edited

Loading