Skip to content

Conversation

@nvazquez
Copy link
Contributor

@nvazquez nvazquez commented Sep 25, 2019

Description

Feature Specification: https://cwiki.apache.org/confluence/display/CLOUDSTACK/%5BKVM%5D+Rolling+Maintenance+of+hosts

This feature allows automating the upgrade/patch process of KVM hosts within a zone, pod or cluster by executing custom scripts.

In a typical scenario prior this feature, the administrator needed to automate the process of setting hosts in maintenance before performing the upgrade on each host. It is commonly achieved using external automation tools.

This feature allows administrators to perform the automation process within CloudStack, providing a flexible framework that allows to define custom scripts to execute on each host. CloudStack executes these scripts within the context of stages. This feature defines 4 stages for a host in the rolling maintenance process:

  • Pre-flight: Pre-flight script will be run on all hosts as part of the pre-flight checks that are carried out before commencing the rolling maintenance. If pre-flight check scripts return an error from any host, then rolling maintenance will be cancelled with no actions taken, and an error returned. If there are no pre-flight scripts defined, then no checks will be done from the hosts.

  • Pre-maintenance: Pre-maintenance script runs 'before' a specific host is put into maintenance. If no pre-maintenance script is defined, or if the pre-flight script on a given host determines no pre-maintenance is required on that host, then no pre-maintenance actions will be done, and the management server will move straight to putting the host in maintenance followed by requesting the agent to run the maintenance script.

  • Maintenance: Maintenance script runs after a host has been put into maintenance. If no maintenance script is defined, or if the pre-flight or pre-maintenance scripts on a given host determine that no maintenance is required on that host, then the host will not be put into maintenance, and the completion of the pre-maintenance scripts will signal the end of all maintenance tasks and the KVM agent will hand the host back to the management server. Once the maintenance scripts have signalled that it have completed, the host agent will signal to the management server that the maintenance tasks have completed, and therefore the host is ready to exit maintenance mode and any 'information' which was collected (such as processing times) will be returned to the management server.

  • Post maintenance: Post-maintenance script is expected to perform validations after the host exits maintenance. These scripts will help to detect any problem during the maintenance process, including reboots or restarts within scripts.

The administrator will be responsible for the maintenance and copying of the hook scripts across all KVM hosts.

On all the KVM hosts to undergo rolling maintenance, a maintenance hooks directory will be defined in the ‘agent.properties’.

Administrators must define only one script per stage. In case a stage does not contain a script, it is skipped, continuing with the next stage. Administrators are responsible for defining and copying scripts into the hosts.

On all the KVM hosts to undergo rolling maintenance, there are two type of scripts execution approaches:

  • Systemd service executor: This approach uses a systemd service to invoke a script execution. Once a script finishes its execution, it will write content to a file, which the agent reads and sends back the result to the management server.

  • Agent executor: The CloudStack agent invokes a script execution within the JVM. In case the agent is stopped or restarted, the management server will assume the stage was completed when the agent reconnects. This approach does not keep the state in a file.

The API command to commence rolling maintenance will allow for multiple hosts or clusters or pods or zones to be specified (though each type is mutually exclusive). Before commencing any rolling maintenance actions, pre-flight checks will be run. These fall into two categories:

  • State and capacity and checks on the hosts and clusters to check that a successful run should be possible 'at this time'.

  • The pre-flight scripts on the hosts. Which are created by the admin to check that a successful run should be possible 'at the time' from the context of the specific actions of the scripts. (i.e. checking that each host can access the yum repo)

If maintenance scripts have been defined, prior to running any scripts on a host, capacity within the cluster to put the given host into maintenance will be re-checked. If it is found there is not enough capacity in the cluster for that host to successfully go into maintenance, rolling maintenance will immediately stop and an error be output to the logs

Given, that compute demands on any cluster are dynamic (i.e. the virtual machines can be started stopped or created at any time), a cluster will be disabled once the prefight checks have been successfully completed, and re-enabled upon the completion of the rolling maintenance on the cluster OR upon a failure during the maintenance of a host to minimise the impact on end users.

Management server

A new API method is created to start the automated rolling maintenance process on hosts, ‘startRollingMaintenance’, with the following parameters:

  • ‘hostid’, ‘clusterid’, ‘podid’ and ‘zoneid’ are mutually exclusive, and only one of them must be passed.
  • ‘forced’: false by default. When enabled, does not stop iterating through hosts in case of any error in the rolling maintenance process.
  • ‘timeout’: defines a timeout in seconds for a stage to be completed in a host
  • 'payload': extra parameters to be passed as parameters on scripts

KVM hosts

Two new properties must be set in the agent.properties file:

  • ‘rolling.maintenance.hooks.dir’: Pointing to the directory in which the custom scripts are defined
  • ‘rolling.maintenance.service.executor.disabled’: false by default. When enabled, the service execution is disabled, using the CloudStack agent as the scripts’ executor.

A new systemctl service is defined to handle scripts’ execution. This service is started by the CloudStack agent when executing a script, allowing to be executed outside of the JVM in which the CloudStack agent runs. With this approach, script execution is not terminated if the CloudStack agent is terminated, as both processes are not related. This service invokes an executor script which simply invokes the custom script in a given path.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Screenshots (if appropriate):

image
image

How Has This Been Tested?

@nvazquez nvazquez self-assigned this Sep 25, 2019
@nvazquez nvazquez changed the title KVM: Rolling maintenance [WIP DO NOT MERGE] KVM: Rolling maintenance Sep 25, 2019
@nvazquez nvazquez added this to the 4.14.0.0 milestone Sep 25, 2019
@rohityadavcloud
Copy link
Member

@nvazquez do you have a doc PR or spec for this feature?

@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✖centos7 ✔debian. JID-293

@rohityadavcloud
Copy link
Member

@nvazquez can you check the build/pkg failure?

@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-297

@nvazquez
Copy link
Contributor Author

@rhtyd please find the description updated

@ravening
Copy link
Member

ravening commented Oct 3, 2019

@nvazquez Do you have the document which explains the algorithm about how it performs upgrades on the hosts and also about error scenarios when one host cant be upgraded?

We have a similar application to perform all these tasks and also a centralized dashboard to control upgrading hypervisors on all platforms in a different locations.

Update:
Nvm, I went through the code and understood what its doing. It enables maintenance mode on each individual hypervisor. This can be really bad in most cases because in the worst case a VM can be migrated multiple times

@rohityadavcloud
Copy link
Member

@nvazquez do you have a doc PR, or wiki spec link? I'll probably do a full review next week (sorry busy this week)
@blueorangutan package

@nvazquez nvazquez force-pushed the kvm-rolling-maintenance branch 2 times, most recently from 59f38f9 to 26fedcc Compare November 14, 2019 04:44
@nvazquez nvazquez force-pushed the kvm-rolling-maintenance branch 4 times, most recently from 67ebb44 to b1baab4 Compare November 24, 2019 22:30
@nvazquez nvazquez force-pushed the kvm-rolling-maintenance branch from 7a59c9d to a40bca0 Compare December 5, 2019 13:14
@nvazquez nvazquez force-pushed the kvm-rolling-maintenance branch from a40bca0 to 196d7ae Compare December 20, 2019 18:36
@nvazquez nvazquez force-pushed the kvm-rolling-maintenance branch from 196d7ae to 3d4d7d4 Compare January 2, 2020 17:37
@nvazquez nvazquez force-pushed the kvm-rolling-maintenance branch from 791e9ad to 50a3a2b Compare March 9, 2020 05:58
@nvazquez
Copy link
Contributor Author

nvazquez commented Mar 9, 2020

Done @rhtyd
@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-1027

@nvazquez
Copy link
Contributor Author

nvazquez commented Mar 9, 2020

@blueorangutan test

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-1218)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 34030 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3610-t1218-kvm-centos7.zip
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@nvazquez
Copy link
Contributor Author

@blueorangutan package

1 similar comment
@nvazquez
Copy link
Contributor Author

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-1040

@andrijapanicsb
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@andrijapanicsb a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-1233)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 32742 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3610-t1233-kvm-centos7.zip
Smoke tests completed. 83 look OK, 0 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File

@andrijapanicsb andrijapanicsb self-requested a review March 12, 2020 11:55
Copy link
Contributor

@andrijapanicsb andrijapanicsb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

after extensive testing.

@andrijapanicsb
Copy link
Contributor

TestID Test Name Steps Expected Result Status
1 Scenario where the PreMaintenance script “informs” CloudStack that the Maintenance stage should not be done On one of the hosts, make sure that the PreMaintenance.sh scripts exists with code 70, i.e.:   #!/bin/bash echo "############# This is PreMaintenance script" exit 70   Run the rolling maintenance API against that host and confirm that no Maintenance stage will be executed Confirm that the maintenance was skipped on the host:   (localcloud) SBCM5> > start rollingmaintenance hostids=0bffbfcb-fc6f-4c6d-9604-35e494439a33 {   "rollingmaintenance": {     "details": "OK",     "hostsskipped": [       {         "hostid": "0bffbfcb-fc6f-4c6d-9604-35e494439a33",         "hostname": "ref-trl-478-k-M7-apanic-kvm3",         "reason": "Pre-maintenance stage set to avoid maintenance"       }     ],     "hostsupdated": [],     "success": true   } }   Confirm that the rolling-maintenance.log on KVM host confirms no script was run after the “PreMaintenance.sh” script:   root@ref-trl-478-k-M7-apanic-kvm3:~/scripts# grep -ir "INFO Executing script" /var/log/cloudstack/agent/rolling-maintenance.log   11:34:53,381 rolling-maintenance INFO Executing script: /root/scripts/PreFlight.sh for stage: PreFlight 11:35:03,662 rolling-maintenance INFO Executing script: /root/scripts/PreMaintenance.sh for stage: PreMaintenance Pass
2 Confirm that the “forced” parameter doesn’t influence the scenario where the PreMaintenance script “informs” CloudStack that the Maintenance stage should not be done On one of the hosts, make sure that the PreMaintenance.sh scripts exits with code 70, i.e.:   #!/bin/bash echo "############# This is PreMaintenance script" exit 70   Run the rolling maintenance api against that host with the “forced=true” parameter and confirm that no Maintenance stage will be executed (localcloud) SBCM5> > start rollingmaintenance hostids=0bffbfcb-fc6f-4c6d-9604-35e494439a33 forced=true {   "rollingmaintenance": {     "details": "OK",     "hostsskipped": [       {         "hostid": "0bffbfcb-fc6f-4c6d-9604-35e494439a33",         "hostname": "ref-trl-478-k-M7-apanic-kvm3",         "reason": "Pre-maintenance stage set to avoid maintenance"       }     ],     "hostsupdated": [],     "success": true   } }   Confirm that the rolling-maintenance.log on KVM host confirms no script was run after the “PreMaintenance.sh” script:   root@ref-trl-478-k-M7-apanic-kvm3:~/scripts# grep -ir "INFO Executing script" /var/log/cloudstack/agent/rolling-maintenance.log 11:42:02,602 rolling-maintenance INFO Executing script: /root/scripts/PreFlight.sh for stage: PreFlight 11:42:13,1 rolling-maintenance INFO Executing script: /root/scripts/PreMaintenance.sh for stage: PreMaintenance Pass
3 Confirm that a single stage failure DOES cause abortion of the rolling maintenance API call against the rest of the cluster, when “forced” parameter NOT specified On the first hosts in the cluster, make sure that the PreMaintenance.sh scripts exits with code 1, i.e.:   #!/bin/bash echo "############# This is PreMaintenance script" exit 1   Run the rolling maintenance API against the cluster and observe that since there is a failure on the first host, no other hosts will be attempted for maintenance ((localcloud) SBCM5> > start rollingmaintenance clusterids=333fc22b-7189-4f67-a691-d95051a1b0f5 {   "rollingmaintenance": {     "details": "Error starting rolling maintenance: Stage: PreMaintenance failed on host d55eb212-357a-41fa-9bd3-99b5492b94d2: ############################## This is PreMaintenance script\n",     "hostsskipped": [],     "hostsupdated": [],     "success": false   } } Pass
4 Confirm that a single stage failure on one host does NOT cause abortion of the rolling maintenance API against the cluster, when “forced” parameter is specified On the first hosts in the cluster, make sure that the PreMaintenance.sh scripts exits with code 1, i.e.:   #!/bin/bash echo "############# This is PreMaintenance script" exit 1   Run the rolling maintenance API against the cluster and observe that even though there is a failure on the first host, other (2) hosts have been processed for Maintenance (localcloud) SBCM5> > start rollingmaintenance clusterids=333fc22b-7189-4f67-a691-d95051a1b0f5 forced=true {   "rollingmaintenance": {     "details": "Error starting rolling maintenance: Maintenance state expected, but got ErrorInPrepareForMaintenance",     "hostsskipped": [       {         "hostid": "d55eb212-357a-41fa-9bd3-99b5492b94d2",         "hostname": "ref-trl-478-k-M7-apanic-kvm1",         "reason": "Pre-maintenance script failed: ############################## This is PreMaintenance script\n"       }     ],     "hostsupdated": [       {         "enddate": "2020-01-23'T'18:42:57+00:00",         "hostid": "b0220ceb-c302-4102-b4eb-f12a72f9769b",         "hostname": "ref-trl-478-k-M7-apanic-kvm2",         "startdate": "2020-01-23'T'18:42:37+00:00"       }     ],         "enddate": "2020-01-23'T'18:44:17+00:00",         "hostid": "b0220ceb-c302-4102-b4eb-f12a72f9769b",         "hostname": "ref-trl-478-k-M7-apanic-kvm3",         "startdate": "2020-01-23'T'18:43:10+00:00"       }     ]     "success": false   } } Pass
5 Confirm that the scripts are executed by the agent Make sure that the agent.properties contain setting to disable service mode executor:   rolling.maintenance.service.executor.disabled=true =true Restart cloudstack-agent, start the rolling maintenance of a single host and observe the log agent.log on the KVM host doesn’t mention script being executed by the systemd (systemctl) 2020-03-11 16:59:18,670 INFO  [resource.wrapper.LibvirtRollingMaintenanceCommandWrapper] (agentRequest-Handler-2:null) (logid:9c201ba1) Processing stage PreFlight 2020-03-11 16:59:18,670 INFO  [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Executing stage: PreFlight script: /root/scripts/PreFlight 2020-03-11 16:59:18,671 DEBUG [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Executing: /root/scripts/PreFlight 2020-03-11 16:59:18,673 DEBUG [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Executing while with timeout : 1800000 2020-03-11 16:59:18,675 DEBUG [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Execution is successful. 2020-03-11 16:59:18,680 INFO  [rolling.maintenance.RollingMaintenanceAgentExecutor] (agentRequest-Handler-2:null) (logid:9c201ba1) Execution finished for stage: PreFlight script: /root/scripts/PreFlight : 0 Pass
6 Confirm that the scripts are executed by the service executor Make sure that the agent.properties contain setting to disable service mode executor:   rolling.maintenance.service.executor.disabled=false Restart cloudstack-agent, start the rolling maintenance of a single host and observe the log agent.log on the KVM host mentioned systemd being invoked 2020-03-11 17:04:03,168 INFO  [resource.wrapper.LibvirtRollingMaintenanceCommandWrapper] (agentRequest-Handler-3:null) (logid:525cb47d) Processing stage PreFlight 2020-03-11 17:04:03,168 DEBUG [rolling.maintenance.RollingMaintenanceServiceExecutor] (agentRequest-Handler-3:null) (logid:525cb47d) Invoking rolling maintenance service for stage: PreFlight and file /root/scripts/PreFlight with action: start 2020-03-11 17:04:03,170 DEBUG [utils.script.Script] (agentRequest-Handler-3:null) (logid:525cb47d) Executing: /bin/bash -c systemd-escape 'PreFlight,/root/scripts/PreFlight,1800,/root/scripts/rolling-maintenance-results,/root/scripts/rolling-maintenance-output' 2020-03-11 17:04:03,171 DEBUG [utils.script.Script] (agentRequest-Handler-3:null) (logid:525cb47d) Executing while with timeout : 3600000 2020-03-11 17:04:03,176 DEBUG [utils.script.Script] (agentRequest-Handler-3:null) (logid:525cb47d) Execution is successful. 2020-03-11 17:04:03,177 DEBUG [rolling.maintenance.RollingMaintenanceServiceExecutor] (agentRequest-Handler-3:null) (logid:525cb47d) Executing: /bin/systemctl start cloudstack-rolling-maintenance@PreFlight\x2c-root-scripts-PreFlight\x2c1800\x2c-root-scripts-rolling\x2dmaintenance\x2dresults\x2c-root-scripts-rolling\x2dmaintenance\x2doutput Pass
7 Confirm capacity checks are in place Out of 6 hosts, disable hosts 2,3,4,5, dedicate a single host (host6) to an account, and then execute the rolling maintenance against the remaining 1 host (host 1) which is hosting VMS that does NOT belong the account for which host6 was dedicated   Having NO VMs on the host1, host1 can be put into maintenance as there are no capacities needed to be available elsewhere, as no VM will be migrated away from host1 (localcloud) SBCM5> > start rollingmaintenance hostids=86c0b59f-89de-40db-9b30-251f851e869f {   "rollingmaintenance": {     "details": "OK",     "hostsskipped": [],     "hostsupdated": [       {         "enddate": "2020-03-11'T'17:21:38+00:00",         "hostid": "86c0b59f-89de-40db-9b30-251f851e869f",         "hostname": "ref-trl-711-k-M7-apanic-kvm1",         "output": "null ",         "startdate": "2020-03-11'T'17:21:07+00:00"       }     ],     "success": true   } } Pass
8 Confirm capacity checks are in place Out of 6 hosts, disable hosts 2,3,4,5, dedicate a single host (host6) to an account, and then execute the rolling maintenance against the remaining 1 host (host 1) which is hosting VMS that does NOT belong the account for which host6 was dedicated   Having at least 1 VM on host1, try to execute rolling maintenance against it -it will fail as there are no free hosts (non-disabled, non-dedicated) that can host VMs from host1. Due to the nature of putting the host into the Maintenance mode, after the first attempt is failed, management server will retry to migrate VMs away for 5 times and will then fail permanently (give up) while the startRollingMaintenance API call will be running until the timeout defined by “kvm.rolling.maintenance.wait.maintenance.timeout                  ” is reached, which defaults to 1800 seconds, after which the API call will fail as well.   The failure to migrate VMs away can be observed in the management-server.log   2020-03-11 17:24:25,761 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) No suitable hosts found 2020-03-11 17:24:25,761 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) No suitable hosts found under this Cluster: 1 2020-03-11 17:24:25,761 DEBUG [c.c.d.DeploymentPlanningManagerImpl] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) Could not find suitable Deployment Destination for this VM under any clusters, returning. 2020-03-11 17:24:25,761 DEBUG [c.c.d.FirstFitPlanner] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) Searching resources only under specified Cluster: 1 2020-03-11 17:24:25,762 DEBUG [c.c.d.FirstFitPlanner] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) The specified cluster is in avoid set, returning. 2020-03-11 17:24:25,762 DEBUG [c.c.v.VirtualMachineManagerImpl] (Work-Job-Executor-13:ctx-f1d47d29 job-1214/job-1215 ctx-bfd95fa0) (logid:7307421c) Unable to find destination for migrating the vm VM[User|i-2-27-VM] Pass
9 The pre-flight scripts must be executed on each host before any maintenance actions. Start the cluster-level rolling maintenance and “tail -f” the /var/log/cloudstack/agent/rolling-maintenance.log on each KVM host simultaneously – observe that the PreFlight stage/script is executed on all hosts in the cluster before the PreMaintenance stage/script is executed on the first host in the cluster tail -f” the /var/log/cloudstack/agent/rolling-maintenance.log on each KVM host simultaneously – observe that the PreFlight stage/script is executed on all hosts in the cluster before the PreMaintenance stage/script is executed on the first host in the cluster:   tail -f” the /var/log/cloudstack/agent/rolling-maintenance.log   17:05:56,710 rolling-maintenance INFO Successful execution of /root/scripts/PreFlight Pass
10 Failure of PreFlight check on hosts halts the API when forced=false is set On the first hosts in the cluster, make sure that the PreFlight scripts exists with code 1, i.e.:   #!/bin/bash echo "############# This is PreMaintenance script" exit 1   Run the rolling maintenance API against the cluster and observe that since there is a failure on the first host, no other hosts will be attempted for maintenance (localcloud) SBCM5> > start rollingmaintenance clusterids=a0c249d2-e020-4f2b-ab9c-1e05bbe68b64 {   "rollingmaintenance": {     "details": "Error starting rolling maintenance: Stage: PreFlight failed on host 86c0b59f-89de-40db-9b30-251f851e869f:  null",     "hostsskipped": [],     "hostsupdated": [],     "success": false   } } Pass
11 In absence of Maintenance script on a host, that host will be skipped On a single host, make sure that there is no script named “Maintenance”, “Maintenance.sh” or “Maintenance.py” present in the configured script folder and execute the rolling maintenance call against this host and another one   The first host will be skipped and a proper message is shown (localcloud) SBCM5> > start rollingmaintenance hostids=86c0b59f-89de-40db-9b30-251f851e869f,ef10dacd-ac4e-4ec0-bc8d-7fb5bb461c9d {   "rollingmaintenance": {     "details": "OK",     "hostsskipped": [       {         "hostid": "86c0b59f-89de-40db-9b30-251f851e869f",         "hostname": "ref-trl-711-k-M7-apanic-kvm1",         "reason": "There is no maintenance script on the host"       }     ],     "hostsupdated": [       {         "enddate": "2020-03-11'T'17:58:58+00:00",         "hostid": "ef10dacd-ac4e-4ec0-bc8d-7fb5bb461c9d",         "hostname": "ref-trl-711-k-M7-apanic-kvm2",         "output": "",         "startdate": "2020-03-11'T'17:58:18+00:00"       }     ],     "success": true   } } Pass
12 Capacity checks are also done before putting host into Maintenance Before executing rolling maintenance on host1, make sure to, out of 6 hosts, disable hosts 2,3,4,5, while host 6 is NOT disabled and does have enough capacities for VMs that exist on host1 – the capacity checks during PreFligh stage will not fail.   On host 1 make sure the PreMaintenance script has the equivalent of “sleep 30” command inside it, so the script will take at least 30 seconds to execute (PreFlight capacity checks have completed by now and there is host6 with enough capacities to host VMs from host1) and there is enough time for test-operator to go and disable host6 during those 30 seconds.   Execute “tail -f the /var/log/cloudstack/agent/rolling-maintenance.log” on the host1. When the line “Executing script: /root/scripts/PreMaintenance.sh for stage: PreMaintenance” appears (your script location might be different, as well as script extension) Quickly go and disable host6 during those 30 seconds. Observe that the rolling maintenance call will fail (localcloud) SBCM5> > start rollingmaintenance hostids=86c0b59f-89de-40db-9b30-251f851e869f {   "rollingmaintenance": {     "details": "Error starting rolling maintenance: No host available in cluster a0c249d2-e020-4f2b-ab9c-1e05bbe68b64 (p1-c1) to support host 86c0b59f-89de-40db-9b30-251f851e869f (ref-trl-711-k-M7-apanic-kvm1) in maintenance",     "hostsskipped": [],     "hostsupdated": [],     "success": false   } } (localcloud) SBCM5> > Pass
13 When a stage does not contain a script for execution, it is skipped On a single host, make sure that there is no script named “PreMaintenance”, “PreMaintenance.sh” or “PreMaintenance.py” present in the configured script folder and execute the rolling maintenance call against this host.   On the KVM host, observe that the lines in “rolling-maintenance.log” do not contain PreMaintenance script, but all the other scripts/stages have run normally grep "Executing script" /var/log/cloudstack/agent/rolling-maintenance.log   18:42:37,527 rolling-maintenance INFO Executing script: /root/scripts/PreFlight for stage: PreFlight 18:43:47,961 rolling-maintenance INFO Executing script: /root/scripts/Maintenance for stage: Maintenance 18:43:58,133 rolling-maintenance INFO Executing script: /root/scripts/PostMaintenance.sh for stage: PostMaintenance Pass
14 Execute rolling maintenance against the whole zone Ensure to have at least 2 clusters in a zone. Perform rolling maintenance of the whole zone.   Observe that clusters are processed one after the another – first all host from the first cluster, then all hosts from the second cluster   NOTE: in these tests, we have remove kvm4/kvm5/kvm6 hosts from the first cluster and added them to the new cluster (in order of kvm6/kvm5/kvm4).   Expected order of clusters/hosts processed is: - Cluster1 (p1-c1 in our case) à host kvm1/kvm2/kvm3 - Cluster2 (cluster2 in our case) à kvm6/kvm5/kvm4 (since the hosts within a cluster are processed by the order as they appear in the DB) Observe the “startdate” reported by the API, that confirms all hosts across both clusters are processed in the expected order:   (localcloud) SBCM5> > start rollingmaintenance zoneids=ce831d12-c2df-4b11-bec9-684dcc292c18 {   "rollingmaintenance": {     "details": "OK",     "hostsskipped": [],     "hostsupdated": [       {         "enddate": "2020-03-11'T'20:06:09+00:00",         "hostid": "86c0b59f-89de-40db-9b30-251f851e869f",         "hostname": "ref-trl-711-k-M7-apanic-kvm1",         "output": "",         "startdate": "2020-03-11'T'20:05:28+00:00"       },       {         "enddate": "2020-03-11'T'20:08:09+00:00",         "hostid": "ef10dacd-ac4e-4ec0-bc8d-7fb5bb461c9d",         "hostname": "ref-trl-711-k-M7-apanic-kvm2",         "output": "",         "startdate": "2020-03-11'T'20:06:29+00:00"       },       {         "enddate": "2020-03-11'T'20:10:10+00:00",         "hostid": "fcc8b96e-1c29-492e-a074-96babec70ecc",         "hostname": "ref-trl-711-k-M7-apanic-kvm3",         "output": "",         "startdate": "2020-03-11'T'20:08:30+00:00"       },       {         "enddate": "2020-03-11'T'20:12:11+00:00",         "hostid": "4a732078-2f5d-4bf1-8425-2135004a6b1a",         "hostname": "ref-trl-711-k-M7-apanic-kvm6",         "output": "",         "startdate": "2020-03-11'T'20:11:01+00:00"       },       {         "enddate": "2020-03-11'T'20:13:12+00:00",         "hostid": "8f27f11a-9c60-4c30-8622-0e1bce718adc",         "hostname": "ref-trl-711-k-M7-apanic-kvm5",         "output": "",         "startdate": "2020-03-11'T'20:12:32+00:00"       },       {         "enddate": "2020-03-11'T'20:14:13+00:00",         "hostid": "adbbfc34-9369-4a15-93dc-7ed85756c24e",         "hostname": "ref-trl-711-k-M7-apanic-kvm4",         "output": "",         "startdate": "2020-03-11'T'20:13:33+00:00"       }     ],     "success": true   } } Pass
15 Execute rolling maintenance against hosts from different clusters/zones While having multiple zones, execute the rolling maintenance by specifying at least hosts from different  zones (localcloud) SBCM5> > start rollingmaintenance hostids=86c0b59f-89de-40db-9b30-251f851e869f,b0f54409-4874-4573-9c24-8efac5b07f6f {   "rollingmaintenance": {     "details": "OK",     "hostsskipped": [],     "hostsupdated": [       {         "enddate": "2020-03-12'T'12:33:04+00:00",         "hostid": "86c0b59f-89de-40db-9b30-251f851e869f",         "hostname": "ref-trl-711-k-M7-apanic-kvm1",         "output": "",         "startdate": "2020-03-12'T'12:32:24+00:00"       },       {         "enddate": "2020-03-12'T'12:35:15+00:00",         "hostid": "b0f54409-4874-4573-9c24-8efac5b07f6f",         "hostname": "ref-trl-714-k-M7-apanic-kvm1",         "output": "",         "startdate": "2020-03-12'T'12:33:35+00:00"       }     ],     "success": true   } } ( Pass
16 Execute rolling maintenance against multiple zones Having multiple zones, execute the rolling maintenance by specifying at least 2 zones, and notice that first all hosts in one zone will be processed (all hosts in a single cluster, then all hosts from another cluster), and only then the hosts from another zone (localcloud) SBCM5> > start rollingmaintenance zoneids=6f3c9827-6e99-4c63-b7d5-e8f427f6dcff,ce831d12-c2df-4b11-bec9-684dcc292c18 {   "rollingmaintenance": {     "details": "OK",     "hostsskipped": [],     "hostsupdated": [       {         "enddate": "2020-03-12'T'12:41:24+00:00",         "hostid": "86c0b59f-89de-40db-9b30-251f851e869f",         "hostname": "ref-trl-711-k-M7-apanic-kvm1",         "output": "",         "startdate": "2020-03-12'T'12:40:44+00:00"       },       {         "enddate": "2020-03-12'T'12:43:25+00:00",         "hostid": "ef10dacd-ac4e-4ec0-bc8d-7fb5bb461c9d",         "hostname": "ref-trl-711-k-M7-apanic-kvm2",         "output": "",         "startdate": "2020-03-12'T'12:41:45+00:00"       },       {         "enddate": "2020-03-12'T'12:45:26+00:00",         "hostid": "fcc8b96e-1c29-492e-a074-96babec70ecc",         "hostname": "ref-trl-711-k-M7-apanic-kvm3",         "output": "",         "startdate": "2020-03-12'T'12:43:46+00:00"       },       {         "enddate": "2020-03-12'T'12:47:27+00:00",         "hostid": "4a732078-2f5d-4bf1-8425-2135004a6b1a",         "hostname": "ref-trl-711-k-M7-apanic-kvm6",         "output": "",         "startdate": "2020-03-12'T'12:46:17+00:00"       },       {         "enddate": "2020-03-12'T'12:49:28+00:00",         "hostid": "8f27f11a-9c60-4c30-8622-0e1bce718adc",         "hostname": "ref-trl-711-k-M7-apanic-kvm5",         "output": "",         "startdate": "2020-03-12'T'12:47:48+00:00"       },       {         "enddate": "2020-03-12'T'12:51:29+00:00",         "hostid": "adbbfc34-9369-4a15-93dc-7ed85756c24e",         "hostname": "ref-trl-711-k-M7-apanic-kvm4",         "output": "",         "startdate": "2020-03-12'T'12:49:48+00:00"       },       {         "enddate": "2020-03-12'T'12:53:00+00:00",         "hostid": "59159ade-f5c3-4606-9174-e501301f59d4",         "hostname": "ref-trl-714-k-M7-apanic-kvm3",         "output": "",         "startdate": "2020-03-12'T'12:52:19+00:00"       },       {         "enddate": "2020-03-12'T'12:54:00+00:00",         "hostid": "b0f54409-4874-4573-9c24-8efac5b07f6f",         "hostname": "ref-trl-714-k-M7-apanic-kvm1",         "output": "",         "startdate": "2020-03-12'T'12:53:20+00:00"       },       {         "enddate": "2020-03-12'T'12:55:01+00:00",         "hostid": "02228e26-a0d6-4607-824d-501ae5ac8dab",         "hostname": "ref-trl-714-k-M7-apanic-kvm2",         "output": "",         "startdate": "2020-03-12'T'12:54:21+00:00"       }     ],     "success": true   } } Pass

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after several iterations of code review. (methods can still be smaller/less complexd)
reviewed test scheme by @andrijapanicsb ; looks good

@DaanHoogland DaanHoogland merged commit efe00aa into apache:master Mar 12, 2020
@DaanHoogland DaanHoogland deleted the kvm-rolling-maintenance branch March 12, 2020 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants