Skip to content

Conversation

@slavkap
Copy link
Contributor

@slavkap slavkap commented Nov 28, 2019

Description

Cloudstack, with KVM as a hypervisor, provides live VM snapshots only with volumes whose image format is QCOW.
This is a limitation for storage providers with disks in RAW format.

link to ML: http://mail-archives.apache.org/mod_mbox/cloudstack-dev/201911.mbox/%3cCAA6FghF7eaY-A3XGN5zSwKVvp7zrcuNzv5nFAQKaF+et3zH-ag@mail.gmail.com%3e#archives

With this feature

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Documentation

apache/cloudstack-documentation#153

Prerequisites

guest agent enabled - used to freeze/thaw the VM

How Has This Been Tested?

Environment:

  • CloudStack - management and agent version - main
  • OS for management and hypervisor - CentOS7
  • Hypervisor - KVM
  • Primary storages - NFS, StorPool, Local storage and Ceph.

Enable the global configuration setting - kvm.vmstoragesnapshot.enabled

Take snapshot:

  • Take VM snapshot without memory on running virtual machine
  • We have tested this feature with these storage providers - NFS, StorPool, Local storage and Ceph.
  • There is one smoke test, which creates VM, installs qemu-guest-agent, takes, reverts and deletes VM snapshot.

Revert snapshot

  • Stop the virtual machine and revert the VM snapshot

@DaanHoogland
Copy link
Contributor

@blueorangutan package

@GabrielBrascher GabrielBrascher added this to the 4.14.0.0 milestone Nov 29, 2019
@andrijapanicsb
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@andrijapanicsb a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✖centos6 ✔centos7 ✔debian. JID-533

@andrijapanicsb
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@andrijapanicsb a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-694)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 28199 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3724-t694-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_vm_snapshot_kvm.py
Intermittent failure detected: /marvin/tests/smoke/test_hostha_kvm.py
Smoke tests completed. 76 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
runTest Error 0.00 test_vm_snapshot_kvm.py
test_hostha_enable_ha_when_host_disabled Error 3.63 test_hostha_kvm.py
test_hostha_enable_ha_when_host_in_maintenance Error 302.75 test_hostha_kvm.py

@weizhouapache
Copy link
Member

@slavkap @andrijapanicsb
this is a nice feature.
I will review and test it . It might take some days, depends on my availability.

@slavkap
Copy link
Contributor Author

slavkap commented Jan 9, 2020

Thank you @weizhouapache that you will spend time on this! I am available if you have any comments or questions.

@andrijapanicsb
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@andrijapanicsb a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@andrijapanicsb andrijapanicsb requested a review from wido January 13, 2020 21:40
@andrijapanicsb
Copy link
Contributor

andrijapanicsb commented Jan 13, 2020

@slavkap - nice stuff as already commented!

  1. I see "lots" of dependency which has to be documented properly in upstream documentation (preferably) but also I would like to see it documented in the description of the global setting - "...requires qemu 1.6+ and the qemu-guest-agent installed inside a VM"

  2. Do you know how well qemu-guest-agent is supported/possible on Windows guests - lot's of people will be running Windows VMs as well - covering just Linux workloads would be suboptimal.

  3. You are mentioning QCOW format and behaviour about it. Ceph uses RAW format from Qemu/Libvirt perspective, but possibly is marked as QCOW2 in DB (I recall some hassle around DB format in past) - can you please elaborate about this - how does it work with Ceph vs other "real" QCOW2 formats? Store

  4. Currently, due to someone noticing bugs in QCOW2 snapshots (or restore of) where the volume/filesystem of the guest gets corrupted - volume snapshots have been disabled for running VMs for KVM. @weizhouapache can you comment on this? I'm wondering if we might hit the same issue here as well

  5. Did you guys test this on the busy VMs (IO and CPU heavy workloads)?

thx

wido
wido previously requested changes Jan 14, 2020
Copy link
Contributor

@wido wido left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments about executing commands from inside Java

@andrijapanicsb andrijapanicsb modified the milestones: 4.14.0.0, 4.15.0.0 Jan 14, 2020
@andrijapanicsb
Copy link
Contributor

moving to 4.15, I'm afraid we won't be able to test in time.

@weizhouapache
Copy link
Member

4. urrently, due to someone noticing bugs in QCOW2 snapshots (or restore of) where the volume/filesystem of the guest gets corrupted - volume snapshots have been disabled for running VMs for KVM. @weizhouapache can you comment on this? I'm wondering if we might hit the same issue here as well

@andrijapanicsb as far as I remember, vm is paused not only when we create snapshot but also delete a snapshot. I think corrupted volume issue should be fixed.

@slavkap
Copy link
Contributor Author

slavkap commented Feb 5, 2020

Hello @andrijapanicsb, sorry for the late response! Bellow you can find answers of your questions

I see "lots" of dependency which has to be documented properly in upstream documentation (preferably) but also I would like to see it documented in the description of the global setting - "...requires qemu 1.6+ and the qemu-guest-agent installed inside a VM"

We'll document the functionality, but could you please tell me what should be included? Also I will update the global setting with:

I would like to see it documented in the description of the global setting - "...requires qemu 1.6+ and the qemu-guest-agent installed inside a VM"

Do you know how well qemu-guest-agent is supported/possible on Windows guests - lot's of people will be running Windows VMs as well - covering just Linux workloads would be suboptimal.

For Windows freeze/thaw of guest agent is included in versions from 2013

You are mentioning QCOW format and behaviour about it. Ceph uses RAW format from Qemu/Libvirt perspective, but possibly is marked as QCOW2 in DB (I recall some hassle around DB format in past) - can you please elaborate about this - how does it work with Ceph vs other "real" QCOW2 formats? Store

Ceph is marked in DB as RAW format, but this doesn't affect snapshot and revert of Ceph's volumes, because we are using its implementation for this. For each primary storage we are using appropriate plugin with take/revert snapshot implementation.

Did you guys test this on the busy VMs (IO and CPU heavy workloads)?

We've completed our tests including on busy VMs with different primary storages. They've been successfully reverted and it works as expected.

moving to 4.15, I'm afraid we won't be able to test in time.

Unfortunately we'll need time to implement qemu-agent and qemu-monitor commands functionality in libvirt java api, and don't know when it will be accepted.

@DaanHoogland DaanHoogland marked this pull request as draft May 7, 2020 08:59
@nvazquez
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 2921

@nvazquez
Copy link
Contributor

@blueorangutan test

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3656)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 30801 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3724-t3656-kvm-centos7.zip
Smoke tests completed. 93 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

@rp-
Copy link
Contributor

rp- commented Mar 22, 2022

I tested this PR today with the Linstor plugin and creating/reverting and deleting works with Linstor.

@nvazquez
Copy link
Contributor

nvazquez commented Mar 22, 2022

Thanks @rp- can you please approve on the Files changed tab -> Review changes -> Submit review?

Copy link
Contributor

@rp- rp- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with Linstor primary storage:
create/revert and delete worked.

public static final ConfigKey<Boolean> BackupSnapshotAfterTakingSnapshot = new ConfigKey<Boolean>(Boolean.class, "snapshot.backup.to.secondary", "Snapshots", "true",
"Indicates whether to always backup primary storage snapshot to secondary storage. Keeping snapshots only on Primary storage is applicable for KVM + Ceph only.", false, ConfigKey.Scope.Global, null);

public static final ConfigKey<Boolean> VMsnapshotKVM = new ConfigKey<>(Boolean.class, "kvm.vmstoragesnapshot.enabled", "Snapshots", "false", "For live snapshot of virtual machine instance on KVM hypervisor without memory. Requieres qemu version 1.6+ (on NFS or Local file system) and qemu-guest-agent installed on guest VM", true, ConfigKey.Scope.Global, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be renamed to something like 'VMStorageSnapshotKVM'? Could the scope be reduced to zone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvazquez, I will rename it, but why limit this to a Zone?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slavkap I was thinking it would make sense for admins to enable/disable the feature for a zone and not for all of them at the same time - just a suggestion

Copy link
Contributor

@nvazquez nvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @slavkap - code LGTM

@nvazquez
Copy link
Contributor

nvazquez commented Mar 30, 2022

Hi @wido @GabrielBrascher @svenvogel would it be possible for you to test this PR on Ceph or Solidfire?

@nvazquez
Copy link
Contributor

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian ✔️ suse15. SL-JID 3027

@nvazquez
Copy link
Contributor

nvazquez commented Apr 4, 2022

@blueorangutan test

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3805)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 31833 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3724-t3805-kvm-centos7.zip
Smoke tests completed. 93 look OK, 0 have errors
Only failed tests results shown below:

Test Result Time (s) Test File

@nvazquez
Copy link
Contributor

nvazquez commented Apr 8, 2022

Merging based on approvals and tests results

@nvazquez nvazquez merged commit 2b075ed into apache:main Apr 8, 2022
@rohityadavcloud
Copy link
Member

Fantastic, this is finally merged! Thank for your work and patience @slavkap - do raise a documentation PR to as required - https://github.com/apache/cloudstack-documentation

@slavkap
Copy link
Contributor Author

slavkap commented Apr 8, 2022

yes, @rohityadavcloud 🎉 ❤️
Thank you all for your help!
I'll update the PR in docs - 153 - with the changes I made

slavkap added a commit to slavkap/cloudstack-documentation that referenced this pull request Apr 12, 2022
nvazquez pushed a commit to apache/cloudstack-documentation that referenced this pull request Apr 21, 2022
* [WIP] Storage-based VM snapshots on KVM

Documentation fo pull request - apache/cloudstack#3724

* Removed information for QCOW2 support

* Fix reference

* Fix section
weizhouapache pushed a commit to weizhouapache/cloudstack-documentation that referenced this pull request Jul 24, 2023
* [WIP] Storage-based VM snapshots on KVM

Documentation fo pull request - apache/cloudstack#3724

* Removed information for QCOW2 support

* Fix reference

* Fix section
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

No open projects
Status: Done

Development

Successfully merging this pull request may close these issues.

Improve kvm vm snapshot