Skip to content

Conversation

@GutoVeronezi
Copy link
Contributor

Description

PR #4640 added a condition validating if the VM is from hypervisor VMWare, in AbstractStoragePoolAllocator::reorderPools(List<StoragePool>, VirtualMachineProfile, DeploymentPlan, DiskProfile).

if (vmProfile.getHypervisorType() == HypervisorType.VMware &&
!storageMgr.DiskProvisioningStrictness.valueIn(plan.getDataCenterId())) {
pools = reorderPoolsByDiskProvisioningType(pools, dskCh);
}

However, when creating a volume from a snapshot (without VM's id), ACS instantiates the VirtualMachineProfile without a VM:

while ((pool = findStoragePool(dskCh, dc, pod.first(), null, null, null, poolsToAvoid)) != null) {

Which cause a NPE on:

The NPE is catch and throwed as a CloudRuntimeException.

This PR intends to add a validation if the VM is null before validating its hypervisor.

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)

Feature/Enhancement Scale or Bug Severity

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

How Has This Been Tested?

It has been tested locally in a test lab.

  1. I created VM;
  2. I took a snapshot;
  3. I tried to create a volume from the snapshot.
    • Before the nulls validation, it would throw an exception:
ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-1:ctx-7ec7e521 job-18849) (logid:43e9eb75) Unexpected exception while executing org.apache.cloudstack.api.command.admin.volume.CreateVolumeCmdByAdmin
com.cloud.utils.exception.CloudRuntimeException: Failed to create volume: xxxx
    at com.cloud.storage.VolumeApiServiceImpl.createVolume(VolumeApiServiceImpl.java:901)
    at com.cloud.storage.VolumeApiServiceImpl.createVolume(VolumeApiServiceImpl.java:201)
  • After the nulls validation, it created the volume.

@DaanHoogland
Copy link
Contributor

@GutoVeronezi a remark and a question;
q: can you add the stack trace. I've been looking at this simple change for 15 minutes and don't see the dynamic model (dumb me)
r: the extra block is not needed, the condition vmProfile.getVirtualMachine() == null can be reversed and added to the block doing the reordering.

@GutoVeronezi
Copy link
Contributor Author

@DaanHoogland this is the full log of the error:

2021-08-05 00:16:12,138 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-2:ctx-1ec8a162 job-18853) (logid:ad059be6) Unexpected exception while executing org.apache.cloudstack.api.command.admin.volume.CreateVolumeCmdByAdmin
com.cloud.utils.exception.CloudRuntimeException: Failed to create volume: 5786
        at com.cloud.storage.VolumeApiServiceImpl.createVolume(VolumeApiServiceImpl.java:901)
        at com.cloud.storage.VolumeApiServiceImpl.createVolume(VolumeApiServiceImpl.java:201)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163)
        at org.apache.cloudstack.network.contrail.management.EventUtils$EventInterceptor.invoke(EventUtils.java:107)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
        at com.cloud.event.ActionEventInterceptor.invoke(ActionEventInterceptor.java:51)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175)
        at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215)
        at com.sun.proxy.$Proxy227.createVolume(Unknown Source)
        at org.apache.cloudstack.api.command.user.volume.CreateVolumeCmd.execute(CreateVolumeCmd.java:225)
        at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:156)
        at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:108)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:620)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102)
        at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52)
        at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45)
        at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:568)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: com.cloud.utils.exception.CloudRuntimeException: Failed to create volume from snapshot:com.cloud.utils.exception.CloudRuntimeException: Failed to copy /mnt/e9ad8c2e-59f6-3886-aebd-a1c7da36028a/5769 to e5f21e96-9ee9-4f87-8b20-10b82d7dbcf6
        at org.apache.cloudstack.engine.orchestration.VolumeOrchestrator.createVolumeFromSnapshot(VolumeOrchestrator.java:494)
        at com.cloud.storage.VolumeApiServiceImpl.createVolumeFromSnapshot(VolumeApiServiceImpl.java:923)
        at com.cloud.storage.VolumeApiServiceImpl.createVolume(VolumeApiServiceImpl.java:873)
        ... 31 more
2021-08-05 00:16:12,150 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-2:ctx-1ec8a162 job-18853) (logid:ad059be6) Complete async job-18853, jobStatus: FAILED, resultCode: 530, result: org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":"530","errortext":"Failed to create volume: 5786"}

The NPE does not appear in the stack trace because it's being hidden; Therefore I mapped the hierarchy to facilitate the comprehension:

  • VolumeApiServiceImpl::createVolume(CreateVolumeCmd), L873, cmd.getVirtualMachineId() is null, therefore it calls:
  • VolumeApiServiceImpl::createVolumeFromSnapshot(VolumeVO, long, Long) with a null vmId. As vmId is null, it will not instantiate vm. In L923, it calls:
  • VolumeOrchestrator::createVolumeFromSnapshot(Volume, Snapshot, UserVm) with a null vm, then, in L434 it calls:
  • VolumeOrchestrator::findStoragePool(DiskProfile, DataCenter, Pod, Long, Long, VirtualMachine, Set<StoragePool>) with vm parameter null. It will instantiate a VirtualMachineProfile with null vm in L315 and in L324 it calls:
  • AbstractStoragePoolAllocator::allocateToPool(DiskProfile, VirtualMachineProfile, DeploymentPlan, ExcludeList, int). In L97 it calls:
  • AbstractStoragePoolAllocator::allocateToPool(DiskProfile, VirtualMachineProfile, DeploymentPlan, ExcludeList, int, boolean) which, in L103, will call:
  • AbstractStoragePoolAllocator::reorderPools(List<StoragePool>, VirtualMachineProfile, DeploymentPlan, DiskProfile). Here vmProfile is not null, but its property _vm is, therefore, when, in L193, it calls:
  • VirtualMachineProfileImpl::getHypervisorType(), it will throw a NPE due to the null _vm.

Also, I added a commit to log the exception before throwing the CloudRuntimeException.


About the remark, we choose to validate the getVirtualMachine() before the reorder block to add the log.

Copy link
Contributor

@nvazquez nvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM, great catch @GutoVeronezi

Comment on lines 902 to 903
s_logger.error(message, e);
throw new CloudRuntimeException(message, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why log and throw? we can do with either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaanHoogland as the real problem was not being logged, I thought it would be better to log it and keep the previous behavior.

What's your suggestion?

Copy link
Contributor

@DaanHoogland DaanHoogland Aug 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A CloudRuntimeException is caught and handled somewhere so I'd just throw. The logging will be hard to find and link to the proper occurrence. If the handling code doesn't log the exception than we should fix that. We potentially get the same stacktrace twice in the log now.
This is no 👎 by any means by the way. I'm just not so keen on overfilling the log with the same info multiple times, It's big enough as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fair, I will undo the last change.

As the focus of this PR is not the exception handling code, I think we should create an issue to investigate it.

@DaanHoogland
Copy link
Contributor

good investigative work @GutoVeronezi , i have some preferences about the implementation (log when we do reorder and not when we don't) but the jist is, this is great.

This reverts commit 70e6556.
@nvazquez
Copy link
Contributor

nvazquez commented Aug 6, 2021

@blueorangutan package

@blueorangutan
Copy link

@nvazquez a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔️ el7 ✔️ el8 ✔️ debian. SL-JID 785

@nvazquez
Copy link
Contributor

nvazquez commented Aug 6, 2021

@blueorangutan test

@blueorangutan
Copy link

@nvazquez a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-1518)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 55603 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr5282-t1518-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_internal_lb.py
Intermittent failure detected: /marvin/tests/smoke/test_kubernetes_clusters.py
Intermittent failure detected: /marvin/tests/smoke/test_network.py
Intermittent failure detected: /marvin/tests/smoke/test_routers_network_ops.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 87 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_01_internallb_roundrobin_1VPC_3VM_HTTP_port80 Failure 488.53 test_internal_lb.py
test_03_vpc_internallb_haproxy_stats_on_all_interfaces Error 239.44 test_internal_lb.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Failure 628.33 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Failure 603.10 test_vpc_redundant.py
test_05_rvpc_multi_tiers Failure 610.70 test_vpc_redundant.py

@rohityadavcloud rohityadavcloud added this to the 4.16.0.0 milestone Aug 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants