Skip to content

Conversation

@wido
Copy link
Contributor

@wido wido commented Nov 30, 2018

Description

This PR refactors the modifyvxlan.sh script for VXLAN support on KVM Hypervisors.

Highlights:

  • Only use iproute2 commands for all links, bridges and routes
  • Use proper Bash syntax in script
  • Add IPv6 underlay support

The script does not change existing functionality, running KVM hypervisors will keep functioning as they are doing right now.

This script hasn't been touched in 5 years (last commit in 2013) and needed some attention.

There are multiple commits in this PR to see how I got to the current code.

Types of changes

  • Enhancement (improves an existing feature and functionality)

Screenshots (if appropriate):

How Has This Been Tested?

Tested on a local setup with CloudStack 4.12 (master) running.

wido added 3 commits November 29, 2018 17:07
This script was using TAB instead of 4 spaces and had many blank
lines containing whitespace.

This commit also fixes some Bash styling, but it does not touch the
functionality of the script.

Signed-off-by: Wido den Hollander <wido@widodh.nl>
Bash suggest using double brackets instead of single brackets in
if-statement test logic

Signed-off-by: Wido den Hollander <wido@widodh.nl>
They are only transport devices and should not be interacting
in the IPv6 traffic.

If IPv6 is enabled Instances can connect to the Hypervisor over
Link-Local IPv6 which is a potential security issue.

By disabling IPv6 on the Bridge and VXLAN device they still forward
Layer 2 packets as intended, but they do not respond on anything.

IPv4 and IPv6 traffic towards the Instances is untouched and works
as before.

Signed-off-by: Wido den Hollander <wido@widodh.nl>
@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2476

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@blueorangutan
Copy link

Trillian test result (tid-3242)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 13474 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3070-t3242-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_multipleips_per_nic.py
Intermittent failure detected: /marvin/tests/smoke/test_public_ip_range.py
Intermittent failure detected: /marvin/tests/smoke/test_reset_vm_on_reboot.py
Intermittent failure detected: /marvin/tests/smoke/test_resource_accounting.py
Intermittent failure detected: /marvin/tests/smoke/test_router_dhcphosts.py
Intermittent failure detected: /marvin/tests/smoke/test_router_dns.py
Intermittent failure detected: /marvin/tests/smoke/test_router_dnsservice.py
Intermittent failure detected: /marvin/tests/smoke/test_routers_iptables_default_policy.py
Intermittent failure detected: /marvin/tests/smoke/test_routers_network_ops.py
Intermittent failure detected: /marvin/tests/smoke/test_routers.py
Intermittent failure detected: /marvin/tests/smoke/test_secondary_storage.py
Intermittent failure detected: /marvin/tests/smoke/test_service_offerings.py
Intermittent failure detected: /marvin/tests/smoke/test_snapshots.py
Intermittent failure detected: /marvin/tests/smoke/test_ssvm.py
Intermittent failure detected: /marvin/tests/smoke/test_templates.py
Intermittent failure detected: /marvin/tests/smoke/test_usage.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_life_cycle.py
Intermittent failure detected: /marvin/tests/smoke/test_vm_snapshots.py
Intermittent failure detected: /marvin/tests/smoke/test_volumes.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_router_nics.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_vpn.py
Smoke tests completed. 49 look OK, 21 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_nic_secondaryip_add_remove Error 23.05 test_multipleips_per_nic.py
ContextSuite context=TestResetVmOnReboot>:setup Error 0.00 test_reset_vm_on_reboot.py
test_01_so_removal_resource_update Error 1.24 test_resource_accounting.py
ContextSuite context=TestISOUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestLBRuleUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestNatRuleUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestPublicIPUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestSnapshotUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestVmUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestVolumeUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestVpnUsage>:setup Error 0.00 test_usage.py
ContextSuite context=TestRouterDHCPHosts>:setup Error 0.00 test_router_dhcphosts.py
ContextSuite context=TestRouterDHCPOpts>:setup Error 0.00 test_router_dhcphosts.py
ContextSuite context=TestRouterDns>:setup Error 0.00 test_router_dns.py
ContextSuite context=TestRouterDnsService>:setup Error 0.00 test_router_dnsservice.py
test_02_routervm_iptables_policies Error 1.18 test_routers_iptables_default_policy.py
test_01_single_VPC_iptables_policies Error 5.24 test_routers_iptables_default_policy.py
test_01_isolate_network_FW_PF_default_routes_egress_true Error 1.24 test_routers_network_ops.py
test_02_isolate_network_FW_PF_default_routes_egress_false Error 1.24 test_routers_network_ops.py
test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true Error 4.77 test_routers_network_ops.py
test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false Error 6.00 test_routers_network_ops.py
test_03_RVR_Network_check_router_state Error 6.00 test_routers_network_ops.py
ContextSuite context=TestRouterServices>:setup Error 0.00 test_routers.py
test_01_sys_vm_start Failure 0.07 test_secondary_storage.py
ContextSuite context=TestCpuCapServiceOfferings>:setup Error 0.00 test_service_offerings.py
ContextSuite context=TestServiceOfferings>:setup Error 1.78 test_service_offerings.py
ContextSuite context=TestCreateVolume>:setup Error 0.00 test_volumes.py
ContextSuite context=TestVolumes>:setup Error 0.00 test_volumes.py
ContextSuite context=TestSnapshotRootDisk>:setup Error 0.00 test_snapshots.py
test_01_list_sec_storage_vm Failure 0.02 test_ssvm.py
test_02_list_cpvm_vm Failure 0.02 test_ssvm.py
test_03_ssvm_internals Failure 0.03 test_ssvm.py
test_04_cpvm_internals Failure 0.02 test_ssvm.py
test_05_stop_ssvm Failure 0.02 test_ssvm.py
test_06_stop_cpvm Failure 0.02 test_ssvm.py
test_07_reboot_ssvm Failure 0.02 test_ssvm.py
test_08_reboot_cpvm Failure 0.02 test_ssvm.py
test_09_destroy_ssvm Failure 0.02 test_ssvm.py
test_10_destroy_cpvm Failure 0.02 test_ssvm.py
test_02_create_template_with_checksum_sha1 Error 65.31 test_templates.py
test_03_create_template_with_checksum_sha256 Error 65.33 test_templates.py
test_04_create_template_with_checksum_md5 Error 65.32 test_templates.py
test_05_create_template_with_no_checksum Error 65.33 test_templates.py
test_02_deploy_vm_from_direct_download_template Error 1.20 test_templates.py
test_03_deploy_vm_wrong_checksum Error 1.25 test_templates.py
ContextSuite context=TestTemplates>:setup Error 16.73 test_templates.py
ContextSuite context=Test01DeployVM>:setup Error 0.00 test_vm_life_cycle.py
ContextSuite context=Test02VMLifeCycle>:setup Error 0.00 test_vm_life_cycle.py
test_14_secure_to_secure_vm_migration Error 11.28 test_vm_life_cycle.py
test_15_secured_to_nonsecured_vm_migration Error 84.33 test_vm_life_cycle.py
test_16_nonsecured_to_secured_vm_migration Error 1.19 test_vm_life_cycle.py
ContextSuite context=TestVmSnapshot>:setup Error 1.73 test_vm_snapshots.py
test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL Error 5.72 test_vpc_redundant.py
test_02_redundant_VPC_default_routes Error 7.73 test_vpc_redundant.py
test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers Error 7.74 test_vpc_redundant.py
test_04_rvpc_network_garbage_collector_nics Error 8.79 test_vpc_redundant.py
test_05_rvpc_multi_tiers Error 7.72 test_vpc_redundant.py
test_01_VPC_nics_after_destroy Error 4.66 test_vpc_router_nics.py
test_02_VPC_default_routes Error 4.67 test_vpc_router_nics.py
test_01_redundant_vpc_site2site_vpn Failure 7.35 test_vpc_vpn.py
test_01_vpc_site2site_vpn_multiple_options Failure 5.28 test_vpc_vpn.py
test_01_vpc_remote_access_vpn Failure 3.18 test_vpc_vpn.py
test_01_vpc_site2site_vpn Failure 4.25 test_vpc_vpn.py

This commit refactors the modifyvxlan.sh script by using only iproute2,
the 'ip' command for all functions.

brctl is deprecated and most bridge functionality can be performed with
the 'ip' command.

This commit also fixes various Bash coding fixes and removes a lot of exit
status checking which was redundant.

In addition it add IPv6 underlay for VXLAN transport. If the caller (KVM Agent)
adds the '-6' flag it will generate IPv6 multicast groups and routes which will
transport the VXLAN encapsulated packaes over IPv6 multicast groups.

Signed-off-by: Wido den Hollander <wido@widodh.nl>
@rohityadavcloud
Copy link
Member

@blueorangutan package

@blueorangutan
Copy link

@rhtyd a Jenkins job has been kicked to build packages. I'll keep you posted as I make progress.

@blueorangutan
Copy link

Packaging result: ✔centos6 ✔centos7 ✔debian. JID-2478

@rohityadavcloud
Copy link
Member

@blueorangutan test

@blueorangutan
Copy link

@rhtyd a Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests

@wido
Copy link
Contributor Author

wido commented Dec 3, 2018

A manual test of this script:

Create a VXLAN device and a bridge.

root@n01:~# /usr/share/cloudstack-common/scripts/vm/network/vnet/modifyvxlan.sh -v 3000 -p cloudbr0 -b brvx-3000 -o add
multicast 239.0.11.184 for VNI 3000 on cloudbr0
vxlan: destination port not specified
Will use Linux kernel default (non-standard value)
Use 'dstport 4789' to get the IANA assigned value
Use 'dstport 0' to get default and quiet this message
root@n01:~# ip link show dev vxlan3000
182: vxlan3000:  mtu 8950 qdisc noqueue master brvx-3000 state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 76:e4:c1:bb:73:5c brd ff:ff:ff:ff:ff:ff
root@n01:~# bridge link show|grep brvx-3000
182: vxlan3000 state UNKNOWN :  mtu 8950 master brvx-3000 state forwarding priority 32 cost 100 
root@n01:~#

Delete the interfaces again

root@n01:~# /usr/share/cloudstack-common/scripts/vm/network/vnet/modifyvxlan.sh -v 3000 -p cloudbr0 -b brvx-3000 -o delete
root@n01:~# 

@blueorangutan
Copy link

Trillian test result (tid-3243)
Environment: kvm-centos7 (x2), Advanced Networking with Mgmt server 7
Total time taken: 23490 seconds
Marvin logs: https://github.com/blueorangutan/acs-prs/releases/download/trillian/pr3070-t3243-kvm-centos7.zip
Intermittent failure detected: /marvin/tests/smoke/test_internal_lb.py
Intermittent failure detected: /marvin/tests/smoke/test_multipleips_per_nic.py
Intermittent failure detected: /marvin/tests/smoke/test_vpc_redundant.py
Smoke tests completed. 68 look OK, 2 have error(s)
Only failed tests results shown below:

Test Result Time (s) Test File
test_nic_secondaryip_add_remove Error 30.25 test_multipleips_per_nic.py
test_04_rvpc_network_garbage_collector_nics Failure 496.29 test_vpc_redundant.py

@wido
Copy link
Contributor Author

wido commented Dec 12, 2018

Is there somebody who can test this? I verified it works for us, but I would like a second pair of eyes.

@wido
Copy link
Contributor Author

wido commented Dec 12, 2018

@kiwiflyer Can you comment?

@wido
Copy link
Contributor Author

wido commented Dec 12, 2018

@andrijapanic Can you also take a look?

@rohityadavcloud
Copy link
Member

@wido I've a vxlan+kvm based ci env, I'll try to test this week/soon

@kiwiflyer
Copy link
Contributor

kiwiflyer commented Dec 12, 2018 via email

@andrijapanic-dont-use-this-one

Need some time here, but will check.

@wido
Copy link
Contributor Author

wido commented Dec 28, 2018

@kiwiflyer and @andrijapanic Have you been able to look at this?

@andrijapanicsb
Copy link
Contributor

andrijapanicsb commented Dec 28, 2018

hm...did we drop support for Ubuntu 14.04 for cloudstack-agent ? @wido

@wido
Copy link
Contributor Author

wido commented Dec 28, 2018

Which of this PR suggests that to you?

All these commands should work on 14.04. Keep in mind, 14.04 is almost EOL.

@andrijapanicsb
Copy link
Contributor

Sorry, perhaps I was not clear - it's 4.12/master with your PR....

cloudstack-agent : Depends: lsb-base (>= 9) but 4.1+Debian11ubuntu6.2 is to be installed

So in 4.12 it seems Ubuntu 14.04 was deprecated... I can't easily test this...will try to give it some more effort possibly, but can't promise.

@kiwiflyer
Copy link
Contributor

@wido Sorry, I've been really busy the last few days. I'm trying to get to this really soon.

@andrijapanicsb
Copy link
Contributor

LGTM

@wido - I did replace the script in 4.11.2 with your version, before zone was built - and whole automation of zone deployment, different operations etc - and series of Marvin tests passed- with no difference to original/vanilla setup of 4.11.2.

I did not test anything ipv6 related.

@wido
Copy link
Contributor Author

wido commented Dec 29, 2018

Thanks! The IPv6 can't be tested at this moment nor is it used. The foundation has been added that Ipv6 multicast groups can be used if needed.

@kiwiflyer
Copy link
Contributor

Tested on Centos 7.5 - LGTM

2019-01-02 08:38:45,851 DEBUG [kvm.resource.BridgeVifDriver] (agentRequest-Handler-3:null) (logid:1d5cf06b) nic=[Nic:Guest-10.1.1.57-vxlan://7767]
2019-01-02 08:38:45,851 DEBUG [kvm.resource.BridgeVifDriver] (agentRequest-Handler-3:null) (logid:1d5cf06b) creating a vNet dev and bridge for guest traffic per traffic label bond0.151
2019-01-02 08:38:45,852 DEBUG [kvm.resource.BridgeVifDriver] (agentRequest-Handler-3:null) (logid:1d5cf06b) Executing: /usr/share/cloudstack-common/scripts/vm/network/vnet/modifyvxlan.sh -v 7767 -p bond0.151 -b brvx-7767 -o add
2019-01-02 08:38:45,884 DEBUG [kvm.resource.BridgeVifDriver] (agentRequest-Handler-3:null) (logid:1d5cf06b) Execution is successful.
2019-01-02 08:38:45,884 DEBUG [kvm.resource.BridgeVifDriver] (agentRequest-Handler-3:null) (logid:1d5cf06b) multicast 239.0.30.87 for VNI 7767 on bond0.151

ip a|grep 7767
393: vxlan7767: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc noqueue master brvx-7767 state UNKNOWN group default qlen 1000
394: brvx-7767: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc noqueue state UP group default qlen 1000
433: vnet18: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc htb master brvx-7767 state UNKNOWN group default qlen 1000
455: vnet1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc htb master brvx-7767 state UNKNOWN group default qlen 1000
458: vnet13: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 8950 qdisc htb master brvx-7767 state UNKNOWN group default qlen 1000

brctl show brvx-7767
bridge name bridge id STP enabled interfaces
brvx-7767 8000.92c24205f646 no vnet1
vnet13
vnet18
vxlan7767

@wido
Copy link
Contributor Author

wido commented Jan 2, 2019

Thanks! Could you ack the PR here in Github?

Btw, 'brctl' is deprecated, it has been replcaed by the 'bridge' command :)

@kiwiflyer
Copy link
Contributor

yeah, call me old school ;-)

Copy link
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks nice, any relevant integration tests?

@wido
Copy link
Contributor Author

wido commented Jan 8, 2019

@DaanHoogland Thanks! We do not have integration tests for this. That's why I asked for manual testing by users which all came back with LGTM

@wido wido removed the request for review from rohityadavcloud January 8, 2019 12:24
@GabrielBrascher GabrielBrascher merged commit d3e95b9 into apache:master Jan 9, 2019
LOCKFILE=/var/run/cloud/vxlan.lock

(
flock -x -w 10 200 || exit 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we confirm if flock is installed/available by default on centos6/centos7 otherwise this will break on both distros? @wido /cc @DagSonsteboSB @borisstoyanov @PaulAngus @dhlaluku @anuragaw @shwstppr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've confirmed that flock comes default on CentOS7 and on Ubuntu 18.04. Not sure about CentOS6 and Ubuntu 16.04.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked, and I can confirm that flock is available by default on Ubuntu 16.04 and CentOS 6.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good! Thanks for checking that @rhtyd @rafaelweingartner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants