Skip to content

Conversation

@wilderrodrigues
Copy link
Contributor

This PR fixes a blocker issue!

  • Just like with RVRs, use the VRID 51 instead of making it dependent on the VPCID
  • Reason: arbitary unique number 0..255 used to differentiate multiple instances of vrrpd running on the same NIC (and hence same socket). virtual_router_id 51

…ant file

   - Just like with RVRs, use the VRID 51 instead of making it dependent on the VPCID
   - Reason: arbitary unique number 0..255 used to differentiate multiple instances of vrrpd running on the same NIC (and hence same socket). virtual_router_id 51
@wilderrodrigues
Copy link
Contributor Author

Ping @remibergsma @DaanHoogland @borisroman

Could you guys test it before the RC? I just fixed, but have to go to a concert now.

@remibergsma
Copy link
Contributor

Thanks for the quick fix @wilderrodrigues.

Bit explanation: redundant routers worked fine in our 4.7 cloud, then all of a sudden were broken. Root cause was due to virtual_router_id was set to vpc_id. When we got more than 255 it broke, as this keepalived setting can only be 0-255. This keeps the default in the template, which is in keepalived.conf.templ:

    virtual_router_id 51

Thanks @fborn for discovering the issue!

@remibergsma
Copy link
Contributor

You can read the following from man keepalived.conf:

# arbitary unique number 0..255
# used to differentiate multiple instances of vrrpd
# running on the same NIC (and hence same socket).
virtual_router_id 51

Since we run only one pair of keepalived on each nic, this default is fine.

@rohityadavcloud
Copy link
Member

LGTM, perhaps we can use VPCID % 255 to get a value that is less than 255 but greater than 0?
@remibergsma @fborn @wilderrodrigues Do you think using a static/fixed value can cause any issue (maybe in future)?

@remibergsma
Copy link
Contributor

@bhaisaab nice suggestion! Not sure if it is needed though. The vrrp is done over the first guest network, so it cannot clash with other router pairs. Other tiers are handled by the same keepalived/vrrp instance so that's also fine. I cannot think of a way it'd clash. And even when we make it 0-255, it could still clash (and be harder to spot).

First testing it now to see if it resolves the issues we see. Will report back soon.

@remibergsma
Copy link
Contributor

Jenkins error unrelated to PR change:

Build timed out (after 120 minutes). Marking the build as aborted.

@remibergsma
Copy link
Contributor

First test results:

keepalived.conf looks as expected:

vrrp_instance inside_network {
    state EQUAL
    interface eth2
    virtual_router_id 51
    nopreempt

Service is running:

root@r-11-VM:/etc/keepalived# ps aux | grep keepalived
root      4058  0.1  0.4  47040  1032 ?        Ss   17:56   0:00 /usr/sbin/keepalived
root      4059  0.1  0.9  53308  2368 ?        S    17:56   0:00 /usr/sbin/keepalived
root      4060  0.2  0.7  53308  1768 ?        S    17:56   0:00 /usr/sbin/keepalived
root      5994  0.0  0.3   8076   852 pts/1    S+   18:00   0:00 grep keepalived

Logs:

Dec 12 17:56:12 r-11-VM Keepalived_vrrp[4060]: Registering Kernel netlink reflector
Dec 12 17:56:12 r-11-VM Keepalived_vrrp[4060]: Registering Kernel netlink command channel
Dec 12 17:56:12 r-11-VM Keepalived_vrrp[4060]: Registering gratuitous ARP shared channel
Dec 12 17:56:12 r-11-VM Keepalived_vrrp[4060]: Opening file '/etc/keepalived/keepalived.conf'.
Dec 12 17:56:12 r-11-VM Keepalived_vrrp[4060]: Truncating auth_pass to 8 characters
Dec 12 17:56:12 r-11-VM Keepalived_vrrp[4060]: Configuration is using : 64669 Bytes
Dec 12 17:56:12 r-11-VM Keepalived_vrrp[4060]: Using LinkWatch kernel netlink reflector...
Dec 12 17:56:12 r-11-VM Keepalived_vrrp[4060]: VRRP_Instance(inside_network) Entering BACKUP STATE
Dec 12 17:56:13 r-11-VM Keepalived_vrrp[4060]: VRRP_Script(heartbeat) succeeded
Dec 12 17:56:16 r-11-VM Keepalived_vrrp[4060]: VRRP_Instance(inside_network) Transition to MASTER STATE
Dec 12 17:56:17 r-11-VM Keepalived_vrrp[4060]: VRRP_Instance(inside_network) Entering MASTER STATE

screen shot 2015-12-12 at 19 01 24

@DaanHoogland
Copy link
Contributor

As an operator I want this in :p have only read the feature description in jira and the diff but lgtm based on that and @remibergsma his test results.

@remibergsma
Copy link
Contributor

@DaanHoogland Thanks. Deploying to a real cloud as we speak. Will verify there too.

@remibergsma
Copy link
Contributor

Update: Verified the same as above in our pre-production environment (aka Employee Cloud). Will now deploy to production as it works as expected. When the integration tests are done and show nothing broke, we will merge.

@remibergsma
Copy link
Contributor

Update: this resolved our production problem. It now works fine in master+this PR.

LGTM 👍

@borisroman
Copy link
Contributor

LGTM 👍

Environment

  • 2 KVM host on CentOS 7.1
  • 1 Management Server on CentOS 7.1
  • Agent + Common RPMs built from source

Integration test suite 1

nosetests --with-marvin --marvin-config=${marvinCfg} -s -a tags=advanced,required_hardware=true \
    component/test_password_server.py \
    smoke/test_vpc_redundant.py \
    smoke/test_routers_iptables_default_policy.py \
    smoke/test_routers_network_ops.py \
    smoke/test_vpc_router_nics.py \
    smoke/test_router_dhcphosts.py \
    smoke/test_loadbalance.py \
    smoke/test_internal_lb.py \
    smoke/test_ssvm.py \
    smoke/test_vpc_vpn.py \
    smoke/test_privategw_acl.py \
    smoke/test_network.py

Result test suite 1

Check the password file in the Router VM ... === TestName: test_isolate_network_password_server | Status : SUCCESS ===
ok
Create a redundant VPC with two networks with two VMs in each network ... === TestName: test_01_create_redundant_VPC_2tiers_4VMs_4IPs_4PF_ACL | Status : SUCCESS ===
ok
Create a redundant VPC with two networks with two VMs in each network and check default routes ... === TestName: test_02_redundant_VPC_default_routes | Status : SUCCESS ===
ok
Create a redundant VPC with two networks with two VMs in each network ... === TestName: test_03_create_redundant_VPC_1tier_2VMs_2IPs_2PF_ACL_reboot_routers | Status : SUCCESS ===
ok
Test iptables default INPUT/FORWARD policy on RouterVM ... === TestName: test_02_routervm_iptables_policies | Status : SUCCESS ===
ok
Test iptables default INPUT/FORWARD policies on VPC router ... === TestName: test_01_single_VPC_iptables_policies | Status : SUCCESS ===
ok
Test redundant router internals ... === TestName: test_01_isolate_network_FW_PF_default_routes_egress_true | Status : SUCCESS ===
ok
Test redundant router internals ... === TestName: test_02_isolate_network_FW_PF_default_routes_egress_false | Status : SUCCESS ===
ok
Test redundant router internals ... === TestName: test_01_RVR_Network_FW_PF_SSH_default_routes_egress_true | Status : SUCCESS ===
ok
Test redundant router internals ... === TestName: test_02_RVR_Network_FW_PF_SSH_default_routes_egress_false | Status : SUCCESS ===
ok
Test redundant router internals ... === TestName: test_03_RVR_Network_check_router_state | Status : SUCCESS ===
ok
Create a VPC with two networks with one VM in each network and test nics after destroy ... === TestName: test_01_VPC_nics_after_destroy | Status : SUCCESS ===
ok
Create a VPC with two networks with one VM in each network and test default routes ... === TestName: test_02_VPC_default_routes | Status : SUCCESS ===
ok
Check that the /etc/dhcphosts.txt doesn't contain duplicate IPs ... === TestName: test_router_dhcphosts | Status : SUCCESS ===
ok
Test to create Load balancing rule with source NAT ... === TestName: test_01_create_lb_rule_src_nat | Status : SUCCESS ===
ok
Test to create Load balancing rule with non source NAT ... === TestName: test_02_create_lb_rule_non_nat | Status : SUCCESS ===
ok
Test for assign & removing load balancing rule ... === TestName: test_assign_and_removal_lb | Status : SUCCESS ===
ok
Test to verify access to loadbalancer haproxy admin stats page ... === TestName: test02_internallb_haproxy_stats_on_all_interfaces | Status : SUCCESS ===
ok
Test create, assign, remove of an Internal LB with roundrobin http traffic to 3 vm's ... === TestName: test_01_internallb_roundrobin_1VPC_3VM_HTTP_port80 | Status : SUCCESS ===
ok
Test SSVM Internals ... === TestName: test_03_ssvm_internals | Status : SUCCESS ===
ok
Test CPVM Internals ... === TestName: test_04_cpvm_internals | Status : SUCCESS ===
ok
Test stop SSVM ... === TestName: test_05_stop_ssvm | Status : SUCCESS ===
ok
Test stop CPVM ... === TestName: test_06_stop_cpvm | Status : SUCCESS ===
ok
Test reboot SSVM ... === TestName: test_07_reboot_ssvm | Status : SUCCESS ===
ok
Test reboot CPVM ... === TestName: test_08_reboot_cpvm | Status : SUCCESS ===
ok
Test destroy SSVM ... === TestName: test_09_destroy_ssvm | Status : SUCCESS ===
ok
Test destroy CPVM ... === TestName: test_10_destroy_cpvm | Status : SUCCESS ===
ok
Test Remote Access VPN in VPC ... === TestName: test_vpc_remote_access_vpn | Status : SUCCESS ===
ok
Test VPN in VPC ... === TestName: test_vpc_site2site_vpn | Status : SUCCESS ===
ok
test_01_vpc_privategw_acl (integration.smoke.test_privategw_acl.TestPrivateGwACL) ... === TestName: test_01_vpc_privategw_acl | Status : SUCCESS ===
ok
test_02_vpc_privategw_static_routes (integration.smoke.test_privategw_acl.TestPrivateGwACL) ... === TestName: test_02_vpc_privategw_static_routes | Status : SUCCESS ===
ok
test_03_rvpc_privategw_static_routes (integration.smoke.test_privategw_acl.TestPrivateGwACL) ... === TestName: test_03_rvpc_privategw_static_routes | Status : SUCCESS ===
ok
Test for port forwarding on source NAT ... === TestName: test_01_port_fwd_on_src_nat | Status : SUCCESS ===
ok
Test for port forwarding on non source NAT ... === TestName: test_02_port_fwd_on_non_src_nat | Status : SUCCESS ===
ok
Test for reboot router ... === TestName: test_reboot_router | Status : SUCCESS ===
ok
Test for Router rules for network rules on acquired public IP ... === TestName: test_network_rules_acquired_public_ip_1_static_nat_rule | Status : SUCCESS ===
ok
Test for Router rules for network rules on acquired public IP ... === TestName: test_network_rules_acquired_public_ip_2_nat_rule | Status : SUCCESS ===
ok
Test for Router rules for network rules on acquired public IP ... === TestName: test_network_rules_acquired_public_ip_3_Load_Balancer_Rule | Status : SUCCESS ===
ok

----------------------------------------------------------------------
Ran 38 tests in 21961.169s

OK

Integration test suite 2

nosetests --with-marvin --marvin-config=${marvinCfg} -s -a tags=advanced,required_hardware=false \
    smoke/test_routers.py \
    smoke/test_network_acl.py \
    smoke/test_reset_vm_on_reboot.py \
    smoke/test_vm_life_cycle.py \
    smoke/test_service_offerings.py \
    smoke/test_network.py \
    component/test_vpc_offerings.py \
    component/test_vpc_routers.py

Result test suite 2

Test router internal advanced zone ... === TestName: test_02_router_internal_adv | Status : SUCCESS ===
ok
Test restart network ... === TestName: test_03_restart_network_cleanup | Status : SUCCESS ===
ok
Test router basic setup ... === TestName: test_05_router_basic | Status : SUCCESS ===
ok
Test router advanced setup ... === TestName: test_06_router_advanced | Status : SUCCESS ===
ok
Test stop router ... === TestName: test_07_stop_router | Status : SUCCESS ===
ok
Test start router ... === TestName: test_08_start_router | Status : SUCCESS ===
ok
Test reboot router ... === TestName: test_09_reboot_router | Status : SUCCESS ===
ok
Test reset virtual machine on reboot ... === TestName: test_01_reset_vm_on_reboot | Status : SUCCESS ===
ok
Test advanced zone virtual router ... === TestName: test_advZoneVirtualRouter | Status : SUCCESS ===
ok
Test Deploy Virtual Machine ... === TestName: test_deploy_vm | Status : SUCCESS ===
ok
Test Multiple Deploy Virtual Machine ... === TestName: test_deploy_vm_multiple | Status : SUCCESS ===
ok
Test Stop Virtual Machine ... === TestName: test_01_stop_vm | Status : SUCCESS ===
ok
Test Start Virtual Machine ... === TestName: test_02_start_vm | Status : SUCCESS ===
ok
Test Reboot Virtual Machine ... === TestName: test_03_reboot_vm | Status : SUCCESS ===
ok
Test destroy Virtual Machine ... === TestName: test_06_destroy_vm | Status : SUCCESS ===
ok
Test recover Virtual Machine ... === TestName: test_07_restore_vm | Status : SUCCESS ===
ok
Test migrate VM ... === TestName: test_08_migrate_vm | Status : SUCCESS ===
ok
Test destroy(expunge) Virtual Machine ... === TestName: test_09_expunge_vm | Status : SUCCESS ===
ok
Test to create service offering ... === TestName: test_01_create_service_offering | Status : SUCCESS ===
ok
Test to update existing service offering ... === TestName: test_02_edit_service_offering | Status : SUCCESS ===
ok
Test to delete service offering ... === TestName: test_03_delete_service_offering | Status : SUCCESS ===
ok
Test for delete account ... === TestName: test_delete_account | Status : SUCCESS ===
ok
Test for Associate/Disassociate public IP address for admin account ... === TestName: test_public_ip_admin_account | Status : SUCCESS ===
ok
Test for Associate/Disassociate public IP address for user account ... === TestName: test_public_ip_user_account | Status : SUCCESS ===
ok
Test for release public IP address ... === TestName: test_releaseIP | Status : SUCCESS ===
ok
Test create VPC offering ... === TestName: test_01_create_vpc_offering | Status : SUCCESS ===
ok
Test VPC offering without load balancing service ... === TestName: test_03_vpc_off_without_lb | Status : SUCCESS ===
ok
Test VPC offering without static NAT service ... === TestName: test_04_vpc_off_without_static_nat | Status : SUCCESS ===
ok
Test VPC offering without port forwarding service ... === TestName: test_05_vpc_off_without_pf | Status : SUCCESS ===
ok
Test VPC offering with invalid services ... === TestName: test_06_vpc_off_invalid_services | Status : SUCCESS ===
ok
Test update VPC offering ... === TestName: test_07_update_vpc_off | Status : SUCCESS ===
ok
Test list VPC offering ... === TestName: test_08_list_vpc_off | Status : SUCCESS ===
ok
test_09_create_redundant_vpc_offering (integration.component.test_vpc_offerings.TestVPCOffering) ... === TestName: test_09_create_redundant_vpc_offering | Status : SUCCESS ===
ok
Test start/stop of router after addition of one guest network ... === TestName: test_01_start_stop_router_after_addition_of_one_guest_network | Status : SUCCESS ===
ok
Test reboot of router after addition of one guest network ... === TestName: test_02_reboot_router_after_addition_of_one_guest_network | Status : SUCCESS ===
ok
Test to change service offering of router after addition of one guest network ... === TestName: test_04_chg_srv_off_router_after_addition_of_one_guest_network | Status : SUCCESS ===
ok
Test destroy of router after addition of one guest network ... === TestName: test_05_destroy_router_after_addition_of_one_guest_network | Status : SUCCESS ===
ok
Test to stop and start router after creation of VPC ... === TestName: test_01_stop_start_router_after_creating_vpc | Status : SUCCESS ===
ok
Test to reboot the router after creating a VPC ... === TestName: test_02_reboot_router_after_creating_vpc | Status : SUCCESS ===
ok
Tests to change service offering of the Router after ... === TestName: test_04_change_service_offerring_vpc | Status : SUCCESS ===
ok
Test to destroy the router after creating a VPC ... === TestName: test_05_destroy_router_after_creating_vpc | Status : SUCCESS ===
ok

----------------------------------------------------------------------
Ran 41 tests in 8632.326s

OK

@asfgit asfgit merged commit 2bebb7f into apache:4.6 Dec 12, 2015
asfgit pushed a commit that referenced this pull request Dec 12, 2015
CLOUDSTACK-9151 - As a Developer I want the VRID to be set within the limits of KeepaliveDThis PR fixes a blocker issue!

   - Just like with RVRs, use the VRID 51 instead of making it dependent on the VPCID
   - Reason: arbitary unique number 0..255 used to differentiate multiple instances of vrrpd running on the same NIC (and hence same socket). virtual_router_id 51

* pr/1231:
  CLOUDSTACK-9151 - Removes the replacement of the VRID in the CsRedundant file

Signed-off-by: Remi Bergsma <github@remi.nl>
@wilderrodrigues
Copy link
Contributor Author

Thanks @DaanHoogland @remibergsma @bhaisaab and @borisroman for reacting very quickly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants