Hi, I've spent most of the morning trying to debug an issue with our

Is it possible that no GARP was sent from a master that received a lower prio advert? about keepalived HOT 10 CLOSED

acassen commented on May 19, 2024

Is it possible that no GARP was sent from a master that received a lower prio advert?

from keepalived.

Comments (10)

ashak commented on May 19, 2024

OK, what i've described above definitely seems to be what's happening as I just had the issue again.

Stuff broke just as described above, after the 're-election', the state of my system was this:

My router-001:
8: eth6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
link/ether 52:54:00:76:99:24 brd ff:ff:ff:ff:ff:ff
inet 10.1.3.240/24 brd 10.1.3.255 scope global eth6
inet 10.1.3.254/24 scope global secondary eth6
inet6 fe80::5054:ff:fe76:9924/64 scope link
valid_lft forever preferred_lft forever

router-002:
8: eth6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
link/ether 52:54:00:9f:9d:41 brd ff:ff:ff:ff:ff:ff
inet 10.1.3.241/24 brd 10.1.3.255 scope global eth6
inet6 fe80::5054:ff:fe9f:9d41/64 scope link
valid_lft forever preferred_lft forever

From the arp cache of one of my app servers:
10.1.3.254 ether 52:54:00:9f:9d:41 C eth1

10.1.3.254 is its default gateways. So it was receiving packets from requests via router-001, but was trying to send packets back via router-002.

This seems very badly broken :(

from keepalived.

ricbartm commented on May 19, 2024

Any update on this? We are suffering this issue using Keepalived 1.2.8.

from keepalived.

ashak commented on May 19, 2024

I don't think so, I had no response :(

We spent a short time trying to work out if we could use the script running functionality to produce GARPs ourselves but ran out of time and had other work prioritised over it.

We have ended up in a situation where if one instance goes into a fault state, we simply have that keepalived stop itself using the script running functionality. This has mostly worked, we have still ended up once or twice in a situation where it's caused some downtime of the services behind it. But so far it's better than it breaking every time there's a fault.

from keepalived.

acassen commented on May 19, 2024

hi guys,

hmm, sounds strange. during master transition, code is sending "updates" which are GARP for IPv4 and Unsollicited Neigh adverts. First thing to try is to tcpdump during master state transition to see if GARP packet are sent on the wire... (daemon will log "VRRP_Instance(%s) Sending gratuitous ARPs on %s for %s"). If you see those log and packet on the wire... then maybe (for sure) your layer2 remote party is not honouring GARP :/ which is bad.

if so, maybe an ICMP will fix the issue... I was considering some time ago adding the ability to send ICMP in addition to GARP.

Please, let me know your debug.

Regs,
Alexandre

from keepalived.

acassen commented on May 19, 2024

Could some one experimenting the same issue help reproduce it in my lab in order to check it and fix it ! (I am right now in a coding process to fix all issues reported).

Regs,
Alexandre

from keepalived.

acassen commented on May 19, 2024

Hi,

I spend time with this issue and extended gratuitous ARP handling to workaround some corner case.

I just commit a patch fixing this issue under "vrrp: fix/extend gratuitous ARP handling"

Please give it a try and report.

Best regs,
Alexandre

from keepalived.

ricbartm commented on May 19, 2024

Hello,

We fixed the issue in our scenario with a work-around which is a small daemon that sends gratuitous ARP when you are the master, rather than messing with Keepalived code which would be probably not merged.

Because several reasons we can't spend time testing this patch now, but I can say it's in the good direction. Any feedback from anyone else would be highly appreciated.

from keepalived.

acassen commented on May 19, 2024

hello,

This is exactly the code I included mainline : adding the possibility to periodically send garp while in MASTER state (using garp_master_refresh), because in some corner case sending gratuitous ARP only during MASTER transition (as specified by RFC) can be not enough.

regs,
Alexandre

from keepalived.

sim- commented on May 19, 2024

Just FYI, over the years, we have often seen artifacts of various switch problems or network topology changes, where STP or similar could eat the packets from the master for some seconds, isolating it from the backup node(s). The backup nodes would become master and GARP, while the master thought nothing changed and would continue VRRPing to itself and possibly some servers on that switch. Once the network converges, the backup sees the master's advertisements and stops, but no other GARPs occur, so the isolated segment sees no recovery GARP and the two sets of servers can stay unmatched.

Combined with the way positive feedback can be used from higher layers in Linux (and now newer versions of Windows), certain situations can cause the hosts to stay even if the traffic is partially broken as a result of the gateway mismatches. I have actually influenced the ARP behaviour (and fixed the problem) at this point by adding a UDP iptables (eg: not ARP layer) firewall rule on the backup node to stop the positive feedback.

So, I think periodic GARPs are required in many cases...or static MACs, where the MAC address of the multicast advertisement is enough to update the CAM tables on all of the networking equipment. In the latter case, the above scenario would mend itself purely as the VRRP packets pass from the new master.

from keepalived.

acassen commented on May 19, 2024

Hi Simon !

agreed... In a first though I chose to disable garp_master_refresh by default... maybe we need to make it on by default with a long timer (say: every 5min).

from keepalived.

Is it possible that no GARP was sent from a master that received a lower prio advert? about keepalived HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent