Giter VIP home page Giter VIP logo

Comments (10)

ashak avatar ashak commented on May 19, 2024

OK, what i've described above definitely seems to be what's happening as I just had the issue again.

Stuff broke just as described above, after the 're-election', the state of my system was this:

My router-001:
8: eth6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
link/ether 52:54:00:76:99:24 brd ff:ff:ff:ff:ff:ff
inet 10.1.3.240/24 brd 10.1.3.255 scope global eth6
inet 10.1.3.254/24 scope global secondary eth6
inet6 fe80::5054:ff:fe76:9924/64 scope link
valid_lft forever preferred_lft forever

router-002:
8: eth6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
link/ether 52:54:00:9f:9d:41 brd ff:ff:ff:ff:ff:ff
inet 10.1.3.241/24 brd 10.1.3.255 scope global eth6
inet6 fe80::5054:ff:fe9f:9d41/64 scope link
valid_lft forever preferred_lft forever

From the arp cache of one of my app servers:
10.1.3.254 ether 52:54:00:9f:9d:41 C eth1

10.1.3.254 is its default gateways. So it was receiving packets from requests via router-001, but was trying to send packets back via router-002.

This seems very badly broken :(

from keepalived.

ricbartm avatar ricbartm commented on May 19, 2024

Any update on this? We are suffering this issue using Keepalived 1.2.8.

from keepalived.

ashak avatar ashak commented on May 19, 2024

I don't think so, I had no response :(

We spent a short time trying to work out if we could use the script running functionality to produce GARPs ourselves but ran out of time and had other work prioritised over it.

We have ended up in a situation where if one instance goes into a fault state, we simply have that keepalived stop itself using the script running functionality. This has mostly worked, we have still ended up once or twice in a situation where it's caused some downtime of the services behind it. But so far it's better than it breaking every time there's a fault.

from keepalived.

acassen avatar acassen commented on May 19, 2024

hi guys,

hmm, sounds strange. during master transition, code is sending "updates" which are GARP for IPv4 and Unsollicited Neigh adverts. First thing to try is to tcpdump during master state transition to see if GARP packet are sent on the wire... (daemon will log "VRRP_Instance(%s) Sending gratuitous ARPs on %s for %s"). If you see those log and packet on the wire... then maybe (for sure) your layer2 remote party is not honouring GARP :/ which is bad.

if so, maybe an ICMP will fix the issue... I was considering some time ago adding the ability to send ICMP in addition to GARP.

Please, let me know your debug.

Regs,
Alexandre

from keepalived.

acassen avatar acassen commented on May 19, 2024

Could some one experimenting the same issue help reproduce it in my lab in order to check it and fix it ! (I am right now in a coding process to fix all issues reported).

Regs,
Alexandre

from keepalived.

acassen avatar acassen commented on May 19, 2024

Hi,

I spend time with this issue and extended gratuitous ARP handling to workaround some corner case.

I just commit a patch fixing this issue under "vrrp: fix/extend gratuitous ARP handling"

Please give it a try and report.

Best regs,
Alexandre

from keepalived.

ricbartm avatar ricbartm commented on May 19, 2024

Hello,

We fixed the issue in our scenario with a work-around which is a small daemon that sends gratuitous ARP when you are the master, rather than messing with Keepalived code which would be probably not merged.

Because several reasons we can't spend time testing this patch now, but I can say it's in the good direction. Any feedback from anyone else would be highly appreciated.

from keepalived.

acassen avatar acassen commented on May 19, 2024

hello,

This is exactly the code I included mainline : adding the possibility to periodically send garp while in MASTER state (using garp_master_refresh), because in some corner case sending gratuitous ARP only during MASTER transition (as specified by RFC) can be not enough.

regs,
Alexandre

from keepalived.

sim- avatar sim- commented on May 19, 2024

Just FYI, over the years, we have often seen artifacts of various switch problems or network topology changes, where STP or similar could eat the packets from the master for some seconds, isolating it from the backup node(s). The backup nodes would become master and GARP, while the master thought nothing changed and would continue VRRPing to itself and possibly some servers on that switch. Once the network converges, the backup sees the master's advertisements and stops, but no other GARPs occur, so the isolated segment sees no recovery GARP and the two sets of servers can stay unmatched.

Combined with the way positive feedback can be used from higher layers in Linux (and now newer versions of Windows), certain situations can cause the hosts to stay even if the traffic is partially broken as a result of the gateway mismatches. I have actually influenced the ARP behaviour (and fixed the problem) at this point by adding a UDP iptables (eg: not ARP layer) firewall rule on the backup node to stop the positive feedback.

So, I think periodic GARPs are required in many cases...or static MACs, where the MAC address of the multicast advertisement is enough to update the CAM tables on all of the networking equipment. In the latter case, the above scenario would mend itself purely as the VRRP packets pass from the new master.

from keepalived.

acassen avatar acassen commented on May 19, 2024

Hi Simon !

agreed... In a first though I chose to disable garp_master_refresh by default... maybe we need to make it on by default with a long timer (say: every 5min).

from keepalived.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.