excessive failover pool balancing, leases files getting out of sync
alexm at ndtel.com
Fri Jun 17 18:54:40 UTC 2011
On Jun 17, 2011, at 9:55 AM, Marc Perea wrote:
> >From: "Gordon A. Lang" glang at goalex.com
> >While most clients are happily getting leases, many clients keep
> >retrying as if they never got the offer/acks or else they simply
> >like what they are getting.
> >The clients who experience trouble one day are not typically the
> >same clients who experience trouble the next day -- the problems
> >seem to be randomly and uniformly distributed across all users
> >(thousands of users) and all subnets (hundreds of subnets).
> >Does this ring a bell with anyone?
> Hi Gordon,
> this sounds exactly like a problem we are currently investigating.
> We've looked into our core, BRAS, transport, access, and CPE vendors
> alike. I wonder if we could see if we have any similarities? We
> don't use failover, but instead of a couple dhcp servers with the
> same config handing back static host IPs.
> For us, the problem appears to be that at some point, just like you
> are seeing, it looks like traffic starts to fail in the downstream
> path, in that either ACKs and OFFERs are either not getting to the
> client, or the client is unhappy with what it is receiving. Things
> will be going along fine, 1/2 of lease duration renewals will be
> occurring, and then at some point the backoff algorithm will get
> invoked, sending more and more renewal attempts until the lease
> expires, after which we'll see a DISCOVER - OFFER loop continue
> indefinitely. We're a DSL ISP, and a modem reboot or retrain fixes
> it, at which point a full DORA occurs. Our CPE vendor provided us
> custom firmware so that if the router sees a retrain it reboots the
> dhcp daemon, and it's interesting that a down/up on our access shelf
> for the port will fix it too.
> We've worked with our CPE vendor who claims that they aren't
> receiving any OFFERs when it gets stuck in the D-O loop, and our
> access vendor claims they are absolutely certain they are putting
> the OFFER - that we are certain is being ingressed by access shelf -
> out on the wire to the customer as ATM. We're thinking that possibly
> the line conditions have changed enough that the lower frequencies
> on the adsl (upstream) are still good enough to pass our traffic
> upstream, but the higher frequencies (downstream) are so poor by
> this point that they are causing the CPE to discard the frames, but
> we are still working on proving this theory.
> Any of this sound familiar?
Just curious, guys, if you are using the access equipment (DSL modem
or ONT) as the firewall? If not, you could sniff between the modem/
ont and the firewall WAN port to prove or disprove whether the OFFER
is being sent down to the firewall.
We made a conscious decision to *not* utilize a modem/ont firewall in
any installation; rather, to recommend/sell/give an off-the-shelf
inexpensive firewall to the customer for this express reason. That
way, we have a definite DMARC and are not dealing with any liabilities
related to network security, or relying on a modem/ont vendor to make
proper firewalls. Also, it makes troubleshooting much easier when it
comes to a situation exactly like this.
If you are using a built-in firewall, you could try switching to
bridged mode and monitoring the connection between the bridge
connection and a "real" firewall. It could be that the built-in
firewall is just experiencing a bug causing this behavior.
We have a couple of dsl/fttp vendors in place. I have seen this
behavior on one of them. Typically, rebooting the access card or
swapping activity on the management cards will clear the problem
up... Marc, you probably can guess which vendor I am talking about.
Just my $.02...
More information about the dhcp-users