Marc Perea marccp at
Fri May 20 14:12:07 UTC 2011

Hello list,
I have a pretty strange scenario occurring that I don't expect to get any responses on, but it's worth a try and I can hope, right?
We're an ISP and our DHCP serves 1 IP per agent.circuit-id and we have a pretty weird issue occurring. We're at the end of months of troubleshooting and I was just wondering if anyone else had perhaps run into this situation or something similar, and had any tips or clues.
What I'm seeing happening is that sometimes our L3 RG out at the customer prem will lose connectivity. At first we were thinking it was a BRAS core router problem because we've seen similar issues at the core (which also performs DHCP-relay) in the past. This time though, our packet sniffs lead us further down the stream toward the client - we can validate that DISCOVER is hitting the server, OFFER with matching transaction ID is making it past our BRAS out into the core and is being seen by the access shelf. What we don't know is whether the access shelf is dropping (or modifying) the OFFER, or if the RG is failing to process the OFFER and move to the REQUEST. Since DORA never completes, we see thousands of D-O exchanges each day for a particular broken customer.

Unfortunately, the access shelf is Ethernet on one side, ATM on the other, providing ADSL service to our customer, so we can't sniff between the shelf and the RG in order to determine who's causing the problem. Additionally, when we convert the RG which is doing both bridging and routing/NAT into two separate boxes, so that we can have an ethernet sniff point between them, it doesn't seem to break anymore.
Finally, a modem reboot will fix the problem - DORA will complete when the RG comes back up just fine. We also encouraged the RG vendor to develop and implement a firmware feature whereby if the modem sees a down/up on it's ATM interface (ADSL retrain), it would re-DHCP, much like an ethernet interface would. A down/up from the access shelf side does indeed also fix the problem.
We've had tickets open for every vendor from the edge to the core and haven't been able to isolate further than what I've explained here. A ticket remains open for the RG vendor, but it doesn't appear very promising now that I've also seen the problem occur on a Belkin router too.
Ideas, anyone?
