lwresd performance with server down
romash at sonusnet.com
Fri Mar 30 23:02:58 UTC 2012
After further experimentation, I've learned some more about how this behaves and made some progress toward getting the performance we need.
9.2.6 lwresd appears to send every request to the first server in the list, and only send the request to the second server in the list when the request to the first server times out.
9.6 lwresd appears to only send a few requests to the first server in the list if it is down, and then switch all requests to the second server in the list for a while, again trying a few requests ( 4?) to the first srver every so often. Also, my earlier performance comments regarding 9.6 lwresd were wrong - we can run it at much higher rates without dropping calls, even with the first server down.
However, BOTH versions appear to onlu send four requests at a time to either server, then wait for a response. This results in a limit of four requests in flight, and causes performance to be affected by network round trip time to the servers. This appears to me to be due to the settings of NTASKS and NRECVS to 2 in lwresd.c.
In conjuction with the 9.2.6 behavior, we can't run more than a very small number requests per second, since the timeout to the first server effectively becomes part of the network round trip time for each request.
Has anyone had any experience with playing with NTASKS and NRECVS in lwresd?
We are using lwresd to resolve DNS ENUM queries with the cache TTL set to 1 second (effective off) and only two servers on a Solaris 10 Netra 5220 system. Performance is reasonable if the first server is up, but when the first server stops responding, we get unreasonably bad performance.
With a 9.2.6 lwresd, we see only 4 requests leave lwresd, and no further requests sent or processed until these 4 complete. Anything much above 8 requests per second ultimately leads to greater than 10 second responses from lwresd to our application.
With a 9.6 lwresd, performance is better, but at 30 requests per second we are still seeing geater than 1 % of the requests exceeding 3 seconds. At 60 requests per second, this goes up over 10 %.
Can anyone explain either of these behaviors and what we might do to improve this?
I'd also welcome an explanation of the retransmission behavior of lwresd in this sort of situation (or a pointer to documentation that might describe such...).
More information about the bind-workers