Bind9 on VMWare

Wed Jan 13 18:54:13 UTC 2016

On 1/13/16, 10:28 AM, "bind-users-bounces at lists.isc.org on behalf of
Reindl Harald" <bind-users-bounces at lists.isc.org on behalf of
h.reindl at thelounge.net> wrote:

>
>
>Am 13.01.2016 um 16:19 schrieb Lightner, Jeff:
>> We chose to do BIND on physical for our externally authoritative
>>servers.
>>
>> We use Windows DNS for internal.
>>
>> One thing you should do if you're doing virtual is be sure you don't
>>have your guests running on the same node of a cluster.   If that node
>>fails your DNS is going down.   Ideally if you have multiple VMWare
>>clusters you'd put your guests on separate clusters.
>
>while for sure you should run them on different nodes (except for
>upgrades where you move them together to get one machine free of guests
>for a short timeframe) a VMware cluster which can be called so has a
>feature "VMware HA" which would start the VMs automatically on the other
>node after a short period of time (node exploded or isolated from the
>network for whatever reason)
>
>it would also restart a crashed guest automatically
>
>https://www.vmware.com/products/vsphere/features/high-availability
>
>one of the things which is much more harder to implement correctly with
>physical setups

I'll be the canary in the coal mine...  having went down this road before,
I felt like dying as a result.

I've ran several large DNS infras over the years.  Back in 2005/6 I
finally drank the koolaid and migrated a large caching infra
(authoritative was kept on bare metal) to VMWare+Linux.  It worked well
for awhile, and we did all the usual VMware BCPs (anti-affinity, full
redundancy across storage/multipathing, etc).  However, even with all the
OCD nits we picked, there were still edge cases that just never performed
as well (mostly high PPS) and misbehaviors stemming from VMWare or
supporting infrastructure.

After spending weeks tweaking every possible VMware setting, adding more
VMs spread across more hosts, backend network and storage upgrades, etc we
would still find or worse have end users report anomalies we hadn't seen
before on the physical infra.  I was devoted to making it work, and spent
a lot of time including nights and weekends scouring usenet groups,
talking to VMware support, etc.  It never got completely better.

Finally after babysitting that for a few years, we moved everything back
to bare metal in the name of "dependency reduction" -- we didn't want core
things like DNS relying on anything more than absolutely necessary (I'd
argue this is a sound engineering principle for any infrastructure admin
to fight for, despite the fact most pointy hairs will value cost savings
more and it flies in the face of NFV hotness).  Guess what?  No more
mystery behaviors, slow queries, etc.  Hmm.  Of course we still have
issues, but now they are much more concrete (traceable to a known bug or
other issue where the resolution is well understood).

This probably wouldn't be an issue in most environments...as I said we ran
virtual caches for years, and really only started seeing issues as clients
ramped.  However, is the cost savings really worth another complex
dependency (quite possibly relying on another team based on your org
structure), or risk you might have to back out some day as the size of
your environment increases?  Your call, but I've learned the hard way not
to virtualize core infrastructure functions just because a whitepaper or
exec says it should work.  I also learned not to trust my own testing...
because I spent a month with tools like dnsperf and real-world queryfiles
from our environments pounding on VMware+Linux+BIND and even though
testing didn't reveal any obvious problems, real world usage did.

Again it worked for awhile, I understand the many justifications, it could
make sense in some environments, the past is not necessarily the key to
the future, and I even have colleagues still doing this...  just had to
rant a bit since it has caused me much pain and suffering.  :-)