administering 1,000 zone files

Fri Dec 31 15:14:22 UTC 2004

On Fri, Dec 31, 2004 at 12:07:48PM +0000, Jim Reid wrote:

Jim,

> You miss the point completely. It's the fact that the server's
> configuration is split across many files that's the problem.

Then I heavily suggest that BIND should revoke the include
feature altogether if it makes it unstable. Not!

In fact, the include feature makes it more stable because you
can split the configuration into a global part with options,
logging, acl's etc and a part that only contains the zone
statements.

Generating and validating a file with only a limited syntax
reduces complexity significantly.

>     Michael> But the program dealing with the files had the advantage
>     Michael> that instead of ploughing through a few ten MBytes it
>     Michael> only had to work on a few hundred kilobytes, making it
>     Michael> 10-20 times as fast.
> 
> That's a curious design choice.

Making something work is a curious design choice to you?

Let me tell you again: writing or even transporting the whole
configuration file for every single tiny change is TOO SLOW.
It could never complete all changes in the required time, it
is the difference between handling a customer request in
minutes vs hours.

> IMO it's the wrong trade-off. It
> should be more important to have a simple, stable DNS configuration
> than a faster way of generating fragments of named.conf. Which doesn't
> appear to be "faster" BTW: it only generates around 1% of named.conf
> in 5-10% of the time to generate the whole file. What's the point of
> generating snippets of named.conf "faster" when that makes it harder
> to debug the server's configuration and troubleshoot problems?

As a matter of fact it is 10-20 times faster and not "faster",
whatever you try to imply. It is also a simple and stable DNS
configuration because the changes affect only that part of
the configuration that is simple.

It also doesn't make it any more difficult to debug a server's
configuration. Even you can operate a simple 'cat' and 'named-checkconf'
will read include files pretty well. But I guess that feature makes
it brittle too.

>     Michael> The slave servers (running BIND8 at that time!) use
>     Michael> configuration and zone files stored on a local disk. Even
>     Michael> BIND9 cannot do much better (unless you add the DLZ
>     Michael> patches), but why should I rely on this, really huge,
>     Michael> complexity ? 
> 
> BIND-DLZ isn't needed to solve this. Neither is BIND for that matter.

So you propose a different software altogether? Then I guess
BIND by itself is too brittle by itself that you want to avoid it.

>     >>  Yuk! What if that small program fails?
> 
>     Michael> What if a large big program fails ? I don't see a point
>     Michael> here.  If any program fails that modifies the nameserver
>     Michael> configuration then the result can be disastrous.
> 
> You have many more components in your setup.

I don't. I have a few more components in my setup that provide the
functionality beyond that of BIND. Any other system would need to
provide the same. Everything you suggested also added 'more components'
to the setup, some of these were even pretty complex like a database.

> Some of these appear to
> be autonomous, independent agents on the slave servers.

There is an ssh server and cron. sshd is surely not autonomous,
cron is. But then it only runs a tiny script that checks a flag,
runs ndc if the flag was set and clears the flag. This makes
it only reactive to the mechanism setting the flag.

The version of the hidden primary server is slightly more complex
and can also issue a series of 'reload zone' commands.

There are two reasons for this mechanism:

It limits the number of reloads to 1 per poll period while batching
all changes that occur within this poll period. This assures that
the nameserver does its primary job answering requests and is never
overloaded by too frequent reloads.

It also escalates the reload operation to a different privilege level.
Remember that in BIND8 times the reload required root permission.
Even in BIND9 times the owner of the rndc key can issue more
commands than a simple reload but having write access to the
configuration file makes this difference negligible.

So much for your autonomous, independent agents.

> This means
> there's a much higher probability of a failure and subtle interactions
> between these components. And let's not forget the complicated
> synchronisation issues that have been introduced.

Only one answer here: this is pure theory, you have zero facts
to talk even about the probabilities involved but you still
claim that the effects are huge, the interactions are subtle.

I can tell you that the effects would be much larger and
would be more subtle if I followed your suggestion and
used a complex system, such as a database.

Interesting enough you also suggest to just copy the configuration
files to the slave servers (which is just the opposite) despite
the fact that exactly this wasn't possible because of sheer
volume.

>     Michael> So how do you suggest to modify the configuration files?
>     Michael> Manually?  Do you suggest to run a database replica on
>     Michael> each slave server and believe that this will never fail,
>     Michael> but insist that the small program does?
> 
> Generate the 2 config file centrally: one for the stealth master and
> one for the slaves. I don't care how you do this. Though with 200K
> zones, some tooling is advisable. Once they have been checked, copy
> the files to their destination. Then activate them during the
> designated service window(s).

Do you believe there is such a thing as a 'designated service window'
for 200000 zones. Please explain how such a thing would look like.
Would it include downtime for the servers ?

The fact is that the nameserver has to operate 24x7 without downtime
and without service window (and we are talking about the service,
not the machine). It also has to accomodate a large number (about 2000)
changes to the configuration _per day_ which gives 2000 'designated
service windows' with 43 seconds each.

But we can use the same words for what happens although I would
never call it this way:

There are 'designated service windows' for each slave every 15-30 minutes,
the windows of course do not overlap to avoid a disastrous outage of
the redundant slave server setup.

The configuration itself is not uploaded but instead the incremental
differences to previous configurations. The differences are applied
to the running configuration and activated during the 'designated
service windows'.

You may argue that incrementally changing the configuration can
accumulate small errors. True. But rewriting the configuration
from scratch can cause a very huge error. I prefer the small errors.

Needless to say that the incremental updates never had a single
failure during several years of operation. However, verifying
the nameserver operation on the slave (read below) for the
incremental changes (something you cannot do for the full
configuration) prevented lots of errors.

>     >> Wouldn't it be better to generate the configuration data in one
>     >> place?
> 
>     Michael> The configuration data can be easily regenerated (let's
>     Michael> say if the slave servers had both system disks fail at
>     Michael> the same time).  In fact, this is absolutely
>     Michael> necessary. How would you (== some automated system) know
>     Michael> that a slave server needs to be configured when there is
>     Michael> no data available ?
> 
> Name servers should do what they are told. IMO they have no business
> fetching or generating their configurations. This should be a push
> operation, not a pull.

I have always described this as a push operation.

>     Michael> BIND doesn't store the zones in a single file
>     Michael> either, why do you insist on storing the configuration in
>     Michael> a single file?
> 
> Looking after 1 file is a whole lot easier than looking after 10 or
> 100.

It isn't for a program. But it is a whole lot easier to validate
files with a simple structure which a generic named.conf is not.

> BTW, all zones are unique (by definition) so they should all have
> their own zone files to define their content.

How is this related?

And no, sharing zone files does have its uses. But since it increases
complexity and destroys flexibility I never did such a thing.

>     Michael> Be assured that the configurations do drift, for some
>     Michael> time, because the slave servers aren't always
>     Michael> reachable. Of course this hardly matters, DNS zone data
>     Michael> also "drifts" because zone transfers do not happen
>     Michael> instantly or servers are not reachable. The only thing
>     Michael> that matters is that the differences are automatically
>     Michael> corrected.
> 
> Eventually. Maybe.

Please make clear your doubts.

Nobody can make the configurations all time identical. Whenever there
is a change there are differences. This is completely independent
of how you make the changes but lies in the nature of communication.

> There's also a big difference between zone
> coherency and server configuration coherency.

You could tell why that is a difference and what method exists
to prevent server configuration incoherency.

You could also explain why a failure of whole nameserver sets
(due to short service interruptions caused by an almost synchronous
reload of all slave servers) are negligible.

You should also explain why there is such a huge problem with an
incoherent configuration that only affects added or removed zones
which are not even delegated to the servers in question (first
configure then delegate, first remove delegation then unconfigure).

Lets not forget, this part of the discussion is only about
me suggesting to offset the reload of each slave within a
time interval of 15-30 minutes.

> Please don't confuse
> these. Or assume that because it's tolerable for zone contents to be
> inconsistent between servers for the zone refresh interval, that it's
> OK for a set of servers that should have the same configuration to be
> inconsistent. 

There is no difference to the outside: does the server give correct
answers or not. If you have different zone contents on each slave
server (which is easily to have) then you and everybody else tolerates
it, because that is how the DNS protocol works.

There is of course a difference in the means. Zone updates are done
within the DNS protocol. Configuration updates are done outside of DNS.
Obviously you tolerate the limitations of the DNS protocol but do
not tolerate the limitations of anything else.

>     Michael> The on-disk configuration is always more up to
>     Michael> date. Whenever you modify it, the nameserver doesn't know
>     Michael> it until reloaded. So where is the problem ?
> 
> You're making this interval longer as a deliberate design decision.

No. I make this interval longer to handle the limitations of a
hardware that is not infinitely fast. I also ensure (to some
degree) that the nameservers are available to answer requests
and are not overwhelmed with reconfigurations.

>     >> It can also mean lame delegations because the master server
>     >> (say) knows about a newly-added zone while one of that zone's
>     >> slaves knows nothing about it.
> 
>     Michael> Lame delegations happen when a nameserver is not
>     Michael> configured for a zone but which is delegated to it. I
>     Michael> don't understand how this is related.
> 
> Suppose this setup of yours provides DNS service for the newly created
> serpens.de zone. It's served by ns[123].serpens.de. The master server
> loads this zone with its 3 NS records. The slave servers ns2 and ns3
> are still waiting for some program of yours to wake up and give their
> name servers a kick. Until they re-read their configuration files and
> load the zone, they don't know anything about serpens.de. They're lame
> for the zone.

They are not lame for the zone because at that time there is no
delegation.

If there were a delegation before the configuration then these
servers had been lame before.

Even when all servers are reloaded at the same time there is an
interval where these do not answer correctly because BIND needs
a very individual time before a zone is loaded, which depends
on the machine load, the number of zones that require updates
and the maximum number of parallel zone transfers.

But you tell me that forcing a reload at exactly the same time
is absolutely necessary.

The most critical part about the delays is that I have to wait
for the last server to finish the reload before I can delegate
a zone (or rather ask the registry to do it).

And one thing you didn't even consider is that a change
to the slave server configuration doesn't mean that it is
serving a zone because it hasn't been transferred yet.
There can be hours before there is a retry.

That's why the 'small program' that changes the configuration
files can also run named-xfer to fetch the zone file before
the nameserver needs it and fail synchronously if the transfer
didn't succeed without changing the configuration and leaving
the server 'lame'.

> If this program on the slaves dies or fails, they could
> be lame for a long time.

If BIND fails (or whatever nameserver software you suggest) then
they could be lame for a long time. You tell me that this cannot
happen but is likely for the 'program on the slaves' ?

Greetings,
-- 
                                Michael van Elst
Internet: mlelstv at serpens.de
                                "A potential Snark may lurk in every tree."