Host configuration

Thu Feb 14 20:39:50 UTC 2008

David W. Hankins wrote:
> On Mon, Feb 11, 2008 at 05:42:40PM -0500, Frank Sweetser wrote:
>> By setting servers up to not require DHCP, this also makes your service
>> dependencies simpler.  There's nothing quite so much fun as creating a
>> circular service dependency between two servers, where neither one may be
>> turned on until the other is fully booted...
> 
> I've heard this from multiple people.  It's not...entirely...wrong.
> I think it is misguided by a lack of understanding that DHCP simply
> needs to be implemented differently in such an evironment, and further
> confused since ISC DHCP is not IETF DHCP.  How we did it is not the
> only way it may be done.

Let me clarify exactly what I was talking about before I go on...

Probably one of the better known examples of a service loop is crossed NFS
mounts.  Server A mounts NFS shares off of server B, and also vice versa.  If
both servers are shut down, neither one can come up cleanly.  Both must be
booted with errors, requiring manual cleanup to restore normal operations.

The typical reaction to this should be, "well, that's stupid - don't do it!"
While this is certainly a reaction I share, I've seen other cases of equally
disastrously laid out services that were simply hidden behind a little more
complexity and a few more layers of indirection.

The *simplest* (not necessarily best, especially in all circumstances - just
the simplest) solution is to flat out avoid requiring DHCP on a small number
of core servers that must be brought up to provide essential services (login
authentication, DNS, etc).  That way, when the 3am service call comes in
(which it always does, sooner or later) there's one less important detail that
you have to remember when fixing things.

> That misunderstanding kind of frustrates me.
> 
> Let us characterize DHCP in this particular use case as a dynamic
> process that reaches a fixed (or semi-fixed) ends.  This is a
> redundant operation.  Redundant operations introduce complexities;
> more components of the system may fail ("unknown flaws").  The
> conclusion is to remove the redundancy to avoid phonecalls at 3 am.

On the other hand, throwing out all the damned machines in the first place
would save a lot of time and effort =)  In the end, you have to balance the
TCO of each piece against the benefit.  A 10 desktop and 1 server network
doesn't need multiple top end routers and load balancers, while a low cost
SOHO router has virtually no benefit to a company with tens of thousands of nodes.

> Let us describe one other redundant operation in server farms; the
> use of network interface speed/duplex autonegotiation.  Your server is
> not going to swap out its nic for one that cannot do full duplex on
> a reboot.  Your switch is similarly not going to physically change on
> a restart.  So this wire-protocol dynamic process always reaches the
> same conclusion:  full speed, full duplex.  It is therefore redundant,
> and by the same argument, must be removed in order for "undefined
> flaws" in the process to keep from affecting service.
> 
> I certainly know of many networks whose server farms do not use DHCP.
> I also know of farms that do use DHCP (and even dynamic DNS).
> 
> Although I do know of folks who disable ethernet link autoneg for
> the reasons given, what troubles me is that most folks I know who
> choose to disable DHCP do not choose to disable ethernet link auto-
> negotiation.  This means they give DHCP special consideration outside
> the norm.  It is somehow an extra special automated process (or at
> least one whose parameters are not understood).  The line drawn is
> arbitrary and fuzzy - whatever the individual chooses to be or not be
> an acceptable risk according to their own sense of comfort -  rather
> than clear and consistent - something drawn from a definition.

There is one very critical difference between DHCP and autoneg, though.  Your
typical ethernet switch does not have any external service dependencies for it
to come up to normal operations.  You turn it on, it loads its internal
configuration, and starts switching and autonegotiation.  Other components
such as routers may have to be brought up for it to do anything useful, but
the switch doesn't care what order things happen in.

The same may or may not be true for a DHCP server.  Regardless of what the
actual DHCP service requires to operate, the server itself may require DNS,
SANs, NIS+, NFS shares off of other servers, etc just to boot in the first
place.  Likewise, the DHCP service may have external requirements to get to
its data required to hand out leases - NFS shares, AFS shares, LDAP, etc.

Bringing the DHCP server up first may result in processes looking for files
that aren't there and trying to look up hostnames from DNS that isn't
answering.  If that DNS server won't boot without DHCP, you've dug yourself
into a nice little hole.  It's quite available, you just have be aware of the
hole when designing your architecture.

Again, statically configured IP addresses are just one way to handle this.

> But, if a group has elected _arbitrarily_ not to use DHCP to configure
> their servers, then the implementation of that election is rather
> obvious.
> 
> While we're on this topic however, we could discuss how one might
> elect to use automation (for all the benefits it conveys) in the best
> way - so as to minimize phonecalls at 3am.
> 
> Foremost and most obvious is to use failover or some other means to
> get redundancy in your DHCP service itself.  But this becomes optional
> by the time you've reached the third point below.  It actually does
> not (or should not) matter if your DHCP service works or not, except
> that you'd tend to prefer it did.

Agreed.  The scenario I'm mostly concerned about is bringing all
infrastructure up from a cold boot - something we've come pretty close to a
couple times due to required maintenance, mostly power related.

> Second and equally obvious is to use long lease times, so any failure
> in the DHCP service is unlikely, or impossible to be noticed, except
> by brand new servers that have never before been put online.  Note
> that valid lease times range from 1 second to 2^32-2, with a big
> gap between 2^32-2 and 'infinity' (meaning the lease simply never
> expires...at least until an operator resets it).  Note that long lease
> times does not necessarily require long renewal times, at least so far
> as the protocol is concerned.  ISC dhcpd could stand to let the renew
> time be configurable.  Note that the server farms I'm aware of use
> 90-120 day lease times.

This would certainly help out with riding out failures of all DHCP servers,
though with an extra DHCP server or two and static leases this should not be a
problem very often.

> Third and, it seems, completely non-obvious, is to vet the use of DHCP
> client software which implements RFC2131 section 3.2, "reusing a
> previously allocated network address", specifically reading between
> the lines on how to optimize this process for a non-nomadic lifestyle.
> This non-nomadic interpretation of this and section 3.7's SHOULD
> (don't) lets a client _immediately_ use any previous valid lease upon
> rebooting, although it probably SHOULD also attempt to contact a DHCP
> server in parallel and reconfigure if necessary.  This effectively
> makes DHCP during the boot sequence a non- blocking operation.  My
> memory is that ISC dhclient is very optimized for the nomadic
> lifestyle, such that it is not capable of operating in this server-
> farm-desirable fashion*.  Improvement would be trivial.

Out of curiosity, anyone know if anything like that been implemented in any
Windows or Mac server platforms?  We've got a mix of just about everything
here, so we'd have to worry about all of 'em.

> Fourth, use DHCP client identification to match the individual
> service.  In this way, the hard drive inside a server, with its
> generated (RFC3942) or configured client id, consistently identifies
> itself for resources like dynamic DNS, and its IPv4 addresses.

Maybe I've missed something here, but I don't see how a choice of client
identifier would make a difference.

> In this way DHCP can help an operator make large changes such as
> network re-addressing or complete domain renaming (along with all the
> little changes, such as nameserver, domain-search, or ntp changes)
> without resorting to brute force and without introducing a blocking
> event in the system startup sequence.  The risk and rewards are
> identical to the use of ethernet autonegotiation; the software itself
> can have a fault, just like the (possibly upgraded) firmware on either
> side of your ethernet cable, always a risk, but you gain the many
> advantages of automation for trusting in code.

For this purpose, you again have to look at which of the multiple available
tools will do the best job.

DHCP has the advantage of already being present and well supported out of the
box on virtually every client out there.  Mac, Windows, Linux - all of them
can pick up gateway, netmask, and DNS server list off of DHCP without having
to install or configure anything on the client.  For clients where you don't
have to do a whole lot of complex configuration, just hitting the basics with
DHCP may be plenty.

On the other hand, as soon as you need to beyond the simple key-value pairs
that happen to be supported by the relevant DHCP clients, you're pretty much
out of luck.  DHCP won't help you much if you need to manage user accounts,
enable/disable services, or install a package.  Likewise, there's no
authentication, no good way to handle sensitive data, minimal extensibility,
no audit trail or notifications of what actions were taken, or ability to push
out updates rather than waiting for the client to pull.  That's nothing
against DHCP - these are all normal system administration tasks and
requirements that simply fall well outside of the scope of DHCP.

Once you get past a handful of relatively trivial settings, DHCP just won't
cut it anymore, and you need to bring in different code designed to handle
that kind of automation instead.

-- 
Frank Sweetser fs at wpi.edu  |  For every problem, there is a solution that
WPI Senior Network Engineer   |  is simple, elegant, and wrong. - HL Mencken
    GPG fingerprint = 6174 1257 129E 0D21 D8D4  E8A3 8E39 29E3 E2E8 8CEC