[kea-dev] On persistent lease file clean-up mechanism
marcin at isc.org
Mon Nov 24 18:41:40 UTC 2014
Thanks for the example. It clarified a lot.
IMO, this approach has a major flaw that the clean-up timer is
effectively driven by the lifetime of the longest possible lease. In
case, if the DHCP server is serving leases to clients in multiple
subnets and one of the subnets has long valid-lifetimes (in cable
networks customers could get leases for 7 days so as their IP address at
home doesn't vary) the lease file would rotate every ~7 days and the
lease file could grow to a significant size if the valid-lifetimes for
other subnets were significantly lower.
There is also some complexity in how the server determines the longest
lease time. Checking valid lifetimes for all configured subnets is not
sufficient because clients may have already got leases with different
valid lifetimes and later the server could have been reconfigured. So,
the configuration mechanism would need to iterate over all leases in the
lease database and check their valid lifetimes. If any of the lifetimes
is greater than the highest valid-lifetime across all configured
subnets, this lifetime would need to be used as a timer interval. But,
this causes an operational problem where the server's administrator
doesn't have a clue what the rotation timer is going to be, because it
depends on dynamic data in the lease database. Additionally, if the
lease with highest valid lifetime expires or is released, the timer
should be recalculated which has its own problems and complexity.
One more thing to keep in mind is that the client can renew at any time
before the lease expires, e.g. as a result of the machine reboot. A
broken client can even send a Request every 1s, even though T1 is set to
1800s. Each lease update would be recorded in a lease file, and would
effectively cause a lease file growth. This could even be used as an
attack vector to quickly fill the disk space reserved for the lease file
(Though, I admit that there are many similar ways to hammer DHCP server
anyway). Having said that, the lease file compression (clean-up) is
something that must be done anyway, so as the redundant information
(e.g. dozens of Renews from the same client) is removed from the lease
file. Relying on the repeating pattern of clients sending Renews at the
predictable time may turn against us pretty quickly as clients tend to
do odd things.
Regarding the option 1 and option 2, I described. The server doesn't
have to wait for the lease file to be re-written. The only time when the
server needs to wait is when the file it is using is renamed, and new
file is created for the server to use. The rest of the operation should
be performed in background (separate thread or process). Note, that the
server doesn't need lease information from the file being cleaned up as
long as it is not restarted. This is because the server has all the
information in memory. So, the server can use an empty lease file while
the clean-up is performed.
When the clean-up is being performed, the server still serves clients
and writes lease updates to the newly created (empty file). In case of
option 2, the server will additionally wait when the clean-up is done -
it will wait for the data from the currently used lease file being
*appended* to the cleaned-up lease file. Append operation (e.g. file1 >>
file2) should be pretty fast as there is no data interpretation, just a
simple IO operation.
On 11/24/14 18:20, Chaigneau, Nicolas wrote:
> Hello Marcin,
> Thanks for your feedback!
> Both option 1 and 2 you describe have, in my opinion, one common
> issue: the server must halt processing the DHCP packets in order to
> rewrite the lease file, which can take some time when dealing with
> very large lease files.
> The main point of the algorithm I proposed was to eliminate the file
> clean-up entirely, while ensuring data integrity (no loss of data).
> This is achieved through a time-triggered rotation of the lease file.
> I'll take an example, hopefully making myself clearer :)
> With a lease time of 30 minutes, and a rotation of the lease file
> triggered every 30 min:
> Server is started at t0.
> At t0 + 1 min, lease L1 is allocated, and written in <lease file>.
> At t0 + 29 min, lease L2 is allocated, and written in <lease file>.
> At t0 + 30 min, the clean-up mechanism is triggered:
> <lease file> is renamed to <old lease file> (which now contains leases
> L1 + L2).
> <lease file> is recreated, empty.
> At t0 + 31 min, lease L1 (not renewed) expires.
> At t0 + 44 min (considering a renew-timer of 15 min), lease L2 is renewed.
> At t0 + 60 min, the clean-up mechanism is triggered again:
> <lease file> is renamed to <old lease file>. Previous <old lease file>
> is simply overwritten.
> It's alright because any lease that is not expired has necessarily
> been renewed, hence is in <lease file>.
> This property is verified as long as we ensure that the clean-up
> interval is at least equal to the maximum lease time value.
> At any time if the server is restarted, all non-expired leases can be
> retrieved from the aggregation of <lease file> and <old lease file>.
> Hence, no data loss, at the cost of a slightly more complicated
> start-up processing.
> This is a simple scenario where the lease file is cleaned-up only on a
> regular time trigger.
> I understand that combining this with other clean-up triggers, such as
> those you describe, will make things much more complicated... possibly
> with multiple lease files, and the need to decide when an older lease
> file can be safely deleted. I really haven't given this much thought,
> I only wanted to propose something that would be sufficient for my needs.
> Anyway, thanks for reading and considering this :)
> > Nicolas,
> > Thanks for the write-up. In general, I feel it is a right direction.
> But I also agree it is not a trivial matter as the solution should not
> impair the DHCP service while the clean-up is in place. Also, the
> solution should prevent race conditions and data loss.
> > As a result of the discussions on the mail list we will have to
> create a page on the trac wiki which will contain requirements for
> this feature as well as some little design. I will make sure such page
> is created once we have first thoughts exchanged.
> > We haven't yet gone through the phase of discussing this feature
> because it was somewhat out of scope for the 0.9.1 release. So, all my
> thoughts here are preliminary and I may be wrong.
> > I think we should consider other triggers for the clean-up apart
> from the timer. Though, the timer could be the first one to implement
> and other could wait. We could consider triggering clean-up when the
> lease expiration counter is X, or number of renewed leases is Y etc.
> > I agree that the clean-up should be performed on a renamed file and
> the server should use a different/empty file for new lease updates
> while the clean-up is in place. However, there is a question how the
> data is combined when the clean-up is done. Keeping the compressed
> (cleaned-up) file separately from the currently used file is an
> option, but it means that each clean-up operation results in creation
> of one new file holding partial information about the leases. I
> believe that in many cases administrators would rather want lease
> information be stored in a single file, not multiple. Also, the
> subsequent clean-up operations would need to process all saved lease
> files, since each of them potentially holds some lease information
> which may have expired.
> > I tend to think that the lease data aggregation should rather take
> place at the end of the clean-up phase and should result in having at
> most two
> > files: one currently used by the server to write lease updates until
> the next clean-up is triggered. Another one, holding all remaining
> > (historical) and cleaned-up lease information. Ideally, they should
> be combined into a single file, but appending the contents of the
> currently used file may have some impact on the service availability
> for a period of time when the append is being done.
> > So, the option 1 would be...
> > When the clean-up is triggered, the lease file used currently by the
> server is renamed. The new file is created for the server to use. The
> renamed file contents are appended to the lease file holding
> historical data. Then, clean-up is performed on this file with
> appended information. The server keeps using the same lease file
> (other than the one on which the clean-up was performed) until next
> clean-up comes.
> > And option 2 ...
> > There is only one lease file. When the clean-up is triggered this
> file is renamed and the new file is created for the server to write
> new lease updates. When the clean-up is completed on the renamed lease
> file, the DHCP service is ceased for a short period of time when the
> most recent lease updates (gathered during clean-up) are appended to
> the cleaned-up file and the server switches to use this file.
> > Both solutions are similar, but second option results in use of a
> single lease file (which I personally prefer). The first option has an
> advantage that there is no need to copy over the most recent
> information to the single lease file.
> > I am not sure I fully understand the proposal of keeping the
> clean-up interval "at least equal to maximum lease time value". Are
> you proposing that clean-ups are triggered no more frequently than
> maximum valid-lifetime that may occur for any lease in a lease file?
> Why server restart is associated with the lease file cleanup in this
> context? The clean-up should not be triggered until the server starts
> up and loads lease information from the existing lease files, at which
> point the server records all updates to leases in an available lease
> file. And the clean-up should ensure that the server always has a
> lease file to write to.
> > Marcin
> > On 11/19/14 15:27, Chaigneau, Nicolas wrote:
> >> Hello,
> >> I'd like to discuss the topic of cleaning-up the lease file in the
> case of a "memfile" back-end.
> >> That's probably something you've already thought of, but I didn't
> find specific implementation discussed of on the mailing list archive.
> >> Since it's still yet to be implemented, I'd like to share my 2
> cents on the subject.
> >> This is not a trivial matter: in the context of high availability
> and very large lease files involved, the situation of a server being
> unresponsive for several seconds while it handles rewriting the lease
> file is not something we can afford.
> >> So here's how I would do it:
> >> - define a lease file clean-up interval; for instance 1H.
> >> - every 1H, trigger the clean-up mechanism:
> >> - rename <lease file> to <old lease file> (<lease file>~ or any
> other naming convention)
> >> - recreate an empty <lease file>
> >> - close and reopen Kea's file handle on <lease file>
> >> - at server startup, load in memory both <lease file> and <old lease
> >> file> (aggregating their content)
> >> If we ensure that the clean-up interval value is *at least* equal
> to the maximum lease time value, this guarantee no data can be lost
> during a server restart: any lease not expired is necessarily either
> in the current or old lease file (possibly both).
> >> Both <lease file> and <old lease file> being in the same directory,
> hence on the same filesystem, the rename operation is immediate. The
> file's inode and content are unchanged. Only the file name is modified.
> >> This probably would not be suitable for everyone's needs, so maybe
> this could be an optional mechanism.
> >> What do you think ?
> >> Regards,
> >> Nicolas.
> This message contains information that may be privileged or
> confidential and is the property of the Capgemini Group. It is
> intended only for the person to whom it is addressed. If you are not
> the intended recipient, you are not authorized to read, print, retain,
> copy, disseminate, distribute, or use this message or any part
> thereof. If you receive this message in error, please notify the
> sender immediately and delete all copies of this message.
More information about the kea-dev