[bind10-dev] Receptionist's guts

Wed Mar 13 08:38:11 UTC 2013

Hello

As asked on the call yesterday, I'm sending what internal data structure and
logic would the receptionist have.

Stored state
============
First, it would have several kinds of sockets with accompanying information:

Provider:
---------
This would be the connection to the answering bind10 server. It would be a
SOCK_STREAM connection, probably over unix-domain socket, but IP would work too,
for the future extensions where the servers are across multiple machines. It
would hold:
• The socket itself
• Some kind of ACL to allow sending query to this server (with lightweight
  checks, like mask+value on the start of packet)
• Queue of queries that were not yet sent to the server
• A batch being built

UDP client:
-----------
This one waits for queries from clients. It would contain:
• The socket itself
• Queue of answers not yet passed to OS
• An overflow flag (see below)

TCP client:
-----------
This one is opened connection from client. It would contain the same as the UDP
one, but additionally time of last activity on the connection.

TCP acceptor:
-------------
This one just accepts the connections from clients and creates TCP clients. It
doesn't need any internal state.

Global state:
-------------
• Count of opened TCP connections
• Total size of send queues to providers
• Time when queues to providers overflew.

Handling of the queues
======================
Now, what is the overflow flag. We need to limit the size of send queues
somehow. We do so in two ways. The first one, when the size reaches a soft
limit, we set the overflow flag and notify all the providers not to send any
data for that socket. We also stop reading data from that socket (so we don't
get more queries). Once the size drops again, we remove the flag and notify they
can start sending again. This flag is most important for XfrOut, since that one
is able to generate a lot of messages with just one query, the others would just
stop sending answer for the socket because there would be no queries from it.

The second way, the hard limit (which would have to be reasonably larger by
at least several batches), we would just drop the connection.

For the queues on the provider side, we would have total size of all the queues
and limit that. If that reached some level, we would stop reading from clients
until the servers catch up. If they don't catch in some reasonable time, it
means some of them are stuck and their queues are taking a lot of the space.
This should be rare, as it indicates a programmer bug, so we just take the one
with largest queue and kill it (that one being the culprit, others probably with
empty queues by that time).

General processing
==================
We have some kind of event loop (be it ASIO, or some lower-level for better
performance like epoll/poll/kqueue). In each iteration, we ask for:
 • Sockets that are readable, but we didn't stop reading from them.
 • Sockets that are writable and have non-empty queue.
 • Timeout on the longest-inactive TCP connection + some time.
 • Short timeout to send batches, if there's any unsent batch.

Readable event
--------------
In it is the client, we read the messages available, sort them to respective
providers and append to the batches there. We reset the last activity time.

If it is the TCP acceptor, we accept the connection and create a new client. If
we reach limit of the open TCP connections, we kill the longest-inactive
connection. Here, we might want to impose some minimal time of inactivity for
kill and if there's no such connection (all connections we have are active), we
would just not ask for readability of the acceptor.

If it is the provider, we take the answers from it and sort them to the client
socket queues. If any queue was empty before, we try to send from it directly,
otherwise we just wait for a writable event on the socket. If the queue
overflows, we notify about it (the notification is put as far in the queues of
providers as possible without putting it in the middle of a batch).

Writable event
--------------
Then, with the writable ones, we just push part of the queue to the socket. Then
we reset the last activity to current time (if it was a client socket).

Timeout on the longest-inactive TCP connection
----------------------------------------------
If the connection wasn't active by that time, we just kill it.

Batches timeout
---------------
We put all the batches to provider's queues.

At the end of loop iteration
----------------------------
At the end of iteration, we put all batches which reached some size to the
provider's queues.

Protocol for communication with a provider
==========================================

Each batch would be either a control message (don't send for this socket/you can
send for the socket now) or group of messages.

A group of messages would be preceded by the total length of the group. Then,
each message would have header and body, the header would contain:
• Length of the body
• Protocol it arrived over (TCP/UDP)
• Local/remote address/port
• ID of a socket it came over (let's say a FD + timestamp ‒ using FD only could
  cause an old stray message to be sent over wrong connection, if the connection
  would be closed and new one with the same FD opened in the meantime).

Me use the same format for messages to and from the provider. The provider just
copies the header (except for the length) and that information is used to send
the message to the right destination and over the right UDP/TCP socket.

The receptionist does _not_ track the query-answer pairs. This means less state
and it allows to have no answer for a query (if we have a DROP in the provider's
ACL, or it is malformed, or whatever) or many answers (XfrOut).

With regards

-- 
"It can be done in C++" says nothing. "It can be done in C++ easily" says nobody.

Michal 'vorner' Vaner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <https://lists.isc.org/pipermail/bind10-dev/attachments/20130313/301a087e/attachment.bin>