loss of masters over ipsec hoses bind

Thu Jan 10 06:13:26 UTC 2008

On Jan 9, 2008 8:45 AM, Adam Tkac <atkac at redhat.com> wrote:
> On Wed, Jan 09, 2008 at 07:33:31AM -0600, Matt LaPlante wrote:
> > > > > >
> > > > > >
> > > > > >       I would say that some I/O is blocking when it shouldn't
> > > > > >       with sockets which use ipsec.  If this is the case it is
> > > > > >       a kernel bug and named can't do anything to prevent it.
> > > > > >       Named marks all sockets as non-blocking.
> > > > > >
> > > > > >       Mark
>
> I also expect kernel bug..
>
> >
> > Ping...
> >
> > I'm still seeing this any time one of the ipsec endpoints goes away
> > (and it happens on either end, so it's definitely repeatable).
> >
>
> I've run into same problems in RH
> (https://bugzilla.redhat.com/show_bug.cgi?id=427629). Would it be
> possible send (me or here) stack traces where exactly named hangs?

I attempted to follow the instructions in the redhat bug and got the
following output:

(gdb) info threads
  4 Thread -1213547632 (LWP 3040)  0xb7c612a1 in pthread_cond_wait@@GLIBC_2.3.2
    () from /lib/libpthread.so.0
  3 Thread -1221936240 (LWP 3041)  0xb7c61512 in
pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/libpthread.so.0
  2 Thread -1230324848 (LWP 3042)  0xb7bdfad7 in select () from /lib/libc.so.6
  1 Thread -1213163856 (LWP 3039)  0xb7b4ddfd in sigsuspend ()
   from /lib/libc.so.6
(gdb) bt 1
#0  0xb7b4ddfd in sigsuspend () from /lib/libc.so.6
(More stack frames follow...)
(gdb) bt 2
#0  0xb7b4ddfd in sigsuspend () from /lib/libc.so.6
#1  0xb7cb416c in isc_app_run () from /usr/lib/libisc.so.32
(More stack frames follow...)
(gdb) bt 3
#0  0xb7b4ddfd in sigsuspend () from /lib/libc.so.6
#1  0xb7cb416c in isc_app_run () from /usr/lib/libisc.so.32
#2  0x0806950b in ?? ()
(More stack frames follow...)
(gdb) bt 4
#0  0xb7b4ddfd in sigsuspend () from /lib/libc.so.6
#1  0xb7cb416c in isc_app_run () from /usr/lib/libisc.so.32
#2  0x0806950b in ?? ()
#3  0x00000000 in ?? ()
(gdb)

I don't have a lot of gdb-fu to draw on, so feel free to give more
extensive instructions and I'll be glad to run through them.

> It
> will point us where problem is. Mark's patch also point me that
> internal_connect functions uses errno directly (something like switch
> (errno) statement etc.). Not sure if something modifies errno and
> socket code has unexpected behavior. Code should start use statements
> like
>
> err = errno;
> switch (err) ...
>
> instead use errno directly.
>
> Adam
>
> --
> Adam Tkac, Red Hat, Inc.
>