BIND9.2.0 not killing on SIGTERM signal

Wed Jun 5 14:44:14 UTC 2002

Hello,

    Did anyone had chance to look into this mail?

    Any additional information on this problem would be greatly
    appreciated.

    Thanks in advance,

With Regards,
D.Dhana.

Dhanasekaran wrote:

> Hi,
>
>     I am running BIND9.2.0 nameserver on multiprocessor system (8 CPU).
>     When named receives SIGTERM signal, named is not getting killed. This
>     problem occurs intermittently. Most of the times named exits properly.
>
>     I instrumented the code and found the following in the syslog.
>
>  May 24 11:34:17 tst1 named[3795]: isc_app_run: Return value of sigwait()
>  :0
>  May 24 11:34:17 tst1 named[3795]: isc_app_run: signal received :15
>  May 24 11:34:17 tst1 named[3795]: isc_app_run: Setting want_shutdown to
>  ISC_TRUE
>  May 24 11:34:17 tst1 named[3795]: isc_app_run: Returning SUCCESS
>  May 24 11:34:17 tst1 named[3795]: main: Return value of isc_app_run(): 0
>  May 24 11:34:17 tst1 named[3795]: main: Calling cleanup()
>  May 24 11:34:17 tst1 named[3795]: shutting down: flushing changes
>  May 24 11:34:17 tst1 named[3795]: stopping command channel on
>  127.0.0.1#953
>  May 24 11:34:17 tst1 named[3795]: no longer listening on 192.1.1.152#53
>  May 24 11:34:17 tst1 named[3795]: no longer listening on 15.14.148.121#53
>  May 24 11:34:17 tst1 named[3795]: no longer listening on 127.0.0.1#53
>
>     From the above log, it is confirmed that main thread receives the
>     SIGTERM signal and cleanup() function is called. In cleanup() function
>     destroy_managers() are called to destroy the task, timer and socket
>     managers.
>
>     In isc_taskmgr_destroy(), the main thread is sending shutdown message
>     to all the tasks and waits for the death of all the threads it created during
>     isc_taskmgr_create().
> ----
>         /*
>          * Wake up any sleeping workers.  This ensures we get work done if
>          * there's work left to do, and if there are already no tasks left
>          * it will cause the workers to see manager->exiting.
>          */
>         BROADCAST(&manager->work_available);
>
>        /*
>          * Wait for all the worker threads to exit.
>          */
>         for (i = 0; i < manager->workers; i++)
>                 (void)isc_thread_join(manager->threads[i], NULL);
> ----
>
>     I think, named is hanging in isc_thread_join() function as it will
>     return only when all the threads exited itself. Using a tool, I found
>     the following information on the threads state before and after the
>     sending the SIGTERM signal.
>
> Before sending the signal:
>
> Thread  PID  PPID  Ticks     Ticks
>                    since     since PRI  KT_STAT COMMAND        WCHAN
>                     run      idle
>
> 4477    3795  1     50         666 0   TSSLEEP  named          ksleep_one(0x21)
> 4481    3795  1     50         666 4   TSSLEEP  named          ksleep_one(0x21)
> 4472    3795  1     50         666 3   TSSLEEP  named          ksleep_one(0x21)
> 4482    3795  1     88         666 7   TSSLEEP  named          per_processor_selects+0x1c0
>
> 4479    3795  1     88         666 6   TSSLEEP  named          ksleep_one(0x21)
> 4480    3795  1     88         666 5   TSSLEEP  named          ksleep_one(0x21)
> 4473    3795  1     88         666 2   TSSLEEP  named          ksleep_one(0x21)
> 4474    3795  1     100        666 6   TSSLEEP  named          ksleep_one(0x21)
> 4476    3795  1     100        666 1   TSSLEEP  named          ksleep_one(0x21)
> 4475    3795  1     100        666 4   TSSLEEP  named          ksleep_one(0x21)
> 4471    3795  1     6717       680 3   TSSLEEP  named
> pm_sigwait(0x400003ffffff0e6c)
>
> After sending the signal:
>
> Thread  PID  PPID  Ticks     Ticks
>                    since     since PRI  KT_STAT COMMAND        WCHAN
>                     run      idle
>
> 4474    3795  1     91         666 1   TSSLEEP  named          ksleep_one(0x21)
> 4472    3795  1     91         666 1   TSSLEEP  named          ksleep_one(0x21)
> 4479    3795  1     91         666 6   TSSLEEP  named          ksleep_one(0x21)
> 4481    3795  1     91         666 0   TSSLEEP  named          ksleep_one(0x21)
> 4477    3795  1     91         666 7   TSSLEEP  named          ksleep_one(0x21)
> 4473    3795  1     91         666 5   TSSLEEP  named          ksleep_one(0x21)
> 4476    3795  1     91         666 3   TSSLEEP  named          ksleep_one(0x21)
> 4475    3795  1     91         666 4   TSSLEEP  named          ksleep_one(0x21)
> 4480    3795  1     91         666 2   TSSLEEP  named          ksleep_one(0x21)
> 4482    3795  1     91         666 3   TSSLEEP  named          per_processor_selects+0xc0
> 4471    3795  1     92         666 3   TSSLEEP  named          thread_wait(0x1178)
>
>    So before and after named receives SIGTERM signal, we have a thread in select().
>    The difference is the thread 4471 which went from pm_sigwait() to thread_wait().
>    Thread 4471 is now waiting on thread 0x1178 (4472) which is sleeping.
>
>    For some reason, the threads which are created in
>    isc_taskmgr_create() are not exiting.
>
>     I suspect the problem is in the following code
>     in lib/dns/task.c file.
>
>      if (task->references == 0 &&
>                     TASK_SHUTTINGDOWN(task)) {
>                     /*
>                     * The task is done.
>                     */
>                     XTRACE(isc_msgcat_get(
>                     isc_msgcat,
>                     ISC_MSGSET_TASK,
>                     ISC_MSG_DONE,
>                     "done"));
>                     finished = ISC_TRUE;
>                     task->state = task_state_done;
>                 }
>
>     When the task is created, dispatch() function is called for each thread that
>     named is creating. So in this case, 8 threads will be executing the dispatch()
>     function.
>
>     I found that following are the sequence for any task that
>     is created by named.
>
>     isc_task_create()
>     isc_task_attach()
>     isc_task_sendanddetach()
>     task_send()
>     task_ready()
>     ......
>     ......
>     task_finished()
>
>     If only 'finished' variable is set to ISC_TRUE, task_finished() will
>     be called and the respective tasks is unlinked from the linked list.
>     Only after all the tasks are unlinked from the list, named will come
>     out of the first while() loop in dispatch() function. This will cause the
>     respective thread to exit itself.
>
>     So, I suspect the problem might be
>
>     1. Due to some reason, some task is not getting detached so that
>         task->references is not equal to 0.
>                                 OR
>     2. When main thread sends broadcast message, some tasks might not
>        be receiving and executing the shutdown events.
>
>     Are my assumptions right?
>
>     It would be great if anyone could give some additional information which
>     will help in identifying the exact problem.
>
>     Thanks,
>
> With Regards,
> D.Dhana.