BIND9.2.0 not killing on SIGTERM signal

Mon Jun 3 14:55:23 UTC 2002

Hi,

    I am running BIND9.2.0 nameserver on multiprocessor system (8 CPU).
    When named receives SIGTERM signal, named is not getting killed. This
    problem occurs intermittently. Most of the times named exits properly.

    I instrumented the code and found the following in the syslog.

 May 24 11:34:17 tst1 named[3795]: isc_app_run: Return value of sigwait()
 :0
 May 24 11:34:17 tst1 named[3795]: isc_app_run: signal received :15
 May 24 11:34:17 tst1 named[3795]: isc_app_run: Setting want_shutdown to
 ISC_TRUE
 May 24 11:34:17 tst1 named[3795]: isc_app_run: Returning SUCCESS
 May 24 11:34:17 tst1 named[3795]: main: Return value of isc_app_run(): 0
 May 24 11:34:17 tst1 named[3795]: main: Calling cleanup()
 May 24 11:34:17 tst1 named[3795]: shutting down: flushing changes
 May 24 11:34:17 tst1 named[3795]: stopping command channel on
 127.0.0.1#953
 May 24 11:34:17 tst1 named[3795]: no longer listening on 192.1.1.152#53
 May 24 11:34:17 tst1 named[3795]: no longer listening on 15.14.148.121#53
 May 24 11:34:17 tst1 named[3795]: no longer listening on 127.0.0.1#53

    From the above log, it is confirmed that main thread receives the
    SIGTERM signal and cleanup() function is called. In cleanup() function
    destroy_managers() are called to destroy the task, timer and socket
    managers.

    In isc_taskmgr_destroy(), the main thread is sending shutdown message
    to all the tasks and waits for the death of all the threads it created during
    isc_taskmgr_create().
----
        /*
         * Wake up any sleeping workers.  This ensures we get work done if
         * there's work left to do, and if there are already no tasks left
         * it will cause the workers to see manager->exiting.
         */
        BROADCAST(&manager->work_available);

       /*
         * Wait for all the worker threads to exit.
         */
        for (i = 0; i < manager->workers; i++)
                (void)isc_thread_join(manager->threads[i], NULL);
----

    I think, named is hanging in isc_thread_join() function as it will
    return only when all the threads exited itself. Using a tool, I found
    the following information on the threads state before and after the
    sending the SIGTERM signal.

Before sending the signal:

Thread  PID  PPID  Ticks     Ticks
                   since     since PRI  KT_STAT COMMAND        WCHAN
                    run      idle

4477    3795  1     50         666 0   TSSLEEP  named          ksleep_one(0x21)
4481    3795  1     50         666 4   TSSLEEP  named          ksleep_one(0x21)
4472    3795  1     50         666 3   TSSLEEP  named          ksleep_one(0x21)
4482    3795  1     88         666 7   TSSLEEP  named          per_processor_selects+0x1c0

4479    3795  1     88         666 6   TSSLEEP  named          ksleep_one(0x21)
4480    3795  1     88         666 5   TSSLEEP  named          ksleep_one(0x21)
4473    3795  1     88         666 2   TSSLEEP  named          ksleep_one(0x21)
4474    3795  1     100        666 6   TSSLEEP  named          ksleep_one(0x21)
4476    3795  1     100        666 1   TSSLEEP  named          ksleep_one(0x21)
4475    3795  1     100        666 4   TSSLEEP  named          ksleep_one(0x21)
4471    3795  1     6717       680 3   TSSLEEP  named
pm_sigwait(0x400003ffffff0e6c)

After sending the signal:

Thread  PID  PPID  Ticks     Ticks
                   since     since PRI  KT_STAT COMMAND        WCHAN
                    run      idle

4474    3795  1     91         666 1   TSSLEEP  named          ksleep_one(0x21)
4472    3795  1     91         666 1   TSSLEEP  named          ksleep_one(0x21)
4479    3795  1     91         666 6   TSSLEEP  named          ksleep_one(0x21)
4481    3795  1     91         666 0   TSSLEEP  named          ksleep_one(0x21)
4477    3795  1     91         666 7   TSSLEEP  named          ksleep_one(0x21)
4473    3795  1     91         666 5   TSSLEEP  named          ksleep_one(0x21)
4476    3795  1     91         666 3   TSSLEEP  named          ksleep_one(0x21)
4475    3795  1     91         666 4   TSSLEEP  named          ksleep_one(0x21)
4480    3795  1     91         666 2   TSSLEEP  named          ksleep_one(0x21)
4482    3795  1     91         666 3   TSSLEEP  named          per_processor_selects+0xc0
4471    3795  1     92         666 3   TSSLEEP  named          thread_wait(0x1178)

   So before and after named receives SIGTERM signal, we have a thread in select().
   The difference is the thread 4471 which went from pm_sigwait() to thread_wait().
   Thread 4471 is now waiting on thread 0x1178 (4472) which is sleeping.

   For some reason, the threads which are created in
   isc_taskmgr_create() are not exiting.

    I suspect the problem is in the following code
    in lib/dns/task.c file.

     if (task->references == 0 &&
                    TASK_SHUTTINGDOWN(task)) {
                    /*
                    * The task is done.
                    */
                    XTRACE(isc_msgcat_get(
                    isc_msgcat,
                    ISC_MSGSET_TASK,
                    ISC_MSG_DONE,
                    "done"));
                    finished = ISC_TRUE;
                    task->state = task_state_done;
                }

    When the task is created, dispatch() function is called for each thread that
    named is creating. So in this case, 8 threads will be executing the dispatch()
    function.

    I found that following are the sequence for any task that
    is created by named.

    isc_task_create()
    isc_task_attach()
    isc_task_sendanddetach()
    task_send()
    task_ready()
    ......
    ......
    task_finished()

    If only 'finished' variable is set to ISC_TRUE, task_finished() will
    be called and the respective tasks is unlinked from the linked list.
    Only after all the tasks are unlinked from the list, named will come
    out of the first while() loop in dispatch() function. This will cause the
    respective thread to exit itself.

    So, I suspect the problem might be

    1. Due to some reason, some task is not getting detached so that
        task->references is not equal to 0.
                                OR
    2. When main thread sends broadcast message, some tasks might not
       be receiving and executing the shutdown events.

    Are my assumptions right?

    It would be great if anyone could give some additional information which
    will help in identifying the exact problem.

    Thanks,

With Regards,
D.Dhana.