<div style="font-family:Microsoft YaHei;font-size:14px;color:#000000;" class="l_node_has_color"><p data-start="68" data-end="150"><strong data-start="68" data-end="80">Subject:</strong> [PATCH] Fix potential deadlock between PingCheckMgr and PingChannel</p>
<p data-start="152" data-end="163"><strong data-start="152" data-end="161">Body:</strong></p>
<p data-start="165" data-end="188">Hello Kea developers,</p>
<p data-start="190" data-end="329">I’d like to draw your attention to a potential deadlock issue we encountered in the interaction between <code data-start="294" data-end="308">PingCheckMgr</code> and <code data-start="313" data-end="326">PingChannel</code>.</p>
<ul data-start="331" data-end="478">
<li data-start="331" data-end="406">
<p data-start="333" data-end="406"><strong data-start="333" data-end="350">Issue report:</strong> <a href="https://gitlab.isc.org/isc-projects/kea/-/issues/4140" target="_new" rel="noopener" data-start="351" data-end="404">https://gitlab.isc.org/isc-projects/kea/-/issues/4140</a></p>
</li>
<li data-start="407" data-end="478">
<p data-start="409" data-end="478"><strong data-start="409" data-end="431">Proposed fix (PR):</strong> <a href="https://github.com/isc-projects/kea/pull/151" target="_new" rel="noopener" data-start="432" data-end="476">https://github.com/isc-projects/kea/pull/151</a></p>
</li>
</ul>
<h3 data-start="480" data-end="506">Summary of the problem</h3>
<p data-start="507" data-end="598">There is a lock order inversion between <code data-start="547" data-end="569">PingCheckMgr::mutex_</code> and <code data-start="574" data-end="595">PingChannel::mutex_</code>:</p>
<ul data-start="600" data-end="890">
<li data-start="600" data-end="742">
<p data-start="602" data-end="742"><code data-start="602" data-end="627">PingChannel::sendNext()</code> holds the channel lock and may call back into <code data-start="674" data-end="706">PingCheckMgr::checkSuspended()</code>, which requires the manager lock.</p>
</li>
<li data-start="743" data-end="890">
<p data-start="745" data-end="890"><code data-start="745" data-end="781">PingCheckMgr::expirationTimedOut()</code> holds the manager lock and then calls <code data-start="820" data-end="855">channel_->startSend()/startRead()</code>, which require the channel lock.</p>
</li>
</ul>
<p data-start="892" data-end="1005">This creates a circular wait and can lead to deadlock under high concurrency with many <code data-start="979" data-end="991">ping_check</code> operations.</p>
<h3 data-start="1007" data-end="1014">Fix</h3>
<p data-start="1015" data-end="1170">The PR moves <code data-start="1028" data-end="1041">startSend()</code> / <code data-start="1044" data-end="1057">startRead()</code> calls outside the <code data-start="1076" data-end="1090">PingCheckMgr</code> lock to enforce consistent lock ordering.<br data-start="1132" data-end="1135">Correctness is preserved because:</p>
<ul data-start="1171" data-end="1320">
<li data-start="1171" data-end="1250">
<p data-start="1173" data-end="1250"><code data-start="1173" data-end="1185">more_pings</code> is computed under lock and used outside only as a wakeup hint.</p>
</li>
<li data-start="1251" data-end="1320">
<p data-start="1253" data-end="1320"><code data-start="1253" data-end="1263">channel_</code> access is null-checked, so concurrent closure is safe.</p>
</li>
</ul>
<h3 data-start="1322" data-end="1344">Request for review</h3>
<p data-start="1345" data-end="1530">I’d appreciate it if you could review the proposed change and share any feedback. In particular, whether there are scenarios I might have overlooked regarding concurrency correctness.</p>
<p data-start="1532" data-end="1552">Thanks in advance!</p>
<p data-start="1554" data-end="1583">Best regards,<br data-start="1567" data-end="1570">liyunqing</p>
<p> </p>
<p style="color: #000; font-size: 16px;">---</p>
<div id="cs2c_mail_sigature" style="color: #000; font-size: 16px; font-family: Microsoft YaHei;"></div>
<p> </p></div>