Skip to content

Conversation

ryanofsky
Copy link
Collaborator

@ryanofsky ryanofsky commented Sep 2, 2025

This bug is currently causing mptest "disconnecting and blocking" test to occasionally hang as reported by maflcko in bitcoin/bitcoin#33244.

The bug was actually first reported by Sjors in Sjors/bitcoin#90 (comment) and there are more details about it in #189.

The bug is caused by the "disconnecting and blocking" test triggering a disconnect right before a server IPC call returns. This results in a race between the IPC server thread and the onDisconnect handler in the event loop thread both trying to destroy the server's request_threads ProxyClient<Thread> object when the IPC call is done.

There was a lack of synchronization in this case, fixed here by adding loop->sync() various places. There were also lock order issues where Waiter::m_mutex could be incorrectly locked before EventLoop::m_mutex resulting in a deadlock.

@DrahtBot
Copy link

DrahtBot commented Sep 2, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK Sjors
Stale ACK Eunovo

If your review is incorrectly listed, please react with 👎 to this comment and the bot will ignore it on the next update.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #210 (Replace KJ_DEFER with kj::defer by Sjors)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@ryanofsky ryanofsky changed the title bug: fix ProxyClient<Thread> deadlock if disconnected as IPC call is returning bug: fix mptest hang, ProxyClient<Thread> deadlock in disconnect handler Sep 3, 2025
@ryanofsky
Copy link
Collaborator Author

I captured two stack traces of mptest hang on master (1b8d4a6) to be more confident this PR fixes the bug. I might also be able to update the test to make it trigger this bug more reliably, but would need to think more about how to do that. Both stack traces were identical and showed a deadlock between Waiter::m_mutex and EventLoop::post. The stack trace annotated with source code is below.

  • Thread 1 is the main test thread. It is stuck at the end of the "disconnecting and blocking" test trying to destroy ProxyClient<FooInterface> which is blocked waiting for EventLoop::post and EventLoop::m_mutex in order to free the client IPC handle it owns.
  • Thread 2 is the server thread created by ProxyServer<ThreadMap>::makeThread which executes async IPC calls. It is blocked in a similar place as thread 1, trying to destroy ProxyClient<Thread> waiting for EventLoop::post. But additionally, further up the stack in frame 22 in the type-context.h Passfield function it has locked Waiter::m_mutex to delete the ThreadContext::request_threads map entry containing the object being destroyed. This lock is the bug because Waiter::m_mutex can't be held while trying to call EventLoop::post, since this violates lock ordering.
  • Thread 3 is the EventLoop thread which is processing the disconnected event. It is running the ProxyClient<Thread>::m_disconnected_cb callback which is stuck trying to acquire Waiter::m_mutex in order to delete its ThreadContext::request_threads entry. It can't acquire the mutex because thread 2 has it. It is also trying to delete the same map entry which thread 2 is, which is an additional bug fixed in this PR by always deleting these entries on event loop thread instead of letting thread 2 access the map directly.
thread 1

#0  0x00007f480509438e in __futex_abstimed_wait_common () from /nix/store/cg9s562sa33k78m63njfn1rw47dp9z0i-glibc-2.40-66/lib/libc.so.6
#1  0x00007f4805097010 in pthread_cond_wait@@GLIBC_2.3.2 () from /nix/store/cg9s562sa33k78m63njfn1rw47dp9z0i-glibc-2.40-66/lib/libc.so.6
#2  0x000055df0ff8bdbb in std::condition_variable::wait<mp::EventLoop::post(kj::Function<void ()>)::$_2>(std::unique_lock<std::mutex>&, mp::EventLoop::post(kj::Function<void ()>)::$_2) (this=0x7f4804ffed08, __lock=..., __p=...)
    at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/condition_variable:105
#3  mp::EventLoop::post (this=0x7f4804ffec90, fn=...) at mp/src/mp/proxy.cpp:273

   269      Unlock(lock, [&] {
   270          char buffer = 0;
   271          KJ_SYSCALL(write(post_fd, &buffer, 1));
   272      });
>  273      m_cv.wait(lock.m_lock, [this, &fn]() MP_REQUIRES(m_mutex) { return m_post_fn != &fn; });
   274  }

#4  0x000055df0feefd7d in mp::EventLoop::sync<mp::ProxyClientBase<mp::test::messages::FooInterface, mp::test::FooImplementation>::ProxyClientBase(mp::test::messages::FooInterface::Client, mp::Connection*, bool)::{lambda()#2}::operator()() const::{lambda()#1}>(mp::ProxyClientBase<mp::test::messages::FooInterface, mp::test::FooImplementation>::ProxyClientBase(mp::test::messages::FooInterface::Client, mp::Connection*, bool)::{lambda()#2}::operator()() const::{lambda()#1}&&) (this=0x7f4804ffec90, callable=...) at mp/include/mp/proxy-io.h:182

   179      template <typename Callable>
   180      void sync(Callable&& callable)
   181      {
 > 182          post(std::forward<Callable>(callable));
   183      }

#5  0x000055df0feefc52 in mp::ProxyClientBase<mp::test::messages::FooInterface, mp::test::FooImplementation>::ProxyClientBase(mp::test::messages::FooInterface::Client, mp::Connection*, bool)::{lambda()#2}::operator()() const (this=<optimized out>)
    at mp/include/mp/proxy-io.h:434

   431          Sub::destroy(*this);
   432
   433          // FIXME: Could just invoke removed addCleanup fn here instead of duplicating code
 > 434          m_context.loop->sync([&]() {
   435              // Remove disconnect callback on cleanup so it doesn't run and try
   436              // to access this object after it's destroyed. This call needs to
   437              // run inside loop->sync() on the event loop thread because
   438              // otherwise, if there were an ill-timed disconnect, the
   439              // onDisconnect handler could fire and delete the Connection object
   440              // before the removeSyncCleanup call.
   441              if (m_context.connection) m_context.connection->removeSyncCleanup(disconnect_cb);

#6  std::__invoke_impl<void, mp::ProxyClientBase<mp::test::messages::FooInterface, mp::test::FooImplementation>::ProxyClientBase(mp::test::messages::FooInterface::Client, mp::Connection*, bool)::{lambda()#2}&>(std::__invoke_other, mp::ProxyClientBase<mp::test::messages::FooInterface, mp::test::FooImplementation>::ProxyClientBase(mp::test::messages::FooInterface::Client, mp::Connection*, bool)::{lambda()#2}&) (__f=...) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/invoke.h:61
#7  std::__invoke_r<void, mp::ProxyClientBase<mp::test::messages::FooInterface, mp::test::FooImplementation>::ProxyClientBase(mp::test::messages::FooInterface::Client, mp::Connection*, bool)::{lambda()#2}&>(mp::ProxyClientBase<mp::test::messages::FooInterface, mp::test::FooImplementation>::ProxyClientBase(mp::test::messages::FooInterface::Client, mp::Connection*, bool)::{lambda()#2}&) (__fn=...) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/invoke.h:111
#8  std::_Function_handler<void (), mp::ProxyClientBase<mp::test::messages::FooInterface, mp::test::FooImplementation>::ProxyClientBase(mp::test::messages::FooInterface::Client, mp::Connection*, bool)::{lambda()#2}>::_M_invoke(std::_Any_data const&) (__functor=...)
    at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/std_function.h:290
#9  0x000055df0ff52d26 in std::function<void()>::operator() (this=0x7fffb7c1b700) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/std_function.h:591
#10 mp::CleanupRun (fns=empty std::__cxx11::list) at mp/include/mp/proxy.h:43

    39  inline void CleanupRun(CleanupList& fns) {
    40      while (!fns.empty()) {
    41          auto fn = std::move(fns.front());
    42          fns.pop_front();
>   43          fn();
    44      }
    45  }

#11 0x000055df0ff872e7 in mp::ProxyClientBase<mp::test::messages::FooInterface, mp::test::FooImplementation>::~ProxyClientBase (this=0x7f4800027cf0) at mp/include/mp/proxy-io.h:460

   457  template <typename Interface, typename Impl>
   458  ProxyClientBase<Interface, Impl>::~ProxyClientBase() noexcept
   459  {
>  460      CleanupRun(m_context.cleanup_fns);
   461  }

#12 0x000055df0feebfc4 in std::default_delete<mp::ProxyClient<mp::test::messages::FooInterface> >::operator() (this=0x7fffb7c1b830, __ptr=0x7f4800027cf0) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/unique_ptr.h:93
#13 std::__uniq_ptr_impl<mp::ProxyClient<mp::test::messages::FooInterface>, std::default_delete<mp::ProxyClient<mp::test::messages::FooInterface> > >::reset (this=0x7fffb7c1b830, __p=0x0)
    at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/unique_ptr.h:205
#14 std::unique_ptr<mp::ProxyClient<mp::test::messages::FooInterface>, std::default_delete<mp::ProxyClient<mp::test::messages::FooInterface> > >::reset (this=0x7fffb7c1b830, __p=0x0)
    at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/unique_ptr.h:504
#15 mp::test::TestSetup::~TestSetup (this=this@entry=0x7fffb7c1b7d8) at mp/test/mp/test/test.cpp:103

    98      ~TestSetup()
    99      {
   100          // Test that client cleanup_fns are executed.
   101          bool destroyed = false;
   102          client->m_context.cleanup_fns.emplace_front([&destroyed] { destroyed = true; });
>  103          client.reset();
   104          KJ_EXPECT(destroyed);
   105
   106          thread.join();
   107      }

#16 0x000055df0fee997f in mp::test::TestCase251::run (this=<optimized out>) at mp/test/mp/test/test.cpp:298

   292      // Now that the disconnect has been detected, set signal allowing the
   293      // callFnAsync() IPC call to return. Since signalling may not wake up the
   294      // thread right away, it is important for the signal variable to be declared
   295      // *before* the TestSetup variable so is not destroyed while
   296      // signal.get_future().get() is called.
   297      signal.set_value();
>  298  }

#17 0x00007f4805a34291 in kj::Maybe<kj::Exception> kj::runCatchingExceptions<kj::TestRunner::run()::{lambda()#1}>(kj::TestRunner::run()::{lambda()#1}&&) () from /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/lib/libkj-test.so.1.1.0
#18 0x00007f4805a33a08 in kj::TestRunner::run() () from /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/lib/libkj-test.so.1.1.0
#19 0x00007f4805a337bd in kj::Function<kj::MainBuilder::Validity ()>::Impl<kj::_::BoundMethod<kj::TestRunner&, kj::TestRunner::getMain()::{lambda(auto:1&, (auto:2&&)...)#7}, kj::TestRunner::getMain()::{lambda(auto:1&, (auto:2&&)...)#8}> >::operator()() ()
   from /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/lib/libkj-test.so.1.1.0
#20 0x00007f480572f93d in kj::MainBuilder::MainImpl::operator()(kj::StringPtr, kj::ArrayPtr<kj::StringPtr const>) () from /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/lib/libkj.so.1.1.0
#21 0x00007f480572df56 in kj::Maybe<kj::Exception> kj::runCatchingExceptions<kj::runMainAndExit(kj::ProcessContext&, kj::Function<void (kj::StringPtr, kj::ArrayPtr<kj::StringPtr const>)>&&, int, char**)::$_0>(kj::runMainAndExit(kj::ProcessContext&, kj::Function<void (kj::StringPtr, kj::ArrayPtr<kj::StringPtr const>)>&&, int, char**)::$_0&&) () from /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/lib/libkj.so.1.1.0
#22 0x00007f480572dc7f in kj::runMainAndExit(kj::ProcessContext&, kj::Function<void (kj::StringPtr, kj::ArrayPtr<kj::StringPtr const>)>&&, int, char**) () from /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/lib/libkj.so.1.1.0
#23 0x00007f4805a329b9 in main () from /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/lib/libkj-test.so.1.1.0
#24 0x00007f480502a47e in __libc_start_call_main () from /nix/store/cg9s562sa33k78m63njfn1rw47dp9z0i-glibc-2.40-66/lib/libc.so.6
#25 0x00007f480502a539 in __libc_start_main_impl () from /nix/store/cg9s562sa33k78m63njfn1rw47dp9z0i-glibc-2.40-66/lib/libc.so.6
#26 0x000055df0fee6eb5 in _start ()

thread 2

#0  0x00007f480509438e in __futex_abstimed_wait_common () from /nix/store/cg9s562sa33k78m63njfn1rw47dp9z0i-glibc-2.40-66/lib/libc.so.6
#1  0x00007f4805097010 in pthread_cond_wait@@GLIBC_2.3.2 () from /nix/store/cg9s562sa33k78m63njfn1rw47dp9z0i-glibc-2.40-66/lib/libc.so.6
#2  0x000055df0ff8bd0b in std::condition_variable::wait<mp::EventLoop::post(kj::Function<void ()>)::$_0>(std::unique_lock<std::mutex>&, mp::EventLoop::post(kj::Function<void ()>)::$_0) (this=0x7f4804ffed08, __lock=..., __p=...)
    at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/condition_variable:105
#3  mp::EventLoop::post (this=this@entry=0x7f4804ffec90, fn=...) at mp/src/mp/proxy.cpp:266

   258  void EventLoop::post(kj::Function<void()> fn)
   259  {
   260      if (std::this_thread::get_id() == m_thread_id) {
   261          fn();
   262          return;
   263      }
   264      Lock lock(m_mutex);
   265      EventLoopRef ref(*this, &lock);
>  266      m_cv.wait(lock.m_lock, [this]() MP_REQUIRES(m_mutex) { return m_post_fn == nullptr; });
   267      m_post_fn = &fn;
   268      int post_fd{m_post_fd};
   269      Unlock(lock, [&] {
   270          char buffer = 0;
   271          KJ_SYSCALL(write(post_fd, &buffer, 1));
   272      });
   273      m_cv.wait(lock.m_lock, [this, &fn]() MP_REQUIRES(m_mutex) { return m_post_fn != &fn; });
   274  }

#4  0x000055df0ff8f22d in mp::EventLoop::sync<mp::ProxyClientBase<mp::Thread, capnp::Void>::ProxyClientBase(mp::Thread::Client, mp::Connection*, bool)::{lambda()#2}::operator()() const::{lambda()#1}>(mp::ProxyClientBase<mp::Thread, capnp::Void>::ProxyClientBase(mp::Thread::Client, mp::Connection*, bool)::{lambda()#2}::operator()() const::{lambda()#1}&&) (this=0x7f4804ffec90, callable=...) at mp/include/mp/proxy-io.h:182

   179      template <typename Callable>
   180      void sync(Callable&& callable)
   181      {
>  182          post(std::forward<Callable>(callable));
   183      }

#5  0x000055df0ff8f102 in mp::ProxyClientBase<mp::Thread, capnp::Void>::ProxyClientBase(mp::Thread::Client, mp::Connection*, bool)::{lambda()#2}::operator()() const (this=<optimized out>) at mp/include/mp/proxy-io.h:434

   431          Sub::destroy(*this);
   432
   433          // FIXME: Could just invoke removed addCleanup fn here instead of duplicating code
>  434          m_context.loop->sync([&]() {
   435              // Remove disconnect callback on cleanup so it doesn't run and try
   436              // to access this object after it's destroyed. This call needs to
   437              // run inside loop->sync() on the event loop thread because
   438              // otherwise, if there were an ill-timed disconnect, the
   439              // onDisconnect handler could fire and delete the Connection object
   440              // before the removeSyncCleanup call.
   441              if (m_context.connection) m_context.connection->removeSyncCleanup(disconnect_cb);

#6  std::__invoke_impl<void, mp::ProxyClientBase<mp::Thread, capnp::Void>::ProxyClientBase(mp::Thread::Client, mp::Connection*, bool)::{lambda()#2}&>(std::__invoke_other, mp::ProxyClientBase<mp::Thread, capnp::Void>::ProxyClientBase(mp::Thread::Client, mp::Connection*, bool)::{lambda()#2}&) (__f=...) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/invoke.h:61
#7  std::__invoke_r<void, mp::ProxyClientBase<mp::Thread, capnp::Void>::ProxyClientBase(mp::Thread::Client, mp::Connection*, bool)::{lambda()#2}&>(mp::ProxyClientBase<mp::Thread, capnp::Void>::ProxyClientBase(mp::Thread::Client, mp::Connection*, bool)::{lambda()#2}&) (__fn=...)
    at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/invoke.h:111
#8  std::_Function_handler<void (), mp::ProxyClientBase<mp::Thread, capnp::Void>::ProxyClientBase(mp::Thread::Client, mp::Connection*, bool)::{lambda()#2}>::_M_invoke(std::_Any_data const&) (__functor=...)
    at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/std_function.h:290
#9  0x000055df0ff52d26 in std::function<void()>::operator() (this=0x7f47ffffe870) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/std_function.h:591
#10 mp::CleanupRun (fns=empty std::__cxx11::list) at mp/include/mp/proxy.h:43

    39  inline void CleanupRun(CleanupList& fns) {
    40      while (!fns.empty()) {
    41          auto fn = std::move(fns.front());
    42          fns.pop_front();
>   43          fn();
    44      }
    45  }

#11 0x000055df0ff8e627 in mp::ProxyClientBase<mp::Thread, capnp::Void>::~ProxyClientBase (this=0x7f47f8000e58) at mp/include/mp/proxy-io.h:460

   457  template <typename Interface, typename Impl>
   458  ProxyClientBase<Interface, Impl>::~ProxyClientBase() noexcept
   459  {
>  460      CleanupRun(m_context.cleanup_fns);
   461  }

#12 0x000055df0ff53129 in std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> >::~pair (this=0x7f47f8000e50) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/stl_iterator.h:3013
#13 std::destroy_at<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > (__location=0x7f47f8000e50) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/stl_construct.h:88
#14 std::allocator_traits<std::allocator<std::_Rb_tree_node<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > > >::destroy<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > (__p=0x7f47f8000e50, __a=...)
    at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/alloc_traits.h:599
#15 std::_Rb_tree<mp::Connection*, std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> >, std::_Select1st<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > >, std::less<mp::Connection*>, std::allocator<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > >::_M_destroy_node (this=0x7f47fffff680, __p=0x7f47f8000e30) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/stl_tree.h:621
#16 std::_Rb_tree<mp::Connection*, std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> >, std::_Select1st<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > >, std::less<mp::Connection*>, std::allocator<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > >::_M_drop_node (this=0x7f47fffff680, __p=0x7f47f8000e30) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/stl_tree.h:629
#17 std::_Rb_tree<mp::Connection*, std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> >, std::_Select1st<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > >, std::less<mp::Connection*>, std::allocator<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > >::_M_erase (this=this@entry=0x7f47fffff680, __x=0x7f47f8000e30) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/stl_tree.h:1934
#18 0x000055df0ff530c9 in std::_Rb_tree<mp::Connection*, std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> >, std::_Select1st<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > >, std::less<mp::Connection*>, std::allocator<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > >::clear (this=0x7f47fffff680) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/stl_tree.h:1251
#19 std::_Rb_tree<mp::Connection*, std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> >, std::_Select1st<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > >, std::less<mp::Connection*>, std::allocator<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > >::_M_erase_aux (this=0x7f47fffff680, __first={...}, __last={...}) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/stl_tree.h:2505
#20 0x000055df0ff70fa6 in std::_Rb_tree<mp::Connection*, std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> >, std::_Select1st<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > >, std::less<mp::Connection*>, std::allocator<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > >::erase (this=0x7f4804ffed30, __x=<optimized out>) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/stl_tree.h:2519
#21 std::map<mp::Connection*, mp::ProxyClient<mp::Thread>, std::less<mp::Connection*>, std::allocator<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > >::erase (this=0x7f4804ffed30, __x=<optimized out>)
    at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/stl_map.h:1118
#22 operator() (this=this@entry=0x7f47ffffeaf8) at mp/include/mp/type-context.h:103

   102                      const bool erase_thread{inserted};
   103                      KJ_DEFER(if (erase_thread) {
   104                          std::unique_lock<std::mutex> lock(thread_context.waiter->m_mutex);
   105                          // Call erase here with a Connection* argument instead
   106                          // of an iterator argument, because the `request_thread`
   107                          // iterator may be invalid if the connection is closed
   108                          // during this function call. More specifically, the
   109                          // iterator may be invalid because SetThread adds a
   110                          // cleanup callback to the Connection destructor that
   111                          // erases the thread from the map, and also because the
   112                          // ProxyServer<Thread> destructor calls
   113                          // request_threads.clear().
>  114                          request_threads.erase(server.m_context.connection);
   115                      });
   116                      fn.invoke(server_context, args...);
   117                  }

#23 0x000055df0ff709c7 in run (this=<optimized out>) at /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/include/kj/common.h:2007
#24 ~Deferred (this=<optimized out>) at /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/include/kj/common.h:1996
#25 operator() (this=0x7f4800034628) at mp/include/mp/type-context.h:117

>  117                  }

#26 0x000055df0ff04920 in kj::Function<void()>::operator() (this=0x7f47ffffeda0) at /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/include/kj/function.h:119
#27 mp::Unlock<std::unique_lock<std::mutex>, kj::Function<void()>&> (lock=..., callback=...) at mp/include/mp/util.h:198

   194  template <typename Lock, typename Callback>
   195  void Unlock(Lock& lock, Callback&& callback)
   196  {
   197      const UnlockGuard<Lock> unlock(lock);
>  198      callback();
   199  }

#28 0x000055df0ff8d979 in mp::Waiter::wait<mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0::operator()() const::{lambda()#1}>(std::unique_lock<std::mutex>&, mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0::operator()() const::{lambda()#1})::{lambda()#1}::operator()() const (this=<optimized out>) at mp/include/mp/proxy-io.h:296

   284      template <class Predicate>
   285      void wait(std::unique_lock<std::mutex>& lock, Predicate pred)
   286      {
   287          m_cv.wait(lock, [&] {
   288              // Important for this to be "while (m_fn)", not "if (m_fn)" to avoid
   289              // a lost-wakeup bug. A new m_fn and m_cv notification might be sent
   290              // after the fn() call and before the lock.lock() call in this loop
   291              // in the case where a capnp response is sent and a brand new
   292              // request is immediately received.
   293              while (m_fn) {
   294                  auto fn = std::move(*m_fn);
   295                  m_fn.reset();
>  296                  Unlock(lock, fn);
   297              }
   298              const bool done = pred();
   299              return done;
   300          });
   301      }

#29 std::condition_variable::wait<mp::Waiter::wait<mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0::operator()() const::{lambda()#1}>(std::unique_lock<std::mutex>&, mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0::operator()() const::{lambda()#1})::{lambda()#1}>(std::unique_lock<std::mutex>&, mp::Waiter::wait<mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0::operator()() const::{lambda()#1}>(std::unique_lock<std::mutex>&, mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0::operator()() const::{lambda()#1})::{lambda()#1}) (this=0x7f47f8000dd8, __lock=..., __p=...) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/condition_variable:104
#30 mp::Waiter::wait<mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0::operator()() const::{lambda()#1}>(std::unique_lock<std::mutex>&, mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0::operator()() const::{lambda()#1}) (this=0x7f47f8000db0, lock=..., pred=...) at mp/include/mp/proxy-io.h:287

>  287          m_cv.wait(lock, [&] {

#31 mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0::operator()() const (this=<optimized out>) at mp/src/mp/proxy.cpp:404

   393  kj::Promise<void> ProxyServer<ThreadMap>::makeThread(MakeThreadContext context)
   394  {
   395      const std::string from = context.getParams().getName();
   396      std::promise<ThreadContext*> thread_context;
   397      std::thread thread([&thread_context, from, this]() {
   398          g_thread_context.thread_name = ThreadName(m_connection.m_loop->m_exe_name) + " (from " + from + ")";
   399          g_thread_context.waiter = std::make_unique<Waiter>();
   400          thread_context.set_value(&g_thread_context);
   401          std::unique_lock<std::mutex> lock(g_thread_context.waiter->m_mutex);
   402          // Wait for shutdown signal from ProxyServer<Thread> destructor (signal
   403          // is just waiter getting set to null.)
>  404          g_thread_context.waiter->wait(lock, [] { return !g_thread_context.waiter; });
   405      });
   406      auto thread_server = kj::heap<ProxyServer<Thread>>(*thread_context.get_future().get(), std::move(thread));
   407      auto thread_client = m_connection.m_threads.add(kj::mv(thread_server));
   408      context.getResults().setResult(kj::mv(thread_client));
   409      return kj::READY_NOW;
   410  }

#32 std::__invoke_impl<void, mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0>(std::__invoke_other, mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0&&) (__f=...) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/invoke.h:61
#33 std::__invoke<mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0>(mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0&&) (__fn=...) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/invoke.h:96
#34 std::thread::_Invoker<std::tuple<mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0> >::_M_invoke<0ul> (this=<optimized out>)
    at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/std_thread.h:301
#35 std::thread::_Invoker<std::tuple<mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0> >::operator() (this=<optimized out>)
    at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/std_thread.h:308
#36 std::thread::_State_impl<std::thread::_Invoker<std::tuple<mp::ProxyServer<mp::ThreadMap>::makeThread(capnp::CallContext<mp::ThreadMap::MakeThreadParams, mp::ThreadMap::MakeThreadResults>)::$_0> > >::_M_run (this=<optimized out>)
    at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/std_thread.h:253
#37 0x00007f48054ed064 in execute_native_thread_routine () from /nix/store/7c0v0kbrrdc2cqgisi78jdqxn73n3401-gcc-14.2.1.20250322-lib/lib/libstdc++.so.6
#38 0x00007f4805097e63 in start_thread () from /nix/store/cg9s562sa33k78m63njfn1rw47dp9z0i-glibc-2.40-66/lib/libc.so.6
#39 0x00007f480511bdbc in __clone3 () from /nix/store/cg9s562sa33k78m63njfn1rw47dp9z0i-glibc-2.40-66/lib/libc.so.6

thread 3

#0  0x00007f480509463f in __lll_lock_wait () from /nix/store/cg9s562sa33k78m63njfn1rw47dp9z0i-glibc-2.40-66/lib/libc.so.6
#1  0x00007f480509b392 in pthread_mutex_lock@@GLIBC_2.2.5 () from /nix/store/cg9s562sa33k78m63njfn1rw47dp9z0i-glibc-2.40-66/lib/libc.so.6
#2  0x000055df0ff8d324 in __gthread_mutex_lock (__mutex=0x7f47f8000db0) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/x86_64-unknown-linux-gnu/bits/gthr-default.h:762
#3  std::mutex::lock (this=0x7f47f8000db0) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/std_mutex.h:113
#4  std::unique_lock<std::mutex>::lock (this=<optimized out>) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/unique_lock.h:147
#5  std::unique_lock<std::mutex>::unique_lock (__m=..., this=<optimized out>) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/unique_lock.h:73
#6  mp::SetThread(std::map<mp::Connection*, mp::ProxyClient<mp::Thread>, std::less<mp::Connection*>, std::allocator<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > >&, std::mutex&, mp::Connection*, std::function<mp::Thread::Client ()> const&)::$_0::operator()() const (
    this=0x7f47f8001170) at mp/src/mp/proxy.cpp:326

   308  std::tuple<ConnThread, bool> SetThread(ConnThreads& threads, std::mutex& mutex, Connection* connection, const std::function<Thread::Client()>& make_thread)
   309  {
   310      const std::unique_lock<std::mutex> lock(mutex);
   311      auto thread = threads.find(connection);
   312      if (thread != threads.end()) return {thread, false};
   313      thread = threads.emplace(
   314          std::piecewise_construct, std::forward_as_tuple(connection),
   315          std::forward_as_tuple(make_thread(), connection, /* destroy_connection= */ false)).first;
   316      thread->second.setDisconnectCallback([&threads, &mutex, thread] {
   317          // Note: it is safe to use the `thread` iterator in this cleanup
   318          // function, because the iterator would only be invalid if the map entry
   319          // was removed, and if the map entry is removed the ProxyClient<Thread>
   320          // destructor unregisters the cleanup.
   321
   322          // Connection is being destroyed before thread client is, so reset
   323          // thread client m_disconnect_cb member so thread client destructor does not
   324          // try to unregister this callback after connection is destroyed.
   325          // Remove connection pointer about to be destroyed from the map
>  326          const std::unique_lock<std::mutex> lock(mutex);
   327          thread->second.m_disconnect_cb.reset();
   328          threads.erase(thread);
   329      });
   330      return {thread, true};
   331  }

#7  std::__invoke_impl<void, mp::SetThread(std::map<mp::Connection*, mp::ProxyClient<mp::Thread>, std::less<mp::Connection*>, std::allocator<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > >&, std::mutex&, mp::Connection*, std::function<mp::Thread::Client ()> const&)::$_0&>(std::__invoke_other, mp::SetThread(std::map<mp::Connection*, mp::ProxyClient<mp::Thread>, std::less<mp::Connection*>, std::allocator<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > >&, std::mutex&, mp::Connection*, std::function<mp::Thread::Client ()> const&)::$_0&)
    (__f=...) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/invoke.h:61
#8  std::__invoke_r<void, mp::SetThread(std::map<mp::Connection*, mp::ProxyClient<mp::Thread>, std::less<mp::Connection*>, std::allocator<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > >&, std::mutex&, mp::Connection*, std::function<mp::Thread::Client ()> const&)::$_0&>(mp::SetThread(std::map<mp::Connection*, mp::ProxyClient<mp::Thread>, std::less<mp::Connection*>, std::allocator<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > >&, std::mutex&, mp::Connection*, std::function<mp::Thread::Client ()> const&)::$_0&) (__fn=...)
    at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/invoke.h:111
#9  std::_Function_handler<void(), mp::SetThread(std::map<mp::Connection*, mp::ProxyClient<mp::Thread>, std::less<mp::Connection*>, std::allocator<std::pair<mp::Connection* const, mp::ProxyClient<mp::Thread> > > >&, std::mutex&, mp::Connection*, std::function<mp::Thread::Client()> const&)::$_0>::_M_invoke (__functor=...) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/std_function.h:290
#10 0x000055df0ff8de15 in std::function<void()>::operator() (this=0x7f47f8000fb0) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/std_function.h:591
#11 mp::Unlock<mp::Lock, std::function<void()>&> (lock=..., callback=...) at mp/include/mp/util.h:198

   194  template <typename Lock, typename Callback>
   195  void Unlock(Lock& lock, Callback&& callback)
   196  {
   197      const UnlockGuard<Lock> unlock(lock);
>  198      callback();
   199  }

#12 0x000055df0ff8a70f in mp::Connection::~Connection (this=0x7f480002a810) at mp/src/mp/proxy.cpp:139

   135      Lock lock{m_loop->m_mutex};
   136      while (!m_sync_cleanup_fns.empty()) {
   137          CleanupList fn;
   138          fn.splice(fn.begin(), m_sync_cleanup_fns, m_sync_cleanup_fns.begin());
>  139          Unlock(lock, fn.front());
   140      }
   141  }

#13 0x000055df0feeedd3 in std::default_delete<mp::Connection>::operator() (__ptr=0x7f480002a810, this=<optimized out>) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/unique_ptr.h:93
#14 std::__uniq_ptr_impl<mp::Connection, std::default_delete<mp::Connection> >::reset (this=0xfffffffffffffe00, __p=0x0) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/unique_ptr.h:205
#15 std::unique_ptr<mp::Connection, std::default_delete<mp::Connection> >::reset (this=0xfffffffffffffe00, __p=0x0) at /nix/store/9ds850ifd4jwcccpp3v14818kk74ldf2-gcc-14.2.1.20250322/include/c++/14.2.1.20250322/bits/unique_ptr.h:504
#16 mp::test::TestSetup::TestSetup(bool)::{lambda()#1}::operator()() const::{lambda()#2}::operator()() const (this=<optimized out>) at mp/test/mp/test/test.cpp:79

    76                server_disconnect = [&] { loop.sync([&] { server_connection.reset(); }); };
    77                // Set handler to destroy the server when the client disconnects. This
    78                // is ignored if server_disconnect() is called instead.
>   79                server_connection->onDisconnect([&] { server_connection.reset(); });
    80
    81                auto client_connection = std::make_unique<Connection>(loop, kj::mv(pipe.ends[1]));

#17 kj::_::MaybeVoidCaller<kj::_::Void, kj::_::Void>::apply<mp::test::TestSetup::TestSetup(bool)::{lambda()#1}::operator()() const::{lambda()#2}>(mp::test::TestSetup::TestSetup(bool)::{lambda()#1}::operator()() const::{lambda()#2}&, kj::_::Void&&) (func=..., in=...)
    at /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/include/kj/async-prelude.h:195
#18 kj::_::TransformPromiseNode<kj::_::Void, kj::_::Void, mp::test::TestSetup::TestSetup(bool)::{lambda()#1}::operator()() const::{lambda()#2}, kj::_::PropagateException>::getImpl(kj::_::ExceptionOrValue&) (this=<optimized out>, output=...)
    at /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/include/kj/async-inl.h:739
#19 0x00007f48057c3a3e in kj::_::TransformPromiseNodeBase::get(kj::_::ExceptionOrValue&) () from /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/lib/libkj-async.so.1.1.0
#20 0x00007f48057d102a in kj::TaskSet::Task::fire() () from /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/lib/libkj-async.so.1.1.0
#21 0x00007f48057d143d in non-virtual thunk to kj::TaskSet::Task::fire() () from /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/lib/libkj-async.so.1.1.0
#22 0x00007f48057c8cb2 in kj::_::waitImpl(kj::Own<kj::_::PromiseNode, kj::_::PromiseDisposer>&&, kj::_::ExceptionOrValue&, kj::WaitScope&, kj::SourceLocation)::$_2::operator()() const () from /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/lib/libkj-async.so.1.1.0
#23 0x00007f48057c1d68 in kj::_::waitImpl(kj::Own<kj::_::PromiseNode, kj::_::PromiseDisposer>&&, kj::_::ExceptionOrValue&, kj::WaitScope&, kj::SourceLocation) () from /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/lib/libkj-async.so.1.1.0
#24 0x000055df0ff8e44c in kj::Promise<unsigned long>::wait (this=<optimized out>, waitScope=..., location=...) at /nix/store/46kiq9naswgbqfc14kc9nxcbgd0rv0m2-capnproto-1.1.0/include/kj/async-inl.h:1357
#25 0x000055df0ff8b722 in mp::EventLoop::loop (this=0x7f4804ffec90) at mp/src/mp/proxy.cpp:231

>  231          const size_t read_bytes = wait_stream->read(&buffer, 0, 1).wait(m_io_context.waitScope);


#26 0x000055df0feed376 in mp::test::TestSetup::TestSetup(bool)::{lambda()#1}::operator()() const (this=0x7f480002d1e8) at mp/test/mp/test/test.cpp:92

    91                client_promise.set_value(std::move(client_proxy));
>   92                loop.loop();

#27 0x00007f48054ed064 in execute_native_thread_routine () from /nix/store/7c0v0kbrrdc2cqgisi78jdqxn73n3401-gcc-14.2.1.20250322-lib/lib/libstdc++.so.6
#28 0x00007f4805097e63 in start_thread () from /nix/store/cg9s562sa33k78m63njfn1rw47dp9z0i-glibc-2.40-66/lib/libc.so.6
#29 0x00007f480511bdbc in __clone3 () from /nix/store/cg9s562sa33k78m63njfn1rw47dp9z0i-glibc-2.40-66/lib/libc.so.6

@ryanofsky
Copy link
Collaborator Author

Updated fe1cbed -> 1d6d854 (pr/context.1 -> pr/context.2, compare) fixing iwyu error and making minor cleanups

Copy link
Contributor

@Eunovo Eunovo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestedACK 1d6d854

As I understand it, there are two problems that are solved by this PR:

  • a kind of use-after-free bug where the disconnect_cb is deleted in one thread while it is being executed in another
  • a deadlock caused by two threads competing for the same waiter mutex

I left comments with more details under this review.

@ryanofsky ryanofsky force-pushed the pr/context branch 2 times, most recently from 564e353 to f219f1b Compare September 5, 2025 20:39
Copy link
Collaborator Author

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Updated with suggested changes

re: #201 (review)

As I understand it, there are two problems that are solved by this PR:

Yes, there were basically the deadlock bug which happened most commonly, and different race bugs that could happen because there wasn't really any synchronization previously between disconnect handler and the type-context.h cleanup code after asynchronous IPC calls.


Updated 1d6d854 -> 564e353 (pr/context.2 -> pr/context.3, compare) updating comments and adding assert.

Updated 564e353 -> f219f1b (pr/context.3 -> pr/context.4, compare) fixing LLM linter issues

@TheCharlatan
Copy link
Collaborator

Can confirm that this fixes the observed test hang.

@Sjors
Copy link
Member

Sjors commented Sep 10, 2025

utACK f219f1b

I left some suggestions to make the intermediate commits more clear, which might benefit other reviewers.

@DrahtBot DrahtBot requested a review from Eunovo September 10, 2025 11:51
ryanofsky and others added 6 commits September 10, 2025 15:56
Use -Wthread-safety not -Wthread-safety-analysis in llvm and sanitize jobs.
-Wthread-safety is a more general flag which adds additional checks outside the
core thread safety analysis.

Credit to hodlinator for noticing the bitcoin core build being stricter about
thread safety checks than the libmultiprocess build here:
bitcoin-core#201 (comment)
This is just a refactoring, no behavior is changing. This is only adding
annotations and switching from unannotated to annotated types.

In SetThread, the first two parameters are combined into a single GuardedRef
parameter to avoid -Wthread-safety-reference warnings at call sites like
"warning: passing variable 'callback_threads' by reference requires holding
mutex 'thread_context.waiter->m_mutex'"
This is just a refactoring changing the ConnThreads data type. The optional
value is not actually left in an unset state until the next commit.
This commit easiest to review ignoring whitespace (git diff -w). This is a
minor change in behavior, but the only change is shortening the duration that
threads.mutex is locked while inserting a new entry in the threads.ref map. The
lock is now only held while the entry is created and is released while the
ProxyClient<Thread> object is initialized.

This change doesn't really fix any problems but it simplifies the next commit
which deals with race conditions and deadlocks in this code, so it has been
split out.
…returning

This bug is currently causing mptest "disconnecting and blocking" test to
occasionally hang as reported by maflcko in
bitcoin/bitcoin#33244.

The bug was actually first reported by Sjors in
Sjors/bitcoin#90 (comment) and there are
more details about it in
bitcoin-core#189.

The bug is caused by the "disconnecting and blocking" test triggering a
disconnect right before a server IPC call returns. This results in a race
between the IPC server thread and the onDisconnect handler in the event loop
thread both trying to destroy the server's request_threads ProxyClient<Thread>
object when the IPC call is done.

There was a lack of synchronization in this case, fixed here by adding
loop->sync() a few places. Specifically the fixes were to:

- Always call SetThread on the event loop thread using the loop->sync() method,
  to prevent a race between the ProxyClient<Thread> creation code and the
  connection shutdown code if there was an ill-timed disconnect.

- Similarly in ~ProxyClient<Thread> and thread-context.h PassField(), use
  loop->sync() when destroying the thread object, in case a disconnect happens
  at that time.

A few other changes were made in this commit to make the resulting code safer
and simpler, even though they are not technically necessary for the fix:

- In thread-context.h PassField(), Waiter::m_mutex is now unlocked while
  destroying ProxyClient<Thread> just to respect EventLoop::m_mutex and
  Waiter::m_mutex lock order and never lock the Waiter first. This is just for
  consistency. There is no actually possibility for a deadlock here due to the
  new sync() call.

- This adds asserts to make sure functions expected to run on the event
  loop thread are only called on that thread.

- This inlines the ProxyClient<Thread>::setDisconnectCallback function, just
  because it was a short function only called in a single place.
Copy link
Collaborator Author

@ryanofsky ryanofsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews! I took Sjors commits to break the change up into smaller parts and addressed hodlinators suggestions to add more thread safety annotations and make them more strict. I think both changes should make this PR easier to review.

Updated f219f1b -> b6321bb (pr/context.4 -> pr/context.5, compare) with suggested changes to split commits and improve thread safety annotations. This actual fix is not changed.

Updated b6321bb -> d86823e (pr/context.5 -> pr/context.6, compare) fixing typos

@Sjors
Copy link
Member

Sjors commented Sep 11, 2025

d86823e looks good once CI is happy...

Thanks for splitting the commits. The new struct GuardedRef introduced in 6d6096a is nice.

When I ran make before locally it didn't emit any warnings about locks. Given that you only changed the CI configuration, I assume vanilla make still won't. It would be good to have a cmake developer preset that catches as much as possible (different PR of course).

Additionally, maybe one of @DrahtBot's llm's can look through CI logs (when all jobs pass) for anything unusual compared to master.

OpenBSD fail seems real:

In file included from /home/runner/work/libmultiprocess/libmultiprocess/test/mp/test/foo-types.h:15:
/home/runner/work/libmultiprocess/libmultiprocess/include/mp/type-context.h:96:29: error: no viable constructor or deduction guide for deduction of template arguments of 'GuardedRef'
                            GuardedRef{thread_context.waiter->m_mutex, request_threads}, server.m_context.connection,
                            ^
/home/runner/work/libmultiprocess/libmultiprocess/include/mp/util.h:186:8: note: candidate function template not viable: requires 1 argument, but 2 were provided
struct GuardedRef

https://github.com/bitcoin-core/libmultiprocess/actions/runs/17628430545/job/50090729622?pr=201

Co-pilot seems to think this is due to incomplete c++20 support, and we could pre-empt that by adding target_compile_features(multiprocess PUBLIC cxx_std_20) after the three add_library( statements. Though @purpleKarrot recently pointed out LLM's tend to be wrong when it comes to cmake.

@ryanofsky
Copy link
Collaborator Author

Thanks, the openbsd failure is probably just happening because openbsd is using an old version of clang. It should be possible to reproduce this locally using the llvm job and following change in shell.nix:

-  llvm = crossPkgs.llvmPackages_20;
+  llvm = crossPkgs.llvmPackages_16;

But I'm having some kind of network issue and can't download the llvm 16 packages.

You are right that it would be nice to have a preset or something that makes it easier to see thread safety errors locally. FWIW I just have a configuration script that does:

MP_CXX_FLAGS="-Werror -ftemplate-backtrace-limit=0 -Wthread-safety"
CC=clang CXX=clang++ cmake -DCMAKE_INSTALL_PREFIX=$HOME/work/mp/build/prefix -DCapnProto_DEBUG=1 -DCMAKE_BUILD_TYPE=Debug -DCMAKE_CXX_FLAGS="$MP_CXX_FLAGS" -DMULTIPROCESS_RUN_CLANG_TIDY=1 ..

And sometimes I'll add other things like -fsanitize=address to MP_CXX_FLAGS as well.

As far as draftbot/CI, the llvm job in ci/configs/llvm.bash uses -Werror and -Wthread-safety together so it will fail if there are any compile time thread safety issues. So I think there is no risk of compile-time errors slipping through anymore (now that is it is using broader -Wthread-safety instead of narrower -Wthread-safety-analysis)


Updated d86823e -> 4a269b2 (pr/context.6 -> pr/context.7, compare) with attempted fix for openbsd failure

@ryanofsky
Copy link
Collaborator Author

ryanofsky commented Sep 11, 2025

Openbsd / clang-16 fix seems to have worked!

@Sjors
Copy link
Member

Sjors commented Sep 11, 2025

ACK 4a269b2

@ryanofsky
Copy link
Collaborator Author

I've made progress on implementing a more comprehensive disconnect test that can detect bugs like bitcoin/bitcoin#33244 and #189 more reliably instead of only intermittently when the test is run thousands of times. If I can finish it soon, I'll add it to this PR, otherwise it can be a followup.

The test is implemented in commit ea4976f (tag) and I think I'm maybe 70% finished with the implementation. The test passes on master because not all of the checks are enabled yet and many aren't implemented (these are noted in FIXME comments in the code). The commit combines the 4 existing disconnect test cases into a more general test and follows up on earlier ideas and efforts from:

@Sjors
Copy link
Member

Sjors commented Sep 13, 2025

@ryanofsky that commit also touches non-test code, so I think it would be better as a standalone PR.

@ryanofsky
Copy link
Collaborator Author

@ryanofsky that commit also touches non-test code, so I think it would be better as a standalone PR.

That's reasonable. But to be fair the only changes to non-test code are adding 4 lines like:

if (connection->m_loop->m_signal) connection->m_loop->m_signal(SignalCallSetup{});

so the test is able to disconnect at specific points. m_signal is always null unless the test sets it, so it should be pretty clear that these lines only affect tests. In practice I'd also split up the change into nontest and test-only commits. And I'd add hasSignal()/sendSignal() helpers so signal mechanism wouldn't be exposed and this would look more like:

if (connection->m_loop->hasSignal<CallSetup>()) connection->m_loop->sendSignal<CallSetup>());

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants