You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add unhandled supervision error hook to crash the client (#1637)
Summary:
Part of #1209
Make two variants of the "actor_states_monitor" watchdog. One version for Owned ActorMesh,
which will send a message to the owner if it exists, and one version for Ref ActorMesh which will
not. This way, Ref actor meshes will generate liveness exceptions without propagation, and Owned
actor meshes will send a SupervisionFailureMessage to its owning actor. Since every Owned mesh
is also doing this, events will always reach the client if they aren't handled.
Add a `monarch.actor.unhandled_fault_hook` function which is called when an unhandled supervision
error reaches the client. It takes one argument, a MeshFailure object, and is expected
to somehow halt the process. By default it calls `sys.exit(1)` after logging the error.
Raising an exception is not sufficient, as it is called outside of a Python thread (by a tokio task).
Note that propagation will not happen if an ActorMesh and all endpoints are unreachable and garbage
collected, but the actors are still running something that generates an error. We'll want to fix this
eventually.
Reviewed By: mariusae
Differential Revision: D85163744
0 commit comments