Skip to content

Conversation

dumbbell
Copy link
Collaborator

Why

Links are started by the plugins but put under the rabbit supervision tree. The federation plugins supervision tree is empty unfortunately...

Links are stopped by a boot step executed by rabbit, as a concequence of unregistering the plugins' parameters.

Unfortunately, links can be terminated if the channel, and implicitly the connection stops. This happens when the amqp_client application stops.

We end up with a race here:

  • Because the federation plugins supervision trees are empty and the application stop functions barely stop the pg group (which doesn't terminate the group members), nothing waits for the links to stop. Therefore, rabbit can stop `amqp_client' which is a dependency of the federation plugins. Therefore, the links underlying channels and connections are stopped.

  • rabbit unregister the federation parameters, terminating the links. The exchange links terminate/2 function needs the channel to delete the remote queue. But the channel and the underlying connection might be gone.

This simply logs a badmatch exception:

[error] <0.884.0> Federation link could not create a disposable (one-off) channel due to an error error: {badmatch,
[error] <0.884.0>                                                                                         {error,
[error] <0.884.0>                                                                                          {noproc,
[error] <0.884.0>                                                                                           {gen_server,
[error] <0.884.0>                                                                                            call,
[error] <0.884.0>                                                                                            [<0.911.0>,
[error] <0.884.0>                                                                                             {command,
[error] <0.884.0>                                                                                              {open_channel,
[error] <0.884.0>                                                                                               none,
[error] <0.884.0>                                                                                               {amqp_selective_consumer,
[error] <0.884.0>                                                                                                []}}},
[error] <0.884.0>                                                                                             130000]}}}}

How

The solution is to make sure links are stopped as part of the stop of the plugins.

rabbit_federation_pg:stop_scope/1 is expanded to stop all members of all groups in this scope, before terminating the pg scope itself. The new code waits for the stopped processes to exit.

We have to handle the EXIT signal in the link processes and change their restart strategy in their parent supervisor from permanent to transient. This ensures they are restarted only if they crash. This also skips a error log message about each stopped link.

@dumbbell dumbbell requested review from dcorbacho and mkuratczyk June 10, 2025 12:25
@dumbbell dumbbell self-assigned this Jun 10, 2025
[Why]
Links are started by the plugins but put under the `rabbit` supervision
tree. The federation plugins supervision tree is empty unfortunately...

Links are stopped by a boot step executed by `rabbit`, as a concequence
of unregistering the plugins' parameters.

Unfortunately, links can be terminated if the channel, and implicitly
the connection stops. This happens when the `amqp_client` application
stops.

We end up with a race here:

* Because the federation plugins supervision trees are empty and the
  application stop functions barely stop the pg group (which doesn't
  terminate the group members), nothing waits for the links to stop.
  Therefore, `rabbit` can stop `amqp_client' which is a dependency of
  the federation plugins. Therefore, the links underlying channels and
  connections are stopped.

* `rabbit` unregister the federation parameters, terminating the links.
  The exchange links `terminate/2` function needs the channel to delete
  the remote queue. But the channel and the underlying connection might
  be gone.

This simply logs a `badmatch` exception:

    [error] <0.884.0> Federation link could not create a disposable (one-off) channel due to an error error: {badmatch,
    [error] <0.884.0>                                                                                         {error,
    [error] <0.884.0>                                                                                          {noproc,
    [error] <0.884.0>                                                                                           {gen_server,
    [error] <0.884.0>                                                                                            call,
    [error] <0.884.0>                                                                                            [<0.911.0>,
    [error] <0.884.0>                                                                                             {command,
    [error] <0.884.0>                                                                                              {open_channel,
    [error] <0.884.0>                                                                                               none,
    [error] <0.884.0>                                                                                               {amqp_selective_consumer,
    [error] <0.884.0>                                                                                                []}}},
    [error] <0.884.0>                                                                                             130000]}}}}

[How]
The solution is to make sure links are stopped as part of the stop of
the plugins.

`rabbit_federation_pg:stop_scope/1` is expanded to stop all members of
all groups in this scope, before terminating the pg scope itself. The
new code waits for the stopped processes to exit.

We have to handle the `EXIT` signal in the link processes and change
their restart strategy in their parent supervisor from permanent to
transient. This ensures they are restarted only if they crash. This also
skips a error log message about each stopped link.
@dumbbell dumbbell force-pushed the terminate-links-when-federation-plugins-stop branch from bdf095c to 033ab45 Compare June 11, 2025 06:21
@dumbbell dumbbell marked this pull request as ready for review June 11, 2025 07:17
@dumbbell dumbbell merged commit f84828e into main Jun 11, 2025
564 of 565 checks passed
@dumbbell dumbbell deleted the terminate-links-when-federation-plugins-stop branch June 11, 2025 07:17
@michaelklishin michaelklishin added this to the 4.2.0 milestone Jun 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants