Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
158 changes: 158 additions & 0 deletions proposals/4349-causal-barriers-and-enforcement.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation requirements:

  • General sense of agreement among protocol developers towards adopting definitions
    • Metric: FCP starts successfully (likely ignoring the checkbox for "has implementation").

Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# MSC4349: Causal barriers and enforcement

_Note: This MSC exists because
[MSC4345](https://github.com/matrix-org/matrix-spec-proposals/pull/4345)
described an informal version of the deferred authorization causal
enforcement model discussed in this MSC. Which uncovered the problem
of causal barrier conflicts._

Matrix currently uses _causal barriers_ enforced locally by soft
failure to exclude concurrent events. A _causal barrier_ works in
Matrix because `prev_events` and `auth_events` specify the _causal
predecessors_ of the event. And with a full synchronisation, it is
possible to determine which events are concurrent. So therefore a
_causal barrier_ is an event which excludes all concurrent events from
_consideration_. Where consideration usually means what is shown to an
end user.

The `m.room.member` event with a _membership_ of `ban` is a causal
barrier that prevents the sender from adding concurrent events to the
ban event's causal predecessors. And this is enforced locally to each
server through soft-failure.

In distributed systems, _vector clocks_ are used to provide reasoning
about causal order. Causal history is already encoded in PDUs in
Matrix through the `prev_events` and `auth_events` of each PDU, and so
a complete DAG would provide us with an implicit vector clock.

Further, once a server has all of the causal predecessors for a given
event, a precise vector clock frontier for the sender of the event can
be constructed by walking the DAG.

## Soft failure as causal enforcement

Soft failure provides local enforcement of _causal barriers_ that is
necessary while knowledge of causal predecessors is incomplete. Soft
failure allows servers to make an immediate decision about concurrent
events without a full synchronisation of the DAG.

We identify two major problems with soft failure:

1. Unbounded DAG growth: Soft-failed events are still accepted into
the DAG, and there is no in-DAG solution to prevent this.

2. Hidden causal barrier conflicts: Soft-failure hides cases where
causal barriers conflict with a sender's previously acknowledged
history.

## Hidden causal barrier conflicts

The problem of hidden causal barrier conflicts does not seem to be
widely known or explored. To illustrate the problem we ask you to
imagine Matrix had a different enforcement model for causal
barriers. We name this causal enforcement system _deferred
authorization_.

### Deferred authorization as causal enforcement

When a server receives an event which is concurrent with a _causal
barrier_ the receiving server should determine whether it has complete
knowledge of all of the _causal barrier_ event's causal predecessors.

If knowledge is incomplete, the server makes an immediate decision to
soft fail the concurrent event (which is what happens currently in
Matrix).
Comment on lines +62 to +64
Copy link
Contributor Author

@Gnuxie Gnuxie Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's more nuance to this, in that the way this decision gets made is by determining if the event is authorizable with respect to current authorization state rather than the cited state from auth_events.


However once the server has all of the _causal predecessors_ for a
_causal barrier_ it can infact revisit the decision, and make a final
decision.

To do this, the server consults the predecessors of the causal barrier
event to form a vector clock frontier for the sender of the
conflicting event.

The server can then either accept events that were soft-failed or
reject them. Effectively resolving soft-failure.

Such a _causal barrier enforcement model_ uncovers a side-effect of
soft-failure which is less obvious.

If the sender of a ban event was to deliberately ignore the history of
the target, they now have the power to erase the target's history for
all room participants. Unfortunately, this already can happen today,
it's just room participants with a more complete synchronisation are
unable to notice. And anyone with a partial synchronisation, such as a
new joiner, can.
Comment on lines +82 to +85
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I get confused every time i revisit this. It's not actually clear to me what does currently happen here.


### Soft-failure hides causal barrier conflicts

- Soft-failure currently obscures the fact that Matrix allows causal
barriers to conflict with a sender’s previously acknowledged
history. This can result in parts of a room’s history effectively
disappearing for any room participant that did not have the same
causal frontier at the time of the event.

- In practice, this means that malicious or careless use of barriers
can erase history for new joiners or for any server in the room that
had only a partial view of the DAG upon receipt of the
event. Servers with more complete history retain the missing events,
but these are hidden by soft-failure. In other words, the behaviour
is not theoretical, it is possible today and its impact depends on
how complete a server's DAG synchronisation is at the time of the
event.

- As a community, this has been accepted by relying on full-sync
assumptions and by prioritising the continuity of existing
participants’ views over the consistency of new joiners.

- If soft-failure is to be reframed as enforcement of causal barriers,
then this issue must be acknowledged and addressed. It is not
sufficient to treat the current behaviour as sound; we need to
recognise the limitations of the existing model and decide
explicitly how much power barriers should have to override history.

- In other words soft-failure provides an illusion that Matrix is
sound.

## Proposal

We propose that the terminology for causal barriers from this MSC is
adopted so that the problem space can be discussed. There are a couple of
solutions that can be explored:

- Policy servers as causal authorities: Policy servers can act as
causal authorities. They could do this by performing a
full-synchronisation on each event they receive. And then only
signing events that are consistent with the policy server's vector
clock frontier.

- Deferred authorization as causal enforcement (described in the MSC
introduction), combined with a system that allows distrust to be
expressed in senders in order to provide a manual resolvation to
forks that would constitute a political problem.

This would allow us to immediately begin to use consistent terminology
within the context of [MSC4345: Server key identity and room
membership](https://github.com/matrix-org/matrix-spec-proposals/pull/4345#discussion_r2336627872)
and other MSC's which may have security concerns that cannot be
expressed without this understanding.
Comment on lines +134 to +138
Copy link
Contributor Author

@Gnuxie Gnuxie Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should come before the suggested solutions. Also the link is broken and the apostrophe shouldn't be there


## Potential issues

- None considered

## Alternatives

- None considered

## Security considerations
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might not be safe to use auth_events to create a vector clock frontier for a sender.

This is because if the sender joined via a partial synchronisation and only have a snapshot of the current auth state from /make_join then they can be forced to indirectly reference unauthorized events. Implementations may also be lazy about recursively verifying auth_state from prev_events when merging branches too. Which absolutly removes the ability to use prev_events to form a vector clock frontier.

Copy link
Contributor Author

@Gnuxie Gnuxie Sep 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Effectively this makes it impossible for someone rebuilding the entire auth_chain to form a vector clock frontier from the auth_events and prev_events to reason about the causal knowledge of a sender. Because the frontier can be full of random events that the sender hasn't bothered to either verify or authorize.

Copy link
Contributor Author

@Gnuxie Gnuxie Sep 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And that stops ANYONE from being able to reliably reason about causal barrier violations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Contributor Author

@Gnuxie Gnuxie Sep 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blue is trying to ban yellow in order to erase yellow's events. But new joiners can't tell wtf is going on here and can't even tell that there is a causal barrier violation because Blue has been tricked by Green into tying in the events Blue tried to erase.

Essentially, Yellow's membership is a causal predecessor of Blue's ban on yellow. The events are not conflicting. But blue has been tricked into referencing a yellow event which would normally be soft failed because of the causal barrier from Blue's ban on yellow. A ban that doesn't include yellow's event as a casual predecessor.

Copy link
Contributor Author

@Gnuxie Gnuxie Sep 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably easier to understand if it looks like this

image

Copy link
Contributor Author

@Gnuxie Gnuxie Sep 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no reliable way to prevent this. It's impossible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem discussion moved to #4353


- None considered

## Unstable prefix

- Not applicable

## Dependencies

- None