Skip to content

Conversation

@bkirwi
Copy link
Contributor

@bkirwi bkirwi commented Nov 4, 2025

This PR:

  • changes the semantics of the Persist shard's version field... instead of indicating the highest version that has ever touched a bit of state, it indicates the lowest version which is guaranteed to be compatible with the existing state.
  • changes the version checks: we now disallow versions from the future, and cap the number of versions from the past we will maintain write compatibility with.
  • moves the version field into StateCollections, to make it available to individual Persist commands. This will allow future Persist changes to check the version before making changes, which is essential for flexible backwards compatibility.
  • adds a new command to upgrade the compatibility version of the shard. This is not yet invoked everywhere it should be; I intend to add the full set of usages in a followup PR.
  • adds metrics that report whether we're interacting with an older version of state or not, which (among other things) will help me track whether we're reliably upgrading shards or not.

Motivation

https://github.com/MaterializeInc/database-issues/issues/9870

@bkirwi bkirwi force-pushed the versioning branch 10 times, most recently from 751819c to e3a343c Compare November 6, 2025 23:02
Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, at least as far as I'm grokking the persist details. There is still the piece missing where we call upgrade_version on all existing shards in the environment during leader startup, right?

}
}

if cfg!(debug_assertions) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we instead use a soft assert here? Debug asserts are less useful because they don't run in CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a whole other set of network roundtrips, which seems too expensive even for a soft assert - it can add brand new failure modes to the code even when the check succeeds. I found this helpful when developing the unit tests, but we can delete it entirely now if it feels awkward...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For checks we don't want to run in production, the _no_log soft assert variants are appropriate. But deleting is also fine with me!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally... I just eg. don't want to have to debug a benchmark regression if this code happens to be invoked somewhere where an extra roundtrip matters, for example, or generate extra load on CRDB in Staging. Sounds like it's best to delete then!

panic!("code at version {code_version} cannot read data with version {data_version}");
} else {
halt!("{msg}");
halt!("code at version {code_version} cannot read data with version {data_version}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooc, what's the reason for doing a halt! here instead of a panic!? Are there cases in normal operation where we'd expect to see incompatible readers/writers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this was introduced in the original halt! PR. I don't see explicit discussion, but it's true that it's possible for this to happen without a bug... for example, a zombie process after an upgrade.

@def- def- mentioned this pull request Nov 7, 2025
5 tasks
@def-
Copy link
Contributor

def- commented Nov 7, 2025

I think this Cargo test failure might be caused by this PR? At least I haven't seen it before: https://buildkite.com/materialize/test/builds/111438:

        FAIL [   4.451s] mz-catalog::open test_persist_open
thread 'test_persist_open' panicked at src/catalog/tests/open.rs:549:9:
assertion `left == right` failed

@bkirwi
Copy link
Contributor Author

bkirwi commented Nov 7, 2025

I think this Cargo test failure might be caused by this PR?

Probably related and probably not concerning - looks like a snapshot test that doesn't fully reflect the state of the merged branch - but thanks for linking; will keep an eye on it!

A downside of repurposing the applier_version as the actual state
version is that we can no longer determine what the actual version of
the code was that made the change withought a serious investigation.
Adding it to this string means it will be included in various places in
state.
@bkirwi bkirwi changed the title [persist] Versioning [persist] Compatibility versioning Nov 7, 2025
@bkirwi bkirwi marked this pull request as ready for review November 7, 2025 22:24
@bkirwi bkirwi requested review from a team as code owners November 7, 2025 22:24
@bkirwi bkirwi requested a review from aljoscha November 7, 2025 22:24
@bkirwi
Copy link
Contributor Author

bkirwi commented Nov 7, 2025

This doesn't yet include all the necessary upgrade_version calls, but it seems to contain enough to make CI happy. Hunting down everywhere that we open a shard may take a while, though, so I'd love to get the first batch merged and unblock some other folks.

@bkirwi bkirwi requested a review from DAlperin November 7, 2025 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants