[persist] Compatibility versioning #34027

bkirwi · 2025-11-04T22:20:36Z

This PR:

changes the semantics of the Persist shard's version field... instead of indicating the highest version that has ever touched a bit of state, it indicates the lowest version which is guaranteed to be compatible with the existing state.
changes the version checks: we now disallow versions from the future, and cap the number of versions from the past we will maintain write compatibility with.
moves the version field into StateCollections, to make it available to individual Persist commands. This will allow future Persist changes to check the version before making changes, which is essential for flexible backwards compatibility.
adds a new command to upgrade the compatibility version of the shard. This is not yet invoked everywhere it should be; I intend to add the full set of usages in a followup PR.
adds metrics that report whether we're interacting with an older version of state or not, which (among other things) will help me track whether we're reliably upgrading shards or not.

Motivation

https://github.com/MaterializeInc/database-issues/issues/9870

teskje

This looks good to me, at least as far as I'm grokking the persist details. There is still the piece missing where we call upgrade_version on all existing shards in the environment during leader startup, right?

teskje · 2025-11-07T09:15:38Z

src/catalog/src/durable/persist.rs

-                }
-            }
+
+        if cfg!(debug_assertions) {


Should we instead use a soft assert here? Debug asserts are less useful because they don't run in CI.

This adds a whole other set of network roundtrips, which seems too expensive even for a soft assert - it can add brand new failure modes to the code even when the check succeeds. I found this helpful when developing the unit tests, but we can delete it entirely now if it feels awkward...

For checks we don't want to run in production, the _no_log soft assert variants are appropriate. But deleting is also fine with me!

Totally... I just eg. don't want to have to debug a benchmark regression if this code happens to be invoked somewhere where an extra roundtrip matters, for example, or generate extra load on CRDB in Staging. Sounds like it's best to delete then!

src/catalog/src/durable/persist.rs

teskje · 2025-11-07T09:33:24Z

src/persist-client/src/internal/encoding.rs

+            panic!("code at version {code_version} cannot read data with version {data_version}");
        } else {
-            halt!("{msg}");
+            halt!("code at version {code_version} cannot read data with version {data_version}");


Ooc, what's the reason for doing a halt! here instead of a panic!? Are there cases in normal operation where we'd expect to see incompatible readers/writers?

It looks like this was introduced in the original halt! PR. I don't see explicit discussion, but it's true that it's possible for this to happen without a bug... for example, a zombie process after an upgrade.

def- · 2025-11-07T17:53:07Z

I think this Cargo test failure might be caused by this PR? At least I haven't seen it before: https://buildkite.com/materialize/test/builds/111438:

        FAIL [   4.451s] mz-catalog::open test_persist_open
thread 'test_persist_open' panicked at src/catalog/tests/open.rs:549:9:
assertion `left == right` failed

bkirwi · 2025-11-07T19:04:35Z

I think this Cargo test failure might be caused by this PR?

Probably related and probably not concerning - looks like a snapshot test that doesn't fully reflect the state of the merged branch - but thanks for linking; will keep an eye on it!

A downside of repurposing the applier_version as the actual state version is that we can no longer determine what the actual version of the code was that made the change withought a serious investigation. Adding it to this string means it will be included in various places in state.

bkirwi · 2025-11-07T22:26:00Z

This doesn't yet include all the necessary upgrade_version calls, but it seems to contain enough to make CI happy. Hunting down everywhere that we open a shard may take a while, though, so I'd love to get the first batch merged and unblock some other folks.

bkirwi force-pushed the versioning branch 10 times, most recently from 751819c to e3a343c Compare November 6, 2025 23:02

teskje reviewed Nov 7, 2025

View reviewed changes

def- mentioned this pull request Nov 7, 2025

Bump version to v26.0.0-dev.0 #33983

Open

5 tasks

bkirwi force-pushed the versioning branch from e3a343c to 1dbc92e Compare November 7, 2025 19:27

bkirwi added 3 commits November 7, 2025 14:57

Document the current semantics of the state version

acd24c0

Shift the version from State to StateCollections

794984f

bkirwi force-pushed the versioning branch from 1dbc92e to 28a8446 Compare November 7, 2025 20:01

bkirwi changed the title ~~[persist] Versioning~~ [persist] Compatibility versioning Nov 7, 2025

Implement the new versioning scheme

62f4e59

bkirwi force-pushed the versioning branch from 28a8446 to 62f4e59 Compare November 7, 2025 22:23

bkirwi marked this pull request as ready for review November 7, 2025 22:24

bkirwi requested review from a team as code owners November 7, 2025 22:24

bkirwi requested a review from aljoscha November 7, 2025 22:24

bkirwi requested a review from DAlperin November 7, 2025 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[persist] Compatibility versioning #34027

[persist] Compatibility versioning #34027

Uh oh!

bkirwi commented Nov 4, 2025 •

edited

Loading

Uh oh!

teskje left a comment

Uh oh!

teskje Nov 7, 2025

Uh oh!

bkirwi Nov 7, 2025

Uh oh!

teskje Nov 7, 2025

Uh oh!

bkirwi Nov 7, 2025

Uh oh!

Uh oh!

teskje Nov 7, 2025

Uh oh!

bkirwi Nov 7, 2025

Uh oh!

def- commented Nov 7, 2025

Uh oh!

bkirwi commented Nov 7, 2025

Uh oh!

bkirwi commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[persist] Compatibility versioning #34027

Are you sure you want to change the base?

[persist] Compatibility versioning #34027

Uh oh!

Conversation

bkirwi commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

teskje left a comment

Choose a reason for hiding this comment

Uh oh!

teskje Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

bkirwi Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

teskje Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

bkirwi Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

teskje Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

bkirwi Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

def- commented Nov 7, 2025

Uh oh!

bkirwi commented Nov 7, 2025

Uh oh!

bkirwi commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bkirwi commented Nov 4, 2025 •

edited

Loading