-
Notifications
You must be signed in to change notification settings - Fork 482
[persist] Compatibility versioning #34027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
751819c to
e3a343c
Compare
teskje
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, at least as far as I'm grokking the persist details. There is still the piece missing where we call upgrade_version on all existing shards in the environment during leader startup, right?
src/catalog/src/durable/persist.rs
Outdated
| } | ||
| } | ||
|
|
||
| if cfg!(debug_assertions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we instead use a soft assert here? Debug asserts are less useful because they don't run in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds a whole other set of network roundtrips, which seems too expensive even for a soft assert - it can add brand new failure modes to the code even when the check succeeds. I found this helpful when developing the unit tests, but we can delete it entirely now if it feels awkward...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For checks we don't want to run in production, the _no_log soft assert variants are appropriate. But deleting is also fine with me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally... I just eg. don't want to have to debug a benchmark regression if this code happens to be invoked somewhere where an extra roundtrip matters, for example, or generate extra load on CRDB in Staging. Sounds like it's best to delete then!
| panic!("code at version {code_version} cannot read data with version {data_version}"); | ||
| } else { | ||
| halt!("{msg}"); | ||
| halt!("code at version {code_version} cannot read data with version {data_version}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooc, what's the reason for doing a halt! here instead of a panic!? Are there cases in normal operation where we'd expect to see incompatible readers/writers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this was introduced in the original halt! PR. I don't see explicit discussion, but it's true that it's possible for this to happen without a bug... for example, a zombie process after an upgrade.
|
I think this Cargo test failure might be caused by this PR? At least I haven't seen it before: https://buildkite.com/materialize/test/builds/111438: |
Probably related and probably not concerning - looks like a snapshot test that doesn't fully reflect the state of the merged branch - but thanks for linking; will keep an eye on it! |
A downside of repurposing the applier_version as the actual state version is that we can no longer determine what the actual version of the code was that made the change withought a serious investigation. Adding it to this string means it will be included in various places in state.
|
This doesn't yet include all the necessary |
This PR:
versionfield... instead of indicating the highest version that has ever touched a bit of state, it indicates the lowest version which is guaranteed to be compatible with the existing state.StateCollections, to make it available to individual Persist commands. This will allow future Persist changes to check the version before making changes, which is essential for flexible backwards compatibility.Motivation
https://github.com/MaterializeInc/database-issues/issues/9870