Skip to content

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Oct 9, 2025

This PR fixes the replica storage race that may occur when applying a split in the rare case there is a concurrent creation of a higher-ReplicaID RHS. The race is described in #152199, and stems from the fact that a higher-ReplicaID uninitialized replica is not locked for the entirety of splitPreApply, so it can be created and make some limited progress concurrently.

We remove the clearing/rewriting of the unreplicated state which belongs to that RHS, to let it progress untouched. This also removes a blocker towards raft/state-machine storage separation: we don't want to be touching raft state of the RHS in the same batch with the state machine updates.

Fixes #152199

Release note (bug fix): fixed a race in range splits that can result in a regressed raft state of a post-split range. The race conditions are very rare / nearly impossible, and we haven't seen it in the wild.

Copy link

blathers-crl bot commented Oct 9, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv pav-kv requested review from tbg and arulajmani October 9, 2025 10:50

// ClearRaftTruncatedState clears the RaftTruncatedState.
func (sl StateLoader) ClearRaftTruncatedState(writer storage.Writer) error {
return writer.ClearUnversioned(sl.RaftTruncatedStateKey(), storage.ClearOptions{})
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to double-check that this ClearUnversioned is the right way to remove this key. I'm never sure with the various MVCC/unversioned/engine keys etc.

//
// TODO(#152847): remove this line when there are no historical proposals with
// RaftTruncatedState, e.g. after a below-raft migration.
if err := rsl.ClearRaftTruncatedState(readWriter); err != nil {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also read the key here, and clear it only if it exists. If it does exist, it necessarily comes from the batch; it can't exist in storage because the RHS replica does not exist or is uninitialized.

The purpose of this dance is to make the batch state-machine-only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems worth doing IMO, but your call.

@pav-kv pav-kv marked this pull request as ready for review October 9, 2025 13:28
@pav-kv pav-kv requested a review from a team as a code owner October 9, 2025 13:28
@pav-kv pav-kv added the backport-all Flags PRs that need to be backported to all supported release branches label Oct 9, 2025
@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 9, 2025

Would be good to test this, though it's a bit non-trivial because splitPreApply needs full Replica and Store. However, after this change there is much less needed (e.g. no more ad-hoc raftMu-locking of the RHS replica), so perhaps we could librarify splitPreApply too and make it testable in isolation. We need to do it anyway for RSE.

Need to think how to make this change+test backportable, since it's a bug fix. We probably already have a TestCluster-based test for this corner case though (TODO: find it).

@arulajmani does your split testing work cover splitPreApply too, or mostly the evaluation side?

@pav-kv pav-kv removed the backport-all Flags PRs that need to be backported to all supported release branches label Oct 9, 2025
@pav-kv
Copy link
Collaborator Author

pav-kv commented Oct 9, 2025

Holding off with backport-all, since we only recently stopped writing RaftTruncatedState into split eval batch, and this change depends on that. Overall, not worth going extra mile to backport too far, since it's a rare/unseen bug.

This commit fixes the replica storage race that may occur when applying
a split in the rare case there is a concurrent creation of a
higher-ReplicaID RHS. The race stems from the fact that a
higher-ReplicaID uninitialized replica is not locked for the entirety of
splitPreApply, so it can be created and make some limited progress
concurrently.

We remove the clearing/rewriting of the unreplicated state which belongs
to that RHS, to let it progress untouched. This also removes a blocker
towards raft/state-machine storage separation: we don't want to be
touching raft state of the RHS in the same batch with the state machine
updates.

Epic: none

Release note (bug fix): fixed a race in range splits that can result in
a regressed raft state of a post-split range. The race conditions are
very rare / nearly impossible, and we haven't seen it in the wild.
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from a high level. I'm too over the place right now to do a detailed review, will leave that to @arulajmani. Nice to see this big wart getting removed!

//
// TODO(#152847): remove this line when there are no historical proposals with
// RaftTruncatedState, e.g. after a below-raft migration.
if err := rsl.ClearRaftTruncatedState(readWriter); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems worth doing IMO, but your call.

// machine keys, and never contains unreplicated / raft keys. One exception:
// there can still be historical split proposals that write the initial
// RaftTruncatedState of the RHS. Remove this key (if exists), and set it
// below only if necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"set it below only if necessary" - this is referring to SynthesizeRaftState, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We set it only in the common case that the ReplicaID is right and is locked. Then we do the RHS initialization, which entails initializing the RaftTruncatedState.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: the reference though is to the SetRaftTruncatedState below, right after SynthesizeRaftState.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: race in split application
3 participants