Skip to content

Conversation

@joshua-adams-1
Copy link
Contributor

@joshua-adams-1 joshua-adams-1 commented Nov 10, 2025

Modifies the SnapshotShutdownProgressTracker to only log the periodic progress report while there are still snapshots left to process. This removes the existing behaviour where the report will log indefinitely, even once all snapshots are completed or paused.

Closes: ES-12794

Modifies the SnapshotShutdownProgressTracker to only log the periodic
progress report when there are still snapshots left to process. This
removes the existing behaviour where the report will log indefinitely,
even once all snapshots are completed or paused.

Closes: ES-12794
// Simulate starting shutdown -- should reset the completion stats and start logging
// Register each snapshot as in progress
for (int i = 0; i < 5; ++i) {
tracker.incNumberOfShardSnapshotsInProgress(dummyShardId, dummySnapshot);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mimics production behaviour where snapshots are added to the tracker prior to the node shutting down. If I didn't, when calling tracker.onClusterStateAddShutdown(); the periodic logger would begin and then immediately end since there would be no in-progress snapshots

@joshua-adams-1 joshua-adams-1 self-assigned this Nov 10, 2025
@joshua-adams-1 joshua-adams-1 marked this pull request as ready for review November 10, 2025 15:50
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Nov 10, 2025
@DiannaHohensee DiannaHohensee added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team labels Nov 11, 2025
@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Nov 11, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Contributor

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite a few nits, but superficial stuff.

Comments are great to have, but they should explain why something is happening, not only say what’s happening. ‘What’ can generally be understood with some effort from reading the code, but ‘why’ is the difficult part for new readers to grasp. I’d use ‘what’ comments for high level what’s happening in a section of code, when there's a series of steps. But having the ‘what’ without a clear ‘why’ is often frustrating for a reader. E.g., “I wrote this thing and this is what it does, but for the life of me I can’t remember why I wrote it to do that thing”, becomes a problem.

pausedCount.get()
);
// If there are no more snapshots in progress, then stop logging periodic progress reports
if (numberOfShardSnapshotsInProgressOnDataNode.get() == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we safely add asserts in this if-statement that the other stat values are zero / empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm what would you be looking for specifically here? I don't think there's anything to assert but please correct me if I'm wrong! Running through the values outputted in the log:

  • shutdownStartMillis should be any non -1 value, (since this code only runs when the node is shutting down) but that's the same as in the below log case too
  • shutdownFinishedSignallingPausingMillis could be any value, including -1, right, because this doesn't necessarily have to be set if the snapshots aren't paused but are instead completed?
  • numberOfShardSnapshotsInProgressOnDataNode is already 0
  • shardSnapshotRequests can be anything, right? So nothing to assert there
  • done, failed, aborted and paused are dependent on the snapshot status so nothing to check

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My expectation was that shardSnapshotRequests should be empty when the number of numberOfShardSnapshotsInProgressOnDataNode reaches zero. The snapshotting begins here with a listener runAfter hook to decrement the numSnapshots. The decrement runs after this listener, which is responsible for notifying the master of the outcome of the shard snapshot, but it doesn't appear to do that synchronously....

If there's an outstanding request to the master node to inform it about the snapshot's completion that has not returned, then this logger shouldn't be turned off, I think.

The if statement could become

if (numberOfShardSnapshotsInProgressOnDataNode.get() == 0 && shardSnapshotRequests.isEmpty()) {
    ....
}

Then we don't have to worry about it?

// Register each snapshot as in progress
for (int i = 0; i < 5; ++i) {
tracker.incNumberOfShardSnapshotsInProgress(dummyShardId, dummySnapshot);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you throw this into a helper method, since this is common across tests?

It'd be fancier to randomize (and/or un-hardcode) the number, but I believe that would require another parameter given to testPeriodicLogReport: up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I considered this when writing the tests. Happy to randomise if you think it would add any benefit, but I was copying the existing precedent set by the tests that had 5 in-progress snapshots, and I didn't want to bloat the tests with unnecessary code modifying all the expectations to match a random number of snapshots and then lose the clarity / meaning of the tests. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The undesirable part is the magic number that's presumed across setInFlightSnapshots and ensurePeriodicLogMessagesAsSnapshotsComplete. Passing a parameter of 5 would be better code practice than two methods magically knowing about what the other does.

I didn't want to bloat the tests with unnecessary code modifying all the expectations to match a random number of snapshots

IIUC, you've localized all the relevant code in ensurePeriodicLogMessagesAsSnapshotsComplete. I think little, if anything, would be changed outside that method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added in e2151a3 👍

tracker.incNumberOfShardSnapshotsInProgress(dummyShardId, dummySnapshot);
}

// 1. Simulate a node shutdown
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you lost a few numbers while refactoring? There's a 1 and a 3.

Suggested change
// 1. Simulate a node shutdown
// Simulate a node shutdown so the tracker starts logging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add why

@joshua-adams-1
Copy link
Contributor Author

Hey, thank you for reviewing. I've updated in line with the comments, let me know if there's anything else you want me to clarify / improve

}
logger.info("---> Generated shard snapshot status in stage (" + status.getStage() + ") for switch case (" + (i % 4) + ")");
tracker.decNumberOfShardSnapshotsInProgress(dummyShardId, dummySnapshot, status);
// Advance time to the next periodic log message and expect it to include the new completion stats.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Advance time to the next periodic log message and expect it to include the new completion stats.
// Advance time to the next scheduled periodic log message and expect it to include the new completion stats.

}

private void ensurePeriodicLogMessagesAsSnapshotsComplete(SnapshotShutdownProgressTracker tracker, MockLog mockLog) {
// We are going to simulate 5 in-progress snapshots while the node is shutting down, so we expect the periodic logger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// We are going to simulate 5 in-progress snapshots while the node is shutting down, so we expect the periodic logger
// We are going to simulate completing 5 in-progress snapshots while the node is shutting down, so we expect the periodic logger

setLogPeriodicProgressReportSeenEventExpectation(mockLog, 2, 1, 1, 1, 0);
setLogPeriodicProgressReportSeenEventExpectation(mockLog, 1, 1, 1, 1, 1);

// When there are no more in-progress snapshots, the tracker emits a final log message, and terminates the periodic logger.
Copy link
Contributor

@DiannaHohensee DiannaHohensee Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment suggests it's looking for another log message, when in fact the check is that there is not a log message.

Suggested change
// When there are no more in-progress snapshots, the tracker emits a final log message, and terminates the periodic logger.
// When the last snapshot completes, the tracker will emit a different log message, saying that it is finished, before terminating the periodic logger. We should not see the regular message with the final stats update.

)
);

// Adds in-progress snapshots so that the tracker doesn't immediately exit on seeing that all snapshots are finished,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Adds in-progress snapshots so that the tracker doesn't immediately exit on seeing that all snapshots are finished,
// Adds an in-progress snapshot so that the tracker doesn't immediately exit on seeing that all snapshots are finished.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto below using this comment.

);

// Adds in-progress snapshots so that the tracker doesn't immediately exit on seeing that all snapshots are finished,
// and should begin logging the periodic progress report
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is rather confusing, since incrementing the tracker doesn't make logging start -- and the subsequent lines test that the tracker doesn't log anything because it is disabled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto below using this comment.

}

// Clear start and pause timestamps
// 3. Clear start and pause timestamps, and expect the periodic logger to reset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does reset mean? The logger stops.

* Ensure that we only log a periodic progress report while there are in progress snapshots,
* and that we do not continue to log after all snapshots have completed
*/
private boolean shouldLogPeriodicProgressReport = false;
Copy link
Contributor

@DiannaHohensee DiannaHohensee Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be volatile. I also don't think this is safe to change before canceling the scheduled logProgressReport(): you could set it to false before cancelling, and race with logProgressReport()'s assert that it is true.

This boolean need only represent completion of exit log message, with an assert in that if-statement, to prevent a double exit message. Other concurrency is already handled by the other methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed Coordination Meta label for Distributed Coordination team v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants