Skip to content

Conversation

jonathan-buttner
Copy link
Contributor

@jonathan-buttner jonathan-buttner commented Sep 12, 2025

This PR changes the behavior to have deployments be gracefully shutdown (queued requests completed) in the situation where a node has been removed from an assignment's routing table.

There are a few other places where we should determine if we want to perform a graceful shutdown:

  • Should we shutdown gracefully when the assignment is missing/no longer exists?

https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java#L424

  • Should we shutdown gracefully when we failed to load the model?
    -It’s probably unlikely that we have anything to drain from the queues, but we could try

https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java#L793

After chatting with Dave we decided not to change these other places. If a deployment fails while trying to load, there shouldn't be any requests in the queue. For the situation where an assignment does not exist, that should only occur if the assignment is stopped which should trigger the logic already.

@jonathan-buttner jonathan-buttner added >bug :ml Machine learning Team:ML Meta label for the ML team auto-backport Automatically create backport pull requests when merged cloud-deploy Publish cloud docker image for Cloud-First-Testing v9.2.0 v8.19.5 v9.1.5 v9.0.8 v8.18.9 labels Sep 12, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @jonathan-buttner, I've created a changelog YAML for you.

@jonathan-buttner jonathan-buttner marked this pull request as ready for review September 15, 2025 17:27
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@@ -573,11 +573,18 @@ private void stopUnreferencedDeployment(String deploymentId, String currentNode)
// This model is not routed to the current node at all
TrainedModelDeploymentTask task = deploymentIdToTask.remove(deploymentId);
if (task == null) {
logger.debug(
() -> format(
"[%s] Unable to stop unreferenced deployment for node %s because task does not exit",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be "task does not exist"?

trainedModelAssignmentNodeService.prepareModelToLoad(taskParams);
trainedModelAssignmentNodeService.clusterChanged(event);

assertBusy(() -> { verify(deploymentManager, times(1)).stopAfterCompletingPendingWork(any()); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, are the curly brackets needed around the lambda here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug cloud-deploy Publish cloud docker image for Cloud-First-Testing :ml Machine learning Team:ML Meta label for the ML team v8.18.9 v8.19.5 v9.0.8 v9.1.5 v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants