-
Notifications
You must be signed in to change notification settings - Fork 25.5k
[ML] Gracefully shutdown model deployment when node is removed from assignment routing #134673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @jonathan-buttner, I've created a changelog YAML for you. |
Pinging @elastic/ml-core (Team:ML) |
@@ -573,11 +573,18 @@ private void stopUnreferencedDeployment(String deploymentId, String currentNode) | |||
// This model is not routed to the current node at all | |||
TrainedModelDeploymentTask task = deploymentIdToTask.remove(deploymentId); | |||
if (task == null) { | |||
logger.debug( | |||
() -> format( | |||
"[%s] Unable to stop unreferenced deployment for node %s because task does not exit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be "task does not exist"?
trainedModelAssignmentNodeService.prepareModelToLoad(taskParams); | ||
trainedModelAssignmentNodeService.clusterChanged(event); | ||
|
||
assertBusy(() -> { verify(deploymentManager, times(1)).stopAfterCompletingPendingWork(any()); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick, are the curly brackets needed around the lambda here?
This PR changes the behavior to have deployments be gracefully shutdown (queued requests completed) in the situation where a node has been removed from an assignment's routing table.
There are a few other places where we should determine if we want to perform a graceful shutdown:
https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java#L424
-It’s probably unlikely that we have anything to drain from the queues, but we could try
https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/assignment/TrainedModelAssignmentNodeService.java#L793
After chatting with Dave we decided not to change these other places. If a deployment fails while trying to load, there shouldn't be any requests in the queue. For the situation where an assignment does not exist, that should only occur if the assignment is stopped which should trigger the logic already.