-
Notifications
You must be signed in to change notification settings - Fork 104
fix(test): Stabilize flaky test by awaiting model deployment #1617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the change log?
| modelId = prepareModel(); | ||
| assertTrue("Timed out waiting for model to deploy", waitUntil(() -> { | ||
| try { | ||
| Response modelGetResponse = getMlModel(modelId); | ||
| String responseBody = EntityUtils.toString(modelGetResponse.getEntity()); | ||
| return responseBody.contains("\"model_state\":\"" + MLModelState.DEPLOYED + "\""); | ||
| } catch (Exception e) { | ||
| return false; | ||
| } | ||
|
|
||
| }, 60, TimeUnit.SECONDS)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add this to prepareModel() function? I think there maybe other models benefiting from this check
|
Thanks for creating this PR. Can you take a look into this function: isModelReadyForInference? You can try modifying this function to only accept the "DEPLOYED" status. I'll trigger workflow once my comments get addressed. |
53c77f1 to
791f9f4
Compare
|
Thanks for the great feedback, @yuye-aws! I've moved the wait logic into the |
.changelog/1617.json
Outdated
| { | ||
| \"category\": \"bug\", | ||
| \"description\": \"Fixes a flaky test (`NeuralQueryEnricherProcessorIT`) by ensuring the test framework waits for a model to be fully deployed.\" | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check this PR for change log update: https://github.com/opensearch-project/neural-search/pull/1589/files
|
Running the CI now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this PR concise and reduce unnecessary code change in NeuralQueryEnricherProcessorIT
| import lombok.SneakyThrows; | ||
| import org.opensearch.neuralsearch.stats.events.EventStatName; | ||
| import org.opensearch.neuralsearch.stats.info.InfoStatName; | ||
| import org.opensearch.ml.common.transport.model.MLModelState; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where do we need this 2 import
791f9f4 to
9f8273b
Compare
| .queryText("Hello World") | ||
| .k(1) | ||
| .build(); | ||
| .fieldName(TEST_KNN_VECTOR_FIELD_NAME_1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you remove all these unnecessary changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can run ./gradlew spotlessApply to apply formatting
| protected boolean isModelReadyForInference(@NonNull final String modelId) throws IOException, ParseException { | ||
| MLModelState state = getModelState(modelId); | ||
| return MLModelState.LOADED.equals(state) || MLModelState.DEPLOYED.equals(state); | ||
| return MLModelState.DEPLOYED.equals(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment on why we don't check for LOADED state?
LOADED is same as DEPLOYED but it is deprecated at OS2.7. As Neural Search is introduced after OS2.7, we could exclude that state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution @umaarov! Added few comments
| String requestBody = Files.readString(Path.of(classLoader.getResource("processor/UploadModelRequestBody.json").toURI())); | ||
| String modelId = registerModelGroupAndUploadModel(requestBody); | ||
| loadModel(modelId); | ||
| waitForModelToBeReady(modelId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of L387-389 we can have
| waitForModelToBeReady(modelId); | |
| loadAndWaitForModelToBeReady(modelId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems that loadModel also have a logic to wait for model deployment to be ready, why that logic doesn't work? Can we just fix that part as it already waits for 30s for model deployment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loadModel does not have that logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loadModel also waits for deployment
Map<String, Object> taskQueryResult = getTaskQueryResponse(taskId);
boolean isComplete = checkComplete(taskQueryResult);
for (int i = 0; !isComplete && i < MAX_TASK_RETRIES; i++) {
taskQueryResult = getTaskQueryResponse(taskId);
isComplete = checkComplete(taskQueryResult);
Thread.sleep(DEFAULT_TASK_RESULT_QUERY_INTERVAL_IN_MILLISECOND);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not check the status of the model. It only check if the load task is completed. There might be a slight delay between the task completion and model status update I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, my point is, we already have a logic here to wait for model deployment why not just use it to wait for its ready. So that we don't have to call multiple wait functions as basically whenever loadModel is called, we want it to be both deployed and ready.
| ); | ||
| String modelId = registerModelGroupAndUploadModel(requestBody); | ||
| loadModel(modelId); | ||
| waitForModelToBeReady(modelId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
| String requestBody = Files.readString(Path.of(classLoader.getResource("highlight/LocalQuestionAnsweringModel.json").toURI())); | ||
| String modelId = registerModelGroupAndUploadModel(requestBody); | ||
| loadModel(modelId); | ||
| waitForModelToBeReady(modelId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
| .queryText("Hello World") | ||
| .k(1) | ||
| .build(); | ||
| .fieldName(TEST_KNN_VECTOR_FIELD_NAME_1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can run ./gradlew spotlessApply to apply formatting
The test testNeuralQueryEnricherProcessor_whenNoModelIdPassed_statsEnabled_thenSuccess was failing intermittently due to a race condition where the search query was executed before the model finished deploying. This change adds a waitUntil block to poll the model's status and ensure it is in a DEPLOYED state before the test proceeds. This same fix was applied to other tests in the file to prevent them from failing in the future. Fixes opensearch-project#1599 Signed-off-by: Umarov Ismoiljon <[email protected]>
9f8273b to
f35a142
Compare
|
@umaarov Can you update the PR according to the comments? |
|
Hi @yuye-aws, @heemin32, @owaiskazi19, Following up on the CI failure I manually checked the test data being ingested ( This confirms the test assertion However, the terms aggregation is returning a Could you please advise on how to proceed? Should this PR be put on hold while this potential aggregation bug is investigated separately, or is there a way to temporarily adjust the test? The other |
|
Good to hear from you @umaarov . Can you first update the PR to address the code review comments? |
The test testNeuralQueryEnricherProcessor_whenNoModelIdPassed_statsEnabled_thenSuccess was failing intermittently due to a race condition where the search query was executed before the model finished deploying.
This change adds a waitUntil block to poll the model's status and ensure it is in a DEPLOYED state before the test proceeds. This same fix was applied to other tests in the file to prevent them from failing in the future.
Fixes #1599
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.