Skip to content

Conversation

@umaarov
Copy link

@umaarov umaarov commented Oct 15, 2025

The test testNeuralQueryEnricherProcessor_whenNoModelIdPassed_statsEnabled_thenSuccess was failing intermittently due to a race condition where the search query was executed before the model finished deploying.

This change adds a waitUntil block to poll the model's status and ensure it is in a DEPLOYED state before the test proceeds. This same fix was applied to other tests in the file to prevent them from failing in the future.

Fixes #1599

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Member

@yuye-aws yuye-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the change log?

Comment on lines 55 to 67
modelId = prepareModel();
assertTrue("Timed out waiting for model to deploy", waitUntil(() -> {
try {
Response modelGetResponse = getMlModel(modelId);
String responseBody = EntityUtils.toString(modelGetResponse.getEntity());
return responseBody.contains("\"model_state\":\"" + MLModelState.DEPLOYED + "\"");
} catch (Exception e) {
return false;
}

}, 60, TimeUnit.SECONDS));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this to prepareModel() function? I think there maybe other models benefiting from this check

@yuye-aws
Copy link
Member

Thanks for creating this PR. Can you take a look into this function: isModelReadyForInference? You can try modifying this function to only accept the "DEPLOYED" status. I'll trigger workflow once my comments get addressed.

@umaarov umaarov force-pushed the fix/flaky-test-1599 branch from 53c77f1 to 791f9f4 Compare October 15, 2025 09:28
@umaarov
Copy link
Author

umaarov commented Oct 15, 2025

Thanks for the great feedback, @yuye-aws! I've moved the wait logic into the BaseNeuralSearchIT helper methods and created the changelog entry. The PR should be all set now.

Comment on lines 1 to 4
{
\"category\": \"bug\",
\"description\": \"Fixes a flaky test (`NeuralQueryEnricherProcessorIT`) by ensuring the test framework waits for a model to be fully deployed.\"
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yuye-aws
Copy link
Member

Running the CI now

Copy link
Member

@yuye-aws yuye-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this PR concise and reduce unnecessary code change in NeuralQueryEnricherProcessorIT

import lombok.SneakyThrows;
import org.opensearch.neuralsearch.stats.events.EventStatName;
import org.opensearch.neuralsearch.stats.info.InfoStatName;
import org.opensearch.ml.common.transport.model.MLModelState;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do we need this 2 import

@umaarov umaarov force-pushed the fix/flaky-test-1599 branch from 791f9f4 to 9f8273b Compare October 15, 2025 10:01
.queryText("Hello World")
.k(1)
.build();
.fieldName(TEST_KNN_VECTOR_FIELD_NAME_1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove all these unnecessary changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can run ./gradlew spotlessApply to apply formatting

protected boolean isModelReadyForInference(@NonNull final String modelId) throws IOException, ParseException {
MLModelState state = getModelState(modelId);
return MLModelState.LOADED.equals(state) || MLModelState.DEPLOYED.equals(state);
return MLModelState.DEPLOYED.equals(state);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment on why we don't check for LOADED state?

LOADED is same as DEPLOYED but it is deprecated at OS2.7. As Neural Search is introduced after OS2.7, we could exclude that state.

Copy link
Member

@owaiskazi19 owaiskazi19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @umaarov! Added few comments

String requestBody = Files.readString(Path.of(classLoader.getResource("processor/UploadModelRequestBody.json").toURI()));
String modelId = registerModelGroupAndUploadModel(requestBody);
loadModel(modelId);
waitForModelToBeReady(modelId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of L387-389 we can have

Suggested change
waitForModelToBeReady(modelId);
loadAndWaitForModelToBeReady(modelId);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems that loadModel also have a logic to wait for model deployment to be ready, why that logic doesn't work? Can we just fix that part as it already waits for 30s for model deployment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadModel does not have that logic.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loadModel also waits for deployment

        Map<String, Object> taskQueryResult = getTaskQueryResponse(taskId);
        boolean isComplete = checkComplete(taskQueryResult);
        for (int i = 0; !isComplete && i < MAX_TASK_RETRIES; i++) {
            taskQueryResult = getTaskQueryResponse(taskId);
            isComplete = checkComplete(taskQueryResult);
            Thread.sleep(DEFAULT_TASK_RESULT_QUERY_INTERVAL_IN_MILLISECOND);
        }

Copy link
Collaborator

@heemin32 heemin32 Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not check the status of the model. It only check if the load task is completed. There might be a slight delay between the task completion and model status update I guess?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, my point is, we already have a logic here to wait for model deployment why not just use it to wait for its ready. So that we don't have to call multiple wait functions as basically whenever loadModel is called, we want it to be both deployed and ready.

);
String modelId = registerModelGroupAndUploadModel(requestBody);
loadModel(modelId);
waitForModelToBeReady(modelId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

String requestBody = Files.readString(Path.of(classLoader.getResource("highlight/LocalQuestionAnsweringModel.json").toURI()));
String modelId = registerModelGroupAndUploadModel(requestBody);
loadModel(modelId);
waitForModelToBeReady(modelId);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

.queryText("Hello World")
.k(1)
.build();
.fieldName(TEST_KNN_VECTOR_FIELD_NAME_1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can run ./gradlew spotlessApply to apply formatting

The test testNeuralQueryEnricherProcessor_whenNoModelIdPassed_statsEnabled_thenSuccess
was failing intermittently due to a race condition where the search query
was executed before the model finished deploying.

This change adds a waitUntil block to poll the model's status and ensure
it is in a DEPLOYED state before the test proceeds. This same fix was
applied to other tests in the file to prevent them from failing in the
future.

Fixes opensearch-project#1599

Signed-off-by: Umarov Ismoiljon <[email protected]>
@umaarov umaarov force-pushed the fix/flaky-test-1599 branch from 9f8273b to f35a142 Compare October 18, 2025 10:25
@yuye-aws
Copy link
Member

@umaarov Can you update the PR according to the comments?

@umaarov
Copy link
Author

umaarov commented Oct 29, 2025

Hi @yuye-aws, @heemin32, @owaiskazi19,

Following up on the CI failure AssertionError: expected:<42> but was:<84> in HybridQueryAggregationsIT:

I manually checked the test data being ingested (processor/ingest_bulk.json). There are exactly 42 documents that match the filter criteria (actor: "anil" and imdb between 1.0 and 10.0).

This confirms the test assertion assertEquals(42, ...) is correct.

However, the terms aggregation is returning a doc_count of 84 for the "anil" bucket in this specific multi-shard test (testNestedAggs_whenMultipleShardsAndConcurrentSearchDisabled_thenSuccessful). This doubling suggests a potential bug in the aggregation calculation or merging process when used with hybrid queries across multiple shards, which seems unrelated to my PR's changes regarding model loading waits.

Could you please advise on how to proceed? Should this PR be put on hold while this potential aggregation bug is investigated separately, or is there a way to temporarily adjust the test?

The other Timeout and Connection refused errors also occurred in the latest run.

@yuye-aws
Copy link
Member

Good to hear from you @umaarov . Can you first update the PR to address the code review comments? HybridQueryAggregationsIT is another problem. We can get it fixed with another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working flaky-test

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] testNeuralQueryEnricherProcessor_whenNoModelIdPassed_statsEnabled_thenSuccess is failing intermittently

5 participants