fix(test): Stabilize flaky test by awaiting model deployment #1617

umaarov · 2025-10-15T08:56:40Z

The test testNeuralQueryEnricherProcessor_whenNoModelIdPassed_statsEnabled_thenSuccess was failing intermittently due to a race condition where the search query was executed before the model finished deploying.

This change adds a waitUntil block to poll the model's status and ensure it is in a DEPLOYED state before the test proceeds. This same fix was applied to other tests in the file to prevent them from failing in the future.

Fixes #1599

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

yuye-aws

Can you update the change log?

yuye-aws · 2025-10-15T09:04:59Z

src/test/java/org/opensearch/neuralsearch/processor/NeuralQueryEnricherProcessorIT.java

        modelId = prepareModel();
+        assertTrue("Timed out waiting for model to deploy", waitUntil(() -> {
+            try {
+                Response modelGetResponse = getMlModel(modelId);
+                String responseBody = EntityUtils.toString(modelGetResponse.getEntity());
+                return responseBody.contains("\"model_state\":\"" + MLModelState.DEPLOYED + "\"");
+            } catch (Exception e) {
+                return false;
+            }
+
+        }, 60, TimeUnit.SECONDS));


Can you add this to prepareModel() function? I think there maybe other models benefiting from this check

yuye-aws · 2025-10-15T09:10:02Z

Thanks for creating this PR. Can you take a look into this function: isModelReadyForInference? You can try modifying this function to only accept the "DEPLOYED" status. I'll trigger workflow once my comments get addressed.

umaarov · 2025-10-15T09:28:42Z

Thanks for the great feedback, @yuye-aws! I've moved the wait logic into the BaseNeuralSearchIT helper methods and created the changelog entry. The PR should be all set now.

yuye-aws · 2025-10-15T09:51:00Z

.changelog/1617.json

+{
+  \"category\": \"bug\",
+  \"description\": \"Fixes a flaky test (`NeuralQueryEnricherProcessorIT`) by ensuring the test framework waits for a model to be fully deployed.\"
+}


Check this PR for change log update: https://github.com/opensearch-project/neural-search/pull/1589/files

yuye-aws · 2025-10-15T09:56:37Z

Running the CI now

yuye-aws

Let's make this PR concise and reduce unnecessary code change in NeuralQueryEnricherProcessorIT

yuye-aws · 2025-10-15T09:57:51Z

src/test/java/org/opensearch/neuralsearch/processor/NeuralQueryEnricherProcessorIT.java

 import lombok.SneakyThrows;
 import org.opensearch.neuralsearch.stats.events.EventStatName;
 import org.opensearch.neuralsearch.stats.info.InfoStatName;
+import org.opensearch.ml.common.transport.model.MLModelState;


where do we need this 2 import

heemin32 · 2025-10-15T16:41:17Z

src/test/java/org/opensearch/neuralsearch/processor/NeuralQueryEnricherProcessorIT.java

-            .queryText("Hello World")
-            .k(1)
-            .build();
+                .fieldName(TEST_KNN_VECTOR_FIELD_NAME_1)


Could you remove all these unnecessary changes?

Can run ./gradlew spotlessApply to apply formatting

heemin32 · 2025-10-15T16:54:10Z

src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java

    protected boolean isModelReadyForInference(@NonNull final String modelId) throws IOException, ParseException {
        MLModelState state = getModelState(modelId);
-        return MLModelState.LOADED.equals(state) || MLModelState.DEPLOYED.equals(state);
+        return MLModelState.DEPLOYED.equals(state);


Could you add a comment on why we don't check for LOADED state?

LOADED is same as DEPLOYED but it is deprecated at OS2.7. As Neural Search is introduced after OS2.7, we could exclude that state.

owaiskazi19

Thanks for your contribution @umaarov! Added few comments

owaiskazi19 · 2025-10-15T20:09:22Z

src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java

        String requestBody = Files.readString(Path.of(classLoader.getResource("processor/UploadModelRequestBody.json").toURI()));
        String modelId = registerModelGroupAndUploadModel(requestBody);
        loadModel(modelId);
+        waitForModelToBeReady(modelId);


Instead of L387-389 we can have

Suggested change

waitForModelToBeReady(modelId);

loadAndWaitForModelToBeReady(modelId);

seems that loadModel also have a logic to wait for model deployment to be ready, why that logic doesn't work? Can we just fix that part as it already waits for 30s for model deployment?

loadModel does not have that logic.

loadModel also waits for deployment

Map<String, Object> taskQueryResult = getTaskQueryResponse(taskId); boolean isComplete = checkComplete(taskQueryResult); for (int i = 0; !isComplete && i < MAX_TASK_RETRIES; i++) { taskQueryResult = getTaskQueryResponse(taskId); isComplete = checkComplete(taskQueryResult); Thread.sleep(DEFAULT_TASK_RESULT_QUERY_INTERVAL_IN_MILLISECOND); }

It does not check the status of the model. It only check if the load task is completed. There might be a slight delay between the task completion and model status update I guess?

Yeah, my point is, we already have a logic here to wait for model deployment why not just use it to wait for its ready. So that we don't have to call multiple wait functions as basically whenever loadModel is called, we want it to be both deployed and ready.

owaiskazi19 · 2025-10-15T20:09:31Z

src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java

        );
        String modelId = registerModelGroupAndUploadModel(requestBody);
        loadModel(modelId);
+        waitForModelToBeReady(modelId);


Same as above

owaiskazi19 · 2025-10-15T20:09:37Z

src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java

        String requestBody = Files.readString(Path.of(classLoader.getResource("highlight/LocalQuestionAnsweringModel.json").toURI()));
        String modelId = registerModelGroupAndUploadModel(requestBody);
        loadModel(modelId);
+        waitForModelToBeReady(modelId);


Same as above

owaiskazi19 · 2025-10-15T20:10:05Z

src/test/java/org/opensearch/neuralsearch/processor/NeuralQueryEnricherProcessorIT.java

-            .queryText("Hello World")
-            .k(1)
-            .build();
+                .fieldName(TEST_KNN_VECTOR_FIELD_NAME_1)


Can run ./gradlew spotlessApply to apply formatting

The test testNeuralQueryEnricherProcessor_whenNoModelIdPassed_statsEnabled_thenSuccess was failing intermittently due to a race condition where the search query was executed before the model finished deploying. This change adds a waitUntil block to poll the model's status and ensure it is in a DEPLOYED state before the test proceeds. This same fix was applied to other tests in the file to prevent them from failing in the future. Fixes opensearch-project#1599 Signed-off-by: Umarov Ismoiljon <[email protected]>

yuye-aws · 2025-10-21T06:26:52Z

@umaarov Can you update the PR according to the comments?

umaarov · 2025-10-29T13:52:57Z

Hi @yuye-aws, @heemin32, @owaiskazi19,

Following up on the CI failure AssertionError: expected:<42> but was:<84> in HybridQueryAggregationsIT:

I manually checked the test data being ingested (processor/ingest_bulk.json). There are exactly 42 documents that match the filter criteria (actor: "anil" and imdb between 1.0 and 10.0).

This confirms the test assertion assertEquals(42, ...) is correct.

However, the terms aggregation is returning a doc_count of 84 for the "anil" bucket in this specific multi-shard test (testNestedAggs_whenMultipleShardsAndConcurrentSearchDisabled_thenSuccessful). This doubling suggests a potential bug in the aggregation calculation or merging process when used with hybrid queries across multiple shards, which seems unrelated to my PR's changes regarding model loading waits.

Could you please advise on how to proceed? Should this PR be put on hold while this potential aggregation bug is investigated separately, or is there a way to temporarily adjust the test?

The other Timeout and Connection refused errors also occurred in the latest run.

yuye-aws · 2025-10-30T06:14:17Z

Good to hear from you @umaarov . Can you first update the PR to address the code review comments? HybridQueryAggregationsIT is another problem. We can get it fixed with another PR.

umaarov requested review from VijayanB, bzhangam, chishui, heemin32, jmazanec15, junqiu-lei, martin-gaievski, minalsha, model-collapse, naveentatikonda, navneet1v, owaiskazi19, sean-zheng-amazon, vamshin, vibrantvarun, yuye-aws, zane-neo and zhichao-aws as code owners October 15, 2025 08:56

github-actions bot added bug Something isn't working flaky-test labels Oct 15, 2025

yuye-aws reviewed Oct 15, 2025

View reviewed changes

umaarov force-pushed the fix/flaky-test-1599 branch from 53c77f1 to 791f9f4 Compare October 15, 2025 09:28

yuye-aws reviewed Oct 15, 2025

View reviewed changes

umaarov force-pushed the fix/flaky-test-1599 branch from 791f9f4 to 9f8273b Compare October 15, 2025 10:01

heemin32 reviewed Oct 15, 2025

View reviewed changes

owaiskazi19 reviewed Oct 15, 2025

View reviewed changes

umaarov force-pushed the fix/flaky-test-1599 branch from 9f8273b to f35a142 Compare October 18, 2025 10:25

	waitForModelToBeReady(modelId);
	loadAndWaitForModelToBeReady(modelId);

fix(test): Stabilize flaky test by awaiting model deployment #1617

Are you sure you want to change the base?

fix(test): Stabilize flaky test by awaiting model deployment #1617

Conversation

umaarov commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Check List

Uh oh!

yuye-aws left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuye-aws commented Oct 15, 2025

Uh oh!

umaarov commented Oct 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuye-aws commented Oct 15, 2025

Uh oh!

yuye-aws left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

owaiskazi19 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heemin32 Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuye-aws commented Oct 21, 2025

Uh oh!

umaarov commented Oct 29, 2025

Uh oh!

yuye-aws commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

umaarov commented Oct 15, 2025 •

edited

Loading

heemin32 Oct 29, 2025 •

edited

Loading