global semantic search: neural sparse search #10696

s-zx · 2025-10-09T08:01:23Z

Description

Neural Sparse Search offers a lightweight yet effective approach for semantic search by representing text as sparse vectors where most elements are zero. This method bridges the gap between traditional keyword matching and dense neural embeddings.

Neural Sparse Search works in two phases:

Document Processing: Documents are tokenized and converted into sparse vector representations where only meaningful tokens have non-zero values.
Query Processing: User queries undergo the same tokenization process, creating sparse vectors that can be efficiently compared with document vectors.

Issues Resolved

Screenshot

Testing the changes

Changelog

Check List

All tests pass
- yarn test:jest
- yarn test:jest_integration
New functionality includes testing.
New functionality has been documented.
Update CHANGELOG.md
Commits are signed per the DCO using --signoff

github-actions · 2025-10-09T08:01:55Z

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

codecov · 2025-10-09T08:22:54Z

Codecov Report

❌ Patch coverage is 17.80822% with 60 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.77%. Comparing base (dfc3f48) to head (91d6063).

Files with missing lines	Patch %	Lines
src/plugins/workspace/server/sparse_search.ts	0.00%	32 Missing ⚠️
src/plugins/workspace/server/routes/index.ts	5.88%	16 Missing ⚠️
src/core/public/chrome/utils.ts	73.33%	4 Missing ⚠️
...core/public/chrome/ui/header/header_search_bar.tsx	0.00%	3 Missing ⚠️
...c/chrome/ui/global_search/search_pages_command.tsx	0.00%	2 Missing ⚠️
src/core/public/chrome/chrome_service.tsx	0.00%	1 Missing ⚠️
src/core/public/core_system.ts	0.00%	1 Missing ⚠️
.../components/global_search/search_pages_command.tsx	50.00%	1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (dfc3f48) and HEAD (91d6063). Click for more details.

HEAD has 4 uploads less than BASE

Flag BASE (dfc3f48) HEAD (91d6063)

Linux_2 1 0

Linux_1 1 0

Linux_3 1 0

Windows_2 1 0

Additional details and impacted files

@@                               Coverage Diff                                @@
##           feature/global-semantic-search-neural-sparse   #10696      +/-   ##
================================================================================
- Coverage                                         60.25%   52.77%   -7.49%     
================================================================================
  Files                                              4385     4099     -286     
  Lines                                            116753   112834    -3919     
  Branches                                          19010    18387     -623     
================================================================================
- Hits                                              70346    59543   -10803     
- Misses                                            41568    48981    +7413     
+ Partials                                           4839     4310     -529

Flag	Coverage Δ
Linux_1	`?`
Linux_2	`?`
Linux_3	`?`
Linux_4	`32.59% <0.00%> (-0.01%)`	⬇️
Windows_1	`26.63% <17.80%> (-0.04%)`	⬇️
Windows_2	`?`
Windows_3	`38.42% <0.00%> (-0.01%)`	⬇️
Windows_4	`32.59% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…se-search Signed-off-by: Zhenxing Shen <[email protected]>

github-actions · 2025-10-13T02:10:26Z

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

github-actions · 2025-10-13T02:10:54Z

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

github-actions · 2025-10-13T02:11:43Z

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

github-actions · 2025-10-13T02:11:52Z

❌ Empty Changelog Section

The Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section.

Signed-off-by: Zhenxing Shen <[email protected]>

FriedhelmWS · 2025-10-14T08:04:43Z

src/core/public/chrome/ui/header/header_search_bar.tsx

  );

+  const debouncedSearch = useMemo(() => {
+    return debounce(search, 500); // 300ms delay, adjust as needed


Nit: comment need changes

Thanks for reminding!

yuye-aws · 2025-10-14T08:05:18Z

src/plugins/workspace/server/sparse_search.ts

+    const nlpBertTokenizer = new BertWordPieceTokenizer({ vocabContent: Object.keys(this.vocab) });
+    const tokenizedResult = nlpBertTokenizer.tokenizeSentence(query);
+    const tokensArray = tokenizedResult.tokens;
+    console.log('Tokenization: ', tokensArray);
+
+    const queryVec = this.buildQueryVector(tokensArray);
+    console.log('Non-zero query dimensions count: ', Object.keys(queryVec).length);
+    console.log('Non-zero query vector: ', queryVec);


neural sparse search supports natural language query. You can search with existing model_id

Cool! We will try this way in the future.

Maybe you can provide some context on why you're doing tokenization here? At first glance, you're doing tokenization to obtain the sparse query vector with IDF value. This is the doc-only search mode: https://docs.opensearch.org/latest/query-dsl/specialized/neural-sparse/

Yes, you are right. We want to use doc-only neural sparse search to achieve semantic search in frontend without using backend service. We generate doc vetor in advance and store them in frontend side. Then we tokenize the query to obtain the sparse query vector with IDF value. After that, relevance is calculated using a dot product between query and document vectors.

Doc-only queries supports natural language query

I see, this way we don't need tokenization anymore.

FriedhelmWS · 2025-10-14T08:08:14Z

src/core/public/chrome/utils.ts

  });
+
+  try {
+    console.log('All links:', allSearchAbleLinks);


Nit: AllSearchableLinks.

FriedhelmWS · 2025-10-14T08:17:31Z

src/plugins/discover/public/plugin.ts

    core.application.register({
      id: PLUGIN_ID,
      title: 'Discover',
+      description: 'Analyze your data in OpenSearch and visualize key metrics.',


Just for curiosity, would the performance of semantic search be more accurate if we can have a more explanatory and detailed description for each application?

Yes, it will imrpove the relevance. We can enrich them in the future.

FriedhelmWS · 2025-10-14T08:30:33Z

src/plugins/workspace/server/routes/index.ts

  ...workspaceOptionalAttributesSchema,
 });

+let jsonData: {


Is it safe to make it as a global variable in nodejs? Would it be nice to create a class, make jsonData a private field of that class and provide dedicated method to manipulate the jsonData?

Yes, this is a good concern. We should make it private.

FriedhelmWS · 2025-10-14T08:31:01Z

src/plugins/workspace/server/routes/index.ts

+
+        if (!jsonData) {
+          const filePath = path.join(__dirname, 'doc_vectors.json');
+          const data = await readFile(filePath, 'utf8');


Would be nice to have error handling for file reading here I guess.

We already have a try/catch here, I think it will handle the error message.

github-actions bot added the repeat-contributor label Oct 9, 2025

github-actions bot added the failed changeset label Oct 9, 2025

Merge branch 'feature/global-semantic-search-neural-sparse' into spar…

1b9340b

…se-search Signed-off-by: Zhenxing Shen <[email protected]>

s-zx force-pushed the sparse-search branch from 5d3b696 to 1b9340b Compare October 13, 2025 02:08

s-zx changed the title ~~Sparse search~~ global semantic search: neural sparse Oct 13, 2025

s-zx changed the title ~~global semantic search: neural sparse~~ global semantic search: neural sparse search Oct 13, 2025

remove unnecessary code

91d6063

Signed-off-by: Zhenxing Shen <[email protected]>

FriedhelmWS reviewed Oct 14, 2025

View reviewed changes

yuye-aws reviewed Oct 14, 2025

View reviewed changes

FriedhelmWS reviewed Oct 14, 2025

View reviewed changes

global semantic search: neural sparse search #10696

Are you sure you want to change the base?

global semantic search: neural sparse search #10696

Uh oh!

Conversation

s-zx commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues Resolved

Screenshot

Testing the changes

Changelog

Check List

Uh oh!

github-actions bot commented Oct 9, 2025

❌ Empty Changelog Section

Uh oh!

codecov bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Oct 13, 2025

❌ Empty Changelog Section

Uh oh!

github-actions bot commented Oct 13, 2025

❌ Empty Changelog Section

Uh oh!

github-actions bot commented Oct 13, 2025

❌ Empty Changelog Section

Uh oh!

github-actions bot commented Oct 13, 2025

❌ Empty Changelog Section

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yuye-aws Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FriedhelmWS Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

s-zx commented Oct 9, 2025 •

edited

Loading

codecov bot commented Oct 9, 2025 •

edited

Loading

yuye-aws Oct 14, 2025 •

edited

Loading

FriedhelmWS Oct 14, 2025 •

edited

Loading