- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.1k
global semantic search: neural sparse search #10696
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/global-semantic-search-neural-sparse
Are you sure you want to change the base?
global semantic search: neural sparse search #10696
Conversation
| ❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. | 
| Codecov Report❌ Patch coverage is  
 Additional details and impacted files@@                               Coverage Diff                                @@
##           feature/global-semantic-search-neural-sparse   #10696      +/-   ##
================================================================================
- Coverage                                         60.25%   52.77%   -7.49%     
================================================================================
  Files                                              4385     4099     -286     
  Lines                                            116753   112834    -3919     
  Branches                                          19010    18387     -623     
================================================================================
- Hits                                              70346    59543   -10803     
- Misses                                            41568    48981    +7413     
+ Partials                                           4839     4310     -529     
 Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
 | 
…se-search Signed-off-by: Zhenxing Shen <[email protected]>
| ❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. | 
    
      
        1 similar comment
      
    
  
    | ❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. | 
| ❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. | 
    
      
        1 similar comment
      
    
  
    | ❌ Empty Changelog SectionThe Changelog section in your PR description is empty. Please add a valid changelog entry or entries. If you did add a changelog entry, check to make sure that it was not accidentally included inside the comment block in the Changelog section. | 
Signed-off-by: Zhenxing Shen <[email protected]>
| ); | ||
|  | ||
| const debouncedSearch = useMemo(() => { | ||
| return debounce(search, 500); // 300ms delay, adjust as needed | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: comment need changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reminding!
| const nlpBertTokenizer = new BertWordPieceTokenizer({ vocabContent: Object.keys(this.vocab) }); | ||
| const tokenizedResult = nlpBertTokenizer.tokenizeSentence(query); | ||
| const tokensArray = tokenizedResult.tokens; | ||
| console.log('Tokenization: ', tokensArray); | ||
|  | ||
| const queryVec = this.buildQueryVector(tokensArray); | ||
| console.log('Non-zero query dimensions count: ', Object.keys(queryVec).length); | ||
| console.log('Non-zero query vector: ', queryVec); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neural sparse search supports natural language query. You can search with existing model_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! We will try this way in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can provide some context on why you're doing tokenization here? At first glance, you're doing tokenization to obtain the sparse query vector with IDF value. This is the doc-only search mode: https://docs.opensearch.org/latest/query-dsl/specialized/neural-sparse/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. We want to use doc-only neural sparse search to achieve semantic search in frontend without using backend service. We generate doc vetor in advance and store them in frontend side. Then we tokenize the query to obtain the sparse query vector with IDF value. After that, relevance is calculated using a dot product between query and document vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc-only queries supports natural language query
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this way we don't need tokenization anymore.
| }); | ||
|  | ||
| try { | ||
| console.log('All links:', allSearchAbleLinks); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: AllSearchableLinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
| core.application.register({ | ||
| id: PLUGIN_ID, | ||
| title: 'Discover', | ||
| description: 'Analyze your data in OpenSearch and visualize key metrics.', | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for curiosity, would the performance of semantic search be more accurate if we can have a more explanatory and detailed description for each application?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will imrpove the relevance. We can enrich them in the future.
| ...workspaceOptionalAttributesSchema, | ||
| }); | ||
|  | ||
| let jsonData: { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to make it as a global variable in nodejs? Would it be nice to create a class, make jsonData a private field of that class and provide dedicated method to manipulate the jsonData?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a good concern. We should make it private.
|  | ||
| if (!jsonData) { | ||
| const filePath = path.join(__dirname, 'doc_vectors.json'); | ||
| const data = await readFile(filePath, 'utf8'); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have error handling for file reading here I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a try/catch here, I think it will handle the error message.
Description
Neural Sparse Search offers a lightweight yet effective approach for semantic search by representing text as sparse vectors where most elements are zero. This method bridges the gap between traditional keyword matching and dense neural embeddings.
Neural Sparse Search works in two phases:
Issues Resolved
Screenshot
Testing the changes
Changelog
Check List
yarn test:jestyarn test:jest_integration