-
Notifications
You must be signed in to change notification settings - Fork 21
CNDB-15640: Determine if vectors are unit length at insert #2059
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Checklist before you submit for review
|
|
|
❌ Build ds-cassandra-pr-gate/PR-2059 rejected by Butler1 regressions found Found 1 new test failures
No known test failures found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the user declares that their embeddings use a source_model with unit length vectors but this is not really the case ?
This may happen by mistake or to people playing with the API
If a user configures the wrong source model, we'll do other things wrong too. We use the source model to get optimizations that we can only apply because we know the model that produced the embeddings. I personally do not think we should plan for such scenarios, but if that is required, we can remove the model optimization for unit length check and just do the check on every inserted vector. |
What is the issue
Fixes: https://github.com/riptano/cndb/issues/15640
What does this PR fix and why was it fixed
In order to lay the ground work for Fused ADC, I want to refactor some of the PQ/BQ logic. The unit length computation needs to move, so I decided to move it out to its own PR.
The core idea is that:
writePQ
method as neededEmbedding normalization notes
(I asked chat gpt to provide proof for the config changes proposed in this PR. Here is it's generated description.)
Quick rundown of which models spit out normalized vectors (so cosine == dot product, etc.):
Normalize
layer so they’re fine; vanilla BERT doesn’t.TL;DR: OpenAI + Gecko are definitely safe, Cohere/BERT/NV need manual normalization due to lack of documentation.