CNDB-15640: Determine if vectors are unit length at insert #2059

michaeljmarshall · 2025-10-09T22:14:27Z

What is the issue

Fixes: https://github.com/riptano/cndb/issues/15640

What does this PR fix and why was it fixed

In order to lay the ground work for Fused ADC, I want to refactor some of the PQ/BQ logic. The unit length computation needs to move, so I decided to move it out to its own PR.

The core idea is that:

some models are documented to provide unit length vectors, and in those cases, we should skip the computational check
otherwise, we should check at runtime until we hit a non-unit length vector, and then we can skip the check and configure the writePQ method as needed

Embedding normalization notes

(I asked chat gpt to provide proof for the config changes proposed in this PR. Here is it's generated description.)

Quick rundown of which models spit out normalized vectors (so cosine == dot product, etc.):

OpenAI (ada-002, v3-small, v3-large) → already normalized. OpenAI FAQ literally says embeddings are unit-length.
BERT → depends. The SBERT “-cos-” models add a Normalize layer so they’re fine; vanilla BERT doesn’t.
Google Gecko → normalized out of the box per Vertex AI docs.
NVIDIA QA-4 → nothing in the NVIDIA NIM model card about normalization, so assume not normalized and handle it yourself.
Cohere v3 → not explicitly in their API docs

TL;DR: OpenAI + Gecko are definitely safe, Cohere/BERT/NV need manual normalization due to lack of documentation.

github-actions · 2025-10-09T22:14:50Z

michaeljmarshall · 2025-10-10T15:59:54Z

VectorKeyRestrictedOnPartitionTest fails with the below message. It passes locally. I am classifying it as flaky.

Expecting actual:
  80
to be close to:
  89
by less than 10% but difference was 10.112359550561797%.
(a difference of exactly 10% being considered valid)

sonarqubecloud · 2025-10-10T19:34:32Z

Quality Gate passed

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2025-10-10T19:42:06Z

❌ Build ds-cassandra-pr-gate/PR-2059 rejected by Butler

1 regressions found
See build details here

Found 1 new test failures

Test	Explanation	Runs	Upstream
o.a.c.index.sai.cql.VectorKeyRestrictedOnPartitionTest.partitionRestrictedWidePartitionBqCompressedTest[eb] (compression)	REGRESSION	🔵🔴	0 / 9

No known test failures found

eolivelli

What happens if the user declares that their embeddings use a source_model with unit length vectors but this is not really the case ?

This may happen by mistake or to people playing with the API

michaeljmarshall · 2025-10-14T16:26:20Z

What happens if the user declares that their embeddings use a source_model with unit length vectors but this is not really the case ?

If a user configures the wrong source model, we'll do other things wrong too. We use the source model to get optimizations that we can only apply because we know the model that produced the embeddings. I personally do not think we should plan for such scenarios, but if that is required, we can remove the model optimization for unit length check and just do the check on every inserted vector.

CNDB-15640: Determine if vectors are unit length at insert

6859092

michaeljmarshall requested review from eolivelli, marianotepper and pkolaczk October 9, 2025 22:14

michaeljmarshall self-assigned this Oct 9, 2025

Remove unused variable

db9cfab

eolivelli reviewed Oct 13, 2025

View reviewed changes

Merge remote-tracking branch 'datastax/main' into cndb-15640

aa8995c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CNDB-15640: Determine if vectors are unit length at insert #2059

CNDB-15640: Determine if vectors are unit length at insert #2059

michaeljmarshall commented Oct 9, 2025

Uh oh!

github-actions bot commented Oct 9, 2025 •

edited by michaeljmarshall

Loading

Uh oh!

michaeljmarshall commented Oct 10, 2025

Uh oh!

sonarqubecloud bot commented Oct 10, 2025

Uh oh!

cassci-bot commented Oct 10, 2025

Uh oh!

eolivelli left a comment

Uh oh!

michaeljmarshall commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CNDB-15640: Determine if vectors are unit length at insert #2059

Are you sure you want to change the base?

CNDB-15640: Determine if vectors are unit length at insert #2059

Conversation

michaeljmarshall commented Oct 9, 2025

What is the issue

What does this PR fix and why was it fixed

Embedding normalization notes

Uh oh!

github-actions bot commented Oct 9, 2025 • edited by michaeljmarshall Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist before you submit for review

Uh oh!

michaeljmarshall commented Oct 10, 2025

Uh oh!

sonarqubecloud bot commented Oct 10, 2025

Quality Gate passed

Uh oh!

cassci-bot commented Oct 10, 2025

❌ Build ds-cassandra-pr-gate/PR-2059 rejected by Butler

Found 1 new test failures

No known test failures found

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

michaeljmarshall commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Oct 9, 2025 •

edited by michaeljmarshall

Loading