Skip to content

Conversation

mccullocht
Copy link
Contributor

@mccullocht mccullocht commented Sep 9, 2025

Unlike the existing ScalarQuantizer selects a mode based on an enum to quantize to unsigned bytes or
packed nibbles using the same packing scheme as the existing scalar quantized codec. Seven bits is also
supported for anyone interested in backward compatibility, but this setting is discouraged.

This is separate from Lucene102BinaryQuantizedVectorsFormat as we need a larger value to store
the component sum for each vector owing to larger quantized values.

This closes #15064

luceneutil benchmark results. OSQ results are bits -4 and -8.

Results:
recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.875        0.858   0.855        0.996  1000000    10     100       32        250    -4 bits    203.69       4909.49          168.11             1         3349.44      3311.157      381.470       HNSW
 0.954        1.222   1.217        0.996  1000000    10     100       32        250    -8 bits    333.87       2995.20          193.90             1         3717.10      3677.368      747.681       HNSW
 0.450        1.881   1.816        0.965  1000000    10     100       32        250     4 bits    510.04       1960.65          285.58             1         3346.30      3299.713      370.026       HNSW
 0.928        1.245   1.241        0.997  1000000    10     100       32        250     8 bits    325.14       3075.58          207.41             1         3705.70      3665.924      736.237       HNSW

@benwtrent benwtrent self-requested a review September 9, 2025 16:50
@benwtrent benwtrent added this to the 10.4.0 milestone Sep 9, 2025
@mccullocht mccullocht marked this pull request as ready for review September 10, 2025 20:27
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Thank you for going through the slog. Doing format changes is always a challenge and requires a ton of ceremony.

I will need to read through this a couple of times to grok all of it (but most of it seems pretty standard for a knn format)

I realize we support unsigned 8 byte now, but for BWC, I would still like to provide 7 bit if possible.

With this change, we should also move all existing quantized formats to BWC. But that would be yet another thousand+ LOC. So, maybe that can be done in a follow up.

@benwtrent
Copy link
Member

It is interesting to me that int4 in the new format has such better latency as well!

I wonder why that is? Is it simply because the scoring quality is higher, and thus we can exit searching the graph more quickly?

Copy link
Contributor Author

@mccullocht mccullocht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, we should also move all existing quantized formats to BWC. But that would be yet another thousand+ LOC. So, maybe that can be done in a follow up.

Marked the old codecs as deprecated. I'd prefer not to do backwards codecs here, this change is already much larger than I'd like but I couldn't figure out how to factor it into smaller pieces.

It is interesting to me that int4 in the new format has such better latency as well!
I wonder why that is? Is it simply because the scoring quality is higher, and thus we can exit searching the graph more quickly?

I re-ran the luceneutil changes because I guess it had partially finished last time. Indexing performance is also much better. I think of this in terms of lossy score compression -- for example trivial bit quantization and hamming distance produces just (dimensions + 1) possible scores, so it can be difficult to distinguish between results in some cases. It could be that the old quantizer produces fewer vector representations and output scores.

More generally I think it would be useful to track and surface KnnCollector stats in indexing and search paths. Being able to distinguish between "comparisons are faster" and "comparisons are fewer" would be helpful for analyzing this, and also other algorithmic and data layout changes (would be really curious to see this for binary partitioning reordering).

@benwtrent
Copy link
Member

Marked the old codecs as deprecated. I'd prefer not to do backwards codecs here, this change is already much larger than I'd like but I couldn't figure out how to factor it into smaller pieces.

Yeah, I am cool with that. Format changes are always huge!

More generally I think it would be useful to track and surface KnnCollector stats in indexing and search paths. Being able to distinguish between "comparisons are faster" and "comparisons are fewer" would be helpful for analyzing this, and also other algorithmic and data layout changes (would be really curious to see this for binary partitioning reordering).

We don't really track this during indexing for sure. But you can expose the vector comparisons in Lucene Util.

@mccullocht
Copy link
Contributor Author

Average visited count in the query path actually is exposed in luceneutil today, it just appears in the iteration summary and not the overall summary. TIL. I've extracted it for my most recent run here:

-4 3794
-8 3857
 4 4294
 8 3849

4 bit is doing ~10% more comparisons than 8 bit for the same fanout. More work in 4 bit but doesn't explain the size of the win in OSQ4. Workload CPU usage and latency is mostly driven scoring costs so let's start by looking at microbenchmark results for dot product on the same hardware:

VectorUtilBenchmark.binaryDotProductVector         128  thrpt   15   47.741 ±  0.502  ops/us
VectorUtilBenchmark.binaryDotProductVector         256  thrpt   15   26.198 ±  0.608  ops/us
VectorUtilBenchmark.binaryDotProductVector         300  thrpt   15   22.857 ±  0.130  ops/us
VectorUtilBenchmark.binaryDotProductVector         512  thrpt   15   13.864 ±  0.443  ops/us
VectorUtilBenchmark.binaryDotProductVector         702  thrpt   15   10.003 ±  0.205  ops/us
VectorUtilBenchmark.binaryDotProductVector        1024  thrpt   15    6.795 ±  0.098  ops/us
VectorUtilBenchmark.binaryHalfByteVectorPacked     128  thrpt   15   72.590 ±  1.089  ops/us
VectorUtilBenchmark.binaryHalfByteVectorPacked     256  thrpt   15   50.962 ±  0.197  ops/us
VectorUtilBenchmark.binaryHalfByteVectorPacked     300  thrpt   15   39.587 ±  0.159  ops/us
VectorUtilBenchmark.binaryHalfByteVectorPacked     512  thrpt   15   31.660 ±  0.187  ops/us
VectorUtilBenchmark.binaryHalfByteVectorPacked     702  thrpt   15   22.251 ±  0.135  ops/us
VectorUtilBenchmark.binaryHalfByteVectorPacked    1024  thrpt   15   18.117 ±  0.129  ops/us

At 702 dimensions (next closest to our test data set) half byte is about twice as fast. The performance different between -4 and -8 makes sense with this context. I don't know why 4 is so slow, these numbers suggest it shouldn't be worse than 8 yet somehow it is 🤷. Not sure this is worth figuring out.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great stuff.

The existing scalar formats can go to the backwards codecs in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch over current scalar quantization formats to use OptimizedScalarQuantizer
2 participants