Skip to content

Conversation

RamakrishnaChilaka
Copy link
Contributor

@RamakrishnaChilaka RamakrishnaChilaka commented Sep 10, 2025

The previous code compared numBitsNextBitsPerValue ≤ docRange to decide whether to store the doc-delta block with PackedInts or as a bit-set. docRange is the number of documents, not the number of bits the bit-set would occupy, so the test did not reflect the actual space that would be written.

Change

Replace the comparison with numBitsNextBitsPerValue ≤ numBitSetLongs * Long.SIZE
so the decision is based on the exact bit footprint of each alternative. The bit-set path is now taken only when it is strictly smaller on disk.

We can skip change-log as it is a marginal change, I will add it, if you prefer

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine with me.

I'm trying to figure out if this could result in different choices of encoding, but I believe not since this essentially rounds up to the next multiple of 64 while the number of bits required when encoding with FOR are always multiples of 256?

@RamakrishnaChilaka
Copy link
Contributor Author

RamakrishnaChilaka commented Sep 11, 2025

Thanks for the review @jpountz.

I think in some cases this will change encoding choice.

For example, Please consider

  • If maxBitsPerValue is 4, largest delta can be 15 i.e., cost would be 1280 ((4 + 1) * 256).
  • The smallest possible docRange is then 269 (15 + 254 × 1) and the largest is 3825 (15 × 255).

Therefore, blocks that were previously encoded with FOR will continue to be encoded with FOR. However, some blocks that were previously encoded with BitSet may now switch to FOR, because we now account for the true size of the BitSet.

@jpountz
Copy link
Contributor

jpountz commented Sep 11, 2025

Ahh, you are right, thanks for explaining. I suggest we wait a couple days so that the performance impact on nightly benchmarks can be easily distinguished from the impact of bumping the block size to 256.

@jpountz
Copy link
Contributor

jpountz commented Sep 17, 2025

I think we've waited long enough, feel free to merge!

@RamakrishnaChilaka RamakrishnaChilaka merged commit b9288ae into apache:main Sep 17, 2025
8 checks passed
@RamakrishnaChilaka RamakrishnaChilaka deleted the bug_fix_postings_writer branch September 17, 2025 19:31
@RamakrishnaChilaka RamakrishnaChilaka added the skip-changelog Apply to PRs that don't need a changelog entry, stopping the automated changelog check. label Sep 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:core/codecs skip-changelog Apply to PRs that don't need a changelog entry, stopping the automated changelog check.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants