-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Switching to a fixed CFS threshold #15295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| } | ||
|
|
||
| // Apply appropriate threshold based on merge policy type | ||
| if (mergePolicy instanceof LogDocMergePolicy) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if we can avoid customizing it for specific policies, otherwise it might be tricky to maintain in the future, if e.g. there is another policy that is based on doc size not bytes.
Maybe we can add a enum and a method to MergePolicy which returns its unit (bytes/docs) , and use it here to decide which threshold to use?
Or do we want to always choose compound format based on size in bytes even for LogDocMergePolicy? In this case we might be able to use merge.getMergeInfo().sizeInBytes() when we call this method and avoid relying on MergePolicy#size all together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to the idea of Enums, I will wait if anyone else has other suggestions here but having an enum makes most sense to me.
|
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution! |
stefanvodita
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had looked at this PR before, but I see it's gone stale. Hoping to revive it.
Surprised how many changes we needed for this!
| // we can add 'producer' classes. | ||
|
|
||
| /** Default document count threshold for using compound files with LogDocMergePolicy */ | ||
| static final int DEFAULT_CFS_THRESHOLD_DOC_SIZE = 65536; // docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we believe these are good thresholds? Do we need a follow-up item about tweaking them?
| * <p><b>Note: To control compound file usage during segment merges see {@link | ||
| * MergePolicy#setNoCFSRatio(double)} and {@link MergePolicy#setMaxCFSSegmentSizeMB(double)}. This | ||
| * setting only applies to newly created segments.</b> | ||
| * <p><b>Note: To control compound file usage during segment merges.</b> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to say more in this comment?
| .info | ||
| .getCodec() | ||
| .compoundFormat() | ||
| .useCompoundFile(mergePolicy.size(merge.info, this), mergePolicy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Sometimes we call getMergeInfo, sometimes we address info directly.
| dir, | ||
| newIndexWriterConfig().setMergePolicy(newLogMergePolicy(random().nextBoolean()))); | ||
| random(), dir, newIndexWriterConfig().setMergePolicy(newLogMergePolicy())); | ||
| writer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of times we chain these access methods to eventually set compound files. I don't think it's a blocker, but maybe in the future we might add some convenience methods. No strong opinion on it though, I might actually prefer the way it's done now.
| lmp.setMaxCFSSegmentSizeMB(Double.POSITIVE_INFINITY); | ||
| conf.getCodec().compoundFormat().setShouldUseCompoundFile(true); | ||
| conf.getCodec().compoundFormat().setMaxCFSSegmentSizeMB(Double.POSITIVE_INFINITY); | ||
| conf.setUseCompoundFile(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we still benefit from the true default?
| .setMergePolicy(newLogMergePolicy())); | ||
|
|
||
| MergePolicy lmp = modifier.getConfig().getMergePolicy(); | ||
| lmp.setNoCFSRatio(1.0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to replicate this?
| // and might use many per-field codecs. turn on CFS for IW flushes | ||
| // and ensure CFS ratio is reasonable to keep it contained. | ||
| conf.setUseCompoundFile(true); | ||
| mp.setNoCFSRatio(Math.max(0.25d, mp.getNoCFSRatio())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an equivalent setting?
Problem
Lucene uses CFS ratio (which is by default 10%) to determine whether to use Compound Files.
Solution
In this PR we are doing the following:
Switching from a CFS ratio to a fixed threshold which is:
Moved CFS configuration to CompoundFormat.java from Merge policies.
Fixes #14959