Skip to content

Conversation

gaobinlong
Copy link
Contributor

Description

Resolve a TODO in FirstPassGroupingCollector, introduce a new parameter ignoreDocsWithoutGroupField, if it's true, ignore documents that don't have the group field instead of putting them into a null group.

This change can also improve the field collapsing performance, I did some benchmark test against OpenSearch, here are the result:

Query DSL:
GET big5/_search?size=10
{
  "query": {
    "match": {
      "message": "50-136-239-27"
    }
  },
  "collapse": {
    "field": "count"
  }
}

Before:

|                                   50th percentile service time | collapsing |     217.497 |     ms |
|                                   90th percentile service time | collapsing |     221.677 |     ms |
|                                   99th percentile service time | collapsing |     236.481 |     ms |
|                                 99.9th percentile service time | collapsing |     266.285 |     ms |
|                                  100th percentile service time | collapsing |     271.294 |     ms |


After:
|                                    50th percentile service time | collapsing |     180.974 |     ms |
|                                   90th percentile service time | collapsing |     185.945 |     ms |
|                                   99th percentile service time | collapsing |     205.012 |     ms |
|                                 99.9th percentile service time | collapsing |     222.983 |     ms |
|                                  100th percentile service time | collapsing |     259.105 |     ms |

Copy link
Contributor

github-actions bot commented Sep 9, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Signed-off-by: Binlong Gao <[email protected]>
@github-actions github-actions bot added this to the 10.3.0 milestone Sep 9, 2025
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some great numbers! I have some minor comments.

Signed-off-by: Binlong Gao <[email protected]>
@github-actions github-actions bot modified the milestones: 10.3.0, 10.4.0 Sep 9, 2025
@benwtrent
Copy link
Member

I am not 100% familiar with this part of the code base, but it seems like a pretty safe change, and tests prove it out.

I will leave this PR open for a little bit for others to comment if they find issue with it. I will likely merge & backport this next week.

@alexmm-amzn
Copy link

alexmm-amzn commented Sep 9, 2025

Should this use the return value of GroupSelector#advanceTo(int) (returns an enum with SKIP or ACCEPT) instead of checking for a null value? The collector currently ignores it, but it looks like the intended way of skipping a hit.

@gaobinlong
Copy link
Contributor Author

Should this use the return value of GroupSelector#advanceTo(int) (returns an enum with SKIP or ACCEPT) instead of checking for a null value? The collector currently ignores it, but it looks like the intended way of skipping a hit.

That doesn't work, ValueSourceGroupSelector.advanceTo() always return ACCEPT:
https://github.com/apache/lucene/blob/main/lucene/grouping/src/java/org/apache/lucene/search/grouping/ValueSourceGroupSelector.java#L69

@alexmm-amzn
Copy link

That's the current behavior of ValueSourceGroupSelector. Is it the expected one though? What's the purpose of this return value?

@gaobinlong
Copy link
Contributor Author

That's the current behavior of ValueSourceGroupSelector. Is it the expected one though? What's the purpose of this return value?

By checking the usage, the return value is designed to be used in second pass grouping selector, not for the FirstPassGroupingCollector.

Signed-off-by: Binlong Gao <[email protected]>
@romseygeek
Copy link
Contributor

I agree with @alexmm-amzn, I think this needs to use the return value of GroupSelector#advanceTo(int) to distinguish between actual null values and missing values, and that ValueSourceGroupSelector should be updated to behave more like the other implementations.

This also reminds me that at some point I should get round to deprecating ValueSource...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants