Skip to content

Conversation

sgup432
Copy link
Contributor

@sgup432 sgup432 commented Aug 26, 2025

Description

Related issue - #15097

Instead of using the default 1024 bytes for query size, we try to use RamUsageEstimator to calculate approx size. As seen in above issues, using 1024 as a default can cause issues(like heap exhaustion) due to underestimation, and even cause overestimation leading to fewer queries getting cached.

RamUsageEstimator.sizeOf() is usually pretty fast, I made a sample BooleanQuery with 15 clauses with a size of around 3kb. It took around 14959ns to calculate its size via RamUsageEstimator.sizeOf().

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@sgup432 sgup432 changed the title Use RamUsageEstimator to estimate query size instead of defaulting to 1024 bytes for non-accountable queries. [QueryCache] Use RamUsageEstimator instead of defaulting to 1024 bytes for non-accountable query size Aug 26, 2025
@sgup432
Copy link
Contributor Author

sgup432 commented Aug 26, 2025

@msfroh Would you mind taking a look at this?

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Comment on lines 428 to 429
long queryRamBytesUsed = RamUsageEstimator.sizeOf(query, 0);
return LINKED_HASHTABLE_RAM_BYTES_PER_ENTRY + queryRamBytesUsed;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize you have a small benchmark that you did, but real world performance is still a concern of mine here.

I am not sure how to benchmark this, but my concern is that non-accountable queries just got more expensive, we calculate memory used for a query on cache and eviction

Copy link
Contributor Author

@sgup432 sgup432 Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just timed one 3kb boolean query.

I am not sure how to benchmark this, but my concern is that non-accountable queries just got more expensive

The only way I see it is to micro-benchmark RamUsageEstimator with different query types and size, might give more idea. I had tried to do same but with only term and boolean queries.

but my concern is that non-accountable queries just got more expensive

I am not sure whether it would be that expensive(atleast in absolute terms). Like for TermQuery and similar cheaper queries, RamUsageEstimator should be pretty fast considering we already cache ramBytesUsed for the underlying terms etc. In my opinion, complex BooleanQueries with many nested clauses might be more expensive, as we need to visit all the underlying children for size calculation. To handle that, we already have a limit of 16 clauses beyond which we can cannot cache, and it can be refined further if needed.

I don't see any other good way of calculating query size until we implement Accountable for all queries, but it would too intrusive for just the caching use-case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent So I wrote a simple code to micro-benchmark putIfAbsent method which is the one which calls getRamBytesUsed(Query query) during caching and eviction.
Created a cache with MAX_SIZE = 10000 and MAX_SIZE_IN_BYTES = 1048576, and created N sample boolean queries and the logic looks like something below

for (int i = 0; i < MAX_SIZE; i++) {
      TermQuery must = new TermQuery(new Term("foo", "bar" + i));
      TermQuery filter = new TermQuery(new Term("foo", "quux" + i));
      TermQuery mustNot = new TermQuery(new Term("foo", "foo" + i));
      BooleanQuery.Builder bq = new BooleanQuery.Builder();
      bq.add(must, BooleanClause.Occur.FILTER);
      bq.add(filter, BooleanClause.Occur.FILTER);
      bq.add(mustNot, BooleanClause.Occur.MUST_NOT);
      queries[i] = bq.build();
    }

JMH method to test 100% writes workload for 10 threads

@Benchmark
  @Group("concurrentPutOnly")
  @GroupThreads(10)
  public void testConcurrentPuts() {
    int random = ThreadLocalRandom.current().nextInt(MAX_SIZE);
    queryCache.putIfAbsent(
        queries[random], this.sampleCacheAndCount, cacheHelpers[random & (SEGMENTS - 1)]);
  }

Baseline numbers

Benchmark                               Mode  Cnt        Score       Error  Units
QueryCacheBenchmark.concurrentPutOnly  thrpt   15  4102080.220 ± 80816.546  ops/s

My changes

Benchmark                               Mode  Cnt       Score       Error  Units
QueryCacheBenchmark.concurrentPutOnly  thrpt   15  901925.345 ± 38059.230  ops/s

So it became ~4.5x slower. There were lot of evictions as well. Though this probably one of the worst case scenario with only write workload and boolean query. For a mixed read/write or simple filter queries, this might be way less.
And this is the method profile (taken from JFR). Kind of expected.

Screenshot 2025-09-17 at 3 53 18 PM

I don't know if there is a way out or we can further improve query visitor logic itself.

Copy link
Contributor Author

@sgup432 sgup432 Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like I was doing

long queryRamBytesUsed = RamUsageEstimator.sizeOf(query, 0);

long sizeOf(Query q, long defSize) Here defSize represents the shallow size of the query, and if 0 is passed, it calculates the shallow size during runtime which was doing a lot of object allocations(adding to higher CPU%) and therefore adding to latency.

I changed to RamUsageEstimator.sizeOf(query, 32);, Here 32 is a rough shallow bytes for a term/boolean query which I calculated manually by calling that method

After running with above change, performance improved but still 2.5x slower than the baseline.

Benchmark                               Mode  Cnt        Score        Error  Units
QueryCacheBenchmark.concurrentPutOnly  thrpt   15  1608680.127 ± 229218.026  ops/s

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the query cache should "cache" the query sizes? It seems we do this more expensive calculation twice for no reason. Why not keep track of the query size in the cache as well, only calculating when adding it?

Copy link
Contributor Author

@sgup432 sgup432 Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the query cache should "cache" the query sizes?

Yeah good point. Let me try that approach as well and update here.
I was able to further improve the performance earlier by making changes in BooleanQuery.visit() so that we store clause level queries explicitly in a list beforehand, which made it 62% faster, overall 1.5x slower than baseline

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.5x is REALLY good!

Copy link
Contributor Author

@sgup432 sgup432 Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I did the change to cache query size as well in the existing map

Baseline (100% write)

Benchmark                               Mode  Cnt        Score        Error  Units
QueryCacheBenchmark.concurrentPutOnly  thrpt    9  4057602.444 ± 380153.561  ops/s

My changes (100% write) - 25% slower than baseline

Benchmark                               Mode  Cnt        Score        Error  Units
QueryCacheBenchmark.concurrentPutOnly  thrpt    9  3028840.412 ± 483817.248  ops/s

Baseline - 60% read, 40% write (6 read threads, 4 write threads)

Benchmark                                                           Mode  Cnt        Score         Error  Units
QueryCacheBenchmark.concurrentGetAndPuts                           thrpt    9  8074623.026 ± 1229792.902  ops/s

My changes - 60% read, 40% write

Benchmark                                                           Mode  Cnt        Score         Error  Units
QueryCacheBenchmark.concurrentGetAndPuts                           thrpt    9  8262227.277 ± 1224929.417  ops/s

So with 100% writes, which is the worst case for any cache, my changes(to add RamUsageEstimator and BooleanQuery.visit() changes) performs 25% worse than the baseline. But this isn't a ideal workload for any cache.

So I did another benchmark to test with 60% reads, 40% writes which represents more ideal distribution for the cache.

Here both are almost equivalent, and no degradation is seen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so great! I really like this.

Copy link
Contributor Author

@sgup432 sgup432 Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent I have pushed my final changes, let me know how it looks.
On a side note, I am also going to raise a PR soon(based on my idea here) to refactor query cache to improve performance by atleast 2-3x :)

Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

@github-actions github-actions bot added Stale and removed Stale labels Sep 11, 2025
…mpact of repeated visit() calls during RamUsageEstimator.sizeOf()
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot added this to the 11.0.0 milestone Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants