Skip to content

Conversation

flaming-archer
Copy link
Contributor

@flaming-archer flaming-archer commented Sep 4, 2025

The previous filestatus will not be cached, as its source hivetable will be created every time, and the filestatus will also be created, resulting in different client IDs for the cached object's key, leading to cache invalidation.

It can cause two problems:

  1. Cache failure, which greatly affects query performance because obtaining filestatus from HDFS is a slow process.
  2. Memory leakage occurs because objects are constantly added to the cache, and since the default cache time is -1, the memory will become larger and larger.

Why are the changes needed?

Improve perfomance.

How was this patch tested?

UT and spark sql query.

Was this patch authored or co-authored using generative AI tooling?

No.

cache version 2

cache ut

change ut
@flaming-archer flaming-archer changed the title fix filestatus not cached [KYUUBI #7192]fix filestatus not cached Sep 4, 2025
@flaming-archer flaming-archer changed the title [KYUUBI #7192]fix filestatus not cached [KYUUBI #7192] Fix filestatus not cached Sep 4, 2025
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 0% with 48 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (ea75fa8) to head (9353f0c).
⚠️ Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
...park/connector/hive/read/HiveFileStatusCache.scala 0.00% 42 Missing ⚠️
...kyuubi/spark/connector/hive/HiveTableCatalog.scala 0.00% 4 Missing ⚠️
...uubi/spark/connector/hive/read/HiveFileIndex.scala 0.00% 2 Missing ⚠️
Additional details and impacted files
@@          Coverage Diff           @@
##           master   #7191   +/-   ##
======================================
  Coverage    0.00%   0.00%           
======================================
  Files         695     696    +1     
  Lines       43433   43479   +46     
  Branches     5887    5902   +15     
======================================
- Misses      43433   43479   +46     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@flaming-archer
Copy link
Contributor Author

@pan3793 could u pls take a look at it

}

/**
* An implementation that caches partition file statuses in memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the code is forked from spark, clarify where and which version it comes from, and briefly explain your modification and expectation

import org.apache.spark.util.SizeEstimator

/**
* Use [[HiveFileStatusCache.getOrCreate()]] to construct a globally shared file status cache.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, the "globally shared" concept does not match the Spark's multi-session architecture, especially for Kyuubi use cases, it's possible that multi users share one Spark application.

I know that there are many hive-related instances are globally shared in Spark, as we are improving this part, let's make it possible to be session shared, and have a config to allow it to be global shared.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants