-
Notifications
You must be signed in to change notification settings - Fork 959
[KYUUBI #7192] Fix filestatus not cached #7191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
cache version 2 cache ut change ut
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #7191 +/- ##
======================================
Coverage 0.00% 0.00%
======================================
Files 695 696 +1
Lines 43433 43479 +46
Branches 5887 5902 +15
======================================
- Misses 43433 43479 +46 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@pan3793 could u pls take a look at it |
} | ||
|
||
/** | ||
* An implementation that caches partition file statuses in memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the code is forked from spark, clarify where and which version it comes from, and briefly explain your modification and expectation
import org.apache.spark.util.SizeEstimator | ||
|
||
/** | ||
* Use [[HiveFileStatusCache.getOrCreate()]] to construct a globally shared file status cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, the "globally shared" concept does not match the Spark's multi-session architecture, especially for Kyuubi use cases, it's possible that multi users share one Spark application.
I know that there are many hive-related instances are globally shared in Spark, as we are improving this part, let's make it possible to be session shared, and have a config to allow it to be global shared.
The previous filestatus will not be cached, as its source hivetable will be created every time, and the filestatus will also be created, resulting in different client IDs for the cached object's key, leading to cache invalidation.
It can cause two problems:
Why are the changes needed?
Improve perfomance.
How was this patch tested?
UT and spark sql query.
Was this patch authored or co-authored using generative AI tooling?
No.