-
Notifications
You must be signed in to change notification settings - Fork 526
feat: get the delta table row count based on the table history #3732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: get the delta table row count based on the table history #3732
Conversation
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
Thanks for your PR @ohadmata, unfortunately this wouldn't be a stable interface to get the row count since operations and metrics are optional and not agreed upon in delta spec. You can however sum all the numRecords of each file add action. |
@ion-elgreco 🤷 this seems like a kind of useful thing to have right? If I put an interface for counting file stats down in the 🦀 layer, would that be cool to expose upwards? |
Definitely! As long as its done based on the file stats |
@ion-elgreco @rtyler I have change the function to get the count from the parquet files metadata. WDYT? |
95d4bdd
to
8fb6bea
Compare
This won't work on a remote object storage. The most efficient and correct way is to do it base on the delta log stats field "numRecords" |
@ion-elgreco I updated it again to retrieve the number of records from the add action in the Delta log. This approach is much more efficient and avoids opening all the Parquet files. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3732 +/- ##
==========================================
- Coverage 76.09% 76.08% -0.01%
==========================================
Files 145 145
Lines 45117 45117
Branches 45117 45117
==========================================
- Hits 34330 34328 -2
- Misses 9098 9100 +2
Partials 1689 1689 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e4b2f21
to
1a0a6fb
Compare
@ohadmata can you add a small test? Then we are good to go! |
1a0a6fb
to
931c3ff
Compare
@ion-elgreco I have added a regression test and some improved docstrings, I think this is ready to go! |
get the delta table row count
Get the number of rows from the underlying list of parquet files Signed-off-by: ohadmata <[email protected]>
Get the number of records from the add actions
Since count is approximate, I made sure to update the docstrings for the function to helpfully inform users Signed-off-by: R. Tyler Croy <[email protected]>
931c3ff
to
4ba2865
Compare
Hi, I just got back from vacation and saw that you added a test and merged it. Thanks! |
Description
Currently, delta-rs does not have an option to get the number of rows for a specific table.
I want to get the number of rows from the delta log without reading the whole table to the RAM of my application.
The idea is to handle a row counter and iterate over the log in order to get the number.
Related Issue(s)
- closes #3731