Skip to content

Conversation

ohadmata
Copy link

@ohadmata ohadmata commented Sep 3, 2025

Description

Currently, delta-rs does not have an option to get the number of rows for a specific table.
I want to get the number of rows from the delta log without reading the whole table to the RAM of my application.
The idea is to handle a row counter and iterate over the log in order to get the number.

Related Issue(s)

- closes #3731

@github-actions github-actions bot added the binding/python Issues for the Python package label Sep 3, 2025
Copy link

github-actions bot commented Sep 3, 2025

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@ohadmata ohadmata changed the title Get the delta table row count based on the table history feat: Get the delta table row count based on the table history Sep 3, 2025
@ohadmata ohadmata changed the title feat: Get the delta table row count based on the table history feat: get the delta table row count based on the table history Sep 3, 2025
@ion-elgreco
Copy link
Collaborator

Thanks for your PR @ohadmata, unfortunately this wouldn't be a stable interface to get the row count since operations and metrics are optional and not agreed upon in delta spec.

You can however sum all the numRecords of each file add action.

@rtyler
Copy link
Member

rtyler commented Sep 6, 2025

@ion-elgreco 🤷 this seems like a kind of useful thing to have right? If I put an interface for counting file stats down in the 🦀 layer, would that be cool to expose upwards?

@ion-elgreco
Copy link
Collaborator

@ion-elgreco 🤷 this seems like a kind of useful thing to have right? If I put an interface for counting file stats down in the 🦀 layer, would that be cool to expose upwards?

Definitely! As long as its done based on the file stats

@ohadmata
Copy link
Author

ohadmata commented Sep 7, 2025

@ion-elgreco @rtyler I have change the function to get the count from the parquet files metadata. WDYT?

@ohadmata ohadmata force-pushed the delta-table-row-count-function branch 3 times, most recently from 95d4bdd to 8fb6bea Compare September 7, 2025 08:44
@ion-elgreco
Copy link
Collaborator

@ion-elgreco @rtyler I have change the function to get the count from the parquet files metadata. WDYT?

This won't work on a remote object storage. The most efficient and correct way is to do it base on the delta log stats field "numRecords"

@ohadmata
Copy link
Author

ohadmata commented Sep 8, 2025

@ion-elgreco I updated it again to retrieve the number of records from the add action in the Delta log. This approach is much more efficient and avoids opening all the Parquet files.

Copy link

codecov bot commented Sep 8, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.08%. Comparing base (9b35849) to head (4ba2865).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3732      +/-   ##
==========================================
- Coverage   76.09%   76.08%   -0.01%     
==========================================
  Files         145      145              
  Lines       45117    45117              
  Branches    45117    45117              
==========================================
- Hits        34330    34328       -2     
- Misses       9098     9100       +2     
  Partials     1689     1689              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ion-elgreco ion-elgreco force-pushed the delta-table-row-count-function branch from e4b2f21 to 1a0a6fb Compare September 21, 2025 19:21
@ion-elgreco
Copy link
Collaborator

@ohadmata can you add a small test? Then we are good to go!

@rtyler rtyler force-pushed the delta-table-row-count-function branch from 1a0a6fb to 931c3ff Compare September 23, 2025 14:02
@rtyler
Copy link
Member

rtyler commented Sep 23, 2025

@ion-elgreco I have added a regression test and some improved docstrings, I think this is ready to go!

ohadmata and others added 4 commits September 24, 2025 08:20
get the delta table row count
Get the number of rows from the underlying list of parquet files

Signed-off-by: ohadmata <[email protected]>
Get the number of records from the add actions
Since count is approximate, I made sure to update the docstrings for the
function to helpfully inform users

Signed-off-by: R. Tyler Croy <[email protected]>
@ion-elgreco ion-elgreco force-pushed the delta-table-row-count-function branch from 931c3ff to 4ba2865 Compare September 24, 2025 06:20
@fvaleye fvaleye enabled auto-merge (rebase) September 24, 2025 08:07
@fvaleye fvaleye merged commit d30b11f into delta-io:main Sep 24, 2025
27 of 29 checks passed
@ohadmata
Copy link
Author

Hi, I just got back from vacation and saw that you added a test and merged it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants