Skip to content

Conversation

@roeap
Copy link
Collaborator

@roeap roeap commented Oct 14, 2025

Description

This PR adds a new table provider, in the hopes of applying all the learnings we had and leveraging modern datafusion APIs. There are several aspects we need to consider.

Implementations

Thus far we implement TableProvider for DeltaTable and a dedicated DeltaTableProvider. Specifically the implementation for DeltaTable is problematic, since we do not (or at least may not) know important information (i.e. schema) about the table.

For log replay we implement ScanFileStream which consumes the kernel ScanMetadata stream and processes it to collect file skipping stats and extract datafusion Statistics to include in parquet execution planning.

Statistics & file skipping

Both delta-kernel and datafusion's parquet handling allow optimising queries via predicates. We pass the predicate into the kernel scan to leverage kernels file skipping. We also add statistics to the PartitionedFiles the get passed into the parquet plan to allow datrafusion to do its thing.

However we no longer expose statistics on the TableProvider since this would always require a full log replay prior to constructing the TableProvider, which we do want to move away from. ListingTable in datafusion - which is likely most similar to our provider - takes a similar approach.

Execution metrics

Thus far we collect operation statistics in several ways, including the custom MetricsObserver node. While we likely need to retain this functionality, there are several stats we can collect more efficiently. Specifically we track files skipped and scanned when we do the log replay to plan the scan.

Future work

push deletion vectors into parquet read

Currently we process deletion vectors after loading the data from the parquet file. This is due to uncertainties in handling row ids and other features that might be affected by skipping individual rows.

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Oct 14, 2025
@codecov
Copy link

codecov bot commented Oct 14, 2025

Codecov Report

❌ Patch coverage is 80.75314% with 322 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.26%. Comparing base (a22a97e) to head (7130d3c).

Files with missing lines Patch % Lines
...c/delta_datafusion/engine/expressions/to_kernel.rs 68.85% 92 Missing and 17 partials ⚠️
...re/src/delta_datafusion/table_provider/next/mod.rs 72.03% 51 Missing and 22 partials ⚠️
...e/src/delta_datafusion/table_provider/next/scan.rs 65.31% 53 Missing and 7 partials ⚠️
...src/delta_datafusion/table_provider/next/replay.rs 80.20% 32 Missing and 7 partials ⚠️
...e/src/delta_datafusion/engine/expressions/to_df.rs 94.39% 22 Missing and 14 partials ⚠️
crates/core/src/operations/load.rs 76.47% 0 Missing and 4 partials ⚠️
...tes/core/src/kernel/snapshot/iterators/scan_row.rs 91.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3849      +/-   ##
==========================================
+ Coverage   73.76%   74.26%   +0.50%     
==========================================
  Files         151      156       +5     
  Lines       39396    41033    +1637     
  Branches    39396    41033    +1637     
==========================================
+ Hits        29061    30474    +1413     
- Misses       9023     9175     +152     
- Partials     1312     1384      +72     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@roeap roeap force-pushed the feat/table-provider branch 2 times, most recently from 9477333 to 17b9768 Compare October 17, 2025 23:08
@roeap roeap moved this to Backlog in delta-rust Oct 17, 2025
@roeap roeap force-pushed the feat/table-provider branch from 17b9768 to 9c7a5b5 Compare October 18, 2025 00:52
@roeap roeap moved this from Backlog to In progress in delta-rust Oct 18, 2025
@roeap roeap force-pushed the feat/table-provider branch 5 times, most recently from cad121f to c661711 Compare October 20, 2025 04:56
@github-actions github-actions bot added the binding/python Issues for the Python package label Oct 20, 2025
@roeap roeap force-pushed the feat/table-provider branch from c661711 to f05fc48 Compare October 20, 2025 22:52
@github-actions github-actions bot removed the binding/python Issues for the Python package label Oct 20, 2025
@roeap roeap force-pushed the feat/table-provider branch 5 times, most recently from 23c9db4 to 5519a70 Compare October 21, 2025 18:46
@roeap roeap force-pushed the feat/table-provider branch from 5519a70 to af2e475 Compare October 21, 2025 18:49
rtyler and others added 7 commits October 24, 2025 11:38
# Description

Creates a new cargo profile for the python build, applying a few
techniques from
[`min-sized-rust`](https://github.com/johnthagen/min-sized-rust). Sets
up maturin to build with said profile.

# Related Issue(s)

Relates to delta-io#3876 

# Documentation

The size of the wheel on my system is:

```
❯ du -sh *
20M     deltalake-1.2.1-cp39-abi3-macosx_11_0_arm64.whl
```

The equivalent wheel in pypi is `43.3 MB`

---------

Signed-off-by: Abhi Agarwal <[email protected]>
Co-authored-by: R. Tyler Croy <[email protected]>
This will include the sso configuration by default

Closes delta-io#3897

Signed-off-by: R. Tyler Croy <[email protected]>
…sumption without cloning

Signed-off-by: Florian Valeye <[email protected]>
…sumption without cloning

Signed-off-by: Florian Valeye <[email protected]>
rtyler and others added 20 commits November 18, 2025 17:37
I do not have a great test environment to validate that the manifests
are _correct_ so that they can be used by an engine like Presto. As such
we will have to depend on users to validate this

Signed-off-by: R. Tyler Croy <[email protected]>
…st files

The closest thing to a "spec" I have found is the Hive
SymlinkTextInputFormat which is poorly documented. Instead I've just
generated manifests for our test tables using pyspark and am more or
less using that as our "spec" here

Signed-off-by: R. Tyler Croy <[email protected]>
We should not be littering the logs with warnings until we have an
immediate plan
to remove the deprecated functions. `new_metadata` is not able to be
replaced until an actual replacement with kernel is available. At that
time we can mark this as deprecated

Signed-off-by: R. Tyler Croy <[email protected]>
Signed-off-by: DrakeLin <[email protected]>
# Description

As of today, the Rust meta-crate re-exports the storage crates but never
calls their register_handlers helpers. That means every Rust binary
still has to remember to call deltalake::gcp::register_handlers(None)
(and the equivalents for S3/Azure/etc.) before using cloud URIs, even
though the Python bindings auto-register. This PR brings the meta-crate
to parity so DeltaOps::try_from_uri("gs://…") works out of the box when
the gcs feature is enabled.

# Problem

- Users of the deltalake crate must manually register each storage
backend before working with gs://, s3://, abfss://, etc.

- Forgetting the call leads to
DeltaTableError::InvalidTableLocation("Unknown scheme: gs"), which
blocks workflows like DataFusion writers on GCS.

- Docs/examples didn’t make it obvious when manual registration was
still required.

# Solution

- Add feature-gated ctor hooks in crates/deltalake/src/lib.rs that call
register_handlers(None) for AWS, Azure, GCS, HDFS, LakeFS, and Unity as
soon as their features are enabled.

- Pull in the lightweight ctor = "0.2" dependency so the hooks run at
startup.

- Add a small regression test that exercises
DeltaTableBuilder::from_uri("gs://…") with the gcs feature to guard
against regressions.

- Update the GCS integration docs and changelog to explain that the
meta-crate now auto-registers backends while deltalake-core users still
need to call the storage crates explicitly.

# Changes

- crates/deltalake/src/lib.rs: new #[ctor::ctor] modules for s3, azure,
gcs, hdfs, lakefs, and unity.

- crates/deltalake/Cargo.toml: add ctor dependency.

- crates/deltalake/tests/gcs_auto_registration.rs: new smoke tests for
gs:// URI recognition when the gcs feature is enabled.

- docs/integrations/object-storage/gcs.md & CHANGELOG.md: document the
auto-registration behavior.

# Testing

- cargo check -p deltalake --all-features
- cargo test -p deltalake --features gcs
- cargo test --test gcs_auto_registration --features gcs
- cargo build --example pharma_pipeline_gcs --features gcs,datafusion

# Documentation

- docs/integrations/object-storage/gcs.md
- CHANGELOG.md

---------

Signed-off-by: Ethan Urbanski <[email protected]>
…n and Rust

Profiles for different use cases:
- dev: Fast local development with good debug experience (Cargo defaults)
- test: test builds with minimal debug info to save disk/RAM (custom)
- release: Production Rust crates - maximum performance (Cargo defaults)
- bench: For benchmarking with flamegraphs (cargo bench)
- profiling: For performance profiling with release opts + debug info
- ci: CI/CD optimized - fast builds, release-like performance
- python-release: Python wheel builds - portable, reproducible (PyPI releases)

Signed-off-by: Florian Valeye <[email protected]>
---
updated-dependencies:
- dependency-name: ctor
  dependency-version: 0.6.1
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
This work was really cool when @houqp explored it. In 2025 we're not
really reliant on dynamodb locking and at some point in the distant
future maybe we'll not need a DynamoDBLogStore either.

Signed-off-by: R. Tyler Croy <[email protected]>
---
updated-dependencies:
- dependency-name: convert_case
  dependency-version: 0.9.0
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <[email protected]>
- Fix panic when get_add_actions() is called on tables with no add actions
- Return empty RecordBatch with correct schema instead of panicking
- Add unit test to verify get_add_actions() works after delete and vacuum

Fixes delta-io#3918

Signed-off-by: Manish Sogiyawar <[email protected]>
@github-actions github-actions bot added binding/python Issues for the Python package proofs labels Nov 18, 2025
… overwrite (delta-io#3912)

Remove references in docs that suggest using `partition_filters` for
selectively overwriting partitions, which has been removed from the
`write_deltalake` API.

fixes delta-io#3904

Signed-off-by: zyd14 <[email protected]>
@roeap roeap force-pushed the feat/table-provider branch from 3f2e8b1 to 1a29c70 Compare November 18, 2025 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

binding/python Issues for the Python package binding/rust Issues for the Rust crate proofs

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

8 participants