-
Notifications
You must be signed in to change notification settings - Fork 545
feat: datafusion table provider next #3849
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
roeap
wants to merge
29
commits into
delta-io:main
Choose a base branch
from
roeap:feat/table-provider
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
+4,447
−4,896
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3849 +/- ##
==========================================
+ Coverage 73.76% 74.26% +0.50%
==========================================
Files 151 156 +5
Lines 39396 41033 +1637
Branches 39396 41033 +1637
==========================================
+ Hits 29061 30474 +1413
- Misses 9023 9175 +152
- Partials 1312 1384 +72 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9477333 to
17b9768
Compare
17b9768 to
9c7a5b5
Compare
cad121f to
c661711
Compare
c661711 to
f05fc48
Compare
23c9db4 to
5519a70
Compare
Signed-off-by: Robert Pack <[email protected]>
5519a70 to
af2e475
Compare
# Description Creates a new cargo profile for the python build, applying a few techniques from [`min-sized-rust`](https://github.com/johnthagen/min-sized-rust). Sets up maturin to build with said profile. # Related Issue(s) Relates to delta-io#3876 # Documentation The size of the wheel on my system is: ``` ❯ du -sh * 20M deltalake-1.2.1-cp39-abi3-macosx_11_0_arm64.whl ``` The equivalent wheel in pypi is `43.3 MB` --------- Signed-off-by: Abhi Agarwal <[email protected]> Co-authored-by: R. Tyler Croy <[email protected]>
This will include the sso configuration by default Closes delta-io#3897 Signed-off-by: R. Tyler Croy <[email protected]>
…sumption without cloning Signed-off-by: Florian Valeye <[email protected]>
…sumption without cloning Signed-off-by: Florian Valeye <[email protected]>
Signed-off-by: JustinRush80 <[email protected]>
Signed-off-by: JustinRush80 <[email protected]>
I do not have a great test environment to validate that the manifests are _correct_ so that they can be used by an engine like Presto. As such we will have to depend on users to validate this Signed-off-by: R. Tyler Croy <[email protected]>
…st files The closest thing to a "spec" I have found is the Hive SymlinkTextInputFormat which is poorly documented. Instead I've just generated manifests for our test tables using pyspark and am more or less using that as our "spec" here Signed-off-by: R. Tyler Croy <[email protected]>
Signed-off-by: Florian Valeye <[email protected]>
…nity crate Signed-off-by: Florian Valeye <[email protected]>
We should not be littering the logs with warnings until we have an immediate plan to remove the deprecated functions. `new_metadata` is not able to be replaced until an actual replacement with kernel is available. At that time we can mark this as deprecated Signed-off-by: R. Tyler Croy <[email protected]>
Signed-off-by: R. Tyler Croy <[email protected]>
Signed-off-by: DrakeLin <[email protected]>
# Description
As of today, the Rust meta-crate re-exports the storage crates but never
calls their register_handlers helpers. That means every Rust binary
still has to remember to call deltalake::gcp::register_handlers(None)
(and the equivalents for S3/Azure/etc.) before using cloud URIs, even
though the Python bindings auto-register. This PR brings the meta-crate
to parity so DeltaOps::try_from_uri("gs://…") works out of the box when
the gcs feature is enabled.
# Problem
- Users of the deltalake crate must manually register each storage
backend before working with gs://, s3://, abfss://, etc.
- Forgetting the call leads to
DeltaTableError::InvalidTableLocation("Unknown scheme: gs"), which
blocks workflows like DataFusion writers on GCS.
- Docs/examples didn’t make it obvious when manual registration was
still required.
# Solution
- Add feature-gated ctor hooks in crates/deltalake/src/lib.rs that call
register_handlers(None) for AWS, Azure, GCS, HDFS, LakeFS, and Unity as
soon as their features are enabled.
- Pull in the lightweight ctor = "0.2" dependency so the hooks run at
startup.
- Add a small regression test that exercises
DeltaTableBuilder::from_uri("gs://…") with the gcs feature to guard
against regressions.
- Update the GCS integration docs and changelog to explain that the
meta-crate now auto-registers backends while deltalake-core users still
need to call the storage crates explicitly.
# Changes
- crates/deltalake/src/lib.rs: new #[ctor::ctor] modules for s3, azure,
gcs, hdfs, lakefs, and unity.
- crates/deltalake/Cargo.toml: add ctor dependency.
- crates/deltalake/tests/gcs_auto_registration.rs: new smoke tests for
gs:// URI recognition when the gcs feature is enabled.
- docs/integrations/object-storage/gcs.md & CHANGELOG.md: document the
auto-registration behavior.
# Testing
- cargo check -p deltalake --all-features
- cargo test -p deltalake --features gcs
- cargo test --test gcs_auto_registration --features gcs
- cargo build --example pharma_pipeline_gcs --features gcs,datafusion
# Documentation
- docs/integrations/object-storage/gcs.md
- CHANGELOG.md
---------
Signed-off-by: Ethan Urbanski <[email protected]>
Signed-off-by: R. Tyler Croy <[email protected]>
…n and Rust Profiles for different use cases: - dev: Fast local development with good debug experience (Cargo defaults) - test: test builds with minimal debug info to save disk/RAM (custom) - release: Production Rust crates - maximum performance (Cargo defaults) - bench: For benchmarking with flamegraphs (cargo bench) - profiling: For performance profiling with release opts + debug info - ci: CI/CD optimized - fast builds, release-like performance - python-release: Python wheel builds - portable, reproducible (PyPI releases) Signed-off-by: Florian Valeye <[email protected]>
Signed-off-by: Florian Valeye <[email protected]>
…om_uri Signed-off-by: R. Tyler Croy <[email protected]>
Signed-off-by: R. Tyler Croy <[email protected]>
--- updated-dependencies: - dependency-name: ctor dependency-version: 0.6.1 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>
This work was really cool when @houqp explored it. In 2025 we're not really reliant on dynamodb locking and at some point in the distant future maybe we'll not need a DynamoDBLogStore either. Signed-off-by: R. Tyler Croy <[email protected]>
--- updated-dependencies: - dependency-name: convert_case dependency-version: 0.9.0 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <[email protected]>
Signed-off-by: R. Tyler Croy <[email protected]>
- Fix panic when get_add_actions() is called on tables with no add actions - Return empty RecordBatch with correct schema instead of panicking - Add unit test to verify get_add_actions() works after delete and vacuum Fixes delta-io#3918 Signed-off-by: Manish Sogiyawar <[email protected]>
Signed-off-by: Manish Sogiyawar <[email protected]>
See delta-io#3918 Signed-off-by: R. Tyler Croy <[email protected]>
… overwrite (delta-io#3912) Remove references in docs that suggest using `partition_filters` for selectively overwriting partitions, which has been removed from the `write_deltalake` API. fixes delta-io#3904 Signed-off-by: zyd14 <[email protected]>
3f2e8b1 to
1a29c70
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR adds a new table provider, in the hopes of applying all the learnings we had and leveraging modern datafusion APIs. There are several aspects we need to consider.
Implementations
Thus far we implement
TableProviderforDeltaTableand a dedicatedDeltaTableProvider. Specifically the implementation forDeltaTableis problematic, since we do not (or at least may not) know important information (i.e. schema) about the table.For log replay we implement
ScanFileStreamwhich consumes the kernelScanMetadatastream and processes it to collect file skipping stats and extract datafusionStatisticsto include in parquet execution planning.Statistics & file skipping
Both delta-kernel and datafusion's parquet handling allow optimising queries via predicates. We pass the predicate into the kernel scan to leverage kernels file skipping. We also add statistics to the
PartitionedFiles the get passed into the parquet plan to allow datrafusion to do its thing.However we no longer expose statistics on the
TableProvidersince this would always require a full log replay prior to constructing theTableProvider, which we do want to move away from.ListingTablein datafusion - which is likely most similar to our provider - takes a similar approach.Execution metrics
Thus far we collect operation statistics in several ways, including the custom
MetricsObservernode. While we likely need to retain this functionality, there are several stats we can collect more efficiently. Specifically we track files skipped and scanned when we do the log replay to plan the scan.Future work
push deletion vectors into parquet read
Currently we process deletion vectors after loading the data from the parquet file. This is due to uncertainties in handling row ids and other features that might be affected by skipping individual rows.