Replies: 2 comments
-
|
First, I'll say I'd be in favor of splitting things out. Overall these splits seems sensible.
I understand the goal, but I'm afraid this will be often impossible due to breaking changes in Arrow. Additionally, I wonder if there's a split we can make between DataFusion-powered modules (where DataFusion is used internally but not exposed at interface) and DataFusion plugins (e.g. |
Beta Was this translation helpful? Give feedback.
-
|
Agreed, splitting up into small crates can make life easier and reduce cognitive load when reasoning about the code. I do think we may want to consider a few more things, especially if we consider the larger eco-system like delta-sharing etc. When it comes to the cloud crates I do have some doubts, since essentially only AWS has any special needs, due to locking. So not even all S3 APIs require any special dependencies. Factoring out the locking logic and mirroring object-stores features may be a way to go. Over there the specific cloud features - aws, azure, gcs - are essentially just legacy, since they are more or less just reference the "cloud" feature. moving towards logical plans I also have some drafts flying oround for deltalake-sql (logical planning / sql parsing), and deltalake-execution, which as physical operators for datafusion. All in all I feel we may want to start small, where we are sure and move to individual crates step by step? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Lately I have been thinking about how to improve the way the
deltalakecrate is packaged and delivered to users. I believe it is time for us to create sub-crates and convertdeltalaketo a meta-package.This would be not too dissimilar to how
arrowanddatafusionare packaged and delivered today. In the case of the arrow package, one can pullarrowdirectory and get all the bells and whistles. However if a user only requires the arrow types, they can pull a sub-crate ofarrow-schema, which is much smaller in code and dependency footprint.Benefits:
deltalake-corewill be easier to oriented around the kernel of functionality needed to implement the Delta protocolNew Crates
I am proposing the addition of the following sub-crates which should be considered as dependencies (optional ❔ and non-optional 🔐 as noted) of the
deltalakecrate:deltalake-core 🔐
This crate would have much of the existing traits and implementations needed to do things like log processing, provide the key APIs that Python and other users depend on such as the writer and
DeltaOpsimplementations.The key distinction here is that this would not create the cloud-specific or engine specific functionality, such as the Datafusion integration and dependency. One of the potential benefits of this refactoring is that we might be able to advocate for the inclusion of
deltalake-coreas adatafusiondependency, and move theTableProviderfordeltaupstream intodatafusionso that it can have native Delta support.Dependencies
I believe this crate would need to take the least necessary dependencies to do its job, which would likely be:
arrow-*subcrates, I think we can avoid some of our current dependency tree here.object_storetokiodeltalake-aws ❔
This crate would contain the dynamodb locking code, and other special case storage logic related to AWS/S3. Right now this code is kind of a mess (IMHO) in the Rust crate. There's some refactoring that @roeap has tried here but we've not merged. Additionally there are some changes that need to be made in #1601 which would be a good time to break the AWS code out.
deltalake-azure ❔
Similar motivation to the above with AWS. There's some specific code for Azure and OneLake we have floating around which could/should be moved over. Hopefully having a small and narrow Azure specific section of the tree would make it easier for contributors to help improve our support.
deltalake-gcp ❔
☝️
deltalake-datafusion ❔
The
deltalake-datafusioncrate would start out with the Datafusion Tableprovider and some other Datafusion powered operations, but I would hope to get the TableProvider upstream. At that point this crate would become a place to host the Datafusion powered extensions to core functionality we want to support, as well as other optimizations to make Delta nice in Datafusion land.❗ This would probably be the hardest dependency to get right since there's a tight coupling between major releases of arrow, datafusion, and in turn the existing
deltalakecrate. By separating this crate out, I am hopeful thatdeltalake-corewould be able to adopt newer Arrow versions at a more rapid pace, rather than being strung along downstream fromarrow->datafusion->deltalakerelease cycles.deltalake-catalog-glue ❔
Assuming the
deltalake-corepackage has a trait which defines what a "Catalog" should look like and how that's used, this would contain the AWS Glue Data Catalog specific code.I do not believe this would need to take a dependency on
deltalake-awsso long as both crates share the same version range for theaws_configcrate once they have adopted the AWS SDK for Rust, which also distributes service-specific crates.deltalake-catalog-unity ❔
Assuming the
Catalogtrait indeltalake-corethis would contain Databricks Unity Catalog specific code.deltalake-testing ❔
This package would be more for the development of the delta-rs project itself, but provide all the test utilities and interfaces we find so helpful for writing integration tests.
The Meta Package
The
deltalakecrate would continue to be released as it is today, and maintain its feature flags which dictate what versions and configuration of the sub-crates it includes. I am hopeful that thedeltalakecrate simply becomes a "shell". That is aCargo.tomland asrc/lib.rswhich re-exports a number of symbols for users' convenience.Semantic Versioning
I think all the crates should follow semantic versioning of course, but should share a major version. The
deltalake-corecrate should not have public API changes in the major ranges so thatdeltalake-core0.20.0, 0.21.0, 0.22.0, etc can be used withdeltalake-aws0.1.0 and so on.Conclusion
I am volunteering to take on this work, and would target
0.20as a good version to set as a milestone marker for such a release. If others are game to try this out, I can start pulling the scope into a milestone for the work.Beta Was this translation helpful? Give feedback.
All reactions