-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Refactor FileSource constructors to take a schema
#18386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| let source = Arc::new(ParquetSource::new(schema.clone())); | ||
|
|
||
| let scan_config = | ||
| FileScanConfigBuilder::new(ObjectStoreUrl::local_filesystem(), schema, source) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the source already has a schema, can we avoid passing it into the FileScanConfigBuilder too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so, I'll give it a shot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it to work! It's more code churn but I do think it's better. Now there is a single source of truth for the schema.
This does mean more and more "stuff" is being moved from FileScanConfig into FileSource but I think that's in line with the big picture goal.
FWIW what I envision FileScanConfig doing in the end is:
- Statistics. It has the list of files and partitioning so it can compute statistics, EqProperties, etc. It will have to get the filter, projection and unprojected schema from the FileSource.
- Some shared configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb would appreciate another look to see if you're okay with the larger diff that change brought with it
Can we set the schema for filesource at the FileScanConfigBuilder level? diff --git a/datafusion/datasource/src/file_scan_config.rs b/datafusion/datasource/src/file_scan_config.rs
index 5847a8cf5..660ba4615 100644
--- a/datafusion/datasource/src/file_scan_config.rs
+++ b/datafusion/datasource/src/file_scan_config.rs
@@ -290,10 +290,11 @@ impl FileScanConfigBuilder {
file_schema: SchemaRef,
file_source: Arc<dyn FileSource>,
) -> Self {
+ let table_schema = TableSchema::from_file_schema(file_schema.clone());
Self {
object_store_url,
- table_schema: TableSchema::from_file_schema(file_schema),
- file_source,
+ table_schema: table_schema.clone(),
+ file_source: file_source.with_schema(table_schema),
file_groups: vec![],
statistics: None,
output_ordering: vec![],
``` |
That's exactly how it was done before! The issue with this is that the |
5955032 to
ef73b24
Compare
|
A small followup to do after this: #18453 |
612d25f to
490da94
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me -- thank you @adriangb and @xudong963
Note that I believe this is also an API change so it may make sense to wait until we cut the 51 release branch (#17558 ) and then merge this one in
Is the concern that this would otherwise interfere with a |
490da94 to
d9cf3d8
Compare
I think I need clarification on what you meant @alamb: do you want to wait to merge this until after we cut the 51 release branch so that it doesn't end up in the 51 release or so that it does end up in the 51 release? Either one is fine by me but I think it makes sense to try and "group" breaking changes within the same area of code into a single release so that users can upgrade their code once instead of multiple times. |
I was thinking that we would wait to make the branch-51 before merging this PR in -- therefore the first release that would have this change would be 52.0.0 As you say, given we are planning some larger churn of the code in this area, I figured we could "batch" them all into the 52 release |
|
Okay we're on the same page then - happy to wait :) |
Most of these file source implementations cannot operate without schema, they all have
.expect("schema must be set")s that violate using the language to enforce correctness.This is an attempt to rework that by making it so you have to pass in a schema to construct them.
That said there are downsides:
FileScanConfigandFileSourcematch.I feel like there's another twist to this needed... maybe moving the schema out of
FileScanConfig? That's not currently possible, it's used in both places. Maybe having aFileScanand aFileScanConfigand having construction beFileScan::new(FileSource::new(config), config)?