Skip to content

Conversation

@adriangb
Copy link
Contributor

Most of these file source implementations cannot operate without schema, they all have .expect("schema must be set")s that violate using the language to enforce correctness.

This is an attempt to rework that by making it so you have to pass in a schema to construct them.

That said there are downsides:

  1. More boilerplate.
  2. Requires that the schema passed into FileScanConfig and FileSource match.

I feel like there's another twist to this needed... maybe moving the schema out of FileScanConfig? That's not currently possible, it's used in both places. Maybe having a FileScan and a FileScanConfig and having construction be FileScan::new(FileSource::new(config), config)?

@github-actions github-actions bot added core Core DataFusion crate substrait Changes to the substrait crate catalog Related to the catalog crate proto Related to proto crate datasource Changes to the datasource crate labels Oct 30, 2025
@adriangb adriangb marked this pull request as ready for review October 30, 2025 21:17
let source = Arc::new(ParquetSource::new(schema.clone()));

let scan_config =
FileScanConfigBuilder::new(ObjectStoreUrl::local_filesystem(), schema, source)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the source already has a schema, can we avoid passing it into the FileScanConfigBuilder too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, I'll give it a shot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it to work! It's more code churn but I do think it's better. Now there is a single source of truth for the schema.

This does mean more and more "stuff" is being moved from FileScanConfig into FileSource but I think that's in line with the big picture goal.

FWIW what I envision FileScanConfig doing in the end is:

  • Statistics. It has the list of files and partitioning so it can compute statistics, EqProperties, etc. It will have to get the filter, projection and unprojected schema from the FileSource.
  • Some shared configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb would appreciate another look to see if you're okay with the larger diff that change brought with it

@xudong963
Copy link
Member

Most of these file source implementations cannot operate without schema, they all have .expect("schema must be set")s that violate using the language to enforce correctness.

Can we set the schema for filesource at the FileScanConfigBuilder level?

diff --git a/datafusion/datasource/src/file_scan_config.rs b/datafusion/datasource/src/file_scan_config.rs
index 5847a8cf5..660ba4615 100644
--- a/datafusion/datasource/src/file_scan_config.rs
+++ b/datafusion/datasource/src/file_scan_config.rs
@@ -290,10 +290,11 @@ impl FileScanConfigBuilder {
         file_schema: SchemaRef,
         file_source: Arc<dyn FileSource>,
     ) -> Self {
+        let table_schema = TableSchema::from_file_schema(file_schema.clone());
         Self {
             object_store_url,
-            table_schema: TableSchema::from_file_schema(file_schema),
-            file_source,
+            table_schema: table_schema.clone(),
+            file_source: file_source.with_schema(table_schema),
             file_groups: vec![],
             statistics: None,
             output_ordering: vec![],
             ```

@adriangb
Copy link
Contributor Author

adriangb commented Nov 2, 2025

Most of these file source implementations cannot operate without schema, they all have .expect("schema must be set")s that violate using the language to enforce correctness.

Can we set the schema for filesource at the FileScanConfigBuilder level?

diff --git a/datafusion/datasource/src/file_scan_config.rs b/datafusion/datasource/src/file_scan_config.rs
index 5847a8cf5..660ba4615 100644
--- a/datafusion/datasource/src/file_scan_config.rs
+++ b/datafusion/datasource/src/file_scan_config.rs
@@ -290,10 +290,11 @@ impl FileScanConfigBuilder {
         file_schema: SchemaRef,
         file_source: Arc<dyn FileSource>,
     ) -> Self {
+        let table_schema = TableSchema::from_file_schema(file_schema.clone());
         Self {
             object_store_url,
-            table_schema: TableSchema::from_file_schema(file_schema),
-            file_source,
+            table_schema: table_schema.clone(),
+            file_source: file_source.with_schema(table_schema),
             file_groups: vec![],
             statistics: None,
             output_ordering: vec![],
             ```

That's exactly how it was done before! The issue with this is that the FileSource then has schema: Option<TableSchema> which is problematic because e.g. when we push down projections, filters, etc. we need the schema, so we end up doing self.schema.expect("you have to call with_schema() first") which is very non-idiomatic.

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Nov 2, 2025
@adriangb adriangb force-pushed the refactor-source-constructor branch from 5955032 to ef73b24 Compare November 2, 2025 21:17
@adriangb
Copy link
Contributor Author

adriangb commented Nov 2, 2025

A small followup to do after this: #18453

@adriangb adriangb force-pushed the refactor-source-constructor branch 4 times, most recently from 612d25f to 490da94 Compare November 2, 2025 21:44
@adriangb adriangb requested a review from alamb November 2, 2025 22:31
@alamb alamb added the api change Changes the API exposed to users of the crate label Nov 3, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me -- thank you @adriangb and @xudong963

Note that I believe this is also an API change so it may make sense to wait until we cut the 51 release branch (#17558 ) and then merge this one in

@adriangb
Copy link
Contributor Author

adriangb commented Nov 3, 2025

Note that I believe this is also an API change so it may make sense to wait until we cut the 51 release branch (#17558 ) and then merge this one in

Is the concern that this would otherwise interfere with a 50.4.0 release? I am actually concerned about something else:
I think this is not going to be the last breaking change we make in this area of the code (I'm hoping others are more minor though) and it might be nice to get all of those in 1 release so that the upgrade is not fragmented for users and they can do it all in one go, with good pre/post examples, etc.

@adriangb adriangb force-pushed the refactor-source-constructor branch from 490da94 to d9cf3d8 Compare November 3, 2025 22:18
@adriangb
Copy link
Contributor Author

adriangb commented Nov 4, 2025

Note that I believe this is also an API change so it may make sense to wait until we cut the 51 release branch (#17558 ) and then merge this one in

Is the concern that this would otherwise interfere with a 50.4.0 release? I am actually concerned about something else: I think this is not going to be the last breaking change we make in this area of the code (I'm hoping others are more minor though) and it might be nice to get all of those in 1 release so that the upgrade is not fragmented for users and they can do it all in one go, with good pre/post examples, etc.

I think I need clarification on what you meant @alamb: do you want to wait to merge this until after we cut the 51 release branch so that it doesn't end up in the 51 release or so that it does end up in the 51 release? Either one is fine by me but I think it makes sense to try and "group" breaking changes within the same area of code into a single release so that users can upgrade their code once instead of multiple times.

@alamb
Copy link
Contributor

alamb commented Nov 4, 2025

Note that I believe this is also an API change so it may make sense to wait until we cut the 51 release branch (#17558 ) and then merge this one in

Is the concern that this would otherwise interfere with a 50.4.0 release? I am actually concerned about something else: I think this is not going to be the last breaking change we make in this area of the code (I'm hoping others are more minor though) and it might be nice to get all of those in 1 release so that the upgrade is not fragmented for users and they can do it all in one go, with good pre/post examples, etc.

I think I need clarification on what you meant @alamb: do you want to wait to merge this until after we cut the 51 release branch so that it doesn't end up in the 51 release or so that it does end up in the 51 release? Either one is fine by me but I think it makes sense to try and "group" breaking changes within the same area of code into a single release so that users can upgrade their code once instead of multiple times.

I was thinking that we would wait to make the branch-51 before merging this PR in -- therefore the first release that would have this change would be 52.0.0

As you say, given we are planning some larger churn of the code in this area, I figured we could "batch" them all into the 52 release

@adriangb
Copy link
Contributor Author

adriangb commented Nov 4, 2025

Okay we're on the same page then - happy to wait :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate catalog Related to the catalog crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation proto Related to proto crate substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants