Skip to content

Commit 612d25f

Browse files
committed
add upgrading guide
1 parent 2a3b131 commit 612d25f

File tree

1 file changed

+80
-0
lines changed

1 file changed

+80
-0
lines changed

docs/source/library-user-guide/upgrading.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -182,6 +182,86 @@ let indices = projection_exprs.column_indices();
182182
_execution plan_ of the query. With this release, `DESCRIBE query` now outputs
183183
the computed _schema_ of the query, consistent with the behavior of `DESCRIBE table_name`.
184184

185+
### Refactoring of `FileSource` constructors and `FileScanConfigBuilder` to accept schemas upfront
186+
187+
The way schemas are passed to file sources and scan configurations has been significantly refactored. File sources now require the schema (including partition columns) to be provided at construction time, and `FileScanConfigBuilder` no longer takes a separate schema parameter.
188+
189+
**Who is affected:**
190+
191+
- Users who create `FileScanConfig` or file sources (`ParquetSource`, `CsvSource`, `JsonSource`, `AvroSource`) directly
192+
- Users who implement custom `FileFormat` implementations
193+
194+
**Key changes:**
195+
196+
1. **FileSource constructors now require TableSchema**: All built-in file sources now take the schema in their constructor:
197+
```diff
198+
- let source = ParquetSource::default();
199+
+ let source = ParquetSource::new(table_schema);
200+
```
201+
202+
2. **FileScanConfigBuilder no longer takes schema as a parameter**: The schema is now passed via the FileSource:
203+
```diff
204+
- FileScanConfigBuilder::new(url, schema, source)
205+
+ FileScanConfigBuilder::new(url, source)
206+
```
207+
208+
3. **Partition columns are now part of TableSchema**: The `with_table_partition_cols()` method has been removed from `FileScanConfigBuilder`. Partition columns are now passed as part of the `TableSchema` to the FileSource constructor:
209+
```diff
210+
+ let table_schema = TableSchema::new(
211+
+ file_schema,
212+
+ vec![Arc::new(Field::new("date", DataType::Utf8, false))],
213+
+ );
214+
+ let source = ParquetSource::new(table_schema);
215+
let config = FileScanConfigBuilder::new(url, source)
216+
- .with_table_partition_cols(vec![Field::new("date", DataType::Utf8, false)])
217+
.with_file(partitioned_file)
218+
.build();
219+
```
220+
221+
4. **FileFormat::file_source() now takes TableSchema parameter**: Custom `FileFormat` implementations must be updated:
222+
```diff
223+
impl FileFormat for MyFileFormat {
224+
- fn file_source(&self) -> Arc<dyn FileSource> {
225+
+ fn file_source(&self, table_schema: TableSchema) -> Arc<dyn FileSource> {
226+
- Arc::new(MyFileSource::default())
227+
+ Arc::new(MyFileSource::new(table_schema))
228+
}
229+
}
230+
```
231+
232+
**Migration examples:**
233+
234+
For Parquet files:
235+
```diff
236+
- let source = Arc::new(ParquetSource::default());
237+
- let config = FileScanConfigBuilder::new(url, schema, source)
238+
+ let table_schema = TableSchema::new(schema, vec![]);
239+
+ let source = Arc::new(ParquetSource::new(table_schema));
240+
+ let config = FileScanConfigBuilder::new(url, source)
241+
.with_file(partitioned_file)
242+
.build();
243+
```
244+
245+
For CSV files with partition columns:
246+
```diff
247+
- let source = Arc::new(CsvSource::new(true, b',', b'"'));
248+
- let config = FileScanConfigBuilder::new(url, file_schema, source)
249+
- .with_table_partition_cols(vec![Field::new("year", DataType::Int32, false)])
250+
+ let options = CsvOptions {
251+
+ has_header: Some(true),
252+
+ delimiter: b',',
253+
+ quote: b'"',
254+
+ ..Default::default()
255+
+ };
256+
+ let table_schema = TableSchema::new(
257+
+ file_schema,
258+
+ vec![Arc::new(Field::new("year", DataType::Int32, false))],
259+
+ );
260+
+ let source = Arc::new(CsvSource::new(table_schema).with_csv_options(options));
261+
+ let config = FileScanConfigBuilder::new(url, source)
262+
.build();
263+
```
264+
185265
### Introduction of `TableSchema` and changes to `FileSource::with_schema()` method
186266

187267
A new `TableSchema` struct has been introduced in the `datafusion-datasource` crate to better manage table schemas with partition columns. This struct helps distinguish between:

0 commit comments

Comments
 (0)