Fix: ListingTableFactory hive column detection #17232

BlakeOrth · 2025-08-18T21:32:04Z

Which issue does this PR close?

Closes Bug auto detecting partitions with ListingTableFactory on Hive partitioned datasets #17049

What changes are included in this PR?

Fixes an issue in the ListingTableFactory where hive columns are not detected and incorporated into the table schema when an explicit schema has not been set by the user
Fixes an issue where subdirectories that do not follow Hive formatting (e.g. key=value) could be erroneously interpreted as contributing to the table schema

Are these changes tested?

Yes, the PR includes both unit tests and sqllogic tests that exercise the fixes

Are there any user-facing changes?

I think there's a possibility for the changes to impact users of the ListingTableFactory. In my mind the behavior represented here is what I would think the "expected" behavior should be, but there's a good possibility users are relying on the previous behavior and could get unexpected results if this PR is merged.

cc @alamb

- Fixes an issue in the ListingTableFactory where hive columns are not detected and incorporated into the table schema when an explicit schema has not been set by the user - Fixes an issue where subdirectories that do not follow Hive formatting (e.g. key=value) could be erroneously interpreted as contributing to the table schema

BlakeOrth · 2025-08-18T21:33:22Z

datafusion/core/src/datasource/listing/table.rs

+                    // Partitions are expected to follow the format "column_name=value", so we
+                    // should ignore any path part that cannot be parsed into the expected format
+                    .filter(|s| s.contains('='))


This code is necessary to prevent regressions exposed by existing sqllogic tests. Here's the failures from existing tests for this PR if this fix is not included

Completed 345 test files in 17 seconds External error: 6 errors in file /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt 1. statement failed: DataFusion error: Error during planning: Found mixed partition values on disk [[], ["subdir"]] [SQL] CREATE EXTERNAL TABLE listing_table_folder_0 STORED AS PARQUET LOCATION 'test_files/scratch/parquet/test_table/'; at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:306 2. query failed: DataFusion error: Error during planning: table 'datafusion.public.listing_table_folder_0' not found [SQL] select count(*) from listing_table_folder_0; at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:315 3. query failed: DataFusion error: Error during planning: table 'datafusion.public.listing_table_folder_0' not found [SQL] select count(*) from listing_table_folder_0; at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:324 4. statement failed: DataFusion error: Error during planning: Found mixed partition values on disk [[], ["subdir"]] [SQL] CREATE EXTERNAL TABLE listing_table_folder_1 STORED AS PARQUET LOCATION 'test_files/scratch/parquet/test_table'; at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:330 5. query failed: DataFusion error: Error during planning: table 'datafusion.public.listing_table_folder_1' not found [SQL] select count(*) from listing_table_folder_1; at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:339 6. query failed: DataFusion error: Error during planning: table 'datafusion.public.listing_table_folder_1' not found [SQL] select count(*) from listing_table_folder_1; at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:348

alamb

THanks @BlakeOrth

I played with this PR for a while, and I think this PR basically enables automatic partition detection for all listing tables which is quite cool.

For example with the following setup:

DataFusion CLI v49.0.0
> copy (values ('foo'), ('bar')) to '/tmp/table/a=1/file1.parquet';

+-------+
| count |
+-------+
| 2     |
+-------+
1 row(s) fetched.
Elapsed 0.034 seconds.

> copy (values ('baz')) to '/tmp/table/a=1/file2.parquet';
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched.
Elapsed 0.002 seconds.

> create external table t stored as parquet location '/tmp/table';
0 row(s) fetched.
Elapsed 0.010 seconds.

On main, no partition columns are recognized:

Main

> select * from t;
+---------+
| column1 |
+---------+
| foo     |
| bar     |
| baz     |
+---------+
3 row(s) fetched.
Elapsed 0.012 seconds.

This PR

> select * from t;
+---------+---+
| column1 | a |
+---------+---+
| baz     | 1 |
| foo     | 1 |
| bar     | 1 |
+---------+---+
3 row(s) fetched.
Elapsed 0.015 seconds.

I think this change is actually a quite a nice improvement but is really a feature, not a bug fix, and I think the code you have here actually looks good 🏆

In order to proceed I think we need a few things:

More tests (see below)
Add a configuration setting to disable this new behavior (let people opt out of automatic partition inference). Perhaps datafusion.execution.infer_partitioning, similarly to datafusion.execution.collect_statistics
Document the behavior (with an example) in https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table
Add a note to the upgrade guide https://datafusion.apache.org/library-user-guide/upgrading.html

New tests

I looked around and it seems we don't actually have any tests focused on reading hive partitioned datasets in .slt, nor automatically detecting hive partitions on table creation

Would you be willing to write some tests in a new file, such as datafusion/sqllogictest/test_files/listing_table_partitions.slt (following the naming convention of datafusion/sqllogictest/test_files/listing_table_statistics.slt)

I think you could start with my example SQL commands above and write the related tests

I know this is a lot. I am sorry for the delayed feedback. Please let me know what you think.

alamb · 2025-08-22T21:54:08Z

FYI @adrian and @nuno-faria

BlakeOrth · 2025-08-23T00:00:07Z

@alamb

I know this is a lot. I am sorry for the delayed feedback. Please let me know what you think.

No worries on the delay, I can absolutely fill in the extra blanks. I just have a couple of quick questions on how to proceed.

More tests (see below)

The sqllogic test I added actually does work on a hive partitioned table. I setup a "readback" using an existing table that's created with the following setup:

datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt

Lines 155 to 160 in cb571d9

    
           statement ok 
        
           CREATE EXTERNAL TABLE 
        
           partitioned_insert_test(a string, b string, c bigint) 
        
           STORED AS csv 
        
           LOCATION 'test_files/scratch/insert_to_external/insert_to_partitioned/' 
        
           PARTITIONED BY (a, b);

This results in the following directory/file structure:

datafusion/sqllogictest/test_files/scratch/insert_to_external/insert_to_partitioned$ tree .
.
├── a=10
│   ├── b=100
│   │   └── ImzsRq5BY2218Z0v.csv
│   └── b=200
│       └── ImzsRq5BY2218Z0v.csv
├── a=20
│   ├── b=100
│   │   └── ImzsRq5BY2218Z0v.csv
│   └── b=200
│       └── ImzsRq5BY2218Z0v.csv
└── a=30
    └── b=300
        └── YGMRSbmO08FmNfDs.csv

9 directories, 5 files

The existing tests prior to this PR all passed because the table was created with an explicit schema, so DataFusion recognized/parsed all the partition columns. The test I implemented for this PR starts a new table, and fails on main with the following errors:

1. query result mismatch:
[SQL] describe partitioned_insert_test_readback;
[Diff] (-expected|+actual)
-   c Int64 YES
-   a Dictionary(UInt16, Utf8) NO
-   b Dictionary(UInt16, Utf8) NO
+   c Int64 YES
at /home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:184


2. query failed: DataFusion error: Schema error: No field named a. Valid fields are partitioned_insert_test_readback.c, partitioned_insert_test_readback.c.
[SQL] select * from partitioned_insert_test_readback order by a,b,c;
at /home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:191


3. query failed: DataFusion error: Schema error: No field named b. Valid fields are partitioned_insert_test_readback.c.
[SQL] select count(*) from partitioned_insert_test_readback where b=100;
at /home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:201

I believe this sequence of tests successfully exercises both the creation of hive partitioned tables, as well as the expected outcome when reading that data.

I'm happy to move this test, and the associated test setup, to a new file and/or augment it with the SQL commands you've laid out above. Given the above explanation, let me know if you'd still prefer to proceed with a new test file or if what's already there is sufficient.

Add a configuration setting to disable this new behavior (let people opt out of automatic partition inference). Perhaps datafusion.execution.infer_partitioning, similarly to datafusion.execution.collect_statistics

I think it's prudent to note that the changes here only effect the ListingTableFactory and low level ListingTable users wouldn't experience any change. Is an execution option stil the right place for the configuration setting? datafusion.execution.listing_table_factory_infer_partitions with a default of true?

Document the behavior (with an example) in https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table

Add a note to the upgrade guide https://datafusion.apache.org/library-user-guide/upgrading.html

No questions here, I can take care of this.

alamb · 2025-08-23T09:46:05Z

I believe this sequence of tests successfully exercises both the creation of hive partitioned tables, as well as the expected outcome when reading that data.

I'm happy to move this test, and the associated test setup, to a new file and/or augment it with the SQL commands you've laid out above. Given the above explanation, let me know if you'd still prefer to proceed with a new test file or if what's already there is sufficient.

I think give we are about to add non trivial new functionality (even if the code is not new), we should make the tests I suggested

I think it's prudent to note that the changes here only effect the ListingTableFactory and low level ListingTable users wouldn't experience any change. Is an execution option stil the right place for the configuration setting?

Yes, I think so (unless you can find another location for Listingtable settings -- maybe we should make a new category 🤔 )

nuno-faria · 2025-08-23T15:08:03Z

Thanks for the changes @BlakeOrth.
I also believe that automatically resolving the partitioning is the best approach. Like @alamb said, it would be nice to document and add an option to toggle this off if needed. In addition to adding documentation at https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table, I think we should also have additional documentation here: https://datafusion.apache.org/user-guide/cli/datasources.html#create-external-table.

One thing to note is that I believe that CREATE EXTERNAL TABLE now has a different behavior from SessionContext::register_parquet, which still requires partitions to be manually specified. I also feel register_parquet could use a form of automatic partition detection, maybe in a flag passed to ParquetReadOptions.

alamb · 2025-08-24T10:46:03Z

One thing to note is that I believe that CREATE EXTERNAL TABLE now has a different behavior from SessionContext::register_parquet, which still requires partitions to be manually specified. I also feel register_parquet could use a form of automatic partition detection, maybe in a flag passed to ParquetReadOptions.

yes please!

Would it help if I made a "mini" epic and broke this project down into some smaller parts / tickets so people could work on it in parallel?

(I am on vacation this week, so my responses will be delayed(

nuno-faria · 2025-08-24T11:19:02Z

Would it help if I made a "mini" epic and broke this project down into some smaller parts / tickets so people could work on it in parallel?

I think that would be a good idea.

(I am on vacation this week, so my responses will be delayed(

I also think it could wait since its not urgent.

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Aug 18, 2025

BlakeOrth commented Aug 18, 2025

View reviewed changes

BlakeOrth mentioned this pull request Aug 18, 2025

Fix: ListingTableFactory hive column detection #17050

Closed

alamb reviewed Aug 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: ListingTableFactory hive column detection #17232

Fix: ListingTableFactory hive column detection #17232

BlakeOrth commented Aug 18, 2025

Uh oh!

BlakeOrth Aug 18, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb commented Aug 22, 2025

Uh oh!

BlakeOrth commented Aug 23, 2025

Uh oh!

alamb commented Aug 23, 2025

Uh oh!

nuno-faria commented Aug 23, 2025

Uh oh!

alamb commented Aug 24, 2025

Uh oh!

nuno-faria commented Aug 24, 2025

Uh oh!

Uh oh!

Fix: ListingTableFactory hive column detection #17232

Are you sure you want to change the base?

Fix: ListingTableFactory hive column detection #17232

Conversation

BlakeOrth commented Aug 18, 2025

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

BlakeOrth Aug 18, 2025

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

New tests

Uh oh!

alamb commented Aug 22, 2025

Uh oh!

BlakeOrth commented Aug 23, 2025

Uh oh!

alamb commented Aug 23, 2025

Uh oh!

nuno-faria commented Aug 23, 2025

Uh oh!

alamb commented Aug 24, 2025

Uh oh!

nuno-faria commented Aug 24, 2025

Uh oh!

Uh oh!