Skip to content

Fix: ListingTableFactory hive column detection #17232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

BlakeOrth
Copy link
Contributor

Which issue does this PR close?

What changes are included in this PR?

  • Fixes an issue in the ListingTableFactory where hive columns are not detected and incorporated into the table schema when an explicit schema has not been set by the user
  • Fixes an issue where subdirectories that do not follow Hive formatting (e.g. key=value) could be erroneously interpreted as contributing to the table schema

Are these changes tested?

Yes, the PR includes both unit tests and sqllogic tests that exercise the fixes

Are there any user-facing changes?

I think there's a possibility for the changes to impact users of the ListingTableFactory. In my mind the behavior represented here is what I would think the "expected" behavior should be, but there's a good possibility users are relying on the previous behavior and could get unexpected results if this PR is merged.

cc @alamb

 - Fixes an issue in the ListingTableFactory where hive columns are not
   detected and incorporated into the table schema when an explicit
   schema has not been set by the user
 - Fixes an issue where subdirectories that do not follow Hive
   formatting (e.g. key=value) could be erroneously interpreted as
   contributing to the table schema
@github-actions github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Aug 18, 2025
Comment on lines +805 to +807
// Partitions are expected to follow the format "column_name=value", so we
// should ignore any path part that cannot be parsed into the expected format
.filter(|s| s.contains('='))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is necessary to prevent regressions exposed by existing sqllogic tests. Here's the failures from existing tests for this PR if this fix is not included

Completed 345 test files in 17 seconds                                                                                                                                      External error: 6 errors in file /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt

1. statement failed: DataFusion error: Error during planning: Found mixed partition values on disk [[], ["subdir"]]
[SQL] CREATE EXTERNAL TABLE listing_table_folder_0
STORED AS PARQUET
LOCATION 'test_files/scratch/parquet/test_table/';
at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:306


2. query failed: DataFusion error: Error during planning: table 'datafusion.public.listing_table_folder_0' not found
[SQL] select count(*) from listing_table_folder_0;
at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:315


3. query failed: DataFusion error: Error during planning: table 'datafusion.public.listing_table_folder_0' not found
[SQL] select count(*) from listing_table_folder_0;
at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:324


4. statement failed: DataFusion error: Error during planning: Found mixed partition values on disk [[], ["subdir"]]
[SQL] CREATE EXTERNAL TABLE listing_table_folder_1
STORED AS PARQUET
LOCATION 'test_files/scratch/parquet/test_table';
at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:330


5. query failed: DataFusion error: Error during planning: table 'datafusion.public.listing_table_folder_1' not found
[SQL] select count(*) from listing_table_folder_1;
at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:339


6. query failed: DataFusion error: Error during planning: table 'datafusion.public.listing_table_folder_1' not found
[SQL] select count(*) from listing_table_folder_1;
at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:348

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks @BlakeOrth

I played with this PR for a while, and I think this PR basically enables automatic partition detection for all listing tables which is quite cool.

For example with the following setup:

DataFusion CLI v49.0.0
> copy (values ('foo'), ('bar')) to '/tmp/table/a=1/file1.parquet';

+-------+
| count |
+-------+
| 2     |
+-------+
1 row(s) fetched.
Elapsed 0.034 seconds.

> copy (values ('baz')) to '/tmp/table/a=1/file2.parquet';
+-------+
| count |
+-------+
| 1     |
+-------+
1 row(s) fetched.
Elapsed 0.002 seconds.

> create external table t stored as parquet location '/tmp/table';
0 row(s) fetched.
Elapsed 0.010 seconds.

On main, no partition columns are recognized:

Main

> select * from t;
+---------+
| column1 |
+---------+
| foo     |
| bar     |
| baz     |
+---------+
3 row(s) fetched.
Elapsed 0.012 seconds.

This PR

> select * from t;
+---------+---+
| column1 | a |
+---------+---+
| baz     | 1 |
| foo     | 1 |
| bar     | 1 |
+---------+---+
3 row(s) fetched.
Elapsed 0.015 seconds.

I think this change is actually a quite a nice improvement but is really a feature, not a bug fix, and I think the code you have here actually looks good 🏆

In order to proceed I think we need a few things:

  1. More tests (see below)
  2. Add a configuration setting to disable this new behavior (let people opt out of automatic partition inference). Perhaps datafusion.execution.infer_partitioning, similarly to datafusion.execution.collect_statistics
  3. Document the behavior (with an example) in https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table
  4. Add a note to the upgrade guide https://datafusion.apache.org/library-user-guide/upgrading.html

New tests

I looked around and it seems we don't actually have any tests focused on reading hive partitioned datasets in .slt, nor automatically detecting hive partitions on table creation

Would you be willing to write some tests in a new file, such as datafusion/sqllogictest/test_files/listing_table_partitions.slt (following the naming convention of datafusion/sqllogictest/test_files/listing_table_statistics.slt)

I think you could start with my example SQL commands above and write the related tests

I know this is a lot. I am sorry for the delayed feedback. Please let me know what you think.

@alamb
Copy link
Contributor

alamb commented Aug 22, 2025

FYI @adrian and @nuno-faria

@BlakeOrth
Copy link
Contributor Author

@alamb

I know this is a lot. I am sorry for the delayed feedback. Please let me know what you think.

No worries on the delay, I can absolutely fill in the extra blanks. I just have a couple of quick questions on how to proceed.

  1. More tests (see below)

The sqllogic test I added actually does work on a hive partitioned table. I setup a "readback" using an existing table that's created with the following setup:

statement ok
CREATE EXTERNAL TABLE
partitioned_insert_test(a string, b string, c bigint)
STORED AS csv
LOCATION 'test_files/scratch/insert_to_external/insert_to_partitioned/'
PARTITIONED BY (a, b);

This results in the following directory/file structure:

datafusion/sqllogictest/test_files/scratch/insert_to_external/insert_to_partitioned$ tree .
.
├── a=10
│   ├── b=100
│   │   └── ImzsRq5BY2218Z0v.csv
│   └── b=200
│       └── ImzsRq5BY2218Z0v.csv
├── a=20
│   ├── b=100
│   │   └── ImzsRq5BY2218Z0v.csv
│   └── b=200
│       └── ImzsRq5BY2218Z0v.csv
└── a=30
    └── b=300
        └── YGMRSbmO08FmNfDs.csv

9 directories, 5 files

The existing tests prior to this PR all passed because the table was created with an explicit schema, so DataFusion recognized/parsed all the partition columns. The test I implemented for this PR starts a new table, and fails on main with the following errors:

1. query result mismatch:
[SQL] describe partitioned_insert_test_readback;
[Diff] (-expected|+actual)
-   c Int64 YES
-   a Dictionary(UInt16, Utf8) NO
-   b Dictionary(UInt16, Utf8) NO
+   c Int64 YES
at /home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:184


2. query failed: DataFusion error: Schema error: No field named a. Valid fields are partitioned_insert_test_readback.c, partitioned_insert_test_readback.c.
[SQL] select * from partitioned_insert_test_readback order by a,b,c;
at /home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:191


3. query failed: DataFusion error: Schema error: No field named b. Valid fields are partitioned_insert_test_readback.c.
[SQL] select count(*) from partitioned_insert_test_readback where b=100;
at /home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:201

I believe this sequence of tests successfully exercises both the creation of hive partitioned tables, as well as the expected outcome when reading that data.

I'm happy to move this test, and the associated test setup, to a new file and/or augment it with the SQL commands you've laid out above. Given the above explanation, let me know if you'd still prefer to proceed with a new test file or if what's already there is sufficient.

  1. Add a configuration setting to disable this new behavior (let people opt out of automatic partition inference). Perhaps datafusion.execution.infer_partitioning, similarly to datafusion.execution.collect_statistics

I think it's prudent to note that the changes here only effect the ListingTableFactory and low level ListingTable users wouldn't experience any change. Is an execution option stil the right place for the configuration setting? datafusion.execution.listing_table_factory_infer_partitions with a default of true?

  1. Document the behavior (with an example) in https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table
  2. Add a note to the upgrade guide https://datafusion.apache.org/library-user-guide/upgrading.html

No questions here, I can take care of this.

@alamb
Copy link
Contributor

alamb commented Aug 23, 2025

I believe this sequence of tests successfully exercises both the creation of hive partitioned tables, as well as the expected outcome when reading that data.

I'm happy to move this test, and the associated test setup, to a new file and/or augment it with the SQL commands you've laid out above. Given the above explanation, let me know if you'd still prefer to proceed with a new test file or if what's already there is sufficient.

I think give we are about to add non trivial new functionality (even if the code is not new), we should make the tests I suggested

I think it's prudent to note that the changes here only effect the ListingTableFactory and low level ListingTable users wouldn't experience any change. Is an execution option stil the right place for the configuration setting?

Yes, I think so (unless you can find another location for Listingtable settings -- maybe we should make a new category 🤔 )

@nuno-faria
Copy link
Contributor

Thanks for the changes @BlakeOrth.
I also believe that automatically resolving the partitioning is the best approach. Like @alamb said, it would be nice to document and add an option to toggle this off if needed. In addition to adding documentation at https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table, I think we should also have additional documentation here: https://datafusion.apache.org/user-guide/cli/datasources.html#create-external-table.

One thing to note is that I believe that CREATE EXTERNAL TABLE now has a different behavior from SessionContext::register_parquet, which still requires partitions to be manually specified. I also feel register_parquet could use a form of automatic partition detection, maybe in a flag passed to ParquetReadOptions.

@alamb
Copy link
Contributor

alamb commented Aug 24, 2025

One thing to note is that I believe that CREATE EXTERNAL TABLE now has a different behavior from SessionContext::register_parquet, which still requires partitions to be manually specified. I also feel register_parquet could use a form of automatic partition detection, maybe in a flag passed to ParquetReadOptions.

yes please!

Would it help if I made a "mini" epic and broke this project down into some smaller parts / tickets so people could work on it in parallel?

(I am on vacation this week, so my responses will be delayed(

@nuno-faria
Copy link
Contributor

Would it help if I made a "mini" epic and broke this project down into some smaller parts / tickets so people could work on it in parallel?

I think that would be a good idea.

(I am on vacation this week, so my responses will be delayed(

I also think it could wait since its not urgent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug auto detecting partitions with ListingTableFactory on Hive partitioned datasets
3 participants