-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Fix: ListingTableFactory hive column detection #17232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Fixes an issue in the ListingTableFactory where hive columns are not detected and incorporated into the table schema when an explicit schema has not been set by the user - Fixes an issue where subdirectories that do not follow Hive formatting (e.g. key=value) could be erroneously interpreted as contributing to the table schema
// Partitions are expected to follow the format "column_name=value", so we | ||
// should ignore any path part that cannot be parsed into the expected format | ||
.filter(|s| s.contains('=')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is necessary to prevent regressions exposed by existing sqllogic
tests. Here's the failures from existing tests for this PR if this fix is not included
Completed 345 test files in 17 seconds External error: 6 errors in file /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt
1. statement failed: DataFusion error: Error during planning: Found mixed partition values on disk [[], ["subdir"]]
[SQL] CREATE EXTERNAL TABLE listing_table_folder_0
STORED AS PARQUET
LOCATION 'test_files/scratch/parquet/test_table/';
at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:306
2. query failed: DataFusion error: Error during planning: table 'datafusion.public.listing_table_folder_0' not found
[SQL] select count(*) from listing_table_folder_0;
at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:315
3. query failed: DataFusion error: Error during planning: table 'datafusion.public.listing_table_folder_0' not found
[SQL] select count(*) from listing_table_folder_0;
at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:324
4. statement failed: DataFusion error: Error during planning: Found mixed partition values on disk [[], ["subdir"]]
[SQL] CREATE EXTERNAL TABLE listing_table_folder_1
STORED AS PARQUET
LOCATION 'test_files/scratch/parquet/test_table';
at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:330
5. query failed: DataFusion error: Error during planning: table 'datafusion.public.listing_table_folder_1' not found
[SQL] select count(*) from listing_table_folder_1;
at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:339
6. query failed: DataFusion error: Error during planning: table 'datafusion.public.listing_table_folder_1' not found
[SQL] select count(*) from listing_table_folder_1;
at /home/blake/open_source_src/datafusion-BlakeOrth/datafusion/sqllogictest/test_files/parquet.slt:348
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks @BlakeOrth
I played with this PR for a while, and I think this PR basically enables automatic partition detection for all listing tables which is quite cool.
For example with the following setup:
DataFusion CLI v49.0.0
> copy (values ('foo'), ('bar')) to '/tmp/table/a=1/file1.parquet';
+-------+
| count |
+-------+
| 2 |
+-------+
1 row(s) fetched.
Elapsed 0.034 seconds.
> copy (values ('baz')) to '/tmp/table/a=1/file2.parquet';
+-------+
| count |
+-------+
| 1 |
+-------+
1 row(s) fetched.
Elapsed 0.002 seconds.
> create external table t stored as parquet location '/tmp/table';
0 row(s) fetched.
Elapsed 0.010 seconds.
On main, no partition columns are recognized:
Main
> select * from t;
+---------+
| column1 |
+---------+
| foo |
| bar |
| baz |
+---------+
3 row(s) fetched.
Elapsed 0.012 seconds.
This PR
> select * from t;
+---------+---+
| column1 | a |
+---------+---+
| baz | 1 |
| foo | 1 |
| bar | 1 |
+---------+---+
3 row(s) fetched.
Elapsed 0.015 seconds.
I think this change is actually a quite a nice improvement but is really a feature, not a bug fix, and I think the code you have here actually looks good 🏆
In order to proceed I think we need a few things:
- More tests (see below)
- Add a configuration setting to disable this new behavior (let people opt out of automatic partition inference). Perhaps
datafusion.execution.infer_partitioning
, similarly todatafusion.execution.collect_statistics
- Document the behavior (with an example) in https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table
- Add a note to the upgrade guide https://datafusion.apache.org/library-user-guide/upgrading.html
New tests
I looked around and it seems we don't actually have any tests focused on reading hive partitioned datasets in .slt, nor automatically detecting hive partitions on table creation
Would you be willing to write some tests in a new file, such as datafusion/sqllogictest/test_files/listing_table_partitions.slt
(following the naming convention of datafusion/sqllogictest/test_files/listing_table_statistics.slt
)
I think you could start with my example SQL commands above and write the related tests
I know this is a lot. I am sorry for the delayed feedback. Please let me know what you think.
FYI @adrian and @nuno-faria |
No worries on the delay, I can absolutely fill in the extra blanks. I just have a couple of quick questions on how to proceed.
The datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt Lines 155 to 160 in cb571d9
This results in the following directory/file structure: datafusion/sqllogictest/test_files/scratch/insert_to_external/insert_to_partitioned$ tree .
.
├── a=10
│ ├── b=100
│ │ └── ImzsRq5BY2218Z0v.csv
│ └── b=200
│ └── ImzsRq5BY2218Z0v.csv
├── a=20
│ ├── b=100
│ │ └── ImzsRq5BY2218Z0v.csv
│ └── b=200
│ └── ImzsRq5BY2218Z0v.csv
└── a=30
└── b=300
└── YGMRSbmO08FmNfDs.csv
9 directories, 5 files The existing tests prior to this PR all passed because the table was created with an explicit schema, so DataFusion recognized/parsed all the partition columns. The test I implemented for this PR starts a new table, and fails on 1. query result mismatch:
[SQL] describe partitioned_insert_test_readback;
[Diff] (-expected|+actual)
- c Int64 YES
- a Dictionary(UInt16, Utf8) NO
- b Dictionary(UInt16, Utf8) NO
+ c Int64 YES
at /home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:184
2. query failed: DataFusion error: Schema error: No field named a. Valid fields are partitioned_insert_test_readback.c, partitioned_insert_test_readback.c.
[SQL] select * from partitioned_insert_test_readback order by a,b,c;
at /home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:191
3. query failed: DataFusion error: Schema error: No field named b. Valid fields are partitioned_insert_test_readback.c.
[SQL] select count(*) from partitioned_insert_test_readback where b=100;
at /home/blake/open_source_src/datafusion/datafusion/sqllogictest/test_files/insert_to_external.slt:201 I believe this sequence of tests successfully exercises both the creation of hive partitioned tables, as well as the expected outcome when reading that data. I'm happy to move this test, and the associated test setup, to a new file and/or augment it with the SQL commands you've laid out above. Given the above explanation, let me know if you'd still prefer to proceed with a new test file or if what's already there is sufficient.
I think it's prudent to note that the changes here only effect the
No questions here, I can take care of this. |
I think give we are about to add non trivial new functionality (even if the code is not new), we should make the tests I suggested
Yes, I think so (unless you can find another location for Listingtable settings -- maybe we should make a new category 🤔 ) |
Thanks for the changes @BlakeOrth. One thing to note is that I believe that |
yes please! Would it help if I made a "mini" epic and broke this project down into some smaller parts / tickets so people could work on it in parallel? (I am on vacation this week, so my responses will be delayed( |
I think that would be a good idea.
I also think it could wait since its not urgent. |
Which issue does this PR close?
ListingTableFactory
on Hive partitioned datasets #17049What changes are included in this PR?
ListingTableFactory
where hive columns are not detected and incorporated into the table schema when an explicit schema has not been set by the userAre these changes tested?
Yes, the PR includes both unit tests and
sqllogic
tests that exercise the fixesAre there any user-facing changes?
I think there's a possibility for the changes to impact users of the
ListingTableFactory
. In my mind the behavior represented here is what I would think the "expected" behavior should be, but there's a good possibility users are relying on the previous behavior and could get unexpected results if this PR is merged.cc @alamb