-
Notifications
You must be signed in to change notification settings - Fork 526
Open
Labels
binding/rustIssues for the Rust crateIssues for the Rust cratebugSomething isn't workingSomething isn't working
Description
Environment
Delta-rs version:
deltalake-1.1.4
Binding:
python
Bug
When calling cleanup_metadata
it incorrectly only looks for the latest checkpoint. As a result the until_version
is not honored and old versions of the table can get incorrectly deleted from the log. Below is a program to reproduce the problem, here we want to delete log entries before version 5.
Output:
################################################
Failed cleanup test:
log contents before cleanup:
Delta log contents:
00000000000000000000.json
00000000000000000001.json
00000000000000000002.json
00000000000000000003.json
00000000000000000004.json
00000000000000000005.json
00000000000000000006.json
00000000000000000007.json
00000000000000000008.json
00000000000000000009.checkpoint.parquet
00000000000000000009.json
_last_checkpoint
Expected exception occurred: Generic DeltaTable error: Cannot read metadata from log segment
Traceback (most recent call last):
File "/home/cjoy/src/delta_rs_cleanup/test_cleanup.py", line 67, in <module>
failed_cleanup_test()
File "/home/cjoy/src/delta_rs_cleanup/test_cleanup.py", line 46, in failed_cleanup_test
table.cleanup_metadata()
File "/home/cjoy/src/delta_rs_cleanup/.venv/lib/python3.12/site-packages/deltalake/table.py", line 981, in cleanup_metadata
self._table.cleanup_metadata()
_internal.DeltaError: Generic DeltaTable error: Cannot read metadata from log segment
With a manual checkpoint at version 5 (See line 45 of the example) we get
################################################
Failed cleanup test:
log contents before cleanup:
Delta log contents:
00000000000000000000.json
00000000000000000001.json
00000000000000000002.json
00000000000000000003.json
00000000000000000004.json
00000000000000000005.json
00000000000000000006.json
00000000000000000007.json
00000000000000000008.json
00000000000000000009.checkpoint.parquet
00000000000000000009.json
_last_checkpoint
################################################
log contents after cleanup:
Delta log contents:
00000000000000000005.checkpoint.parquet
00000000000000000005.json
00000000000000000006.json
00000000000000000007.json
00000000000000000008.json
00000000000000000009.checkpoint.parquet
00000000000000000009.json
_last_checkpoint
################################################
Version 5 of the data:
id value
0 50 str_0
1 51 str_1
2 52 str_2
3 53 str_3
4 54 str_4
5 55 str_5
6 56 str_6
7 57 str_7
8 58 str_8
9 59 str_9
I believe that this is a bug and that there is a logic error in cleanup_expired_logs_for
.
Specifically, this line:
let maybe_last_checkpoint = read_last_checkpoint(&object_store, log_path).await?; |
The value of
maybe_last_checkpoint
should be the last checkpoint with a version <= until_version
and created on or before cutoff_timestamp
. Otherwise, while the files in the log may be kept, their dependencies will not, and these versions will no longer be loadable / usable.
So, I propose that either:
- This logic be fixed since it seems like a hidden bug to me, and it bit some of our users.
- Or, if this is really working as intended, then the docs should be updated with a warning since the versions below the latest checkpoint may be rendered unusable.
Metadata
Metadata
Assignees
Labels
binding/rustIssues for the Rust crateIssues for the Rust cratebugSomething isn't workingSomething isn't working