Skip to content

bug: table.cleanup_metadata does not correctly preserve files #3692

@corwinjoy

Description

@corwinjoy

Environment

Delta-rs version:
deltalake-1.1.4

Binding:
python


Bug

When calling cleanup_metadata it incorrectly only looks for the latest checkpoint. As a result the until_version is not honored and old versions of the table can get incorrectly deleted from the log. Below is a program to reproduce the problem, here we want to delete log entries before version 5.

test_cleanup.py

Output:

################################################
Failed cleanup test:
log contents before cleanup:
Delta log contents:
  00000000000000000000.json
  00000000000000000001.json
  00000000000000000002.json
  00000000000000000003.json
  00000000000000000004.json
  00000000000000000005.json
  00000000000000000006.json
  00000000000000000007.json
  00000000000000000008.json
  00000000000000000009.checkpoint.parquet
  00000000000000000009.json
  _last_checkpoint

Expected exception occurred: Generic DeltaTable error: Cannot read metadata from log segment
Traceback (most recent call last):
  File "/home/cjoy/src/delta_rs_cleanup/test_cleanup.py", line 67, in <module>
    failed_cleanup_test()
  File "/home/cjoy/src/delta_rs_cleanup/test_cleanup.py", line 46, in failed_cleanup_test
    table.cleanup_metadata()
  File "/home/cjoy/src/delta_rs_cleanup/.venv/lib/python3.12/site-packages/deltalake/table.py", line 981, in cleanup_metadata
    self._table.cleanup_metadata()
_internal.DeltaError: Generic DeltaTable error: Cannot read metadata from log segment

With a manual checkpoint at version 5 (See line 45 of the example) we get

################################################
Failed cleanup test:
log contents before cleanup:
Delta log contents:
  00000000000000000000.json
  00000000000000000001.json
  00000000000000000002.json
  00000000000000000003.json
  00000000000000000004.json
  00000000000000000005.json
  00000000000000000006.json
  00000000000000000007.json
  00000000000000000008.json
  00000000000000000009.checkpoint.parquet
  00000000000000000009.json
  _last_checkpoint

################################################
log contents after cleanup:
Delta log contents:
  00000000000000000005.checkpoint.parquet
  00000000000000000005.json
  00000000000000000006.json
  00000000000000000007.json
  00000000000000000008.json
  00000000000000000009.checkpoint.parquet
  00000000000000000009.json
  _last_checkpoint

################################################
Version 5 of the data:
   id  value
0  50  str_0
1  51  str_1
2  52  str_2
3  53  str_3
4  54  str_4
5  55  str_5
6  56  str_6
7  57  str_7
8  58  str_8
9  59  str_9

I believe that this is a bug and that there is a logic error in cleanup_expired_logs_for.
Specifically, this line:

let maybe_last_checkpoint = read_last_checkpoint(&object_store, log_path).await?;

The value of maybe_last_checkpoint should be the last checkpoint with a version <= until_version and created on or before cutoff_timestamp. Otherwise, while the files in the log may be kept, their dependencies will not, and these versions will no longer be loadable / usable.

So, I propose that either:

  1. This logic be fixed since it seems like a hidden bug to me, and it bit some of our users.
  2. Or, if this is really working as intended, then the docs should be updated with a warning since the versions below the latest checkpoint may be rendered unusable.

Metadata

Metadata

Assignees

Labels

binding/rustIssues for the Rust cratebugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions