-
Notifications
You must be signed in to change notification settings - Fork 525
Description
Environment
Delta-rs version: Git, June 1, 2025
Bug
Let's say you're using without_files=True
/ require_files=false
with the goal of not loading files into memory.
In particular, you have two processes appending at same time. If they don't stomp on each other, all is well, the files don't get loaded into memory, as EagerSnapshot::try_new_with_visitor()
will see that require_files
is false, and not load them.
However, if they conflict in version numbers, EagerSnapshot.update()
is called... and this loads files unconditionally, ignoring the require_files=false
the user originally passed in which is respected by EagerSnapshot::try_new_with_visitor()
:
Compare:
let files = match config.require_files { |
to
let files = |
So now you have much larger memory usage, and also more CPU impact.
At scale, if you have many processes appending, this can have a significant impact on performance. With make develop
here's the performance I got with git, compared to a prototype of options 2/3 mentioned below:
Version of code | # of processes | Ops / sec |
---|---|---|
Git | 8 | 10 |
Git | 16 | 7 |
Git w/prototype fix | 8 | 14 |
Git w/prototype fix | 16 | 10 |
(My computer has 12 CPU cores).
How to fix?
Options:
- It seems a little weird to have
EagerSnapshot
that isn't eager. And a bunch of its methods just lie ifrequire_files
is false, essentially claiming no files. So one option is to ... have a snapshot trait, and use the appropriate snapshot type depending on whether it's append operation not? This may be major refactoring though. - Change
EagerSnapshot.update()
to respectrequire_files: false
. I am not sure if this is semantically correct, though. - Change
EagerSnapshot.update()
to respectrequire_files: false
, but only when appending.
As you can see in table above, options 2/3 definitely speed things up.