-
Notifications
You must be signed in to change notification settings - Fork 682
fix: link_remap files get left over on failed dumps, unique file names to avoid collisions #2681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: criu-dev
Are you sure you want to change the base?
fix: link_remap files get left over on failed dumps, unique file names to avoid collisions #2681
Conversation
Fixes: checkpoint-restore#2121 Signed-off-by: Pengda Yang <[email protected]>
This patch fixes the following warnings that appear when building an RPM package: + /usr/lib/rpm/redhat/brp-mangle-shebangs *** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.c is executable but has no shebang, removing executable bit *** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.h is executable but has no shebang, removing executable bit Signed-off-by: Radostin Stoyanov <[email protected]>
By default, CRIU uses the path "/usr/lib/criu" to install and load plugins at runtime. This path is defined by the `PLUGINDIR` variable in Makefile.install and `CR_PLUGIN_DEFAULT` in `criu/include/plugin.h`. However, some distribution packages might install the CRIU plugins at "/usr/lib64/criu" instead. This patch updates the makefile to align the path defined by `CR_PLUGIN_DEFAULT` with the value of `PLUGINDIR`. Signed-off-by: Radostin Stoyanov <[email protected]>
We only use the last pid from the list in NSpid entry (from /proc/<pid>/fdinfo/<pidfd>) while restoring pidfds. The last pid refers to the pid of the process in the most deeply nested pid namespace. Since CRIU does not currently support nested pid namespaces, this entry is the one we want. After Linux 6.9, inode numbers can be used to compare pidfds. pidfds referring to the same process will have the same inode numbers. We use inode numbers to restore pidfds that point to dead processes. Signed-off-by: Bhavik Sachdev <[email protected]>
Process file descriptors (pidfds) were introduced to provide a stable handle on a process. They solve the problem of pid recycling. For a detailed explanation, see https://lwn.net/Articles/801319/ and http://www.corsix.org/content/what-is-a-pidfd Before Linux 6.9, anonymous inodes were used for the implementation of pidfds. So, we detect them in a fashion similiar to other fd types that use anonymous inodes by calling `readlink()`. After 6.9, pidfs (a file system for pidfds) was introduced. In 6.9 `S_ISREG()` returned true for pidfds, but this again changed with 6.10. (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/pidfs.c?h=v6.11-rc2#n285) After this change, pidfs inodes have no file type in st_mode in userspace. We use `PID_FS_MAGIC` to detect pidfds for kernel >= 6.9 Hence, check for pidfds occurs before the check for regular files. For pidfds that refer to dead processes, we lose the pid of the process as the Pid and NSpid fields in /proc/<pid>/fdinfo/<pidfd> change to -1. So, we create a temporary process for each unique inode and open pidfds that refer to this process. After all pidfds have been opened we kill this temporary process. This commit does not include support for pidfds that point to a specific thread, i.e pidfds opened with `PIDFD_THREAD` flag. Fixes: checkpoint-restore#2258 Signed-off-by: Bhavik Sachdev <[email protected]>
Ensures that entries in /proc/<pid>/fdinfo/<pidfd> are same. Signed-off-by: Bhavik Sachdev <[email protected]>
Ensure `pidfd_send_signal()` syscall works as expected after C/R. Signed-off-by: Bhavik Sachdev <[email protected]>
Validate that pidfds can been used to send signals to different processes after C/R using the `pidfd_send_signal()` syscall. Signed-off-by: Bhavik Sachdev <[email protected]>
After, C/R of pidfds that point to dead processes their inodes might change. But if two pidfds point to same dead process they should continue to do so after C/R. This test ensures that this happens by calling `statx()` on pidfds after C/R and then comparing their inode numbers. Support for comparing pidfds by using `statx()` and inode numbers was introduced alongside pidfs. So if `f_type` of pidfd is not equal to `PID_FS_MAGIC` then we skip this test. signed-off-by: Bhavik Sachdev <[email protected]>
We get the read end of a pipe using `pidfd_getfd` and check if we can read from it after C/R. signed-off-by: Bhavik Sachdev <[email protected]>
We open a pidfd to a thread using `PIDFD_THREAD` flag and after C/R ensure that we can send signals using it with `PIDFD_SIGNAL_THREAD`. signed-off-by: Bhavik Sachdev <[email protected]>
The command `ruff <path>` has been deprecated and removed: https://astral.sh/blog/ruff-v0.5.0#removed-deprecated-features Signed-off-by: Radostin Stoyanov <[email protected]>
This patch fixes the following errors reported by ruff:
lib/pycriu/images/pb2dict.py:307:24: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
|
305 | elif field.type in _basic_cast:
306 | cast = _basic_cast[field.type]
307 | if pretty and (cast == int):
| ^^^^^^^^^^^ E721
308 | if is_hex:
309 | # Fields that have (criu).hex = true option set
|
lib/pycriu/images/pb2dict.py:379:13: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
|
377 | elif field.type in _basic_cast:
378 | cast = _basic_cast[field.type]
379 | if (cast == int) and is_string(value):
| ^^^^^^^^^^^ E721
380 | if _marked_as_dev(field):
381 | return encode_dev(field, value)
|
Signed-off-by: Radostin Stoyanov <[email protected]>
This patch extends the inventory image with a `plugins` field that contains an array of plugins which were used during checkpoint, for example, to save GPU state. In particular, the CUDA and AMDGPU plugins are added to this field only when the checkpoint contains GPU state. This allows to disable unnecessary plugins during restore, show appropriate error messages if required CRIU plugin are missing, and migrate a process that does not use GPU from a GPU-enabled system to CPU-only environment. We use the `optional plugins_entry` for backwards compatibility. This entry allows us to distinguish between *unset* and *missing* field: - When the field is missing, it indicates that the checkpoint was created with a previous version of CRIU, and all plugins should be *enabled* during restore. - When the field is empty, it indicates that no plugins were used during checkpointing. Thus, all plugins can be *disabled* during restore. Signed-off-by: Radostin Stoyanov <[email protected]>
This patch adds two test plugins to verify that CRIU plugins listed in the inventory image are enabled, while those that are not listed can be disabled. Signed-off-by: Radostin Stoyanov <[email protected]>
This patch blocks SIGCHLD during temporary process creation to prevent a race condition between kill() and waitpid() where sigchld_handler() causes `criu restore` to fail with an error. Fixes: checkpoint-restore#2490 Signed-off-by: Bhavik Sachdev <[email protected]> Signed-off-by: Radostin Stoyanov <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]> --- - rebased - imported a page_size() type fix (authored by Cryolitia PukNgae) Signed-off-by: PukNgae Cryolitia <[email protected]> Signed-off-by: Alexander Mikhalitsyn <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]> --- - rebased - added a membarrier() to syscall table (fix authored by Cryolitia PukNgae) Signed-off-by: PukNgae Cryolitia <[email protected]> Signed-off-by: Alexander Mikhalitsyn <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]> Co-authored-by: stove <[email protected]> Signed-off-by: Haorong Lu <[email protected]>
Signed-off-by: Haorong Lu <[email protected]>
Signed-off-by: Haorong Lu <[email protected]>
Link: SerenityOS/serenity@e300da4 Signed-off-by: PukNgae Cryolitia <[email protected]> --- - cherry-picked Signed-off-by: Alexander Mikhalitsyn <[email protected]>
After a fork, both the child and parent processes may trigger a page fault (#PF) at the same virtual address, referencing the same position in the page image. If deduplication is enabled, the last process to trigger the page fault will fail. Therefore, deduplication should be disabled after a fork to prevent this issue. Signed-off-by: Liu Hua <[email protected]>
When restoring dumps in new mount + pid namespaces where multiple dumps share the same network namespace, CRIU may fail due to conflicting unix socket names. This happens because the service worker creates sockets using a pattern that includes criu_run_id, but util_init() is called after cr_service_work() starts. The socket naming pattern "crtools-fd-%d-%d" uses the restore PID and criu_run_id, however criu_run_id is always 0 when not initialized, leading to conflicts when multiple restores run simultaneously either in the same CRIU process or because of multiple CRIU processes doing the same operation in different PID namespaces. Fix this by: - Moving util_init() before cr_service_work() starts - Adding a second util_init() call in the service worker fork to ensure unique IDs across multiple worker runs - Making sure that dump and restore operations have util_init() called early to generate unique socket names With this fix, socket names always include the namespace ID, preventing conflicts when multiple processes with the same pid share a network namespace. Fixes checkpoint-restore#2499 [ avagin: minore code changes ] Signed-off-by: Lorenzo Fontana <[email protected]> Signed-off-by: Andrei Vagin <[email protected]>
When `check_freezer_cgroup()` has non-zero return value, `goto err` calls `return ret`. However, the value of `ret` has been set to `0` in the lines above and CRIU does not handle the error properly. This problem is related to checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <[email protected]>
Container runtimes like CRI-O and containerd utilize the freezer cgroup to create a consistent snapshot of container root filesystem (rootfs) changes. In this case, the container is frozen before invoking CRIU. After CRIU successfully completes, a copy of the container rootfs diff is saved, and the container is then unfrozen. However, the `cuda-checkpoint` tool is not able to perform a 'lock' action on frozen threads. To support GPU checkpointing with these container runtimes, we need to unfreeze the cgroup and return it to its original state once the checkpointing is complete. To reflect this new behavior, the following changes are applied: - `dont_use_freeze_cgroup(void)` -> `set_compel_interrupt_only_mode(void)` - `bool freeze_cgroup_disabled` -> `bool compel_interrupt_only_mode` - `check_freezer_cgroup(void)` -> `prepare_freezer_for_interrupt_only_mode(void)` Note that when `compel_interrupt_only_mode` is set to `true`, `compel_interrupt_task()` is used instead of `freeze_processes()` to prevent tasks from running during `criu dump`. Fixes: checkpoint-restore#2508 Signed-off-by: Radostin Stoyanov <[email protected]>
Signed-off-by: Radostin Stoyanov <[email protected]>
The check for `/dev/nvidiactl` to determine if the CUDA plugin can be used is unreliable because in some cases the default path for driver installation is different [1]. This patch changes the logic to check if a GPU device is available in `/proc/driver/nvidia/gpus/`. This approach is similar to `torch.cuda.is_available()` and it is a more accurate indicator. The subsequent check for support of the `cuda-checkpoint --action` option would confirm if the driver supports checkpoint/restore. [1] https://github.com/NVIDIA/gpu-operator Fixes: checkpoint-restore#2509 Signed-off-by: Radostin Stoyanov <[email protected]>
Currently, the `waitpid()` call on the tmp process can be made by a process which is not its parent. This causes restore to fail. This patch instead selects one process to create the tmp process and open all the fds that point to it. These fds are sent to the correct process(es). Fixes: checkpoint-restore#2496 Signed-off-by: Andrei Vagin <[email protected]> Signed-off-by: Bhavik Sachdev <[email protected]>
In this test we want to ensure that contents of droppable mappings and mappings with MADV_WIPEONFORK is properly restored in parent/child processes. Signed-off-by: Alexander Mikhalitsyn <[email protected]>
CRIU currently requires a number of dependencies in order to build from source. The package names vary across distributions and package managers. A Nix flake allows developers to spin up a dev environment with `nix develop`, eliminating the hassle of manual dependency management. It also prevents polluting the global package set on the machine. Signed-off-by: Prajwal S N <[email protected]>
The cpuinfo command requires a "dump" or "check" subcommand. Thus, we replace `CR_CPUINFO` with `CR_CPUINFO_DUMP` and `CR_CPUINFO_CHECK`. This allows us to remove unnecessary subcommand check in `image_dir_mode()` and perform all parsing in `parse_criu_mode()`. With this change the check for validating the cpuinfo subcommand is now done only once with `CR_CPUINFO_DUMP` or `CR_CPUINFO_CHECK` enum. Signed-off-by: Liana Koleva <[email protected]> Signed-off-by: Radostin Stoyanov <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
The `criu cpuinfo check` command calls cpu_validate_cpuinfo(), which attempts to open the cpuinfo.img file using `open_image()`. If the image file is not found, `open_image()` returns an "empty image" object. As a result, `cpu_validate_cpuinfo()` tries to read from it and fails with the following error: (00.002473) Error (criu/protobuf.c:72): Unexpected EOF on (empty-image) This patch adds a check for an empty image and appropriate error message. Signed-off-by: Radostin Stoyanov <[email protected]>
Fixes a clang compile-time error: "argument unused during compilation: '-c'". Signed-off-by: Andrei Vagin <[email protected]>
Use shared first error buffer to return correct first error in rpc. Fixes: checkpoint-restore#338 Signed-off-by: Ivan Pravdin <[email protected]>
Having CTL_FLAGS_IPC_EACCES_SKIP == (CTL_FLAGS_OPTIONAL | CTL_FLAGS_READ_EIO_SKIP) is probably not what we want. So let's make it a real distinct flag. Fixes: 840735a ("ipc_sysctl: Prioritize restoring IPC variables using non usernsd approach") Signed-off-by: Pavel Tikhomirov <[email protected]>
Fixes: f38e588 ("net/sysctl: c/r ipv4/ping_group_range value") Signed-off-by: Pavel Tikhomirov <[email protected]>
We have ability to skip sysctl if there is no value, but we still give n requests to sysctl_op, that is not correct and probably can segfault on nullptr access. Fix it by adding ri to count non skipped requests. To be on the safe side, let's add a check that ri == n on read, as we should not do any skips there. While on it lets fix bad error message prefix: s/unix/ipv4/. Remove excess has_iarg set, and add sarg reset to NULL for the case sysctl_op skipped it. Signed-off-by: Andrei Vagin <[email protected]> Signed-off-by: Pavel Tikhomirov <[email protected]>
We dump sysctls from criu user namespace, but restore from restored user namespace. So group id values should be mapped to the restored user namespace gid space to restore correctly. Signed-off-by: Andrei Vagin <[email protected]> Signed-off-by: Pavel Tikhomirov <[email protected]>
net/unix/max_dgram_qlen can't be tuned from non-root userns before:
v5.17-rc1~170^2~215 ("net: Enable max_dgram_qlen unix sysctl to be
configurable by non-init user namespaces")
Signed-off-by: Andrei Vagin <[email protected]>
Currently there is no option to checkpoint/restore programs that use ICMP sockets, such as `ping`. This patch adds support for the same. Fixes checkpoint-restore#2557 Signed-off-by: समीर सिंह Sameer Singh <[email protected]>
Add ZDTM static tests for IP4/ICMP and IP6/ICMP socket feature. Signed-off-by: समीर सिंह Sameer Singh <[email protected]> Signed-off-by: Andrei Vagin <[email protected]>
…que, name is now unique Signed-off-by: Brandon Smith <[email protected]>
|
Do you think you could also add a zdtm test? You mentioned that you have a test that is able to trigger the problem you described. |
|
|
||
| mntns_root = mntns_get_root_fd(nsid); | ||
|
|
||
| fd = open("/dev/urandom", O_RDONLY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we generate this uid once for the entire dump?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change introduces generating a random suffix to prevent link_remap collisions when a dump fails and another dump is attempted on the same process
should we generate this uid once for the entire dump?
We already do this for dump_criu_run_id. Can we use this value instead of generating a new one?
| } | ||
|
|
||
| // create file link_remap.fd_id.random_suffix id | ||
| snprintf(tmp + 1, sizeof(link_name) - (size_t)(tmp - link_name) - 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we customize a link name if linkat_hard returns EEXIST?
| unsigned char rand_bytes[8]; | ||
| if (read(fd, rand_bytes, sizeof(rand_bytes)) == sizeof(rand_bytes)) { | ||
| snprintf(random_suffix, sizeof(random_suffix), "%02x%02x%02x%02x", | ||
| rand_bytes[0], rand_bytes[1], rand_bytes[2], rand_bytes[3]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you read 8 bytes, if you use just four of them?
| (unsigned long)(ts.tv_sec ^ ts.tv_nsec)); | ||
| } | ||
|
|
||
| // create file link_remap.fd_id.random_suffix id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use tabs for indentation
| } | ||
|
|
||
| /* Use grand parent, if parent directory does not exist. */ | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
white spaces...
Would you be able to include this test, for example, as part of ZDTM or |
|
#2683
I have a PR with a zdtm test open. The specific use case was a multi
process training job where multiple processes share the same semaphore, but
you’re migrating the whole training job.
…On Tue, Jun 17, 2025 at 2:18 AM Radostin Stoyanov ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In criu/files-reg.c
<#2681 (comment)>
:
> mntns_root = mntns_get_root_fd(nsid);
+ fd = open("/dev/urandom", O_RDONLY);
I have another branch directly related to this PR and fix ...
@bsmithai <https://github.com/bsmithai> I believe that you are referring
to the following branch: cedana#4 <cedana#4>
It would be good to open an issue or draft PR describing the use-case with
an example (of a real) application that you are trying to migrate. This
will help us to understand why this functionality is important and discuss
what might be the best solution.
—
Reply to this email directly, view it on GitHub
<#2681 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BBXBXMGAQ4ZVUPNXZFHDR2D3D66MDAVCNFSM6AAAAAB7D6ESQ2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDSMZUGQZTCMZXGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
A friendly reminder that this PR had no activity for 30 days. |
Problem:
create_link_remaplogic specifies before creating the hard link that these file names can be truly unique. Currently these files are generated w/ ids matching sequentially with file descriptor ids. This change introduces generating a random suffix to prevent link_remap collisions when a dump fails and another dump is attempted on the same process that has open fds with links that were remapped.Tested:
Tested w/ a process that opens and uses semaphores, the test creates remap files in
/dev/shmthat would have been created by criu on dump. A dump is done on this process and collisions are avoided. A restore is then done to verify nothing was damaged in C/R.Very simple solution but I wanted to verify e2e.
Results:
Original bug:
This branch: