Skip to content

Conversation

@bsmithai
Copy link

Problem:

create_link_remap logic specifies before creating the hard link that these file names can be truly unique. Currently these files are generated w/ ids matching sequentially with file descriptor ids. This change introduces generating a random suffix to prevent link_remap collisions when a dump fails and another dump is attempted on the same process that has open fds with links that were remapped.

Tested:

Tested w/ a process that opens and uses semaphores, the test creates remap files in /dev/shm that would have been created by criu on dump. A dump is done on this process and collisions are avoided. A restore is then done to verify nothing was damaged in C/R.

Very simple solution but I wanted to verify e2e.

Results:

Original bug:

(00.022698) Strip ' (deleted)' tag from './dev/shm/sem.8NKp2J (deleted)'
(00.022703) Warn  (criu/files-reg.c:1881): Can't link  -> ./dev/shm/link_remap.5
(00.022707) Error (criu/files-reg.c:1159): Can't link remap to /dev/shm/sem.8NKp2J: File exists
(00.022711) Error (criu/cr-dump.c:1570): Collect mappings (pid: 1401204) failed with -1
(00.022753) net: Unlock network

This branch:

(00.043028) Found regular file mapping, OK
(00.043041) Dumping path for -3 fd via self 12 [/dev/shm/sem.8NKp2J (deleted)]
(00.043046) Strip ' (deleted)' tag from './dev/shm/sem.8NKp2J (deleted)'
(00.043081) Only file size could be stored for validation for file /dev/shm/sem.8NKp2J
(00.043087) Handling VMA with the following smaps entry: 745051de5000-745051de7000 r--p 00037000 fc:01 17565282                   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
(00.043097) Found regular file mapping, OK
(00.043120) Handling VMA with the following smaps entry: 745051de7000-745051de9000 rw-p 00039000 fc:01 17565282                   /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
(00.043130) vma 745051de7000 borrows vfi from previous 745051de5000
(00.043134) Handling VMA with the following smaps entry: 7ffcde225000-7ffcde247000 rw-p 00000000 00:00 0                          [stack]
(00.043143) Handling VMA with the following smaps entry: ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0                  [vsyscall]
(00.043150) Collected, longest area occupies 405 pages
(00.043153) 0x64e115337000-0x64e115338000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0 reg fp  shmid: 0x1
(00.043156) 0x64e115338000-0x64e115339000 (4K) prot 0x5 flags 0x2 fdflags 0 st 0x41 off 0x1000 reg fp  shmid: 0x1
(00.043158) 0x64e115339000-0x64e11533a000 (4K) prot 0x1 flags 0x2 fdflags 0 st 0x41 off 0x2000 reg fp  shmid: 0x1
Original CRIU (/usr/local/bin/criu): Exit code 1
Modified CRIU (./criu/criu):         Exit code 0
Modified CRIU Restore:               Exit code 0

Daz-3ux and others added 30 commits October 26, 2024 22:17
This patch fixes the following warnings that appear
when building an RPM package:

+ /usr/lib/rpm/redhat/brp-mangle-shebangs
*** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.c is executable but has no shebang, removing executable bit
*** WARNING: ./usr/src/debug/criu-4.0-1.fc42.x86_64/plugins/amdgpu/amdgpu_plugin_util.h is executable but has no shebang, removing executable bit

Signed-off-by: Radostin Stoyanov <[email protected]>
By default, CRIU uses the path "/usr/lib/criu" to install and load
plugins at runtime. This path is defined by the `PLUGINDIR` variable
in Makefile.install and `CR_PLUGIN_DEFAULT` in `criu/include/plugin.h`.
However, some distribution packages might install the CRIU plugins at
"/usr/lib64/criu" instead. This patch updates the makefile to align
the path defined by `CR_PLUGIN_DEFAULT` with the value of `PLUGINDIR`.

Signed-off-by: Radostin Stoyanov <[email protected]>
We only use the last pid from the list in NSpid entry (from
/proc/<pid>/fdinfo/<pidfd>) while restoring pidfds.
The last pid refers to the pid of the process in the most deeply nested
pid namespace. Since CRIU does not currently support nested pid
namespaces, this entry is the one we want.

After Linux 6.9, inode numbers can be used to compare pidfds. pidfds
referring to the same process will have the same inode numbers. We use
inode numbers to restore pidfds that point to dead processes.

Signed-off-by: Bhavik Sachdev <[email protected]>
Process file descriptors (pidfds) were introduced to provide a stable
handle on a process. They solve the problem of pid recycling.

For a detailed explanation, see https://lwn.net/Articles/801319/ and
http://www.corsix.org/content/what-is-a-pidfd

Before Linux 6.9, anonymous inodes were used for the implementation of
pidfds. So, we detect them in a fashion similiar to other fd types that
use anonymous inodes by calling `readlink()`.
After 6.9, pidfs (a file system for pidfds) was introduced.
In 6.9 `S_ISREG()` returned true for pidfds, but this again changed with
6.10.
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/pidfs.c?h=v6.11-rc2#n285)
After this change, pidfs inodes have no file type in st_mode in
userspace.
We use `PID_FS_MAGIC` to detect pidfds for kernel >= 6.9
Hence, check for pidfds occurs before the check for regular files.

For pidfds that refer to dead processes, we lose the pid of the process
as the Pid and NSpid fields in /proc/<pid>/fdinfo/<pidfd> change to -1.
So, we create a temporary process for each unique inode and open pidfds
that refer to this process. After all pidfds have been opened we kill
this temporary process.

This commit does not include support for pidfds that point to a specific
thread, i.e pidfds opened with `PIDFD_THREAD` flag.

Fixes: checkpoint-restore#2258

Signed-off-by: Bhavik Sachdev <[email protected]>
Ensures that entries in /proc/<pid>/fdinfo/<pidfd> are same.

Signed-off-by: Bhavik Sachdev <[email protected]>
Ensure `pidfd_send_signal()` syscall works as expected after C/R.

Signed-off-by: Bhavik Sachdev <[email protected]>
Validate that pidfds can been used to send signals to different
processes after C/R using the `pidfd_send_signal()` syscall.

Signed-off-by: Bhavik Sachdev <[email protected]>
After, C/R of pidfds that point to dead processes their inodes might
change. But if two pidfds point to same dead process they should
continue to do so after C/R.

This test ensures that this happens by calling `statx()` on pidfds after
C/R and then comparing their inode numbers.

Support for comparing pidfds by using `statx()` and inode numbers was
introduced alongside pidfs. So if `f_type` of pidfd is not equal to
`PID_FS_MAGIC` then we skip this test.

signed-off-by: Bhavik Sachdev <[email protected]>
We get the read end of a pipe using `pidfd_getfd` and check if we can
read from it after C/R.

signed-off-by: Bhavik Sachdev <[email protected]>
We open a pidfd to a thread using `PIDFD_THREAD` flag and after C/R
ensure that we can send signals using it with `PIDFD_SIGNAL_THREAD`.

signed-off-by: Bhavik Sachdev <[email protected]>
The command `ruff <path>` has been deprecated and removed:
https://astral.sh/blog/ruff-v0.5.0#removed-deprecated-features

Signed-off-by: Radostin Stoyanov <[email protected]>
This patch fixes the following errors reported by ruff:

lib/pycriu/images/pb2dict.py:307:24: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
305 |     elif field.type in _basic_cast:
306 |         cast = _basic_cast[field.type]
307 |         if pretty and (cast == int):
    |                        ^^^^^^^^^^^ E721
308 |             if is_hex:
309 |                 # Fields that have (criu).hex = true option set
    |

lib/pycriu/images/pb2dict.py:379:13: E721 Use `is` and `is not` for type comparisons, or `isinstance()` for isinstance checks
    |
377 |     elif field.type in _basic_cast:
378 |         cast = _basic_cast[field.type]
379 |         if (cast == int) and is_string(value):
    |             ^^^^^^^^^^^ E721
380 |             if _marked_as_dev(field):
381 |                 return encode_dev(field, value)
    |

Signed-off-by: Radostin Stoyanov <[email protected]>
This patch extends the inventory image with a `plugins` field that
contains an array of plugins which were used during checkpoint,
for example, to save GPU state. In particular, the CUDA and AMDGPU
plugins are added to this field only when the checkpoint contains
GPU state. This allows to disable unnecessary plugins during restore,
show appropriate error messages if required CRIU plugin are missing,
and migrate a process that does not use GPU from a GPU-enabled system
to CPU-only environment.

We use the `optional plugins_entry` for backwards compatibility. This
entry allows us to distinguish between *unset* and *missing* field:

- When the field is missing, it indicates that the checkpoint was
  created with a previous version of CRIU, and all plugins should be
  *enabled* during restore.

- When the field is empty, it indicates that no plugins were used during
  checkpointing. Thus, all plugins can be *disabled* during restore.

Signed-off-by: Radostin Stoyanov <[email protected]>
This patch adds two test plugins to verify that CRIU plugins listed
in the inventory image are enabled, while those that are not listed
can be disabled.

Signed-off-by: Radostin Stoyanov <[email protected]>
This patch blocks SIGCHLD during temporary process creation to prevent a
race condition between kill() and waitpid() where sigchld_handler()
causes `criu restore` to fail with an error.

Fixes: checkpoint-restore#2490

Signed-off-by: Bhavik Sachdev <[email protected]>
Signed-off-by: Radostin Stoyanov <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]>
Co-authored-by: stove <[email protected]>
Signed-off-by: Haorong Lu <[email protected]>
---
- rebased
- imported a page_size() type fix (authored by Cryolitia PukNgae)
Signed-off-by: PukNgae Cryolitia <[email protected]>
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]>
Co-authored-by: stove <[email protected]>
Signed-off-by: Haorong Lu <[email protected]>
---
- rebased
- added a membarrier() to syscall table (fix authored by Cryolitia PukNgae)
Signed-off-by: PukNgae Cryolitia <[email protected]>
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]>
Co-authored-by: stove <[email protected]>
Signed-off-by: Haorong Lu <[email protected]>
Co-authored-by: Yixue Zhao <[email protected]>
Co-authored-by: stove <[email protected]>
Signed-off-by: Haorong Lu <[email protected]>
Link: SerenityOS/serenity@e300da4

Signed-off-by: PukNgae Cryolitia <[email protected]>
---
- cherry-picked
Signed-off-by: Alexander Mikhalitsyn <[email protected]>
After a fork, both the child and parent processes may trigger a page fault (#PF)
at the same virtual address, referencing the same position in the page image.
If deduplication is enabled, the last process to trigger the page fault will fail.

Therefore, deduplication should be disabled after a fork to prevent this issue.

Signed-off-by: Liu Hua <[email protected]>
When restoring dumps in new mount + pid namespaces where multiple dumps
share the same network namespace, CRIU may fail due to conflicting
unix socket names. This happens because the service worker creates
sockets using a pattern that includes criu_run_id, but util_init()
is called after cr_service_work() starts.

The socket naming pattern "crtools-fd-%d-%d" uses the restore PID
and criu_run_id, however criu_run_id is always 0 when not initialized,
leading to conflicts when multiple restores run simultaneously either
in the same CRIU process or because of multiple CRIU processes
doing the same operation in different PID namespaces.

Fix this by:

- Moving util_init() before cr_service_work() starts
- Adding a second util_init() call in the service worker fork
to ensure unique IDs across multiple worker runs
- Making sure that dump and restore operations have util_init() called
early to generate unique socket names

With this fix, socket names always include the namespace ID, preventing
conflicts when multiple processes with the same pid share a network
namespace.

Fixes checkpoint-restore#2499

[ avagin: minore code changes ]

Signed-off-by: Lorenzo Fontana <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
When `check_freezer_cgroup()` has non-zero return value, `goto err` calls
`return ret`. However, the value of `ret` has been set to `0` in the lines
above and CRIU does not handle the error properly.

This problem is related to checkpoint-restore#2508

Signed-off-by: Radostin Stoyanov <[email protected]>
Container runtimes like CRI-O and containerd utilize the freezer cgroup
to create a consistent snapshot of container root filesystem (rootfs)
changes. In this case, the container is frozen before invoking CRIU.
After CRIU successfully completes, a copy of the container rootfs diff
is saved, and the container is then unfrozen.

However, the `cuda-checkpoint` tool is not able to perform a 'lock'
action on frozen threads.  To support GPU checkpointing with these
container runtimes, we need to unfreeze the cgroup and return it to its
original state once the checkpointing is complete.

To reflect this new behavior, the following changes are applied:
 - `dont_use_freeze_cgroup(void)` -> `set_compel_interrupt_only_mode(void)`
 - `bool freeze_cgroup_disabled` -> `bool compel_interrupt_only_mode`
 - `check_freezer_cgroup(void)` -> `prepare_freezer_for_interrupt_only_mode(void)`

Note that when `compel_interrupt_only_mode` is set to `true`,
`compel_interrupt_task()` is used instead of `freeze_processes()`
to prevent tasks from running during `criu dump`.

Fixes: checkpoint-restore#2508

Signed-off-by: Radostin Stoyanov <[email protected]>
The check for `/dev/nvidiactl` to determine if the CUDA plugin can be
used is unreliable because in some cases the default path for driver
installation is different [1]. This patch changes the logic to check
if a GPU device is available in `/proc/driver/nvidia/gpus/`. This
approach is similar to `torch.cuda.is_available()` and it is a more
accurate indicator.

The subsequent check for support of the `cuda-checkpoint --action`
option would confirm if the driver supports checkpoint/restore.

[1] https://github.com/NVIDIA/gpu-operator

Fixes: checkpoint-restore#2509

Signed-off-by: Radostin Stoyanov <[email protected]>
Currently, the `waitpid()` call on the tmp process can be made by a
process which is not its parent. This causes restore to fail.

This patch instead selects one process to create the tmp process and
open all the fds that point to it. These fds are sent to the correct
process(es).

Fixes: checkpoint-restore#2496

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Bhavik Sachdev <[email protected]>
mihalicyn and others added 15 commits May 19, 2025 15:33
In this test we want to ensure that contents of droppable mappings
and mappings with MADV_WIPEONFORK is properly restored in
parent/child processes.

Signed-off-by: Alexander Mikhalitsyn <[email protected]>
CRIU currently requires a number of dependencies in order to build from
source. The package names vary across distributions and package
managers. A Nix flake allows developers to spin up a dev environment
with `nix develop`, eliminating the hassle of manual dependency
management. It also prevents polluting the global package set on the
machine.

Signed-off-by: Prajwal S N <[email protected]>
The cpuinfo command requires a "dump" or "check" subcommand. Thus, we
replace `CR_CPUINFO` with `CR_CPUINFO_DUMP` and `CR_CPUINFO_CHECK`.
This allows us to remove unnecessary subcommand check in
`image_dir_mode()` and perform all parsing in `parse_criu_mode()`.

With this change the check for validating the cpuinfo subcommand is
now done only once with `CR_CPUINFO_DUMP` or `CR_CPUINFO_CHECK` enum.

Signed-off-by: Liana Koleva <[email protected]>
Signed-off-by: Radostin Stoyanov <[email protected]>
The `criu cpuinfo check` command calls cpu_validate_cpuinfo(), which
attempts to open the cpuinfo.img file using `open_image()`. If the
image file is not found, `open_image()` returns an "empty image"
object. As a result, `cpu_validate_cpuinfo()` tries to read from it
and fails with the following error:

(00.002473) Error (criu/protobuf.c:72): Unexpected EOF on (empty-image)

This patch adds a check for an empty image and appropriate error message.

Signed-off-by: Radostin Stoyanov <[email protected]>
Fixes a clang compile-time error:
"argument unused during compilation: '-c'".

Signed-off-by: Andrei Vagin <[email protected]>
Use shared first error buffer to return correct
first error in rpc.

Fixes: checkpoint-restore#338

Signed-off-by: Ivan Pravdin <[email protected]>
Having CTL_FLAGS_IPC_EACCES_SKIP == (CTL_FLAGS_OPTIONAL |
CTL_FLAGS_READ_EIO_SKIP) is probably not what we want. So let's make it
a real distinct flag.

Fixes: 840735a ("ipc_sysctl: Prioritize restoring IPC variables using non usernsd approach")
Signed-off-by: Pavel Tikhomirov <[email protected]>
Fixes: f38e588 ("net/sysctl: c/r ipv4/ping_group_range value")
Signed-off-by: Pavel Tikhomirov <[email protected]>
We have ability to skip sysctl if there is no value, but we still give
n requests to sysctl_op, that is not correct and probably can segfault
on nullptr access. Fix it by adding ri to count non skipped requests.

To be on the safe side, let's add a check that ri == n on read, as we
should not do any skips there.

While on it lets fix bad error message prefix: s/unix/ipv4/.

Remove excess has_iarg set, and add sarg reset to NULL for the case
sysctl_op skipped it.

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Pavel Tikhomirov <[email protected]>
We dump sysctls from criu user namespace, but restore from restored user
namespace. So group id values should be mapped to the restored user
namespace gid space to restore correctly.

Signed-off-by: Andrei Vagin <[email protected]>
Signed-off-by: Pavel Tikhomirov <[email protected]>
net/unix/max_dgram_qlen can't be tuned from non-root userns before:
v5.17-rc1~170^2~215 ("net: Enable max_dgram_qlen unix sysctl to be
configurable by non-init user namespaces")

Signed-off-by: Andrei Vagin <[email protected]>
Currently there is no option to checkpoint/restore programs that use
ICMP sockets, such as `ping`. This patch adds support for the same.

Fixes checkpoint-restore#2557

Signed-off-by: समीर सिंह Sameer Singh <[email protected]>
Add ZDTM static tests for IP4/ICMP and IP6/ICMP
socket feature.

Signed-off-by: समीर सिंह Sameer Singh <[email protected]>
Signed-off-by: Andrei Vagin <[email protected]>
@adrianreber
Copy link
Member

Do you think you could also add a zdtm test? You mentioned that you have a test that is able to trigger the problem you described.

@avagin avagin self-assigned this Jun 17, 2025

mntns_root = mntns_get_root_fd(nsid);

fd = open("/dev/urandom", O_RDONLY);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we generate this uid once for the entire dump?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change introduces generating a random suffix to prevent link_remap collisions when a dump fails and another dump is attempted on the same process

should we generate this uid once for the entire dump?

We already do this for dump_criu_run_id. Can we use this value instead of generating a new one?

}

// create file link_remap.fd_id.random_suffix id
snprintf(tmp + 1, sizeof(link_name) - (size_t)(tmp - link_name) - 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we customize a link name if linkat_hard returns EEXIST?

unsigned char rand_bytes[8];
if (read(fd, rand_bytes, sizeof(rand_bytes)) == sizeof(rand_bytes)) {
snprintf(random_suffix, sizeof(random_suffix), "%02x%02x%02x%02x",
rand_bytes[0], rand_bytes[1], rand_bytes[2], rand_bytes[3]);
Copy link
Member

@avagin avagin Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you read 8 bytes, if you use just four of them?

(unsigned long)(ts.tv_sec ^ ts.tv_nsec));
}

// create file link_remap.fd_id.random_suffix id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use tabs for indentation

}

/* Use grand parent, if parent directory does not exist. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

white spaces...

@rst0git
Copy link
Member

rst0git commented Jun 17, 2025

Tested w/ a process that opens and uses semaphores, the test creates remap files in /dev/shm that would have been created by criu on dump. A dump is done on this process and collisions are avoided. A restore is then done to verify nothing was damaged in C/R.

Would you be able to include this test, for example, as part of ZDTM or test/others?

@bsmithai
Copy link
Author

bsmithai commented Jun 17, 2025 via email

@github-actions
Copy link

A friendly reminder that this PR had no activity for 30 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.