Skip to content

Conversation

Alexey-Rivkin
Copy link
Contributor

What?

Switch GPU tests to use Ubuntu 24.04 DOCA 3.1 image that includes GPUNetIO runtime and headers.
Follow-up to #10849

Why?

The previous RHEL 9 image lacked doca-gpunetio and its headers, blocking GPUNetIO-enabled tests.

Why: previous RHEL9 image lacked doca-gpunetio and headers

Signed-off-by: Alexey Rivkin <[email protected]>
@Alexey-Rivkin Alexey-Rivkin marked this pull request as ready for review September 8, 2025 14:24
dpressle
dpressle previously approved these changes Sep 9, 2025
Copy link
Contributor

@yosefe yosefe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This container cannot build UCX device API:

  • doca verbs library is missing
  • it has cuda13 while the doca gpunetio requires cuda12
configure:30577: gcc -o conftest    -I/opt/mellanox/doca/include    -L/opt/mellanox/doca/lib  -Wl,-rpath-link,/opt/mellanox/doca/lib -L/opt/mellanox/doca/lib/x86_64-linux-gnu  -Wl,-rpath-link,/opt/mellanox/doca/lib/x86_64-linux-gnu -Wl,-rpath-link, conftest.c -ldoca_gpunetio -ldoca_gpunetio -lrt -lrt  >&5
/usr/bin/ld: warning: libdoca_verbs.so.2, needed by /opt/mellanox/doca/lib/x86_64-linux-gnu/libdoca_gpunetio.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: warning: libcuda.so.1, needed by /opt/mellanox/doca/lib/x86_64-linux-gnu/libdoca_gpunetio.so, not found (try using -rpath or -rpath-link)
/usr/bin/ld: /opt/mellanox/doca/lib/x86_64-linux-gnu/libdoca_gpunetio.so: undefined reference to `cuDeviceGet'
/usr/bin/ld: /opt/mellanox/doca/lib/x86_64-linux-gnu/libdoca_gpunetio.so: undefined reference to `doca_verbs_cq_get_dbr_addr@EXPERIMENTAL'

@yosefe yosefe mentioned this pull request Sep 9, 2025
Alexey-Rivkin added a commit to Alexey-Rivkin/ucx that referenced this pull request Sep 9, 2025
Why: GPU CI builds now require GPUNetIO; silently skipping it can pass CI and fail at runtime.
Refs: openucx#10861 openucx#10865

Signed-off-by: Alexey Rivkin <[email protected]>
Why: GPU CI builds now require GPUNetIO; silently skipping it can pass CI and fail at runtime.
Refs: openucx#10861 openucx#10865

Signed-off-by: Alexey Rivkin <[email protected]>
- az-helpers.sh: try_load_cuda_env() sets have_cuda=yes when nvcc is on PATH.
  Falls back to loading the CUDA module only if nvcc is absent.
Why: CUDA-enabled builder images already include the correct toolchain.
Preferring local CUDA avoids module mismatches and supports DOCA GPUNetIO builds.

Signed-off-by: Alexey Rivkin <[email protected]>
@yosefe
Copy link
Contributor

yosefe commented Sep 16, 2025

move the changes to #10865

@yosefe yosefe closed this Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants