rocshmem dependencies #349

chivatam · 2025-09-07T14:28:45Z

Description

added rocshmem dependencies to the dockerfile

@msaroufim

msaroufim · 2025-09-07T17:09:13Z

Could you share a toy user submission as well using rocshmem. Just wanna get a sense of what things will look like e2e

msaroufim · 2025-09-07T19:47:21Z

Also @saienduri to sanity check

chivatam · 2025-09-07T19:51:05Z

Could you share a toy user submission as well using rocshmem. Just wanna get a sense of what things will look like e2e

import os
from typing import Any

from torch.utils.cpp_extension import load_inline


ROCSHMEM_INSTALL_DIR = os.environ.get("ROCSHMEM_INSTALL_DIR", "/opt/rocshmem")
OMPI_INSTALL_DIR = os.environ.get("OMPI_INSTALL_DIR", "/opt/openmpi")


EXT_NAME = "rocshmem_all2all_ext"

CUDA_SRC = r"""
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>
#include <cstdlib>
#include <vector>

#include <hip/hip_runtime.h>
#include <roc_shmem.hpp>

namespace py = pybind11;

__global__ void all2all_kernel(int* symm, int npes) {
    if (threadIdx.x == 0) {
        int me = roc_shmem_my_pe();

        // Initialize local symmetric buffer
        for (int i = 0; i < npes; ++i) symm[i] = -1;
        roc_shmem_barrier_all();

        // Put my rank into every PE's symmetric buffer at index 'me'
        for (int dst = 0; dst < npes; ++dst) {
            roc_shmem_int_p(symm + me, me, dst);
        }
        roc_shmem_barrier_all();
    }
}

static void hip_check(hipError_t err, const char* where) {
    if (err != hipSuccess) {
        throw std::runtime_error(std::string("HIP error at ") + where + ": " + hipGetErrorString(err));
    }
}

void bind_and_init() {
    // Bind device based on rank
    int dev_count = 0;
    hip_check(hipGetDeviceCount(&dev_count), "hipGetDeviceCount");

    int rank = 0;
    if (const char* s = std::getenv("OMPI_COMM_WORLD_RANK")) {
        rank = std::atoi(s);
    }
    hip_check(hipSetDevice(dev_count == 0 ? 0 : (rank % dev_count)), "hipSetDevice");

    // Initialize rocSHMEM after device selection
    roc_shmem_init();
}

std::vector<int> run_all2all() {
    int me   = roc_shmem_my_pe();
    int npes = roc_shmem_n_pes();

    int* symm = (int*)roc_shmem_malloc(sizeof(int) * npes);
    if (!symm) throw std::runtime_error("roc_shmem_malloc failed");

    // Launch one-thread kernel to do the collective
    all2all_kernel<<<1, 1>>>(symm, npes);
    hip_check(hipDeviceSynchronize(), "hipDeviceSynchronize");

    // Copy local symmetric buffer back to host
    std::vector<int> out(npes, -1);
    hip_check(hipMemcpy(out.data(), symm, sizeof(int) * npes, hipMemcpyDeviceToHost), "hipMemcpy D2H");

    roc_shmem_free(symm);
    return out;
}

void finalize() {
    roc_shmem_finalize();
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("bind_and_init", &bind_and_init);
    m.def("run_all2all",  &run_all2all);
    m.def("finalize",     &finalize);
}
"""


def _build_ext():
    return load_inline(
        name=EXT_NAME,
        cuda_sources=[CUDA_SRC],
        functions=["bind_and_init", "run_all2all", "finalize"],
        with_cuda=True,
        extra_cflags=["-std=c++17"],
        extra_cuda_cflags=["-std=c++17"],
        extra_include_paths=[f"{ROCSHMEM_INSTALL_DIR}/include"],
        extra_ldflags=[
            f"-L{ROCSHMEM_INSTALL_DIR}/lib", "-lrocshmem",
            f"-L{OMPI_INSTALL_DIR}/lib", "-lmpi",
            f"-Wl,-rpath,{ROCSHMEM_INSTALL_DIR}/lib:{OMPI_INSTALL_DIR}/lib",
        ],
        verbose=True,
    )


# --- Optional: type-compatible stub for the Python leaderboard pattern ---
def custom_kernel(data: Any):  # input_t -> output_t, toy no-op to fit signature
    return data


def _rank_and_world():
    r = int(os.environ.get("OMPI_COMM_WORLD_RANK", "0"))
    w = int(os.environ.get("OMPI_COMM_WORLD_SIZE", "1"))
    return r, w


if __name__ == "__main__":
    rank, world = _rank_and_world()
    ext = _build_ext()
    ext.bind_and_init()
    out = ext.run_all2all()
    print(f"Rank {rank}/{world} all2all -> {out}")
    ext.finalize()

Vibe coded this but is gonna look similar to HIP kernels in python

@msaroufim

saienduri · 2025-09-08T09:04:11Z

Looks good to me. Starting a test docker build here to check status: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17545534459.

chivatam · 2025-09-08T15:09:36Z

ooo! looks like there is some issue with UCX. I ll debug it today!

…patibility

chivatam · 2025-09-09T02:54:37Z

@saienduri I made some changes but not sure if it works, is there a way to test the workflow without approval? I don't have MI300X to test 😅

saienduri · 2025-09-13T20:02:36Z

Thanks, trying a build here now: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17701378282. You can locally try building the docker just to see if the build passes.

saienduri · 2025-09-13T21:31:47Z

Cool, the build passed and a sanity test passed here: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17702258708
I was using this test payload: https://github.com/gpu-mode/discord-cluster-manager/blob/saienduri/fix-payload/scripts/github_test_payload.json
Can you also share a small payload for testing if rocshmem works before we merge this PR?

chivatam · 2025-09-14T04:41:50Z

@saienduri added one, lmk if it works!

saienduri · 2025-09-14T17:05:49Z

Hmm getting ValueError: Invalid language cpp (https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17714138325)

msaroufim · 2025-09-14T17:33:52Z

You want the example working with load_inline in PyTorch

chivatam · 2025-09-14T18:21:42Z

done but idk if it works 😬

msaroufim · 2025-09-17T03:12:06Z

@saienduri can we test the provided payload example on the server directly? If it's fine then we should be good to merge

saienduri · 2025-09-17T07:50:51Z

ok running the payload in github actions yielded the following (https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17790562194):

"stdout": "=== ROCshmem PyTorch Inline Test ===\nROCshmem test failed: module 'torch.utils' has no attribute 'cpp_extension'\n"

I'll try on the server itself, but pretty sure it will be the same error.

rocshmem dependencies

ff628b1

chivatam marked this pull request as draft September 7, 2025 14:29

chivatam marked this pull request as ready for review September 7, 2025 14:29

fix: update UCX build configuration and dependencies for improved com…

a30b0d8

…patibility

msaroufim requested a review from saienduri September 9, 2025 18:17

rocshmem test payload

7762e89

rocshmem with load_inline test

6c57a27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rocshmem dependencies #349

rocshmem dependencies #349

chivatam commented Sep 7, 2025

Uh oh!

msaroufim commented Sep 7, 2025

Uh oh!

msaroufim commented Sep 7, 2025

Uh oh!

chivatam commented Sep 7, 2025 •

edited

Loading

Uh oh!

saienduri commented Sep 8, 2025

Uh oh!

chivatam commented Sep 8, 2025

Uh oh!

chivatam commented Sep 9, 2025

Uh oh!

saienduri commented Sep 13, 2025 •

edited

Loading

Uh oh!

saienduri commented Sep 13, 2025

Uh oh!

chivatam commented Sep 14, 2025

Uh oh!

saienduri commented Sep 14, 2025

Uh oh!

msaroufim commented Sep 14, 2025 •

edited

Loading

Uh oh!

chivatam commented Sep 14, 2025

Uh oh!

msaroufim commented Sep 17, 2025

Uh oh!

saienduri commented Sep 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

rocshmem dependencies #349

Are you sure you want to change the base?

rocshmem dependencies #349

Conversation

chivatam commented Sep 7, 2025

Description

Uh oh!

msaroufim commented Sep 7, 2025

Uh oh!

msaroufim commented Sep 7, 2025

Uh oh!

chivatam commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saienduri commented Sep 8, 2025

Uh oh!

chivatam commented Sep 8, 2025

Uh oh!

chivatam commented Sep 9, 2025

Uh oh!

saienduri commented Sep 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

saienduri commented Sep 13, 2025

Uh oh!

chivatam commented Sep 14, 2025

Uh oh!

saienduri commented Sep 14, 2025

Uh oh!

msaroufim commented Sep 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chivatam commented Sep 14, 2025

Uh oh!

msaroufim commented Sep 17, 2025

Uh oh!

saienduri commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

chivatam commented Sep 7, 2025 •

edited

Loading

saienduri commented Sep 13, 2025 •

edited

Loading

msaroufim commented Sep 14, 2025 •

edited

Loading

saienduri commented Sep 17, 2025 •

edited

Loading