-
Notifications
You must be signed in to change notification settings - Fork 15
rocshmem dependencies #349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Could you share a toy user submission as well using rocshmem. Just wanna get a sense of what things will look like e2e |
Also @saienduri to sanity check |
Vibe coded this but is gonna look similar to HIP kernels in python |
Looks good to me. Starting a test docker build here to check status: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17545534459. |
ooo! looks like there is some issue with UCX. I ll debug it today! |
@saienduri I made some changes but not sure if it works, is there a way to test the workflow without approval? I don't have MI300X to test 😅 |
Thanks, trying a build here now: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17701378282. You can locally try building the docker just to see if the build passes. |
Cool, the build passed and a sanity test passed here: https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17702258708 |
@saienduri added one, lmk if it works! |
Hmm getting |
You want the example working with load_inline in PyTorch |
done but idk if it works 😬 |
@saienduri can we test the provided payload example on the server directly? If it's fine then we should be good to merge |
ok running the payload in github actions yielded the following (https://github.com/gpu-mode/discord-cluster-manager/actions/runs/17790562194):
I'll try on the server itself, but pretty sure it will be the same error. |
Description
added rocshmem dependencies to the dockerfile
@msaroufim