Skip to content

Conversation

dstaay-fb
Copy link
Contributor

Summary: Enable device='cuda' for RDMABuffer

Reviewed By: zdevito

Differential Revision: D82331433

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 16, 2025
@facebook-github-bot
Copy link
Contributor

@dstaay-fb has exported this pull request. If you are a Meta employee, you can view the originating diff in D82331433.

)

Summary:

This diff integrates Monarch RDMA w/ PyTorch's caching allocator for CUDA devices. 

Typical users will want to create tensors (managed my PyTorch's caching allocator) and just pass to RdmaBuffer(..).   However we can't keep generating BAR1 memory registrations every time given limits around total registration space (~128GB est).   So in this diff, 
- monarch RDMA inspects the actual memory segments registered inside caching allocator, and only creates MR when needed.   
- If a segment expands during program lifecycle, we can use mlx5dv advanced APIs to build rdma regions which are backed by multiple MR regions; and just hand out new lkey/rkey to new tensor within expanded segment (old keys remain valid also).
- To spy on PT caching allocator we leverage snapshot API whenever we encounter a new register_buffer request with a virtual address range outside prior known segments.

Other stuff:
- It doesn't make sense to MR's to RdmaDomain, moving up to Manager level.

Reviewed By: zdevito

Differential Revision: D81736949
Summary:

Enable device='cuda' for RDMABuffer

Reviewed By: zdevito

Differential Revision: D82331433
dstaay-fb added a commit to dstaay-fb/monarch that referenced this pull request Sep 16, 2025
Summary:

Enable device='cuda' for RDMABuffer

Reviewed By: zdevito

Differential Revision: D82331433
@facebook-github-bot
Copy link
Contributor

@dstaay-fb has exported this pull request. If you are a Meta employee, you can view the originating diff in D82331433.

@facebook-github-bot
Copy link
Contributor

@dstaay-fb has exported this pull request. If you are a Meta employee, you can view the originating diff in D82331433.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants