Skip to content

Conversation

frost-intel
Copy link
Contributor

Fixes #1978

In ProcessGroupNCCL, globalRank() returns a static int globalRank, which is first initialized by the ProcessGroup setup, so the globalRank assigned to each thread matches the id of the CUDA device. However, we were not using this same pattern for XCCL. Instead, we were just using the assigned rank of the thread, which were not necessarily the same as the globalRank.

The failing test test_barrier created two separate groups of 2 ranks each, and then 4 threads called barrier, but all on the same 2-thread group. Since initially the device id is not specified in this barrier call, the thread attempts to "guess" the device index. In the previous code, this guess would be 0 or 1, since the rank of each thread was not actually the globalRank. In barrier, this guessed id was used to initialize XCCLComm objects, and then call allreduce on a single element tensor. However, this resulted in an allreduce call two times on each device, which could result in a hang based on the execution order of the 4 threads.

With the update in this PR, PGXCCL now uses the static globalRank in the same places as ProcessGroupNCCL, so the initialized XCCLComm objects are for unique devices and allreduce doesn't call on the same device multiple times.

@frost-intel frost-intel requested review from Copilot, Chao1Han and daisyden and removed request for Copilot September 12, 2025 19:07
@frost-intel frost-intel changed the title Fix test_barrier hang by using static global rank in ProcessGroupXCCL [Do not merge] Fix test_barrier hang by using static global rank in ProcessGroupXCCL Sep 15, 2025
@frost-intel
Copy link
Contributor Author

Please wait on merging, I want to confirm behavior in scale-out test.

@Copilot Copilot AI review requested due to automatic review settings September 19, 2025 15:12
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a hang in the test_barrier test by making ProcessGroupXCCL use static global ranks consistently, matching the pattern used in ProcessGroupNCCL. The issue occurred when multiple threads used the same device ID for barrier operations, causing duplicate allreduce calls that could hang based on execution order.

Key changes:

  • Added globalRank() method that returns a static global rank value
  • Updated device ID guessing logic to use global rank instead of thread rank
  • Updated logging and debugging to reference global rank consistently

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
ProcessGroupXCCL.hpp Added declaration for globalRank() method
ProcessGroupXCCL.cpp Implemented globalRank() method and updated references to use it

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +355 to +357
const int& ProcessGroupXCCL::globalRank() const {
static int globalRank = rank_;
return globalRank;
Copy link
Preview

Copilot AI Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The static variable globalRank is initialized from rank_ on first call, but this could lead to race conditions if multiple threads call this method simultaneously during initialization. Consider using std::call_once or atomic operations to ensure thread-safe initialization.

Suggested change
const int& ProcessGroupXCCL::globalRank() const {
static int globalRank = rank_;
return globalRank;
int ProcessGroupXCCL::globalRank() const {
return rank_;

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[distributed][checkpoint]test_utils.py::TestDistWrapper::test_barrier random RuntimeError
2 participants