Skip to content

Conversation

@frost-intel
Copy link
Contributor

As a follow-up to #1867 , this PR includes tests for the FlightRecorder on XCCL, as well as moving some definitions from ProcessGroupXCCL::Options to Backend::Options.

These tests are largely based on pytorch/test/distributed/test_c10d_nccl.py, but doesn't include some tests:

  • test_short_json since json dumps are not supported in ProcessGroupXCCL
  • test_trace_while_all_works_retired: _wait_for_pending_works isn't supported by XCCL
  • test_trace_while_active: XCCL hangs when op is called on only one rank
  • test_trace_while_stuck: XCCL hangs when op is called on only one rank

@frost-intel frost-intel marked this pull request as ready for review August 27, 2025 20:36
Copilot AI review requested due to automatic review settings August 28, 2025 12:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds FlightRecorder tests for XCCL (Intel XPU Collective Communications Library) as a follow-up to #1867. The tests validate flight recording functionality for distributed operations on Intel XPU devices, including trace dumping, timing, and various collective operations.

  • Adds comprehensive test suite for XCCL FlightRecorder functionality based on NCCL tests
  • Moves global_ranks_in_group and group_name from ProcessGroupXCCL::Options to Backend::Options
  • Adds conditional recording parameter to initWork method to control when flight recording occurs

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
test/xpu/distributed/test_c10d_xccl.py Adds XCCLTraceTestBase and XCCLTraceTest classes with comprehensive flight recorder tests
src/xccl/ProcessGroupXCCL.hpp Removes group-specific options and adds record parameter to initWork method
src/xccl/ProcessGroupXCCL.cpp Implements conditional flight recording and fixes sequence counting logic

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Chao1Han
Copy link
Contributor

Chao1Han commented Aug 29, 2025

Hi @frost-intel, could you apt install clang-format and format ProcessGroupXCCL.cpp/ProcessGroupXCCL.hpp before push code.
Avoid periodic code cleanup like https://github.com/intel/torch-xpu-ops/pull/1960/files

@guangyey
Copy link
Contributor

guangyey commented Sep 3, 2025

@frost-intel The UT failure is related?

@frost-intel
Copy link
Contributor Author

@guangyey Yes, the UT is related. I'm unable to reproduce the error on Borealis/Aurora. It seems like the final collective isn't being registered as complete

@zhangxiaoli73 zhangxiaoli73 self-requested a review September 23, 2025 01:44
@frost-intel
Copy link
Contributor Author

@zhangxiaoli73 This is also ready for merge. Do we need a 2nd reviewer?

@frost-intel
Copy link
Contributor Author

@zhangxiaoli73 I don't have permission to merge. I assumed you did.

@guangyey Can you merge this?

@dvrogozh
Copy link
Contributor

dvrogozh commented Oct 3, 2025

I wish we had XCCL related code and tests right in the pytorch repo. This way we could work to unify the test code. As of now this test seems to be a huge duplication which is hardly reviewable thoroughly. But to make things going, I guess we need to merge this in.

@dvrogozh dvrogozh added this pull request to the merge queue Oct 3, 2025
Merged via the queue into main with commit cae6ba3 Oct 3, 2025
51 of 55 checks passed
@dvrogozh dvrogozh deleted the frost/fr_tests branch October 3, 2025 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants