-
Notifications
You must be signed in to change notification settings - Fork 63
Add FlightRecorder tests #1971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add FlightRecorder tests #1971
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds FlightRecorder tests for XCCL (Intel XPU Collective Communications Library) as a follow-up to #1867. The tests validate flight recording functionality for distributed operations on Intel XPU devices, including trace dumping, timing, and various collective operations.
- Adds comprehensive test suite for XCCL FlightRecorder functionality based on NCCL tests
- Moves
global_ranks_in_groupandgroup_namefrom ProcessGroupXCCL::Options to Backend::Options - Adds conditional recording parameter to
initWorkmethod to control when flight recording occurs
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| test/xpu/distributed/test_c10d_xccl.py | Adds XCCLTraceTestBase and XCCLTraceTest classes with comprehensive flight recorder tests |
| src/xccl/ProcessGroupXCCL.hpp | Removes group-specific options and adds record parameter to initWork method |
| src/xccl/ProcessGroupXCCL.cpp | Implements conditional flight recording and fixes sequence counting logic |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
Hi @frost-intel, could you apt install clang-format and format ProcessGroupXCCL.cpp/ProcessGroupXCCL.hpp before push code. |
|
@frost-intel The UT failure is related? |
|
@guangyey Yes, the UT is related. I'm unable to reproduce the error on Borealis/Aurora. It seems like the final collective isn't being registered as complete |
|
@zhangxiaoli73 This is also ready for merge. Do we need a 2nd reviewer? |
|
@zhangxiaoli73 I don't have permission to merge. I assumed you did. @guangyey Can you merge this? |
Co-authored-by: Yu, Guangye <[email protected]>
Co-authored-by: Yu, Guangye <[email protected]>
|
I wish we had XCCL related code and tests right in the pytorch repo. This way we could work to unify the test code. As of now this test seems to be a huge duplication which is hardly reviewable thoroughly. But to make things going, I guess we need to merge this in. |
As a follow-up to #1867 , this PR includes tests for the FlightRecorder on XCCL, as well as moving some definitions from ProcessGroupXCCL::Options to Backend::Options.
These tests are largely based on
pytorch/test/distributed/test_c10d_nccl.py, but doesn't include some tests:test_short_jsonsince json dumps are not supported in ProcessGroupXCCLtest_trace_while_all_works_retired:_wait_for_pending_worksisn't supported by XCCLtest_trace_while_active: XCCL hangs when op is called on only one ranktest_trace_while_stuck: XCCL hangs when op is called on only one rank