-
Notifications
You must be signed in to change notification settings - Fork 62
Closed
Labels
module: distributedFor distributed feature issueFor distributed feature issueskippedUsed for temp UT failure to parallel fixUsed for temp UT failure to parallel fix
Description
Cases:
unknown,third_party.torch-xpu-ops.test.xpu.distributed.test_c10d_xccl.XCCLTraceTest,test_short_pickle_include_collectives_False
unknown,third_party.torch-xpu-ops.test.xpu.distributed.test_c10d_xccl.XCCLTraceTest,test_short_pickle_include_collectives_True
Some tests added in #1971 are failing:
distributed/test_c10d_xccl.py::XCCLTraceTest::test_short_pickle_include_collectives_Falsedistributed/test_c10d_xccl.py::XCCLTraceTest::test_short_pickle_include_collectives_True
Log snapshot:
2025-10-03T01:18:25.3010246Z Process 1 exited with error code 10 and exception:
2025-10-03T01:18:25.3010485Z Traceback (most recent call last):
2025-10-03T01:18:25.3010903Z File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 863, in run_test
2025-10-03T01:18:25.3011301Z getattr(self, test_name)()
2025-10-03T01:18:25.3011700Z File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 711, in wrapper
2025-10-03T01:18:25.3012069Z fn()
2025-10-03T01:18:25.3012418Z File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3268, in wrapper
2025-10-03T01:18:25.3012863Z method(*args, **kwargs)
2025-10-03T01:18:25.3013271Z File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 578, in instantiated_test
2025-10-03T01:18:25.3013670Z test(self, **param_kwargs)
2025-10-03T01:18:25.3014168Z File "/home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/test/xpu/distributed/test_c10d_xccl.py", line 945, in test_short_pickle
2025-10-03T01:18:25.3014653Z self._verify_trace(
2025-10-03T01:18:25.3015185Z File "/home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/test/xpu/distributed/test_c10d_xccl.py", line 841, in _verify_trace
2025-10-03T01:18:25.3015676Z default_pg_info = pg_config["0"]
2025-10-03T01:18:25.3016363Z KeyError: '0\n\nTo execute this test, run the following from the base repo dir:\n PYTORCH_TEST_WITH_SLOW=1 python test/xpu/distributed/test_c10d_xccl.py XCCLTraceTest.test_short_pickle_include_collectives_True\n\nThis message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0'
Full log: https://github.com/intel/torch-xpu-ops/actions/runs/18164761505/job/51849163236?pr=1971
Note that tests status reporting is affected by #2134.
CC: @frost-intel
Metadata
Metadata
Assignees
Labels
module: distributedFor distributed feature issueFor distributed feature issueskippedUsed for temp UT failure to parallel fixUsed for temp UT failure to parallel fix