-
Notifications
You must be signed in to change notification settings - Fork 63
Closed
Description
The #1971 was adding some new tests. We've noticed that couple tests were actually failing, but not reported to the CI summary - all jobs have passed status, summary missing failures. Job:
Look into linux-distributed cases:
2025-10-03T01:18:25.3010246Z Process 1 exited with error code 10 and exception:
2025-10-03T01:18:25.3010485Z Traceback (most recent call last):
2025-10-03T01:18:25.3010903Z File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 863, in run_test
2025-10-03T01:18:25.3011301Z getattr(self, test_name)()
2025-10-03T01:18:25.3011700Z File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 711, in wrapper
2025-10-03T01:18:25.3012069Z fn()
2025-10-03T01:18:25.3012418Z File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3268, in wrapper
2025-10-03T01:18:25.3012863Z method(*args, **kwargs)
2025-10-03T01:18:25.3013271Z File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 578, in instantiated_test
2025-10-03T01:18:25.3013670Z test(self, **param_kwargs)
2025-10-03T01:18:25.3014168Z File "/home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/test/xpu/distributed/test_c10d_xccl.py", line 945, in test_short_pickle
2025-10-03T01:18:25.3014653Z self._verify_trace(
2025-10-03T01:18:25.3015185Z File "/home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/test/xpu/distributed/test_c10d_xccl.py", line 841, in _verify_trace
2025-10-03T01:18:25.3015676Z default_pg_info = pg_config["0"]
2025-10-03T01:18:25.3016363Z KeyError: '0\n\nTo execute this test, run the following from the base repo dir:\n PYTORCH_TEST_WITH_SLOW=1 python test/xpu/distributed/test_c10d_xccl.py XCCLTraceTest.test_short_pickle_include_collectives_True\n\nThis message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0'
2025-10-03T01:18:25.3017076Z =================== 2 failed, 31 passed in 352.08s (0:05:52) ===================
And in the summary:
[New failed cases Summary]
No new failed cases found
[PASS] UT xpu_distributed test Pass
Metadata
Metadata
Assignees
Labels
No labels