Skip to content

[CI][distributed] test failures are not reported in the summary #2134

@dvrogozh

Description

@dvrogozh

The #1971 was adding some new tests. We've noticed that couple tests were actually failing, but not reported to the CI summary - all jobs have passed status, summary missing failures. Job:

Look into linux-distributed cases:

2025-10-03T01:18:25.3010246Z Process 1 exited with error code 10 and exception:
2025-10-03T01:18:25.3010485Z Traceback (most recent call last):
2025-10-03T01:18:25.3010903Z   File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 863, in run_test
2025-10-03T01:18:25.3011301Z     getattr(self, test_name)()
2025-10-03T01:18:25.3011700Z   File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 711, in wrapper
2025-10-03T01:18:25.3012069Z     fn()
2025-10-03T01:18:25.3012418Z   File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3268, in wrapper
2025-10-03T01:18:25.3012863Z     method(*args, **kwargs)
2025-10-03T01:18:25.3013271Z   File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 578, in instantiated_test
2025-10-03T01:18:25.3013670Z     test(self, **param_kwargs)
2025-10-03T01:18:25.3014168Z   File "/home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/test/xpu/distributed/test_c10d_xccl.py", line 945, in test_short_pickle
2025-10-03T01:18:25.3014653Z     self._verify_trace(
2025-10-03T01:18:25.3015185Z   File "/home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/test/xpu/distributed/test_c10d_xccl.py", line 841, in _verify_trace
2025-10-03T01:18:25.3015676Z     default_pg_info = pg_config["0"]
2025-10-03T01:18:25.3016363Z KeyError: '0\n\nTo execute this test, run the following from the base repo dir:\n    PYTORCH_TEST_WITH_SLOW=1 python test/xpu/distributed/test_c10d_xccl.py XCCLTraceTest.test_short_pickle_include_collectives_True\n\nThis message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0'
2025-10-03T01:18:25.3017076Z =================== 2 failed, 31 passed in 352.08s (0:05:52) ===================

And in the summary:

[New failed cases Summary]
No new failed cases found
[PASS] UT xpu_distributed test Pass

CC: @mengfei25, @chuanqi129, @frost-intel

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions