Skip to content

[distributed] test_short_pickle_include_collectives tests fail #2135

@dvrogozh

Description

@dvrogozh

Cases:
unknown,third_party.torch-xpu-ops.test.xpu.distributed.test_c10d_xccl.XCCLTraceTest,test_short_pickle_include_collectives_False
unknown,third_party.torch-xpu-ops.test.xpu.distributed.test_c10d_xccl.XCCLTraceTest,test_short_pickle_include_collectives_True

Some tests added in #1971 are failing:

  • distributed/test_c10d_xccl.py::XCCLTraceTest::test_short_pickle_include_collectives_False
  • distributed/test_c10d_xccl.py::XCCLTraceTest::test_short_pickle_include_collectives_True

Log snapshot:

2025-10-03T01:18:25.3010246Z Process 1 exited with error code 10 and exception:
2025-10-03T01:18:25.3010485Z Traceback (most recent call last):
2025-10-03T01:18:25.3010903Z   File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 863, in run_test
2025-10-03T01:18:25.3011301Z     getattr(self, test_name)()
2025-10-03T01:18:25.3011700Z   File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 711, in wrapper
2025-10-03T01:18:25.3012069Z     fn()
2025-10-03T01:18:25.3012418Z   File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3268, in wrapper
2025-10-03T01:18:25.3012863Z     method(*args, **kwargs)
2025-10-03T01:18:25.3013271Z   File "/tmp/xpu-tool/Python/3.10.18/x64/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 578, in instantiated_test
2025-10-03T01:18:25.3013670Z     test(self, **param_kwargs)
2025-10-03T01:18:25.3014168Z   File "/home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/test/xpu/distributed/test_c10d_xccl.py", line 945, in test_short_pickle
2025-10-03T01:18:25.3014653Z     self._verify_trace(
2025-10-03T01:18:25.3015185Z   File "/home/jenkins/actions-runner/_work/torch-xpu-ops/torch-xpu-ops/pytorch/third_party/torch-xpu-ops/test/xpu/distributed/test_c10d_xccl.py", line 841, in _verify_trace
2025-10-03T01:18:25.3015676Z     default_pg_info = pg_config["0"]
2025-10-03T01:18:25.3016363Z KeyError: '0\n\nTo execute this test, run the following from the base repo dir:\n    PYTORCH_TEST_WITH_SLOW=1 python test/xpu/distributed/test_c10d_xccl.py XCCLTraceTest.test_short_pickle_include_collectives_True\n\nThis message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0'

Full log: https://github.com/intel/torch-xpu-ops/actions/runs/18164761505/job/51849163236?pr=1971

Note that tests status reporting is affected by #2134.

CC: @frost-intel

Metadata

Metadata

Assignees

Labels

module: distributedFor distributed feature issueskippedUsed for temp UT failure to parallel fix

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions