Logging Metrics Manually in DDP mode with 2 GPUs #3088

angad-virgo · 2025-05-08T21:49:24Z

angad-virgo
May 8, 2025

The docs for torchmetrics when using pytorch lightning describe how to manually call compute() and reset() at the end of the train/val epochs if logging a metric manually instead of letting self.log(...) do the work. This works for me when I use a single GPU, but when using 2 GPUs and DDP strategy, the metric.compute() call hangs for a long time and then timeout error occurs. Is there a different set of instructions for manually logging in DDP mode when using > 1 GPUs?

Answered by Borda

Aug 12, 2025

When using torchmetrics with PyTorch Lightning in DDP mode (multi-GPU), metrics need to be synchronized across all processes before calling compute(). This is because each GPU calculates partial metric states locally, and compute() waits for all to sync, which can cause hangs if not done properly.

A recommended approach:

Use the sync_on_compute=True argument when initializing your metric. This ensures metrics automatically sync across processes before computing:

from torchmetrics import Accuracy
metric = Accuracy(sync_on_compute=True)

Then you can safely call:

metric.update(preds, target)
result = metric.compute()
metric.reset()

Alternatively, you can manually call metric.sync() before co…

View full answer

Borda · 2025-08-12T13:48:13Z

Borda
Aug 12, 2025
Maintainer

When using torchmetrics with PyTorch Lightning in DDP mode (multi-GPU), metrics need to be synchronized across all processes before calling compute(). This is because each GPU calculates partial metric states locally, and compute() waits for all to sync, which can cause hangs if not done properly.

A recommended approach:

Use the sync_on_compute=True argument when initializing your metric. This ensures metrics automatically sync across processes before computing:

from torchmetrics import Accuracy
metric = Accuracy(sync_on_compute=True)

Then you can safely call:

metric.update(preds, target)
result = metric.compute()
metric.reset()

Alternatively, you can manually call metric.sync() before compute() if your version supports it.

Also, make sure you only call compute() on the main process (e.g., using if self.trainer.is_global_zero: in Lightning) to avoid unnecessary deadlocks.

This should fix the hanging issue in DDP mode.

Let me know if you want a minimal example to demonstrate this!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Logging Metrics Manually in DDP mode with 2 GPUs #3088

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Logging Metrics Manually in DDP mode with 2 GPUs #3088

Uh oh!

angad-virgo May 8, 2025

Replies: 1 comment

Uh oh!

Borda Aug 12, 2025 Maintainer

angad-virgo
May 8, 2025

Borda
Aug 12, 2025
Maintainer